ASPECTS OF COMPUTATIONAL TOPOLOGY AND MATHEMATICAL VIROLOGY
                                      By
                                   Rui Wang
                             A DISSERTATION
                                 Submitted to
                         Michigan State University
                 in partial fulfillment of the requirements
                              for the degree of
               Applied Mathematics – Doctor of Philosophy
                                     2022


                                        ABSTRACT
  ASPECTS OF COMPUTATIONAL TOPOLOGY AND MATHEMATICAL VIROLOGY
                                             By
                                         Rui Wang
Being able to describe the shape of data is of paramount importance to the fields of biol-
ogy, physics, chemistry, pharmaceutics, etc. Therefore, in recent years, scientists from
the TDA community have been applying advanced mathematical tools to decode the
topological structures of data. Methods such as persistent homology, path homology,
and de Rham-Hodge theory have become the main workhorse of TDA, which pioneered
new branches in algebraic topology and differential geometry. Later, various topolog-
ical Laplacians such as graph Laplacian, Hodge Laplacian, sheaf Laplacian, and Dirac
Laplacian are proposed to preserve topological invariants and geometric shapes simul-
taneously. However, such Laplacians fail to extract the topological and geometric de-
formations when one introduces the filtration parameters in. Therefore, we proposed a
new topological Laplacians called persistent Laplacians to fully recover the topological
persistence and homotopic shape evolution during filtration.
    It is worth mentioning that persistent Laplacians are insensitive to asymmetry or di-
rected relations, which limits their power to preserve the directional information of struc-
tures in practical applications. Therefore, we proposed persistent path Laplacians to over-
come this issue. Similar to the persistent Laplacians, one can also extract the topological
persistence and geometric deformations during filtration from the persistent path Lapla-
cians by calculating their harmonic and non-harmonic spectra. In addition, the persistent
path Laplacians are constructed on the directed graphs or network, which address the
importance of directional representation in datasets such as gene regulation datasets in
biology.
    Versatile mathematical tools have been playing an essential role in various biological


applications. Since the first COVID-19 case was reported in December 2019, researchers
worldwide have been pursuing scientific endeavors in the SARS-CoV-2 projects. Instead
of designing promising vaccines and antibody therapies that required wet lab resources,
we proposed a new mathematical-AI model called TopNetmAb to systematically ana-
lyze the mutation-induced impacts on the SARS-CoV-2 infectivity, vaccines, and antibody
drugs. In this dissertation, the topological data analysis (including the persistent Lapla-
cians mentioned above), artificial intelligence, various network models, and genomics
analysis are all included in our SARS-CoV-2-related projects to provide comprehensive
representations for the understanding of the transmission and evolution of SARS-CoV-2.


Copyright by
RUI WANG
2022


This thesis is dedicated to my family.
                   v


                                 ACKNOWLEDGEMENTS
Especially, I would like to thank Prof. Guowei Wei for being an energetic, meticulous,
and long-headed advisor throughout my Ph.D. journey. He is an outstanding advisor
with enough foresight in various research topics, which steered me to tackle projects that
yielded fruitful results. Furthermore, his deep thinking and intuitive explanations have
enhanced my comprehension of how mathematical tools will benefit life sciences, which
has led to my firm interest in the field of mathematical biology. I genuinely appreci-
ate him for providing valuable opportunities and constant support during my graduate
studies. Without his encouragement, I probably would not have chosen to pursue a career
in academia.
    Importantly, I would like to express gratitude to my committee members, Professors
Chichia Chiu, Moxun Tang, Yiying Tong, and Jeanne Wald, for their precious comments
and feedback. Furthermore, I want to thank Professors Yiying Tong and Ping Wang for
welcoming me into their research projects, which have been invaluable to my interdisci-
plinary research experience. Also, I would like to thank Professor Changchuan Yin for
teaching me multiple techniques in genetic biology that are indispensable to the applica-
tions of my thesis. Besides, I genuinely appreciated my lab mates Drs. Jiahui Chen, Kaifu
Gao, and Duc Nguyen for helping me in all our collaboration projects against SARS-CoV-
2 since 2020. Leading by Prof. Wei, we had a quite tacit collaboration and it was probably
one of the best collaborative experience I have ever had.
    Particularly, I would like to thank Prof. Jeanne Wald and her husband, Mark, for the
cherished time we spent together. I deeply appreciated their experienced and valuable
advice on my career and life. In addition, I would like to thank my lab mates, colleagues,
and friends for their helpful comments and suggestions. Further, I want to thank Drs.
Jiahui Chen, Duc Nguyen, Kaifu Gao, Menglun Wang, Timothy Szocinski, Zixuan Cang,
Dong Chen, and Mrs. Jing Huang for their friendly conversations and helpful discussions
                                              vi


of diverse topics during my time at MSU.
    Undoubtedly, my husband, Shihao Liu, deserves all my thanks. His supportive words
of encouragement have been keeping me at ease and focused during this journey. Thank
you, Shihao, for all the companionship and understanding you have given to enrich
my research career. Most importantly, I should thank my parents, Ling Lu and Yunru
Wang, for the enduring sacrifices they have made for me. My Ph.D. journey would have
never started without their support. I do sincerely thank them for their endless and un-
requited support and love. Foremost, I am grateful to my grandparents, Guojun Gao and
Weizhong Lu, who raised me and gave me a great childhood with plenty of family activ-
ities, indoor and outdoor. They shaped my character and set an example for me to be a
gentle and decent person. My late grandfather Weizhong Lu, passed away in 2020 at the
age of 81. I will always miss him.
    Lastly, I would like to acknowledge that all of the work included in this thesis was sup-
ported in part by NIH grants R01GM126189 and R01AI164266, NSF grants DMS-2052983,
DMS-1761320, IIS-1900473, and NASA grant 80NSSC21M0023.
                                             vii


                                TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER 1 INTRODUCTION . . . . . . .              . . . . . . . . . . . . . . . . . . . . . . .  1
   1.1 Topological Laplacian . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 Mathematical Modeling of Virology          . . . . . . . . . . . . . . . . . . . . . . .  3
   1.3 Outline . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . .  8
CHAPTER 2 METHODS ON TOPOLOGICAL LAPLACIANS . .                             . . . . . . . . . .  9
   2.1 Persistent Laplacians . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . .  9
       2.1.1 Simplex . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . .  9
       2.1.2 Simplicial Complex . . . . . . . . . . . . . . . . . . .       . . . . . . . . . . 10
              2.1.2.1 Delaunay Triangulation and Alpha Shapes               . . . . . . . . . . 11
              2.1.2.2 Vietoris-Rips Complex . . . . . . . . . . . .         . . . . . . . . . . 16
       2.1.3 Chain Complex . . . . . . . . . . . . . . . . . . . . .        . . . . . . . . . . 16
       2.1.4 Combinatorial Laplacians . . . . . . . . . . . . . . .         . . . . . . . . . . 18
       2.1.5 Persistent Laplacian . . . . . . . . . . . . . . . . . .       . . . . . . . . . . 19
       2.1.6 Variants of Persistent Laplacians . . . . . . . . . . .        . . . . . . . . . . 31
   2.2 Persistent Path Laplacian . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 35
       2.2.1 Paths on a Finite Set . . . . . . . . . . . . . . . . . .      . . . . . . . . . . 35
       2.2.2 Path Complex . . . . . . . . . . . . . . . . . . . . . .       . . . . . . . . . . 37
       2.2.3 Path Homology . . . . . . . . . . . . . . . . . . . . .        . . . . . . . . . . 38
       2.2.4 Path Homology on Directed Graphs . . . . . . . . .             . . . . . . . . . . 39
       2.2.5 Homologies of Directed Subgraphs . . . . . . . . . .           . . . . . . . . . . 41
       2.2.6 Path Laplacian . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . 42
       2.2.7 Persistent Path Laplacian . . . . . . . . . . . . . . .        . . . . . . . . . . 48
CHAPTER 3 METHODS ON MATHEMATICAL MODELING OF VIROLOGY                                      . . 54
   3.1 Genomics Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 54
       3.1.1 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . .         . . 54
              3.1.1.1 Pairwise Sequence Alignment . . . . . . . . . . . . . . .             . . 54
              3.1.1.2 Multiple Sequence Alignment (MSA) . . . . . . . . . . .               . . 55
       3.1.2 Single Nucleotide Polymorphism Calling . . . . . . . . . . . . . .             . . 57
       3.1.3 Jaccard Distance of SNP profiles . . . . . . . . . . . . . . . . . . .         . . 57
       3.1.4 k-nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 58
       3.1.5 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 59
   3.2 Mathematical-assisted Machine Learning Models in SARS-CoV-2 . . . .                  . . 60
       3.2.1 Data Collection and Pre-processing . . . . . . . . . . . . . . . . . .         . . 60
       3.2.2 Preparation of Machine learning Datasets . . . . . . . . . . . . . .           . . 62
       3.2.3 Features Generalization . . . . . . . . . . . . . . . . . . . . . . . .        . . 62
                                             viii


             3.2.3.1 Generation of Topological Features for PPIs . . . . . . .           . . 63
             3.2.3.2 Generation of Residue-level Features for PPIs . . . . . .           . . 66
             3.2.3.3 Generation of Atom-level Features for PPIs . . . . . . . .          . . 67
      3.2.4  Models for the Binding Free Energy Change Prediction of Protein-
             protein Interaction on SARS-CoV-2 . . . . . . . . . . . . . . . . .         . .  68
             3.2.4.1 TopNet Model . . . . . . . . . . . . . . . . . . . . . . . .        . .  68
             3.2.4.2 TopNetmAb Model . . . . . . . . . . . . . . . . . . . . .           . .  70
      3.2.5  Other Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . .  70
CHAPTER 4 APPLICATIONS IN TOPOLOGICAL LAPLACIANS . . . . . . .                           . .  72
  4.1 Persistent Laplacians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  72
      4.1.1 Benzene Structure Analysis . . . . . . . . . . . . . . . . . . . . . .       . .  75
      4.1.2 Fullerene Analysis and Prediction . . . . . . . . . . . . . . . . . .        . .  77
             4.1.2.1 Fullerene Structure Analysis . . . . . . . . . . . . . . . .        . .  78
             4.1.2.2 Fullerene stability prediction . . . . . . . . . . . . . . . .      . .  81
      4.1.3 Protein flexibility analysis . . . . . . . . . . . . . . . . . . . . . . .   . .  85
      4.1.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . .        . .  88
  4.2 Persistent Path Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  89
      4.2.1 Constructions of Persistent Path Laplacian for Tetra and Pyramid             . .  92
      4.2.2 Constructions of Persistent Path Laplacian for CB7 . . . . . . . . .         . .  95
      4.2.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . .        . .  98
CHAPTER 5    HERMES: AN OPEN-SOURCE SOFTWARE FOR THE SPECTRAL
             ANALYSIS OF PERSISTENT LAPLACIANS . . . . . . . . . . . .                   . .  99
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  99
  5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 102
      5.2.1 Construction of Alpha Shape . . . . . . . . . . . . . . . . . . . . .        . . 102
      5.2.2 Implementation details for alpha shape . . . . . . . . . . . . . . .         . . 103
             5.2.2.1 Boundary operator construction . . . . . . . . . . . . . .          . . 104
             5.2.2.2 Persistent boundary operator . . . . . . . . . . . . . . . .        . . 104
             5.2.2.3 Persistent spectrum computation . . . . . . . . . . . . . .         . . 105
      5.2.3 Implementation Details for Rips Complex . . . . . . . . . . . . . .          . . 106
  5.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
      5.3.1 Validation on Fullerene structures . . . . . . . . . . . . . . . . . .       . . 107
      5.3.2 Validation on proteins . . . . . . . . . . . . . . . . . . . . . . . . .     . . 109
  5.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 114
CHAPTER 6 APPLICATIONS IN MATHEMATICAL MODELING OF VIROLOGY 117
  6.1 Mutations on COVID-19 diagnostic targets . . . . . . . . . . . . . . . . . . . 117
      6.1.1 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
      6.1.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
      6.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
  6.2 Mechanisms of SARS-CoV-2 evolution . . . . . . . . . . . . . . . . . . . . . . 130
      6.2.1 Evolutionary trajectories of viral RBD single mutations . . . . . . . . 132
  6.3 Mutational impacts on SARS-CoV-2 infectivity . . . . . . . . . . . . . . . . . 135
                                            ix


       6.3.1 Impacts of S RBD single mutation on SARS-CoV-2 Infectivity . . .             . . 137
       6.3.2 Impacts of S RBD co-mutations on SARS-CoV-2 Infectivity . . . .              . . 140
   6.4 Mutational impacts on SARS-CoV-2 antibodies and vaccines . . . . . . .             . . 144
       6.4.1 Impacts of S RBD single mutation on SARS-CoV-2 antibodies and
              vaccines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 144
       6.4.2 Impacts of S RBD single mutation on SARS-CoV-2 antibodies and
              vaccines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 147
   6.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
   6.6 Websites Designed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 153
       6.6.1 Mutation Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 153
       6.6.2 Mutation Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 153
   6.7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 154
CHAPTER 7     DISSERTATION CONTRIBUTION . . . . . . . . . . . . . . . . . . . . 157
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
   APPENDIX A       SUPPLEMENTARY MATERIALS IN PERSISTENT LAPLA-
                    CIAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
   APPENDIX B       SUPPLEMENTARY MATERIALS IN PERSISTENT PATH
                    LAPLACIAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
                                             x


                                        LIST OF TABLES
Table 2.1: The Betti number of simplicial complexes in Figure 2.2. Each color rep-
            resents different faces. The tetrahedron-shaped simplicial complexes
            are demonstrated in (a)-(c), and the cube-shaped simplicial complexes
            are depicted in (d) - (f). (a) and (d) only has 0-simplices and 1-simplices,
            (b) has four 2-simplices, and (c) has one more 4-simplex. (e) and (f) do
            not have any 2-simplex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Table 2.2: The matrix representation of q-boundary operator and its qth-order
            persistent Laplacian with corresponding dimension, rank, nullity, and
            spectra from alpha complex K0.6 → K0.6 . . . . . . . . . . . . . . . . . . . . 16
Table 2.3: The matrix representation of q-boundary operator and its qth-order
            persistent Laplacian with corresponding dimension, rank, nullity, and
            spectra from alpha complex K0.2 → K0.6 . . . . . . . . . . . . . . . . . . . . 17
Table 2.4: The number of q-cycles of simplicial complexes demonstrated in Figure 2.6. 23
Table 2.5: K1 → K3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Table 2.6: K3 → K4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Table 2.7: K4 → K4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 2.8: K4 → K5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Table 2.9: K5 → K6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Table 2.10: K6 → K6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Table 2.11: Illustration of digraph c in Figure 2.8. . . . . . . . . . . . . . . . . . . . . . 49
Table 2.12: Illustration of digraph d in Figure 2.8. . . . . . . . . . . . . . . . . . . . . . 49
Table 2.13: Illustration of digraph e in Figure 2.8. . . . . . . . . . . . . . . . . . . . . . 50
Table 2.14: Illustration of digraph f in Figure 2.8. . . . . . . . . . . . . . . . . . . . . . 50
Table 4.1: Distances between atoms in the benzene molecule and the radii when
            the changes of (λ̃2 )r+0
                                 0    occur (Values increase from left to right). . . . . . . 77
                                                 xi


Table 4.2: The heat of formation energy of fullerenes [1] and its corresponding
           predicted energies with α = Max. The unit is EV/atom. . . . . . . . . . . 84
Table 4.3: The correlation coefficients under different type index α. . . . . . . . . . . 85
Table 6.1: The mutation distribution clusters with sample counts (SC) and to-
           tal single mutation counts (MC). The listed countries are United States
           (US), Canada (CA), Australia (AU), Germany (DE), France (FR), United
           Kingdom (UK), Italy (IT), Russia (RU), China (CN), Japan (JP), Korean
           (KR), India (IN), Iceland (IS), Brazil (BR), Spain (ES), Belgium (BE),
           Saudi Arabia (SA), Turkey (TR), Peru(PE), and Chile (CL). . . . . . . . . . 119
Table 6.2: Summary of mutations on COVID-19 diagnostic primers and probes
           and their occurrence frequencies in clusters. Here, SC is the sample
           counts and MC is the mutation counts. . . . . . . . . . . . . . . . . . . . . 121
Table 6.3: Gene-specific statistics of SARS-CoV-2 single mutations on 26 proteins. . 129
Table 6.4: List of top 40 high-frequency (HF) mutations and their corresponding
           BFE changes (unit: kcal/mol) of the binding of S protein and ACE2.
           Here, count shows the frequency occurred in 2021. . . . . . . . . . . . . . 139
Table 6.5: Top 25 most observed S protein RBD mutations. Here, BFE change
           refers to the BFE change for the S protein and human ACE2 complex
           induced by a single-site S protein RBD mutation. A positive mutation-
           induced BFE change strengthens the binding between S protein and
           ACE2, which results in more infectious variants. Counts of antibody
           disruption represent the number of antibody and S protein complexes
           disrupted by a specific RBD mutation. Here, an antibody and S pro-
           tein complex is to be disrupted if its binding affinity is reduced by
           more than 0.3 kcal/mol [2]. In addition, we calculate the antibody
           disruption ratio (%), which is the ratio of the number of disrupted an-
           tibody and S protein complexes over 130 known complexes. Ranks are
           computed from 683 observed RBD mutations. . . . . . . . . . . . . . . . . 141
Table 6.6:  List of vaccine escape (VE) and vaccine weakening (VW) Their cor-
           responding BFE changes (unit: kcal/mol) of the binding of S protein
           and ACE2 are provided as well. Here, the count shows the number of
           antibodies that will make a specific mutation to be an AD mutation. . . . 146
Table A.1: K1 → K1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Table A.2: K2 → K2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Table A.3: K3 → K3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
                                             xii


Table A.4: K5 → K5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Table A.5: K1 → K2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Table A.6: K1 → K4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Table A.7: K1 → K5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Table A.8: K1 → K6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Table A.9: K2 → K3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Table A.10:K2 → K4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Table A.11:K2 → K5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Table A.12:K2 → K6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Table A.13:K3 → K5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Table A.14:K3 → K6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Table A.15:K4 → K6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Table A.16:Fitting parameters from w0 to w5 .     . . . . . . . . . . . . . . . . . . . . . . . 177
Table A.17:Fitting parameters from w6 to w11 . . . . . . . . . . . . . . . . . . . . . . . . 177
Table B.1: Matrix construction of graph G1 (with isolated points included) in the
           top panel of Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Table B.2: Matrix construction of graph G1 (without isolated points) in the top
           panel of Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Table B.3: Matrix construction of graph G2 in the top panel of Figure 4.10. . . . . . . 179
Table B.4: Matrix construction of graph G3 in the top panel of Figure 4.10. . . . . . . 179
Table B.5: Matrix construction of graph G4 in the top panel of Figure 4.10. . . . . . . 180
Table B.6: Matrix construction of graph G5 in the top panel of Figure 4.10. . . . . . . 180
Table B.7: Matrix construction of graph G1 (with isolated points included) in the
           bottom panel of Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
                                             xiii


Table B.8: Matrix construction of graph G1 (without isolated points) in the bot-
            tom panel of Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Table B.9: Matrix construction of graph G2 (with isolated points included) in the
            bottom panel of Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Table B.10: Matrix construction of graph G2 (without isolated points) in the bot-
            tom panel of Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Table B.11: Matrix construction of graph G3 (with isolated points included) in the
            bottom panel of Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Table B.12: Matrix construction of graph G3 (without isolated points) in the bot-
            tom panel of Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Table B.13: Matrix construction of graph G4 in the bottom panel of Figure 4.10. . . . . 184
Table B.14: Matrix construction of graph G5 in the bottom panel of Figure 4.10. . . . . 184
                                             xiv


                                      LIST OF FIGURES
Figure 1.1: Genomics organization of SARS-CoV-2. . . . . . . . . . . . . . . . . . . .        4
Figure 1.2: Six stages of the SARS-CoV-2 life cycle. Stage I: Virus entry. I(a) Virus
            can enter the host cell via plasma membrane fusion. I(b) Virus can
            enter the host cell via endosomes. Stage II: Translation of viral repli-
            cation. Stage III: Replication. Here, nsp12 (RdRp) and nsp13 (heli-
            case) cooperate to perform the replication of the viral genome. Stage
            IV: Translation of viral structure proteins. Stage V: Virion assembly.
            Stage VI: Release of a virus. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Figure 2.1: Illustration of simplices. (a) 0-simplex (a vertex), (b) 1-simplex (an
            edge), (c) 2-simplex (a triangle), and (d) 3-simplex (a tetrahedron). . . .       9
Figure 2.2: Illustrations of simplicial complexes. . . . . . . . . . . . . . . . . . . . . . 10
Figure 2.3: Illustration of Voronoi diagram, Delaunay triangulation, and Non-
            Delaunay triangulation. Left chart: The Voronoi diagram and its dual
            Delaunay triangulation. The points set is P = {A,B,C,D,E} and the
            Delaunay is defined as DT(P ). The blue lines tessellate the plane
            into Voronoi cells. The red circle are the circumcircles of triangles in
            DT(P ). Right chart: A Non-Delaunay triangulation. Vertices E and D
            are in the green circumcircles, implying the right chart is an example
            of Non-Delaunay triangulation. . . . . . . . . . . . . . . . . . . . . . . . 13
Figure 2.4: Illustration of 2D Delaunay triangulation, alpha shapes, and alpha
            complexes for a set of 6 points A, B, C, D, E, and F. Top left: The
            2D Delaunay triangulation. Top right: The alpha shape and alpha
            complex at filtration value α = 0.2. Bottom left : The alpha shape
            and alpha complex at filtration value α = 0.6. Bottom right: The
            alpha shape and alpha complex at filtration value α = 1.0. Here, we
            use dark blue color to fill the alpha shape. . . . . . . . . . . . . . . . . . . 15
Figure 2.5: The persistent barcode for a set of points as illustrated in Figure 2.4
            that are generated from Gudhi and DioDe. . . . . . . . . . . . . . . . . . 15
Figure 2.6: Illustration of filtration. We use 0, 1, 2, 3, and 4 to stand for 0-simplices,
            01, 12, 23, 03, 24, 02, and 13 for 1-simplices, 012, 023, 013, and 123 for 2-
            simplices, and 0123 for the 3-simplex. Here, K1 has five 0-cycles, K2
            has four 0-cycles, K3 has two 0-cycles and a 1-cycle, K4 has a 0-cycle
            and a 1-cycle, K5 has one 0-cycle, and K6 has a 0-cycle. . . . . . . . . . . 23
                                                xv


Figure 2.7: Homologies of directed subgraphs. a, b, and c illustrate three sub-
            graphs whose homology groups or homology group dimensions are
            related to the original digraphs. . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 2.8: Five digraphs. a and b Digraphs with 3 vertices and 3 directed edges.
            c and d Digraphs with 4 vertices and 4 directed edges. e A digraph
            with 6 vertices and 8 directed edges. f A digraph with 6 vertices and
            8 directed edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 3.1: The flowchart of k-NN algorithm. The features of the training set is
            {xi }ni=1 with xi ∈ Rm , k shows the number of the nearest neighbors,
            and x ∈ Rm is a feature representation of the training set. . . . . . . . . . 59
Figure 3.2: Illustration of genome sequence data pre-processing and BFE change
            predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Figure 4.1: Benzene molecule and its topological changes during the filtration
            process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 4.2: Persistent spectral analysis of the benzene molecule induced by filtra-
            tion parameter r. Blue line, orange line, and green line represent Lr+0      0 ,
            L̂0 , and Ľ0 respectively. (a) Plot of the smallest non-zero eigen-
              r+0          r+0
            values with radius filtration under Lr+0    0    (blue line), L̂r+0
                                                                             0   (red line),
            and Ľr+00  (green line). Total 10 jumps   observed    in this plot which   rep-
            resent 10 possible distances between atoms. (b) Plot of the number of
            zero eigenvalues (β0r+0 ) with radius filtration under Lr+0     0 , L̂0 , and
                                                                                   r+0
            Ľr+0
              0     (three spectra are superimposed). When r = 0.00 Å, 12 atoms
            are disconnected with each other. After r = 0.54 Å, H atoms and
            their adjacent C atoms are connected with one another resulting in
            β0r+0 = 6. With r keeps growing, all of the atoms are connected with
            one another and then β0r+0 = 1. (c) Plot of the number of zero eigen-
            values (β1r+0 ) with radius filtration under Lr+0    1 . When r = 0.70 Å,
            a 1-cycle created since all of the C atoms are connected and form a
            hexagon, resulting in β1r+0 = 1. After the radius reached 1.21 Å, the
            hexagon disappears and β1r+0 = 0. . . . . . . . . . . . . . . . . . . . . . . 77
Figure 4.3: (a) Illustration of filtration built on fullerene C20 . Each carbon atom
            of C20 is plotted by its given coordinates, which are associated with
            an ever-increasing radius r. The solid balls centered at given coor-
            dinates keep growing along with the radius filtration parameter. (b)
            The accumulated Lr+0   0   matrix for C20 . For clarity, the diagonal terms
            are set to 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
                                                xvi


Figure 4.4: Illustration of persistent multiscale analysis of C60 in terms of 0-combinatorial
            Laplacian matrices (b)-(f) and their accumulated matrix (a) induced
            by filtration. As the value of filtration parameter r increases, high-
            dimensional simplicial complex forms and grows accordingly. (b),
            (c), (d), (e), and (d) demonstrate the 0-combinatorial Laplacian matri-
            ces (i.e., the connectivity among C60 atoms) at filtration r = 1.0 Å, 1.5 Å, 2.5 Å,
            3.0 Å, and 3.6 Å, respectively. The blue cell located at the ith row and
            jth column represents the balls centered at atom i and atom j con-
            nected with each other. For clarity, the diagonal terms are set to 0 in
            all plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 4.5: Illustration of persistent spectral analysis of C20 and C60 using the
            spectra of Lr+0 q   (q = 1, 2 and 3). (a) The number of zero eigenval-
            ues of L0 , i.e., β0r+0 , under radius filtration. (b) The number of zero
                       r+0
            eigenvalues of Lr+0  1 , i.e., β1
                                             r+0
                                                 under radius filtration. (c) The num-
            ber of zero eigenvalues of L2 , i.e., β2r+0 under radius filtration. (d)
                                              r+0
            The smallest non-zero eigenvalue (λ̃2 )r+0   0   under radius filtration. The
            radius grid spacing is 0.01 Å. . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure 4.6: Persistent spectral analysis and prediction of fullerene heat forma-
            tion energies. Left chart: the heat of formation energies of fullerenes
            obtained from quantum calculations [1]. Middle chart: PST model
            using the area under the plot of (λ̃2 )r+0  0 . Right chart: Correlation be-
            tween the quantum calculation and the PST prediction. The highest
            correlation coefficient form the least-squares fitting is 0.986 with the
            type index of α = Max. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Figure 4.7: Illustration of persistent spectral prediction of protein B-factors. (a)
            Plot of the secondary structure of protein 2Y7L. (b) Accumulated per-
            sistent Laplacian matrix (For clarity, the diagonal terms are set to 0.).
            Note that the accumulated persistent Laplacian matrix maps out the
            detailed distance between each pair of residues. (c) Comparison of
            experimental B-factors and those predicted by PST for protein 2Y7L. . . 87
Figure 4.8: Illustration of filtration on a tetrahedron. Here, 1, 2, 3, and 4 rep-
            resent four elementary 0-paths e1 , e2 , e3 , and e4 . The top panel is
            a tetrahedron that has √     edge lengths |e12 | = |e32 | = |e24 | = 1 and
            |e13 | = |e14 | = |e34 | = 2. The bottom panel√is a tetrahedron         √ that
            has edge lengths |e32 | = |e24 | = 1, |e34 | = 2, |e12 | = 3, and
            |e13 | = |e14 | = 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
                                                 xvii


Figure 4.9: Comparison of Betti numbers and non-harmonic spectra of Lδ,δ               n when
             n = 0, 1, and 2 on tetrahedrons Tetra 1 and Tetra 2. Note that since
             β1δ,δ = 0 and β2δ,δ = 0 for Tetra 1 and Tetra 2, topological variants
             from persistent path homology cannot discriminate Tetra 1 and Tetra
             2. However λδ,δ  1 and λ2 show the differences between Tetra 1 and
                                           δ,δ
             Tetra 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Figure 4.10: Illustration of filtration on a pyramid. Here, 1, 2, 3, 4, and 5 represent
             five elementary 0-paths e1 , e2 , e3 , e4 , and e5 . The top panel is a pyra-
             mid that has edge√       lengths |e13
                                                 √| = |e25 | = |e32 | = |e34 | = |e54 | = 1,
             |e12 | = |e14 | = 2, and |e15 | = 3. The bottom panel is a pyramid that
             has edge √ lengths |e25 | = |e32 | = |e34 | = |e54 | = 1, |e12 | = |e14 | = 2, and
             |e15 | = 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Figure 4.11: Comparison of Betti number and non-harmonic spectra of Lδ,δ               n when
             n = 0, 1,c and 2 on pyramids Pyra 1 and Pyra 2. Note that since
             β2δ,δ = 0, it cannot distinguish Pyra 1 and Pyra 2. But λδ,δ       2 can tell the
             difference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 4.12: a The 3D structures of CB7, 2 glycolurils, and path direction assign-
             ment. Here, from left to right, the side view of CB7, top view of CB7,
             the structure of two glycoluril units (=C10 H4 N8 O4 =), and electronegativity-
             based path direction assignment are depicted as well. b Illustra-
             tion of filtration-induced geometries Gi (i = 1, 2, . . . , 8) of CB7. Eight
             digraphs G1 = G0.200   0    , G2 = G00.565 , G3 = G0.710
                                                                   0    , G4 = G00.745 , G5 =
                0.800          1.210
             G0 , G6 = G0 , G7 = G0 , G8 = G0  1.315          1.800
                                                                     are constructed under
             filtration parameter δ. c Illustration of filtration-induced path com-
             plexes within two glycoluril units. Path directions can be inferred
             from their colors as shown in the last chart of a. d Betti numbers
             βnδ,δ and non-harmonic spectra λ̃δ,δ    n of persistent path Laplacians (Ln
                                                                                              δ,δ
             when n = 0, 1, and 2) for CB7. . . . . . . . . . . . . . . . . . . . . . . . . 97
Figure 5.1: The 3D structures of C20 and C60 . (a) C20 molecule. A total of 12
             pentagon rings can be found in C20 . (b) C60 molecule. 12 pentagon
             rings and 20 hexagon rings form the structure of C60 . . . . . . . . . . . . 106
                                                  xviii


Figure 5.2: Illustration of the harmonic spectra (for Rips complex) β0r,0 , β0r,0 , and
            β2r,0 (green curves from top chart to bottom chart) and the smallest
            non-zero eigenvalue λr,0       0 , λ1 , and λ2 (yellow curves from top chart
                                                 r,0         r,0
            to bottom chart) of C20 molecule (the bottom left chart in Figure 5.6)
            at different filtration values α calculated from HERMES. Here, the
            x-axis represents the radius filtration value r (unit: Å), the left-y-
            axes represents the number of zero eigenvalues of Lr,0               0 , L1 , and L1
                                                                                        r,0          r,0
            from top to bottom, and the right-y-axes represents the first non-zero
            eigenvalue of Lr,0  0 , L1 , and L2 from top to bottom. . . . . . . . . . . . . 108
                                       r,0           r,0
Figure 5.3: Illustration of the harmonic spectra (for alpha complex) β0α,0.05 , β0α,0.05 ,
            and β2α,0.05 (green curves from top chart to bottom chart) and the
            smallest non-zero eigenvalue λ0α,0.05 , λ1α,0.05 , and λα,0.05   2     (yellow curves
            from top chart to bottom chart) of the C20 molecule (the bottom left
            chart in Figure 5.6) at different filtration value α calculated from HER-
            MES. Here, the x-axis represents the radius filtration value α (unit:
            Å), the left-y-axes represents the number of zero eigenvalues of L0α,0.05 ,
            Lα,0.05
              1      , and Lα,0.05
                             1      from top to bottom, and the right-y-axes represents
            the first non-zero eigenvalue of L0α,0.05 , Lα,0.05      1   , and L2α,0.05 from top to
            bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 5.4:  Illustration of the harmonic spectra β0r,0 , β0r,0 , and β2r,0 (blue curves
            from top chart to bottom chart) and the smallest non-zero eigenvalue
              0 , λ1 , and λ2 (red curves from top chart to bottom chart) of the
            λr,0     r,0       r,0
            C60 molecule (the bottom left chart in Figure 5.6) at different filtration
            value α calculated from HERMES. Here, the x-axis represents the ra-
            dius filtration value α (unit: Å), the left-y-axes represents the number
            of zero eigenvalues of Lr,0      0 , L1 , and L1 from top to bottom, and the
                                                    r,0          r,0
            right-y-axes represents the first non-zero eigenvalue of Lr,0              0 , L1 , and
                                                                                             r,0
              2 from top to bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
            Lr,0
Figure 5.5:  Illustration of the harmonic spectra β0α,0.05 , β0α,0.05 , and β2α,0.05 (green
            curves from top chart to bottom chart) and the smallest non-zero
            eigenvalue λα,0.05
                             0     , λα,0.05
                                       1      , and λ2α,0.05 (yellow curves from top chart
            to bottom chart) of the C60 molecule (the bottom left chart in Fig-
            ure 5.6) at different filtration value α calculated from HERMES. Here,
            the x-axis represents the radius filtration value α (unit: Å), the left-y-
            axes represents the number of zero eigenvalues of L0α,0.05 , Lα,0.05           1     , and
            L1α,0.05
                       from top to bottom, and the right-y-axes represents the first
            non-zero eigenvalue of Lα,0.05     0     , Lα,0.05
                                                         1     , and Lα,0.05
                                                                       2     from top to bottom. . . 111
                                                       xix


Figure 5.6: The alpha carbon network plots of 15 proteins: PDB IDs 1CCR, 1NKO,
            1O08, 1OPD, 1QTO, 1R7J, 1V70, 1W2L, 1WHI, 2CG7, 2FQ3, 2HQK,
            2PKT, 2VIM, and 5CYT from left to right and top to bottom. The
            color represents the normalized diagonal element of the accumulated
            Laplacian at each alpha carbon atom. . . . . . . . . . . . . . . . . . . . . 112
Figure 5.7: Illustration of the harmonic spectra βqα,0 (blue curve) and the smallest
            non-zero eigenvalue λα,0     q (red curve) of PDB ID 5CYT (the bottom left
            chart in Figure 5.6) at different filtration values α when q = 0, 1, 2. The
            βqα,0 are calculated from Gudhi, DioDe, and HERMES, and λα,0               q   are
            obtained only from HERMES. Here, the x-axis represents the radius
            filtration value α (unit: Å), the left-y-axis represents the number of
            zero eigenvalues of Lα,0    q , and the right-y-axis represents the first non-
            zero eigenvalue of Lα,0   q . Note that the harmonic spectra from the three
            methods are indistinguishable. . . . . . . . . . . . . . . . . . . . . . . . . 113
Figure 5.8: Illustration of the harmonic spectra β0α,0.5 , β0α,0.5 , and β2α,0.5 (green curves
            from top chart to bottom chart) and the smallest non-zero eigenvalue
            λα,0.5
              0    , λα,0.5
                      1     , and λα,0.5
                                    2    (yellow curves from top chart to bottom chart)
            of PDB ID 5CYT (the bottom left chart in Figure 5.6) at different filtra-
            tion values α calculated from HERMES. Here, the x-axis represents
            the radius filtration value α (unit: Å), the left-y-axes represents the
            number of zero eigenvalues of Lα,0.5       0  , L1α,0.5 , and L1α,0.5 from top to
            bottom, and the right-y-axes represents the first non-zero eigenvalue
            of Lα,0.5
                  0   , Lα,0.5
                          1     , and L2α,0.5 from top to bottom. . . . . . . . . . . . . . . . 114
Figure 5.9: (a) The 3D secondary structure of PDB ID 1O08. The blue, purple,
            and orange colors represent helix, sheet, and random coils of PDB ID
            1O08. The ball represents the alpha carbon of PDB ID 1O08. (b) Il-
            lustration of the harmonic spectra βqα,0 (blue curve) and the smallest
            non-zero eigenvalue λα,0      q   (red curve) of PDB ID 1O08 at different fil-
            tration values α when q = 0, 1, 2. The βqα,0 are calculated from Gudhi,
            DioDe, and HERMES, and λα,0           q  are calculated only from HERMES.
            Here, the x-axis represents the radius filtration value α (unit: Å), the
            left-y-axis represents for the number of zero eigenvalue of Lα,0          q , and
            the right-y-axis represents for the non-zero eigenvalues of Lα,0         q . Note
            that the harmonic spectra from three methods are indistinguishable. . . 115
Figure 6.1: The scatter plot of six distinct clusters in the world in July 2020. The
            light blue, dark blue, green, red, pink, and yellow represent Cluster
            I, Cluster II, Cluster III, Cluster IV, Cluster V, and Cluster VI, respec-
            tively. The base color of each country is decided by the color of the
            dominated Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
                                                    xx


Figure 6.2: Illustration of mutation positions and frequencies on the primer and/or
            probes of RX7038-N1 primer (Fw), RX7038-N1 primer (Rv), RX7038-
            N2 primer (Fw), RX7038-N2 primer (Rv), RX7038-N3 primer (Fw),
            RX7038-N3 primer (Rv), N1-U.S.-P, N2-U.S.-P, N3-U.S.-P, N-Sarbeco-F. . 123
Figure 6.3: Illustration of mutation positions and frequencies on the primer and/or
            probes of N-Sarbeco-P, N-Sarbeco-R, N-China-F, N-China-R, N-China-
            P, N-HK-F, N-HK-R, N-JP-F, N-JP-P, N-TL-F. . . . . . . . . . . . . . . . . 124
Figure 6.4: Illustration of mutation positions and frequencies on the primer and/or
            probes of N-TL-R, N-TL-P, E-Sarbeco-F1, E-Sarbeco-R2, E-Sarbeco-
            P1, nCoV-IP2-12669Fw, nCoV-IP2-12759Rv, nCoV-IP2-12696bProbe(+),
            nCoV-IP4-14059Fw, nCoV-IP4-14146Rv. . . . . . . . . . . . . . . . . . . . 125
Figure 6.5: Illustration of mutation positions and frequencies on the primer and/or
            probes of nCoV-IP4-14084Probe(+), RdRP-SARSr-F2, RdRP-SARSr-
            R1, RdRP-SARSr-P2, ORF1ab-China-F, ORF1ab-China-R, ORF1ab-China-
            P, ORF1b-nsp14-HK-F, ORF1b-nsp14-HK-R, ORF1b-nsp14-HK-P. . . . . 126
Figure 6.6: Illustration of mutation positions and frequencies on the primer and/or
            probes of SC2-F, SC2-R,NIID_WH-1_F501,NIID_WH-1_R913, NIID_WH-
            1_F509, NIID_WH-1_R85, NIID_WH-1_Seq F519, NIID_WH-1_Seq
            R840, WuhanCoV-spk1-f, WuhanCoV-spk1-r, NIID_WH-1_F24381, NIID_WH-
            1_R24873, NIID_WH-1_Seq F24383, NIID_WH-1_Seq R24865. . . . . . . 127
Figure 6.7: The pie chart of the distribution of 12 different types of mutations. . . . 128
Figure 6.8: Illustration of SARS-CoV-2 mutation ratio and mutation h-index one
            various genes. For each gene, its length is given in the mutation ratio
            bar while the number of unique SNPs is given in the h-index bar. . . . . 128
                                            xxi


Figure 6.9: a The mechanism of mutagenesis. Nine mechanisms are grouped
             into three scales: 1) molecular-based mechanism (green color); 2)
             organism-based mechanism (red color); 3) population-based mech-
             anism (blue color). The random shifts (Random), replication error
             (Rep), Transcription error (Transcr), viral proofreading (Proof), and
             recombination (Recomb) are the six molecular-based mechanisms.
             The gene editing and the host-virus recombination are the organism-
             based mechanism. In addition, the natural selection (Natural) is the
             population-based mechanism, which is the mainly driven source in
             the transmission of SARS-CoV-2. b A sketch of SARS-CoV-2 and its
             interaction with host cell. c Illustration of 25 single-site RBD muta-
             tions with top frequencies. The height of each bar shows the BFE
             change of each mutation, the color of each bar represents the natu-
             ral log of frequency of each mutation, and the number at the top of
             each bar means the AI-predicted number of antibody and RBD com-
             plexes that may be significantly disrupted by a single site mutation.
             d Illustration of SARS-CoV-2 S protein with human ACE2. The blue
             chain represents the human ACE2, the pink chain represents the S
             protein, and the purple fragment on the S protein points out the two
             vaccine-resistant mutations Y449S/H. . . . . . . . . . . . . . . . . . . . . 134
Figure 6.10: Most significant RBD mutations. a Time evolution of RBD muta-
             tions with its mutation-induced BFE changes per 60-day from April
             30, 2020, to August 31, 2021. Here, only the top 100 most observed
             RBD mutations are displayed. The height and color of each bar rep-
             resent the log frequency and ACE-S BFE change induced by a given
             RBD mutation. The red star marks the vaccine-resistant mutations
             with significantly negative BFE changes. b Time evolution of RBD
             mutations with its experimental mutation-induced log2 enrichment
             ratio changes per 60-day from April 30, 2020, to August 31, 2021.
             The height and color of each bar represent the log frequency and
             enrichment ratio change induced by a given RBD mutation. The
             red star marks vaccine-resistant mutations with significantly nega-
             tive BFE changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Figure 6.11: Illustration of SARS-CoV-2 mutation-induced BFE changes for the
             complexes of S protein and ACE2. Here, 100 most observed muta-
             tions on S RBD are illustrated. . . . . . . . . . . . . . . . . . . . . . . . . . 138
Figure 6.12: Illustration of the time evolution of 424 ACE2 binding-strengthening
             RBD mutations (blue) and 227 ACE2 binding-weakening RBD muta-
             tions (red) on the S protein RBD of SARS-CoV-2 from Jan 07, 2020 to
             April 18, 2021. The x-axis represents date and y-axis represents the
             natural log of frequency of each mutation. . . . . . . . . . . . . . . . . . 138
                                             xxii


Figure 6.13: The 3D structure of SARS-CoV-2 S protein RBD bound with ACE2
             (PDB ID: 6M0J). We choose blue and red colors to mark the binding-
             strengthening and binding-weakening mutations, respectively. Vac-
             cine escape mutations described in Table 6.6 are labeled. . . . . . . . . . 140
Figure 6.14: Most significant RBD mutations. a The 3D structure of SARS-CoV-2 S
             protein RBD and ACE2 complex (PDB ID: 6M0J). The RBD mutations
             in ten variants are marked with color. b Illustration of the time evo-
             lution of 455 ACE2 binding-strengthening RBD mutations (blue) and
             228 ACE2 binding-weakening RBD mutations (red). The x-axis rep-
             resents the date and the y-axis represents the natural log of frequency.
             There has been a surge in the number of infections since early 2021.
             c BFE changes of RBD complexes with ACE2 and 130 antibodies in-
             duced by 75 significant RBD mutations. A positive BFE change (blue)
             means the mutation strengthens the binding, while a negative BFE
             change (red) means the mutation weakens the binding. Most muta-
             tions, except for vaccine-resistant Y449H and Y449S, strengthen the
             RBD binding with ACE2. Y449S and K417N are highly disruptive to
             antibodies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Figure 6.15: Illustration of SARS-CoV-2 S RBD 100 most observed mutations in-
             duced BFE changes for the complexes of S protein and 106 antibod-
             ies or ACE2. Here, red represents the negative changes that will
             weaken the binding, while green shows the positive changes that will
             strengthen the binding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Figure 6.16: Properties of RBD co-mutations. a Illustration of RBD 2 co-mutations
             with a frequency greater than 90. b Illustration of RBD 3 co-mutations
             with a frequency greater than 30. c Illustration of RBD 2 co-mutations
             with a frequency greater than 20. Here, the x-axis lists RBD co-mutations
             and the y-axis represents the predicted total BFE change between S
             RBD and ACE2 of each set of RBD co-mutations. The number on
             the top of each bar is the AI-predicted number of antibody and RBD
             complexes that may be significantly disrupted by the set of RBD co-
             mutations, and the color of each bar represents the natural log of fre-
             quency for each set of RBD co-mutations. (Please check the interac-
             tive HTML files in the Supporting Information S2.2.4 for a better view
             of these plots.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
                                               xxiii


Figure 6.17: a 2D histograms of antibody disruption count and total BFE changes
             for RBD 2 co-mutations (unit: kcal/mol). b 2D histograms of anti-
             body disruption count and total BFE changes (unit: kcal/mol) for
             RBD 3 co-mutations. c 2D histograms of antibody disruption count
             and total BFE changes (unit: kcal/mol) for RBD 4 co-mutations. d
             The histograms of total BFE changes (unit: kcal/mol) for RBD co-
             mutations. e The histograms of the natural log of frequency for RBD
             co-mutations. f The histograms of antibody disruption count for RBD
             co-mutations. In figures a, b, and c, the color bar represents the num-
             ber of co-mutations that fall into the restriction of x-axis and y-axis.
             The reader is referred to the web version of these plots in the Sup-
             porting Information S2.2.2 and S2.2.3. . . . . . . . . . . . . . . . . . . . . 150
Figure 6.18: A comparison between experimental RBD deep mutation enrichment
             data and predicted BFE changes for SARS-CoV-2 RBD binding to
             ACE2 (6M0J) [3]. Top left: deep mutational scanning heatmap show-
             ing the average effect on the enrichment for single-site mutants of
             RBD when assayed by yeast display for binding to the S protein RBD
             [3]. Right: RBD colored by average enrichment at each residue posi-
             tion bound to the S protein RBD. Bottom left: machine learning pre-
             dicted BFE changes for single-site mutants of the S protein RBD. . . . . . 151
Figure 6.19: Illustration of SARS-CoV-2 mutations given by Mutation Tracker. In-
             teractive version is available at Mutation Tracker. . . . . . . . . . . . . . 153
Figure 6.20: Illustration of the analysis of SARS-CoV-2 mutations given by inter-
             active Mutation Analyzer that is available at Mutation Analyzer. . . . . 154
                                             xxiv


                                      CHAPTER 1
                                    INTRODUCTION
1.1    Topological Laplacian
Persistent homology (PH) is one of the most popular tools in topological data analysis
(TDA), which is constrained to purely topological persistence obtained from its persis-
tent betti numbers. PH has had tremendous success in various fields such as biology
[4], chemistry [5], drug discovery [6], and 3D shape analysis [7]. Inspired by the suc-
cess of PH, multiple advanced mathematical tools in TDA have emerged, and one of
the new rising stars in TDA is the de Rham-Hodge theory in differential geometry. De
Rham-Hodge theory aims to use the differential forms to represent the cohomology of an
oriented closed Riemannian manifold with boundary in terms of a topological Laplacian
named Hodge Laplacian [8]. Similar to homology, the de Rham-Hodge theory fails to
give an in-depth analysis of data through Hodge Laplacians. Therefore, the evolutionary
de Rham-Hodge theory [9] was introduced to alleviate or heal problems arising in the de
Rham-Hodge. A persistent Hodge Laplacian was developed to offer a multiscale-level
analysis on a family of evolutionary manifolds. Such a method provides an answer to
the old question “can one hear the shape of a drum" [10]. One can decode the topological
persistence and the homotopic shape evolution of data during filtration by calculating the
harmonic and non-harmonic spectra of persistent Hodge Laplacians.
    Nonetheless, one main concern we should address in evolutionary de Rham-Hodge
theory is that it is set up on the Riemannian manifold, which is quite computational-
consuming in real applications. Therefore, seeking a method that can reduce the compu-
tational complexity is indeed needed. One natural idea to overcome this issue is to set
up a similar system on the discrete points instead of the Riemannian manifold. Hence,
a multiscaled-based topological Laplacian, namely persistent spectral graph (PSG) [11],
                                            1


was introduced by creating low-dimensional multiscale representations (i.e., persistent
combinatorial graph Laplacians, , persistent Laplacians) on graphs. In PSG theory, fami-
lies of persistent Laplacian matrices (PLMs) corresponding to various topological dimen-
sions are constructed via filtration to sample a given dataset at multiple scales. The har-
monic spectra from the null spaces of PLMs offer the same topological invariants, namely
persistent Betti numbers, at various dimensions as those provided by PH, while the non-
harmonic spectra of PLMs give rise to additional geometric analysis of the shape of the
data. Meanwhile, we developed an open-source software package called highly efficient
robust multidimensional evolutionary spectra (HERMES), to enable broad applications
of PSGs in science, engineering, and technology. To ensure the reliability and robustness
of HERMES, we have validated the software with simple geometric shapes and complex
datasets from three-dimensional (3D) protein structures. We found that the smallest non-
zero eigenvalues are very sensitive to data abnormality.
    It is noticed that the persistent Laplacians are insensitive to asymmetry or directed
relations (i.e, they treat all data points equally). That is to say, each point does not carry
any labeled information such as the type, mass, color, etc. Therefore, they fail to represent
the structures that have directional information. Undoubtedly, we need a method that
has a flavor to deal with asymmetry structures. Notably, the path homology [12] pro-
posed by Grigor’yan, Yong Lin, Yuri Muranov, and S.-T.Yau provides a powerful tool to
analyze datasets with asymmetric structures. To encode richer information, Chowdhury
and Mémoli extended path homology to a persistent framework on a directed network
[13] call persistent path homology (PPH). Such methods are perfect tools for us to fix the
aforementioned issue in the persistent Laplacian. Similar to the PH, PPH also decodes
purely topological persistence and cannot track the homotopic shape evolution of data
during filtration. To overcome the limitation of PPH, persistent path Laplacian (PPL) is
introduced to capture the shape evolution of data. PPL’s harmonic spectra fully recover
PPH’s topological persistence and its non-harmonic spectra reveal the homotopic shape
                                               2


evolution of data during filtration.
    Topological Laplacians are powerful tools to extract both topological invariants and
geometric deformation of a given system. In this dissertation, we mainly discuss two
new multiscale-based topological Laplacians: persistent Laplacians and persistent path
Laplacians, and their applications in life science, especially in the fields of molecular bi-
ology.
1.2    Mathematical Modeling of Virology
Since its first case was identified in Wuhan, China, in December 2019, coronavirus disease
2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-
2) has expeditiously spread to as many as 226 countries and territories worldwide and
led to over 541 million confirmed cases and over 6.3 million fatalities as of June 2022.
This pandemic has also brought a massive economic recession globally. The countries all
around the world have implemented a variety of policies to tackle the COVID-19 pan-
demic.
    Many SARS-CoV-2 vaccines and monoclonal antibodies (mAbs) have already obtained
the use authorization worldwide (See Coronavirus Vaccine Tracker). Additionally, U.S.
Food and Drug Administration (FDA) has given the emergency use authorization to the
oral SARS-CoV-2 Mpro inhibitor PAXLOVID (PF-07321332) developed by Pfizer[14, 15].
However, COVID-19 has a high infection rate, high prevalence, long incubation period
[16], asymptomatic transmission [17, 18, 19], and potential seasonal pattern [20]. SARS-
CoV-2 keeps involving into new infectious and antibody resistant variants [21, 22, 23].
Therefore, it is imperative to understand its viral molecular mechanism [24], track its
genetic evolution [25], and continuously improve the efficacy of antiviral drugs and anti-
body therapies.
    Belonging to the β-coronavirus genus and coronaviridae family, SARS-CoV-2 is an un-
segmented positive-sense single-stranded RNA (+ssRNA) virus with a compact 29,903
                                              3


        2,000 4,000      6,000    8,000    10,000    12,000    14,000    16,000     18,000   20,000  22,000    24,000    26,000     28,000   29,903
                                                           13,468
 5’                      ORF1a                                                                                    ORF2-ORF10                     3’
                                                                             ORF1b
                                                          13,442
                                                        NSP11
                                                NSP6 NSP10                              NSP14             Spike              EM          N
                                                                                                    (21,563 - 25,384)
       NSP2            NSP3             NSP4                      NSP12        NSP13                         (26,245 - 26,472)
  NSP1                                        NSP5                                                             (26,523 - 27,191) (28,274 - 29,533)
                    Papain-like                              RNA-dependent
                                       Main Protease        RNA polymerase                NSP15
                     Protease                                                                                            Accessory factors
                  (4,955 - 5,900)     (10,055 - 10,972)      (13,442 - 16,236)               NSP16
                                                                               Helicase                                          6 7b 9c 10
                                                   NSP7                                                                 3a            8
                                                                          (16,237 - 18,043)                                        7a 9b
                                                     NSP8
                                                      NSP9                                                             3b
                             Figure 1.1: Genomics organization of SARS-CoV-2.
nucleotide-long genome and the diameter of each SARS-CoV-2 virion is about 50-200
nm [26]. In the first 20 years of the 21st century, β-coronaviruses have triggered three
major outbreaks of deadly pneumonia: SARS-CoV (2002), Middle East respiratory syn-
drome coronavirus (MERS-CoV) (2012), and SARS-CoV-2 (2019) [27]. Like SARS-CoV
and MERS-CoV, SARS-CoV-2 also causes respiratory infections, but at a much higher in-
fection rate [28, 29]. The complete genome of SARS-CoV-2 comprises 15 open reading
frames (ORFs), which encodes 29 structural and non-structural proteins (nsps). The 16
non-structural proteins nsp1-nsp16 get expressed by protein-coding genes ORF1a and
ORF1b, while four canonical 3’ structural proteins: spike (S), envelope (E), membrane
(M), and nucleocapsid (N) proteins, as well as accessory factors, are encoded by other
four major ORFs, namely ORF2, ORF4, ORF5, and ORF9 (see Figure 1.1) [30, 31, 32, 33].
    The viral structure of SARS-CoV-2 can be found in Figure 1.1. This structure is formed
by the four structural proteins: the N protein holds the RNA genome, the S protein helps
virus enter into the host cell, and M and E proteins define the shape of the viral envelope
[34]. The studies on SARS-CoV-2 as well as previous SARS-CoV and other coronaviruses
have mostly identified the functions of these structural proteins, nonstructural proteins as
well as accessory proteins. Their 3D structures are also largely known from experiments
or predictions.
    With these SARS-CoV-2 proteins, the intracellular viral life cycle of SARS-CoV-2 can
                                                                       4


                                                                                                                                                           SARS-CoV-2
     New SARS-CoV-2                                                      I(a)                                                                                           S
                                                                                                                                                    N
                                                                                                                                                                        M
                                                                             TMPRSS2                                                                E
                                                                                                                         ne
                                                                                        Ribosome       at
                                                                                                                      ra
                                                                                   5'               3'                                          ACE2
                                                                                                                try b                      e
                                                                                                              en m
                                 VI Virus release
                                                                                                             s me                       om
                                                                                  II            R
                                                                                                          iru a                       en
                                                                                                 el
                                                                                                   ea                                          I(b) Virus entry
                                                                                                       ) V sm                      lg
                                                               pp1a and pp1b
                                                                                                                                               via endosomes
                                                                                                                                ra
                                                                                                    I(a pla                  vi
             Golgi                                              Mpro/PLpro                                                of
                                                                                              Replicase               se
                            V                                 Functional nsps
                                                           RNA replication and packing
                                      N                                                  3'                   5'
                                          in                           III
                                               cy
                     En                          top                                               Transcription
                        p                           las                             IV
                     do                                m
                      las                                                       Translation
                          mi                               Nuclecapsid (N)
                            cr
                            eti c
                                                                 Spike (S)
                                ulu
                                 m
   Nucleus                          (E                      Membrane (M)
                                      R)
                                                              Envelope (E)
Figure 1.2: Six stages of the SARS-CoV-2 life cycle. Stage I: Virus entry. I(a) Virus can
enter the host cell via plasma membrane fusion. I(b) Virus can enter the host cell via
endosomes. Stage II: Translation of viral replication. Stage III: Replication. Here, nsp12
(RdRp) and nsp13 (helicase) cooperate to perform the replication of the viral genome.
Stage IV: Translation of viral structure proteins. Stage V: Virion assembly. Stage VI:
Release of a virus.
be realized [35]. This life cycle has six stages as shown in Figure 1.2. The first stage is
the entry of the virus. SARS-CoV-2 enters the host cell either via endosomes or plasma
membrane fusion. In both ways, the S protein of SARS-CoV-2 first attaches to the host
cell-surface protein, angiotensin-converting enzyme 2 (ACE2). Then, the cell’s protease,
TMPRSS2, cuts and opens the S protein of the virus, exposing a fusion peptide in the S2
subunit of the S protein [36]. After fusion, an endosome forms around the virion, sepa-
rating it from the rest of the host cell. The virion escapes when the pH of the endosome
drops or when cathepsin, a host cysteine protease, cleaves it. The virion then releases its
RNA into the cell [37]. After the RNA release, polyproteins pp1a and pp1ab are trans-
lated. Notably, facilitated by viral papain-like protease (PLpro), nsp1, nsp2, nsp3, and the
                                                                                  5


amino terminus of nsp4 from the pp1a and pp1ab are released. Moreover, nsp5-nsp16 are
also cleaved proteolytically by the main protease [38]. The next stage of the life cycle is
the replication process, where nsp12 (RdRp) and nsp13 (helicase) cooperate to perform
the replication of the viral genome. Stages IV and V are the translation of viral structural
proteins and the virion assembly process. In these stages, structural proteins S, E, and M
are translated by ribosomes and then present on the surface of the endoplasmic reticulum
(ER), which is transported from the ER through the Golgi apparatus for the preparation
of virion assembly. Meanwhile, multiple copies of N protein package the genomics RNA
in cytoplasm, which interacts with other 3 structural proteins to direct the assembly of
virions. Finally, virions will be secreted from the infected cell through exocytosis.
    Since the initial outbreak of the COVID-19, the raging pandemic caused by SARS-
CoV-2 has lasted over two years. We do have many promising vaccines, but they might
have side effects and their full side effects, particularly, long-term side effects, remain
unknown. To make things worse, near 29260 unique mutations have been recorded for
SARS-CoV-2 as shown by Mutation Tracker ( https://users.math.msu.edu/users/weig/
SARS-CoV-2_Mutation_Tracker.html). All of these reveal the sad reality that our cur-
rent understanding of life science, virology, epidemiology, and medicine is severely lim-
ited. Ultimately, the core of challenges is the lack of molecular mechanistic understand-
ings of many aspects, namely coronavirus RNA proofreading, virus-host cell interactions,
antibody-antigen interactions, protein-protein interactions, protein-drug interactions, vi-
ral regulation of host cell functions, including autophagocytosis and apoptosis, and ir-
regular host immune response behavior such as cytokine storm and antibody-dependent
enhancement. Molecular-level experiments on SARS-CoV-2 are both expensive and time-
consuming and require to take heavy safety measures. Moreover, disparities among re-
ported experimental binding affinities can be more than 100 fold for the receptor-binding
domain (RBD) of S protein binding to ACE2 or antibodies (see Table 1 of Ref. [39]). All
these complicated realities make the understanding of viral evolution and transmission
                                               6


mechanism some of the most challenging tasks.
    On the other hand, computational tools provide alternative approaches in understand-
ing viral evolution and transmission with higher efficiency and lower costs. The increas-
ing computer power, the accumulation of molecular data, the availability of artificial in-
telligence (AI) algorithms, and the development of new mathematical tools have paved
the road for mechanistic understanding from molecular modeling, simulations, and pre-
dictions.
    In May 2020, we developed an intensively validated topology-based neural network
model [40] called TopNetmAb to predict certain RBD mutations. It showed that RBD
residues 452 and 501 were predicted to “have very high chances to mutate into signifi-
cantly more infectious COVID-19 strains” in summer 2020 [41] and were later confirmed
in prevailing SARS-CoV-2 variants Alpha, Beta, Gamma, Delta, Theta, Epsilon, Kappa,
Lambda, Mu, and Omicron. These predictions [41], achieved via the integration of deep
learning, biophysics, genotyping, and advanced mathematics, are some of the most re-
markable events.
    Additionally, 3,696 possible RBD mutations were classified into three categories with
different appearance likelihoods, namely, 1149 most likely, 1912 likely, and 625 unlikely
[41]. The predicted “most likely” partition successfully contained all the newly observed
RBD mutations, until the recent appearance of S371L from Omicron BA.1. Most remark-
ably, the mechanism governing SARS-CoV-2 evolution and transmission, i.e., natural se-
lection via mutation-strengthened infectivity, was discovered in July 2020 [41] when there
were only 89 RBD mutations with the highest observed frequency of merely 50 globally
[41].
    In April 2021, this mechanism was confirmed beyond any doubt. By using 506,768
sequences isolated from patients, the authors demonstrated that the predicted binding
free energy (BFE) changes of the 100 most observed RBD mutations out of 651 existing
RBD mutations are all above the BFE change of -0.28 kcal/mol, indicating evolution fa-
                                             7


vors variants having higher infectivity [2]. Moreover, using network-based modeling for
drug repurposing, it was found out Baricitinib as a potential treatment for COVID-19[42].
These extraordinary results prove that mathematical modeling of virology spearhead the
discovery of new drugs and the mechanisms of SARS-CoV-2 evolution and transmission.
1.3    Outline
In Chapter 2, we provide a mathematical background in two topological Laplacians: per-
sistent Laplacians and persistent path Laplacians. Also, vital examples are involved to
illustrate how we construct two types of topological Laplacians on a given point-cloud
dataset. In Chapter 3, we review the theoretically details in the mathematical modeling of
virology, including the methods in the genomics analysis and the structure of the math-
AI models that we used in the SARS-CoV-2 studies. In Chapter 4, we mainly discuss
the applications in the PL and PPL, and their advantages compared to other topologi-
cal Laplacians. We further introduce an open-source package called HERMES, which is
designed to extract the harmonic and non-harmonic spectra of persistent Laplacians. In
addition, the validation of the HERMES is also discussed in the Chapter 5 to show its ac-
curacy, robustness, and reliability on standard test datasets and multiple complex protein
structures. Chapter 6 includes several applications in the study of SARS-CoV-2, including
the mutational impacts on the SARS-CoV-2 diagnostic targets, vaccines, antibodies, along
with the discussion about the mechanisms of SARS-CoV-2 evolution and transmission.
The dissertation contribution is summarized in Chapter 7.
                                             8


                                                 CHAPTER 2
                            METHODS ON TOPOLOGICAL LAPLACIANS
2.1     Persistent Laplacians
2.1.1    Simplex
                                                                            q
                                                                          X
Let {v0 , v1 , · · · , vq } be a set of points in R . A point v =
                                                     n
                                                                               λi vi , λi ∈ R is an affine
                                                                           i=0
                               q
                             X
combination of vi if             λi = 1. An affine hull is the set of affine combinations. Here,
                             i=0
q + 1 points v0 , v1 , · · · , vq are affinely independent if v1 − v0 , v2 − v0 , · · · , vq − v0 are linearly
independent. A q-plane is well-defined if the q + 1 points are affinely independent. In Rn ,
one can have at most n linearly independent vectors. Therefore, there are at most n + 1
                                                                      q
                                                                    X
affinely independent points. An affine combination v =                   λi vi is a convex combination
                                                                    i=0
if all λi are non-negative. The convex hull is the set of convex combinations.
    A (geometric) q-simplex denoted as σq is the convex hull of q + 1 affinely independent
points in Rq with dimension dim(σq ) = q. A 0-simplex is a vertex, a 1-simplex is an edge,
a 2-simplex is a triangle, and a 3-simplex is a tetrahedron, as shown in Figure 2.1. The
convex hull of each nonempty subset of q + 1 points forms a subsimplex and is regraded
as a face of σq denoted τ . The p-face of a q-simplex is the subset {vi1 , · · · , vip } of the q-
simplex.
                   (a)                    (
                                          b)            (
                                                        c)                         (d)
Figure 2.1: Illustration of simplices. (a) 0-simplex (a vertex), (b) 1-simplex (an edge), (c)
2-simplex (a triangle), and (d) 3-simplex (a tetrahedron).
                                                      9


2.1.2     Simplicial Complex
A simplicial complex is a powerful algebraic topology tool that has wide applications in
graph theory, topological data analysis [43], and many physical fields [44]. We briefly
review simplicial complexes to generate notation and provide essential preparation for
introducing persistent spectral graphs. A (finite) simplicial complex K is a (finite) collec-
tion of simplices in Rn satisfying the following conditions
   (1) If σq ∈ K and σp is a face of σq , then σp ∈ K.
   (2) The non-empty intersection of any two simplices σq , σp ∈ K is a face of both of σq
        and σp .
Each element σq ∈ K is a q-simplex of K. The dimension of K is defined as dim(K) =
max{dim(σq ) : σq ∈ K}. To distinguish topological spaces based on the connectivity of
simplicial complexes, one uses Betti numbers. The k-th Betti number, βk , counts the num-
ber of k-dimensional holes on a topological surface. The geometric meaning of Betti num-
bers in R3 is the following: β0 represents the number of connected components, β1 counts
the number of one-dimensional loops or circles, and β2 describes the number of two-
dimensional voids or holes. In a nutshell, the Betti number sequence {β0 , β1 , β2 , · · · } re-
veals the intrinsic topological property of the system. To illustrate the simplicial complex
and its corresponding Betti number, we have designed two simple models as is shown in
               1
Figure 2.2.
           (
           a )               (b)            ( c
                                              )            (
                                                           d)           (
                                                                        e)               (f)
                            Figure 2.2: Illustrations of simplicial complexes.
     1
       These examples show an intuitive way to count Betti numbers. However, it is impossible to generate
structures (b), (e), and (f) in Rips complex.
                                                    10


Table 2.1: The Betti number of simplicial complexes in Figure 2.2. Each color represents
different faces. The tetrahedron-shaped simplicial complexes are demonstrated in (a)-(c),
and the cube-shaped simplicial complexes are depicted in (d) - (f). (a) and (d) only has
0-simplices and 1-simplices, (b) has four 2-simplices, and (c) has one more 4-simplex. (e)
and (f) do not have any 2-simplex.
   Betti number     Fig. 3 (a)   Fig. 3 (b)   Fig. 3 (c)    Fig. 3 (d)   Fig. 3 (e)   Fig. 3 (f)
         β0             1             1            1            1            1            1
         β1             3             0            0            5            0            0
         β2             0             1            0            0            1            0
    Recall that in graph theory, the degree of a vertex (0-simplex) v is the number of edges
that are adjacent to the vertex, denoted as deg(v). However, once we generalize this
notion to q-simplex, problem arouse since a q-simplex can have (q − 1)-simplices and
(q + 1)-simplices adjacent to it at the same time. Therefore, the upper adjacency and
lower adjacency are required to define the degree of a q-simplex for q > 0 [45, 46].
Defination 2.1.1 Two q-simplices σqi and σqj of a simplicial complex K are lower adjacent if they
                                          L
share a common (q − 1)-face, denoted σqi ∼ σqj . The lower degree of q-simplex, denoted degL (σq ),
is the number of nonempty (q − 1)-simplices in K that are faces of σq , which is always q + 1.
Defination 2.1.2 Two q-simplices σqi and σqj of a simplicial complex K are upper adjacent if they
                                          U
share a common (q + 1)-face, denoted σqi ∼ σqj . The upper degree of q-simplex, denoted degU (σq ),
is the number of (q + 1)-simplices in K of which σq is a face.
Then, the degree of a q-simplex (q > 0) is defined as:
                     deg(σq ) = degL (σq ) + degU (σq ) = degU (σq ) + q + 1.                 (2.1)
2.1.2.1   Delaunay Triangulation and Alpha Shapes
In this section, we provide the details on a practical construction of filtration for persistent
spectral graph theory based on the alpha complex. The alpha complex can be regarded
as a simplicial complex, which is a homotopy equivalent to the nerve of balls around
data points. Its geometric realization built as the union of convex hulls of points in each
                                                11


simplex is called the alpha shape. First proposed in 1983, t he alpha shape defined the
shape associated with a finite set of points in the plane controlled by one parameter [47].
    In the following, we first describe how to construct the alpha shape, and then provide
some necessary concepts for the implementation of the alpha complex in PSG theory. Let
P be a finite set of points in qD Euclidean space Rq (q = 2 or 3 in most applications), and
α be a positive real number. Denote an open ball with radius α as an alpha ball (α-ball).
We say that an α-ball is empty if it contains no point of P , and the alpha hull (α-hull) of
P is the set of points that do not belong to any empty α-ball. For any subset T ⊆ P with
size |T | = k + 1, 0 ≤ k ≤ q, the geometric realization of k-simplex σT is the convex hull
of T . We say that a k-simplex σT is α-exposed if there exists an empty α-ball b such that
T = ∂b ∩ P for 0 ≤ k ≤ q − 1. Denoting the collection of α-exposed k-simplices as Fk,α for
0 ≤ k ≤ q − 1, the alpha shape (α-shape) of P is the polytope whose boundary consists
of the k-simplices in Fk,α . The alpha complex is just the simplicial complex that is the
collection of the simplices in the alpha shape.
    There are two structures that are closely related to the alpha shape and helpful in
efficient implementation of alpha shape and alpha complex. One is the Voronoi diagram
[48] and the other is its dual structure, the Delaunay tessellation [49]. The latter is the
alpha complex for sufficiently large α, e.g., when α is greater than the diameter of P .
Thus, the Delaunay tessellation is the final complete simplicial complex in the filtration
that we use.
    For a given set of points P = {p1 , p2 , · · · , pn } ⊆ Rq , the Voronoi cell Vi of a point pi ∈ P
contains all of the points for which pi is the closest among all the points in P ,
                      Vi = {x ∈ Rq |    ∥x − pi ∥ ≤ ∥x − pj ∥,        ∀pj ∈ P }.                  (2.2)
The Voronoi diagram of P is the set of Voronoi cells, which is defined as
                              VorP = {Vi | ∀i ∈ {1, 2, · · · , |P |}}.                            (2.3)
The Delaunay tessellation for a given set P in general position (i.e., no q + 1 ponits are in
                                                   12


a (q −1)-D linear subspace, and no q + 2 points share the same circumsphere) is the dual
simplicial complex to the Voronoi diagrams. For instance, a Delaunay tessellation for a
given set P in 2D is a triangulation DT(P ) such that no point in P is inside the circumcircle
of any triangle in DT(P ) [50, 51]. A formal way to define the Delaunay tessellation is to
use the nerve of the collection of Voronoi cells (Nrv(VorP )), which can be expressed as
                                                                        \
                         DT(P ) = Nrv(VorP ) = {J ⊆ {1, 2, ..., |P |} |     Vi ̸= ∅},         (2.4)
                                                                        i∈J
under the condition that the points in P are general position. Note that, in practice, a
set of points that are not in general position can be symbolically perturbed to general
position.
Figure 2.3: Illustration of Voronoi diagram, Delaunay triangulation, and Non-Delaunay
triangulation. Left chart: The Voronoi diagram and its dual Delaunay triangulation. The
points set is P = {A,B,C,D,E} and the Delaunay is defined as DT(P ). The blue lines tessel-
late the plane into Voronoi cells. The red circle are the circumcircles of triangles in DT(P ).
Right chart: A Non-Delaunay triangulation. Vertices E and D are in the green circumcir-
cles, implying the right chart is an example of Non-Delaunay triangulation.
    Next, we introduce the mathematical description of the construction of alpha complex
through the union of balls centered at points in P , which is essentially a van der Waals
surface for atoms positioned at P with the same radius α. For a given set of points P =
{p1 , p2 , · · · , pn } in Rq and a positive real number α, we can denote the closed ball centered
at pi as Bi (α) = pi + αBq , where Bq is a qD unit ball around the origin. The union of these
                                                    13


balls can be expressed as
                         U (α) = {x ∈ Rq | ∃pi ∈ P s.t. ∥x − pi ∥ ≤ α}.                  (2.5)
To ensure that we obtain a subcomplex of the Delaunay tessellation, we intersect Bi (α)
with its corresponding Voronoi cell,
                                     Ri (α) = Bi (α) ∩ Vi .                              (2.6)
It can be observed that U (α) = ∪pi ∈P Ri (α), so the Ri ’s is a covering of U (α). The alpha
complex Kα is the simplicial complex representing the nerve of this covering,
                                                        \
                          Kα = {J ⊆ {1, 2, ..., |P |} |     Ri (α) ̸= ∅}.                (2.7)
                                                        i∈J
The equivalence to the original definition can be readily checked. The union of all sim-
plices in the alpha complex forms the alpha shape. Figure 2.3 illustrates the Voronoi
diagram, Delaunay triangulation, and non-Delaunay triangulation. The point set is P =
{A,B,C,D,E}, and the blue lines in the left chart of Figure 2.3 separate the plane into the
Voronoi cells. The red circles are the empty circumcircles for triples of points in P . We
can notice that no four points are on the same red circle, which satisfies the uniqueness
condition for constructing the Delaunay triangulation. In the right chart of Figure 2.3,
the green circumcircle of ACD contains E and the green circumcirlce of AEC contains D,
indicating that those two triangles do not belong to the Delaunay triangulation.
    Figure 2.4 illustrates the standard filtration of alpha complexes. The top left figure is
the Delaunay triangulation of six 2D points A, B, C, D, E, and F. With an ever-growing
radius α centered at these points, a family of sub-complexes of the Delaunay triangulation
can be constructed. Figure 2.5 shows the persistence barcode of these 6 points. It can
be seen that when α = 0.2, all six points are disconnected, indicating that 6 0-cycles
(connected components) existed, which matches with Figure 2.5, where there are a total
of 6 bars when α = 0.2. With the radius α continually increasing, a 1-cycle will be formed,
and the associated alpha shape are shown in the bottom left chart of Figure 2.4. One
                                                14


can notice that in Figure 2.5, when α = 0.6, β1α,0 = 1. When α reaches 0.83, the 1-cycle
disappears and β1α,0 = 0 as shown in the bottom left panel of Figure 2.4. Table 2.2 and
Table 2.3 show how we construct the qth-order persistent Laplacian Lt,p q and calculate the
                                                          q from the simplicial complexes
harmonic (βqt,p ) and non-harmonic persistent spectra of Lt,p
K0.2 to K0.6 and K0.6 to K0.6 .
Figure 2.4: Illustration of 2D Delaunay triangulation, alpha shapes, and alpha complexes
for a set of 6 points A, B, C, D, E, and F. Top left: The 2D Delaunay triangulation. Top
right: The alpha shape and alpha complex at filtration value α = 0.2. Bottom left : The
alpha shape and alpha complex at filtration value α = 0.6. Bottom right: The alpha shape
and alpha complex at filtration value α = 1.0. Here, we use dark blue color to fill the
alpha shape.
Figure 2.5: The persistent barcode for a set of points as illustrated in Figure 2.4 that are
generated from Gudhi and DioDe.
                                             15


Table 2.2: The matrix representation of q-boundary operator and its qth-order persistent
Laplacian with corresponding dimension, rank, nullity, and spectra from alpha complex
K0.6 → K0.6 .
          q                               q=0                                             q=1                           q=2
                                                                                             DEF
                         AB    BC    CD      DE    EF    DF    AE
                                                                                      AB       0
                                                                                                   
                    A    −1     0      0       0    0     0    −1
                                                                 
                                                                                      BC       0
                    B     1   −1       0       0    0     0     0
                                                                                                   
                                                                                      CD       0
                                                                                                 
         0.6,0
      Bq+1          C     0     1     −1       0    0     0     0                                                         /
                                                                                                 
                                                                                      DE       1
                                                                                                 
                    D     0     0      1     −1     0    −1     0
                                                                                                 
                                                                                      EF       1
                                                                                                 
                    E     0     0      0       1   −1     0     1
                                                                                                 
                                                                                      DF     −1    
                    F     0     0      0       0    1     1     0
                                                                                      AE       0
                                                                                                                          DEF
                                                                           AB  BC      CD    DE      EF    DF AE
                                                                                                                     AB     0
                                                                                                                             
                                                                      A    −1   0       0     0       0     0 −1
                                                                                                                  
                                                                                                                     BC     0
                                                                      B     1  −1       0     0       0     0   0
                                                                                                                             
                                A    B    C     D  E   F                                                             CD     0
                                                                                                                           
        Bq0.6                                                         C     0   1      −1     0       0     0   0
                                                                                                                           
                              [ 0    0    0     0  0    0 ]                                                          DE     1
                                                                                                                           
                                                                      D     0   0       1    −1       0    −1   0
                                                                                                                           
                                                                                                                     EF     1
                                                                                                                           
                                                                      E     0   0       0     1     −1      0   1
                                                                                                                           
                                                                                                                     DF   −1 
                                                                      F     0   0       0     0       1     1   0
                                                                                                                     AE     0
                                                                           2   −1      0     0      0      0   1
                                                                                                                 
                            2    −1      0       0   −1     0
                                                              
                                                                         −1   2      −1     0      0      0  0
                           −1     2     −1       0    0      0
                                                                                                                  
                                                                         0    −1      2   −1       0     −1  0
                                                                                                               
                            0    −1      2      −1    0      0
                                                                                                                
      L0.6,0                                                             0    0      −1     3      0      0  1          [3]
                                                                                                               
         q
                            0     0     −1       3   −1     −1
                                                                                                                
                                                                         0    0       0     0      3      0  −1
                                                                                                               
                           −1     0      0      −1    3     −1
                                                                                                                
                                                                         0    0      −1     0      0      3  0   
                            0     0      0      −1   −1     2
                                                                           1   0       0     1     −1      0   2
       βq0.6,0                                1                                              1                            0
   dim(L0.6,0
            q   )                             6                                              7                            1
   rank(L0.6,0
            q   )                             5                                              6                            1
  nullity(L0.6,0
              q   )                           1                                              1                            0
   Spec(L0.6,0
            q   )             {0, 1, 1.5858, 3, 4, 4.4142}                    {0, 1, 1.5858, 3, 3, 4, 4.4142}            {3}
2.1.2.2         Vietoris-Rips Complex
Vietoris-Rips complex is an abstract simplicial complex. It is commonly used in various
applications. For a given set of points P = {p1 .p2 , · · · , pn } in a metric space and a real
value r > 0, a k-simplex σk = [pi0 , · · · , pik ] is in the Vietoris-Rips complex if and only if
                                            ′
B(pij,r ) ∩ B(pij ′ ,r ) ̸= ∅, ∀j, j ∈ [0, k].
2.1.3         Chain Complex
Chain complex is an important concept in topology, geometry, and algebra. A q-chain is
a formal sum of q-simplices in simplicial complex K with Z2 coefficients. The set of all
q-chains has a basis which the set of q-simplices in K, thus forming a finitely generated
free abelian group denoted as Cq (K). The boundary operator is a group homomorphism
                                                                    16


Table 2.3: The matrix representation of q-boundary operator and its qth-order persistent
Laplacian with corresponding dimension, rank, nullity, and spectra from alpha complex
K0.2 → K0.6 .
                     q                                           q=0                            q=1 q=2
                                             AB      BC     CD DE        EF     DF   AE
                                         A      −1                                    −1
                                                                                         
                                                       0      0     0      0      0
                                         B   1       −1      0     0      0      0    0 
                   0.2,0.4
                 Bq+1                    C   0        1     −1     0      0      0    0         /   /
                                                                                        
                                                                                         
                                         D   0
                                                      0      1    −1      0     −1    0 
                                                                                         
                                         E   0        0      0     1     −1      0    1 
                                         F       0     0      0     0      1      1    0
                                                        A B C D E              F
                   Bq0.2                                                                         /   /
                                                      [ 0 0 0 0 0              0 ]
                                                        −1                  −1
                                                                                     
                                                   2            0     0             0
                                                 −1     2     −1     0      0      0 
                                                   0    −1      2    −1      0      0
                                                                                     
                 L0.2,0.4                                                                        /   /
                                                                                     
                   q                          
                                                  0     0     −1     3     −1     −1 
                                                                                      
                                                 −1     0      0    −1      3     −1 
                                                   0     0      0    −1     −1      2
                 βq0.2,0.4                                         1                             /   /
              dim(L0.2,0.4
                      q     )                                      6                             /   /
              rank(L0.2,0.4
                      q     )                                      5                             /   /
             nullity(L0.2,0.4
                        q     )                                    1                             /   /
              Spec(L0.2,0.4
                      q     )                        {0, 1, 1.5858, 3, 4, 4.4142}                /   /
defined by ∂q : Cq (K) → Cq−1 (K) to relate the chain groups. More specifically, denoting q-
simplex as σq = [v0 , v1 , · · · , vq ] by its vertices vi , the boundary operator is defined through
its action on the basis,
                                                                 q
                                                               X
                                                    ∂q σq =         (−1)i σq−1 i
                                                                                    .                   (2.8)
                                                                i=0
Here,    i
        σq−1  = [v0 , · · · , v̂i , · · · , vq ] is the (q −1)-simplex with vi omitted. The following se-
quence of chain groups connected by boundary operators is a chain complex (defined as a
set of abelian groups connected by homomorphisms such that the composite of any two
consecutive homomorphisms is zero, ∂q ∂q+1 = 0.)
                                   ∂q+2                 ∂q+1                ∂q             ∂q−1
                            · · · −→ Cq+1 (K) −→ Cq (K) −→ Cq−1 (K) −→ · · ·
                                                                  17


2.1.4    Combinatorial Laplacians
Combinatorial Laplacians[52] offer both spectral analysis and topological analysis [53].
One central role played by the chain complex associated with a simplicial complex is to
define its q-th homology group (Hq = ker ∂q / im ∂q+1 ), which is a topological invariant of
the simplicial complex. The dimension of Hq is denoted by βq = dim Hq , the q-th Betti
number, which, roughly speaking, measures the number of q-dimensional holes in the
simplicial complex, or the geometric object tessellated into the simplicial complex.
     A dual chain complex can be defined on any chain complex through the adjoint op-
erator of ∂q defined on the dual spaces C q (K) = Cq∗ (K). The q-coboundary operator
∂q∗ : C q−1 (K) → C q (K) is defined as:
                                    ∂ ∗ ω q−1 (cq ) ≡ ω q−1 (∂cq ),                    (2.9)
where ω q−1 ∈ C q−1 (K) is a (q − 1)-cochain, which is a homomorphism mapping a chain
to the coefficient group, and cq ∈ Cq (K) is a q-chain. The homology of the dual chain
complex is often called cohomology.
     If we denote by Bq the matrix representation of a q-boundary operator with respect
to the standard basis for Cq (K) and Cq−1 (K), the number of rows and the number of
columns in Bq correspond to the number of (q − 1)-simplices and that of q-simplices in K,
respectively. Moreover, the matrix representation of q-coboundary operator is denoted
BqT .
     In de Rham-Hodge theory, homology and cohomology are often studied through their
correspondences to the q-combinatorial Laplacian operator, defined as the linear operator
∆q : C q (K) → C q (K) as follows,
                                     ∆q := ∂q+1 ∂q+1 ∗
                                                         + ∂q∗ ∂q ,                   (2.10)
where the isomorphism C q (K) ∼    = Cq (K) is assumed, where each q-simplex is mapped to
its own dual, i.e., the isomorphism keeps the coefficients of chains and cochains in the
                                                   18


standard simplicial basis. Correspondingly, the matrix representation of ∆q is the qth-
order Laplacian, which is denoted Lq (K),
                                                 T
                                 Lq (K) = Bq+1 Bq+1  + BqT Bq .                       (2.11)
Assume the number of q-simplices existing in K to be Nq , then Lq (K) is an Nq×Nq -matrix.
Since the qth-order Laplacian Lq (K) is symmetric and positive semi-definite, its spectrum
consists of only real and non-negative eigenvalues. We denote the spectrum of Lq (K) as
                            Spec(Lq (K)) = {λ1,q , λ2,q , · · · , λNq ,q }.
The multiplicity of zero in the spectrum (also called the harmonic spectrum) reveals the
topological information βq , whereas the non-harmonic spectrum encodes further geomet-
ric information. The correspondence between the multiplicity of zero spectra of Lq (K)
and the qth Betti number defined in the homology is an important result in de Rham-
Hodge theory, [54, 55, 56]
      βq = dim ker ∂q − dim im ∂q+1 = dim ker Lq (K) = #0 eigenvalues of Lq (K).      (2.12)
Intuitively, β0 represents the number of connected components in K, β1 reveals the num-
ber of 1D noncontractible loops or circles in K, and β2 shows the number of 2D voids or
cavities in K.
2.1.5   Persistent Laplacian
Both topological and geometric information can be derived from analyzing the spectra of
qth-order Laplacian. However, the information is restricted to those pieces contained in
the connectivity of the simplicial complex. A single simplicial complex produces insuffi-
cient information for practical problems such as feature extraction for machine learning
analysis. To enrich the spectral information, persistent spectral graph (PSG) is proposed
by creating a sequence of simplicial complexes induced by varying a filtration parameter,
                                             19


which is inspired by persistent homology as well as our earlier multiscale graph Lapla-
cians [57].
    First, we consider a filtration of simplicial complex K which is a nested sequence of
                  t=0 of the final complex K:
subcomplexes (Kt )m
                                ∅ = K0 ⊆ K1 ⊆ K2 ⊆ · · · ⊆ Km = K.                                      (2.13)
For each subcomplex Kt , we denote its corresponding chain group to be Cq (Kt ), and the
q-boundary operator will be denoted by ∂qt : Cq (Kt ) → Cq−1 (Kt ). As conventionally done,
we define Cq (Kt ) for q < 0 as the zero group {0} and ∂qt as a zero map. 2 If 0 < q ≤ dim Kt ,
then
                                                 q
                                                 X
                                 ∂qt (σq )   =         (−1)i σq−1
                                                              i
                                                                  ,    ∀σq ∈ Kt ,                       (2.14)
                                                  i
with σq = [v0 , · · · , vq ] being any q-simplex, and σq−1
                                                       i
                                                           = [v0 , · · · , v̂i , · · · , vq ] being the (q −
1)-simplex constructed by removing vi . The adjoint operator of ∂qt is the coboundary
                ∗
operator ∂qt : C q−1 (Kt ) → C q (Kt ), which can be regarded as a map from Cq−1 (Kt ) to
Cq (Kt ) through the isomorphisms C q (Kt ) ∼
                                            = Cq (Kt ) between cochain groups and chain
groups.
    Similar to the persistent homology, a sequence of chain complexes can be defined as
below:
                     1
                    ∂q+1        ∂q1      3       ∂1              ∂1
                                                                 2             ∂1
                                                                                1          ∂1
                                                                                            0
   ···    1
         Cq+1       −
                    ↽−
                     −∗⇀
                       − Cq1    −
                                −
                                ↽⇀
                                 −− ··· −
                                        −
                                        ↽⇀
                                         −−               C21   −
                                                                −
                                                                ↽⇀
                                                                 −−     C11    −
                                                                               −
                                                                               ↽⇀
                                                                                −−   C01   −
                                                                                           −
                                                                                           ↽⇀
                                                                                            −−  1
                                                                                               C−1 = {0}
                     1            ∗       ∗                       ∗              ∗           ∗
                    ∂q+1        ∂q1              ∂31             ∂21           ∂11         ∂01
          ⊆                ⊆                               ⊆             ⊆           ⊆
                     2
                    ∂q+1        ∂q2      3       ∂2              ∂2
                                                                 2             ∂2
                                                                                1          ∂2
                                                                                            0
   ···    2
         Cq+1       −
                    ↽−
                     −∗⇀
                       − Cq2    −
                                −
                                ↽⇀
                                 −− ··· −
                                        −
                                        ↽⇀
                                         −−               C22   −
                                                                −
                                                                ↽⇀
                                                                 −−     C12    −
                                                                               −
                                                                               ↽⇀
                                                                                −−   C02   −
                                                                                           −
                                                                                           ↽⇀
                                                                                            −−  2
                                                                                               C−1 = {0}
                     2            ∗       ∗                       ∗              ∗           ∗
                    ∂q+1        ∂q2              ∂32             ∂22           ∂12         ∂02
           ..              ..                              ..             ..          ..           ..
            .               .                               .              .           .            .
          ⊆          m
                           ⊆                               ⊆             ⊆           ⊆
                    ∂q+1        ∂qm
                                3        2       ∂m
                                                  1        0     ∂m            ∂m          ∂m
          m
   · · · Cq+1 −−
              ↽−⇀
                −     −−
                  Cqm ↽−⇀
                        −  ··· −
                               ↽−
                                −⇀− C2m −
                                        ↽−
                                         −⇀− C1m −
                                                 ↽−
                                                  −⇀− C0m −
                                                          ↽−
                                                           −⇀−  m
                                                               C−1 = {0}
                ∗    m   ∗        ∗        ∗        ∗        ∗
                    ∂q+1        ∂qm              ∂3m             ∂2m           ∂1m         ∂0m
                                                                                                        (2.15)
    2
      We define the boundary matrix B0t for the boundary operator ∂0t as a zero matrix. The number of
columns of B0t is the number of 0-simplices in Kt , the number of rows will be 1.
                                                           20


For simplicity, we use Cqt to denote the chain group Cq (Kt ).
    Next, we introduce persistence to the Laplacian spectra. We define the subset of Cqt+p
whose boundary is in Cq−1            q , assuming the natural inclusion map from Cq−1 to Cq−1 ,
                               as Ct,p
                             t                                                                   t        t+p
                               Ct,p
                                 q := {β ∈ Cq
                                                   t+p
                                                        | ∂qt+p (β) ∈ Cq−1 t
                                                                              }.                       (2.16)
On this subset, one may define the p-persistent q-boundary operator denoted by ðt,p                      q    :
  q → Cq−1 . Its corresponding adjoint operator is (ðq ) : Cq−1 → Cq , again through the
                                                                   t,p ∗
Ct,p       t                                                                  t       t,p
identification of cochains with chains. We then define the q-order p-persistent Laplacian
operator ∆t,p q : Cq → Cq associated with the filtration as
                    t      t
                                                              ∗      ∗
                                    ∆t,p       t,p
                                       q = ðq+1 ðq+1
                                                        t,p
                                                                 + ∂qt ∂qt .                           (2.17)
The matrix representation of ∆t,p   q in the simplicial basis is
                                  Lt,p       t,p     t,p T
                                    q = Bq+1 (Bq+1 ) + (Bq ) Bq ,
                                                                     t T t
                                                                                                       (2.18)
where Bq+1t,p
               is the matrix representation of ðt,p    q+1 .
    We denote the spectrum of Lt,p    q as
                                Spec(Lt,p            t,p     t,p         t,p
                                         q ) = {λ1,q , λ2,q , · · · , λNqt ,q },
where Nqt = dim Cqt is the number of q-simplices in Kt , and the eigenvalues are listed in
the ascending order. Thus, the smallest non-zero eigenvalue of Lt,p                q is denoted as λ2,q . We
                                                                                                     t,p
may recognize the multiplicity of zero in the spectrum of Lt,p               q as the qth order p-persistent
Betti number βqt,p , which counts the number of (independent) q-dimensional holes in Kt
that still exists in Kt+p . The relation can be observed in
                                        q+1 = dim ker Lq = #0 eigenvalues of Lq .
        βqt,p = dim ker ∂qt − dim im ðt,p                      t,p                          t,p
                                                                                                       (2.19)
In this paper, we focus on the 0, 1, 2th-order persistent Laplacians, which depict the rela-
tions among vertices, edges, triangles, and tetrahedra, as we target 3D real-world appli-
cations.
                                                      21


    For instance, given a set of vertices V = {v0 , v1 , · · · , vN0 −1 } , N0 embedded in R3 , we
consider a nested family of simplicial complexes that may be created for a positive real
number α. Denoting the simplicial complex generated for α by Kα , the traditional qth-
order Laplacian is just a special case of qth-order 0-persistent Laplacian at Kα
                                    Lα,0
                                      q
                                                α,0
                                           = Bq+1     α,0 T
                                                    (Bq+1  ) + (Bqα )T Bqα .                       (2.20)
The spectrum of Lα,0   q is simply associated with a snapshot of the filtration,
                                  Spec(Lα,0           α,0   α,0          α,0
                                           q ) = {λ1,q , λ2,q , · · · , λNqα ,q }.                 (2.21)
Correspondingly, the q-th 0-persistent Betti number βqα,0 = βqα . In addition to the tradi-
tional homology information, and persistent homology information, our proposed per-
sistent spectral graph theory, through the nonzero eigenvalues in the spectrum of the per-
sistent Laplacian operator, provide richer spatial information induced by varying the fil-
tration parameters. Thus it provides a powerful tool to encode high-dimensional datasets
into various topological and geometric features in a coherent fashion.3
    Figure 2.6 demonstrates an example of a standard filtration process. Here the initial
setup K1 consists of five 0-simplices (vertices). We construct Vietoris-Rips complexes by
using an ever-growing circle centered at each vertex with radius r. Once two circles over-
lapped with each other, an 1-simplex (edge) is formed. A 2-simplex (triangle) will be
created when 3 circles contact with one another, and a 3-simplex will be generated once
4 circles get overlapped one another. As Figure 2.6 shows, we can attain a series of sim-
plicial complexes from K1 to K6 with the radius of circles increasing. To fully illustrate
how to construct p-persistent q-combinatorial Laplacian matrices by the boundary oper-
ator and determine persistent Betti numbers, we analyze 6 p-persistent q-combinatorial
Laplacian matrices and their corresponding harmonic persistent spectra (i.e., persistent
Betti numbers) and non-harmonic persistent spectra. Additional matrices are analyzed in
Appendix Section A.1.
    3
      In this work, we use notations Ct,p
                                       q , ðq , ∆q , Lq , and βq instead of Cq , ðq , ∆q , Lq , and βq
                                            t,p   t,p  t,p       t,p               t+p t+p t+p t+p    t+p
used in Ref. [11].
                                                       22


          3              K1                               K2                                K3
                    2
        0                  4
              1
                         K4                               K5                                K6
Figure 2.6: Illustration of filtration. We use 0, 1, 2, 3, and 4 to stand for 0-simplices,
01, 12, 23, 03, 24, 02, and 13 for 1-simplices, 012, 023, 013, and 123 for 2-simplices, and 0123
for the 3-simplex. Here, K1 has five 0-cycles, K2 has four 0-cycles, K3 has two 0-cycles
and a 1-cycle, K4 has a 0-cycle and a 1-cycle, K5 has one 0-cycle, and K6 has a 0-cycle.
  Table 2.4: The number of q-cycles of simplicial complexes demonstrated in Figure 2.6.
                     # of q-cycles      K1     K2      K3      K4      K5        K6
                         q=0            5       4       2       1      1          1
                         q=1            0       0       1       1      0          0
                         q=2            0       0       0       0      0          0
               Case 1. In this case, the initial setup is K1 and the end status is K3 . Therefore,
                         t = 1 and p = 2 in Eq. (2.18). We will calculate L1+2     0 , L1 , and L2
                                                                                        1+2         1+2
                         first and find out their corresponding persistent spectra.
                         The 2-persistent 0, 1, 2-combinatorial Laplacian operators are:
                                                                   ∗      ∗
                                               ∆1+2
                                                 0   = ð1+2
                                                         1    ð1+2
                                                               1      + ∂01 ∂01 ,
                                                                   ∗      ∗
                                               ∆1+2
                                                 1   = ð1+2
                                                         2    ð1+2
                                                               2      + ∂11 ∂11 ,
                                                                   ∗      ∗
                                               ∆1+2
                                                 2   = ð1+2
                                                         3    ð1+2
                                                               3      + ∂21 ∂21 ,
                         Since 2-simplex and 3-simplex do not exist in K1 and K3 , ð1+2        1    1+2
                                                                                          2 , ∂1 , ð3 ,
                         and ∂21 do not exist and ∂01 is a zero map. Then, there is only one per-
                                                  23


            sistent combinatorial Laplacian matrix
                                   L1+2
                                     0   = B11+2 (B11+2 )T + (B01 )T B01 .
            It can be seen in Figure 2.6 that two 0-cycles (connected components)
            in K1 are still alive in K3 , while no 1-cycle and 2-cycle exist in the
            initial set up K1 , which perfectly match the calculations in Table 2.5:
            β01+2 = 2.
                           Table 2.5: K1 → K3 .
       q                               q=0                     q=1         q=2
                            01 12 23 03 
                         0     −1 0          0 −1
                         1  1 −1 0               0 
       1+2
     Bq+1                                                        /        /
                         2  0
                                      1    −1    0  
                                                     
                         3  0         0     1    1 
                         4      0      0     0    0
       Bq1                      0 1 2 3 4                        /        /
                            / 0 0 0 0 0
                                                     
                             2 −1 0 −1 0
                          −1 2 −1 0 0                
                                                                   /        /
                                                     
     L1+2
       q
                         
                            0 −1 2 −1 0              
                                                      
                          −1 0 −1 2 0                
                             0      0     0     0 0
     βq1+2                               2                         /        /
  dim(L1+2 q )                           5                         /        /
  rank(L1+2q )                           3                         /        /
 nullity(L1+2
            q )                          2                         /        /
Spectrum(L1+2 q )                 {0, 0, 2, 2, 4}                  /        /
  Case 2. The initial setup is K3 and the end status is K4 . The 1-persistent
                                        24


                  0, 1, 2-combinatorial Laplacian operators are
                                                                            ∗         ∗
                                                 ∆3+1
                                                    0     = ð3+1
                                                             1      ð3+1
                                                                       1       + ∂03 ∂03 ,
                                                                            ∗         ∗
                                                 ∆3+1
                                                    1     = ð3+1
                                                             2      ð3+1
                                                                       2       + ∂13 ∂13 ,
                                                                            ∗         ∗
                                                 ∆3+1
                                                    2     = ð3+1
                                                             3      ð3+1
                                                                       3       + ∂23 ∂23 ,
                  Since 2-simplex and 3-simplex do not exist in K4 , ∂23 , ∂23+1 , and ∂23 do
                  not exist, then
                                                                          T
                                               L3+1
                                                  0     = B13+1 B13+1          + (B03 )T B03 ,
                                               L3+1
                                                  1     = (B13 )T B13 .
                  From Table 2.6, one can see that β03+1 = 0 and β13+1 = 1, which reveals
                  only one 0-cycle and one 1-cycle in K3 are still alive in K4 .
                                        Table 2.6: K3 → K4 .
        q                            q=0                                        q=1              q=2
                          01    12     23    03     24
                    0     −1     0      0    −1      0
                                                       
       3+1          1      1    −1      0     0      0
    Bq+1                                                                          /               /
                                                       
                    2      0     1     −1     0     −1
                                                       
                                                       
                    3     0     0      1     1      0  
                    4      0     0      0     0      1
                                                                         01    12     23  03
                                                                  0      −1     0      0  −1
                                                                                              
                              0    1   2   3    4                 1       1    −1      0   0
      Bq3                                                                                         /
                                                                                              
                            [ 0    0   0   0    0 ]               2       0     1    −1    0
                                                                                              
                                                                                              
                                                                  3      0     0      1   1   
                                                                  4       0     0      0   0
                          2     −1     0    −1      0
                                                                                          
                                                                          2    −1      0  1
                         −1     2     −1     0       0                −1       2     −1  0 
                                                       
    L3+1                  0     −1     3    −1      −1                                            /
                                                       
       q
                                                                                            
                                                                      0       −1      2  1 
                                                       
                        −1     0     −1     2       0  
                                                                          1     0      1  2
                          0     0     −1     0       1
     βq3+1                             1                                          1               /
 dim(L3+1 q   )                        5                                          4               /
 rank(L3+1q   )                        4                                          3               /
nullity(L3+1q   )                      1                                          1               /
Spectra(L3+1q   )       {0, 0.8299, 2, 2.6889, 4.4812}                       {0, 2, 2, 4}         /
                                                       25


Case 3. The initial setup is K4 and the end status is K4 . Similarly,
                                                  T
                             L4+0
                              0   = B14+0 B14+0      + (B04 )T B04 ,
                             L4+0
                              1   = (B14 )T B14 ,
        and L4+02  does not exist. In this case, the 0-persistent q-combinatorial
        Laplacian matrix is actually the q-combinatorial Laplacian matrix de-
        fined in Eq. (2.11). Therefore, β04+0 , β14+0 , and β24+0 actually represent
        the number of 0, 1, 2-cycles in K4 . With the filtration parameter r in-
        creasing, all the circles overlapped with at least another circle in K4 ,
        which results in β04+0 = 1. Since only one 1-cycle formed in K4 , one
        has β14+0 = 1.
Case 4. The initial setup is K4 and the end status is K5 . Using similar analysis
        as in previous cases, we have
                                                  T
                             L4+1
                              0   = B14+1 B14+1      + (B04 )T B04 ,
                                                  T
                             L4+1
                              1   = B24+1 B24+1      + (B14 )T B14 ,
        and L4+12  does not exist. Notice that two 2-simplices 012 and 023 are
        created under the filtration process. The appearance of these two
        newborns results in the 1-cycle that was alive in K4 being killed.
        Therefore β14+1 = 0 and β04+1 = 1 because only one connected compo-
        nent keeps alive until K5 .
Case 5. The initial setup is K5 and the end status is K6 . The 1-persistent
        0, 1, 2-combinatorial Laplacian matrices are
                                                  T
                             L5+1
                              0   = B15+1 B15+1      + (B05 )T B05 ,
                                                  T
                             L5+1
                              1   = B25+1 B25+1      + (B15 )T B15 ,
                                                  T
                             L5+1
                              2   = B35+1 B35+1      + (B25 )T B25 .
                                 26


                                       Table 2.7: K4 → K4 .
      q                          q=0                                 q=1                q=2
                      01 12 23 03 24 
                  0     −1 0        0 −1 0
                  1   1 −1 0            0     0 
      4+0
    Bq+1                                                              /                /
                  2   0
                              1   −1    0    −1  
                                                  
                  3   0       0    1    1     0 
                  4      0     0    0    0     1
                                                           01 12 23 03 24 
                                                       0     −1 0        0 −1 0
                                                           1 −1 0
      Bq4                   0 1 2 3 4                1                     0    0   /
                        / 0 0 0 0 0                    2   0
                                                                 1     −1    0    −1 
                                                                                      
                                                       3   0     0      1    1    0 
                                                       4      0   0      0    0    1
                                                                                  
                        2 −1 0 −1 0                           2 −1 0 1 0
                      −1 2 −1 0              0            −1 2 −1 0 −1 
                                                                                         /
                                                                                  
    L4+0
      q
                    
                       0 −1 3 −1 −1            
                                                          
                                                             0 −1 2 1 1            
                      −1 0 −1 2              0             1    0     1 2 0 
                        0     0 −1 0          1               0 −1 1 0 2
    βq4+0                          1                                    1                /
 dim(L4+0 q )                      5                                    5                /
 rank(L4+0q )                      4                                    4                /
nullity(L4+0
           q )                     1                                    1                /
Spectra(L4+0q )     {0, 0.8299, 2, 2.6889, 4.4812}       {0, 0.8299, 2, 2.6889, 4.4812}  /
                     In this situation, a new 3-simplex is formed in K6 , which means that
                     B35+1 is no long a non-zero matrix. From Table 2.9, we can see that
                     β25+1 = 0 because K5 does not own any 2-cycle and thus, there is no 2-
                     cycle keeping alive up to K6 . β05+1 implies only one 0-cycle preserved
                     along the filtration process.
             Case 6. The initial setup is K6 and the end status is K6 . The 0-persistent
                                                 27


                                      Table 2.8: K4 → K5 .
      q                         q=0                                   q=1                 q=2
                   01 12 23 03 24 02                             012     023  
                0   −1 0        0 −1 0 −1                       01     1     0
                1  1 −1 0             0    0    0             12  1       0
                                                                                           /
      4+1
                                                                                 
    Bq+1                                                                      
                2  0
                         1 −1 0 −1 1                         23  0
                                                                            1   
                                                                                 
                3  0     0     1      1    0    0             03  0      −1   
                4    0    0     0      0    1    0              24     0     0
                                                            01 12 23 03 24 
                                                       0     −1 0         0 −1 0
                                                            1 −1 0
      Bq4                 0 1 2 3 4                  1                       0    0   /
                       /    0 0 0 0 0                  2    0
                                                                  1 −1 0 −1           
                                                       3    0     0      1     1    0 
                                                       4      0    0      0     0    1
                                                                                   
                       3 −1 −1 −1 0                            3 0       0 1 0
                     −1 2 −1 0               0              0 3 −1 0 −1 
                                                                                           /
                                                                                   
    L4+1
      q
                    
                     −1 −1 4 −1 −1             
                                                            
                                                              0 −1 3 0 1            
                     −1 0 −1 2               0              1 0       0 3 0 
                       0     0 −1 0           1                0 −1 1 0 2
    βq4+1                          1                                     0                 /
 dim(L4+1 q )                      5                                     5                 /
 rank(L4+1q )                      4                                     5                 /
nullity(L4+1
           q )                     1                                     0                 /
Spectra(L4+1q )             {0, 1, 2, 4, 5}                  {1.2677, 2, 2, 4, 4.7321}     /
                                                28


                                              Table 2.9: K5 → K6 .
      q                         q=0                                   q=1                           q=2
                                                                012 023 013 123
                   01 12 23 03 24 02                      01     1    0      1 0
                0   −1 0        0 −1 0 −1
                   1 −1 0                                  12  1      0      0 1                   0123 
      5+1       1                      0    0   0                                     
    Bq+1                                                  23  0      1      0 1             012     −1
                2  0     1 −1 0 −1 1                         
                                                                0 −1 −1 0 
                                                                                        
                                                          03                                  023     −1
                3  0     0     1      1    0   0                                     
                                                            24  0      0      0 0 
                4    0    0     0      0    1   0
                                                            02    −1 1         0 0
                                                                                                  012 023 
                                                          01 12 23 03 24 02                 01     1     0
                                                       0   −1 0       0 −1 0 −1
                                                          1 −1 0                             12  1       0 
                          0 1 2 3 4                  1                     0      0     0                 
      Bq5                                                                                   23  0       1 
                       / 0 0 0 0 0                     2  0     1   −1      0    −1      1     
                                                                                                  0 −1 
                                                                                                              
                                                                                            03
                                                       3  0     0    1      1      0     0                 
                                                                                              24  0       0 
                                                       4    0    0    0      0      1     0
                                                                                              02    −1 1
                                                                                         
                                                            4 0 0       0 0         0
                       3 −1 −1 −1 0                          0 4 0       0 −1 0          
                     −1 2 −1 0               0                                                        
                                                           0 0 4       0 1         0             4 0
    L5+1
      q
                     −1 −1 4 −1 −1                                                     
                                                           0 0 0       4 0         0             0 4
                     −1 0 −1 2               0                                         
                                                             0 −1 1      0 2 −1          
                       0     0 −1 0           1
                                                              0 0 0       0 −1 4
    βq5+1                          1                                     0                             0
 dim(L5+1 q )                      5                                     6                             2
 rank(L5+1q )                      4                                     6                             2
nullity(L5+1
           q )                     1                                     0                             0
Spectra(L5+1q )             {0, 1, 2, 4, 5}                      {1, 4, 4, 4, 4, 5}                 {4, 4}
                         0, 1, 2-combinatorial Laplacian operators are
                                                  L6+0
                                                    0    = B16+0 (B16+0 )T + (B06 )T B06 ,
                                                  L6+0
                                                    1    = B26+0 (B26+0 )T + (B16 )T B16 ,
                                                  L6+0
                                                    2    = B36+0 (B36+0 )T + (B26 )T B26 ,
                         β06+0 = 1, β16+0 = 0, and β26+0 = 0 imply that only one 0-cycle (con-
                         nected component) exists in K6 .
                                                       29


                                              Table 2.10: K6 → K6 .
      q                           q=0                                            q=1                        q=2
                                                                         012 023 013 123         
                   01 12 23 03 24 02 13                            01    1      0     1      0
                0   −1 0        0 −1 0 −1 0                          12  1
                                                                                 0     0      1  
                                                                                                  
      6+0       1   1 −1 0           0     0    0 −1               23  0       1     0      1  
    Bq+1                                                                                                 B36+0
                2 
                    0    1 −1 0 −1 1                0             03  0 −1 −1 0
                                                                        
                                                                                                  
                                                                                                  
                3   0    0     1     1     0    0   1              24  0
                                                                                 0     0      0  
                                                                                                  
                4    0    0     0     0     1    0   0               02  −1 1          0      0  
                                                                     13    0      0     1 −1
                                                                01 12 23 03 24 02 13 
                                                             0    −1 0        0 −1 0 −1 0
                                                                1 −1 0                          0 −1 
      Bq6                    0 1 2 3 4                     1                      0     0                 B26
                         /    0 0 0 0 0                      2  0
                                                                       1 −1 0 −1 1                   0 
                                                                                                        
                                                             3  0      0     1      1     0     0    1 
                                                             4     0    0     0      0     1     0    0
                                                                                                    
                                                                    4 0 0 0 0               0 0
                         3 −1 −1 −1 0                             
                                                                     0 4 0 0 −1 0 0                 
                                                                                                     
                       −1 2 −1 0               0                   0 0 4 0 1               0 0    
                                                                                                  
    L6+0
      q
                       −1 −1 4 −1 −1 
                                                  
                                                                  
                                                                     0 0 0 4 0               0 0    
                                                                                                            L6+0
                                                                                                               3
                       −1 0 −1 2               0                   0 −1 1 0 2 −1 0                
                                                                                                    
                         0     0 −1 0           1                    0 0 0 0 −1 4 0                 
                                                                      0 0 0 0 0               0 4
    βq6+0                            1                                              0                          0
 dim(L6+0 q )                        5                                              7                          4
 rank(L6+0q )                        4                                              7                          4
nullity(L6+0
           q )                       1                                              0                          0
Spectra(L6+0q )               {0, 1, 4, 4, 5}                             {1, 4, 4, 4, 4, 4, 5}           {4, 4, 4, 4}
                        with
                                                                                  012 023 013 123
                                                                                                         
                                                                         01        1       0      1    0 
                                                     0123                                                
                                                                       12        1       0      0    1 
                                                      −1 
                                                                                                         
                                              012                                                       
                                                                         23         0       1      0    1 
                                                                                                         
                                                                  6
                                 B36+0
                                                                            
                                         = 023       −1      , B2 =                                    
                                                                                    0 −1 −1 0 
                                                                                                       
                                                                       03 
                                              013    1  
                                                                              
                                                                              
                                                                                                          
                                                                                                          
                                                                       24       0       0      0    0 
                                                                                                          
                                              123      1                                                 
                                                                         02     −1 1             0    0 
                                                                                                          
                                                                                                         
                                                                         13         0       0      1 −1
                                                         30


                     and                                                          
                                                                 4 0 0         0 
                                                                                  
                                                                 0 4 0         0 
                                                      L6+0  =                     .
                                                                                  
                                                        3
                                                                 0 0 4         0 
                                                                                  
                                                                                  
                                                                   0 0 0        4
2.1.6  Variants of Persistent Laplacians
The traditional approach in defining the q-boundary operator ∂q : Cq (K) → Cq−1 (K) can
be expressed as:
                                                     q
                                                   X
                                          ∂q σq =       (−1)i σq−1
                                                                 i
                                                                    ,
                                                    i=0
which leads to the corresponding elements in the boundary matrices being either 1 or
−1. However, to encode more geometric information into the Laplacian operator, we add
volume information of q-simplex σq to the expression of q-boundary operator.
    Given a vertex set V = {v0 , v1 , · · · , vq } with q+1 isolated points (0-simplices) randomly
arranged in the n-dimensional Euclidean space Rn , often with n ≥ q. Set dij to be the
distances between vi and vj with 0 ≤ i ≤ j ≤ q and obviously, dij = dji . The Cayley-
Menger determinant can be expressed as [58]
                                                        0   d201 d202 · · · d20q 1
                                                       d210  0     d212 · · · d21q 1
                                                       d220 d221    0    · · · d22q 1
                    DetCM (v0 , v1 , · · · , vq ) =     ..   ..     ..     .. . . ..          (2.22)
                                                         .    .      .      .       . .
                                                       d2q0 d2q1 d2q2 · · ·       0   1
                                                        1    1      1      1      1   0
The q-dimensional volume of q-simplex σq with vertices {v0 , v1 , · · · , vq } is defined by
                                         s
                                             (−1)q+1
                         Vol(σq ) =                    DetCM (v0 , v1 , · · · , vq ).         (2.23)
                                              (q!)2 2q
                                                      31


In trivial cases, Vol(σ0 ) = 1, meaning the 0-dimensional volume of 0-simplex is 1, i.e.,
there is only 1 vertex in a 0-simplex. Also, the 1-dimensional volume of 1-simplex σ1 =
[vi , vj ] is the distance between vi and vj , and the 2-dimensional volume of 2-simplex is the
area of a triangle [vi , vj , vk ].
      The weighted boundary operator equipped with volume, denoted ∂ˆq , is given by
                                                    q
                                                  X
                                         ∂ˆq σq =      (−1)i Vol(σqi )σq−1 i
                                                                                 .          (2.24)
                                                   i=0
Employed the same concept to the persistent spectral theory, we have the volume-weighted
p-persistent q-combinatorial Laplacian operator. We also define
                                               
                                               ∂ˆqt+p (σq ), if σq ∈ Ct+p
                                               
                                               
                                                                             q
                                t+p
                             ð̂q (σq ) :=                                                   (2.25)
                                                                if σq ∈ Cqt+p \ Ct+p
                                               
                                               0,
                                               
                                                                                        q
with
                                  Cqt+p := {σq ∈ Cqt+p | ∂ˆqt+p (σq ) ∈ Cq−1       t
                                                                                       }.
Similarly, an inverse-volume weighted boundary operator, denoted ∂ˇq , is given by
                                                    q
                                                  X                1
                                        ∂ˇq σq =       (−1)i             σi .               (2.26)
                                                   i=0
                                                              Vol(σqi ) q−1
To define an inverse-volume weighted p-persistent q-combinatorial Laplacian operator.
We define                                      
                                               ∂ˇqt+p (σq ), if σq ∈ Ct+p
                                               
                                               
                                                                             q
                             ð̌qt+p (σq ) :=                                                (2.27)
                                                                if σq ∈   Cqt+p        Ct+p
                                               
                                               0,
                                                                                    \  q
with
                                  Cqt+p := {σq ∈ Cqt+p | ∂ˇqt+p (σq ) ∈ Cq−1       t
                                                                                       }.
Then volume-weighted and inverse-volume-weighted p-persistent q-combinatorial Lapla-
cian operators defined along the filtration can be expressed as
                                                                ∗        ∗
                                        ∆ˆ t+p = ð̂t+p
                                            q        q+1   ð̂t+p
                                                             q+1     + ∂ˆqt ∂ˆqt ,
                                                                                            (2.28)
                                                                ∗        ∗
                                        ∆ˇ qt+p = ð̌t+p     t+p
                                                     q+1 ð̌q+1      + ∂ˇqt ∂ˇqt .
                                                          32


The corresponding weighted matrix representations of boundary operators ð̂t+p           q+1 , ð̂q , ð̌q+1 ,
                                                                                                t     t+p
and ð̌tq are denoted B̂q+1
                        t+p
                            , B̂qt , B̌q+1
                                       t+p
                                           , and B̌qt , respectively. Therefore, volume-weighted
and inverse-volume-weighted p-persistent q-combinatorial Laplacian matrices can be ex-
pressed as
                                             t+p    t+p T
                                L̂t+p
                                    q   = B̂q+1  (B̂q+1 ) + (B̂qt )T (B̂qt ),
                                                                                                   (2.29)
                                             t+p    t+p T
                                Ľt+p
                                    q   = B̌q+1  (B̌q+1 ) + (B̌qt )T (B̌qt ).
Although the expressions of the weighted persistent Laplacian matrices are different from
the original persistent Laplacian matrices, some properties of Lt+p           q are preserved. The
weighted persistent Laplacian operators are still symmetric and positive semi-defined.
Additionally, their ranks are the same as Lt+p       q . With the embedded volume information,
weighted PSGs can provide richer topological and geometric information through the as-
sociated persistent Betti numbers and non-harmonic spectra (i.e., non-zero eigenvalues).
    In real applications, we are more interested in the 0, 1, 2-combinatorial Laplacian ma-
trices because its more intuitive to depict the relation among vertex, edges, and faces.
Given a set of vertices V = {v0 , v2 , · · · , vN } with N + 1 isolated points (0-simplices) ran-
domly arranged in Rn . By varying the radius r of the (n − 1)-sphere centered at each
vertex, a variety of simplicial complexes is created. We denote the simplicial complex
generated at radius r to be Kr , then the 0-persistent q-combinatorial Laplacian operator
and matrix at initial set up Kr is
                                  Lr+0
                                     q
                                              r+0
                                         = Bq+1          ) + (Bqr )T Bqr .
                                                      r+0 T
                                                  (Bq+1                                            (2.30)
The volume of any 1-simplex σ1 = [vi , vj ] is Vol(σ1 ) is actually the distance between vi and
vj denoted dij . Then the 0-persistent 0-combinatorial Laplacian matrix based on filtration
                                                     33


r can be expressed explicitly as
                                  
                                      X
                                           (Lr+0      if i = j
                                  
                                  
                                  
                                  
                                   −         0 )ij ,
                                       j
                                  
                                  
                                  
                     (Lr+0  )   =                                                   (2.31)
                        0    ij
                                  
                                   −1,               if i ̸= j and dij − 2r < 0
                                  
                                  
                                  
                                                      otherwise.
                                  
                                  0,
                                  
Correspondingly, we can denote the 0-persistent 1-combinatorial Laplacian matrix based
on filtration r by Lr+0
                      1 , and the 0-persistent 2-combinatorial Laplacian matrix based on
filtration r by Lr+0
                 2 .
     Alternatively, variants of persistent 0-combinatorial Laplacian matrices can be de-
fined by adding the Euclidean distance information. The distance-weight persistent 0-
combinatorial Laplacian matrix based on filtration r can be expressed explicitly as
                                  
                                      X
                                           (L̂r+0     if i = j
                                  
                                  
                                  
                                  
                                   −         0 )ij ,
                                   j
                                  
                                  
                        r+0
                     (L̂0 )ij =                                                     (2.32)
                                  
                                   −dij ,            if i ̸= j and dij − 2r < 0
                                  
                                  
                                  
                                                      otherwise.
                                  
                                  0,
                                  
Moreover, the inverse-distance-weight persistent 0-combinatorial Laplacian matrix based
on filtration r can also be implemented:
                                  
                                      X
                                    −      (Ľr+0
                                              0 )ij , if i = j
                                  
                                  
                                  
                                  
                                  
                                   j
                                  
                                  
                     (Ľ0 )ij = − 1 ,
                        r+0
                                                      if i ̸= j and dij − 2r < 0    (2.33)
                                  
                                  
                                  
                                     d ij
                                  
                                  
                                  0,                 otherwise.
                                  
     The spectra of the aforementioned 0-persistent 0-combinatorial Laplacian matrices
based on filtration are given by
                        Spectra(Lr+0              r+0       r+0             r+0
                                   0 ) = {(λ1 )0 , (λ2 )0 , · · · , (λN )0 },
                        Spectra(L̂r+0             r+0       r+0             r+0
                                   0 ) = {(λ̂1 )0 , (λ̂2 )0 , · · · , (λ̂N )0 },
                        Spectra(Ľr+0             r+0       r+0             r+0
                                   0 ) = {(λ̌1 )0 , (λ̌2 )0 , · · · , (λ̌N )0 },
                                                   34


where N is the dimension of persistent Laplacian matrices, (λ̂j )r+0                  0    and (λ̌j )r+0
                                                                                                     0   are the j-th
eigenvalues of L̂r+0  0  and Ľr+0
                                 0 , respectively. We denote β̂q
                                                                               r+0
                                                                                   and β̌qr+0 the qth Betti for L̂r+0
                                                                                                                   q
       q , respectively.
and Ľr+0
    The smallest non-zero eigenvalue of Lr+0            0 , denoted (λ̃2 )0 , is particularly useful in
                                                                                    r+0
many applications. Similarly, the smallest non-zero eigenvalues of L̂r+0                          0    and Ľr+0
                                                                                                              0    are
                ˜            ˜ )r+0 , respectively.
denoted as (λ̂2 )r+00   and (λ̌ 2 0
    Finally, it is mentioned that using the present procedure, more general weights, such
as the radial basis function of the Euclidean distance, can be employed to construct weighted
boundary operators and associated persistent combinatorial Laplacian matrices.
2.2    Persistent Path Laplacian
2.2.1   Paths on a Finite Set
Denote set V an arbitrary nonempty finite set, and elements in V are called vertices. For
p ∈ Z+0 (i.e., a set with integers p ≥ 0), an elementary p-path on V is any sequence i0 . . . ip of
p + 1 vertices in V . An elementary p-path is an empty set ∅ for p = −1. For a fixed field K,
a vector space that consists of all formal linear combinations of elementary p-paths with
its coefficients in K is called the space generated by the elementary paths, denoted as
Λp = Λp (V, K) = Λp (V ). One says the elements in Λp are p-paths on V , and an elementary
p-path i0 . . . ip ∈ Λp is denoted by ei0 ...ip . By definition, ∀v ∈ Λp , its unique representation
can be given by the basis in Λp :
                                                 X
                                        v=                ci0 ...ip ei0 ...ip ,                                 (2.34)
                                            i0 ,...,ip ∈V
where ci0 ...ip is the coefficient in K. For instance, Λ0 contains all linear combination of ei
with i ∈ V , Λ1 has all linear combination of eij with (i, j) ∈ V × V , and so on so forth.
Since Λ−1 consists of all multiples of e, one has Λ−1 ∼               = K.
    Additionally, ∀p ∈ Z+     0 , the linear boundary operator from Λp to Λp−1 that acts on ele-
                                                        35


mentary paths can be defined as
                                                  ∂ : Λp → Λp−1                                (2.35)
with
                                                         p
                                                       X
                                        ∂ei0 ...ip =        (−1)q ei0 ...îq ...ip ,           (2.36)
                                                        q=0
where îq denotes the omission of index iq from the elementary p-path ei0 ...ip . One sets
Λ−2 = {0}, and for p = −1, defines ∂ : Λ−1 → Λ−2 to be a zero map. Following Lemma
2.1 in [59], one has ∂ 2 = 0, which indicates that the collection of boundary operator ∂ and
space Λp can form a chain complex of V denoted as Λ∗ = {Λp } as
                                       ∂              ∂        ∂              ∂      ∂
                             · · · Λp −→ Λp−1 −→ · · · −→ Λ0 −→ K −→ 0.                        (2.37)
    Next, the concepts of regular path and non-regular path are introduced according to
[59]. An elementary path ei0 ...ip on a set V is regular if ik−1 ̸= ik , and non-regular if ik−1 = ik
for k = 1, . . . , p. For any p ∈ Z+  0 ∪{−1}, let Rp be the subspace of Λp spanned by all regular
elementary paths, and Np be the subspace of Λp spanned by all non-regular elementary
paths. Therefore, one has
                              Rp = span{ei0 ...ip : i0 . . . ip is regular}
                              Np = span{ei0 ...ip : i0 . . . ip is non-regular}.
Note that Rp = Λp for integers p = −1, 0.
    Then ∀p ∈ Z+     0 ∪ {−1}, Λp = Rp ⊕ Np . Therefore,
                                                   Rp ∼  = Λp /Np .
According to Section 2.4 in [59], the boundary operator ∂ is well-defined on the quotient
space Λp /Np . Moreover, ∂ 2 = 0 and the product rules are satisfied in the quotient space
Λp /Np as well. One has an induced regular boundary operator:
                                                 ∂¯ : Rp → Rp−1 ,                              (2.38)
                                                           36


where the regular boundary operator ∂¯ satisfies (2.36) except that all non-regular terms
on the right hand side should be treated as 0. Then a chain complex of V , denoted as
R∗ (V ) = (Rp )p and equipped with ∂,          ¯ can be expressed as:
                                        ∂¯          ∂¯         ∂¯      ∂¯       ∂¯
                             · · · Rp −→ Rp−1 −→ · · · −→ R0 −→ K −→ 0.                   (2.39)
It can be verified that Rp ∼         = Λp /Np is an isomorphism of chain complexes [60]. In the
following sections, for simplicity, we use ∂ to denote the boundary operator of Eq. (2.39)
unless specified differently.
2.2.2   Path Complex
A path complex over set V is a nonempty collection P of elementary paths on V for any
n ∈ Z+0,
                        if i0 . . . in ∈ P , then i0 . . . in−1 ∈ P, and i1 . . . in ∈ P. (2.40)
For a fixed path complex, all the paths from P are called allowed (i.e. ik−1 → ik for any
k = 1, . . . , n), while the elementary paths on V that are not in P are non-allowed. We say a
path complex P is perfect if any subsequence of any path from P is also in P . We choose
Pn to denote all n-paths from P . Then the set P−1 has a single empty path e, the set P0
consists of all the vertices of P , and clearly, V = P0 . To be noted, a path complex P is a
collection {Pn }∞   n=−1 satisfying Eq. (2.40). Let K be an abstract simplicial complex defined
over a finite vertex set V , satisfying
                               if σ ∈ K, then any subset of σ is also in K.
The collection of elementary paths on V is denoted by P (K). Follows from [59] (cf. Ex-
ample 3.2), the family P (K) is a path complex, and the allowed n-paths are n-simplices.
                                                          37


2.2.3   Path Homology
For any n ∈ Z+  0 , the K-linear space An is spanned by all the elementary n-paths from a
given path complex P = {Pn }∞         n=0 over a finite set V , i.e.,
                           An = An (P ) = span{ei0 ...in : i0 . . . in ∈ Pn }.
We call the elements of An the allowed n-paths. By the definition of An , An ⊂ Λn , and
An = Λn for n ≤ 0. It is natural that the boundary operator ∂ defined on Rn can be
introduced to An under certain condition: ∂An ⊆ An−1 . For example, for perfect path
complexes, we can obtain a chain complex:
                                       ∂        ∂        ∂          ∂    ∂
                         · · · An −→ An−1 −→ · · · −→ A0 −→ K −→ 0.
    Next, we consider a general path complex P (i.e., ∂An does not have to be a subspace
of An−1 ). For any n ∈ Z+  0 ∪ {−1}, we define a subspace of An :
                                Ωn = Ωn (P ) = {v ∈ An : ∂v ∈ An−1 }.                          (2.41)
The elements of Ωn are called ∂-invariant n-paths. To be noted, ∂Ωn ⊂ Ωn−1 always sat-
isfies. Moreover, ∂ 2 = 0 has been established in the previous section. Therefore, the
augmented chain complex of ∂-invariant paths can be denoted as
                                       ∂         ∂        ∂         ∂    ∂
                          · · · Ωn −→ Ωn−1 −→ · · · −→ Ω0 −→ K −→ 0,                           (2.42)
whose homology group H̃n (P ) of the chain complex in Eq. (2.42) are called the reduced
path homology groups of the path complex P for n ∈ Z+             0 ∪ {−1}. The truncated version of
the chain complex in Eq. (2.42) for n ∈ Z+       0 is:
                                          ∂          ∂        ∂        ∂
                                · · · Ωn −→ Ωn−1 −→ · · · −→ Ω0 −→ 0,                          (2.43)
whose homology group Hn (P ) of the chain complex in Eq. (2.43) are called the path
homology groups of the path complex P .
                                                    38


2.2.4   Path Homology on Directed Graphs
A directed graph is an ordered pair G = (V, E), where V is a set of all vertices and E
is a set of ordered pairs of vertices (i.e. directed edges that satisfy E ⊆ V × V ). If G =
(V, E) does not contain any loop and multiple edge, then it is called simple directed graph.
Moreover, for the path homology of multigraph or quiver, one can refer to Ref. [61]. In
the following section of this work, we use G(V, E) to represent the simple directed graphs
unless specified differently.
    The path complex P (G) is regular if G = (V, E) is a simple directed graph. In this
section, we mainly discuss the regular spaces Ωn (G) = Ωn (P (G)) and their associated
regular homology groups H(G) = Hn (P (G)). Similar to the discussion in Subsection 2.2.3,
given a simple digraph G(V, E), for any n ∈ Z+       0 ∪ {−1}, the space of ∂-invariant n-paths
on G is defined by the subspace of An (G) = An (V, E; K):
                               Ωn = Ωn (G) = {v ∈ An : ∂v ∈ An−1 },
with Ω−1 = A−1 ∼  = K and Ω−2 = A−2 = {0}. Since ∂(Ωn ) ⊆ Ωn−1 (as ∂ 2 = 0), then we have
the following chain complex of V denoted as Ω∗ (V ) = {Ωn },
                                ∂       ∂     ∂        ∂      ∂       ∂
                       · · · −→ Ω3 −→ Ω2 −→ Ω1 −→ Ω0 −→ K −→ 0,
and the associated n- dimensional path homology groups of G = (V, E) are defined as:
                       Hn (G) = Hn (V, E; K) := ker(∂|Ωn )/ im(∂|Ωn+1 ).                  (2.44)
To be noted, the elements of ker(∂|Ωn ) are called n-cycles, and the elements of im(∂|Ωn+1 )
are referred to as n-boundaries. For simplicity, we define ∂n = ∂|Ωn , and the chain complex
of ∂-invariant paths is written as
                                        ∂n+1   ∂ n       ∂n−1
                             · · · Ωn+1 −→ Ωn −→    Ωn−1 −→ Ωn−2 · · · .
Notably, the path cohomology, introduced in Refs. [60, 62], is isomorphic to the dual
space of path homology when the coefficient ring is a field. The associated n- dimensional
                                                  39


path homology groups of digraphs are defined as:
                             H n (G) = H n (V, E; K) := ker(dn+1 )/ im(dn ),                          (2.45)
where d is called coboundary operator.
    Given two simple digraphs G = (V, E) and G′ = (V ′ , E ′ ). According to the Definition
2.2 in [63], a morphism of digraphs/digraphs map from G to G′ is a map f : V → V ′ such that
for any directed edge i → j in E, one has either f (i) → f (j) is a directed edge on E ′ or
f (i) = f (j).
    Let f be a digraph map from G to G′ . For n ∈ Z+                    0 ∪ {−1}, one defines a map (f∗∗ )n :
Λn (V ) → Λn (V ′ ) such that:
                                        (f∗∗ )n (ei0 ...in ) = ef (i0 )...f (in ) .                   (2.46)
Assume ∂ and ∂ ′ are the boundary operators of chain complexes Λ∗ (V ) and Λ∗ (V ′ ), then
for ei0 ...in ∈ Λn , one has
                                                          X n
                         ((f∗∗ )n−1 ◦ ∂)(ei0 ...in ) =        (−1)q (f∗∗ )n−1 (ei0 ...îq ...in )     (2.47)
                                                          q=0
                                                          X n
                                                     =        (−1)q (ef (i0 )...fˆ(iq )...f (in ) )   (2.48)
                                                          q=0
                                                     = (∂ ′ ◦ (f∗ )n )(ei0 ...in ).                   (2.49)
Hence f∗∗ is a chain map. By the definition of digraph map, (f∗∗ )n maps non-regular
elementary n-paths on V to non-regular elementary n-paths on V ′ . Therefore, one has
(f∗∗ )n (Nn (V )) ⊆ Nn (V ′ ), and then (f∗∗ )n descends to a quotient homomorphism of chain
complexes:
                              (f˜∗∗ )n : Λn (V )/Nn (V ) → Λn (V ′ )/Nn (V ′ ).                       (2.50)
It can be verified that Rp ∼     = Λp /Np is an isomorphism of chain complexes [60], then the
map in (2.50) induces a morphism of chain complexes:
                                         (f∗ )n : Rn (V ) → Rn (V ′ ).                                (2.51)
                                                           40


Since (f∗∗ )n maps non-regular paths to non-regular, then similarly to what Eq. (2.47)
shows, (f∗ )n is also a chain map that follows:
                                                   
                                                                       if ef (i0 )...f (in ) is regular,
                                                   
                                                   ef (i0 )...f (in )
                                                   
                           (f∗ )n (ei0 ...in ) :=                                                                      (2.52)
                                                                       otherwise.
                                                   
                                                   0
                                                   
Following the Theorem 2.10 in [63], the induced map (f∗ )n induces a morphism of chain
complexes:
                                              (f∗ )n : Ωn (G; K) → Ωn (G′ ; K)                                         (2.53)
and consequently induces a homomorphism between the path homology groups:
                                     (f∗ )n : Hn (G; K) → Hn (G′ ; K),                  n ≥ 0.                         (2.54)
2.2.5    Homologies of Directed Subgraphs
Some interesting propositions on the homologies of subgraphs provide a way to simplify
complicated digraphs to relatively simple ones. Following the Section 4.2 in [59], three
propositions are discussed.
Proposition 2.2.1 Given a simple digraph G that has a vertex v with n outcoming arrows v →
v0′ , v → v1′ , . . . , v → vn−1
                              ′
                                   . Note that v does not have any incoming arrows. Assume that for all
i ≥ 1, one has v0′ → vi′ . Denote G′ be the subgraph of G by removing the vertex v with all adjacent
edges (i.e. V ′ = V \{v} and E ′ = E\{vvi′ }n−1                                                    ∼        ′
                                                          i=0 ). Then, one has H∗ (G) = H∗ (G ) (See Figure 2.7
a).
Proposition 2.2.2 Given a simple digraph G = (V, E) that has a vertex v with n incoming
arrows v0′ → v, v1′ → v, . . . , vn−1    ′
                                               → v. Note that v does not have any outcoming arrows. Assume
that for all i ≥ 1, one has vi′ → v0′ . Denote G′ = (V ′ , E ′ ) be the subgraph of G by removing
                                                                                                         n−1
the vertex v with all adjacent edges (i.e. V ′ = V \{v} and E ′ = E\{vi′ v}i=0                               ). Then, one has
H∗ (G) ∼ = H∗ (G′ ) (See Figure 2.7 b).
                                                                  41


              a                              b                          c
Figure 2.7: Homologies of directed subgraphs. a, b, and c illustrate three subgraphs
whose homology groups or homology group dimensions are related to the original di-
graphs.
Proposition 2.2.3 Given a simple digraph G = (V, E) that has a vertex v with only one outcom-
ing arrow v → vi′ and only one incoming arrow vj′ → v, where i ̸= j. Denote G′ = (V ′ , E ′ ) be
the subgraph of G (See Figure 2.7 c) by removing the vertex v and the adjacent edges v → vi′ and
vj′ → v (i.e. V ′ = V \{v} and E ′ = E\{vvi′ , vj′ v}). Then,
    (i) dim Hp (G) = dim Hp (G′ ) for p ̸= 2 or for p = 0, 1 if vj′ vi′ is an edge/semi-edge in G′ .
   (ii) If vj′ vi′ is neither an edge or a semi-edge in G′ , but vj′ and vi′ are in the same connected
        component of G′ , then dim H1 (G) = dim H1 (G′ + 1), and dim H0 (G) = dim H0 (G′ ).
 (iii) If vj′ and vi′ are not in the same connected component of G′ , then dim H1 (G) = dim H1 (G′ )
        and dim H0 (G) = dim H0 (G′ ) − 1.
2.2.6     Path Laplacian
Recall that a chain complex of ∂-invariant paths is given by
                                            ∂n+1        ∂ n       ∂n−1
                                 · · · Ωn+1 −→ Ωn −→         Ωn−1 −→ Ωn−2 · · · ,
where Ωn = Ωn (P ) = {v ∈ An : ∂v ∈ An−1 } and ∂n := ∂|Ωn . Alternatively, assume
Sn := Sn (P ) to be the set of n-th elementary paths in P , then we define an inner product
                                              ⟨·, ·⟩ : Sn × Sn → R
                                                           42


such that for any ei0 ...in , ej0 ...jn ∈ Sn , the following satisfies
                                                          
                                                          1 if ei0 ...in = ej0 ...jn ,
                                                          
                                                          
                               ⟨ei0 ...in , ej0 ...jn ⟩ =                               (2.55)
                                                          0 otherwise.
                                                          
                                                          
    Let Mn be a matrix representation of ∂ : An → An−1 with respect to the standard basis
of An and An−1 . Define an inclusion map ιn : Ωn ,→ An , then the matrix representation
of ιn with respect to the basis of Ωn (i.e., the standard basis of An with the removal of
generators that are not in Ωn ) and the standard basis of An is denoted as On . Denote the
boundary matrix representation of ∂n as Bn , then we have
                                                  On−1 Bn = M̃n On .                    (2.56)
If On−1 is a square matrix, then On is actually an identity matrix, and we have
                                          Bn = On−1    −1
                                                           M̃n On = M̃n On ,            (2.57)
where M̃n is Mn with the removal of rows that their basis are not elementary (n − 1)-paths
in P . Otherwise, Bn is the least-square solution to Eq. (2.56).
    Note that Bn is the matrix representation of ∂n with respect to the basis of Ωn and Ωn−1 .
Dual space Ωn := Hom(Ωn , K) of Ωn is equipped with dual maps d to form a cochain
complex
                                             dn+1         d          dn−1
                               · · · Ωn+1 ←− Ωn ←−          n
                                                               Ωn−1 ←− Ωn−2 · · · ,
where dn is called a coboundary operator. The inner product on Ωn induces an inner
product ≪ ·, · ≫ on Ωn such that
                                                     X
                                ≪ f, g ≫=                 f (e)g(e),   ∀f, g ∈ Ωn .
                                                    e∈Sn
We denote the adjoint operator of ∂n be ∂n∗ : Ωn−1 → Ωn . Note that similar inner product
≪ ·, · ≫ on Ωn was defined in the literature [64]. Hence, the coboundary operator dn is
                                                             43


consistent with the adjoint operator ∂n∗ . Then, for integers p ≥ 0, the n-th path Laplacian
operator is a linear operator: ∆n : Ωn → Ωn given by
                                                ∗
                                    ∆n = ∂n+1 ∂n+1    + ∂n∗ ∂n ,                                (2.58)
and ∆0 = ∂1 ∂1∗ . The n-th path Laplacian matrix corresponding to ∆n is expressed by
                                                T
                                   Ln = Bn+1 Bn+1    + BnT Bn .                                 (2.59)
Since Ln is positive semi-definite and symmetric, its eigenvalues are all real and non-
negative. Additionally, recall that the Betti number βn of path complex P satisfies
                          βn = dim ker ∂n − dim im ∂n+1 = dim ker ∆n .                          (2.60)
It is easy to show that
                  βn = nullity(Ln ) = the number of zero eigenvalues of Ln .                    (2.61)
Moreover, assume the dimension of Ln is N , then the set of spectra of Ln is denoted as
                           Spectra(Ln ) = {(λ1 )n , (λ2 )n , · · · , (λN )n }.
Figure 2.8 shows 5 digraphs with multiple vertices and directed edges. Here, we take
them as examples to give a detailed illustration of Ln matrix constructions.
     Construction of L0 – Figure 2.8a Since L0 = B1 B1T , then we first construct B1 , where
                                                                              e1 e2 e3
                                                                                      
                                                                        e1  1   0  0
B1 = O0−1 M̃1 O1 according to Eq. (2.57), we have O0 =                                 , and M1 =
                                                                                      
                                                                        e2 
                                                                           0    1  0
                                                                                      
                                                                        e3 0     0  1
       e12  e23   e31                     e12  e23    e31
                                                        
 e1  −1 0         1               e12  1     0      0 
                      , and O1 = e  0
                                                           . Since e1 , e2 , and e3 are all elemen-
                                                        
 e2 
       1   −1     0                23        1      0 
                                                        
 e3 0        1 −1                   e31 0       0      1
                                              44


                                a                       b
                                c                       d
                           e                             f
Figure 2.8: Five digraphs. a and b Digraphs with 3 vertices and 3 directed edges. c and d
Digraphs with 4 vertices and 4 directed edges. e A digraph with 6 vertices and 8 directed
edges. f A digraph with 6 vertices and 8 directed edges.
                                                                    e12  e23  e31
                                                                                 
                                                               e1  −1 0       1 
tary 0-paths (vertices), M1 = M̃1 . We have B1 = O0−1 M̃1 O1 = e2                . Then
                                                                                 
                                                                    1   −1    0  
                                                                                 
                                                               e3 0       1 −1
                             
                2 −1 −1
               −1 2 −1, which gives Spectra(L0 ) = {0, 3, 3} and thus, one finally
                             
L0 = B1 B1T =                
                             
                 −1 −1 2
has β0 = 1.
   Construction of L1 – Figure 2.8a We have L1 = B2 B2T + B1T B1 , where B1 has been
formed, so we focus on the construction of B2 = O1−1 M̃2 O2 according to Eq. (2.57). Since
                                            45


                                                  e123   e231  e312
                                                                   
                                            e11   0      0     0 
                                                                   
                                            e12 
                                                  1      0     1  
                                                                   
              e12 e23  e31                  e13 
                                                
                                                  −1      0     0 
                                                                    
                                                                 
                                                                   
        e12  1    0    0                  e21 
                                                  0     −1     0  
                           ,  and M2 = e22                        , and O2 is a 3 × 0 empty
                                                                 
O1 = e23    0    1    0                        0      0     0 
                                                                 
        e31 0      0    1                   e23   1      1     0 
                                                                   
                                                                   
                                                                   
                                            e31 
                                                  0      1     1  
                                                                   
                                            e32 
                                                  0      0    −1  
                                                                   
                                            e33    0      0     0
matrix since Ω2 = {0}.Therefore, B    2 = O1 M̃2 O2 is a 3 × 0 empty matrix. Additionally,
                                             −1
                        2 −1 −1
                                        , where Spectra(L1 ) = {0, 3, 3} and thus, one finally
                                       
L1 = B2 B2T +B1T B1 =  −1     2   −1  
                                       
                         −1 −1 2
has β1 = 1.
    Construction of L2 – Figure 2.8a We have L2 = B3 B3T + B2T B2 , where B2 is an empty
matrix. Hence, we focus on the construction of B3 = O2−1 M̃3 O3 according to Eq. (2.57).
We have A2 = span{e123 , e231 , e312 } and A1 = span{e12 , e23 , e31 }. Note that ∂2 (e123 ) =
e23 − e13 + e12 where e13 is not in A1 . Hence, e123 is not in Ω2 . The same conclusion can be
deduced for e231 and e312 . Therefore, we have Ω2 = {0}, and it is straightforward to get
that L2 is an empty matrix.
    Construction of L0 – Figure 2.8b Since L0 = B1 B1T , then we should first construct
                                                                                  e1  e2  e3
                                                                                            
                                                                             e1  1   0   0
B1 , where B1 = O0−1 M̃1 O1 according to Eq. (2.57). Since O0 =                              ,
                                                                                            
                                                                             e2 
                                                                                0    1   0
                                                                                            
                                                                             e3 0     0   1
                                               46


                 e12    e13 e23                         e12 e13   e23
                                                                   
           e1  −1 −1 0                          e12  1    0     0 
                                , and O1 =
                                                                      . Since e1 , e2 , and e3 are
                                                                   
M1 =       e2 
                 1      0  −1                   e13 
                                                       0    1     0 
                                                                   
           e3 0          1   1                    e23 0      0     1
all elementary 0-paths (vertices). Therefore, M1 = M̃1 , and we have B1 = O0−1 M̃1 O1 =
       e12   e13    e23
                                                             
 e1  −1 −1 0                                   2 −1 −1
                        . Then L0 = B1 B1 = −1 2 −1, which gives the Spectra(L0 ) =
                                          T
                                                               
 e2 
     1       0 −1                                            
                                                             
 e3 0         1      1                            −1 −1 2
{0, 3, 3} and thus, one finally has β0 = 1.
    Construction of L1 – Figure 2.8b We have L1 = B2 B2T + B1T B1 , where B1 has been
formed, so we focus on the construction of B2 = O1−1 M̃2 O2 according to Eq. (2.57).
First, A2 = span{e123 } and A1 = span{e12 , e13 , e23 }. Note that ∂2 (e123 ) = e23 − e13 + e12
where e12 , e23 , and e13 are all in A1 . Hence, Ω2 = A2 = span{e123 }. Note that O1 =
                                         e123
                                             
                                   e11   0 
                                             
                                   e12 
                                         1  
                                             
        e12   e13    e23           e13 
                                       
                                         −1 
                                              
                                           
                                                                   e
                                             
 e12  1       0      0           e21 
                                         0                     123 
                      0 , M2 = e22          , and O2 = e123       1 . The e11 , e21 , e22 , e31 , e32 ,
                                           
 e13 
      0       1         
                                         0 
                                           
 e23 0         0      1            e23   1 
                                             
                                             
                                             
                                   e31 
                                         0  
                                             
                                   e32 
                                         0  
                                             
                                   e33    0
                                                  47


                                                                      e123
                                                                          
                                                                e12  1 
and e33 are not elementary 1-paths in P . Hence, M̃2 =                     , and then B2 =
                                                                          
                                                                e13 
                                                                     −1   
                                                                          
                                                                e23    1
                     e123
                                                                    
               e12  1                                      3 0 0
O1−1 M̃2 O2 = e13  −1 . Therefore, L1 = B2 B2 +B1 B1 = 0 3 0, where Spectra(L1 ) =
                                                T   T
                                                                      
                                                                     
                                                                    
               e23    1                                        0 0 3
{3, 3, 3} and thus, we finally have β1 = 0.
    Construction of L2 – Figure 2.8b According to Eq. (2.59), we have L2 = B3 B3T + B2T B2
and B3 = O2−1 M̃3 O3 . Since there is no 3-path existing, so the M3 and O3 are both empty
matrix. Hence L2 = (3), Spectra(L2 ) = {3}, and thus, one has β2 = 0.
    In the following section, we will omit the detailed construction steps of boundary
matrix Bn . Table 2.11, Table 2.12, Table 2.13, and Table 2.14 list the boundary matrix Bn
and the n-th path Laplacian matrix Ln for with its corresponding Betti numbers βn and
spectrum Spectra(Ln ) for Figure 2.8 c, d, e, and f. It is worth to mention that βn can
distinguish the same graph with different paths assigned. For example, Figure 2.8 c and
d have the same undirected graph structure with different paths assigned. We have β1 = 0
for Figure 2.8 c and β1 = 1 for Figure 2.8 d.
2.2.7   Persistent Path Laplacian
From Section 2.2.6, the way to calculate both harmonic spectra (topological invariants)
and non-harmonic spectra of n-th path Laplacian matrix is genuinely free of metrics or
coordinates, which contains too little information to fully describe the object. Therefore,
inspired by the idea of the persistent spectral graph (PSG), persistent path Laplacian (PPL)
is proposed to create a sequence of digraphs induced by varying a filtration parameter to
encode more geometric or structural information.
                                              48


                       Table 2.11: Illustration of digraph c in Figure 2.8.
         n                    n=0                               n=1                            n=2
       Ωn              span{e1 , e2 , e3 , e4 }          span{e12 , e14 , e23 , e43 }    span{e143 − e123 }
                       e12 e14 e23 e43                      e143 − e123 
                  e1     −1 −1 0             0            e12        −1
     Bn+1         e2  1
                              0 −1 0           
                                                         e14 
                                                                     1        
                                                                                       1 × 0 empty matrix
                  e3  0       0      1      1           e23       −1        
                  e4      0    1      0 −1                e43         1
                                                                                 
                         2 −1 0 −1                          3   0      0 −1
                      −1 2 −1 0                        0     3 −1 0                             
        Ln           
                      0 −1 2 −1 
                                                       
                                                         0 −1 3
                                                                                                4
                                                                                0 
                        −1 0 −1 2                          −1 0        0        3
        βn                       1                                  0                            0
 Spectra(Ln )               {0, 2, 2, 4}                      {2, 2, 4, 4}                      {4}
                       Table 2.12: Illustration of digraph d in Figure 2.8.
               n                        n=0                                 n=1                n=2
              Ωn               span{e1 , e2 , e3 , e4 }         span{e12 , e14 , e32 , e34 }    {0}
                              e12 e14 e32 e34            
                          e1     −1 −1 0             0
                                                                 4 × 0 empty matrix
                                                                                                     
             Bn+1         e2  1
                                         0     1    0    
                                                                                                 /
                          e3      0      0 −1 −1         
                          e4       0      1     0    1
                                                                                     
                                 2 −1 0 −1                               2    1    1  0
                              −1 2 −1 0                            1       2    0  1             
              Ln             
                              0 −1 2 −1 
                                                                   
                                                                     1
                                                                                                 /
                                                                              0    2  1 
                                −1 0 −1 2                                0    1    1  2
              βn                            1                                    1                0
          Spectra(Ln )               {0, 2, 2, 4}                         {0, 2, 4, 4}            /
    First, we consider a filtration of digraphs G : R → D, which is a morphism fs,t :
Hp (Gt ; K) → Hp (Gs ; K) from the category of real number R to the category of digraphs
D that satisfies:
                                           G(t) ⊆ G(s), ∀t ≤ s,
                                                    49


                                Table 2.13: Illustration of digraph e in Figure 2.8.
      n                                 n=0                                                    n=1                                              n=2
     Ωn                    span{e1 , e2 , e3 , e4 , e5 , e6 }              span{e12 , e13 , e24 , e25 , e34 , e35 , e64 , e65 } span{e134 − e124 , e135 − e125 }
                                                                                   e134 − e124        e135 − e125 
                   e12 e13 e24 e25 e34 e35 e64 e65                          e12        −1                −1
              e1    −1 −1 0              0        0       0    0    0         e13  
                                                                                          1                 1          
                                                                                                                        
              e2    1     0 −1 −1 0                      0    0    0        e24       −1                  0          
                                                                                                                                        2 × 0 empty matrix
                                                                                                                     
   Bn+1       e3 
                    0     1     0       0 −1 −1 0                  0 
                                                                             e25  
                                                                                          0               −1           
                                                                                                                        
              e4 
                    0     0     1       0        1       0    1    0 
                                                                             e34  
                                                                                          1                 0          
                                                                                                                        
              e5    0     0     0       1        0       1    0    1        e35  
                                                                                          0                 1          
                                                                                                                        
              e6     0     0     0       0        0       0 −1 −1             e64         0                 0          
                                                                              e65          0                 0
                                                                                                                                
                                                                            4 −1 0             0 −1 −1 0 0
                          2 −1 −1 0                     0     0           
                                                                            −1 4 −1 −1 0                        0 0 0          
                     
                        −1 3           0 −1 −1 0                
                                                                          
                                                                             0 −1 3             1       0       0 1 0                             
                        −1 0           3 −1 −1 0                           0 −1 1             3       0       0 0 1                         4 2
     Ln                                                                                                                       
                     
                         0 −1 −1 3                     0 −1     
                                                                          
                                                                            −1 0        0       0       3       1 1 0                        2 4
                         0 −1 −1 0                     3 −1             
                                                                            −1 0        0       0       1       3 0 1          
                          0     0       0 −1 −1 2                            0     0    1       0       1       0 2 1 
                                                                              0     0    0       1       0       1 1 2
     βn                                     1                                                     1                                               0
Spectra(Ln )                 {0, 1.4384, 3, 3, 3, 5}                            {0, 1.4384, 2, 3, 3, 3, 5.5616, 6}                              {2,6}
                                Table 2.14: Illustration of digraph f in Figure 2.8.
     n                             n=0                                                              n=1                                               n=2
    Ωn                   span{e1 , e2 , e3 , e4 , e5 , e6 }                     span{e12 , e15 , e23 , e26 , e42 , e45 , e53 , e56 }            span{e153 − e123 ,
                                                                                                                                                   e156 − e126 ,
                                                                                                                                                   e453 − e423 ,
                                                                                                                                                   e456 − e426 }
                                                                            e − e123     e156 − e126        e453 − e423         e456 − e426
                                                                           153
                  e12 e15 e23 e26 e42 e45 e53 e56                     e12       −1            −1
                                                                                                                                            
                                                                                                                 0                  0
             e1   −1 −1 0                                             e15
                                                                  
                                    0        0      0       0   0         
                                                                                1             1                 0                  0       
                                                                                                                                            
             e2   1     0 −1 −1 1                  0       0   0    e23      −1             0               −1                   0       
                                                                                                                                               4 × 0 empty matrix
                                                                                                                                         
   Bn+1      e3 
                  0     0    1     0        0      0       1   0  
                                                                     e26 
                                                                                0            −1                 0                 −1       
                                                                                                                                            
             e4 
                  0     0    0     0 −1 −1 0                   0  
                                                                     e42 
                                                                                0             0               −1                  −1       
                                                                                                                                            
             e5   0     1    0     0        0      1 −1 −1          e45 
                                                                                0             0                 1                  1       
                                                                                                                                            
             e6    0     0    0     1        0      0       0   1     e53       1             0                 1                  0       
                                                                      e56        0             1                 0                  1
                                                                                 4 −1 0                            0 −1 −1
                                                                                                                                     
                                                                                                   0       1
                            −1 0           0 −1 0                               −1 4 −1 −1 0
                                                             
                       2                                                     
                                                                                                                  1      0       0                           
                   
                      −1    4 −1 −1 0 −1                    
                                                                             
                                                                                0 −1 4            1       0 −1 −1 0                             4 2     2   0
                       0   −1 2           0 −1 0                              0 −1 1            4       0 −1 0 −1                            2 4      0   2 
    Ln                                                                                                                                                      
                   
                       0   −1 0           2 −1 0            
                                                                             
                                                                                1     0     0     0       4 −1 −1 −1                
                                                                                                                                                 2 0      4   2 
                      −1    0 −1 −1 4 −1                                   
                                                                                0     1 −1 −1 −1 4                       0       0              0 2     2   4
                        0   −1 0           0 −1 2                              −1 0 −1 0 −1 0                            4       1 
                                                                                −1 0         0 −1 −1 0                    1       4
    βn                                 1                                                               0                                                 1
Spectra(Ln )                  {0, 2, 2, 2, 4, 6}                                           {2, 2, 2, 4, 4, 4, 6, 8}                                  {0,4,4,8}
where Gt := G(t) ∈ D and Gs := G(s) ∈ D. Consider a sequence of finitely many positive
                                                                           50


integers 1, 2, . . . , m, we have a sequence of digraphs
                                         G1 ⊆ G2 ⊆ · · · ⊆ Gm .
For each digraph Gt , we denote its corresponding chain group to be Ωn (Gt ), and the n-
boundary operator of Gt is denoted by ∂nt : Ωn (Gt ) → Ωn−1 (Gt ), ∀n ≥ 0 .
   Similarly, as in persistent homology, a sequence of chain complexes can be denoted as
                   1
                  ∂n+1            ∂1          ∂1          ∂1                ∂1         ∂1
     · · · Ω1n+1 −−−→ Ω1n          n
                                  −→        3
                                     · · · −→      Ω12     2
                                                          −→       Ω11       1
                                                                            −→   Ω10    0
                                                                                       −→ Ω1−1
            ,→             ,→                      ,→               ,→           ,→
                   2
                  ∂q+1            ∂2          ∂2          ∂2                ∂2         ∂2
     · · · Ω2n+1 −−→ Ω2n           n
                                  −→        3
                                     · · · −→      Ω22     2
                                                          −→       Ω21       1
                                                                            −→   Ω20    0
                                                                                       −→ Ω2−1
                                                                                                      (2.62)
            ···            ···                     ···              ···          ···
            ,→             ,→                      ,→               ,→           ,→
                   m
                  ∂q+1            ∂m         ∂m           ∂m                ∂m         ∂m
     · · · Ωm        m
            n+1 −−→ Ωn
                        n
                       −→        3
                          · · · −→ Ωm
                                    2
                                       2
                                      −→ Ωm
                                          1
                                             1
                                            −→ Ωm
                                                0
                                                   0
                                                  −→ Ωm
                                                      −1
For the sake of simplicity, we use Ωtn to represent Ωn (Gt ). Suppose a subset of Ωsn whose
boundary is in Ωtn−1 as:
                                   Ωt,s        s    s      t
                                    n := {α ∈ Ωn | ∂n α ∈ Ωn−1 }.                                     (2.63)
                                                  n : Ωn → Ωn−1 , and its corresponding
The persistent n-boundary operator is denoted as ðt,s  t,s  t
adjoint operator is (ðt,s
                      n ) : Ωn−1 → Ωn . Therefore, the persistent n-th path Laplacian
                          ∗  t      t,s
operator ∆t,s
          n : Ωn → Ωn defined along the filtration is:
               t    t
                                                         ∗        ∗
                                       ∆t,s  t,s  t,s
                                        n = ðn+1 ðn+1         + ∂nt ∂nt .                             (2.64)
                                                                                            ∗
Since ∆t,s
       n inherits the inner product from ðn+1 , then the adjoint map ðn+1
                                          t,s                         t,s
                                                                                                 is well de-
fined. Intuitively, the matrix representation of ∆t,s
                                                  n is
                                 Lt,s  t,s
                                  n = Bn+1 P
                                             −1   t,s T
                                                (Bn+1 ) + (Bnt )T Bnt ,                               (2.65)
where P −1 is the associated inner product matrix of Ωt,s
                                                      n+1 with arbitrary basis. Moreover,
assume the dimension of Lt,s
                         n is N , then the spectra of Ln that are arranged in ascending
                                                       t,s
                                                   51


order can be displayed as:
                           Spectra(Lt,s                  t,s       t,s
                                        n ) = {(λ1 )n , (λ2 )n , · · · , (λN )n }.
                                                                                   t,s
Note that the smallest non-harmonic spectra of Lt,s               n is denoted as (λ̃2 )n . We call the mul-
                                                                                           t,s
tiplicity of zero spectra of Lt,s
                                q to be persistent n-th Betti number βn from Gt to Gs .
                                                                                       t,s
                   n ) = the number of zero eigenvalues (i.e., harmonic eigenvalues) of Ln .
 βnt,s = nullity(Lt,s                                                                                     t,s
                                                                                                       (2.66)
      Distanced-based filtration Specifically, suppose G(w) = (V, E, w) is a weighted di-
graph, where V is the set of the vertices and E is the set of the directed edges. Assume w
is a weight function w : E → R. For example, if V is in the Euclidean space, then a digraph
G(w) is a geometric digraph (a geometric digraph is a digraph in which the vertices are
embedded as points in the Euclidean space, and the edges are embedded as non-crossing
directed line segments). For any (i, j) ∈ E where i, j ∈ V , we define w(i, j) = ∥i − j∥,
where ∥ · ∥ is a Euclidean metric. Hence, for every δ ∈ R, a digraph can be described as
Gδ = (V, E δ ) = (V, {e ∈ E : w(e) ≤ δ}), and a filtration of digraphs can be described as
            ′
{Gδ ,→ Gδ }δ≤δ′ .
      Therefore, the persistent n-th path Laplacian matrix defined on the filtration is
                                     ′          ′              ′
                               Lδ,δ
                                  n = Bn+1 P
                                            δ,δ   −1
                                                     (Bn+1  δ,δ T
                                                                 ) + (Bnδ )T Bnδ ,                     (2.67)
where its corresponding Betti numbers and spectra can be expressed as:
      ′               ′                                                                                        ′
βnδ,δ = nullity(Lδ,δ
                   n ) = the number of zero eigenvalues (i.e., harmonic eigenvalues) of Ln .
                                                                                                           δ,δ
                                                                                                       (2.68)
              ′             ′         ′                ′
Spectra(Lnδ,δ ) = {(λ1 )δ,δ       δ,δ              δ,δ
                         n , (λ2 )n , · · · , (λN )n }.                                                (2.69)
                                                                      ′
Notably, the Fiedler value (i.e., spectral gap) of Lδ,δ            n is widely used in many other areas
                                                                          ′
such as physics and geography, which is denoted as λ̃δ,δ                n . As shown below, it is sensitive
to both topological and geometric changes.
                                                        52


    Moreover, it is worth to mention that isolated points (vertices) can be either included
in the digraphs (under the distance-based filtration) or removed from the digraphs (under
the distanced-based filtration with removal of isolated points).
                                            53


                                          CHAPTER 3
            METHODS ON MATHEMATICAL MODELING OF VIROLOGY
3.1    Genomics Analysis
3.1.1   Sequence Alignment
Sequence alignment is a method in which one can arrange DNA, RNA, or amino acid
sequences to identify their similar regions [65]. Such similar regions may arise from func-
tional, structural, geometrical, or evolutionary similarities. Though sequence alignment
offers the best accuracy, it is not practical to be used for a large sample size. There are two
main categories of sequence alignment, namely pair-wise sequence alignment and mul-
tiple sequence alignment. The former only compares two sequences at a time, while the
latter compares many sequences. There are many popular tools for sequence alignment
such as BLAST (Basic Local Alignment Search Tool) for pair-wise alignment and MAFFT,
Clustal Omega, ClustalW, and MUSCLE, for multiple sequence alignment. The following
section describes BLAST first followed by several multiple sequence alignment tools.
3.1.1.1   Pairwise Sequence Alignment
One of the popular pair-wise sequence alignment tools is BLAST. BLAST is a local sim-
ilarity search tool that is commonly used to find similar DNA, RNA, and amino acid
sequences to the sequence in question. BLAST was created in 1990 based on the k-tuple
method, and has since been implemented in the GenBank, and had numerous updates to
increase efficiency and accuracy. k-tuple method [66] is a fast heuristic method for pair-
wise alignment and is commonly used as an initial step for a large sample size. Similarity
score, Sij between sequences i and j is defined as the number of k-tuple matches in the
best pairwise alignment minus a fixed gap penalty term. For DNA and RNA, k usually
                                                54


ranges from 2 to 4, and for amino acids, k is 1 or 2. Sij is calculated as the number of
identities divided by the number of residues compared between i and j. The distance is
defined as,
                                                  Sij
                                       dij = 1 −      .                                 (3.1)
                                                  100
Note that this method does not guarantee optimal alignment, but it is a fast heuristic
method and can be used for the initialization of BLAST and multiple sequence alignment.
    BLAST begins by first creating a list of k-letter words. It then searches for possible
matching k-letter words in the databank and scores them, and any words that score above
a threshold are kept. The high-scoring words are kept in a search tree. This process is then
extended to high scoring pairs (HSPs), which also looks for similar words, rather than
only looking at exact matching words. After searching for HSPs, the significance of the
HSPs score is considered by utilizing Gumbel extreme value distribution (EVD). Further
details can be found in the literature [67, 68]. The GenBank tutorial can be found in Ref.
[69]. As a basic tool for sequence alignment, it is utilized to detect, identify, or search
sequences in a database. For example, similar coronavirus strands in other organisms,
such as that of pangolins [70, 71] and bats[72] were found. This tool is also used to detect
SARS-CoV-2 virus in the environment[73, 74] such as waste waters[75, 76].
3.1.1.2  Multiple Sequence Alignment (MSA)
Unlike pair-wise sequence alignment, MSA arranges 3 or more DNA, RNA, or protein
sequences by identical regions. Through multiple sequence alignment, one can further
analyze sequence homology to find evolutionary origins. In many cases, one uses a ref-
erence sequence, which is the first sequenced data, to observe mutation in SARS-CoV-2
genome [77]. There are several popular tools, Clustal[78], MUSCLE[79], MAFFT[80, 81],
etc.
                                             55


Clustal Clustal is a series of multiple sequence alignment tools for sequence analysis.
With the first version Clustal released in 1988[78], its package has been developed for
several generations based on different methods. ClustalW is the third generation and
is updated to ClustalW2 currently, which aligns sequences with the best similarity score
first, and progressively aligns more distant scores[82, 83]. This is achieved by first ob-
taining a rough pairwise sequence alignment using the k-tuple method [66], followed
by a neighbor-joining method [84], which uses midpoint rooting to create a guided tree.
ClustalW2 is used as the basis for global alignment.
     As for Clustal Omega, unlike the ClustalW, it uses a guided tree approach, rather
than a progressive alignment method. Clustal Omega begins with first producing a pair-
wise alignment using the k-tuple method. This, however, does not guarantee finding
optimal alignment, but it is time-efficient. Then, the sequences are clustered using the
mBed method [85], which calculates pairwise distance using the embedding method. Af-
terward, K-means clustering is used to further cluster the sequence. Then, a guided
tree is formed utilizing the UPGMA method [86]. Lastly, MSA is produced using the
HHAlign package from HH-Suite [86]. Clustal Omega’s advantage comes from the large-
scale MSA. The accuracy and time complexity are average for a low number of samples.
For a large number of samples with a long sequence, Clustal Omega produces high ac-
curacy and is time-efficient. ClustalW is the updated version of the original Clustal MSA
tool.
Multiple alignment using fast Fourier transform (MAFFT) MAFFT is a MSA package
based on fast Fourier transform (FFT). Given two sequences v1 and v2 , the correlation
cv (s) of volume between the two sequences with positional lag of s sites can be defined as
                                          X
                            cv (s) =                v̂1 (n)v̂2 (n + s)
                                     1≤n≤N,1≤n+s≤M
where v̂1 and v̂2 are the FFT of the two sequences. If homologous regions exists, through
Fourier analysis, there will be a peak in similar region. For amino acid sequences, MAFFT
                                              56


also calculates correlation between polarity:
                                           X
                            cρ (s) =                 ρ̂1 (n)ρ̂2 (n + s)
                                     1≤n≤N,1≤n+s≤M
where ρ(s) is the polarity of each amino acid, N is the length of v1 , and M is the length of
v2 . Then, a scoring function can be calculated through the sum of the two correlations
                                      c(s) = cv (s) + cρ (s).
To reduce the computational complexity, only peaks above some threshold are consid-
ered. Note that the peak does not tell the location of the homologous region directly,
and only shows the lag. Therefore, neighboring regions at the peak must be analyzed
carefully. Further details of MAFFT can be found in the literature [80, 81].
3.1.2   Single Nucleotide Polymorphism Calling
Single nucleotide polymorphism (SNP) calling measures the genetic variations between
different members of a species. Establishing the SNP calling method to the investigation
of the genotype changes during the transmission and evolution of SARS-CoV-2 is of great
importance [21, 25]. By analyzing the rearranged genome sequences, SNP profiles, which
record all of the SNP positions in teams of the nucleotide changes and their corresponding
positions, can be constructed. The SNP profiles of a given SARS-CoV-2 genome isolated
from a COVID-19 patient capture all the differences from a complete reference genome
sequence and can be considered as the genotype of the individual SARS-CoV-2.
3.1.3   Jaccard Distance of SNP profiles
In this work, we use the Jaccard distance to measure the similarity between SNP profiles
and compare the difference between the SNP variant profiles of SARS-CoV-2 genomes.
                                               57


    The Jaccard similarity coefficient is defined as the intersection size divided by the
union of two sets A and B [87]:
                                       |A ∩ B|         |A ∩ B|
                           J(A, B) =           =                     .                   (3.2)
                                       |A ∪ B|   |A| + |B| − |A ∩ B|
The Jaccard distance of two sets A and B is scored as the difference between one and the
Jaccard similarity coefficient and is a metric on the collection of all finite sets:
                                                    |A ∪ B| − |A ∩ B|
                          dJ (A, B) = 1 − J(A, B) =                    .                 (3.3)
                                                         |A ∪ B|
Therefore, the genetic distance of two genomes corresponds to the Jaccard distance of
their SNP profiles.
    In principle, the Jaccard distance of SNP profiles takes account of the ordering of
SNP positions, i.e., transmission trajectory, when an appropriate reference sample is se-
lected. However, one may fail to identify the infection pathways from the mutual Jaccard
distances of multiple samples. In this case, the dates of the sample collection provide
key information. Additionally, clustering techniques, such as k-means, UMAP, and t-
distributed stochastic neighbor embedding (t-SNE), enable us to characterize the spread
of COVID-19 onto the communities.
3.1.4   k-nearest Neighbors
The k-nearest neighbors algorithm (k-NN) is a non-parametric technique proposed by
Thomas Cover and P. E. Hart in 1967 [88]. k-NN can be used for solving both regression
and classification problems [89], and it is sensitive to the local structure of the data. The
flowchart of the k-NN algorithm can be found in Figure 3.1. The features of the training
set is {xi }ni=1 with xi ∈ Rm , k shows the number of the nearest neighbors, and x ∈ Rm is
a feature representation of the training set. Different distance metrics can be employed
in the k-NN algorithm, such as Euclidean distance, Manhattan distance, Minkowski dis-
tance, Chebyshev distance, natural log distance, generalized exponential distance, gener-
                                               58


alized Lorentzian distance, Canberra distance, quadratic distance, and Mahalanobis dis-
tance.
                          Start                Input feature vector x          Set k
                                       Compute the distance between x and xi
                                    Sort the distance values in an ascending order
                                     Choose the top k rows from the sorted array
                                 If Classification                      If Regression
                     Assign the label of xi based                   Assign the label of xi based
                     on the most frequent label of k rows           on the average label of k rows
                                      Yes        Is the performance of                   No
                           End                   the model satisfying?
Figure 3.1: The flowchart of k-NN algorithm. The features of the training set is {xi }ni=1
with xi ∈ Rm , k shows the number of the nearest neighbors, and x ∈ Rm is a feature
representation of the training set.
3.1.5   k-means Clustering
k-means clustering is an unsupervised learning algorithm, aiming to partition a set of
observations into k subsets or clusters. It typically partitions a given dataset
                                X = {x1 , x2 , · · · , xn , · · · , xN }, xn ∈ Rd
into k different clusters {C1 , C2 , · · · , Ck }, k ≤ N such that the specific clustering criteria
are optimized. The standard procedure of k-means clustering method aims to obtain the
optimal partition for a fixed number of clusters. First, we randomly pick k points as
the cluster centers and then assign each data to its nearest cluster. Next, we calculate
the within-cluster sum of squares (WCSS) defined below to update the cluster centers
iteratively.
                                                X k    X
                                                              ∥xi − µk ∥22 ,                       (3.4)
                                                i=1 xi ∈Ck
                                                              59


where µk is the mean value of the points located in the k-th cluster Ck . Here, ∥ · ∥2 de-
notes the L2 distance. It is noted that the k-mean clustering method described above aims
to find the optimal partition for a fixed number of clusters. However, seeking the best
number of clusters for the SNP profiles is essential as well. In this work, by varying the
number of clusters k, a set of WCSS with its corresponding number of clusters can be
plotted. The location of the elbow in this plot will be taken as the optimal number of
clusters. Such a procedure is called the Elbow method which is frequently applied in the
k-means clustering problem.
    Specifically, in this work we apply the k-means clustering with the Elbow method for
the analysis of the optimal number of the subtypes of SARS-CoV-2 SNP profiles. The
pairwise Jaccard distances between different SNP profiles are considered as the input
features for the k-means clustering method.
3.2    Mathematical-assisted Machine Learning Models in SARS-CoV-2
In this section, the workflow of the deep learning-based BFE change predictions of protein-
protein interactions induced by mutations for the present SARS-CoV-2 variant analysis
and prediction will be firstly introduced, which includes three steps as shown in Fig-
ure 3.2: (1) Data collection and pre-processing; (2) training data preparation; (3) feature
generations of protein-protein interaction complexes; (4) predictive models of protein-
protein interactions.
3.2.1   Data Collection and Pre-processing
The first step is to pre-process the original SARS-CoV-2 sequences data. In this step, a
total of 1,983,328 complete SARS-CoV-2 genome sequences with high coverage and ex-
act collection date are downloaded from the GISAID database [90] ( https://www.gisa
id.org/) as of August 05, 2021. Complete SARS-CoV-2 genome sequences are available
from the GISAID database [90]. Next, the 1,983,328 complete SARS-CoV-2 genome se-
                                              60


 a                                                                             b
                                                                                                                                               Pre. BFE change (reduction)
1,489,884 patients       Viruses      Sequencing           Genotyping
                                                                               Exp. IC 50 fold change (reduction)
                                                                                                                                         10
                                                             28,478                                                 30
                                                             single                                                                      8
                                                            mutations                                               20                   6
                                         GISAID
                                                                                                                                         4
                                                                                                                    10
     Convolution      Topological barcodes                                                                                               2
                                                   ACE2            683 RBD                                          0                    0
                                                                   mutations                                             L452R   T478K
                                                                               c
                                                                                                                                               BFE change (kcal/mol)
                      +
                                                                               Relative luciferase units
                      Other inputs                 130                                                       8×105                       0.6
                                                   antibodies
              Pooling & Dropout                                                                             6×105
                                                                                                                                         0.4
                                                                                                            4×10     5
                                                                    BFE                                                                  0.2
                                                                    changes                                2×105
                                                                                                                    0                    0
                                     Convolution      Flattening                                                         L452R   N501Y
Figure 3.2: Illustration of genome sequence data pre-processing and BFE change predic-
tions.
quences were rearranged according to the reference genome downloaded from the Gen-
Bank (NC_045512.2)[91], and multiple sequence alignment (MSA) is applied by using
Cluster Omega with default parameters. Then, single nucleotide polymorphism (SNP)
genotyping is applied to measure the genetic variations between different isolates of
SARS-CoV-2 by analyzing the rearranged sequences [21, 92], which is of paramount im-
portance for tracking the genotype changes during the pandemic. The SNP genotyping
captures all of the differences between patients’ sequences and the reference genome,
which decodes a total of 28,865 unique single mutations from 1,983,328 complete SARS-
CoV-2 genome sequences. Among them, 724 non-degenerate mutations on the S protein
RBD (S protein residues from 329 to 530) are detected. In this work, the co-mutation anal-
ysis is more crucial than the unique single mutation analysis. Notably, the SARS-CoV-2
unique single mutations in the world are available at Mutation Tracker. The analysis of
RBD mutations is available at Mutation Analyzer.
                                                          61


3.2.2   Preparation of Machine learning Datasets
Dataset is important to train accurate machine learning models. Both the BFE changes
and enrichment ratios describe the effects on the binding affinity of protein-protein inter-
actions. Therefore, integrating both kinds of datasets can improve the prediction accu-
racy. Especially, due to the urgency of COVID-19, the BFE changes of SARS-CoV-2 data
are rarely reported, while the enrichment ratio data via high-throughput deep mutations
are relatively easy to obtain. The most important dataset that provides the information for
binding free energy changes upon mutations is the SKEMPI 2.0 dataset [93]. The SKEMPI
2.0 is an updated version of the SKEMPI database, which contains new mutations and
data from other three databases: AB-Bind [94], PROXiMATE[95], and dbMPIKT [96].
There are 7,085 elements, including single- and multi-point mutations in SKEMPI 2.0.
4,169 variants in 319 different protein complexes are filtered as single-point mutations are
used for our TopNetTree model training. Moreover, SARS-CoV-2 related datasets are also
included to improve the prediction accuracy after a label transformation. They are all
deep mutation enrichment ratio data, mutational scanning data of ACE2 binding to the
receptor-binding domain (RBD) of the S protein [97], mutational scanning data of RBD
binding to ACE2 [98, 3], and mutational scanning data of RBD binding to CTC-445.2 and
of CTC-445.2 binding to the RBD [3]. Note that our training datasets used in the valida-
tion do not include the test dataset, which is a mutational scanning data of RBD binding
to ACE2.
3.2.3   Features Generalization
Once the data pre-processing and SNP genotyping are carried out, we will firstly pro-
ceed with the training data preparation process, which plays a key role in reliability and
accuracy. A library of 130 antibodies and RBD complexes, as well as an ACE2-RBD com-
plex, are obtained from Protein Data Bank (PDB). RBD mutation-induced BFE changes of
these complexes are evaluated by the following machine learning model. According to
                                            62


the emergency and the rapid change of RNA virus, it is rare to have massive experimental
BFE change data of SARS-CoV-2, while, on the other hand, next-generation sequencing
data is relatively easy to collect. In the training process, the dataset of BFE changes in-
duced by mutations of the SKEMPI 2.0 dataset [93] is used as the basic training set, while
next-generation sequencing datasets are added as assistant training sets. The SKEMPI 2.0
contains 7,085 single- and multi-point mutations and 4,169 elements of that in 319 dif-
ferent protein complexes used for the machine learning model training. The mutational
scanning data consists of experimental data of the binding of ACE2 and RBD induced
mutations on ACE2[97] and RBD[98, 3], and the binding of CTC-445.2 and RBD with
mutations on both protein[3].
    Next, the feature generations of protein-protein interaction complexes are performed.
The element-specific algebraic topological analysis on complex structures is implemented
to generate topological bar codes [99, 100, 101, 4]. In addition, biochemistry and bio-
physics features such as Coulomb interactions, surface areas, electrostatics, et al., are
combined with topological features [102].
3.2.3.1  Generation of Topological Features for PPIs
Algebraic topology [100, 101] has had tremendous success in describing biochemical and
biophysical properties [4]. Element-specific and site-specific persistent homology can ef-
fectively simplify the structural complexity of protein-protein complex and extract the ab-
stract properties of the vital biological information in PPIs [40, 41]. The algebraic topologi-
cal analysis on PPIs is constructed based on a series of atom subsets of complex structures,
which are atoms of the mutation sites, Am , atoms in the neighborhood of the mutation site
within a cut-off distance r, Amn (r), antibody atoms within r of the binding site, AAb (r),
antigen atoms within r of the binding site, AAg (r), and atoms in the system that has atoms
of element type of {C, N, O}, Aele (E). Additionally, a bipartition graph is introduced to
describe the antibody and antigen in PPIs. Then, molecular atoms construct point clouds
                                               63


for simplicial complex, which is a finite collection of sets of linear combinations of points.
We apply the Vietoris-Rips (VR) complex for dimension 0 topology, and alpha complex
for point cloud of dimensions 1 and 2 topology [4]. Overall, element-specific and site-
specific persistent homology is devised to capture the multiscale topological information
over different scales along a filtration [100] and is important for our machine learning
predictions.
Simplex and simplicial complex Given a set of independent k+1 points U = {u0 , u1 , ..., uk }
in RN , the convex combination is a point u = ki=0 αi ui , where i αi = 1 and αi ≥ 0. The
                                                           P                         P
convex hull of U is the collection of convex combinations of U , and a k-simplex σ is the
convex hull of k+1 independent points U . For example, a 0-simplex is a point, a 1-simplex
is an edge, a 2-simplex is a triangle, and a 3-simplex is a tetrahedron. A proper m-face of
the k-simplex is a subset of the k + 1 vertices of a k-simplex with m + 1 vertices forms a
convex hull in a lower dimension and m < k. The boundary of a k-simplex σ is defined
as a sum of all its (k−1)–faces as
                                              X k
                                       ∂k σ =      (−1)i ⟨u0 , ..., ûi , ..., uk ⟩,                (3.5)
                                               i=1
where ⟨u0 , ..., ûi , ..., uk ⟩ is a convex hull formed by vertices of σ excluding ui . A simpli-
cial complex denotes by K is a collection of finitely many simplices forms a simplicial
complex. Thus, faces of any simplex in K are also simplices in K, and intersections of
any 2 simplices are only faces of both or an empty set. A k-simplex σ = ⟨ui0 , ..., uik ⟩ is in
Vietoris–Rips complex Rr (U ) if and only if B(uij , r) ∩ B(uij′ , r) ̸= ∅ for j, j ′ ∈ [0, k] and is
in alpha complex Ar (U ) if and only if ∩uij ∈σ B(uij , r) ̸= ∅.
Homology For a simplicial complex K, a k-chain ck of K is a formal sum of the k-
simplices in K defined as ck =                αi σi , where σi is the k-simplices and αi is coefficients.
                                           P
αi can be in different fields such as R, Q, and Z. Typically, αi is chosen to be Z2 , which
is {−1, 0, 1} and forms an Abelian group Ck (K, Z2 ). Then, the boundary operator can be
                                                        64


extended to a k-chain ck as
                                                 X
                                        ∂k ck =      αi ∂k σi ,                         (3.6)
such that ∂k : Ck → Ck−1 and satisfies ∂k−1 ∂k = ∅, follows from that boundaries are
boundaryless. The chain complex is defined as a sequence of complexes by boundary
maps is called a chain complex
                    ∂i+1       ∂              ∂i−1      ∂           ∂           ∂
              · · · −→ Ci (K) −→ i
                                   Ci−1 (K) −→ · · · −→   2
                                                            C1 (K) −→ 1
                                                                        C0 (K) −→ 0
                                                                                    0.  (3.7)
The k-homology group is the quotient group defined by taking k-cycle group module of
k-boundary group as
                                           Hk = Zk /Bk ,                                (3.8)
where Hk is the k-homology group, and k-cycle group Zk and the k-boundary group Bk
are the subgroups of Ck defined as,
                             Zk = ker ∂k = {c ∈ Ck        | ∂k c = ∅},
                                                                                        (3.9)
                            Bk = im ∂k+1 = {∂k+1 c | c ∈ Ck+1 }
The Betti numbers are defined by the ranks of kth homology group Hk as βk = rank(Hk ).
β0 reflects the number of connected components, β1 reflects the number of loops, and β2
reflects the number of cavities.
Filtration and Persistent Homology A filtration of a topology space K is a nested se-
quence of K such that
                                 ∅ = K0 ⊆ K1 ⊆ · · · ⊆ Km = K.                         (3.10)
Then, a sequence of chain complexes and a homology sequence are constructed on the
filtration. The pth persistent of kth homology group of Kt are defined as
                                                         \
                                    Hkt,p = Zkt /(Bkt+p     Zkt ),                     (3.11)
and the Betti numbers βkt,p = rank(Hkt,p ). These persistent Betti numbers are applied to
represent topological fingerprints.
                                                 65


3.2.3.2  Generation of Residue-level Features for PPIs
Mutation site neighborhood amino acid composition Neighbor residues are the residues
within 10 Å of the mutation site. Distances between residues are calculated based on
residue Cα atoms. Six categories of amino acid residues are counted, which are hydropho-
bic, polar, positively charged, negatively charged, special cases, and pharmacophore changes.
The count and percentage of the 6 amino acid groups in the neighbor site are regrading as
the environment composition features of the mutation site. The sum, average, and vari-
ance of residue volumes, surface areas, weights, and hydropathy scores are used but only
the sum of charges is included.
pKa shifts The pKa values are calculated by the PROPKA software [103], namely the
values of 7 ionizable amino acids, namely, ASP, GLU, ARG, LYS, HIS, CYS, and TYR. The
maximum, minimum, sum, the sum of absolute values, and the minimum of the absolute
value of total pKa shifts are calculated. We also consider the difference of pKa values
between a wild type and its mutant. Additionally, the sum and the sum of the absolute
value of pKa shifts based on ionizable amino acid groups are included.
Position-specific scoring matrix (PSSM) Features are computed from the conservation
scores in the position-specific scoring matrix of the mutation site for the wild type and the
mutant as well as their difference. The conservation scores are generated by PSI-BLAST
[104].
Secondary structure The SPIDER2 software is used to compute the probability scores
for residue torsion angle and residues being in a coil, alpha helix, and beta strand based
on the sequences for the wild type and the mutant [105].
                                             66


3.2.3.3   Generation of Atom-level Features for PPIs
Seven groups of atom types, including C, N, O, S, H, all heavy atoms, and all atoms,
are considered when generating the element-type features. Meanwhile, other three atom
types, i.e., mutation site atoms, all heavy atoms, and all atoms, are used when generating
the general atom-level features.
Surface areas Atom-level solvent excluded surface areas are computed by ESES [106].
Partial changes Partial change of each atom is generated by pdb2pqr software [107]
using the Amber force field [108] for wild type and CHARMM force field [109] for mutant.
The sum of the partial charges and the sum of absolute values of partial charges for each
atomic group are collected.
Atomic pairwise interaction interactions Coulomb energy of the ith single atom is cal-
culated as the sum of pairwise coulomb energy with every other atom as
                                                X         qi q j
                                           Ci =        ke        ,                   (3.12)
                                                j,j̸=i
                                                           rij
where ke is the Coulomb’s constant, rij is the distance of ith atom to jth atom, and qi is
the charge of ith atom. The van der Waals energy of the ith atom is modeled as the sum
of pairwise Lennard-Jones potentials with other atoms as
                                  X h ri + rj 12                  ri + rj 6 i
                             Vi =        ϵ                −2                    ,    (3.13)
                                  j,j̸=i
                                              rij                     rij
where ϵ is the depth of the potential well, and ri is van der Waals radii.
    In atomic pairwise interaction, 5 groups (C, N, O, S, and all heavy atoms) are counted
both for Coulomb interaction energy and van der Waals interaction energy.
Electrostatic solvation free energy Electrostatic solvation free energy of each atom is
calculated using the Poisson-Boltzmann equation via MIBPB [110] and are summed up
by atom groups.
                                                  67


3.2.4   Models for the Binding Free Energy Change Prediction of Protein-protein Inter-
        action on SARS-CoV-2
3.2.4.1   TopNet Model
In this section, we illustrate the construction of a topology-based network (TopNet) model
for the BFE change prediction of protein-protein interactions (PPIs) on SARS-CoV-2 stud-
ies. These approaches have been widely applied in studying protein-ligand and protein-
protein binding free energy predictions [41, 102]. Firstly, one ensemble method, gradient
boosting decision tree (GBDT), is studied as baselines in comparison to deep neural net-
work methods. The ensemble methods naturally handle correlation between descriptors
and are robust to redundant features. Therefore, they usually do not depend on a sophisti-
cated feature selection procedure and a complicated grid search of hyper-parameters. The
implemented GBDT is a function from the scikit-learn package (version 0.22.2.post1)[111].
The number of estimators and the learning is optimized for ensemble methods as 20000
and 0.01, respectively. For each set, 10 runs (with different random seeds) were done and
the average result is reported in this work. Considering a large number of features, the
maximum number of features to consider is set to the square root of the given descriptor
length for GBDT methods to accelerate the training process. The parameter setting shows
that the performance of the average of sufficient runs is decent.
    A neural network is a network of neurons that maps an input feature layer to an out-
put layer. The neural network simulates a biological brain solves problems with numer-
ous neuron units by backpropagation to update weights on each layer. To reveal the
facts of input features at different levels and abstract more properties, one can construct
more layers and more neurons in each layer, which is known as a deep neural network.
Optimization methods for feedforward neural networks and dropout methods are ap-
plied to prevent overfitting. In 10-fold cross validations, the neural network model has a
slightly better performance than the GBDT model, where Pearson correlations for these
algorithms are 0.864 and 0.838 and root mean square errors are 1.019 kcal/mol and 1.063
                                               68


kcal/mol, respectively. Thus, we applied the deep neural network for predictions, vali-
dation, and comparison.
    Deep learning algorithms A deep neural network is a neural network methods with
multi-layers (hidden layer) of neurons between the input and output layers. In each layer,
the single neuron gets fully connecting with the neurons in next layer. It should be pre-
serve the consistency of all labels when applying the model for mutation-induced BFE
change predictions. The loss function is constructed as following:
                                           N
                                        1X                         2
              argmin L(W, b) = argmin          yi − f (xi ; {W, b}) + λ∥W ∥2        (3.14)
                W,b               W,b   2 i=1
where N is the number of samples, f is a function of the feature vector xi parameterized
by a weight vector W and bias term b, and λ represents a penalty constant.
    Optimization The backpropagation is applied to evaluated the loss function start from
the output layer and propagates backward through the network structure to update the
weight vector W and bias term b. According to that the gradient calculation is required,
we apply the stochastic gradient descent method with momentum which only evaluates
a small part of training data and can be considered as calculating exponentially weighted
averages, which is given as
                                Vi = βVi−1 + η∇Wi L(Wi , bi )
                                                                                    (3.15)
                                Wi+1 = Wi − Vi ,
where Wi is the parameters in the network, L(Wi , bi ) is the objective function, η is the
learning rate, X and y are the input and target of the training set, and β ∈ [0, 1] is a
scalar coefficient for the momentum term. The momentum term involved accelerates the
converging speed.
    Dropout Fully connected layers possess a large number of degrees of freedom. This
can easily cause an over-fitting issue, while the dropout technique is an easy way of pre-
venting network over-fitting.[112] In the training process, hidden units are randomly set
                                              69


zero values to their connected neurons in the next layer. Suppose that a percentage of
neurons at a certain layer is chosen to be dropped during training. The number of com-
puted neurons of this layer is equal to the neuron number multiplied by a coefficient such
as 1-p, where p is the dropout rate. Then, in the testing process, the output of these layers
is computed by randomly dropouts the same rate of neurons, to approximate the network
in each training step.
3.2.4.2  TopNetmAb Model
In this section, the TopNet model trained with additional experimental data was intro-
duced to predict mAb binding free energy changes [99]. Such a model is called Top-
NetmAb model. Persistent homology is the main workhorse for TopNetmAb, but auxil-
iary features inherited from our earlier TopNetTree [40] are utilized. The detailed descrip-
tions of dataset and machine learning model are found in the literature [41, 22, 99] and
are available at TopNetmAb.
3.2.5   Other Models
As mentioned above, we constructed a TopNet model for the BFE change prediction of
protein-protein interactions (PPIs) on SARS-CoV-2 studies. A topology-based GBT model
(TopBGT) is also developed in the present work by replacing Net in the TopNet model
with GBT. Both TopNet and TopGBT include a set of auxiliary features inherited from our
earlier TopNetTree [40] and TopNetmAb [99] to enhance their performance.
    Additionally, to evaluate the performance of persistent Laplacian (Lap) for PPIs, we
construct persistent Laplacian-based GBT (LapGBT) and persistent Laplacian-based deep
neural network (LapNet). Note that unlike TopNet and TopGBT, LapGBT and LapNet
employ only persistent Laplacian features extracted from protein structures. Therefore,
their performance depends purely on persistent Laplacian.
                                             70


   Moreover, TopLapGBT and TopLapNet are constructed by adding persistent Lapla-
cian features to TopGBT and TopNet, respectively. Furthermore, the consensus of GBT
and Net predictions are also used for validations, denoted as TopNetGBT and LapNet-
GBT, respectively. Finally, the consensus of TopLapNet and TopLapGBT is called TopLap-
NetGBT.
                                             71


                                        CHAPTER 4
                  APPLICATIONS IN TOPOLOGICAL LAPLACIANS
4.1    Persistent Laplacians
Graph theory, a branch of discrete mathematics, concerns the relationship between ob-
jects. These objects can be either simple vertices, i.e., nodes and/or points (zero sim-
plexes), or high-dimensional simplexes. Here, the relationship refers to connectivity with
possible orientations. Graph theory has many branches, such as geometric graph theory,
algebraic graph theory, and topological graph theory. The study of graph theory draws
on many other areas of mathematics, including algebraic topology, knot theory, algebra,
geometry, group theory, combinatorics, etc. For example, algebraic graph theory can be
investigated by using either linear algebra, group theory, or graph invariants. Among
them, the use of learning algebra in graph study leads to spectral graph theory.
    Precursors of the spectral theory have often had a geometric flavor. An interesting
spectral geometry question asked by Mark Kac was “Can one hear the shape of a drum?”
[10]. The Laplace-Beltrami operator on a closed Riemannian manifold has been inten-
sively studied [54]. Additionally, eigenvalues and isoperimetric properties of graphs are
the foundation of the explicit constructions of expander graphs [113]. Moreover, the study
of random walks and rapidly mixing Markov chains utilized the discrete analog of the
Cheeger inequality [114]. The interaction between spectral theory and differential geom-
etry became one of the critical developments [115]. For example, the spectral theory of
the Laplacian on a compact Riemannian manifold is a central object of de Rham-Hodge
theory [54]. Note that the Hodge Laplacian spectrum contains the topological informa-
tion of the underlying manifold. Specifically, the harmonic part of the Hodge Laplacian
spectrum corresponds to topological cycles. Connections between topology and spec-
tral graph theory also play a central role in understanding the connectivity properties
                                             72


of graphs [116, 117, 118, 119]. Similarly, as the topological invariants revealing the con-
nectivity of a topological space, the multiplicity of 0 eigenvalues of a 0-combinatorial
Laplacian matrix is the number of connected components of a graph. Indeed, the num-
ber of q-dimensional holes can also be unveiled from the number of 0 eigenvalues of the
q-combinatorial Laplacian [45, 53, 46, 120]. Nonetheless, spectral graph theory offers ad-
ditional non-harmonic spectral information beyond topological invariants.
    The traditional topology and homology are independent of metrics and coordinates
and thus, retain little geometric information. This obstacle hinders their practical appli-
cability in data analysis. Recently, persistent homology has been introduced to overcome
this difficulty by creating low-dimensional multiscale representations of a given object
of interest [121, 101, 122, 43, 123, 124]. Specifically, a filtration parameter is devised to
induce a family of geometric shapes for a given initial data. Consequently, the study of
the underlying topologies or homology groups of these geometric shapes leads to the
so-called topological persistence. Like the de Rham-Hodge theory which bridges differ-
ential geometry and algebraic topology, persistent homology bridges multiscale analysis
and algebraic topology. Topological persistence is the most important aspect of the pop-
ular topological data analysis (TDA) [125, 126, 127, 128] and has had tremendous success
in computational biology [129, 44] and worldwide competitions in computer-aided drug
design [6].
    Graph theory has been applied in various fields [130]. For example, spectral graph the-
ory is applied to the quantum calculation of π-delocalized systems. The Hückel method,
or Hückel molecular orbital theory, describes the quantum molecular orbitals of π-electrons
in π-delocalized systems in terms of a kind of adjacency matrix that contains atomic con-
nectivity information [131, 132]. Additionally, the Gaussian network model (GNM) [133]
and anisotropic network model (ANM) [134] represent protein Cα atoms as an elastic
mass-and-spring network by graph Laplacians. These approaches were influenced by
the Flory theory of elasticity and the Rouse model [135]. Like traditional topology, tra-
                                              73


ditional graph theory extracts very limited information from data. In our earlier work,
we have proposed multiscale graphs, called multiscale flexibility rigidity index (mFRI),
to describe the multiscale nature of biomolecular interactions [136], such as hydrogen
bonds, electrostatic effects, van der Waals interactions, hydrophilicity, and hydrophobic-
ity. A multiscale spectral graph method has also been proposed as generalized GNM
and generalized ANM [57]. Our essential idea is to create a family of graphs with dif-
ferent characteristic length scales for a given dataset. We have demonstrated that our
multiscale weighted colored graph (MWCG) significantly outperforms traditional spec-
tral graph methods in protein flexibility analysis [137]. More recently, we demonstrate
that our MWCG outperforms other existing approaches in protein-ligand binding scor-
ing, ranking, docking, and screening [138].
    The objective of the present work is to introduce persistent spectral graph as a new
paradigm for the multiscale analysis of the topological invariants and geometric shapes
of high-dimensional datasets. Motivated by the success of persistent homology [44] and
multiscale graphs [138] in dealing with complex biomolecular data, we construct a fam-
ily of spectral graphs induced by a filtration parameter. In the present work, we con-
sider the radius filtration via the Vietoris-Rips complex while other filtration methods
can be implemented as well. As the filtration radius is increased, a family of persistent
q-combinatorial Laplacians are constructed for a given point-cloud dataset. The diago-
nalization of these persistent q-combinatorial Laplacian matrices gives rise to persistent
spectra. It is noted that our harmonic persistent spectra of 0-eigenvalues fully recover the
persistent barcode or persistent diagram of persistent homology. Additional information
is generated from non-harmonic persistent spectra, namely, the non-zero eigenvalues and
associated eigenvectors. In a combination with a simple machine learning algorithm, this
additional spectral information is found to provide a powerful new tool for the quantita-
tive analysis of molecular data.
                                             74


4.1.1    Benzene Structure Analysis
 In the past few years, we have developed a multiscale spectral graph method such as
generalized GNM and generalized ANM [136, 57], to create a family of spectral graphs
with different characteristic length scales for a given dataset. Similarly, in our persis-
tent spectral theory, we can construct a family of spectral graphs induced by a filtration
parameter. Moreover, we can sum over all the multiscale spectral graphs as an accumu-
lated spectral graph. Specifically, a family of Lr+0   0   matrices, as well as the accumulated
combinatorial Laplacian matrices, can be generated via the filtration. By analyzing the
persistent spectra of these matrices, the topological invariants and geometric shapes can
be revealed from the given input point-cloud data.
    The spectra of Lr+0    0 , L̂0 , and Ľ0
                                 r+0        r+0
                                                 mentioned above carry similar information on
how the topological structures of a graph are changed during the filtration. Benzene
molecule (C6 H6 ), a typical aromatic hydrocarbon which is composed of six carbon atoms
bonded in a planar regular hexagon ring with one hydrogen joined with each carbon
atom. It provides a good example to demonstrate the proposed PST. Figure 4.1 illustrates
the filtration of the benzene molecule. Here, we label 6 hydrogen atoms by H1 , H2 , H3 ,
H4 , H5 , and H6 , and the carbon adjacent to the labeled hydrogen atoms are labeled by
C1 , C2 , C3 , C4 , C5 , and C6 , respectively. Figure Figure 4.1 b depicts that when the radius
of the solid sphere reaches 0.54 Å, each carbon atom in the benzene ring is overlapped
with its joined hydrogen atom, resulting in the reduction of β0r+0 to 6. Moreover, once
the radius of solid spheres is larger than 0.70 Å, all the atoms in the benzene molecule
will connect and constitute a single component which gives rise β0r+0 = 1. Furthermore,
we can deduce that the C-C bond length of the benzene ring is about 1.40 Å, and the C-
H bond length is around 1.08 Å, which are the real bond lengths in benzene molecule.
Figure Figure 4.1 c shows that a 1-dimensional hole (1-cycle) is born when the filtration
parameter r increase to 0.70 Å and dead when r = 1.21 Å. In Figures Figure 4.1 b and
Figure 4.1 c, it can be seen that variants of 0-persistent 0-combinatorial Laplacian and 1
                                                   75


-combinatorial Laplacian matrices based on filtration give us the identical β0r+0 and β1r+0
information respectively.
                                   H1
                        H6                     H2
                        H5                     H3
                                   H4
  Figure 4.1: Benzene molecule and its topological changes during the filtration process.
    The C-C bond length of benzene is 1.39 Å, and the C-H bond length is 1.09 Å. Due
to the perfect hexagon structure of the benzene ring, we can calculate all of the distances
between atoms. The shortest and longest distances between carbons and the hydrogen
atoms are 1.09 Å and 3.87 Å. In Figure Figure 4.1a, a total of 10 changes of (λ̃2 )r+0
                                                                                    0  values
is observed at various radii. Table 4.1 lists all the distances between atoms and the values
of radii when the changes of (λ̃2 )r+0
                                   0   occur. It can be seen that the distance between atoms
approximately equals twice of the radius value when a jump of (λ̃2 )r+0 0   occurs. Therefore,
we can detect all the possible distances between atoms with the nonzero spectral infor-
mation. Moreover, in Figure Figure 4.1 b, the values of the smallest nonzero eigenvalues
of Lr+0
     0 , L̂0 , and Ľ0
           r+0       r+0
                         change concurrently.
                                               76


Table 4.1: Distances between atoms in the benzene molecule and the radii when the
changes of (λ̃2 )r+0
                  0    occur (Values increase from left to right).
        Type       C1 -H1 C1 -C2 C2 -H1 C1 -C3 H1 -H2 C1 -C4 C3 -H1 C4 -H1 H1 -H3 H1 -H4
  Distance (Å) 1.09 1.39 2.15 2.41 2.48 2.78 3.39 3.87 4.30                               4.96
        r (Å)       0.54 0.70 1.08 1.21 1.24 1.40 1.70 1.94 2.15                          2.49
Figure 4.2: Persistent spectral analysis of the benzene molecule induced by filtration pa-
rameter r. Blue line, orange line, and green line represent Lr+0      0 , L̂0 , and Ľ0
                                                                            r+0       r+0
                                                                                          respec-
tively. (a) Plot of the smallest non-zero eigenvalues with radius filtration under L0 (blue
                                                                                        r+0
line), L̂r+0
          0  (red line), and Ľr+0
                               0   (green line). Total 10 jumps observed in this plot which rep-
resent 10 possible distances between atoms. (b) Plot of the number of zero eigenvalues
(β0r+0 ) with radius filtration under Lr+0
                                         0 , L̂0 , and Ľ0
                                                r+0        r+0
                                                               (three spectra are superimposed).
When r = 0.00 Å, 12 atoms are disconnected with each other. After r = 0.54 Å, H atoms
and their adjacent C atoms are connected with one another resulting in β0r+0 = 6. With
r keeps growing, all of the atoms are connected with one another and then β0r+0 = 1.
(c) Plot of the number of zero eigenvalues (β1r+0 ) with radius filtration under Lr+0  1 . When
r = 0.70 Å, a 1-cycle created since all of the C atoms are connected and form a hexagon, re-
sulting in β1r+0 = 1. After the radius reached 1.21 Å, the hexagon disappears and β1r+0 = 0.
4.1.2    Fullerene Analysis and Prediction
In 1985 Kroto et all discovered the first structure of C60 [139], which was confirmed by
Kratschmer et al in 1990 [140]. Since then, the quantitative analysis of fullerene molecules
has become an interesting research topic. The understanding of the fullerene structure-
function relationship is important for nanoscience and nanotechnology. Fullerene molecules
are only made of carbon atoms that have various topological shapes, such as the hollow
spheres, ellipsoids, tubes, or rings. Due to the monotony of the atom type and the vari-
ety of geometric shapes, the minor heterogeneity of fullerene structures can be ignored.
                                                  77


The fullerene system offers a moderately large dataset with relatively simple structures.
Therefore, it is suitable for validating new computational methods because every single
change in the spectra is interpretable. The proposed persistent spectral theory, i.e., per-
sistent spectral analysis, is applied to characterize fullerene structures and predict their
stability.
    All the structural data can be downloaded from CCL.NET Webpage. This dataset
gives the coordinates of fullerene carbon atoms. In this section, we will analyze fullerene
structures and predict the heat of formation energy.
4.1.2.1    Fullerene Structure Analysis
The smallest member of the fullerene family is C20 molecule with a dodecahedral cage
structure. Note that 12 pentagons are required to form a closed fullerene structure. Fol-
lowing the Euler’s formula, the number of vertices, edges, and faces on a polygon have
the relationship V − E + F = 2. Therefore, the 20 carbon atoms in the dodecahedral cage
form 30 bonds with the same bond length. The C20 is the only fullerene smaller than C60
that has the molecular symmetry of the full icosahedral point group Ih . C60 is a molecule
that consists of 60 carbon atoms arranged as 12 pentagon rings and 20 hexagon rings.
Unlike C20 , C60 has two types of bonds: 6 : 6 bonds and 6 : 5 bonds. The 6 : 6 bonds
are shorter than 6 : 5 bonds, which can also be considered as “double bond" [141]. C60 is
the most well-know fullerene with geometric symmetry Ih . Since C20 and C60 are highly
symmetrical, they are ideal systems for illustrating the persistent spectral analysis.
    Figure 4.3 (a) illustrates the radius filtration process built on C20 . As the radius in-
creases, the solid balls corresponding to carbon atoms grow, and a sequence of Lr+0   0   ma-
trices can be defined through the overlap relations among the set of balls. At the initial
state (r = 0.00 Å), all of the atoms are isolated from one another. Therefore, Lr+0
                                                                                 0   is a zero
matrix with dimension 20 × 20. Since the C20 molecule has the same bond length which
can be denoted as l(C20 ), once the radius of solid balls is greater than l(C20 ), all of the
                                               78


                           (
                           a)                                            (b)
Figure 4.3: (a) Illustration of filtration built on fullerene C20 . Each carbon atom of C20 is
plotted by its given coordinates, which are associated with an ever-increasing radius r.
The solid balls centered at given coordinates keep growing along with the radius filtration
parameter. (b) The accumulated Lr+0    0    matrix for C20 . For clarity, the diagonal terms are
set to 0.
balls are overlapped, which makes the system a singly connected component. Figure 4.3
(b) depicts the accumulated Lr+0  0   for C20 . For C60 , the accumulated Lr+00   is described in
Figure 4.4 (a). Figure 4.4 (b)-(f) are the plots of Lr+0 0   under different filtration r values.
The blue cell located at the ith row and jth column means the balls centered at atom i and
atom j connected with each other, i.e., a 1-simplex formed with its vertex to be i and j.
When the radius filtration increases, more and more bluer cells are created. In Figure 4.4
(f), the color of cells, except the cells located in the diagonal, turns to blue, which means
all of the carbon atoms are connected with one another at r = 3.6 Å. For clarity, we set the
diagonal terms to 0.
     In Figure 4.5, the blue solid line represents C20 properties and the dash orange line
represents C60 properties. For Figure Figure 4.5 a, the blue line drops at r = 0.72 Å,
which means the bond length of C20 is around 1.44 Å. The orange line drops at r =
0.68 Å and 0.72 Å, which means the “double bond" length of C60 is around 1.36 Å and the
                                                79


    1                            0       1                           1
                                 -
                                 60
    20                          -1
                                 20     2
                                        0                           20
                                -1
                                 80
    4
    0                                   4
                                        0                           40
                                -2
                                 40
                                -30
                                  0
    6
    0                                   6
                                        0                           60
      1     20     40     60              1    2
                                               0      4
                                                      0     60         1     2
                                                                             0     40    60
                (
                a)                                 (
                                                   b)                            (
                                                                                 c)
     1                                  1                            1
    20                                 20                           20
    40                                 40                           40
    60                                 60                           60
      1     20      4
                    0     60               1   2
                                               0      40     60        1     20     40   60
                 (
                 d)                                (
                                                   e)                             (
                                                                                  f)
Figure 4.4: Illustration of persistent multiscale analysis of C60 in terms of 0-combinatorial
Laplacian matrices (b)-(f) and their accumulated matrix (a) induced by filtration. As the
value of filtration parameter r increases, high-dimensional simplicial complex forms and
grows accordingly. (b), (c), (d), (e), and (d) demonstrate the 0-combinatorial Laplacian
matrices (i.e., the connectivity among C60 atoms) at filtration r = 1.0 Å, 1.5 Å, 2.5 Å, 3.0 Å,
and 3.6 Å, respectively. The blue cell located at the ith row and jth column represents the
balls centered at atom i and atom j connected with each other. For clarity, the diagonal
terms are set to 0 in all plots.
6 : 5 bond length is around 1.44 Å. Moreover, the total number of “double bond" is 30,
yielding β0r+0 = 30 when the radius of solid balls is over 0.68 Å. In conclusion, one can
deduce the number of different types of bonds as well as the bond length information
from the number of zero eigenvalues (i.e., β0r+0 ) under the radius filtration. Furthermore,
the geometric information can also be derived from the plot of (λ̃2 )r+0    0 . Each jump in
Figure Figure 4.5 d at a specific radius represents the change of geometric and topological
structure. The smallest non-zero eigenvalue (λ̃2 )r+0   0  of Lr+0
                                                                0  matrices for C20 changes 5
times in Figure Figure 4.5 d, which means C20 has 5 different distances between carbon
atoms. Furthermore, as (λ̃2 )r+0
                               0    of C20 keeps increasing, the smallest vertex connectivity of
the connected subgraph continues growing and the topological structure becomes steady.
                                                80


As can be seen in the right-corner chart of Figure 4.3, the carbon atoms will finally grow
to a solid object with a steady topological structure.
    Figure Figure 4.5 b depicts the changes of Betti 1 value β1r+0 (i.e., the number of zero
eigenvalues for Lr+0
                   1 ) under the filtration r. Since C20 has 12 pentagonal rings, β1
                                                                                        r+0
                                                                                            jumps
to 11 when radius r equals to the half of the bond length of l(C20 ). These eleven 1-cycles
disappear at r = 1.17 Å. There are 12 pentagons and 20 hexagons in C60 , which results
in β1r+0 = 12 at r = 0.72 Å, β1r+0 = 31 at r = 1.17 Å. All of the pentagons and hexagons
disappear at r = 1.22 Å.
    As the filtration process, even more structure information can be derived from the
number of zero eigenvalues of Lr+0   2  (i.e., β2r+0 ) in Figure Figure 4.5 c. For C20 , β2r+0 = 1
when r = 1.17 Å, which corresponds to the void structure in the center of the dodecahe-
dral cage. The void disappears at r = 1.65 Å since a solid structure is generated at this
point. For fullerene C60 , 20 hexagonal cavities and a center void exist from 1.12 Å to 1.40 Å
yielding β2r+0 = 21. As the filtration goes, hexagonal cavities disappear which results β2r+0
decrease to 1. The central void keeps alive until a solid block is formed at r = 3.03 Å. In a
nutshell, we can deduce the number of different types of bonds, the bond length, and the
topological invariants from the present persistent spectral analysis.
4.1.2.2   Fullerene stability prediction
Having shown that the detailed fullerene structural information can be extracted into
the spectra of Lr+0
                  q , we further illustrate that fullerene functions can be predicted from
their structures by using our persistent spectral theory in this section. Similar structure-
function analysis has been carried out by using other methods [136, 142, 143]. For small
fullerene molecule series C20 to C60 , with the increase in the number of atoms, the ground-
state heat of formation energies decrease [144, 1]. The left chart in Figure 4.6 describes
this phenomenon. Similar patterns can also be found in the total energy (STO-3G/SCF
at MM3) per atom and the average binding energy of C2n . To analyze these patterns,
                                                81


Figure 4.5: Illustration of persistent spectral analysis of C20 and C60 using the spectra of
Lr+0
  q   (q = 1, 2 and 3). (a) The number of zero eigenvalues of Lr+0   0 , i.e., β0 , under radius
                                                                                r+0
filtration. (b) The number of zero eigenvalues of Lr+0 1 , i.e., β1
                                                                  r+0
                                                                      under radius filtration. (c)
The number of zero eigenvalues of L2 , i.e., β2 under radius filtration. (d) The smallest
                                         r+0       r+0
non-zero eigenvalue (λ̃2 )r+0
                            0  under radius filtration. The radius grid spacing is 0.01 Å.
many theories have been proposed. Isolated pentagon rule assumes that the most stable
fullerene molecules are those in which all the pentagons are isolated. Zhang et al. [1]
stated that fullerene stability is related to the ratio between the number of pentagons and
                                                82


the number of carbon atoms. Xia and Wei [142] proposed that the stability of fullerene de-
pends on the average number of hexagons per atom. However, these theories all focused
on the pentagon and hexagon information. More specifically, they use topological infor-
mation to reveal the stability of fullerene. In contrast, we believe that the non-harmonic
persistent spectra can also model the structure-function relationship of fullerenes. We hy-
pothesize that the non-harmonic persistent spectra of Lr+0 0   matrices are powerful enough
to model the stability of fullerene molecules. To verify our hypothesis, we compute the
summation, mean, maximal, standard deviation, variance of its eigenvalues, and (λ̃2 )r+0    0
of the persistent spectra of Lr+0
                               0  over various filtration radii r. We depict a plot with the
horizontal axis represents radius r and the vertical axis represents the particular spectrum
value, which is actually the same as Figure 4.5. Then we define the area under the plot of
spectra with a negative sign as
                                               X
                                      Aα = −       Λαi δr,                                (4.1)
                                               i=1
where δr is the radius grid spacing, in Figure 4.5, δr = 0.01 Å. Here, α = Sum, Avg, Max,
Std, Var, Sec is the type index and thus, Λαi represent the summation, mean, maximal,
standard deviation, variance, and the smallest non-zero eigenvalue (λ̃2 )r+00   of Lr+0
                                                                                    0   at i-th
radius step, respectively. The right chart in Figure 4.6 describes the area under the plot
of spectra and closely resembles that of the heat of formation energy. We can see that
generally the left chart and the middle chart show the same pattern. The integration of
(λ̃2 )r+0
      0   decreases as the number of carbon atoms increases. However, the structural data
we used might not be the same ground-state data as in Ref. [1], which results in C36 do not
match the corresponding energy perfectly. Limited by the availability of the ground-state
structural data, we are not able to analyze the full set of the fullerene family.
     To quantitatively validate our model, we apply one of the simplest machine learning
algorithms, linear least-squares method, to predict the heat of formation energy. The
                                              83


Figure 4.6: Persistent spectral analysis and prediction of fullerene heat formation energies.
Left chart: the heat of formation energies of fullerenes obtained from quantum calcula-
tions [1]. Middle chart: PST model using the area under the plot of (λ̃2 )r+0  0 . Right chart:
Correlation between the quantum calculation and the PST prediction. The highest corre-
lation coefficient form the least-squares fitting is 0.986 with the type index of α = Max.
Pearson correlation coefficient is defined as
                                         XN
                                             (Aiα − Āα )(Ei − Ē)
                                         i=1
                            Ccα = "                                   # 21                 (4.2)
                                    X N                XN
                                         (Aiα − Āα )2     (Ei − Ē)2
                                     i=1               i=1
where Aiα represents the theoretically predicted energy of the i-th fullerene molecule, Ei
represents the heat of formation energy of the i-th fullerene molecule, and Āα and Ē are
the corresponding mean values. When α = Max, the Pearson correlation coefficient is
around 0.986. The right chart of Figure 4.6 plots the correlation between predicted ener-
gies and the heat of formation energy of the fullerene molecules computed from quantum
mechanics [1]. These results agree very well.
Table 4.2: The heat of formation energy of fullerenes [1] and its corresponding predicted
energies with α = Max. The unit is EV/atom.
    Fullerene type                C20       C24     C26      C30    C32    C36  C50    C60
    Heat of formation energy 1.180 1.050 0.989 0.850 0.781 0.706 0.509 0.401
    Predicted energy             1.138 1.050 0.964 0.821 0.857 0.766 0.474 0.391
    The right chart of Figure 4.6 illustrates the fitting results under different type index α.
Table 4.3 lists the correlation coefficient under different type index α. The highest corre-
                                                 84


lation coefficient is close to unity (0.986) obtained with α = Max. The lowest correlation
coefficient is 0.942 with α = Sum. We can see that all the correlation coefficients are close
to unity, which verifies our hypothesis that the non-harmonic spectra of Lr+0    0   have the
capacity of modeling the stability of fullerene molecules. Although we ignore the topo-
logical information (Betti numbers), our persistent spectral theory still works extremely
well only with non-harmonic spectra, which means our persistent spectral theory is a
powerful tool for quantitative data analysis and prediction.
            Table 4.3: The correlation coefficients under different type index α.
        Type index                  Sum       Avg     Max       Std      Var      Sec
        Correlation coefficient     0.942    0.985    0.986    0.969    0.977    0.981
4.1.3   Protein flexibility analysis
As clarified earlier, the number of zero eigenvalues of p-persistent q-Laplacian matrix (p-
persistent qth Betti number) can also be derived from persistent homology. Persistent
homology has been used to model fullerene stability [142]. In this section, we further il-
lustrate the applicability of present persistent spectral theory by a case that non-harmonic
persistent spectra offer a unique theoretical model whereas it may be difficult to come up
with a suitable persistent homology model for this problem.
    The protein flexibility is known to correlate with a wide variety of protein functions.
It can be modeled by the beta factors or B-factors, which are also called Debye-Waller
factors. B-factors are a measure of the atomic mean-square displacement or uncertainty
in the X-ray scattering structure determination. Therefore, understanding the protein
structure, flexibility, and function via the accurate protein B-factor prediction is a vital
task in computational biophysics [145]. Over the past few years, quite many methods
are developed to predict protein B-factors, such as GNM, [133], ANM [134], FRI, [146,
147] and MWCG [57, 145]. However, all of the aforementioned methods are based on
a particular matrix derived from the graph network which is constructed using alpha
                                               85


carbon as nodes and connections between nodes as edges. In this section, we apply our
persistent spectral theory to create richer geometric information in B-factor prediction.
    To illustrate our method, we consider protein 2Y7L whose total number of residues is
N = 319. In this work, we employ the coarse-grained Cα representation of 2Y7L. There-
fore, 319 particles are taken into consideration in protein 2Y7L. Similarly, like in the previ-
ous application of fullerene structure analysis, we treat each Cα atom as a 0-simplex at the
initial setup and assign it a solid ball with a radius of r. By varying the filtration param-
eter r, we can obtain a family of Lr+0    0 . For each matrix L0 , its corresponding ordered
                                                                         r+0
spectrum is given by
                                  (λ1 )0r+0 , (λ2 )r+0               r+0
                                                   0 , · · · , (λN )0 .
Suppose the number of zero eigenvalues is m, then, we have β0r+0 = m. Since L0r+0 is
symmetric, then eigenvectors of Lr+0    0    corresponding to different eigenvalues must be or-
thogonal to each other. The Moore-Penrose inverse of Lr+0             0   can be calculated by the non-
harmonic spectra of Lr+0 0 :
                                           N
                                −1
                                          X          1           r+0       r+0 T
                          (Lr+0
                            0 )    =                   r+0 [(uk )0 ((uk )0 ) ],
                                       k=m+1
                                                 (λ  )
                                                    k 0
where T is the transpose and (uk )r+0  0     is the kth eigenvector of Lr+0     0 . The modeling of ith
B-factor of 2Y7L at filtration parameter r can be expressed as
                                                 −1
                                Bir = (Lr+0 0 )ii , ∀i = 1, 2, · · · , N,
and the final model of ith B-factor of 2Y7L is given by
                            BiPST =
                                    X
                                           wr Bir + w0 , ∀i = 1, 2, · · · , N,
                                       r
where wr and w0 are fitting parameters which can be derived by linearly fitting B-factors
from experimental data B Exp . Consider the filtration radius from 2 to 12 with the grid
spacing of 1, then totally 11 different Lr+0  0   are created. By calculating all the non-harmonic
spectra together with their eigenvectors, 11 Moore-Penrose inverse matrices (Lr+0            0 )
                                                                                                 −1
                                                                                                    can
                                                     86


be constructed. Therefore, the predicted ith B-factor is
                                                    12
                                         BiPST
                                                   X
                                               =        wr Bir + w0 .
                                                   r=2
The specific values of wr and w0 can be found in Table A.16 and Table A.17 of Appendix
Section A.2. Figure 4.7 (c) shows that the prediction B-factors are in an excellent agree-
ment with the experimental B-factors of protein 2Y7L. The Pearson correlation coefficient
is 0.925 1 .
                                        (
                                        a)                         (
                                                                   b)
                                                      (c
                                                       )
Figure 4.7: Illustration of persistent spectral prediction of protein B-factors. (a) Plot of the
secondary structure of protein 2Y7L. (b) Accumulated persistent Laplacian matrix (For
clarity, the diagonal terms are set to 0.). Note that the accumulated persistent Laplacian
matrix maps out the detailed distance between each pair of residues. (c) Comparison of
experimental B-factors and those predicted by PST for protein 2Y7L.
    This example shows that our persistent spectral theory can be used beyond the persis-
tent homology analysis. The number of zero eigenvalues of 0-persistent q-combinatorial
Laplacian matrices fully recover the persistent barcode or persistent diagram of persis-
tent homology. Additional spectral information from non-harmonic persistent spectra
and persistent eigenvectors provides valuable information for data modeling, analysis,
and prediction.
    1
      We carry out feature scaling to make sure all Bir are on a similar scale.
                                                      87


4.1.4   Discussion and Conclusion
Spectral graph theory is a powerful tool for data analysis due to its ability to extract ge-
ometric and topological information. However, its performance can be quite limited for
various reasons. One of them is that the current spectral graph theory does not pro-
vide a multiscale analysis. Motivated by persistent homology and multiscale graphs, we
introduce persistent spectral theory as a unified paradigm to unveil both topological per-
sistence and geometric shape from high-dimensional datasets.
    For a point set V ⊂ Rn without additional structures, we construct a filtration using
an (n − 1)-sphere of a varying radius r centered at each point. A series of persistent com-
binatorial Laplacian matrices are induced by the filtration. It is noted that our harmonic
persistent spectra (i.e., zero eigenvalues) fully recover the persistent barcode or persistent
diagram of persistent homology. Specifically, the numbers of zero eigenvalues of persis-
tent q-combinatorial Laplacian matrices are the q-dimensional persistent Betti numbers
for the same filtration given filtration. However, additional valuable spectral information
is generated from the non-harmonic persistent spectra. In this work, in addition to per-
sistent Betti numbers and the smallest non-zero eigenvalues, five statistic values, namely,
sum, mean, maximum, standard deviation, and variance, are also constructed for data
analysis. We use a few simple two-dimensional (2D) and three-dimensional (3D) struc-
tures to carry out the proof of principle analysis of the persistent spectral theory. The
detailed structural information can be incorporated into the persistent spectra of. For in-
stant, for the benzene molecule, the approximate C-C bond and C-H bond length can be
intuitively read from the plot of the 0-dimensional persistent Betti numbers. Moreover,
persistent spectral theory also has the capacity to accurately predict the heat of forma-
tion energy of small fullerene molecules. We use the area under the plot of the persistent
spectra to model fullerene stability and apply the linear least-squares method to fit our
prediction with the heat of formation energy. The resulting correlation coefficient is close
to 1, which shows that our persistent spectral theory has an excellent performance on
                                               88


molecular data. Furthermore, we have applied our persistent spectral theory to the pro-
tein B-factor prediction. In this case, persistent homology does not give a straightforward
model. This example shows that the additional non-harmonic persistent spectral infor-
mation provides a powerful tool for dealing with molecular data.
    It is pointed out that the proposed persistent spectral analysis can be paired with
advanced machine learning algorithms, including various deep learning methods, for
a wide variety of applications in data science. In particular, the further construction of
element-specific persistent spectral theory and its application to protein-ligand binding
affinity prediction and computer-aided drug design will be reported elsewhere.
4.2     Persistent Path Laplacian
Recent years witness the emergence of a variety of advanced mathematical tools in topo-
logical data analysis (TDA) [148]. As the main workhorse of TDA, persistent homology
(PH) [100, 43, 122, 101] pioneered a new branch in algebraic topology, offering a power-
ful tool to decode the topological structures of data during filtration in terms of persistent
Betti numbers. Persistent homology has had tremendous success in many areas of science
and technology, such as biology [4], chemistry [5], drug discovery [6], 3D shape analysis
[7], etc.
    Inspired by the success of PH, other mathematical tools have been given due atten-
tion. One of them is de Rham-Hodge theory in differential geometry, which uses the
differential forms to represent the cohomology of an oriented closed Riemannian mani-
fold with boundary in terms of a topological Laplacian, namely Hodge Laplacian [8]. The
de Rham-Hodge theory has been applied to computational biology [55], graphic [149],
and robotics [150]. However, like homology, the de Rham-Hodge theory does not offer
an in-depth analysis of data, which is a famous problem in spectral geometry [10]. To
overcome this drawback, the evolutionary de Rham-Hodge theory [9] was introduced in
terms of persistent Hodge Laplacian to offer a multiscale analysis of the de Rham-Hodge
                                               89


theory. Defined on a family of evolutionary manifolds, the evolutionary de Rham-Hodge
theory gives a new answer to, or at least reopens, the famous 55-years old question “can
one hear the shape of a drum". [10] The persistent Hodge Laplacian captures both the
topological persistence and the homotopic shape evolution of data during filtration.
    Nevertheless, the evolutionary de Rham-Hodge theory is set up on Riemannian man-
ifolds, which may be computationally demanding for large datasets. Hence, a similar
multiscaled-based topological Laplacian, called persistent spectral graph (PSG) [11], was
proposed by introducing a filtration to combinatorial graph Laplacians. PSG, aka persis-
tent Laplacian (PL) [151], extends persistent homology to non-harmonic analysis of data,
showing much advantage in sophisticated applications [152, 153]. Dealing with point
cloud data instead of manifolds, PL encodes a point cloud to a family of simplicial com-
plexes generated from filtration and analyzes both harmonic and non-harmonic spectra.
It is worthy to notice that the harmonic spectra from the null spaces of PLs reveal the
same topological persistence like that of persistent homology, whereas, the non-harmonic
spectra of PLs capture the homotopic shape evolution of data during the filtration. Mean-
while, open-source software called HERMES [154] was developed for the simultaneous
topological and geometric analysis of data. However, like persistent homology, PSG treats
all data points equally. That is to say, each point does not carry any labeled information
such as the type, mass, color, etc. Therefore, an extension of PSG, called persistent sheaf
Laplacian (PSL), was proposed to generalize cellular sheaves [155, 156] for the multiscale
analysis of point cloud data with attached labeled information [157]. PSL is also a topo-
logical Laplacian that carries topological information in its null space but tracks homo-
topic shape evolution during filtration. Another interesting development is the persistent
Dirac Laplacian (PDL) by Ameneyro, Maroulas, and Siopsis [158]. PDL offers an efficient
quantum computation of persistent Betti numbers across different scales. These new ap-
proaches have great potentials to deal with complex data in science and engineering.
    It is noticed that the aforementioned homologies and topological Laplacians are in-
                                              90


sensitive to asymmetry or directed relations, which limits their representational power in
encoding structures that have directional information. For example, in gene regulation
data, the directions of gene regulations are indicated by arrowheads or perpendicular
edges in systems biology [159]. Therefore, a technique that can deal with directed graphs
(digraphs) is of vital importance to inferring gene regulation relationships. Notably, the
path homology [12] proposed by Grigor’yan, Lin, Muranov, and Yau provides a powerful
tool to analyze datasets with asymmetric structures using the path complex. Particular
cases of homologies of digraphs and their path cohomology were also discussed [12, 60].
The notion of path homology of digraphs has a richer mathematical structure than the
earlier homology and Laplacian, opening new directions for both pure and applied math-
ematics. For example, path homology theory was extended to various objects such as
quivers, multigraphs, digraphs pairs, cylinder, cone, hypergraphs, etc. [160, 161, 162]
Path homology has drawn much attention from researchers in the TDA community. To
encode richer information, Chowdhury and Mémoli extended path homology to a persis-
tent framework on a directed network [13]. Wang, Ren, and Wu constructed a weighted
path homology for weight digraphs and proved a persistent version of a Künneth-type
formula for joins of weighted digraphs [163]. Recently, Dey, Li, and Wang have designed
an efficient algorithm for 1-dimensional persistent path homology [164], which is useful
in real applications.
    Similar to persistent homology, persistent path homology cannot track the homotopic
shape evolution of data during filtration. To overcome this limitation, we introduce path
Laplacian as a new topological Laplacian to analyze the spectral geometry of data, in ad-
dition to its topology. Moreover, we introduce a filtration to path Laplacian to obtain a
persistent path Laplacian (PPL), a new framework that captures both the topological per-
sistence and shape evolution of directed graphs and networks. By varying the filtration
parameter, one can construct a series of digraphs, which result in a family of persistent
path Laplacian matrices. The harmonic spectra of the persistent path Laplacian recover
                                             91


all the topological invariants of the digraphs, while the non-harmonic spectra provide ad-
ditional geometric information, which can distinguish two systems when they are homo-
topy but geometrically different. PPL has potential applications in science, engineering,
industry, and technology. This work is organized as follows: Section 2 reviews the nec-
essary background on path homology. Section 3 describes path Laplacian and persistent
path Laplacian. Detailed PPL matrix constructions are illustrated with various examples
for the interested readers in Section 3 and Section 4.
4.2.1   Constructions of Persistent Path Laplacian for Tetra and Pyramid
Figure 4.8: Illustration of filtration on a tetrahedron. Here, 1, 2, 3, and 4 represent four
elementary 0-paths e1 , e2 , e3 , and e4 . The top panel√is a tetrahedron that has edge lengths
|e12 | = |e32 | = |e24 | = 1 and |e13 | = |e14 | = |e34√| = 2. The
                                                                √ bottom panel is a tetrahedron
that has edge lengths |e32 | = |e24 | = 1, |e34 | = 2, |e12 | = 3, and |e13 | = |e14 | = 2.
     One can get both abstract information (revealed by Betti numbers) and geometric
information (revealed by non-harmonic spectra) from digraphs along filtration. For in-
stance, Figure 4.8 illustrates the filtration on two tetrahedrons. The top panel is a tetrahe-
                                                                                             √
dron (Tetra 1) with edge lengths |e12 | = |e32 | = |e24 | = 1, and |e13 | = |e14 | = |e34 | = 2. The
                                                                                   √
bottom panel is another tetrahedron (Tetra 2) with edge lengths |e12 | = 3, |e32 | = |e24 | =
                                          √                                            √          √
1, and |e13 | = |e14 | = 2, and |e34 | = 2. We say G1 = G0 , G2 = G1 , G3 = G 2 , G4 = G 3 ,
                                                    92


Figure 4.9: Comparison of Betti numbers and non-harmonic spectra of Lδ,δ   n when n = 0, 1,
and 2 on tetrahedrons Tetra 1 and Tetra 2. Note that since β1 = 0 and β2δ,δ = 0 for Tetra
                                                              δ,δ
1 and Tetra 2, topological variants from persistent path homology cannot discriminate
Tetra 1 and Tetra 2. However λδ,δ1 and λ2 show the differences between Tetra 1 and Tetra
                                         δ,δ
2.
             √
                                                          n of persistent n-th path Lapla-
and G5 = G 5 . Figure 4.9 shows the changes of βnδ,δ and λδ,δ
        n along filtration. It can be seen that by varying the filtration parameter δ from
cian Lδ,δ
0 to 1, the Betti 1 and Betti 2 are always 0. However, the smallest nonzero eigenvalue
  n of Tetra 1 and Tetra 2 have changes along filtration parameter δ. Additionally, when
λ̃δ,δ
n = 1, 2, the λ̃δ,δ
                 n can distinguish Tetra 1 and Tetra 2, while βn cannot. This indicates
                                                                  δ,δ
that non-harmonic spectra of persistent path Laplacian can reveal more geometric infor-
mation than the persistent Betti numbers in distinguishing similar topological structures.
Notably, we remove all the isolated points from each digraph for the simplicity of calcu-
lation.
      Moreover, a more complicated example is also illustrated in Figure 4.10 to describe
                                             93


Figure 4.10: Illustration of filtration on a pyramid. Here, 1, 2, 3, 4, and 5 represent five
elementary 0-paths e1 , e2 , e3 , e4 , and e5 . The top panel    √ is a pyramid√  that has edge lengths
|e13 | = |e25 | = |e32 | = |e34 | = |e54 | = 1, |e12 | = |e14 | = 2, and |e15 | = 3. The bottom panel
is a pyramid
         √       that has edge lengths |e25 | = |e32 | = |e34 | = |e54 | = 1, |e12 | = |e14 | = 2, and
|e15 | = 5.
the filtration on two pyramids. The top panel is a pyramid (Pyra 1) with edge lengths
                                                               √
|e12 | = |e32 | = |e24 | = 1, and |e13 | = |e14 | = |e34 | = 2. The bottom panel is a pyramid (Pyra
                                     √                                                              √
2) with edge lengths |e12 | = 3, |e32 , | = |e24 | = 1, and |e13 | = |e14 | = 2, and |e34 | = 2.
                                             √             √                 √
We say G1 = G0 , G2 = G1 , G3 = G 2 , G4 = G 3 , and G5 = G 5 . Figure 4.11 depicts the
changes of βnδ,δ and λδ,δ  n of persistent n-th path Laplacian Ln for objects Pyra 1 and Pyra
                                                                         δ,δ
2 along filtration. For Pyra 1 and Pyra 2, when n = 0 and δ = 1, their corresponding
digraphs form, which result in β01,1 = 1 and β11,1 = 1 for both Pyra 1 and Pyra 2. When
      √                  √ √
δ = 3, we have β1 3, 3 = 0 for Pyra 1 since the introducing of a new directed edges e15 .
              √                  √ √
When δ = 5, we have β1 5, 5 = 0 for Pyra 2 since the introducing of a new directed edges
e15 kills the 1-cycle formed by e25 , e32 , e34 , and e54 . Furthermore, although Pyra 1 and Pyra
2 do not have exactly the same geometric structure, their share the same β2δ,δ value from
                   √
δ = 0 to δ = 5. However, Pyra 1 and Pyra 2 can be distinguished by the λ̃δ,δ                   2 along
filtration. Therefore, we can see that similar to the PSG, one can use the non-harmonic
spectra from the persistent path laplacian to reveal the intrinsic geometric information of
a givens point-cloud dataset by varying the filtration parameters. In addition, the detailed
                                                        94


Figure 4.11: Comparison of Betti number and non-harmonic spectra of Lδ,δ   n when n =
0, 1,c and 2 on pyramids Pyra 1 and Pyra 2. Note that since β2 = 0, it cannot distinguish
                                                             δ,δ
Pyra 1 and Pyra 2. But λδ,δ
                          2 can tell the difference.
calculations of Lδ,δ
                  n can be found in the Appendix.
4.2.2   Constructions of Persistent Path Laplacian for CB7
In this section, we apply the persistent path Laplacian to the analysis of the curcur-
bit[n]urils system. Cucurbiturils are macrocyclic molecules, which are made of glycoluril
(=C6 H2 N4 O2 =) monomers linked by methylene bridges (-CH2 -). CBn is commonly used
as an abbreviation of Cucurbiturils. Here, n is the number of glycoluril units. In this
work, we consider CB7 as an example. The molecular formulas of CB7 is C42 H14 N28 O14 .
The molecular structure of CB7 is obtained from the Supporting Information of Ref. [165].
    Figure 4.12 illustrates how PPL is employed for a molecular system to extract its rich
topological and geometric information. The first two charts of Figure 4.12a describe the
                                              95


three-dimensional (3D) top view and side view of CB7. The green, blue, red, and gray
colors represent C, N, O, and H atoms, respectively. The third chart of Figure 4.12a is
a basic “Octagon-pentagon” unit that consists of two glycolurils. It can be seen that 7
glycolurils exist in CB7. The last chart of Figure 4.12a demonstrates the path direction
assignment to pairs of atoms based on atomic electronegativity. The periodic table of
electronegativity is given by the Pauling scale [166], in which the electronegativities of C,
N, O, and H are 2.55, 3.04, 3.44, and 2.20, respectively. Then, we set the directions of edges
following the order “H → C → N → O".
    Figure 4.12b depicts the distance-based filtration of CB7. Here, structures Gi (i =
1, 2, ..., 8) were obtained at the filtration radii of 0.200, 0.565, 0.710, 0.745, 0.800, 1.210,
1.315, and 1.800 Å, respectively. In our digraph notation, we denote these structures as
G1 = G0.200 0  , G2 = G0.565
                         0   , G3 = G0.710
                                      0    , G4 = G00.745 , G5 = G00.800 , G6 = G01.210 , G7 = G1.315
                                                                                                0     ,
and G8 = G1.800  0   . Note that, in the present formulation, all of the isolated points were
removed from these digraphs.
    Figure 4.12c illustrates the filtration-induced path complexes in the aforementioned
Gi (i = 1, 2, ..., 8). To clearly show the topological and geometric changes, only the path
complexes in one “Octagon-pentagon” unit (or two glycolurils) are considered and de-
picted for each structure. For simplicity, only edges are presented. However, their path
directions can be easily assigned based on their color map as shown in the last chart of
Figure 4.12a.
    Figure 4.12d depicts the PPL spectra of CB7. We can see that at the initial state (G1 )
when δ = 0.200 Å ), total 98 atoms are isolated from one another. When radius δ =
0.565 Å (G2 ), C atoms on each pentagon are connected with their H atom neighborhoods.
Therefore, four isolated components are formed in each glycoluril, which makes β0δ,δ =
4 × 7 = 28. At G3 (r = 0.710 Å), C atoms on each pentagon are connected with their N
and O neighborhoods. At this stage, two more connected components are involved in
one glycoluri structure, which makes β0δ,δ = 6 × 7 = 42. Only one connected structure can
                                                 96


be formed if all of the atoms get connected with their neighborhood atoms. Therefore,
β0δ,δ = 1 (see G5 - G8 ). Notably, the β2δ,δ and λ̃δ,δ      2 provide rich topological and geometric
information when the filtration parameter δ increases.
                      a    Side View          Top View      2 glycolurils in Stick 2 glycolurils in StickBall
                      b
                      c
                      d
Figure 4.12: a The 3D structures of CB7, 2 glycolurils, and path direction assignment.
Here, from left to right, the side view of CB7, top view of CB7, the structure of two
glycoluril units (=C10 H4 N8 O4 =), and electronegativity-based path direction assignment
are depicted as well. b Illustration of filtration-induced geometries Gi (i = 1, 2, . . . , 8) of
CB7. Eight digraphs G1 = G0.200        0   , G2 = G00.565 , G3 = G00.710 , G4 = G00.745 , G5 = G00.800 , G6 =
   1.210        1.315
G0 , G7 = G0 , G8 = G0             1.800
                                         are constructed under filtration parameter δ. c Illustration
of filtration-induced path complexes within two glycoluril units. Path directions can be
inferred from their colors as shown in the last chart of a. d Betti numbers βnδ,δ and non-
harmonic spectra λ̃δ,δ  n of persistent path Laplacians (Ln when n = 0, 1, and 2) for CB7.
                                                                             δ,δ
     This example shows that PPL can decode topological persistence and the shape evo-
lution of a given molecular system with chemical- or biological-based directional assign-
ment. Specifically, λ̃δ,δ0 can still offer geometric information when β0 does not changes for
                                                                                                     δ,δ
                                                       97


large radii. Therefore, PPL keeps revealing homotopic shape evolution when the topolog-
ical invariant from persistent path homology does not change.
    Additionally, unlike persistent Laplacian, high-order PPL operators provide rich topo-
logical information. For instance, when the filtration parameter δ increases to 1.68, β2δ,δ
from PPL dramatically goes up. Whereas, in persistent Laplacian, the value of Betti 2 is
quite limited since the CB7 system can barely form 2-cycles at a similar filtration param-
eter using either Rips complex or alpha complex. This trait endows PPL with a better
ability to characterize the geometry and topology of an object at large scales.
4.2.3   Discussion and Conclusion
Path homology, a rich mathematical concept introduced by Grigor’yan, Lin, Muranov,
and Yau, has stimulated a variety of new developments in pure and applied mathemat-
ics, including much attention from the topological data analysis (TDA) community. Un-
like original homology or persistent homology, path homology enables the treatment of
directed graphs and networks. Persistent path homology bridges path homology with
multiscale analysis, making it a powerful tool for practical applications. Nonetheless,
these formulations are insensitive to homotopic shape evolution during filtration.
    Topological Laplacians, including Hodge Laplacian, graph Laplacian, sheaf Laplacian,
and Dirac Laplacian, are versatile mathematical tools that not only preserve all topolog-
ical invariants but also describe geometric shapes. This work introduces a new topo-
logical Laplacian, namely persistent path Laplacian, as a new mathematical tool for the
multi-scale analysis of directed graphs and networks. For a given data, the proposed per-
sistent path Laplacian fully recovers the topological persistence of persistent homology
in its harmonic spectra and meanwhile, captures homotopic shape evolution of the data
during filtration in its non-harmonic spectra.
                                            98


                                         CHAPTER 5
  HERMES: AN OPEN-SOURCE SOFTWARE FOR THE SPECTRAL ANALYSIS OF
                                PERSISTENT LAPLACIANS
5.1     Introduction
As a branch of discrete mathematics, graph theory focuses on the relations among vertices
or nodes (0-simplices), edges (1-simplices), faces (2-simplices), and their high-dimensional
extensions. Benefiting from the capability of graph formulations that encode inter-dependencies
among constituents of versatile data into simple representations, graph theory has been
regarded as the mathematical scaffold in the study of various complex systems in bi-
ology, material science, physical infrastructure, and network science. However, tradi-
tional graphs only represent the pairwise relationships between entries. Therefore, hy-
pergraphs, a generalization of graphs that describe the multi-way relationships of math-
ematical structures have been developed to capture the high-level complexity of data
[167, 168]. Mathematically, graphs and hypergraphs are intrinsically related to the sim-
plicial complexes, which have broader use in computational topology. Moreover, many
other areas such as algebra, group theory, knot theory, spectral graph theory (SGT), al-
gebraic topology (AT), and combinatorics are closely related to graph theory. Among
them, the applications of SGT have been driven by various real-life problems in chem-
istry, physics, and life science in the past few decades [138, 169].
    In its early days, spectral graph theory studied the properties of a graph by its graph
Laplacian matrix and adjacency matrix. Later on, developments in spectral graph the-
ory involved some geometric flavor. The explicit constructions of expander graphs rely
on studying eigenvalues and isoperimetric properties of graphs. The discrete analog of
Cheeger’s inequality for graphs in Riemannian geometry is related to the study of man-
ifolds [170]. Specifically, an eigenvalue of the Laplacian of a manifold is related to the
                                               99


isoperimetric constant of the manifold, which motivates the study of graphs by employ-
ing manifolds. Benefiting from increasingly rich connections with differential geometry,
spectral graph theory entered a new era [171]. One of the critical developments is the
Laplacian on a compact Riemannian manifold in the context of the de Rham-Hodge the-
ory [54, 55]. The harmonic part of the Hodge Laplacian spectrum contains the topologi-
cal information, whereas the non-harmonic part of the Hodge Laplacian spectrum offers
additional geometric information for shape analysis [56]. Indeed, the connectivity of a
graph/topological space can be revealed from topological invariants. It is well-known
that the number of eigenvalues in the harmonic spectra of qth-order persistent Laplacian
represents the dimension of persistent q-cohomology of a graph [172, 53, 11], which builds
the connection between spectral graph theory and algebraic topology.
    Homology and cohomology are key concepts in the algebraic topology, which were
developed to analyze and classify manifolds according to their cycles. Traditional ho-
mology is genuinely metric-independent, indicating that geometric information is barely
considered [173]. Therefore, for practical computation, a new branch of algebraic topol-
ogy named persistent homology (PH) [122, 124, 43] was implemented to create a sequence
of topological spaces characterized by a filtration parameter, such as the radius of a ball
or the level set of a real-valued function. As the most important realization of topological
data analysis (TDA) [125, 128, 174], topological persistence has had great success in com-
putational chemistry [5, 175] and biology [4, 44, 176, 177, 178]. For instance, the superior
performance of using PH features of protein-drug complexes in the free energy predic-
tion and ranking at D3R Grand Challenges, a worldwide competition series in computer-
aided drug design [6], was a remarkable success for TDA. Additionally, a weighted per-
sistent homology is proposed as a unified paradigm for the analysis of the biomolecular
data system [179].
    Recently, we introduced persistent spectral graph (PSG) theory to bridge persistent
homology and spectral graph theory [11, 11]. The PSG theory extends the persistence no-
                                             100


tion or multiscale analysis to algebraic graph theory. A family of spectral graphs induced
by a filtration overcomes the difficulty of using traditional spectral graph theory in ana-
lyzing graph structures with a single geometry, giving rise to persistent spectral analysis
(PSA). Additionally, the evolution of the null space dimension of the persistent Laplacian
matrix (PLM) over the filtration offers the topological persistence. Therefore, PSG the-
ory provides simultaneous TDA and PSA. Specifically, by varying a filtration parameter,
a series of qth-order persistent Laplacians (or q-persistent Laplacian) provide persistent
spectra. Notably, the persistent harmonic spectra of 0-eigenvalues span the null space
of the q-th order persistent Laplacian and fully recover the persistent q-th Betti numbers
or persistent barcodes [180] of the associated persistent homology. Specifically, the num-
ber of 0-eigenvalues of qth-order persistent Laplacian reveals the number of q-cocycles
for a given point-cloud dataset. Moreover, the additional geometric shape information
of the data will be unveiled in the non-harmonic spectra.        For example, the spectral
gap (the difference between the moduli of the first two smallest eigenvalues of a Lapla-
cian) reveals the energy difference/density changes between the ground state and first
excited state of a system/dataset. Additionally, the B-factor prediction performance can
be significantly improved by using the non-harmonic spectra involved in the prediction
model, as discussed in [11]. Recently, the theoretical properties and algorithms of PSGs
have been further studied [151] and the application of PSG methods to drug discovery
has been reported [181] . The de Rham-Hodge theory counterpart, called evolutionary de
Rham-Hodge theory, has also been formulated [56].
    Currently, many open-source packages have been developed for the applications of
persistent homology, including Ripser [182], Dionysus [183], Gudhi [184], Perseus [123],
DIPHA [185], Javaplex [186], CliqueTop [187], DioDe [188], Hera, Eirene, and “TDA”
package in R [189]. These packages are able to construct a family of complexes with
the point clouds data as input and calculate its corresponding Betti numbers, which are
equivalent to the harmonic spectra of the persistent Laplacian. However, there is no soft-
                                            101


ware package for simultaneous TDA and PSA. While we developed the theoretical part
of the persistent spectral graph in 2019, we have not constructed efficient and robust soft-
ware yet.
    The objective of present work is to provide the first open-source package, dubbed
highly efficient robust multidimensional evolutionary spectra (HERMES), for evaluating
both the harmonic and non-harmonic spectra of persistent Laplacian matrices, which en-
able broad and convenient applications of the PSG method. In the present release, we
consider an implementation in both alpha complexes [47] and Vietoris–Rips complexes.
To verify the reliability of HERMES, 15 complicated 3D structures of proteins as well as
two fullerene structures are used to calculate the spectra of qth-order persistent Lapla-
cians for q = 0, 1, 2. Moreover, as a validation, the persistent harmonic spectra generated
by HERMES are compared with those obtained from Gudhi and DioDe. Furthermore,
with the use of the spectra of PLMs, molecular data abnormality detection is also dis-
cussed.
    In a nutshell, HERMES provides a powerful tool in various applications such as drug
discovery, protein flexibility analysis, and complex protein structures analysis. It can be
potentially applied to various fields where persistent homology has had success.
5.2    Implementation
5.2.1   Construction of Alpha Shape
Recall that, given a set of points, the alpha shape with any α value is a subcomplex of De-
launay tessellation. Thus, to construct the filtration of alpha complexes, it is necessary to
first compute the complete simplicial complex through the Delaunay tessellation formed
by the set of points. A number of efficient implementations is available in existing soft-
ware packages. Our implementation employs the Computational Geometry Algorithms
Library (CGAL), an efficient and robust software package for many commonly used cal-
culations. We then assign each simplex σ with an alpha value ασ . Finally, the alpha shape
                                              102


given at an α value α0 is constructed by union of convex hulls of all the simplices σ sat-
isfying ασ ≤ α0 , which naturally forms the nerve of balls centered at the given points
truncated by the Voronoi regions, i.e., the corresponding alpha complex.
     We illustrate our implementation with point sets P in 3D, as it is the most common
use scenario. We also assume that all the points are in general positions, which means
that no 4 points of P lie on the same plane and no 5 points of P lie on the same sphere.
Given a simplex σ, which can be a point, an edge, a triangle or a tetrahedron, denote the
open ball bounded by its minimal circumsphere as Bσ . The simplex σ is called Gabriel
([190]) if Bσ ∩ P = ∅. Note that for vertices (0-simplices) the circumradius is considered
0. The above discussion can be directly adapted for 2D implementation by replacing
circumsphere with circumcircle and omitting tetrahedra.
     The filtration parameter α for every simplex σ can be defined as follows. If the simplex
is Gabriel, the filtration value is the corresponding circumradius (for efficiency, we actu-
ally store its square) because the corresponding ball can be considered as an empty α-ball
touching all its vertices. If the simplex is not Gabriel, the filtration value is the minimum
of all the filtration values of the cofaces of σ that contain the points making the simplex
non-Gabriel. When α value reaches that number, we will have an empty α-ball making
the simplex α-exposed.
5.2.2   Implementation details for alpha shape
To ensure the valid calculation of the filtration parameter for non-Gabriel simplices, the
filtration values are always computed from the highest dimension (tetrahedra) down to
0 (vertices). We initialize the filtration value for all the simplices to be positive infinity.
For dimension k, we iterate through each k-simplex. If the current filtration value ασ2 is
positive infinity, we assign the filtration value as the square of the corresponding circum-
radius. Then, we check every (k −1)-dimensional face τ in ∂σ. If the circumsphere of τ
enclosed the other vertex of σ in the interior, it is not Gabriel, and does not correspond to
                                              103


an empty α-ball. In this case, ασ2 is assigned to ατ2 if ασ > ατ .
     With this procedure, we ensure that ασ for every simplex σ corresponding to the fil-
tration value α is α-exposed to an empty α-ball. In other words, we ensure each simplex
represented by its vertex index set J ⊆ {1, 2, ..., |P |} is in the nerve of the Ri ’s, which are
the intersections Ri = Vi ∩ Bi of Voronoi cells Vi ’s and balls Bi ’s around the points pi ’s.
5.2.2.1    Boundary operator construction
With ασ assigned, we sort the k-simplices with increasing filtration parameter value. This
allows us to construct a single boundary operator Bq∞ (the matrix representation of ∂q∞ )
for the entire filtration, which is that of the Delaunay tessellation. For any given α, we
can read off the top left block of the full boundary matrix Bq∞ , i.e.,
                         Bqα     = Bq∞                          α
                                                                   , 1 ≤ j ≤ Nqα ,             (5.1)
                                        
                              ij            ij
                                               ,    ∀1 ≤ i ≤ Nq−1
where Nqα is the number of q-simplices in the alpha complex with the filtration parameter
α. Alternative, we can consider the Nqα × Nq∞ projection matrix Pqα from the Delaunay
tessellation to the alpha complex, Pqα ij = δij (1 on the diagonal and 0 elsewhere), with
                                                
which we have Bqα = Pq−1  α
                              Bq∞ (Pqα )T .
5.2.2.2    Persistent boundary operator
The construction of p-persistent boundary matrix Bqα,p (the representation of operator
  q ) is more involved than reading off Bq . We first construct the projection matrix Pq
                                                    ∞
ðα,p                                                                                             α,p
from Cqα+p to Cα,pq . Then, the p-persistent boundary matrix can be assembled as Bq
                                                                                             α,p
                                                                                                  =
   α
Pq−1  Bq∞ (Pα,p T
            q ) .
     To construct the projection matrix, we first note that it is the projection to the kernel
of an operator that measures the difference between the boundary operator mapped onto
   α+p
Cq−1   and the boundary restricted to Cq−1        α
                                                     , Diffα,p
                                                           q
                                                                    α+p
                                                               = (Iq−1         ) Bq , where Rqα,p =
                                                                            α,p T α+p
                                                                        − Rq−1
                                                      104


Pqα+p (Pqα )T Pqα (Pqα+p )T is the restriction from Cqα+p to Cqα and Iqα+p is the identity matrix on
Cqα+p .
    Instead of storing a dense matrix, we propose to use a procedural representation in-
volving the inverse of persistent Laplacians with gauge ([191]) to reduce the storage as
well as speed up the computation. More specifically, we construct the projection matrix
as follows
                                Pα,p = Iqα+p − (Diff                      ˜ α,p ,
                                                     ˜ α,p )T (L̃α,p )−1 Diff                       (5.2)
                                  q                     q        q−1         q
where (L̃α,pq−1 )
                 −1
                    can be implemented through rank deficiency fixing in [191], and the re-
stricted operator Diff  ˜ α,p is defined below. Note that this sparse linear equation solving
                           q
approach is essentially the graph version of the harmonic extension described in Ref.
[55].
    The reason that the projection matrix can be defined this way is that starting from an
arbitrary element ωq ∈ Cqα+p , we can modify it into ωq −(Diffα,p           q ) fq−1 ∈ Cq , where fq−1 is
                                                                               T         α,p
nonzero only in the difference complex Cl(Tα+p −Tα ), the closure of the difference between
Tα+p and Tα . Denoting any chain f on the difference complex as f˜ and any operator
B on it as B̃ α,p , and the B̃qα,p (B̃qα,p )T f˜q−1 = B̃qα,p ω̃q . Noticing that f˜q−1 is determined up
to a gauge transform fq−1 − (B̃q−1          ) g̃q−2 for some (q − 2)-chain gq−2 in Cl(Tα+p − Tα ),
                                        α,p T
we introduce the gauge fixing term B̃q−1          α,p
                                                      fq−1 = 0, which leads us to the sparse linear
system L̃α+p     ˜         ˜ α,p                      ˜
            q−1 fq−1 = Diffq ωq where the Diff operator is the above operator projected to
the difference complex. Note that fixing the rank deficiency of persistent Laplacians (in
the difference complex) is computationally efficient as its kernel dimension is far smaller
than that of the corresponding boundary or coboundary operators.
5.2.2.3   Persistent spectrum computation
The q-order p-persistent Laplacian operators can then be implemented by direct eval-
uation of Lα,p q
                         α,p
                    = Bq+1          ) + (Bqα )T Bqα . Their spectra can be evaluated through any
                                α,p T
                             (Bq+1
off-the-shelf sparse matrix eigensolver.
                                                       105


    Thus, the dimension of the null space of Lα,p0 is the number of p-persistent connected
components. The dimension of the null space of Lα,p 1 is the number of p-persistent handles
or tunnels. Similarly, the dimension of the null space of Lα,p
                                                             2 is the number of p-persistent
cavities.
5.2.3  Implementation Details for Rips Complex
The Vietoris–Rips complex at different filtration values is also considered in HERMES.
Following the definition of the Vietoris–Rips complex, the implementation is straightfor-
ward. However, due to the large number of simplices, the calculation of non-harmonic
spectra of PLMs Lt,pq can be resource-intensive. Therefore, we may set a maximum cutoff
distance for the filtration r and an upper limit for persistent p for practical applications.
5.3    Validation
Figure 5.1: The 3D structures of C20 and C60 . (a) C20 molecule. A total of 12 pentagon
rings can be found in C20 . (b) C60 molecule. 12 pentagon rings and 20 hexagon rings form
the structure of C60 .
    We construct the alpha complex at different filtration values from the finite cells of
a Delaunay tessellation from the Computational Geometry Algorithms Library (CGAL).
Moreover, the Vietoris–Rips complex at different filtration values is also constructed in
HERMES. Gudhi and DioDe are two of the most frequently applied open-source libraries
                                             106


that are able to compute Betti numbers (harmonic persistent spectra) based on CGAL,
while Ripser is based on the blazing fast C++ Ripser package. As shown in [11], the
0-persistent qth Betti numbers βqα,0 at filtration parameter t is the number of zero eigen-
values of qth-order 0-persistent Laplacian Lt,0     q :
                           βqt,0 = dim(Cqt ) − rank(Lt,0                   t,0
                                                          q ) = dim ker Lq ,                     (5.3)
where t = α if we choose to construct the alpha complex, and t = r if we choose to
construct the Vietoris–Rips complex.
    In fact, βqt,0 counts the number of q-cycles in the alpha complex Kt that persists in
Kt . Although Gudhi and DioDe can calculate the number of zero eigenvalues, the non-
harmonic persistent spectra also play an important role in applications as shown in our
earlier work [11]. Therefore, we developed an open-source package HERMES, which not
only tracks the topological changes from the persistent Betti numbers but also derives
the geometric changes from the non-harmonic spectra of persistent Laplacians. In the
following, we compare the Betti numbers βqt,p that are calculated from HERMES with the
Betti numbers that are derived from Gudhi and DioDe on a set of 2D and 3D points,
aiming to validate the robustness and accuracy of HERMES.
5.3.1   Validation on Fullerene structures
In this section, we will validate the correctness of HERMES with simple systems such as
C20 and C60 molecules with known persistent Betti numbers [4] for Rips complex. More-
over, the persistent Betti numbers for the alpha complex are also included in this section.
    C20 molecule. The C20 molecule is the smallest member of the fullerene family, which
has a dodecahedral cage structure as illustrated in Figure 5.1 (a). Both C20 and C60 have
the molecular symmetry of the full icosahedral point group Ih . Figure 5.2 illustrates the
persistent Betti numbers for Rips complex β0r,0.05 , β1r,0.05 , and β2r,0.05 (green curves) and
the smallest non-zero eigenvalue λr,0.05
                                       0      , λr,0.05
                                                 1      , and λr,0.05
                                                               2      (yellow curves) of C2 0 that are
                                                  107


                       20
                                                                                    15
             r, 0.00                                                                10  r, 0.00
             0         10                                                               0
                                                                                    5
                                                                                    0
                             0.0   0.5   1.0   1.5    2.0   2.5   3.0   3.5   4.0
                       10                                                           15
             r, 0.00                                                                10  r, 0.00
             1          5                                                               1
                                                                                    5
                         0                                                          0
                             0.0   0.5   1.0   1.5    2.0   2.5   3.0   3.5   4.0
                       1.0
                                                                                    10
             r, 0.00                                                                    r, 0.00
             2         0.5                                                              2
                                                                                    5
                       0.0                                                          0
                             0.0   0.5   1.0   1.5    2.0   2.5   3.0   3.5   4.0
Figure 5.2: Illustration of the harmonic spectra (for Rips complex) β0r,0 , β0r,0 , and β2r,0 (green
curves from top chart to bottom chart) and the smallest non-zero eigenvalue λr,0       0 , λ1 , and
                                                                                             r,0
λ2 (yellow curves from top chart to bottom chart) of C20 molecule (the bottom left chart
 r,0
in Figure 5.6) at different filtration values α calculated from HERMES. Here, the x-axis
represents the radius filtration value r (unit: Å), the left-y-axes represents the number of
zero eigenvalues of Lr,0
                       0 , L1 , and L1 from top to bottom, and the right-y-axes represents
                             r,0       r,0
the first non-zero eigenvalue of Lr,00 , L1 , and L2 from top to bottom.
                                           r,0     r,0
computed from HERMES. Similarly, Figure 5.3 illustrates the persistent Betti numbers
for the alpha complex β0α,0.05 , β1α,0.05 , and β2α,0.05 (green curves) and the smallest non-zero
eigenvalue the λα,0.05
                0      , λα,0.05
                          1      , and λ2α,0.05 (yellow curves) of C2 0 that are computed from
HERMES.
   Note that although the Rips complex and the alpha complex have similar Betti-0 and
Betti-1 patterns, their Betti-2 patterns differ from each other over the filtration range. Ad-
ditionally, the non-harmonic spectra of the Rips complex and the alpha complex differ
much from each other. Moreover, the non-harmonic spectra of the Rips complex appear
to carry more information than those of the alpha complex.
   C60 molecule. The C60 molecule is a well-known structure that is also called buck-
minsterfullerene. A total of 12 pentagon rings and 20 hexagon rings consist of C60 . Fig-
ure 5.1 (b) shows the 3D structure of C60 . Figure 5.4 and Figure 5.5 demonstrate the 0.05-
                                                     108


                       20
                                                                                             3
              , 0.05   10                                                                    , 0.05
                                                                                             2
                                                                                             1
                 0                                                                               0
                                                                                             0
                             0.00   0.25   0.50   0.75    1.00   1.25   1.50   1.75   2.00
                       10                                                                    0.75
              , 0.05
                                                                                             0.50     , 0.05
                        5
                 1
                                                                                             0.25        1
                         0                                                                   0.00
                           0.00     0.25   0.50   0.75    1.00   1.25   1.50   1.75   2.00
                       1.0
                                                                                             0.3
             , 0.05    0.5                                                                   0.2 , 0.05
                                                                                             0.1
                2                                                                                     2
                       0.0                                                                   0.0
                             0.00   0.25   0.50   0.75    1.00   1.25   1.50   1.75   2.00
Figure 5.3: Illustration of the harmonic spectra (for alpha complex) β0α,0.05 , β0α,0.05 , and
β2α,0.05 (green curves from top chart to bottom chart) and the smallest non-zero eigen-
value λ0α,0.05 , λα,0.05
                  1      , and λα,0.05
                                2      (yellow curves from top chart to bottom chart) of the C20
molecule (the bottom left chart in Figure 5.6) at different filtration value α calculated from
HERMES. Here, the x-axis represents the radius filtration value α (unit: Å), the left-y-axes
represents the number of zero eigenvalues of Lα,0.05    0   , Lα,0.05
                                                               1      , and L1α,0.05 from top to bot-
tom, and the right-y-axes represents the first non-zero eigenvalue of Lα,0.05       0   , L1α,0.05 , and
L2α,0.05
         from top to bottom.
persistent Betti numbers for rips complex and alpha complex, respectively. Figure 5.2 -
Figure 5.5 indicate the capacity of HERMES for the direct calculation of the persistent
            q and Lq (p > 0).
spectra of Lr,p    α,p
5.3.2   Validation on proteins
In this section, we further validate HERMES using 15 proteins. Their Protein Data Bank
(PDB) IDs of these proteins are 1CCR, 1NKO, 1O08, 1OPD, 1QTO, 1R7J, 1V70, 1W2L,
1WHI, 2CG7, 2FQ3, 2HQK, 2PKT, 2VIM, and 5CYT. The 3D structures of these 15 pro-
teins can be downloaded from the PDB ( https://www.rcsb.org/). Here, only the alpha
carbon atoms are considered in our calculations. The harmonic spectra of HERMES are
compared with the persistent Betti numbers of Gudhi and DioDe. Figure 5.6 illustrates
                                                         109


                       60
                       40                                                          4
             r, 0.00                                                               r, 0.00
             0                                                                     0
                       20                                                          2
                       0                                                           0
                            0.0   0.5   1.0   1.5    2.0   2.5   3.0   3.5   4.0
                       30                                                          2
             r, 0.00   20                                                          r, 0.00
             1                                                                     1
                                                                                   1
                       10
                        0                                                          0
                            0.0   0.5   1.0   1.5    2.0   2.5   3.0   3.5   4.0
                       20
                                                                                   4
             r, 0.00                                                               r, 0.00
             2         10                                                          2
                                                                                   2
                       0                                                           0
                            0.0   0.5   1.0   1.5    2.0   2.5   3.0   3.5   4.0
Figure 5.4: Illustration of the harmonic spectra β0r,0 , β0r,0 , and β2r,0 (blue curves from top
chart to bottom chart) and the smallest non-zero eigenvalue λr,0    0 , λ1 , and λ2 (red curves
                                                                          r,0      r,0
from top chart to bottom chart) of the C60 molecule (the bottom left chart in Figure 5.6) at
different filtration value α calculated from HERMES. Here, the x-axis represents the ra-
dius filtration value α (unit: Å), the left-y-axes represents the number of zero eigenvalues
of Lr,0
     0 , L1 , and L1 from top to bottom, and the right-y-axes represents the first non-zero
          r,0       r,0
eigenvalue of Lr,00 , L1 , and L2 from top to bottom.
                        r,0     r,0
the network structures of 15 proteins. For each protein, the color at atomic positions rep-
resents the normalized diagonal values of the accumulated 0th-order 0-persistent Lapla-
                                                                               √
cians: max 1L0 (L00 )jj , with L00 =    L0 . Here, the filtration α goes from 1.5 Å to
                                     P α,0
          i( 0)
                                      α
√              ii
  10 Å with the step size of 0.01 Å. Figure 5.7 depicts the persistent Betti numbers βqα,0
(blue curve) of PDB ID 5CYT that are calculated from Gudhi, DioDe, and HERMES, to-
gether with the smallest non-zero eigenvalue λα,0
                                              q (red curve) that are obtained only from
HERMES.
   It can be seen that all of these three packages return exactly the same persistent Betti
numbers, suggesting that the calculation of our package HERMES is reliable. Addition-
ally, the values of the smallest non-zero eigenvalues λα,0
                                                       0 and λ1 increase around 1.86 Å,
                                                              α,0
indicating the dramatic topological changes at this point. Similarly, with the increment
of the α, the curve of λα,0
                        2 also records the topological and geometric changes at a specific
                                                    110


                        60                                                                    2
                        40
               , 0.05                                                                         , 0.05
                                                                                              1
                  0
                        20                                                                        0
                         0                                                                    0
                              0.00   0.25   0.50   0.75    1.00   1.25   1.50   1.75   2.00
                         30                                                                   2
               , 0.05
                         20                                                                   , 0.05
                                                                                              1
                  1
                         10                                                                       1
                          0                                                                   0
                            0.00     0.25   0.50   0.75    1.00   1.25   1.50   1.75   2.00
                        1.0                                                                   3
                                                                                              2
              , 0.05    0.5                                                                   , 0.05
                 2
                                                                                              1   2
                        0.0                                                                   0
                              0.00   0.25   0.50   0.75    1.00   1.25   1.50   1.75   2.00
Figure 5.5: Illustration of the harmonic spectra β0α,0.05 , β0α,0.05 , and β2α,0.05 (green curves
from top chart to bottom chart) and the smallest non-zero eigenvalue λα,0.05         0   , λ1α,0.05 , and
λ2α,0.05
         (yellow curves from top chart to bottom chart) of the C60 molecule (the bottom left
chart in Figure 5.6) at different filtration value α calculated from HERMES. Here, the x-
axis represents the radius filtration value α (unit: Å), the left-y-axes represents the number
of zero eigenvalues of Lα,0.05
                           0   , Lα,0.05
                                   1     , and Lα,0.05
                                                1      from top to bottom, and the right-y-axes
represents the first non-zero eigenvalue of L0    α,0.05
                                                         , Lα,0.05
                                                            1      , and Lα,0.05
                                                                          2      from top to bottom.
filtration value. The use of non-harmonic spectra for biophysical modeling was described
in our earlier work [11].
   To be noted, HERMES can also deal with the qth-order p-persistent Laplacians Lα,p
                                                                                 q .
Figure 5.8 illustrates the persistent Betti numbers β0α,0.5 , β1α,0.05 , and β2α,0.5 (green curves)
and the smallest non-zero eigenvalue λ0α,0.5 , λα,0.5
                                                1     , and λα,0.5
                                                             2     (yellow curves) of 5CYT that
are computed from HERMES, demonstrating the capacity of HERMES for the direct cal-
culation of the persistent spectra of Lα,p
                                       q   (p > 0). Compared with the middle chart of
Figure 5.7, β1α,0.5 in the middle chart of Figure 5.8 is always smaller than β1α,0 at the same
filtration α. Moreover, λα,0.5
                         1     also goes up around 1.86 Å, which has the same behavior
as λα,0
    1 . Similar behaviors can be also observed from the bottom charts of Figure 5.7 and
Figure 5.8.
   Furthermore, HERMES can be used to detect the abnormality of a protein structure.
                                                          111


Figure 5.6: The alpha carbon network plots of 15 proteins: PDB IDs 1CCR, 1NKO, 1O08,
1OPD, 1QTO, 1R7J, 1V70, 1W2L, 1WHI, 2CG7, 2FQ3, 2HQK, 2PKT, 2VIM, and 5CYT from
left to right and top to bottom. The color represents the normalized diagonal element of
the accumulated Laplacian at each alpha carbon atom.
                                            112


                 100                                                                2
            ,0
             0   50                                                                 1,0
                                                                                         0
                  0                                                                  0
                       1.25   1.50   1.75   2.00     2.25   2.50   2.75   3.00   3.25
                                                                                     2
                 40
            ,0
             1                                                                      1,0
                                                                                         1
                 20
                  0                                                                  0
                       1.25   1.50   1.75   2.00     2.25   2.50   2.75   3.00   3.25
                                                                                     3
                 7.5
                                                                                    2
            ,0
             2   5.0                                                                 ,0
                                                                                         2
                 2.5                                                                1
                 0.0                                                                 0
                       1.25   1.50   1.75   2.00     2.25   2.50   2.75   3.00   3.25
Figure 5.7: Illustration of the harmonic spectra βqα,0 (blue curve) and the smallest non-zero
eigenvalue λα,0q (red curve) of PDB ID 5CYT (the bottom left chart in Figure 5.6) at differ-
ent filtration values α when q = 0, 1, 2. The βqα,0 are calculated from Gudhi, DioDe, and
HERMES, and λα,0  q   are obtained only from HERMES. Here, the x-axis represents the ra-
dius filtration value α (unit: Å), the left-y-axis represents the number of zero eigenvalues
    q , and the right-y-axis represents the first non-zero eigenvalue of Lq . Note that the
of Lα,0                                                                      α,0
harmonic spectra from the three methods are indistinguishable.
Figure 5.9 (a) shows a 3D secondary structure of PDB 1O08, where the balls represent the
alpha carbon atoms. The light blue, purple, and orange colors represent helix, sheet, and
random coils of PDB ID 1O08. Figure 5.9 (b) depicts its harmonic spectra βqα,0 (blue curve)
and the smallest non-zero eigenvalue λα,0
                                      q (red curve). Notably, two unusual onsets of β0
                                                                                     α,0
and β1α,0 are detected when α << 1.9 Å, indicating something is wrong with the structure
data. Usually, the distance between the two alpha carbon atoms is around 3.8 Å. By
examining the structure of PDB 1O08, we found that two pairs of alpha carbon atoms in
PDB 1O08 have abnormal distances as marked with black frames. The distance of alpha
carbon atoms in the upper box is 2.914 Å and that in the lower box is 2.996 Å, which are
too short. The plots of the other proteins can be found in the Appendix. Similar structural
defects were detected for PDB IDs 1V70, 2HQK, 2PKT, and 2VIM.
   Although our package provides additional geometric information by calculating the
                                                   113


                             100                                                                 2
               , 0.50         50                                                                 1, 0.50
                    0                                                                                 0
                               0                                                                  0
                                    1.25   1.50   1.75   2.00     2.25   2.50   2.75   3.00   3.25
                              10                                                                  2
                    , 0.50     5                                                                 1, 0.50
                       1                                                                              1
                                0                                                                 0
                                    1.25   1.50   1.75   2.00     2.25   2.50   2.75   3.00   3.25
                             0.05                                                                 3
                                                                                                 2
           , 0.50            0.00                                                                 , 0.50
              2
                                                                                                 1    2
                             0.05                                                                 0
                                    1.25   1.50   1.75   2.00     2.25   2.50   2.75   3.00   3.25
Figure 5.8: Illustration of the harmonic spectra β0α,0.5 , β0α,0.5 , and β2α,0.5 (green curves from
top chart to bottom chart) and the smallest non-zero eigenvalue λα,0.5           0  , λα,0.5
                                                                                       1     , and λα,0.5
                                                                                                    2
(yellow curves from top chart to bottom chart) of PDB ID 5CYT (the bottom left chart
in Figure 5.6) at different filtration values α calculated from HERMES. Here, the x-axis
represents the radius filtration value α (unit: Å), the left-y-axes represents the number
of zero eigenvalues of Lα,0.5
                           0   , Lα,0.5
                                   1    , and L1α,0.5 from top to bottom, and the right-y-axes
represents the first non-zero eigenvalue of Lα,0.50    , Lα,0.5
                                                          1     , and L2α,0.5 from top to bottom.
non-harmonic spectra of qth-order persistent Laplacians, there are two limitations of HER-
MES. First, the construction of the Vietoris–Rips complex is the primary bottleneck in
the calculation of non-harmonic spectra of persistent Laplacian matrices (PLMs). Addi-
tionally, the input format of HERMES is point cloud data. Other input formats, such as
pairwise distances, point cloud with van der Waals radii, and volumetric density are not
supported. These limitations will be addressed in our future implementation.
5.4    Discussion and Conclusion
While spectral graph theory has had tremendous success in data science to capture the
geometric and topological information, it is limited by representing a graph structure at
a given characteristic length scale, which hinders its practical application in data anal-
ysis. Motivated by the persistent (co)homology in dealing with a given initial data by
                                                                114


Figure 5.9: (a) The 3D secondary structure of PDB ID 1O08. The blue, purple, and orange
colors represent helix, sheet, and random coils of PDB ID 1O08. The ball represents the
alpha carbon of PDB ID 1O08. (b) Illustration of the harmonic spectra βqα,0 (blue curve)
and the smallest non-zero eigenvalue λα,0q (red curve) of PDB ID 1O08 at different filtration
values α when q = 0, 1, 2. The βqα,0 are calculated from Gudhi, DioDe, and HERMES, and
λα,0
  q  are calculated only from HERMES. Here, the x-axis represents the radius filtration
value α (unit: Å), the left-y-axis represents for the number of zero eigenvalue of Lα,0
                                                                                     q , and
the right-y-axis represents for the non-zero eigenvalues of Lq . Note that the harmonic
                                                                α,0
spectra from three methods are indistinguishable.
constructing a family of simplicial complexes to track their topological invariants, and
the multiscale graphs by creating a set of spectral graphs aiming to extract rich geometric
information, we proposed persistent spectral graph (PSG) theory as a unified multiscale
paradigm for simultaneous geometric and topological analysis [192]. PSG theory has
stimulated mathematical analysis and algorithm development [151], as well as applica-
tions to drug discovery [181], and protein flexibility analysis [11].
    To enable broad and convenient applications of the PSG method, we present an open-
source software package called highly efficient robust multidimensional evolutionary
spectra (HERMES). For a given point-cloud dataset, HERMES creates persistent Lapla-
cian matrices (PLMs) at various topological dimensions via filtration. The spectrum of
PLMs includes harmonic parts and non-harmonic parts. It turns out that the harmonic
part spans the kernel spaces of PLMs and carries the full topological information of the
                                              115


dataset. As a result, HERMES delivers the same topological data analysis (TDA) as does
persistent homology. The non-harmonic part of PLMs provides valuable geometric anal-
ysis of the shape of data at various topological dimensions. The smallest non-zero eigen-
values are found to be very sensitive to data abnormality. In the present HERMES, both
the alpha complex and the Vietoris–Rips complex are implemented. Due to the poten-
tially large number of simplicies, the eigenvalue problem of persistent Laplacian for the
Vietoris–Rips complex becomes memory-intensive for large systems. This difficulty may
be overcome with approximate eigenvalue solvers. We will continue improving the effi-
ciency of HERMES. HERMES has been extensively validated for its accuracy, robustness,
and reliability by standard test datasets and a large number of complex protein structures,
including comparison with Gudhi and DioDe.
                                             116


                                           CHAPTER 6
           APPLICATIONS IN MATHEMATICAL MODELING OF VIROLOGY
6.1     Mutations on COVID-19 diagnostic targets
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which was first reported
in Wuhan in December 2019, is an unsegmented positive-sense single-stranded RNA
virus that belongs to the β-coronavirus genus and coronaviridae family. Coronaviruses
are some of the most sophisticated viruses with their genome size ranging from 26 to 32
kilobases in length. Caused by SARS-CoV-2, the coronavirus disease 2019 (COVID-19)
pandemic outbreak has spread to more than 200 countries and territories with more than
15,012,731 infection cases and 619,150 fatalities worldwide by July 23, 2020 [193]. Addi-
tionally, travel restrictions, quarantines, and social distancing measures have essentially
put the global economy on hold. Furthermore, we remain without efficacious testing,
medications and vaccines for COVID-19. Undoubtedly, effective and widely available
COVID-19 diagnostic testing, medications and vaccines would not only save lives, but
would play a crucial role in a recovering worldwide economic1 .
    There are three types of diagnostic tests for COVID-19, namely polymerase chain re-
action (PCR) tests, antibody tests, and antigen tests. PCR tests detect the genetic material
from the virus. Antibody tests, also called serological tests, examine the presence of an-
tibodies produced from immune response to the virus infection. Antigen tests detect the
presence of viral antigens, e.g., parts of the viral spike protein. PCR tests are relatively
more accurate but take time to show the test result. The protein tests based on antibody
or antigen can display test results in minutes but are relatively insensitive and subject to
host immune response.
    1
      This work is published on Nov 2020. No vaccines and medications available for COVID-19 at that
time.
                                                117


    PCR diagnostic test reagents were designed based on early clinical specimens con-
taining a full spectrum of SARS-CoV-2 [194], particularly the reference genome collected
on January 5, 2020, in Wuhan (SARS-CoV-2, NC004718) [91]. Approved by the United
States (US) Food and Drug Administration (FDA), the US Centers for Disease Control
and Prevention (CDC) has detailed guidelines for COVID-19 diagnostic testing, called
“CDC 2019-Novel Coronavirus (2019-nCoV) Real-Time RT-PCR Diagnostic Panel” ( https:
//www.fda.gov/media/134922/download). The US CDC has designated two oligonu-
cleotide primers from regions of the virus nucleocapsid (N) gene, i.e., N1 and N2, as
probes for the specific detection of SARS-CoV-2. The panel has also selected an addi-
tional primer/probe set, the human RNase P gene (RP), as control samples. Many other
diagnostic primers and probes based on RNA-dependent RNA polymerase (RdRP), en-
velope (E), and nucleocapsid (N) genes have been designed [195] and/or designated by
the World Health Organization (WHO) as shown in Table S1 of the Supporting Material,
which provides the details of 54 commonly used diagnostic primers and probes [196]. The
diagnostic kits are often static over time, yet SARS-CoV-2 is undergoing fast mutations.
Hence, it is reported that different primers and probes show nonuniform performance
[197, 198, 199].
    In this study, we genotype 31421 SARS-CoV-2 genome isolates in the globe and reveal
numerous mutations on the COVID-19 diagnostic targets commonly used around the
world, including those designated by the US CDC. We identify and analyze the SARS-
CoV-2 mutation positions, frequencies, and encoded proteins in the global setting. These
mutations may impact the diagnostic sensitivity and specialty, and therefore, they should
be considered in designing new testing kits as the current effort in COVID-19 testing,
prevention, and control. We propose diagnostic target selection and optimization based
on nucleotide-based and gene-based mutation-frequency analysis.
                                             118


6.1.1   Results and Analysis
Genotyping analysis We first genotype 31421 SARS-CoV-2 genome samples from the
globe as of July 23, 2020. The genotyping results unravel 13402 single mutations among
these virus isolates. Typically, a SARS-CoV-2 isolate can have eight co-mutations on av-
erage. A large number of mutations may occur on all of the SARS-CoV-2 genes and have
broad effects on diagnostic kits, vaccines, and drug developments. Moreover, we cluster
these mutations by k-means methods, resulting in globally at least six distinct subtypes
of the SARS-CoV-2 genomes, from Cluster I to Cluster VI. Table 6.1 shows the mutation
distribution clusters with sample counts (SC) and total single mutation counts (MC) in 20
countries.
Table 6.1: The mutation distribution clusters with sample counts (SC) and total single mu-
tation counts (MC). The listed countries are United States (US), Canada (CA), Australia
(AU), Germany (DE), France (FR), United Kingdom (UK), Italy (IT), Russia (RU), China
(CN), Japan (JP), Korean (KR), India (IN), Iceland (IS), Brazil (BR), Spain (ES), Belgium
(BE), Saudi Arabia (SA), Turkey (TR), Peru(PE), and Chile (CL).
               Cluster I    Cluster II    Cluster III   Cluster IV   Cluster V   Cluster VI
  Country     SC     MC    SC     MC     SC     MC     SC     MC    SC     MC    SC    MC
     US      3252 24846   2013 14737     286    3686  2366 27012   562     3798 304    2706
    CA       113     835   80     561     9      106    42     417  84     525   33     290
    AU       173    1204   587    5048   75     1010   195    2127 165      885 132    1076
    DE        69     504   25     121     5       58    26     209  27     144   43     366
     FR      100     718   14      55     2       22    48     523  74     465   10      83
    UK       295    2328  1927 12777    2171 27636    1623 16123   1890 11835   2919 25576
     IT        1       8    8      104   33      561    24     308  57     283   24     192
    RU         7      52    2       32   19      219    7      53   32     187  119     968
    CN         3      22   287    1155    2       32    7      50    8       35   3     26
     JP       18     134   243    1001   23      272    9      79   23     139  191    1676
    KR         0       0   58     327     0        0    0       0    0       0    0      0
     IN       29     212   268    3045   200    2703   399    4840 141      847  51     487
     IS       66     446   103     595   30      345    10      89 152      924  59     525
     ES        4      33   163    1198    3       33    37     365 170     1103  42     359
     BR        3      26    7       51   78     1009    2      10    7       42  63     591
     BE       56     411   85     400    66      783   115    1031 230     1381 141    1239
     SA       16     110    9       61    0        0    14     126  17     133    1      7
     TR        0       0   28     339    13      158    50     476   4       28  31     273
     PE        2      12    5       36   10      124    5      48    9       58   2     17
    CL        13      91   27     282    21      285    49     665  32     200   20     169
    All of the countries are involved in six clusters except Korean (KR), Saudi Arabia (SA),
                                               119


and Turkey (TR). Among them, China initially had samples only in clusters II and its
sample distributions reached to other Clusters after March 2020. Cluster I, II, and IV
dominate in the United States. Germany (DE) and France (FR) samples are mainly in
Cluster I, IV, and VI. Italy (IT) samples are mainly in Clusters III, IV, V, and VI. Samples in
Turkey (TR) are mainly in Cluster II, III, IV, and VI. Japan (JP) samples are dominated in
Cluster II and VI, Korea (KR) samples belong to Cluster II only. Cluster II is common to
all countries. Figure 6.1 depicts the distribution of six distinct clusters in the world. The
light blue, dark blue, green, red, pink, and yellow represent Cluster I, Cluster II, Cluster
III, Cluster IV, Cluster V, and Cluster VI, respectively. The color of the dominated Cluster
decides the base color of each country.        To be noted, although some countries have a
lot of confirmed sequences, a very limited number of complete genome sequences are
deposited in the GISAID, which causes the geographical bias in the Table 6.1.
Figure 6.1: The scatter plot of six distinct clusters in the world in July 2020. The light blue,
dark blue, green, red, pink, and yellow represent Cluster I, Cluster II, Cluster III, Cluster
IV, Cluster V, and Cluster VI, respectively. The base color of each country is decided by
the color of the dominated Cluster.
     Mutations on Diagnostic Targets
                                               120


Table 6.2: Summary of mutations on COVID-19 diagnostic primers and probes and their
occurrence frequencies in clusters. Here, SC is the sample counts and MC is the mutation
counts.
    Primer                      MC  SC Cluster I Cluster II Cluster III Cluster IV Cluster V Cluster VI
    RX7038-N1 primer (Fw)a      15   79   5         14         12           28        14         6
    RX7038-N1 primer (Rv)a      17  113   1         66         14            9         2        21
    RX7038-N2 primer (Fw)a       7   60   3          10        24          21          1         1
    RX7038-N2 primer (Rv)a       6   50   2          17         6          15          3         7
    RX7038-N3 primer (Fw) [200] 13  287    4        224        13          26          14        6
    RX7038-N3 primer (Rv) [200] 12   70    4         10         7          39           6        4
    N1-U.S.-P [196]             15  856    4        782         20          31         15         4
    N2-U.S.-P [196]             11   70   10         40          4          12          4         0
    N3-U.S.-P [196]             16   84    5         27         15          21         10         6
    N-Sarbeco-Fb [195]          12  63     4         20        10           15         10         4
    N-Sarbeco-Pb [195]          12  116    1        19         30           42         15         9
    N-Sarbeco-Rb [195]          17  156   37        26          4           80          5         4
    N-China-F [196]             23 26280  38        226       10873        139         17      14987
    N-China-R [196]             17  217    5         15         17         157          8        15
    N-China-P [196]              7   20    1          4         6           8           1        0
    N-HK-F [196]                 5  149    1          2        74           7           1       64
    N-HK-R [196]                14   84   14         12        14          35           4        5
    N-JP-F [196]                10   66    5         10         9          16          26        0
    N-JP-P [196]                 9   32    0          5         1          16           3        7
    N-TL-F [196]                17  149    1         84        14          31          13        6
    N-TL-R [196]                17  115   29          7          7          66          3         3
    N-TL-P [196]                11   45    1          5         13           5          1        20
    E-Sarbeco-F1c                5   23   0           0        10            9         2         2
    E-Sarbeco-R2c                4   18   0           6         5            1         6         0
    E-Sarbeco-P1c                9   48   1          29         6            9         3         0
    nCoV-IP2-12669Fwc            3   50   0          17        12           11         0        10
    nCoV-IP2-12759Rvc           11  739  123        244        77          168        127        0
    nCoV-IP2-12696bProbe(+)c     8   17   2           4         1            6         4         0
    nCoV-IP4-14059Fwc            3    9   0           0         7            2         0         0
    nCoV-IP4-14146Rvc           11   38   7          7          9            9         1         5
    nCoV-IP4-14084Probe(+)c     11   49   3          12         6           19         5         4
    RdRP-SARSr-F2d               5   89   2           1         5           37         44        0
    RdRP-SARSr-R1d [195]         3    4   2          0          0            2         0         0
    RdRP-SARSr-P2d [195]         4   10   0          6          2            2         0         0
    ORF1ab-China-F [196]         4   19    0          4          2           6          5         2
    ORF1ab-China-R [196]         0    0    0          0          0           0          0         0
    ORF1ab-China-P [196]        14   61    1          6         30          11          3        10
    ORF1b-nsp14-HK-F [196]       6   12    2          1          6           3          0         0
    ORF1b-nsp14-HK-R[196]        9   89    3          9         52          14          6         5
    ORF1b-nsp14-HK-P[196]        6   37    2          1          9          13          0        12
    SC2-Fe                      11   88   0           5        34           29        13         7
    SC2-Re                       0    0   0           0         0            0         0         0
    NIID_WH-1_F501[201]         13  255    0        205         25          18          3         4
    NIID_WH-1_R913[201]         14  128    1         94          9          18          4         2
    NIID_WH-1_F509[201]         10   30    7          5          7           6          3         2
    NIID_WH-1_R854[201]          9  261   63         25         33         117          5        18
    NIID_WH-1_Seq[201] F519     19  130    8         89         17          11          3         2
    NIID_WH-1_Seq R840[201]     12   66    6          9         21           8          3        19
    WuhanCoV-spk1-f[201]        14  433  265         22         11         123          8         4
    WuhanCoV-spk1-r[201]         4   10    0          2          3           1          2         2
    NIID_WH-1_F24381[201]       20  494  275         30         16         153         13         7
    NIID_WH-1_R24873[201]        5   15    1          4          3           7          0         0
    NIID_WH-1_Seq_F24383[201]   21  503  275         30         22         153         13        10
    NIID_WH-1_Seq_R24865[201]    6   17    2          4          5           6          0         0
                                               121


    Table 6.2 provides all mutations on various primers and probes and their occurring
frequencies in various clusters, where SC is the sample counts and MC is the mutation
counts. More detailed mutation information is given in Tables S4-S56 of the Supporting
Material. We plot the mutation position and frequency for 54 primers and probes in this
work in Figure 6.2 - Figure 6.6.
    It is noted that N-China-F [196] is the mostly-used reagent among all primers/probes,
but the primer target gene of SARS-CoV-2 has 15 mutations involving thousands of sam-
ples, which may account for low efficacy of certain COVID-19 diagnostic kits in China
according to this website. Note that primers and probes typically have a small length of
around 20 nucleotides.
    Currently, most primers and probes used in the US target are the N gene [196]. How-
ever, Table 6.2 shows that a plurality of mutations has been found in all of the targets of
the US CDC designated COVID-19 diagnostic primers. The targets of N gene primers and
probes used in Japan, Thailand, and China, including Hong Kong, have undergone mul-
tiple mutations involving many clusters. Therefore, the N gene may not be an optimal
target for diagnostic kits, and the current test kits targeting the N gene should be updated
accordingly for testing accuracy.
    It can be seen that so far, no mutation has been detected on ORF1ab-China-R and SC2-
R, showing that they are two relatively reliable diagnostic primers. Notably, the targets
of four E gene primers and probes have only six mutations. Also, no mutation has been
found on the targets of ORF1ab-China-R and SC2-R. However, the target of nCoV-IP2-
12759R recommended by Institute Pasteur, Paris has six mutations. Overall, targets of the
envelope and RNA-dependent RNA polymerase based primers and probes have fewer
mutations than the N gene. This observation leads to an assumption that the N gene is
particularly prone to mutations.
                                              122


                  RX7038-N1 primer (Fw)                           RX7038-N1 primer (Rv)
                                                   50
   10
                                                   25
    0                                               0
      G A C C C C A A A A T C A G C G A A A T         CAGAT TCAACTGGCAGTAACCAGA
                  RX7038-N2 primer (Fw)                           RX7038-N2 primer (Rv)
                                                   20
   20
                                                   10
    0                                               0
      T T A C A A A C A T T G G C C G C A A A         T T C T T C G G A A T G T C G C G C
                  RX7038-N3 primer (Fw)                           RX7038-N3 primer (Rv)
  200
                                                   20
  100                                              10
                                                    0
      G G G A G C C T T G A A T A C A C C A A A A     C A A T G C T G C A A T C G T G C T A C A
                        N1-U.S-P                                       N2-U.S-P
  200
                                                   40
  100                                              20
      A C C C C G C A T T A C G T T T G               ACA A T T T GCCCCCAGCGC T T CAG
                        N3-U.S-P                                      N-Sarbeco-F
   20                                              20
    0                                               0
      ATCACAT TGGCACCCGCAATCCTG                       C A C A T T G G C A C C C G C A A T C
Figure 6.2: Illustration of mutation positions and frequencies on the primer and/or probes
of RX7038-N1 primer (Fw), RX7038-N1 primer (Rv), RX7038-N2 primer (Fw), RX7038-N2
primer (Rv), RX7038-N3 primer (Fw), RX7038-N3 primer (Rv), N1-U.S.-P, N2-U.S.-P, N3-
U.S.-P, N-Sarbeco-F.
6.1.2   Discussions
Mechanisms of mutation and mutation impact on diagnostics The accumulation of the
frequency of virus mutations is due to natural selection, polymerase fidelity, cellular envi-
ronment, features of recent epidemiology, random genetic drift, host immune responses,
gene editing [202], replication mechanism, etc [203, 204]. SARS-CoV-2 has a higher fi-
delity in its transcription and replication process than other single-stranded RNA viruses
because it has a proofreading mechanism regulated by NSP14 [205]. However, 13402 sin-
gle mutations have been detected from 31421 SARS-CoV-2 genome isolates.
    Due to technical constraints, genome sequencing is subject to errors. Some “muta-
                                                123


                      N-Sarbeco-P                                     N-Sarbeco-R
   20                                             50
                                                  25
    0                                              0
       ACTTCCTCAAGGAACAACATTGCCA                     C A A G C C T C T T C T C G T T C C T C
                       N-China-F                                       N-China-R
  200
                                                 100
  100
                                                  50
                                                   0
       GGGG A A C T T C T C C T G C T A G A A T      C A G C T T G A G A G C A A A A T G T C T G
                       N-China-P                                        N-HK-F
   10
                                                 100
    5                                             50
    0                                              0
       T T G C T G C T G C T T G A C A G A T T       T A A T C A G A C A A GG A A C T G A T T A
                        N-HK-R                                           N-JP-F
                                                  20
   20                                             10
    0
       C A T G G A A G T C A C A C C T T C G         A A A T T T T G G G G A C C A G G A A C
                        N-JP.P                                          N-TL-F
   20
                                                  50
   10                                             25
    0                                              0
       A T G T C G C G C A T T G G C A T G G A       C G T T T G G T G G A C C C T C A G A T
Figure 6.3: Illustration of mutation positions and frequencies on the primer and/or probes
of N-Sarbeco-P, N-Sarbeco-R, N-China-F, N-China-R, N-China-P, N-HK-F, N-HK-R, N-JP-
F, N-JP-P, N-TL-F.
tions” might result from sequencing errors, instead of actual mutations. Additionally,
mRNA editing, such as APOBEC [202], in defending virus invasion in the human im-
mune system can create fatal mutations. Both cases may lead to single-nucleotide poly-
morphisms (SNPs) without a descendant. We report that among all of 31421 genome
isolates, 13402 individual mutations have at least one descendant.
    It is well known that the sensitivity of diagnostic primers and probes depends on their
target positions. Specifically, the beginning part of a primer or probe is not as important as
its ending part. A high-frequency mutation on the right end of a primer or probe position
of a target would possibly produce more false-negatives in diagnostics. Also, importantly,
                                              124


                         N-TL-R                                         N-TL-P
  20                                            20
    0                                             0
      A A T G G A G A A C G C A G T G G G G         C A A C T G G C A G T A A C C A
                      E-Sarbeco-F1                                   E-Sarbeco-R2
                                                5.0
  10
                                                2.5
    0                                           0.0
      ACAGGTACGT TAATAGT TAATAGCGT                  T G T G T G C G T A C T G C T G C A A T A T
                      E-Sarbeco-P1                                nCoV-IP2-12669Fw
                                                  2
  20
  10                                              1
    0                                             0
      ACACTAGCCATCCTTACTGCGCTTCG                    A T G A G C T T A G T C C T G T T G
                   nCoV-IP2-12759Rv                            nCoV-IP2-12696bProbe(+)
                                                  4
  20                                              2
    0                                             0
      A C A A C A C A A C A A A G G G A G           A G A T G T C T T G T G C T G C C G G T A
                   nCoV-IP4-14059Fw                               nCoV-IP4-14146Rv
  5.0                                           10
  2.5
  0.0                                             0
      G G T A A C T G G T A T G A T T T C G         C C T A T A T T A A C C T T G A C C A G
Figure 6.4: Illustration of mutation positions and frequencies on the primer and/or probes
of N-TL-R, N-TL-P, E-Sarbeco-F1, E-Sarbeco-R2, E-Sarbeco-P1, nCoV-IP2-12669Fw, nCoV-
IP2-12759Rv, nCoV-IP2-12696bProbe(+), nCoV-IP4-14059Fw, nCoV-IP4-14146Rv.
for primers involving significant mutations, polymerase chain reaction (PCR) annealing
temperatures are estimated based on correctly matched sequences [206]. Annealing tem-
peratures for primers and probes involving mutations of are given in Tables S4-S56 of the
Supporting Material.
     Nucleotide-based diagnostic target optimization Table 6.2 shows that the degree of
mutations on various diagnostic targets vary dramatically. Therefore, it is of great im-
portance to know how to select an optimal viral diagnostics target to avoid potential
mutations. We discuss such a target optimization via both nucleotide-based analysis and
gene-based mutation analysis.
                                             125


                  nCoV-IP4-14084Probe(+)                           RdRP-SARSr-F2
  20                                             50
  10                                             25
   0                                              0
      T C A T A C A A A C C A C G C C A G G T       G T G A A A T G G T C A T G T G T G G C G G
                     RdRP-SARSr-R1                                 RdRP-SARSr-P2
   2
                                                  4
   1                                              2
   0                                              0
      TATGCTAATAGTGTTTTTAACATTTG                    CAGGTGGAACCTCATCAGGAGATGC
                     ORF1ab-China-F                                ORF1ab-China-R
  10                                             10
   5                                              5
   0
      C C C T G T G G G T T T T A C A C T T A A     T C A G C T G A T G C A C A A T C G T
                     ORF1ab-China-P                               ORF1b-nsp14-HK-F
  20                                              2
  10
   0                                              0
      CCGTCTGCGGT A TGTGGAAAGGT T A TGG             T G G G G T T T T A C A G G T A A C C T
                    ORF1b-nsp14-HK-R                              ORF1b-nsp14-HK-P
  40                                             20
  20                                             10
   0                                              0
      G A G T G C T T T G T T A A G C G T G T T     T AGT TGTGA TGCAA TCA TGACT AG
Figure 6.5: Illustration of mutation positions and frequencies on the primer and/or
probes of nCoV-IP4-14084Probe(+), RdRP-SARSr-F2, RdRP-SARSr-R1, RdRP-SARSr-P2,
ORF1ab-China-F, ORF1ab-China-R, ORF1ab-China-P, ORF1b-nsp14-HK-F, ORF1b-nsp14-
HK-R, ORF1b-nsp14-HK-P.
     Figure 6.7 illustrates the rates of 12 different types of mutations among 31421 SNP
variants. It is interesting to note that 51.4% mutations on the SARS-CoV-2 are of C>T
type, due to strong host cell mRNA editing knows as APOBEC cytidine deaminase [202].
Therefore, researchers should avoid cytosine bases as much as possible when designing
the diagnostic test kits.
     Gene-based diagnostic target optimization
     To further understand how to design the most reliable SARS-CoV-2 diagnostic targets,
we carry out gene-level mutation analysis. Figure 6.8 and Table 6.3 present the muta-
tion ratio, i.e., the number of unique single-nucleotide polymorphisms (SNPs) over the
                                              126


                                            SC2-F                                          SC2-R
                                                                    10
                      50
                                                                     5
                       0
                         C TGCAGA T T TGGA TGA T T T C T CC            CTAAACTCATGCAGACCACACAAGG
                                       NIID_WH-1_F501                                 NIID_WH-1_R913
                     200
                                                                    50
                       0                                             0
                         T T C G G A T G C T C G A A C T G C A C C     CC T T C T AGCACGTGC TGGT A A AG
                                       NIID_WH-1_F509                                 NIID_WH-1_R854
                      10
                                                                   100
                       0                                             0
                         C T C G A A C T G C A C C T C A T G G         G C T A T G T C G A T A A C A A C T T C T G
                                     NIID_WH-1_Seq F519                             NIID_WH-1_Seq R840
                      50                                            20
                       0                                             0
                         A C C T C A T G G T C A T G T T A T G G       G G C A T A C A C T C G C T A T G T C
                                       WuhanCoV-spk1-f                                WuhanCoV-spk1-r
                     250                                             2
                       0                                             0
                         T TGGCAAAAT TCAAGACTCACT T T                  CACAAAGGAATTTTTATGAACCACA
                                      NIID_WH-1_F24381                               NIID_WH-1_R24873
                     100
                                                                     5
                      50
                                                                     0
                         T C A A G A C T C A C T T T C T T C C A C     GTGA AGGTGT C T T TGT T T CA A A T
                                    NIID_WH-1_Seq F24383                           NIID_WH-1_Seq R24865
                     100
                                                                     5
                      50
                                                                     0
                         A A G A C T C A C T T T C T T C C A C A G     C C T C G T G A A G G T G T C T T T G
Figure 6.6: Illustration of mutation positions and frequencies on the primer and/or probes
of SC2-F, SC2-R,NIID_WH-1_F501,NIID_WH-1_R913, NIID_WH-1_F509, NIID_WH-
1_R85, NIID_WH-1_Seq F519, NIID_WH-1_Seq R840, WuhanCoV-spk1-f, WuhanCoV-
spk1-r, NIID_WH-1_F24381, NIID_WH-1_R24873, NIID_WH-1_Seq F24383, NIID_WH-
1_Seq R24865.
corresponding gene length, for each SARS-CoV-2 gene. A smaller mutation ratio for a
given gene indicates a higher degree of conservativeness. Clearly, the ORF7b gene has
the smallest mutation ratio of 0.155, while the ORF7a gene has the largest mutation ratio
of 0.642. The N gene has the fourth-largest mutation rate of 0.558, which is very close to
the largest ratio of 0.594 for the ORF3a gene and 0.559 for the ORF8 gene. Additionally,
two ends of the SARS-CoV-2 genome, i.e., NSP1, NSP2, ORF10, N gene, ORF8, ORF7a,
and ORF6, exception for ORF7b, have higher mutation ratios. Considering the mutation
frequency, we introduce the mutation h-index, defined as the maximum value of h such
that the given gene section has h single mutations that have each occurred at least h times.
Normally, larger genes tend to have a higher h-index. Figure 6.8 shows that, with a mod-
erate length, the N gene has the second-largest h-index of 44, which is close to the largest
                                                                127


                             Figure 6.7: The pie chart of the distribution of 12 different types of mutations.
                                                 2626                                                                                                                                                                                                            363
                 0.6
                                                                                                                                                                      Mutation ratio                              825                                                                               701               40
                 0.5                                                                                                                                                                                       1651                                      183                                363
                                                                                                                                                                                                                                                                                                      1257   114
Mutation ratio
                                    973                                                                                                                               h-index
                       540
                                                                                                                                                                           1038
                                   1914
                                          5835                                                                                                                       706                                                                                                                                              30
                                                                                                                                                                                                                                                                                                                           h-index
                 0.4                                                                                                                             1030                                               3819                490   225
                                                          604
                                                                918         870         249        594         339                                        653                           894                                              666
                 0.3                                                                                                       417                     2769           1581
                                                        1500                                                                                               1803                                                                                271                                                                    20
                                                                      353         348                                                  39
                             273                                                                                                                                                  476         358
                 0.2                                                                                                                                                                                                                                                                          203
                                                                                                                                                                                                                                                                       233
                                                                                                         242         135                                                                                                            95                                                                             61 10
                 0.1
                                                                                              99
                                                                                                                                 147                                                                                                                       101               129
                                                                                                                                            11                                                                                                                                     20
                  0                                                                                                                                                                                                                                                                                                   0
                       SP1         SP2    SP3           SP4   3C
                                                                L           SP6         SP7        SP8         SP9     SP          SP            Rp       e
                                                                                                                                                          as        se     se           es           e
                                                                                                                                                                                                    ik       RF          ve
                                                                                                                                                                                                                              e          e
                                                                                                                                                                                                                                         an      RF
                                                                                                                                                                                                                                                     6       RF              7b          8
                                                                                                                                                                                                                                                                                        RF          id       10
                                                                                                                         10           11     Rd       el       le a        A       Ta s        Sp              3a           lo       br                        7a       RF                          ps   RF
                   N          N          N         N                   N           N           N          N          N           N                      ic     uc        RN                                O                   p                O          O                        O          ca
                                                                                                                                                   H         on      do       se                                    En              em                                 O                    le o
                                                                                                                                                                                                                                                                                                         O
                                                                                                                                                          Ex        en           M                                             M                                                         uc
                                                                                                                                                                           bo                                                                                                           N
                                                                                                                                                                         -ri
                                                                                                                                                                      -O
                                                                                                                                                                     2’
Figure 6.8: Illustration of SARS-CoV-2 mutation ratio and mutation h-index one various
genes. For each gene, its length is given in the mutation ratio bar while the number of
unique SNPs is given in the h-index bar.
                                                                                                                                                             128


h-index of 47 for NSP3. Therefore, selecting SARS-CoV-2 N gene primers and probes as
diagnostic reagents for combating COVID-19 is not an optimal choice. Moreover, a few
primers and probes used in Japan are designed on the spike and NSP2 gene. However,
the high mutation ratio and h-index of spike and NSP2 gene indicate that these diagnos-
tic reagents may not perform well. Furthermore, we design a website called Mutation
Tracker to track the single mutations on 26 SARS-CoV-2 proteins, which will be an in-
tuitive tool to inform other research on regions to be avoided in future diagnostic test
development.
      Table 6.3: Gene-specific statistics of SARS-CoV-2 single mutations on 26 proteins.
  Gene type                       Gene site   Gene length Unique SNPs mutation ratio h-index
  NSP1                             266:805        540          273       0.506          19
  NSP2                            806:2719       1914          973       0.508          36
  NSP3                           2720:8554       5835         2626       0.450          47
  NSP4                           8555:10054      1500          604       0.403          25
  NSP5(3CL)                     10055:10972       918          353       0.385          22
  NSP6                          10973:11842       870          348       0.400          22
  NSP7                          11843:12091       249           99       0.398          12
  NSP8                          12092:12685       594          242       0.407          14
  NSP9                          12686:13024       339          135       0.398          13
  NSP10                         13025:13441       417          147       0.353          11
  NSP11                         13442:13480        39           11       0.282           4
  RNA-dependent-polymerase      13442:16236      2796         1030       0.368          31
  Helicase                      16237:18039      1803          653       0.362          29
  3’-to-5’ exonuclease          18040:19620      1581          706       0.447          27
  endoRNAse                     19621:20658      1038          476       0.459          19
  2’-O-ribose methyltransferase 20659:21552       894          358       0.400          20
  Spike protein                 21563:25384      3819         1651       0.432          42
  ORF3a protein                 25393:26220       825          490       0.594          32
  Envelope protein              26245:26472       225           95       0.422          13
  Membrane glycoprotein         26523:27191       666          271       0.407          23
  ORF6 protein                  27202:27387       183          101       0.552          12
  ORF7a protein                 27394:27759       363          233       0.642          16
  ORF7b protein                 27756:27887       129           20       0.155           5
  ORF8 protein                  27894:28259       363          203       0.559          18
  Nucleocapsid protein          28274:29533      1257          701       0.558          44
  ORF10 protein                 29558:29674       114           61       0.535          12
                                               129


6.1.3   Conclusion
In summary, the targets of currently used COVID-19 diagnostic tests have numerous mu-
tations that impact the diagnostic test accuracy in identifying COVID-19. There is a need
for continued surveillance of viral evolution and diagnostic test performance, as the emer-
gence of viral variants that are no longer detectable by certain diagnostics tests is a real
possibility. A cocktail test kit is needed to mitigate mutations. We propose nucleotide-
based and gene-based diagnostic target optimizations to design the most reliable diag-
nostic targets. We analyze a full list of SNPs for all 31421 genome isolates, including
their positions and mutation types. This information, together with ranking of the de-
gree of the conservativeness of SARS-CoV-2 genes or proteins given in Table 6.3, enables
researchers to avoid non-conservative genes (or their proteins) and mutated nucleotide
segments in designing COVID-19 diagnosis, vaccine, and drugs.
6.2   Mechanisms of SARS-CoV-2 evolution
The mechanism of mutagenesis is driven by various competitive processes [203, 204, 207,
208, 24], which can be categorized into 3 different scales with many factors as illustrated
in Figure 6.9 a: 1) the molecular scale, 2) the organism scale, and 3) the population scale.
From the molecular-scale perspective, the random shifts, replication errors, transcription
errors, translation errors, viral proofreading, and viral recombination are the main driven
sources. Moreover, the host gene editing induced by the adaptive immune response [24]
and the recombination between the host and virus are the key-driven factors at the organ-
ism level. Furthermore, the natural selection popularized by Charles Darwin is a critical
process, which favors mutations that have reproductive advantages for the virus to have
adaptive traits in evolution. Such complicated mechanisms of viral mutagenesis make
the comprehension of viral transmission and evolution a grand challenge.
    Although there are 28,780 unique single mutations distributed evenly on the whole
SARS-CoV-2 genome, the mutations on the S gene stand out among all 29 genes on SARS-
                                              130


CoV-2 due to the mechanism of viral infection. Under assistant with host transmembrane
protease, serine 2 (TMPRSS2), SARS-CoV-2 enters the host cell by interacting with its
S protein and the host angiotensin-converting enzyme 2 (ACE2) [37] (See Figure 6.9 b).
Later on, antibodies will be generated by the host immune system, aiming to eliminate
the invading virus through direct neutralization or non-neutralizing binding [209, 210],
which makes the S protein the main target for the current vaccines. Specifically, there is
a short immunogenic fragment located on the S protein of SARS-CoV-2 that can facili-
tate the SARS-CoV-2 S protein binding with ACE2, which is called the receptor-binding
domain (RBD) [211]. Studies have shown that the binding free energy (BFE) between
the S RBD and the ACE2 is proportional to the infectivity [212, 213, 214, 37, 28]. There-
fore, tracking and monitoring the RBD mutations and their corresponding BFE changes
will expedite understanding the infectivity, transmission, and evolution of SARS-CoV-2,
especially for the new SARS-CoV-2 variants, such as Alpha, Beta, Gamma, Delta, and
Lambda, etc. [21]
    The current prevailing variants Alpha, Beta, Gamma, Delta, Kappa, Theta, Lambda,
and Mu carry at least one vital mutation at residues 452 and 501 on the S RBD 2 . Notably,
in July 2020, we successfully predicted that residues 452 and 501 "have high chances to
mutate into significantly more infectious COVID-19 strains" [41]. In the same work, we
hypothesized that “natural selection favors those mutations that enhance the viral trans-
mission" and provided the first evidence for infectivity-based natural selection. In other
words, we revealed the mechanism of SARS-CoV-2 evolution and transmission based on
very limited genome data in July 2020 [41]. Additionally, we predicted three categories
of RBD mutations: 1) most likely (1149 mutations), 2) likely (1912 mutations), and 3) un-
likely (625 mutations) [41]. Up to now, all of the RBD mutations we detected fall into
our first category [102, 2]. Until now, all of the top 100 most observed RBD mutations
have BFE change greater than the average BFE changes of -0.28kcal/mol (the average
    2
      This work was published in 2020
                                            131


BFE changes for all RBD mutations[215]). There are extremely low odds (i.e.,           1
                                                                                   1.27×1030
                                                                                             )
for 100 RBD mutations to accidentally have BFE changes simultaneously above the av-
erage value. This provides convincing evidence for our hypothesis that the transmission
and evolution of new SARS-CoV-2 variants are governed by infectivity-based natural se-
lection, despite all other competing mechanisms [41]. Our predictions rely on algebraic
topology [100, 101, 4]-assisted deep learning [40, 41], but have been extensively validated
[102, 99].
    However, infectivity is not the only transmission pathway that governs viral evolu-
tion. Vaccine-resistant mutations or more precisely, antibody-resistant mutations, that
can disrupt the protection of antibodies has become a viable mechanism for new variants
to transmit among the vaccinated population since the vaccine was put on the market. In
early January 2021, we have predicted that RBD mutations W353R, I401N, Y449D, Y449S,
P491R, P491L, Q493P, etc., will weaken most antibody bindings to the S protein [102].
Later on, we have provided a list of most likely vaccine escape RBD mutations with high
frequency, including S494P, Q493L, K417N, F490S, F486L, R403K, E484K, L452R, K417T,
F490L, E484Q, and A475S [2]. Moreover, we have pointed out that Y449S and Y449H
are two vaccine-resistant mutations, and “Y449S, S494P, K417N, F490S, L452R, E484K,
K417T, E484Q, L452Q, and N501Y" are the top 10 mutations that will disrupt most anti-
bodies with high-frequency [215]. As mentioned in Ref. [216], RBD mutations such as
E484K/A, Y489H, Q493K, and N501Y found in late-stage evolved S variants “confer re-
sistance to a common class of SARS-CoV-2 neutralizing antibodies", which suggests the
viral evolution is also regulated by vaccine-resistant mutations.
6.2.1  Evolutionary trajectories of viral RBD single mutations
Studying the mechanisms of SARS-CoV-2 mutagenesis is beneficial to the understand-
ing of viral transmission and evolution. The mainly driven force of viral evolution is
regulated by natural selection, which is employed by two complementary transmission
                                            132


pathways: 1) infectivity-based pathway and 2) vaccine-resistant pathway. We have dis-
cussed the infectivity-based pathways in Ref.[215] and [39]. This section focuses on the
vaccine-resistant pathway and its impact on the transmission and evolution of SARS-
CoV-2. To understand the mechanisms of vaccine-resistant mutations, we first analyze
1,983,328 complete SARS-CoV-2 genomes, and a total of 28,780 unique single mutations
are decoded. Among them, there are 737 non-degenerate RBD mutations. The infectivity
of SARS-CoV-2 is proportional to the BFE between the S RBD and ACE2 [212, 213, 214,
37, 28]. Therefore, the BFE change induced by a specific RBD mutation reveals whether
the RBD mutation is an infectivity-strengthen or an infectivity-weaken mutation. Simi-
larly, the BFE change between S RBD and antibody induced by a given mutation reveals
whether this mutation will strengthen the binding between S and antibody or not.
    Up to now, we have collected 130 antibody structures (see the Supporting Informa-
tion S4), which includes Food and Drug Administration (FDA)-approved mAbs from Eli
Lilly and Regeneron. For a specific RBD mutation, its antibody disruption count shows
the number of antibodies that have antibody-S BFE changes smaller than -0.3 kcal/mol.
The ACE2-S and antibody-S BFE changes induced by RBD mutations are predicted from
our TopNetTree model [41], which is available at TopNetmAb. All of the predicted BFE
changes induced by RBD mutations can be found at Mutation Analyzer. Figure 6.9 c
illustrates the top 25 most observed RBD mutations. The height and color of each bar
represent the ACE2-S BFE changes and frequency of each RBD mutation. The number
at the top of each bar shows the antibody disruption count of each mutation. The de-
tailed information can be viewed in Supplementary Information S4. It can be seen that
23 mutations have positive ACE2-S BFE changes, suggesting they are regulated by the
infectivity-based transmission pathway.
    Howbeit, 2 RBD mutations D427N and Y449S, have negative BFE changes. Notably,
mutation Y449S has a significantly negative BFE change (-0.8112 kcal/mol) and a pretty
large antibody disruption count (89), revealing a non-typical mechanism of mutagenesis.
                                           133


Such a mutation with significantly negative ACE2-S BFE change together with a high an-
tibody disruption count is called a vaccine-resistant or antibody-resistant mutation. Fig-
ure 6.9 d is the illustration of SARS-CoV-2 S protein (blue color) with human ACE2 (pink
color), and the Y449 residue (purple color) is located on the random coil of the S protein.
Among all of the vaccine-resistant mutations, Y449S has the highest frequency (1189). In
addition, at residue 449, mutations Y449H, Y449N, Y449D are all vaccine-resistant muta-
tions that have been observed in more than 20 SARS-CoV-2 genome isolates.
                            a                                                                         c
                                                                                                                                                                                                                   14
                                               m     Proof
                                          Reco
                                                               s                                       1 2 27
                                                             Tra                                                                                                                                                   13
                                            b                    n      cr
                                                                                                                                                                                                                        Natural log of frequency
                                                                     Tr
                                                                       ns
                                                                                                                0 39                                                                                               12
                                                                        a                                            24                                               Vaccine-resistant mutation
                                                                                                                          11
                                          Molecular scale                        p                    0.5                      17 51                                                                               11
                                                                                                                                       3
                                                                              Re
                                                                                         BFE change
                                                                                                                                           10 18
                                                                                                                                                                                                                   10
                                                                                                                                                                                           S477I   D427N   Y449S
                                                                                                                                                   11 0 53 9 2 3 5
                                                                                                                                                                   2 6 38 62
                                                                                                                                                                             1 0
                                                                                Shift                                                                                            0 37 30                           9
                                             Mechanism of                                              0
                                                                                                                                 V367F                         E484Q                         2
                                                                                                                                                       K417N
                                                                                                            N440K   N501T                                      K417T
                                                                                                                                                               S477N
                                             SARS-CoV-2                                                                                                                                                            8
                                                                                                                                 N439K
                                                                                                                                                                                                    1
                                                                                                                                                               V367L
                                                                                                            L452Q   A411S        P348L
                                                                                                                                 A475V
                                                                                                                                                               A522V
                                                                                                                                                               S494P
                                                                                                                                                               E484K
                                                                                                                    N501Y                                      R346K
                                             Mutagenesis                                                                         P479S                         A522S
                                                                                                                                                               R357K
                                                                                                            T478K   L452R        F490S                         A520S
                                                                                                                                                               N440S
                                                                                                                                                                                                                   7
                                                                                                                                                               G446V
                                                                                                -0.5
                                                             Population                                                                                                                                            6
                                                               scale
                                               Organism                                                                                                                                                            5
                                                scale                                                                                                                                                      85
                                 Gene                                 Natural
                                          Recom
                                editing
                                                b
                            b                                                                         d
                                                    TMPRSS2
                                                                                                                                                                                          Spike
                                                     ACE2
                                                                Spike
                                     Host cell                              SARS-CoV-2                                                         ACE2                                        Y449
Figure 6.9: a The mechanism of mutagenesis. Nine mechanisms are grouped into three
scales: 1) molecular-based mechanism (green color); 2) organism-based mechanism (red
color); 3) population-based mechanism (blue color). The random shifts (Random), repli-
cation error (Rep), Transcription error (Transcr), viral proofreading (Proof), and recom-
bination (Recomb) are the six molecular-based mechanisms. The gene editing and the
host-virus recombination are the organism-based mechanism. In addition, the natural se-
lection (Natural) is the population-based mechanism, which is the mainly driven source
in the transmission of SARS-CoV-2. b A sketch of SARS-CoV-2 and its interaction with
host cell. c Illustration of 25 single-site RBD mutations with top frequencies. The height of
each bar shows the BFE change of each mutation, the color of each bar represents the nat-
ural log of frequency of each mutation, and the number at the top of each bar means the
AI-predicted number of antibody and RBD complexes that may be significantly disrupted
by a single site mutation. d Illustration of SARS-CoV-2 S protein with human ACE2. The
blue chain represents the human ACE2, the pink chain represents the S protein, and the
purple fragment on the S protein points out the two vaccine-resistant mutations Y449S/H.
   To track the evolution trajectory of vaccine-resistant mutations, the BFE changes, log2
enrichment ratios 3 , and log10 frequencies of RBD mutations are analyzed from April 30,
   3
       Log2 enrichment ratio is collected from the experimental deep mutation enrichment data in Ref. [3]
                                                                                                                    134


2020, to August 23, 2021, in every 60 days, as illustrated in Figure 6.10. Here, the top 100
most observed RBD mutations are displayed. In Figure 6.10 a, red stars mark the vaccine-
resistant mutations that have negative BFE changes. Although a few vaccine-resistant
mutations S438F, I434K, Y505C, and Q506K were detected before November 2020, they
had relatively low frequencies. However, since December 2020, such vaccine-resistant
mutations were no longer in the top 100 most observed RBD mutation list, suggesting
that in this period, the evolution of SARS-CoV-2 is mainly regulated by natural selection
through the infectivity-based transmission pathway. Notably, in May 2021, two vaccine-
resistant mutations Y449S and Y449H, came back to the top 100 most observed RBD mu-
tation list. In addition, Y449S has a relatively high frequency. Such finding indicates that
natural selection not only favors those mutations that enhance the transmission but also
those mutations that can disrupt plenty of antibodies since SARS-CoV-2 vaccines started
to provide protection among populations in early May. Similarly, patterns can be found
in Figure 6.10 b, suggesting our AI-predicted BFE changes are highly consistent with the
deep mutational enrichment ratio from experiments [3].
6.3    Mutational impacts on SARS-CoV-2 infectivity
Recently, the SARS-CoV-2 variants from the United Kingdom (UK), South Africa, and
Brazil have received much attention for their increased infectivity, potentially high vir-
ulence, and possible threats to existing vaccines and antibody therapies. The question
remains if there are other more infectious variants transmitted around the world. We
carry out a large-scale study of 506,768 SARS-CoV-2 genome isolates from patients to
identify many other rapidly growing mutations on the spike (S) protein receptor-binding
domain (RBD). We reveal that essentially all 100 most observed mutations strengthen the
binding between the RBD and the host angiotensin-converting enzyme 2 (ACE2), indi-
cating the virus evolves toward more infectious variants. In particular, we discover new
fast-growing RBD mutations N439K, S477N, S477R, and N501T that also enhance the RBD
                                              135


                                a                                                                           b
                                                                                04/30/20                                                                   04/30/20
                                2
                                1                                  I434K            S438F                                                      I434K          S438F
                                0                                    *               *                                                          *                *
                                                                                06/29/20                                                                   06/29/20
                                                       Y505C                                                                         Y505C
                                2
                                                                        Q506K                       S438F                                        Q506K
                                1                                                                                                                                            S438F
                                                        *                *                                                            *              *
                                0                                                                    *                                                                        *
                                4                                               08/28/20                                                                   08/28/20
                                                                Y505C
                                3
                                2                                            Q506K                                                           Y505C       Q506K
                                1                                *              *                                                             * *
                                0
                                4                                               10/27/20                                                                   10/27/20
                                                                                            Y505C                                                                    Y505C
                                3
                                2
                                1
                                0
                                                                                             *                                                                        *
                                4                                                   12/26/20                                                               12/26/20
                                3
             Log10(Frequency)
                                2
                                1
                                0
                                                                                    02/24/21                                                                02/24/21
                                4
                                2
                                0
                                                                                    04/25/21                                                                04/25/21
                                4
                                2
                                0
                                6                                                   06/24/21                                                                06/24/21
                                    Y449S
                                                                                                                  Y449S
                                4     *          Y449H
                                                                                                                    *         Y449H
                                2                  *                                                                            *
                                0
                                6                                                   08/23/21                                                                08/23/21
                                4     Y449S                                                                         Y449S
                                                       Y449H                                                                     Y449H
                                          *                 *                                                        *                 *
                                2
                                0
                                6                                                   10/22/21                                                               10/22/21
                                4     Y449S                                                                         Y449S
                                                                Y449H
                                2
                                          *                          *                                                  *                    Y449H
                                                                                                                                               *
                                0
                                          BFE change                                                        Log2 enrichment change
                                                         -2             -1            0              1                                -3      -1          0           1       2
Figure 6.10: Most significant RBD mutations. a Time evolution of RBD mutations with its
mutation-induced BFE changes per 60-day from April 30, 2020, to August 31, 2021. Here,
only the top 100 most observed RBD mutations are displayed. The height and color of
each bar represent the log frequency and ACE-S BFE change induced by a given RBD mu-
tation. The red star marks the vaccine-resistant mutations with significantly negative BFE
changes. b Time evolution of RBD mutations with its experimental mutation-induced
log2 enrichment ratio changes per 60-day from April 30, 2020, to August 31, 2021. The
height and color of each bar represent the log frequency and enrichment ratio change
induced by a given RBD mutation. The red star marks vaccine-resistant mutations with
significantly negative BFE changes.
                                                                                         136


and ACE2 binding. We further unveil that mutation N501Y involved in United Kingdom
(UK), South Africa, and Brazil variants may moderately weaken the binding between the
RBD and many known antibodies, while mutations E484K and K417N found in South
Africa and Brazilian variants, L452R and E484Q found in India variants, can potentially
disrupt the binding between the RBD and many known antibodies. Among these RBD
mutations, L452R is also now known as part of the California variant B.1.427.
6.3.1  Impacts of S RBD single mutation on SARS-CoV-2 Infectivity
The RBD is located on the S1 domain of the S protein, which plays a vital role in binding
with the human ACE2 to get entry into host cells. The mutations that are detected on the
RBD may affect the binding process and lead to the BFE changes. In this section, we ap-
ply the TopNetTree model [217] to predict the mutation-induced BFE changes of RBD and
ACE2. Figure 6.11 illustrates the predicted BFE changes for S protein and human ACE2
induced by single-site mutations on the RBD. Here, we consider 100 most observed mu-
tations. The bar plot of the other mutations on S RBD can be found in the Supporting
Information. In this figure, a total of 100 most observed mutations are displayed. Among
them, 9 mutations induced negligible negative BFE changes, while the other 91 muta-
tions are binding-strengthening mutations. Mutation T478K has the largest BFE change
which is nearly 1 kcal/mol. It may have made the Mexico variant B.1.1.222 the most
infectious observed variant.
    To be noted, the residue T478 is not conservative among different species. The N501Y,
S477N, L452R, N439K, and E484K mutations are the top mutations with significant fre-
quencies. Among them, the N501Y and L452R mutations have a relatively high BFE
change of 0.55 kcal/mol and 0.58kcal/mol. Moreover, the frequency and predicted BFE
changes are both at a high level for mutations N501T, Y508H. Figure 6.12 illustrates the
time evolution of 651 binding-strengthening (blue) and binding-weakening mutations
(red) on the S protein RBD. Here, the y-axis reveals the natural log frequency of each mu-
                                             137


                         1.0
                                                                                                            105
                         0.8
BFE changes (kcal/mol)
                         0.6                                                                                104
                                                                                                              Frequency
                         0.4
                                                                                                            103
                         0.2
                         0.0                                                                                102
                                L335F
                                F338L
                               G339D
                               A344S
                               R346K
                               R346S
                               A348S
                               A352S
                               N354D
                               N354K
                               K356R
                               R357K
                               V362F
                               V367L
                               V367F
                               V367A
                               N370H
                               N370S
                               A372V
                               S373P
                               S373L
                                T376I
                               K378N
                               V382L
                               P384S
                                P384L
                               T385N
                                T385I
                               R403K
                               E406Q
                               R408K
                                R408I
                                I410V
                               A411S
                               Q414K
                               Q414R
                               K417T
                               K417N
                               D427N
                               D427Y
                                T430I
                                I434V
                               N439K
                               N440K
                               K444R
                               K444N
                               V445A
                               G446V
                               N450K
                               L452M
                               L452R
                               Y453F
                                L455F
                               K458N
                               S459Y
                               S459F
                               P463S
                                I468V
                               T470N
                                T470I
                               E471Q
                               A475S
                               A475V
                               G476S
                               S477G
                               S477N
                                S477I
                               S477R
                               T478K
                               T478R
                                T478I
                               P479S
                                P479L
                               N481K
                               V483F
                               V483A
                               E484K
                               E484Q
                                F486L
                                F490L
                               F490S
                               Q493R
                               Q493L
                               Q493H
                               S494P
                               G496S
                               N501Y
                               N501T
                                V503I
                               Y508H
                               S514F
                               E516Q
                                L517F
                               H519Q
                               A520S
                               A520V
                               P521S
                               A522P
                               A522S
                               A522V
Figure 6.11: Illustration of SARS-CoV-2 mutation-induced BFE changes for the complexes
of S protein and ACE2. Here, 100 most observed mutations on S RBD are illustrated.
tation. Based on the our previous findings in [41], at this stage, 651 out of 1149 RBD
mutations that we predicted as "most likely" mutations have been observed, and none
of the 1912 "likely" and 625 "unlikely" mutations are tracked on the S protein RBD, sug-
gesting the reliability of our model for predicting the BFE changes of S protein RBD and
ACE2. Among 651 mutations that are detected on RBD, mutations N501Y, S477N, L452R,
N439K, and E484K have the highest frequency up to April 18, 2021.
Figure 6.12: Illustration of the time evolution of 424 ACE2 binding-strengthening RBD
mutations (blue) and 227 ACE2 binding-weakening RBD mutations (red) on the S protein
RBD of SARS-CoV-2 from Jan 07, 2020 to April 18, 2021. The x-axis represents date and
y-axis represents the natural log of frequency of each mutation.
                           It is important to track those mutations that have high frequency since the beginning
of 2021. Table 6.4 gives such information for top 40 mutations in 2021. It can be seen that
mutations N501Y, L452R, T478K, N501T, N550K, F490S, V483F, L452M, and A348S have
                                                                  138


relatively high BFE changes of the binding of S protein and ACE2, suggesting that they
may lead to more infectious variants.
Table 6.4: List of top 40 high-frequency (HF) mutations and their corresponding BFE
changes (unit: kcal/mol) of the binding of S protein and ACE2. Here, count shows the
frequency occurred in 2021.
   Rank   HF mutation Count BFE change            Rank   HF mutation Count BFE change
  Top 1      N501Y      168801        0.5499     Top 21     N450K        184     0.3535
  Top 2       L452R      9843         0.5752     Top 22     E484Q        182     0.0057
  Top 3       E484K      9350         0.0946     Top 23      P330S       182     0.0533
  Top 4       S477N      9276          0.018     Top 24     A522V        179     0.0705
  Top 5      N439K       6056         0.1792     Top 25     D427N        164     -0.1133
  Top 6       T478K      4935         0.9994     Top 26      P479S       153     0.3844
  Top 7      K417N       1634         0.1661     Top 27     V382L        151     0.0355
  Top 8       K417T      1508         0.0116     Top 28     T385N        151     0.0049
  Top 9       S494P      1483         0.0902     Top 29     Q414R        143     0.0708
  Top 10     N501T       1295         0.4514     Top 30     R346K        135     0.1234
  Top 11      A520S       819         0.1495     Top 31      T385I       127     0.0314
  Top 12      A522S       621         0.1283     Top 32     R403K        121     0.1778
  Top 13      V367F       536         0.1764     Top 33     L455F         99     -0.0415
  Top 14     N440K        432         0.6161     Top 34     V483F         99     0.5428
  Top 15      S477R       394          0.082     Top 35     A475V         96     0.3069
  Top 16      P384L       389         0.2681     Top 36     G446V         86     0.1583
  Top 17      R357K       373         0.1393     Top 37     L452M         83     0.5966
  Top 18      F490S       363         0.4406     Top 38     A348S         82     0.4616
  Top 19      P384S       263         0.1151     Top 39      T478I        81     0.1269
  Top 20     Q414K        224         0.1234     Top 40     A352S         78     0.2576
    Figure 6.13 shows the 3D structure of SARS-CoV-2 S protein RBD bound with ACE2.
Here, we mark 13 mutations with either high frequency or high BFE changes. The blue
and red colors represent the mutations that have positive and negative BFE changes, re-
spectively. The darker the color is, the larger the absolute value of BFE changes is. While
mutations occur everywhere on the spike protein, the ones that are most important to
COVID-19 infectivity and the efficacy of antibodies and vaccines are located at the inter-
face between the spike protein and ACE2 or antibodies.
                                             139


                                       F486L                S477R/N
                                                              T478K
                                                             E484K/Q K417N
                                                                  L452R
                                                                       S494P
                                                    N501Y/T    N439K  P384N
Figure 6.13: The 3D structure of SARS-CoV-2 S protein RBD bound with ACE2 (PDB ID:
6M0J). We choose blue and red colors to mark the binding-strengthening and binding-
weakening mutations, respectively. Vaccine escape mutations described in Table 6.6 are
labeled.
6.3.2   Impacts of S RBD co-mutations on SARS-CoV-2 Infectivity
To understand the molecular mechanisms of vaccine-escape mutations, we analyze single
nucleotide polymorphisms (SNPs) of 1,489,884 complete SARS-CoV-2 genome sequences,
resulting in 683 non-degenerate RBD mutations and their associated frequencies. A full
set of mutation information is available on our interactive web page Mutation Tracker.
The infectivity of each mutation is mainly determined by the mutation-induced BFE
change to the binding complex of RBD and ACE2. To estimate the impact of each muta-
tion on vaccines, we collect a library of 130 antibody structures (Supporting Information
S2.1.2), including Food and Drug Administration (FDA)-approved mAbs from Eli Lilly
and Regeneron. For a given RBD mutation, its number of antibody disruptions is given
by the number of antibodies whose mutation-induced antibody-RBD BFE changes are
smaller than -0.3kcal/mol (A list of names for antibodies that are disrupted by mutations
can be found in the Supporting Information S2.1.1.). BFE changes following mutations are
predicted by our deep learning model, TopNetTree [40]. We have created an interactive
web page, Mutation Analyzer, to list all RBD mutations, their observed frequencies, their
RBD-ACE2 BFE changes following mutations, their number of antibody disruptions, and
various ranks. Figure 6.14 illustrates RBD mutations associated with prevailing SARS-
                                             140


CoV-2 variants, time evolution trajectories of all RBD mutations, and the BFE changes of
RBD-ACE2 and 130 RBD-antibodies induced by 75 significant mutations. A summary of
our analysis is given in Table 6.5.
Table 6.5: Top 25 most observed S protein RBD mutations. Here, BFE change refers to
the BFE change for the S protein and human ACE2 complex induced by a single-site S
protein RBD mutation. A positive mutation-induced BFE change strengthens the binding
between S protein and ACE2, which results in more infectious variants. Counts of anti-
body disruption represent the number of antibody and S protein complexes disrupted by
a specific RBD mutation. Here, an antibody and S protein complex is to be disrupted if its
binding affinity is reduced by more than 0.3 kcal/mol [2]. In addition, we calculate the
antibody disruption ratio (%), which is the ratio of the number of disrupted antibody and
S protein complexes over 130 known complexes. Ranks are computed from 683 observed
RBD mutations.
                           Worldwide BFE change Antibody disruption
               Mutation
                          Count Rank Change Rank Count Ratio Rank
                 N501Y 744354 1         0.5499     30     24 18.46    160
                 L452R 259345 2         0.5752     28     39  30.0     98
                 T478K 239619 3         0.9994     2       2  1.54    557
                 E484K 84167 4          0.0946    272     38 29.23    104
                 K417T 37748 5          0.0116    433     37 28.46    107
                 S477N 32673 6          0.0180    422      0   0.0    650
                 N439K 16154 7          0.1792    159     11  8.46    272
                 K417N 8399         8   0.1661    176     53 40.77     61
                 F490S     5617     9   0.4406     52     51 39.23     67
                 S494P     5119 10      0.0902    282     62 47.69     46
                 N440K 3379 11          0.6161     22      0   0.0    645
                 E484Q 3229 12          0.0057    442     30 23.08    130
                 L452Q 2858 13          0.9802     3      27 20.77    144
                 A520S     2727 14      0.1495    199      3  2.31    497
                 N501T 2054 15          0.4514     48     17 13.08    202
                 R357K 1973 16          0.1393    208      5  3.85    388
                 A522S     1959 17      0.1283    221      2  1.54    543
                 R346K 1686 18          0.1234    229      6  4.62    380
                 V367F     1395 19      0.1764    161      0   0.0    637
                 N440S 1361 20          0.1499    197      2  1.54    542
                 P384L     1155 21      0.2681    105     18 13.85    199
                 Y449S     1146 22     -0.8112    632     85 65.38     16
                 D427N 1106 23         -0.1133    558      1  0.77    589
                 R346S     1037 24      0.0374    386     20 15.38    182
                 A475V      891     25  0.3069     94     10  7.69    289
                                            141


a                                                                     b
                                                                                                 14         Positive      N501Y
Variants of Concern (VOC):                               T478K                                              Negative      L452R                                    T478K
 Alpha: N501Y                                            S477N                                   12
                                                                      Natural log of frequency
                                                            E484K/Q                                                       E484K
 Beta: K417N, E484K, N501Y                                                                                                K417T                                    N439K
                                                            F490S                                10          K417N
Gamma: K417T, E484K, N501Y
                                                            S494P                                            S477N                                                 F490S
Delta: L452R, T478K                                                                              8                                                                 S494P
                                                            L452Q/R
Variants of Interest (VOI):                                 K417N/T
 Eta: E484K
                                                                                                 6
                                                            N501Y
 Iota: E484K                                                                                     4
 Kappa: L452R, E484Q          hACE2
Lambda: L452Q, F490S                                                                              2
Mu: R346K, E484K, N501Y
                                                                                                  0
Other variants:                                                                                      Ja    Ma       Ma    Ju    Se   No    Ja     Ma    Ma     J
 Delta plus: K417N, L452R, T478K                                                                       n0      r0      y     l0    p    v     n0    r0     y 0 ul 01
                                         S protein RBD                                                   82       8 2 07 2 6 20 04 2 03         22     32     22     2
                                                                                                            02       02               02   20                    02 021
 Beta plus: P384L, K417N, E484K, N501Y                                                                         0        0  02
                                                                                                                              0
                                                                                                                                 20
                                                                                                                                         0    20 021     02
                                                                                                                                                           1       1
c
                                                                                                             BFE changes (kcal/mol)
                                                                                                                                      -4     -2       0       2           4
Figure 6.14: Most significant RBD mutations. a The 3D structure of SARS-CoV-2 S protein
RBD and ACE2 complex (PDB ID: 6M0J). The RBD mutations in ten variants are marked
with color. b Illustration of the time evolution of 455 ACE2 binding-strengthening RBD
mutations (blue) and 228 ACE2 binding-weakening RBD mutations (red). The x-axis rep-
resents the date and the y-axis represents the natural log of frequency. There has been
a surge in the number of infections since early 2021. c BFE changes of RBD complexes
with ACE2 and 130 antibodies induced by 75 significant RBD mutations. A positive BFE
change (blue) means the mutation strengthens the binding, while a negative BFE change
(red) means the mutation weakens the binding. Most mutations, except for vaccine-
resistant Y449H and Y449S, strengthen the RBD binding with ACE2. Y449S and K417N
are highly disruptive to antibodies.
                                                               142


    First, the 10 most observed or fast-growing RBD mutations are N501Y, L452R, T478K,
E484K, K417T, S477N, N439K, K417N, F490S, and S494P, as shown in Table 6.5. Inclu-
sively, these top mutations strengthen their BFEs and become more infectious, following
the natural selection mechanism [41]. Figure 6.14b shows that the frequencies of the top
three mutations increased dramatically since 2021 due to Alpha, Beta, Gamma, Delta, and
other variants. Second, among the top 25 most observed RBD mutations, T478K, L452Q
N440K, L452R, N501Y, N501T, F490S, A475V, and P384L are the 8 most infectious ones
judged by their ability to strengthen the binding with ACE2, as shown in Figure 6.14c.
The BFE changes of S protein and ACE2 for mutation T478K is nearly 1.00 kcal/mol,
which strongly enhances the binding of the RBD-ACE2 complex [218]. Together with
L452R (BFE change: 0.58kcal/mol), T478K makes Delta the most infectious variant in
VOCs. Third, among the top 25 most observed RBD mutations, Y449S, S494P, K417N,
F490S, L452R, E484K, K417T, E484Q, L452Q, and N501Y are the 10 most antibody disrup-
tive ones, judged by their interactions with 130 antibodies shown in Figure 6.14c. It can
be seen that mutations L452R, E484K, K417T, K417N, F490S, and S494P disrupt more than
30% of antibody-RBD complexes, while mutations E484K and K417T may disrupt nearly
30% antibody-RBD complexes, indicating their disruptive ability to the efficacy and relia-
bility of antibody therapies and vaccines. The most dangerous mutations are the ones that
are both infectivity-strengthening and antibody disruptive. Four RBD mutations, N501Y,
L452R, F490S, and L452Q, appear in both lists and are key mutations in WHO’s VOC and
VOI lists. Among them, F490S and L452Q are the key RBD mutations in Lambda, making
Lambda a more dangerous emerging variant than Delta. Note that high-frequency muta-
tion S477N does not significantly weaken any antibody and RBD binding, and thus does
not appear in any prevailing variants.
                                            143


6.4     Mutational impacts on SARS-CoV-2 antibodies and vaccines
6.4.1    Impacts of S RBD single mutation on SARS-CoV-2 antibodies and vaccines
It is of paramount importance to track not only ACE2-binding-strengthening RBD muta-
tions and FG mutations but also the antibody-binding-weakening RBD mutations. Our
early work reported nearly 71% mutations on the S protein RBD will weaken the bind-
ing of S protein and antibodies, while 64.9% mutations on the RBD will strengthen the
binding of S protein and ACE2, suggesting that these mutations may potentially enhance
the infectivity of SARS-CoV-2 and make the existing antibodies less effective [217]. We
call those mutations that weaken the binding of the S protein and most SARS-CoV-2 anti-
bodies as antibody disrupting (AD) mutations [217]. Notably, most antibody disrupting
mutations have negative BFE changes, suggesting that they will make the SARS-CoV-2
less infectious and thus, will not frequently occur due to natural selection. As a result,
many of them may not be able to evade the existing vaccines in a population. Therefore,
it is necessary to focus on the BFE changes of S protein and antibodies that are induced
by 100 most observed mutations on S protein RBD.
     In this work, we have collected a total of 106 antibodies. The detailed information of
these 106 antibodies can be found in the Supporting Information. Figure 6.15 shows the
BFE changes for the S protein and 106 antibody complexes together with ACE2 following
100 most observed mutations on the S protein RBD. The red color marks the mutation-
induced negative BFE changes for the complexes of S protein and antibodies, which indi-
cates that these mutations may weaken the binding and make the antibody less effective.
Meanwhile, the green color represents the positive BFE changes induced by mutations,
which suggests that these mutations may strengthen the binding of S protein and anti-
bodies. From Figure 6.15, we can see that mutation E484K will disruptively weaken the
binding of S protein with antibodies such as LY-CoV555 and DH1041, which are marked
in dark red. Mutation S494P will disruptively weaken the binding of S protein with an-
                                             144


tibodies such as H11-D4, H11-H4, and LY-CoV555. Mutation K417N will disruptively
weaken the binding of S protein with a large number of antibodies. Moreover, muta-
tion N501Y will moderately weaken the binding of S protein with antibodies such as
CC12.1/CR3022, COVOX-88/-45, COVOX-88 etc.
                         L335F
                         F338L
                         G339D
                         A344S
                         R346K
                         R346S
                         A348S
                         A352S
                         N354D
                         N354K
                         K356R
                         R357K
                         V362F
                         V367L
                         V367F
                         V367A
                         N370H
                         N370S
                         A372V
                         S373P
                         S373L
                         T376I
                         K378N
                         V382L
                         P384S
                         P384L
                         T385N
                         T385I
                         R403K
                         E406Q
                         R408K
                         R408I
                         I410V
                         A411S
                         Q414K
                         Q414R
                         K417T
                         K417N
                         D427N
                         D427Y
                         T430I
                         I434V
                         N439K
                         N440K
                         K444R
                         K444N
                         V445A
                         G446V
                         N450K
                         L452M
                         L452R
                         Y453F
                         L455F
                         K458N
                         S459Y
                         S459F
                         P463S
                         I468V
                         T470N
                         T470I
                         E471Q
                         A475S
                         A475V
                         G476S
                         S477G
                         S477N
                         S477I
                         S477R
                         T478K
                         T478R
                         T478I
                         P479S
                         P479L
                         N481K
                         V483F
                         V483A
                         E484K
                         E484Q
                         F486L
                         F490L
                         F490S
                         Q493R
                         Q493L
                         Q493H
                         S494P
                         G496S
                         N501Y
                         N501T
                         V503I
                         Y508H
                         S514F
                         E516Q
                         L517F
                         H519Q
                         A520S
                         A520V
                         P521S
                         A522P
                         A522S
                         A522V
              CR3022
                 S309
              CC12.1
     CC12.1/CR3022
              CC12.3
     CC12.3/CR3022
                 C105
  REGN10933/10987
                 CV30
              Fab 2-4
            CV07-270
            CV07-250
              H11-D4
     CR3022/H11-D4
              H11-H4
             EY6Z/Nb
                 EY6Z
                 Sb23
          STE90-C11
             P2B-2F6
                 BD23
                  B38
                  CB6
                  SR4
                 MR17
                 H014
          MR17-K99Y
            P2C-1F11
             P2C-1A3
              BD-604
              BD-629
              BD-236
    BD-236/BD-368-2
    BD-604/BD-368-2
            BD-368-2
                 A fab
               CT-P59
                   P17
            P17/H014
           COVA2-04
           COVA2-39
           COVA1-16
               S2H13
                 S2A4
                 S304
           VH binder
   S309/S2H14/S304
               S2M11
                S2E12
                 C102
                                                                                                 BFE changes (kcal/mol)
                 C002
                 C104
                 C110
                 C119
                 C121
                 C135
                 C144
         Fabs 298/52
             C1A-B12
              C1A-B3
              C1A-C2
             C1A-F10
          LY-CoV555
          LY-CoV488
          LY-CoV481
             VHH E/U
  VHH V/Fab CC12.3
 VHH W/Fab CC12.3
 CR3014-C8/CR3022
             DH1047
                  7D6
              Fab 2-7
             DH1041
                  1-57
                                                                                                                          4
       P5A-3C12_1B
             P5A-2G7
             P5A-1B8
             P5A-3A1
                                                                                                                          3
       P5A-3C12_2B
             P5A-1B9
        P5A-2F11_2B
         P5A-1B6_3B                                                                                                       2
        P5A-2F11_3B
             P2B-1A1
            P2B-1A10
             P5A-2G9                                                                                                      1
         P5A-1B6_2B
         P5A-1B8_2B
         P5A-1B8_3B
      COVOX-88/-45                                                                                                        0
    COVOX-269 scFV
         COVOX-158
    COVOX-384/S309
     COVOX-253/-75                                                                                                        -1
COVOX-253H55L/-75
         COVOX-316
         COVOX-150
    COVOX-253H55L
   COVOX-253H165L
                                                                                                                          -2
         COVOX-384
           COVOX-40
           COVOX-88
   CV05-163/CR3022
                                                                                                                          -3
                MW06
               910-30
                  2-51
                 ACE2                                                                                                     -4
          Frequency
                                                   Frequency
                                                               0   101   102   103   104   105
Figure 6.15: Illustration of SARS-CoV-2 S RBD 100 most observed mutations induced BFE
changes for the complexes of S protein and 106 antibodies or ACE2. Here, red repre-
sents the negative changes that will weaken the binding, while green shows the positive
changes that will strengthen the binding.
                                         145


    Considering the impact of the possible calculation error, we set -0.3 kcal/mol as the
threshold of the binding of S protein and antibodies induced by AD mutations. Specif-
ically, we say a mutation is an AD mutation to the binding complex of S protein and
antibody if its BFE change for the complex is less than 0.3 kcal/mol.
    We hypothesize that RBD mutations that can simultaneously strengthen the infectivity
and disrupt the binding between the S protein and existing antibodies will pose imminent
threats to the current crop of vaccines. We define a vaccine escape (VE) mutation as a
high-frequency mutation that is an AD mutation for at least 24 (23%) different antibodies.
We also define a vaccine-weakening (AW) mutation as a high-frequency mutation and
AD mutation for 11 (10%) to 21 (20%) different antibodies.
Table 6.6: List of vaccine escape (VE) and vaccine weakening (VW) Their corresponding
BFE changes (unit: kcal/mol) of the binding of S protein and ACE2 are provided as well.
Here, the count shows the number of antibodies that will make a specific mutation to be
an AD mutation.
   VE Mutation        BFE change     Count     VW Mutation       BFE change      Count
       S494P             0.0902       50           N501Y           0.5499          21
       Q493L             0.2279       43           Q493R           0.1271          21
       K417N             0.1661       43            R408I          0.1949          19
       F490S             0.4406       42           Q493H           0.2385          18
       F486L             0.1456       41            P384S          0.1151          18
       R403K            0.1778        34           K378N           0.0573          16
       E484K             0.0946       31           G496S            0.0187         15
       L452R             0.5752       28            L455F          -0.0415         15
       K417T             0.0116       28            I410V          0.7105          14
       F490L             0.5139       25            R346S          0.0374          14
       E484Q             0.0057       25           V483A           0.6695          13
       A475S            -0.0732       24           K444N           0.1024          12
                                                   N501T           0.4514          11
                                                    P384L          0.2681          11
    Table 6.6 lists vaccine-escape (VE) and vaccine-weakening (VW) RBD mutations to-
gether with their corresponding BFE changes (unit: kcal/mol) of the binding of S pro-
tein and ACE2. The count represents the number of antibodies that will make a specific
mutation to be an AD mutation. We can see that VE mutations F490S, L452R, VW muta-
                                           146


tions F490L, N501Y, V483A, and N501T have relatively high BFE changes of the binding
of S protein and ACE2, suggesting that they are high-risk mutations. Moreover, L452R,
N501Y, and N501T are also HF mutations, which should receive high attention.
6.4.2  Impacts of S RBD single mutation on SARS-CoV-2 antibodies and vaccines
The recent surge in COVID-19 infections is due to the occurrence of RBD co-mutations
that combine two or more infectivity-strengthening mutations. The most dangerous fu-
ture SARS-CoV-2 variants are highly likely to be RBD co-mutations that combine infectivity-
strengthening mutation(s) with antibody disruptive mutation(s). A list of 1,139,244 RBD
co-mutations that are decoded from 1,489,884 complete SARS-CoV-2 genome sequences
can be found in Section S2.1.3 of the Supporting Information, and all of the non-degenerate
RBD co-mutations with their frequencies, antibody disruption counts, total BFE changes,
and the first detection dates and countries can be found in Section S2.1.4 of the Supporting
Information.
    Figure 6.16 illustrates the properties of S protein RBD 2, 3, and 4 co-mutations. The
height of each bar shows the predicted total BFE change of each set of co-mutations on
RBD, the color represents the natural log of frequency for each set of RBD co-mutations,
and the number at the top of each bar is the AI-predicted number of antibody-RBD com-
plexes that each set of RBD co-mutations may disrupt based on a total of 130 RBD and an-
tibody complexes. Notably, for a specific set of co-mutations, the higher the number at the
top of the bar is, the stronger ability to break through vaccines will be. From Figure 6.16,
RBD 2 co-mutation set [L452R, T478K] (Delta variant) has the highest frequency (219,362)
and the highest BFE change (1.575 kcal/mol). Moreover, the Delta variant would disrupt
40 antibody-RBD complexes, suggesting that Delta would not only enhance the infectiv-
ity but also be a vaccine breakthrough variant. Moreover, [L452Q, F490S] (Lambda) is
another co-mutation with high frequency, high BFE changes (1.421 kcal/mol), and high
antibody disruption count (59). In addition, Lambda is considered to be more dangerous
                                              147


than Delta due to its higher antibody disruption count. Further, [R346K, E484K, N501Y]   (Mu variant) has a BFE change of 0.768 kcal/mol and high antibody disruption count   (60). It is not as infectious as Delta and Lambda, but has a similar ability as Lambda in   escaping vaccines. Note that among all VOCs and VOIs, Beta has the highest ability to
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Figure 6.16: Properties of RBD co-mutations. a Illustration of RBD 2 co-mutations with a
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Furthermore, high-frequency 2 co-mutation sets [E484K, N501Y], [F490S, N501Y], and
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ]
                                                                                                                                                                                                                                                                                                                                                                  break through vaccines, but its infectivity is relatively low (BFE change: 0.656 kcal/mol).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1Y
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                50 ]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Natural log of frequency                                                                           Natural log of frequency                                                           , N 1Y
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           4K 50
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         48 , N Q]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   12         11        10         9                   8       7         6            5                                  5.5           5             4.5       4              3.5                     , E 4K 16
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   7N 48 E5 Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 41 , E 1Y, 01
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              , K 7N 50 N5 Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            4L 41 N K, 501
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         38 K K, 4
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Beta plus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     101                              [P 6K, 484 E48 K, N 0S]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                c 4 co-mutations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [Y449S, N501Y]              94                                                                                             34 , E      , 84 52 ]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [R 7N 70N E4 Y,A 01Y
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       91 93
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   78          [E484K, S494P]                                                                                                         ,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         41 T4 Q 01 N5 ]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         frequency greater than 90. b Illustration of RBD 3 co-mutations with a frequency greater
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            73                                 [S494P, N501Y]                                                                                         [K 7T, 471 N5 K, 01Y
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         41 E K, 84 5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        72                                                     [F490S, N501Y]                                                                   84                    [K 7T, 484 , E4 K, N
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         41 E N 84
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          60                                   [E484K, N501Y]                                                    83                                   [K 7T, 427 E4
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         41 D K,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    59                                                                         [L452Q, F490S]                                                              82                         [K 7T, 478
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         41 T
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  58                                                           [L452R, N501Y]                                                                     82                  [K 2R,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         45
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Lambda
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       58                                      [K417N, N501Y]                  75                                                                     [L
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   56                                                          [F490L, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         1.5                        0.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         2                            1                         0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         than 30. c Illustration of RBD 2 co-mutations with a frequency greater than 20. Here, the
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [E484Q, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                52 54
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [L452R, E484Q]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               51                              [R346S, L452R]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Total BFE change
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [Q493R, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        47 49                                  [P384S, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Natural log of frequency
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                47                             [K417T, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Beta
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       46                                      [R403K, N501Y]                                      9           8     7     6        5         4
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Delta plus
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              46               [N439K, E484K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         x-axis lists RBD co-mutations and the y-axis represents the predicted total BFE change
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  42                                           [P384L, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           90                         [K417N, E484K, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         41                                    [L452R, A522P]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   82                                                 [K417N, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   41                                          [R408I, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                79 81
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [K417T, E484K, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                41                             [G496S, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [K417R, E484K, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Gamma
                                                                                                                                                                                                                                                                                                                                                                                                                                                                     40                                                                                        [L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               70                                                                     [V401L, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       38 39
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [L452R, V503I]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         69
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Delta
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [L452R, T478K, Q493E]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [N440K, E484K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           64                         [R346K, E484K, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         between S RBD and ACE2 of each set of RBD co-mutations. The number on the top of each
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    38         [S477N, E484K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       35                                                      [A411S, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  61                                                  [R408I, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  60
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       35                      [L455F, N501Y]                                                              Mu                         [V367A, E484K, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           148
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       33                                      [G476S, N501Y]                            59                                                           [L452R, T478K, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             33                                                [N450K, N501Y]                                          59                                             [L452R, T478K, S494L]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       32                                      [G446V, N501Y]                                 56                                                      [P384L, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    31                         [L441R, N501Y]                                            56                                           [L452R, L455F, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  31                                                           [L452M, N501Y]                                                                    53                   [V382L, L452R, E484Q]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         bar is the AI-predicted number of antibody and RBD complexes that may be significantly
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               30                              [I468V, N501Y]                                           53                                            [L452R, T478K, E484Q]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              30                                               [A475V, N501Y]                                                                   51                    [R346S, L452R, S477N]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 30                            [T385N, N501Y]                                       49                                                [K444N, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   29                                                          [V483F, N501Y]                            49                                                           [A411S, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       27                                                      [A348S, N501Y]                                     47                                                  [I434M, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       27                                      [N501Y, A520S]                                                                                   47    [Y449H, S477I, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 27                            [K458N, N501Y]                                      47                                                 [T376I, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            27                                                 [P479S, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         disrupted by the set of RBD co-mutations, and the color of each bar represents the natural
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          47                                                          [K444M, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        26                                     [N501Y, A522S]                                      46                                                 [G446V, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   26                                          [N354K, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               44                                                                     [L452R, T478K, Y508H]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        26                                     [T478I, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         44                                           [L452R, S477G, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   26                                                          [E471Q, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          43 43
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [A419S, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 26                            [S477I, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [L452R, T478K, P479S]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            26                                 [S371T, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         43                                                           [A348S, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               26                              [V382L, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               42 43
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [A352S, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         log of frequency for each set of RBD co-mutations. (Please check the interactive HTML
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         25                    [D427N, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [L452R, T478K, P479L]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                25                             [S359T, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         42                                                           [L452R, T478K, V483F]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               25                              [N501Y, A522V]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                a 2 co-mutations                                                                                                  b 3 co-mutations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       25                                                                                                 42                                          [Q414H, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [N440S, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          24                                   [N501Y, A520V]                                 42                                                      [L452R, T478K, S514F]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       24                                      [V367F, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   42                                                 [L452R, T478K, A522S]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         42                                                           [L452R, E471Q, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            24 24 24
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [S477R, N501Y]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [V367L, N501Y]                                           41                                            [L452R, S477I, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         files in the Supporting Information S2.2.4 for a better view of these plots.)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    40 41 41
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               [L335F, N501Y]                                                                                         [V362F, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       13                      [Q414K, N450K]                                                                                         [L452R, T478K, A522V]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   2           [S477N, A522S]                                                                                         [V367L, L452R, T478K]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1.5                      0.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1                                                                                                 2                            1                        0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1.5                                                 0.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Total BFE change (kcal/mol)                                                                            Total BFE change (kcal/mol)


[S494P, N501Y] are all considered to be the emerging variants that have the potential
to escape vaccines. From Figure 6.16, three 3 co-mutation sets [R345K, E484K, N501Y]
(Mu), [K417T, E484K, N501Y] (Gamma), and [K417N, E484K, N501Y] (Beta) draw our
attention. They are all the prevailing three co-mutations with moderate BFE changes but
very high antibody disruption count (more than 60). With a BFE change of 1.4 kcal/mol
and antibody disruption count of 82, co-mutation set [K417N, L452R, T478K] (Delta plus)
appears to be more dangerous than all of the current VOCs and VOIs.
    For 4 co-mutations in Figure 6.16 c, [P384L, K417N, E484K, N501Y] (Beta plus) could
penetrate all vaccines due to its highest antibody disruption count of 101. We would
like to address that all of the co-mutations sets, except for [Y449S, N501Y] in Figure 6.16
have positive BFE changes, following natural selection. We anticipate that although co-
mutation sets [V401L, L452R, T478K], [L452R, T478K, N501Y], [A411S, L452R, T478K],
and [L452R, T478K, E484K, N501Y] have relatively low frequencies at this point, they may
become dangerous variants soon due to their large BFE changes and antibody disruption
counts.
    It is important to understand the general trend of SARS-CoV-2 evolution. To this end,
we carry out the statistical analysis of RBD co-mutations. Among 1,489,884 SARS-CoV-2
genome isolates, a total of 1,113 distinctive 2 co-mutations, 612 distinctive 3 co-mutations,
and 217 distinctive 4 co-mutations are found. Figures 6.17 a, b, and c illustrate the 2D
histograms of 2, 3, and 4 co-mutations, respectively. The x-axis is the number of antibody
disruption counts, and the y-axis shows the total BFE change. Figure 6.17 a shows that
there are 82 RBD 2 co-mutations that have BFE changes in the range of [0.600, 0.799]
kcal/mol and will disruptive 40 to 49 antibodies. According to Figure 6.17 b, there are
170 unique 3 co-mutations that have large BFE changes of S protein and ACE2 in the range
of [1.500, 1.999] kcal/mol. In Figure 6.17 c, it is seen that almost all of the 4 co-mutations
on RBD have the BFE changes greater than 0.5 kcal/mol and weaken the binding of S
protein with at least 60 antibodies. Figures 6.17d, e, and f are the histograms of total BFE
                                              149


changes, natural log of frequencies, and antibody disruption counts for RBD 2, 3, and 4
co-mutations. It can be found that most of the 2, 3, and 4 RBD co-mutations have positive
total BFE changes, and the larger number of RBD co-mutations is, the higher number of
antibody disruption count will be. In summary, co-mutations with a larger number of
antibody disruptive counts and high BFE changes will grow faster. We anticipate that
when most of the population is vaccinated, vaccine-resistant mutations will become a
more viable mechanism for viral evolution.
                                      a            count              d
              Total BFE change
                                                                   200
                                 2                   80                            2 co-mutations
                                                                   150
                                                           count
                                                                                   3 co-mutations
                                 0                                 100
                                                     40                            4 co-mutations
                                 -2                                50
                                                     0              0
                                     0 40 80 100                              -3      -2     -1   0     1       2
                                                   count                 e             Total BFE change
                                     b               160
              Total BFE change
                                                                   300                              2 co-mutations
                                 2
                                                                   200                              3 co-mutations
                                                           count
                                                    80
                                 0                                                                  4 co-mutations
                                                               100
                             -2                     0
                               0 40 80                              0
                                                                     0         2       4       6      8    10       12
                                c                  count              f             Natural log of frequency
                                                     80            60
              Total BFE change
                                 2                                                                  2 co-mutations
                                                                   40
                                                                                                    3 co-mutations
                                                           count
                                 0                  40
                                                                   20                               4 co-mutations
                             -2
                      -4                  0                         0
                        0 40 80                                         0      20      40    60    80     100 120
                     Antibody disruption count                                      Antibody disruption count
Figure 6.17: a 2D histograms of antibody disruption count and total BFE changes for RBD
2 co-mutations (unit: kcal/mol). b 2D histograms of antibody disruption count and to-
tal BFE changes (unit: kcal/mol) for RBD 3 co-mutations. c 2D histograms of antibody
disruption count and total BFE changes (unit: kcal/mol) for RBD 4 co-mutations. d The
histograms of total BFE changes (unit: kcal/mol) for RBD co-mutations. e The histograms
of the natural log of frequency for RBD co-mutations. f The histograms of antibody dis-
ruption count for RBD co-mutations. In figures a, b, and c, the color bar represents the
number of co-mutations that fall into the restriction of x-axis and y-axis. The reader is
referred to the web version of these plots in the Supporting Information S2.2.2 and S2.2.3.
                                                                             150


6.5         Validation
Here, we present a validation of our BFE change prediction for mutations on S protein
RBD compared to the experimental deep mutational enrichment data [3]. Figure 6.18
presents a comparison between experimental deep mutational enrichment data and BFE
change predictions on SARS-CoV-2 RBD binding to ACE2. In the heatmap of Figure 6.18,
both BFE changes and enrichment ratios describe the affinity changes of the S protein
RBD-ACE2 complex induced by mutations. It is obvious that the predicted BFE changes
are highly correlated to the enrichment ratio data. Pearson correlation is 0.70. It should
be noticed that the deep mutational scanning data from different labs might vary dramat-
ically due to different experimental conditions. For example, the RBD deep mutational
scanning data of the SARS-CoV-2 RBD binding to ACE2 reported by two teams [98, 3]
have a relatively small Pearson correlation of 0.666.
   V401
   I402
   R403
   G404
   D405
   E406
   R408
   Q409
   I410
   G416
   K417
   I418
   A419
   D420
   Y421
   N422
   N437
   N439
   N440
   D442
   S443
   K444
   V445
   G446
   G447
   N448
   Y449
   N450
   Y451
   L452
   Y453
   R454
   L455
   F456
   R457
   K458
   S459
   T470
   E471
   I472
   Y473
   Q474
   A475
   G476
   S477
   T478
   P479
   C480
   G482
   V483
   E484
   G485
   F486
   N487
   C488
   Y489
   F490
   P491
   L492
   Q493
   S494
   Y495
   G496
   F497
   Q498
   P499
   T500
   N501
   G502
   V503
   G504
   Y505
   Q506
   P507
   Y508
A                         A                                                  A
C                                                                                      C             C                                         1
D         D                 D        D
E           E                                                        E                       E
F                                                          F                                     F       F             F
                                                                                                                                                    log10 enrichment ratio
                                                                                                                                               0
G       G           G                        GG                                G         G     G                     G           G G
H
 I  I             I     I                                              I
K                     K                  K                     K                                                                               -1
L                                                   L    L                                                   L
M
N                               NNNN            N N                                                N                           N               -2
P                                                                                    P                     P               P             P
Q               Q                                                          Q                                   Q         Q             Q
R     R       R                                        R     R                                                                                 -3
S                                      S                         S               S                               S
T                                                                  T               T                                         T
V V                                        V                                               V                                      V            -4
W
Y                             Y                  Y Y Y                   Y                             Y           Y                 Y     Y
   V401
   I402
   R403
   G404
   D405
   E406
   R408
   Q409
   I410
   G416
   K417
   I418
   A419
   D420
   Y421
   N422
   N437
   N439
   N440
   D442
   S443
   K444
   V445
   G446
   G447
   N448
   Y449
   N450
   Y451
   L452
   Y453
   R454
   L455
   F456
   R457
   K458
   S459
   T470
   E471
   I472
   Y473
   Q474
   A475
   G476
   S477
   T478
   P479
   C480
   G482
   V483
   E484
   G485
   F486
   N487
   C488
   Y489
   F490
   P491
   L492
   Q493
   S494
   Y495
   G496
   F497
   Q498
   P499
   T500
   N501
   G502
   V503
   G504
   Y505
   Q506
   P507
   Y508
A                         A                                                  A
C                                                                                      C             C                                         1
D         D                 D        D
E           E                                                        E                       E
                                                                                                                                                    BFE changes (kcal/mol)
F                                                          F                                     F       F             F
G       G           G                        GG                                G         G     G                     G           G G           0
H
 I  I             I     I                                              I
K                     K                  K                     K
L                                                   L    L                                                   L                                 -1
M
N                               NNNN            N N                                                N                           N
P                                                                                    P                     P               P             P
Q               Q                                                          Q                                   Q         Q             Q       -2
R     R       R                                        R     R                                                                                                               hACE2 binding to SARS-CoV-2 RBD
S                                      S                         S               S                               S
T                                                                  T               T                                         T
V V                                        V                                               V                                      V            -3
W
Y                             Y                  Y Y Y                   Y                             Y           Y                 Y     Y
Figure 6.18: A comparison between experimental RBD deep mutation enrichment data
and predicted BFE changes for SARS-CoV-2 RBD binding to ACE2 (6M0J) [3]. Top left:
deep mutational scanning heatmap showing the average effect on the enrichment for
single-site mutants of RBD when assayed by yeast display for binding to the S protein
RBD [3]. Right: RBD colored by average enrichment at each residue position bound to
the S protein RBD. Bottom left: machine learning predicted BFE changes for single-site
mutants of the S protein RBD.
                                                                                           151


    The validation of our machine learning predictions for mutation-induced BFE changes
compared to experimental data has been demonstrated in recently published papers [102,
99]. Firstly, we showed high correlations of experimental deep mutational enrichment
data and predictions for the binding complex of SARS-CoV-2 S protein RBD and pro-
tein CTC-445.2 [102] and the binding complex of SARS-CoV-2 RBD and ACE2 [99]. In
comparison with experimental data on the impacts of emerging variants on antibodies
in clinical trials, our predictions achieve a Pearson correlation at 0.80 [99]. Considering
the BFE changes induced by RBD mutations for ACE2 and RBD complex, predictions
on mutations L452R and N501Y have a highly similar trend with experimental data [99].
Meanwhile, as we presented in [2], high-frequency mutations are all having positive BFE
changes. Moreover, for multi-mutation tests, our BFE change predictions have the same
pattern with experimental data of the impact of SARS-CoV-2 variants on major antibody
therapeutic candidates, where the BFE changes are accumulative for co-mutations [99].
    Recent studies on potency of mAb CT-P59 in vitro and in vivo against Delta variants[219]
show that the neutralization of CT-P59 is reduced by L452R (13.22 ng/mL) and is re-
tained against T478K (0.213 ng/mL). In our predictions [99], L452R induces a negative
BFE change (-2.39 kcal/mol), and T478K produces a positive BFE change (0.36 kcal/mol).
In Figure 3.2b, the fold changes for experimental and predicted values are presented.
Additional, Figure 3.2c shows a comparison of the experimental pseudovirus infection
changes and predicted BFE changes of ACE2 and S protein complex induced by muta-
tions L452R and N501Y. The experimental data is obtained in a reference to D614G and
reported in relative luciferase units [220]. It indicates that the binding of RBD and ACE2
dominates the infectivity of SARS-CoV-2. More details can be found in Section S6 of Sup-
porting information.
                                              152


6.6                  Websites Designed
6.6.1                Mutation Tracker
Since the initial outbreak of the COVID-19, the raging pandemic caused by SARS-CoV-2
has lasted over two years. We do have many promising vaccines, but they might have side
effects and their full side effects, particularly, long-term side effects, remain unknown. To
make things worse, near 28734 unique mutations have been recorded for SARS-CoV-2 as
shown by Mutation Tracker (See Figure 6.19). All of these reveal the sad reality that our
current understanding of life science, virology, epidemiology, and medicine is severely
limited.
                16
                                                                               28734 Single Mutations in 1876327 hCoV-19 Genomes
                                                                                           Relevant link: Analysis of S protein RBD mutations                        enabled by data from
                14
                12
                10
ln(Frequency)   8
                6
                4
                2
                 0
                     NSP1                       NPS3   NSP4 3CL NSP6                            RdRp           Helicase                                        S   ORF3a E                  N
                                Date=20210910          GISAID data provided on this website is subject to GISAID’s Terms and Conditions <[Download Summary]>
                            20200101                                                                        20201027                                                                            20210910
Figure 6.19: Illustration of SARS-CoV-2 mutations given by Mutation Tracker. Interactive
version is available at Mutation Tracker.
6.6.2                Mutation Analyzer
The most observed SARS-CoV-2 RBD mutations are available at Mutation Analyzer (See
Figure 6.20).
                                                                                            153


                                            Analysis of observed S protein RBD mutations
                                                      Relevant link: Mutation Tracker for genome-wide analysis
              ^
  Show 10        entries                                                                                                         Search
                                    Worldwide observed                BFE change (kcal/mol)*                           Antibody disruption
      Mutation
                                                            ^                    ^                 ^                 ^                     ^              ^
                                 Counts    ▼      Rank      ^         Value      ^       Rank      ^         Counts♦ ^       Ratio(%)♠     ^      Rank    ^
        N501Y                    778190             1                 0.5499               32                  24               18.46             179
        L452R                    492276             2                 0.5752                30                 39                30.0             112
        T478K                    467835             3                 0.9994                 2                 2                 1.54              597
        E484K                     97264             4                 0.0946               285                 38               29.23              118
        K417T                    47315              5                 0.0116               453                 37               28.46              122
        S477N                    33170              6                 0.0180               442                  0                 0.0              677
        N439K                    16505              7                 0.1792               168                 11                8.46              292
        K417N                      9415             8                 0.1661               185                 53                40.77              74
         F490S                     5971             9                 0.4406                55                 51                39.23              81
         S494P                     5263            10                 0.0902               296                 62                47.69              57
  Showing 1 to 10 of 724 entries                                                                                           Previous 1 2 3 4 5 ...   73 Next
Figure 6.20: Illustration of the analysis of SARS-CoV-2 mutations given by interactive
Mutation Analyzer that is available at Mutation Analyzer.
6.7       Discussion and Conclusion
Since the first COVID-19 case was reported in December 2019, this pandemic has led
to four waves of infections, over 400 million reported cases globally, and near 6 million
deaths. Despite the exciting progress in the developments of vaccines and monoclonal
antibodies, their potential side effects, such as allergy reactions to COVID-19 vaccines, are
not very clear. Additionally, the latest Omicron variant is able to evade current vaccines
and compromise essentially all monoclonal antibodies. Although the Omicron variant
may be less deadly than the original virus, there is no guarantee that future variants will
be less virulent. Our present understanding of SARS-CoV-2 and COVID-19 is still quite
poor.
    Molecular modeling, simulation, and prediction of SARS-CoV-2 has contributed tremen-
dously to the development of effective vaccines, drugs, and antibody therapies. Their role
in combating COVID-19 is indispensable. For example, thank to an approach that inte-
grates genotyping, biophysics, artificial intelligence, advanced mathematics, and experi-
ment data, it is now well-understood that the SARS-CoV-2 evolution and transmission are
                                                                                 154


governed by natural selection [41]. This indicates the next SARS-CoV-2 variant will be in-
creasingly more transmissible through high infectivity, robust vaccine breakthrough, and
strong antibody resistance [221, 222]. This understanding cannot be achieved through
individual experiments. Therefore, it is imperative to provide a literature review for the
study of the molecular modeling, simulation, and prediction of SARS-CoV-2. Since the
related literature is huge and varies in quality, we cannot collect all of the existing liter-
ature for the topic. However, we try to put forward a methodology-centered review in
which we emphasize the methods used in various studies. To this end, we gather the ex-
isting theoretical and computational studies of SARS-CoV-2 concerning the aspects such
as molecular modeling, biophysics, bioinformatics, cheminformatics, machine learning
including deep learning, and mathematical approaches, aiming to provide a comprehen-
sive, systematic, and indispensable component for the understanding of the molecular
mechanism of SARS-CoV-2 and their interactions with host cells. Our review provides a
methodology-centered description of the status of the molecular model, simulation, and
prediction of SARS-CoV-2. We discuss both the traditional molecular theories, models,
and methods and emergent machine learning algorithms and mathematical approaches.
    Although various vaccines have been approved and in use, vaccine-breakthrough mu-
tations have become a serious problem. Even with the promising news of new vaccines,
COVID-19 as a global health crisis may still last for years before it is fully stopped glob-
ally. The research on SARS-CoV-2 will also last for many years. It will take researchers
many more years to fully understand the molecular mechanism of coronaviruses, such
as RNA proofreading, virus-host cell interactions, antibody-antigen interactions, protein-
protein interactions, protein-drug interactions, viral regulation of host cell functions, and
immune response. Even if we could control the transmission of SARS-CoV-2 in the future,
newly emergent coronaviruses may still cause similar pandemic outbreaks. Therefore, the
coronaviral studies will continue even after the current pandemic is fully under control.
    Currently, epidemiologists, virologists, biologists, medical scientists, pharmacists, phar-
                                             155


macologists, chemists, biophysicists, mathematicians, computer scientists, and many oth-
ers are called to investigate various aspects of COVID-19 and SARS-CoV-2. This trend of a
joint effort on COVID-19 investigations will continue beyond the present pandemic. The
urgent need for the molecular mechanistic understanding of SARS-CoV-2 and COVID-
19 will further stimulate the development of computational biophysical, artificial intelli-
gence, and advanced mathematical methods. The theoretical, computational, and mathe-
matical communities will benefit from this endeavor against the pandemic.
    The year 2020 has witnessed the birth of human mRNA vaccines for the first time —
a remarkable accomplishment in science and technology. Although there are more dark
days ahead of us, humanity will prevail in a post-COVID-19 world. Science will emerge
stronger against all pathogens and diseases in the future.
                                             156


                                       CHAPTER 7
                           DISSERTATION CONTRIBUTION
The main contributions of this dissertation are listed as follows:
  • In Chapter 2, we propose two topological Laplacians: persistent Laplacians and
     persistent path Laplacians for the multiscale analysis of a given point-cloud dataset.
     The detailed construction process of persistent Laplacians and persistent path Lapla-
     cians are also included in Chapter 2. Notably, persistent Laplacians can extract rich
     topological and geometric information during filtration, and persistent path Lapla-
     cians are proposed to deal with asymmetric structures such as digraphs and net-
     works.
  • In Chapter 3, we set up a standard procedure to systematically decode nearly 30k
     unique single mutations from more than 2 million complete SARS-CoV-2 genome
     sequences in the GISAID database. In addition, we build a mathematical model
     called TopNetmAb, to detect the impact of single and co-mutations on the SARS-
     CoV-2 variants.
  • In Chapter 4, we discuss applications of two new topological Laplacians in several
     systems, such as benzene, tetrahedron, pyramid, fullerene, curcurbit[n]urils sys-
     tems, etc.
  • In Chapter 5, we develop an open-source software package, called highly efficient
     robust multidimensional evolutionary spectra (HERMES), to enable broad applica-
     tions of persistent Laplacians in science, engineering, and technology. To ensure the
     reliability and robustness of HERMES, we also validate the software with simple
     geometric shapes and complex datasets from three-dimensional (3D) protein struc-
     tures.
                                            157


  • Chapter 6 shows our findings in the study of SARS-CoV-2, including the mecha-
       nisms of SARS-CoV-2 evolution, the mutational impacts on the infectivity, diagnos-
       tic targets, vaccines, and antibodies of SARS-CoV-2. Our standard procedures re-
       garding date collection, pre-possessing, and model training integrate multiple tech-
       niques in computational biophysical, artificial intelligence, and advanced mathe-
       matics, which may facilitate the development of next-generation vaccines and anti-
       body therapies against future SARS-CoV-2 variants.
   The contents of this dissertation are mostly adopted from the following publications
and preprints1 :
  • Wang, R., Wei, G., Persistent Path Laplacian, arXiv, (2022)
  • Gao, K.∗ , Wang, R.∗ , Chen, J., Cheng, L., Frishcosy, J., Huzumi, Y., Qiu, Y., Schluck-
       bier, T., Wei, X., and Wei, G., Methodology-centered review of molecular modeling,
       simulation, and prediction of SARS-CoV-2, Chemical Reviews, in press, (2022).
  • Wang, R., Chen, J., Hozumi, Y., Yin, C., and Wei, G., Emerging vaccine-breakthrough
       SARS-CoV-2 variants, ACS Infectious Diseases, 8(3), 546-556, (2022).
  • Chen, J., Wang, R., and Wei, G., Review of the mechanisms of SARS-CoV-2 evolution
       and transmission, (2021).
  • Wang, R., Chen, J., and Wei, G., Mechanisms of SARS-CoV-2 evolution revealing
       vaccine-resistant mutations in Europe and America, The Journal of Physical Chem-
       istry Letters, 12, 11850-11857, (2021)
  • Chen, J., Gao, K., Wang, R., and Wei, G., Revealing the threat of emerging SARS-
       CoV-2 mutations to antibody therapies, Journal of Molecular Biology, 433(18), (2021)
   1 ∗
    ( co-first author)
                                              158


    • Wang, R., Gao, K., Chen, J., and Wei, G., Vaccine-escape and fast-growing mutations
        in the United Kingdom, the United States, Singapore, Spain, South Africa, and other
        COVID-19-devastated countries, Genomics, 113(4), 2158-2170, (2021).
    • Chen, J.∗ , Gao, K.∗ , Wang, R.∗ , and Wei, G., Prediction and mitigation of mutation
        threats to COVID-19 vaccines and antibody therapies, Chemical Science, (2021).
    • Wang, R., Zhao, R., Ribando-Gros, Emily., Chen, J., Tong, Y., and Wei, G., HERMES:
        Persistent spectral graph software, Foundations of Data Science, 3(1), 67-97, (2021).
    • Wang, R., Hozumi, Y., Yin, C., and Wei, G., Decoding SARS-CoV-2 transmission,
        evolution and ramification on COVID-19 diagnosis, vaccine, and medicine, Journal
        of Chemical Information and Modeling, 60, 5853-5865 (2020).
    • Wang, R., Duc D Nguyen and Wei, G., Persistent spectral graph, International Jour-
        nal for Numerical Methods in Biomedical Engineering, 36(9), e3376 (2020).
    This work led to the following publications/preprints are not discussed in this disser-
tation2 :
    • Chen, J., Wang, R., Gilby, N.B., and Wei, G., Omicron (B.1.1.529): Infectivity, vac-
        cine breakthrough, and antibody resistance, Journal of Chemical Information and
        Modeling, 62(2), 412-422, (2022).
    • Gao, K., Wang, R., Chen, J., Huang, F., and Wei, G., Perspectives on SARS-CoV-
        2 Main Protease Inhibitors, Journal of Medicinal Chemistry, 64(23), 16922-16955,
        (2021).
    • Jiang, J., Wang, R., and Wei, G., GGL-Tox: Geometric graph learning for toxicity
        prediction, Journal of Chemical Information and Modeling, 61(4), (2021).
    2 ∗
     ( co-first author)
                                              159


• Hozumi, Y., Wang, R., Yin, C., and Wei, G., UMAP-assisted K-means clustering
  of large-scale SARS-CoV-2 mutation datasets, Computers in Biology and Medicine,
  131, p.104264, (2021).
• Chen, J.∗ , Gao, K.∗ , Wang, R., Duc Nguyen, and Wei, G., Review of COVID-19 anti-
  body therapies, Annual Review of Biophysics, 50, 1-30 (2021).
• Wang, R., Chen, J., Gao, K., Hozumi, Y., Yin, C., and Wei, G., Analysis of SARS-
  CoV-2 mutations in the United States suggests presence of four substrains and novel
  variants, Communications Biology, 4,228 (2021).
• Chen, J., Wang, R., and Wei, G., SARS-CoV-2 becoming more infectious as revealed
  by algebraic topology and deep learning. Communications in Information and Sys-
  tems 21(1), 31-36 (2021).
• Wang, R., Chen, J., Hozumi, Y., Yin, C., and Wei, G., Decoding Asymptomatic
  COVID-19 infection and transmission, The Journal of Physical Chemistry Letters,
  11, 10007-10015 (2020).
• Nguyen, D. D., Gao, K., Chen, J., Wang, R., and Wei, G., Unveiling the molecu-
  lar mechanism of SARS-CoV-2 main protease inhibition from 137 crystal structures
  using algebraic topology and deep learning, Chemical Sciences, 11, 12036 - 12046
  (2020).
• Wang, R., Hozumi, Y., Zheng, Y., Yin, C., and Wei, G., Host immune response driv-
  ing SARS-CoV-2 evolution, Viruses, 12, 1095 (2020).
• Wang, R., Hozumi, Y., Yin, C., Wei, G., Mutations on COVID-19 diagnostic targets,
  Genomics, 112, 5204-5213 (2020).
• Chen, J., Wang, R., Wang, M., and Wei, G., Mutations strengthened SARS-CoV-2
  infectivity, Journal of Molecular Biology, 432, 5212-5226 (2020).
                                         160


• Jiang, J., Wang, R., Menglun Wang, Gao, K., Nguyen, D. D., and Wei, G., Boosting
  tree-assisted multitask deep learning for small scientific datasets. Journal of Chem-
  ical Information and Modeling, 60 (3), 1235-1244 (2020).
                                       161


APPENDICES
    162


                                      APPENDIX A
            SUPPLEMENTARY MATERIALS IN PERSISTENT LAPLACIAN
A.1     Additional Laplacian matrices and their properties
In this section, we give a further description of additional boundary and Laplacian ma-
trices and their properties involved in the filtration process in Figure 2.6.
                                   Table A.1: K1 → K1 .
                       q                   q=0               q=1        q=2
                       1+0
                     Bq+1                     /                 /          /
                                     0 1 2 3 4 
                       Bq1                                      /          /
                                   [ 0 0 0 0 0
                                                       
                                      0   0   0   0   0
                                     0   0   0   0   0 
                                                                /          /
                                                       
                     L1+0
                       q
                                    
                                     0   0   0   0   0 
                                                        
                                     0   0   0   0   0 
                                      0   0   0   0   0
                     βq1+0                    5                 /          /
                  dim(L1+0 q )                5                 /          /
                  rank(L1+0q )                0                 /          /
                 nullity(L1+0
                            q )               5                 /          /
                Spectra(L1+0 q )       {0, 0, 0, 0, 0}          /          /
                                             163


                      Table A.2: K2 → K2 .
      q                  q=0                  q=1    q=2
                         01 
                     0     −1
                     1   1 
      2+0
    Bq+1                                      /     /
                     2   0 
                               
                     3   0 
                     4       0
                                              01 
                                           0    −1
                    0 1 2 3 4              1  1 
      Bq2                                           /
                  [ 0 0 0 0 0 ]            2  0 
                                                  
                                           3  0 
                                           4     0
                                    
                   1 −1 0 0 0
                 −1 1 0 0 0         
                                               [2]    /
                                    
    L2+0
      q
                
                  0     0 0 0 0     
                                     
                  0     0 0 0 0     
                   0     0 0 0 0
    βq2+0                   4                   0     /
 dim(L2+0 q )               5                   1     /
 rank(L2+0q )               1                   1     /
nullity(L2+0
           q )              4                   0     /
Spectra(L2+0q )      {0, 0, 0, 0, 2}            2     /
                                 164


                                 Table A.3: K3 → K3 .
      q                     q=0                            q=1             q=2
                   01 12 23 03 
                0    −1 0          0 −1
                1  1 −1 0              0 
      3+0
    Bq+1                                                    /             /
                2  0
                           1 −1 0        
                3  0       0      1    1 
                4     0     0      0    0
                                                    01 12 23 03 
                                                 0    −1 0        0 −1
                       0 1 2 3 4                 1  1 −1 0           0 
      Bq3                                                                 /
                    [ 0 0 0 0 0 ]                2  0
                                                          1     −1   0  
                                                                         
                                                 3  0     0      1   1 
                                                 4    0    0      0   0
                                           
                   2 −1 0 −1 0                                         
                                                       2 −1 0 1
                 −1 2 −1 0 0               
                                                     −1 2 −1 0 
                                                                            /
                                           
    L3+0
      q
                
                  0 −1 2 −1 0              
                                            
                                                    
                                                     0 −1 2 1 
                                                                        
                 −1 0 −1 2 0               
                                                       1    0      1 2
                   0     0      0     0 0
    βq3+0                      2                              1             /
 dim(L3+0 q )                  5                              4             /
 rank(L3+0q )                  3                              3             /
nullity(L3+0
           q )                 2                              1             /
Spectra(L3+0q )         {0, 0, 2, 2, 4}                  {0, 2, 2, 4}       /
                                          165


                                                    Table A.4: K5 → K5 .
       q                            q=0                                     q=1                             q=2
                                                                         012 023 
                     01    12     23    03     24    02           01       1       0
                  0   −1     0      0    −1      0    −1
                                                                    12   1          0                       0123 
      5+0         1  1     −1      0     0      0    0                               
    Bq+1                                                          23   0          1                 012     −1
                  2  0      1    −1      0     −1    1                
                                                                         0 −1 
                                                                                        
                    
                     0
                                                                   03                                  023      1
                  3          0      1     1      0    0                               
                                                                    24   0          0 
                  4    0     0      0     0      1    0
                                                                    02      −1 1
                                                                                                          012 023 
                                                                01  12     23     03     24    02  01      1     0
                                                            0    −1  0      0      −1      0    −1
                                                                                                     12   1       0 
                             0    1   2   3    4            1   1  −1      0       0      0    0                   
      Bq5                                                                                          23   0       1 
                           [ 0    0   0   0    0 ]          2   0   1     −1       0    −1     1       
                                                                                                          0 −1 
                                                                                                                      
                                                               
                                                                0
                                                                                                    03
                                                            3        0      1       1      0    0                   
                                                                                                     24   0       0 
                                                            4     0  0      0       0      1    0
                                                                                                     02     −1 1
                                                                                                
                                                                3  0      0     1 0        0
                         3     −1    −1     −1      0
                       −1
                                                                 0  3    −1      0 −1        0  
                                2    −1      0      0                                                          
                                                               0 −1      3     0 1        0             4    0
    L5+0
      q
                       −1     −1     4     −1     −1                                          
                      
                       −1
                                                                1  0      0     3 0        0             0    4
                                0    −1      2      0                                          
                                                                 0 −1      1     0 2        −1  
                         0      0    −1      0      1
                                                                  0  0      0     0 −1        4
    βq5+0                             1                                        0                               0
 dim(L5+0q    )                       5                                        6                               2
 rank(L5+0q   )                       4                                        6                               2
nullity(L5+0q   )                     1                                        0                               0
Spectra(L5+0q   )              {0, 1, 2, 4, 5}                         {1, 2, 2, 4, 4, 5}                   {4, 4}
                                                            166


                 Table A.5: K1 → K2 .
      q                   q=0           q=1 q=2
                          01 
                      0     −1
                      1   1 
      1+1
    Bq+1                               /   /
                      2   0 
                                
                      3   0 
                      4      0
                     0 1 2 3 4
      Bq1                                /   /
                   [ 0 0 0 0 0 ]
                                     
                    1 −1 0 0 0
                 −1 1 0 0 0          
                                         /   /
                                     
    L1+1
      q
                
                   0     0 0 0 0     
                                      
                   0     0 0 0 0     
                    0     0 0 0 0
    βq1+1                    4           /   /
 dim(L1+1 q )                5           /   /
 rank(L1+1q )                1           /   /
nullity(L1+1
           q )               4           /   /
Spectra(L1+1q )       {0, 0, 0, 0, 2}    /   /
                           167


                      Table A.6: K1 → K4 .
      q                        q=0               q=1 q=2
                    01 12 23 03 24 
                0     −1 0        0 −1 0
                1   1 −1 0            0     0 
      1+3
    Bq+1                                        /   /
                2   0
                            1   −1    0    −1 
                                               
                3   0       0    1    1     0 
                4      0     0    0    0     1
                           0 1 2 3 4
      Bq1                                         /   /
                         [ 0 0 0 0 0 ]
                                              
                      2 −1 0 −1 0
                    −1 2 −1 0              0 
                                                  /   /
                                              
    L1+3
      q
                  
                     0 −1 3 −1 −1            
                    −1 0 −1 2              0 
                      0     0 −1 0          1
    βq1+3                        1                /   /
 dim(L1+3 q )                    5                /   /
 rank(L1+3q )                    4                /   /
nullity(L1+3
           q )                   1                /   /
Spectra(L1+3q )   {0, 0.8299, 2, 2.6889, 4.4812}  /   /
                                168


                       Table A.7: K1 → K5 .
      q                         q=0                  q=1 q=2
                   01 12 23 03 24 02 
                0   −1 0        0 −1 0 −1
                1  1 −1 0             0     0   0 
      1+4
    Bq+1                                            /   /
                2  0
                         1    −1      0    −1   1 
                                                   
                3  0     0     1      1     0   0 
                4    0    0     0      0     1   0
                           0 1 2 3 4
      Bq1                                             /   /
                         [ 0 0 0 0 0 ]
                                                
                       3 −1 −1 −1 0
                     −1 2 −1 0                0 
                                                      /   /
                                                
    L1+4
      q
                    
                     −1 −1 4 −1 −1             
                     −1 0 −1 2                0 
                       0     0 −1 0            1
    βq1+4                          1                  /   /
 dim(L1+4 q )                      5                  /   /
 rank(L1+4q )                      4                  /   /
nullity(L1+4
           q )                     1                  /   /
Spectra(L1+4q )             {0, 1, 2, 4, 5}           /   /
                                 169


                         Table A.8: K1 → K6 .
      q                           q=0                  q=1 q=2
                   01 12 23 03 24 02 13 
                0   −1 0        0 −1 0 −1 0
                1  1 −1 0           0      0  0 −1 
      1+5
    Bq+1                                              /   /
                2  0
                        1    −1     0     −1  1   0 
                                                     
                3  0    0      1    1      0  0   1 
                4    0   0      0    0      1  0   0
                             0 1 2 3 4
      Bq1                                               /   /
                           [ 0 0 0 0 0 ]
                                                
                         3 −1 −1 −1 0
                       −1 3 −1 −1 0 
                                                        /   /
                                                
    L1+5
      q
                      
                       −1 −1 4 −1 −1           
                       −1 −1 −1 3            0 
                         0     0 −1 0         1
    βq1+5                            1                  /   /
 dim(L1+5 q )                        5                  /   /
 rank(L1+5q )                        4                  /   /
nullity(L1+5
           q )                       1                  /   /
Spectra(L1+5q )               {0, 1, 4, 4, 5}           /   /
                                   170


                       Table A.9: K2 → K3 .
      q                     q=0                 q=1    q=2
                   01 12 23 03 
                0    −1 0          0 −1
                1  1 −1 0              0 
      2+1
    Bq+1                                        /     /
                2  0
                           1     −1    0 
                                          
                3  0       0      1    1 
                4     0     0      0    0
                                                01 
                                             0    −1
                       0 1 2 3 4             1  1 
      Bq2                                             /
                    [ 0 0 0 0 0 ]            2  0 
                                                    
                                             3  0 
                                             4    0
                                          
                   2 −1 0 −1 0
                 −1 2 −1 0 0              
                                                 [2]    /
                                          
    L2+1
      q
                
                  0 −1 2 −1 0             
                                           
                 −1 0 −1 2 0              
                   0     0      0     0 0
    βq2+1                      2                  0     /
 dim(L2+1 q )                  5                  1     /
 rank(L2+1q )                  3                  1     /
nullity(L2+1
           q )                 2                  0     /
Spectra(L2+1q )         {0, 0, 2, 2, 4}           2     /
                                  171


                           Table A.10: K2 → K4 .
      q                        q=0                  q=1    q=2
                    01 12 23 03 24 
                0     −1 0        0 −1 0
                1   1 −1 0            0     0 
      2+2
    Bq+1                                            /     /
                2   0
                            1   −1    0    −1 
                                               
                3   0       0    1    1     0 
                4      0     0    0    0     1
                                                    01 
                                                 0    −1
                           0 1 2 3 4             1  1 
      Bq2                                                 /
                         [ 0 0 0 0 0 ]           2  0 
                                                        
                                                 3  0 
                                                 4    0
                                              
                      2 −1 0 −1 0
                    −1 2 −1 0              0 
                                                     [2]    /
                                              
    L2+2
      q
                  
                     0 −1 3 −1 −1            
                    −1 0 −1 2              0 
                      0     0 −1 0          1
    βq2+2                        1                    0     /
 dim(L2+2 q )                    5                    1     /
 rank(L2+2q )                    4                    1     /
nullity(L2+2
           q )                   1                    0     /
Spectra(L2+2q )   {0, 0.8299, 2, 2.6889, 4.4812}      2     /
                                    172


                             Table A.11: K2 → K5 .
      q                         q=0                       q=1     q=2
                   01 12 23 03 24 02 
                0   −1 0        0 −1 0 −1
                   1 −1 0
      2+3
    Bq+1
                1                     0    0   0 
                                                       012 023  /
                2  0
                         1    −1      0    −1  1 
                                                   01 1       0
                3  0     0     1      1    0   0 
                4    0    0     0      0    1   0
                                                          01 
                                                      0     −1
                           0 1 2 3 4                  1   1 
      Bq2                                                        /
                         [ 0 0 0 0 0 ]                2   0 
                                                              
                                                      3   0 
                                                      4      0
                                               
                       3 −1 −1 −1 0
                     −1 3 −1 −1 0 
                                                           [3]     /
                                               
    L2+3
      q
                    
                     −1 −1 4 −1 −1            
                     −1 −1 −1 3              0 
                       0     0 −1 0           1
    βq2+3                          1                        0      /
 dim(L2+3 q )                      5                        1      /
 rank(L2+3q )                      4                        1      /
nullity(L2+3
           q )                     1                        0      /
Spectra(L2+3q )             {0, 1, 2, 4, 5}                 3      /
                                         173


                                   Table A.12: K2 → K6 .
      q                           q=0                              q=1        q=2
                   01 12 23 03 24 02 13 
                0   −1 0        0 −1 0 −1 0
                   1 −1 0                     0 −1 
      2+4
    Bq+1
                1                   0      0              012 023 013 123  /
                2  0
                        1 −1 0 −1 1               0 
                                                        01   1    0    1 0
                3  0    0      1    1      0  0   1 
                4    0   0      0    0      1  0   0
                                                                   01 
                                                                0    −1
                             0 1 2 3 4                          1  1 
      Bq2                                                                    /
                           [ 0 0 0 0 0 ]                        2  0 
                                                                       
                                                                3  0 
                                                                4    0
                                                
                         3 −1 −1 −1 0
                       −1 2 −1 0             0 
                                                                    [4]        /
                                                
    L2+4
      q
                      
                       −1 −1 4 −1 −1           
                       −1 0 −1 2             0 
                         0     0 −1 0         1
    βq2+4                            1                               0         /
 dim(L2+4 q )                        5                               1         /
 rank(L2+4q )                        4                               1         /
nullity(L2+4
           q )                       1                               0         /
Spectra(L2+4q )               {0, 1, 4, 4, 5}                        4         /
                                              174


                                   Table A.13: K3 → K5 .
      q                         q=0                                q=1               q=2
                   01 12 23 03 24 02 
                0   −1 0        0 −1 0 −1                        012 023     
                   1 −1 0                                    01    1 0
                1                      0    0   0 
      3+2
    Bq+1                                                    12 
                                                                  1 0
                                                                                     /
                2  0     1 −1 0 −1 1                                        
                                                            23  0 1        
                3  0     0     1      1    0   0 
                                                              03    0 −1
                4    0    0     0      0    1   0
                                                            01 12 23 03 
                                                         0   −1 0         0 −1
                           0 1 2 3 4                     1  1 −1 0             0 
      Bq3                                                                           /
                         [ 0 0 0 0 0 ]                   2  0
                                                                  1 −1 0         
                                                         3  0     0      1     1 
                                                         4    0    0      0     0
                                               
                       3 −1 −1 −1 0                                             
                                                               3 0       0 1
                     −1 2 −1 0               0             0 3 −1 0 
                                                                                      /
                                               
    L3+2
      q
                    
                     −1 −1 4 −1 −1            
                                                            
                                                             0 −1 3 0 
                                                                                 
                     −1 0 −1 2               0 
                                                               1 0       0 3
                       0     0 −1 0           1
    βq3+2                          1                                  0               /
 dim(L3+2 q )                      5                                  4               /
 rank(L3+2q )                      4                                  4               /
nullity(L3+2
           q )                     1                                  0               /
Spectra(L3+2q )             {0, 1, 2, 4, 5}                      {2, 2, 4, 4}         /
                                              175


                                   Table A.14: K3 → K6 .
      q                           q=0                              q=1              q=2
                   01 12 23 03 24 02 13 
                0   −1 0        0 −1 0 −1 0                  012 023 013 123
                   1 −1 0                                01    1 0        1 0
                1                    0      0  0 −1 
      3+3
    Bq+1                                                12  1 0
                                                                          0 1      /
                2  0    1 −1 0 −1 1               0                            
                                                        23   0 1        0 1 
                3  0    0      1    1      0  0   1 
                                                          03    0 −1 −1 0
                4    0   0      0    0      1  0   0
                                                            01 12 23 03 
                                                         0   −1 0         0 −1
                             0 1 2 3 4                   1  1 −1 0            0 
      Bq3                                                                          /
                           [ 0 0 0 0 0 ]                 2  0
                                                                  1     −1    0 
                                                         3  0     0      1    1 
                                                         4    0    0      0    0
                                                
                         3 −1 −1 −1 0                                        
                                                                 4  0   0   0
                       −1 3 −1 −1 0                          0   4   0   0 
                                                                                     /
                                                
    L3+3
      q
                       −1 −1 4 −1 −1                        
                                                               0
                                                                              
                                                                  0   4   0 
                       −1 −1 −1 3            0 
                                                                 0  0   0   4
                         0     0 −1 0         1
    βq3+3                            1                                0              /
 dim(L3+3 q )                        5                                4              /
 rank(L3+3q )                        4                                4              /
nullity(L3+3
           q )                       1                                0              /
Spectra(L3+3q )               {0, 1, 4, 4, 5}                    {4, 4, 4, 4}        /
                                              176


                                        Table A.15: K4 → K6 .
        q                          q=0                                    q=1                q=2
                    01 12 23 03 24 02 13                             012 023 013 123
                                                                                      
                0   −1 0        0 −1 0 −1 0                     01    1    0      1 0
                1    1 −1 0           0      0  0 −1           12   1    0      0 1 
       4+2
                                                                                              /
                  
     Bq+1                                                                             
                2 
                    0    1 −1 0 −1 1                0 
                                                               23   0
                                                                          1      0 1   
                3   0    0     1     1      0  0    1         03   0   −1 −1 0 
                4    0    0     0     0      1  0    0          24    0    0      0 0
                                                                  01 12 23 03 24
                                                                                          
                                                             0   −1 0         0 −1 0
                              0 1 2 3 4                      1    1 −1 0            0    0 
       Bq4                                                                                    /
                                                               
                                                                                          
                            [ 0 0 0 0 0 ]                    2 
                                                                 0    1 −1 0 −1          
                                                             3   0    0      1     1    0 
                                                             4    0    0      0     0    1
                                                                                      
                         3 −1 −1 −1 0                               4 0      0   0 0
                       −1 3 −1 −1 0                              0 4      0   0 −1 
                                                                                              /
                                                                                      
     L4+2
       q
                      
                       −1 −1 4 −1 −1             
                                                                 
                                                                   0 0      4   0 1    
                       −1 −1 −1 3             0                  0 0      0   4 0 
                         0     0 −1 0          1                    0 −1     1   0 2
     βq4+2                            1                                      0                /
  dim(L4+2 q )                        5                                      5                /
  rank(L4+2q )                        4                                      5                /
 nullity(L4+2
            q )                       1                                      0                /
 Spectra(L4+2
            q )                {0, 1, 4, 4, 5}                   {1.2679, 4, 4, 4, 4.7321}    /
A.2    Parameters in the protein B-factor prediction
                      Table A.16: Fitting parameters from w0 to w5 .
                  r      0             1          2        3       4           5
                 wr 10.6102 0.2026 −0.0031 0.2169 0.3127 0.2815
                     Table A.17: Fitting parameters from w6 to w11 .
                r       6             7        8         9         10           11
                wr −0.4623 1.0203 0.6110 −0.6872 −1.0695 4.4257
                                                  177


                                       APPENDIX B
       SUPPLEMENTARY MATERIALS IN PERSISTENT PATH LAPLACIAN
Table B.1 - Table B.14, we present the detailed matrix constructions, Betti numbers, and
spectra for various digraphs as shown in Figure 4.10 top and bottom panels
Table B.1: Matrix construction of graph G1 (with isolated points included) in the top panel
of Figure 4.10.
                         n                  n=0                  n=1 n=2
                         Ωn        span{e1 , e2 , e3 , e4 , e5 }  {0} {0}
                        Bn+1        5 × 0 empty matrix             /   /
                         Ln          5 × 5 zero matrix             /   /
                         βn                    5                   /   /
                    Spectra(Ln )        {0, 0, 0, 0, 0}            /   /
Table B.2: Matrix construction of graph G1 (without isolated points) in the top panel of
Figure 4.10.
                                 n          n=0 n=1 n=2
                                 Ωn          {0}         {0}      {0}
                               Bn+1            /           /       /
                                 Ln            /           /       /
                                 βn            /           /       /
                            Spectra(Ln )       /           /       /
                                              178


         Table B.3: Matrix construction of graph G2 in the top panel of Figure 4.10.
        n                                n=0                                                      n=1                            n=2
       Ωn                  span{e1 , e2 , e3 , e4 , e5 }                     span{e13 , e25 , e32 , e34 , e45 }                    {0}
                        e13 e25 e32 e34 e45                       
                   e1      −1 0                 0         0      0
                   e2   0 −1 1                           0      0
                                                                                   5 × 0 empty matrix
                                                                                                                                       
     Bn+1                                                                                                                         /
                   e3   1
                                   0         −1        −1       0 
                                                                   
                   e4   0          0           0         1      1 
                   e5       0       1           0         0 −1
                                                                                                                       
                           1       0 −1 0                       0                 2         0 −1 −1 0
                      
                          0       2 −1 0 −1                      
                                                                           
                                                                                 0         2 −1 0 −1                                 
       Ln             
                         −1 −1 3 −1 0                            
                                                                           
                                                                               −1 −1 2                          1   0            /
                          0       0 −1 2 −1                                  −1 0                  1          2   1 
                           0 −1 0 −1 2                                            0 −1 0                         1   2
       βn                                     1                                                       1                             0
Spectra(Ln )           {0, 0.8299, 2, 2.6889, 4.4812}                      {0, 0.8299, 2, 2.6889, 4.4812}                           /
         Table B.4: Matrix construction of graph G3 in the top panel of Figure 4.10.
     n                            n=0                                                  n=1                                  n=2
    Ωn                   span{e1 , e2 , e3 , e4 , e5 }                span{e12 , e13 , e14 , e25 , e32 , e34 , e54 }   span{e132 , e134 }
                                                                                   e132 e134 
                  e12  e13   e14   e25       e32     e34   e54             e12        −1       0
                   −1   −1    −1
                                                                 
              e1                      0        0       0     0              e13    1           1 
              e2  1     0     0    −1         1       0     0              e14    0          −1 
                                                                                                    
   Bn+1                                                                                                              2 × 0 empty matrix
                                                                 
              e3  0     1     0      0       −1      −1     0              e25    0           0 
                                                                                                  
                                                                 
              e4  0     0     1      0        0       1     1             e32    1
                                                                                               0    
                                                                                                     
              e5    0    0     0      1        0       0    −1              e34    0           1 
                                                                            e54         0       0
                                                                                                                  
                                                                         3   0    1     −1       0      0      0
                             −1    −1       −1
                                                         
                        3                             0                 0   4    0      0       0      0      0   
                     −1      3    −1        0      −1                   1   0    3      0       0      0      0
                                                                                                                                 
                                                                                                                            3    1
                                                                                                                 
    Ln               −1     −1     3       −1        0                 −1   0    0      2     −1       0    −1
                                                                                                                
                     −1
                                                                                                                         1    3
                              0    −1        3      −1              
                                                                        0   0    0     −1       3      1      0   
                                                                                                                   
                        0    −1     0       −1        2                 0   0    0      0       1      3      1   
                                                                         0   0    1     −1       0      1      2
    βn                              1                                                     1                                   0
Spectra(Ln )                 {0, 2, 3, 4, 5}                                   {0, 2, 2, 3, 4, 4, 5}                        {2, 4}
                                                                 179


         Table B.5: Matrix construction of graph G4 in the top panel of Figure 4.10.
     n                            n=0                                               n=1                                         n=2
    Ωn                   span{e1 , e2 , e3 , e4 , e5 }           span{e12 , e13 , e14 , e15 , e25 , e32 , e34 , e54 } span{e125 , e132 , e134 , e154 }
                                                                          e125 e132 e134 e154 
                                                                    e12        1 −1 0                 0
                 e12 e13 e14 e15 e25 e32 e34 e54                  e13   0           1       1      0 
             e1   −1 −1 −1 −1 0                    0   0  0                                               
                                                                    e14   0           0     −1      −1    
             e2  1    0    0      0 −1 1              0  0
                                                                                                                         4 × 0 empty matrix
                                                                                                         
   Bn+1                                                           e15   −1 0                0      1    
             e3  0    1    0      0      0 −1 −1 0                                                      
                                                                  e25   1           0       0      0    
             e4  0    0    1      0      0        0   1  1                                              
                                                                    e32   0           1       0      0 
             e5    0   0    0      1      1        0   0 −1                                               
                                                                    e34   0           0       1      0 
                                                                    e54        0       0       0      1
                                                                                                                
                                                                    4   0    1    0     0       0     0 0
                                                                 0   4    0    1     0       0     0 0 
                         4   −1     −1     −1       −1                                                                                         
                      −1
                                                                   1   0    4    0     0       0     0 0                 3    −1     0 −1
                              3     −1       0      −1                                                         
                                                                                                                        −1
                     
                      −1
                                                                  0   1    0    4     0       0     0 0                       3     1 0 
    Ln                       −1      3     −1        0                                                                                        
                     
                      −1
                                                                  0   0    0    0     3      −1     0 −1             0       1     3 1 
                              0     −1       3      −1                                                         
                                                                   0   0    0    0    −1       3     1 0                −1     0     1 3
                        −1   −1      0     −1        3                                                          
                                                                   0   0    0    0     0       1     3 1 
                                                                    0   0    0    0    −1       0     1 3
    βn                               1                                                  1                                          0
Spectra(Ln )                  {0, 3, 3, 5, 5}                               {1, 3, 3, 3, 3, 5, 5, 5}                          {1, 3, 3, 5}
         Table B.6: Matrix construction of graph G5 in the top panel of Figure 4.10.
     n                            n=0                                               n=1                                         n=2
    Ωn                   span{e1 , e2 , e3 , e4 , e5 }           span{e12 , e13 , e14 , e15 , e25 , e32 , e34 , e54 } span{e125 , e132 , e134 , e154 }
                                                                          e125 e132 e134 e154 
                                                                    e12        1 −1 0                 0
                 e12 e13 e14 e15 e25 e32 e34 e54                  e13   0           1       1      0 
             e1   −1 −1 −1 −1 0                    0   0  0                                               
                                                                    e14   0           0     −1      −1    
             e2  1    0    0      0 −1 1              0  0
                                                                                                                         4 × 0 empty matrix
                                                                                                         
   Bn+1                                                           e15   −1 0                0      1    
             e3  0    1    0      0      0 −1 −1 0                                                      
                                                                  e25   1           0       0      0 
             e4  0    0    1      0      0        0   1  1                                              
                                                                    e32   0           1       0      0 
             e5    0   0    0      1      1        0   0 −1                                               
                                                                    e34   0           0       1      0 
                                                                    e54        0       0       0      1
                                                                                                                
                                                                    4   0    1    0     0       0     0 0
                                                                 0   4    0    1     0       0     0 0 
                         4   −1     −1     −1       −1                                                                                         
                      −1
                                                                   1   0    4    0     0       0     0 0                 3    −1     0 −1
                              3     −1       0      −1                                                         
                                                                                                                        −1
                     
                      −1
                                                                  0   1    0    4     0       0     0 0                       3     1 0 
    Ln                       −1      3     −1        0                                                                                        
                     
                      −1
                                                                  0   0    0    0     3      −1     0 −1             0       1     3 1 
                              0     −1       3      −1                                                         
                                                                   0   0    0    0    −1       3     1 0                −1     0     1 3
                        −1   −1      0     −1        3                                                          
                                                                   0   0    0    0     0       1     3 1 
                                                                    0   0    0    0    −1       0     1 3
    βn                               1                                                  0                                          0
Spectra(Ln )                  {0, 3, 3, 5, 5}                               {1, 3, 3, 3, 3, 5, 5, 5}                          {1, 3, 3, 5}
                                                              180


Table B.7: Matrix construction of graph G1 (with isolated points included) in the bottom
panel of Figure 4.10.
                         n                  n=0                  n=1 n=2
                        Ωn         span{e1 , e2 , e3 , e4 , e5 }   /  /
                       Bn+1         5 × 0 empty matrix             /  /
                        Ln           5 × 5 zero matrix             /  /
                        βn                     5                   /  /
                    Spectra(Ln )        {0, 0, 0, 0, 0}            /  /
Table B.8: Matrix construction of graph G1 (without isolated points) in the bottom panel
of Figure 4.10.
                                 n          n=0 n=1 n=2
                                 Ωn          {0}         {0}      {0}
                               Bn+1            /           /       /
                                 Ln            /           /       /
                                 βn            /           /       /
                           Spectra(Ln )        /           /       /
                                              181


Table B.9: Matrix construction of graph G2 (with isolated points included) in the bottom
panel of Figure 4.10.
           n                       n=0                               n=1                 n=2
          Ωn              span{e1 , e2 , e3 , e4 , e5 }     span{e25 , e32 , e34 , e54 }   {0}
                           e25 e32 e34 e54             
                      e1       0    0     0        0
                      e2   −1 1          0        0
                                                             4 × 0 empty matrix
                                                                                              
         Bn+1             
                           0 −1 −1 0
                                                                                           /
                      e3  
                                                        
                                                        
                      e4   0       0     1        1    
                      e5       1    0     0 −1
                                                     
                             0 0      0   0 0                                         
                                                                2   0     1 −2
                            0 2      0   0 −2 
                                                          0      2 −1 0                    
          Ln             
                            0 0      1   1 0        
                                                           
                                                            1 −1 2 −1 
                                                                                           /
                            0 0      1   2 1 
                                                               −2 0 −1 2
                             0 −2     0   1 3
           βn                         2                                 1                   0
      Spectra(Ln )    {0, 0, 0.6571, 2.5293, 4.8136}      {0, 0.6571, 2.5293, 4.8136}       /
Table B.10: Matrix construction of graph G2 (without isolated points) in the bottom panel
of Figure 4.10.
            n                       n=0                              n=1                 n=2
            Ωn               span{e2 , e3 , e4 , e5 }       span{e25 , e32 , e34 , e54 }  {0}
                            e25 e32 e34 e54            
                       e2      −1 1          0      0
                                                             4 × 0 empty matrix
                                                                                              
          Bn+1         e3  0 −1 −1 0
                                                       
                                                                                          /
                       e4  0        0       1      1   
                       e5       1    0       0 −1
                                                                                    
                               2 −1 0 −1                        2 −1 0 −1
                           −1 2 −1 0                      −1 2 −1 0                       
            Ln            
                           0 −1 2 −1 
                                                          
                                                            0
                                                                                          /
                                                                    1     2        1 
                              −1 0 −1 2                        −1 0       1        2
            βn                         1                                1                  0
       Spectra(Ln )               {0, 2, 2, 4}                    {0, 2, 2, 4}             /
                                                    182


Table B.11: Matrix construction of graph G3 (with isolated points included) in the bottom
panel of Figure 4.10.
           n                       n=0                               n=1                 n=2
          Ωn              span{e1 , e2 , e3 , e4 , e5 }     span{e25 , e32 , e34 , e54 }   {0}
                           e25 e32 e34 e54             
                      e1       0    0     0        0
                      e2   −1 1          0        0
                                                             4 × 0 empty matrix
                                                                                              
         Bn+1             
                           0 −1 −1 0
                                                                                           /
                      e3  
                                                        
                                                        
                      e4   0       0     1        1    
                      e5       1    0     0 −1
                                                     
                             0 0      0   0 0                                         
                                                                2   0     1 −2
                            0 2      0   0 −2 
                                                          0      2 −1 0                    
          Ln             
                            0 0      1   1 0        
                                                           
                                                            1 −1 2 −1 
                                                                                           /
                            0 0      1   2 1 
                                                               −2 0 −1 2
                             0 −2     0   1 3
          βn                          2                                 1                   0
      Spectra(Ln )    {0, 0, 0.6571, 2.5293, 4.8136}      {0, 0.6571, 2.5293, 4.8136}       /
Table B.12: Matrix construction of graph G3 (without isolated points) in the bottom panel
of Figure 4.10.
            n                       n=0                              n=1                 n=2
           Ωn                span{e2 , e3 , e4 , e5 }       span{e25 , e32 , e34 , e54 }  {0}
                            e25 e32 e34 e54            
                       e2      −1 1          0      0
                                                             4 × 0 empty matrix
                                                                                              
          Bn+1         e3  0 −1 −1 0
                                                       
                                                                                          /
                       e4  0        0       1      1   
                       e5       1    0       0 −1
                                                                                    
                               2 −1 0 −1                        2 −1 0 −1
                           −1 2 −1 0                      −1 2 −1 0                       
           Ln             
                           0 −1 2 −1 
                                                          
                                                            0
                                                                                          /
                                                                    1     2        1 
                              −1 0 −1 2                        −1 0       1        2
           βn                          1                                1                  0
       Spectra(Ln )               {0, 2, 2, 4}                    {0, 2, 2, 4}             /
                                                    183


      Table B.13: Matrix construction of graph G4 in the bottom panel of Figure 4.10.
        n                                     n=0                                                     n=1                                   n=2
       Ωn                    span{e1 , e2 , e3 , e4 , e5 }                     span{e13 , e25 , e32 , e34 , e45 }                             {0}
                          e13 e25 e32 e34 e45                    
                     e1      −1 0                    0    0    0
                     e2   0 −1 1                         0    0
                                                                                      5 × 0 empty matrix
                                                                                                                                                    
     Bn+1                                                                                                                                      /
                     e3   1
                                        0 −1 −1 0                
                                                                  
                     e4   0             0           0    1    1  
                     e5        0         1           0    0 −1
                                                                                                                               
                             1        0 −1 0                  0                     2           0 −1 −1 0
                        
                            0        2 −1 0 −1                 
                                                                            
                                                                                   0           2 −1 0 −1                                          
       Ln               
                           −1 −1 3 −1 0                        
                                                                            
                                                                                  −1 −1 2                            1      0                 /
                            0        0 −1 2 −1                                  −1 0                    1          2      1 
                             0 −1 0 −1 2                                            0 −1 0                            1      2
       βn                                          1                                                       1                                     0
Spectra(Ln )            {0, 0.8299, 2, 2.6889, 4.4812}                       {0, 0.8299, 2, 2.6889, 4.4812}                                     /
      Table B.14: Matrix construction of graph G5 in the bottom panel of Figure 4.10.
     n                            n=0                                               n=1                                           n=2
    Ωn                    span{e1 , e2 , e3 , e4 , e5 }          span{e12 , e13 , e14 , e15 , e25 , e32 , e34 , e54 }   span{e125 , e132 , e134 , e154 }
                                                                             e125 e132 e134 e154
                                                                     e12       1 −1 0
                                                                                                          
                                                                                                      0
                  e12 e13 e14 e15 e25 e32 e34 e54
                                                                   e13   0          1      1       0 
             e1   −1 −1 −1 −1 0                    0    0  0                                              
                                                                     e14   0          0 −1 −1 
             e2    1    0    0     0 −1 1               0  0
                                                                                                                           4 × 0 empty matrix
                                                                                                        
   Bn+1                                                            e15   −1 0              0       1 
             e3   0    1    0     0       0 −1 −1 0                                                     
                                                                   e25   1          0      0       0 
             e4   0    0    1     0       0       0    1  1                                             
                                                                     e32   0          1      0       0 
             e5    0    0    0     1       1       0    0 −1                                              
                                                                     e34   0          0      1       0 
                                                                     e54       0       0      0       1
                                                                                                                
                                                                     4   0   1    0 0          0 0 0
                                                                  0   4   0    1 0          0 0 0 
                          4 −1 −1 −1 −1                                                                                                         
                       −1 3 −1 0 −1 
                                                                  
                                                                    1   0   4    0 0          0 0 0                       3 −1 0 −1
                                                                  0   1   0    4 0          0 0 0                     −1 3 1 0 
    Ln                 −1 −1 3 −1 0                                                                                                           
                      
                       −1 0 −1 3 −1 
                                                                 
                                                                    0   0   0    0 3 −1 0 −1                   
                                                                                                                          0       1 3 1 
                                                                    0   0   0    0 −1 3 1 0                                −1 0 1 3
                         −1 −1 0 −1 3                                                                           
                                                                    0   0   0    0 0          1 3 1 
                                                                     0   0   0    0 −1 0 1 3
    βn                               1                                                  0                                            0
Spectra(Ln )                  {0, 3, 3, 5, 5}                              {1, 3, 3, 3, 3, 5, 5, 5}                             {1, 3, 3, 5}
                                                               184


BIBLIOGRAPHY
      185


                                     BIBLIOGRAPHY
[1]  B. L. Zhang, C. H. Xu, C. Z. Wang, C. T. Chan, and K. M. Ho. Systematic study of
     structures and stabilities of fullerenes. Physical Review B, 46(11):7333–7336, 1992.
[2]  Rui Wang, Jiahui Chen, Kaifu Gao, and Guo-Wei Wei. Vaccine-escape and fast-
     growing mutations in the united kingdom, the united states, singapore, spain, in-
     dia, and other covid-19-devastated countries. Genomics, 113(4):2158–2170, 2021.
[3]  Thomas W Linsky, Renan Vergara, Nuria Codina, Jorgen W Nelson, Matthew J
     Walker, Wen Su, Christopher O Barnes, Tien-Ying Hsiang, Katharina Esser-Nobis,
     Kevin Yu, et al. De novo design of potent and resilient hACE2 decoys to neutralize
     SARS-CoV-2. Science, 370(6521):1208–1214, 2020.
[4]  Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure,
     flexibility, and folding. International journal for numerical methods in biomedical engi-
     neering, 30(8):814–844, 2014.
[5]  Jacob Townsend, Cassie Putman Micucci, John H Hymel, Vasileios Maroulas, and
     Konstantinos D Vogiatzis. Representation of molecular structures with persistent
     homology for machine learning applications in chemistry. Nature communications,
     11(1):1–9, 2020.
[6]  Duc Duy Nguyen, Zixuan Cang, Kedi Wu, Menglun Wang, Yin Cao, and Guo-
     Wei Wei. Mathematical deep learning for pose and binding affinity prediction and
     ranking in d3r grand challenges. Journal of computer-aided molecular design, 33(1):71–
     82, 2019.
[7]  Primoz Skraba, Maks Ovsjanikov, Frederic Chazal, and Leonidas Guibas.
     Persistence-based segmentation of deformable shapes. In 2010 IEEE Computer So-
     ciety Conference on Computer Vision and Pattern Recognition-Workshops, pages 45–52.
     IEEE, 2010.
[8]  Jozef Dodziuk. de Rham-Hodge theory for L2-cohomology of infinite coverings.
     Topology, 16(2):157–165, 1977.
[9]  Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de
     rham-hodge method. Discrete and continuous dynamical systems. Series B, 26(7):3785,
     2021.
[10] Mark Kac. Can one hear the shape of a drum? The american mathematical monthly,
     73(4P2):1–23, 1966.
[11] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Persistent spectral graph. Interna-
     tional Journal for Numerical Methods in Biomedical Engineering, page e3376, 2020.
                                             186


[12] Alexander Grigor’yan, Yong Lin, Yuri Muranov, and Shing-Tung Yau. Homologies
     of path complexes and digraphs. arXiv preprint arXiv:1207.2834, 2012.
[13] Samir Chowdhury and Facundo Mémoli. Persistent path homology of directed net-
     works. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete
     Algorithms, pages 1152–1169. SIAM, 2018.
[14] Dafydd R Owen, Charlotte MN Allerton, Annaliesa S Anderson, Lisa Aschenbren-
     ner, Melissa Avery, Simon Berritt, Britton Boras, Rhonda D Cardin, Anthony Carlo,
     Karen J Coffman, et al. An oral sars-cov-2 mpro inhibitor clinical candidate for the
     treatment of covid-19. Science, 374(6575):1586–1593, 2021.
[15] Kaifu Gao, Rui Wang, Jiahui Chen, Jetze J Tepe, Faqing Huang, and Guo-Wei Wei.
     Perspectives on sars-cov-2 main protease inhibitors. Journal of medicinal chemistry,
     64(23):16922–16955, 2021.
[16] Matthew D Shin, Sourabh Shukla, Young Hun Chung, Veronique Beiss, Soo Khim
     Chan, Oscar A Ortega-Rivera, David M Wirth, Angela Chen, Markus Sack,
     Jonathan K Pokorski, et al. COVID-19 vaccine development and a potential nano-
     material path forward. Nature Nanotechnology, pages 1–10, 2020.
[17] Michael Day. COVID-19: four fifths of cases are asymptomatic, China figures indi-
     cate. BMJ, 369, 2020.
[18] Quan-Xin Long, Xiao-Jun Tang, Qiu-Lin Shi, Qin Li, Hai-Jun Deng, Jun Yuan, Jie-Li
     Hu, Wei Xu, Yong Zhang, Fa-Jin Lv, et al. Clinical and immunological assessment
     of asymptomatic SARS-CoV-2 infections. Nature medicine, 26(8):1200–1204, 2020.
[19] Rui Wang, Jiahui Chen, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. De-
     coding asymptomatic COVID-19 infection and transmission. The journal of physical
     chemistry letters, 11(23):10007–10015, 2020.
[20] Stephen M Kissler, Christine Tedijanto, Edward Goldstein, Yonatan H Grad, and
     Marc Lipsitch. Projecting the transmission dynamics of SARS-CoV-2 through the
     postpandemic period. Science, 368(6493):860–868, 2020.
[21] Changchuan Yin. Genotyping coronavirus SARS-CoV-2: methods and implica-
     tions. Genomics, 112(5):3588–3596, 2020.
[22] Rui Wang, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei.               Mutations on
     COVID-19 diagnostic targets. Genomics, 112(6):5204–5213, 2020.
[23] Rui Wang, Jiahui Chen, Kaifu Gao, Yuta Hozumi, Changchuan Yin, and Guo-Wei
     Wei. Analysis of SARS-CoV-2 mutations in the united states suggests presence of
     four substrains and novel variants. Communications biology, 4(1):1–14, 2021.
[24] Rui Wang, Yuta Hozumi, Yong-Hui Zheng, Changchuan Yin, and Guo-Wei Wei.
     Host immune response driving SARS-CoV-2 evolution. Viruses, 12(10):1095, 2020.
                                            187


[25] Rui Wang, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Decoding SARS-
     CoV-2 Transmission and Evolution and Ramifications for COVID-19 Diagnosis,
     Vaccine, and Medicine. Journal of Chemical Information and Modeling, 2020. PMID:
     32530284.
[26] Nanshan Chen, Min Zhou, Xuan Dong, Jieming Qu, Fengyun Gong, Yang Han,
     Yang Qiu, Jingli Wang, Ying Liu, Yuan Wei, et al. Epidemiological and clinical
     characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China:
     a descriptive study. The Lancet, 395(10223):507–513, 2020.
[27] Roujian Lu, Xiang Zhao, Juan Li, Peihua Niu, Bo Yang, Honglong Wu, Wenling
     Wang, Hao Song, Baoying Huang, Na Zhu, et al. Genomic characterisation and
     epidemiology of 2019 novel coronavirus: implications for virus origins and receptor
     binding. The Lancet, 395(10224):565–574, 2020.
[28] Alexandra C Walls, Young-Jun Park, M Alejandra Tortorici, Abigail Wall, Andrew T
     McGuire, and David Veesler. Structure, function, and antigenicity of the SARS-
     CoV-2 spike glycoprotein. Cell, 181(2):281–292, 2020.
[29] Daniel Wrapp, Nianshuang Wang, Kizzmekia S Corbett, Jory A Goldsmith, Ching-
     Lin Hsieh, Olubukola Abiona, Barney S Graham, and Jason S McLellan. Cryo-
     EM structure of the 2019-nCoV spike in the prefusion conformation. Science,
     367(6483):1260–1263, 2020.
[30] Christian Jean Michel, Claudine Mayer, Olivier Poch, and Julie Dawn Thomp-
     son. Characterization of accessory genes in coronavirus genomes. Virology journal,
     17(1):1–13, 2020.
[31] Yosra A Helmy, Mohamed Fawzy, Ahmed Elaswad, Ahmed Sobieh, Scott P Kenney,
     and Awad A Shehata. The COVID-19 pandemic: a comprehensive review of tax-
     onomy, genetics, epidemiology, diagnosis, treatment, and control. Journal of Clinical
     Medicine, 9(4):1225, 2020.
[32] Ahmad Abu Turab Naqvi, Kisa Fatima, Taj Mohammad, Urooj Fatima, Indrakant K
     Singh, Archana Singh, Shaikh Muhammad Atif, Gururao Hariprasad, Gu-
     lam Mustafa Hasan, and Md Imtaiyaz Hassan. Insights into SARS-CoV-2 genome,
     structure, evolution, pathogenesis and therapies: Structural genomics approach.
     Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, 1866(10):165878, 2020.
[33] Jingfang Mu, Yaohui Fang, Qi Yang, Ting Shu, An Wang, Muhan Huang, Liang
     Jin, Fei Deng, Yang Qiu, and Xi Zhou. SARS-CoV-2 N protein antagonizes type I
     interferon signaling by suppressing phosphorylation and nuclear translocation of
     STAT1 and STAT2. Cell discovery, 6(1):1–4, 2020.
[34] Canrong Wu, Yang Liu, Yueying Yang, Peng Zhang, Wu Zhong, Yali Wang, Qiqi
     Wang, Yang Xu, Mingxue Li, Xingzhou Li, et al. Analysis of therapeutic targets
     for SARS-CoV-2 and discovery of potential drugs by computational methods. Acta
     Pharmaceutica Sinica B, 10(5):766–788, 2020.
                                           188


[35] Dongwan Kim, Joo-Yeon Lee, Jeong-Sun Yang, Jun Won Kim, V Narry Kim, and
     Hyeshik Chang. The architecture of SARS-CoV-2 transcriptome. Cell, 181(4):914–
     921, 2020.
[36] Shutoku Matsuyama, Naganori Nao, Kazuya Shirato, Miyuki Kawase, Shinji Saito,
     Ikuyo Takayama, Noriyo Nagata, Tsuyoshi Sekizuka, Hiroshi Katoh, Fumihiro
     Kato, et al. Enhanced isolation of SARS-CoV-2 by TMPRSS2-expressing cells. Pro-
     ceedings of the National Academy of Sciences, 117(13):7001–7003, 2020.
[37] Markus Hoffmann, Hannah Kleine-Weber, Simon Schroeder, Nadine Krüger, Tanja
     Herrler, Sandra Erichsen, Tobias S Schiergens, Georg Herrler, Nai-Huei Wu, An-
     dreas Nitsche, et al. SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is
     blocked by a clinically proven protease inhibitor. cell, 181(2):271–280, 2020.
[38] Philip V’kovski, Annika Kratzel, Silvio Steiner, Hanspeter Stalder, and Volker Thiel.
     Coronavirus biology and replication: implications for SARS-CoV-2. Nature Reviews
     Microbiology, pages 1–16, 2020.
[39] Jiahui Chen, Kaifu Gao, Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Review of
     covid-19 antibody therapies. Annual review of biophysics, 50:1–30, 2021.
[40] Menglun Wang, Zixuan Cang, and Guo-Wei Wei. A topology-based network tree
     for the prediction of protein–protein binding affinity changes following mutation.
     Nature Machine Intelligence, 2(2):116–123, 2020.
[41] Jiahui Chen, Rui Wang, Menglun Wang, and Guo-Wei Wei. Mutations strengthened
     SARS-CoV-2 infectivity. Journal of molecular biology, 432(19):5212–5226, 2020.
[42] Peter Richardson, Ivan Griffin, Catherine Tucker, Dan Smith, Olly Oechsle, Anne
     Phelan, Michael Rawling, Edward Savory, and Justin Stebbing. Baricitinib as po-
     tential treatment for 2019-ncov acute respiratory disease. Lancet (London, England),
     395(10223):e30, 2020.
[43] Herbert Edelsbrunner and John Harer. Persistent homology-a survey. Contemporary
     mathematics, 453:257–282, 2008.
[44] Zixuan Cang and Guo-Wei Wei. Topologynet: Topology based deep convolutional
     and multi-task neural networks for biomolecular property predictions. PLoS com-
     putational biology, 13(7):e1005690, 2017.
[45] Daniel Hernández Serrano and Darío Sánchez Gómez. Centrality measures in
     simplicial complexes: applications of tda to network science. arXiv preprint
     arXiv:1908.02967, 2019.
[46] Slobodan Maletić and Milan Rajković. Consensus formation on a simplicial com-
     plex of opinions. Physica A: Statistical Mechanics and its Applications, 397(March):111–
     120, 2014.
                                             189


[47] Herbert Edelsbrunner. Alpha shapes—a survey. Tessellations in the Sciences, 27:1–25,
     2010.
[48] Georges Voronoi. Nouvelles applications des paramètres continus à la théorie
     des formes quadratiques. premier mémoire. sur quelques propriétés des formes
     quadratiques positives parfaites. Journal für die reine und angewandte Mathematik,
     1908(133):97–102, 1908.
[49] Boris Delaunay et al. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matem-
     aticheskii i Estestvennyka Nauk, 7(793-800):1–2, 1934.
[50] Franz Aurenhammer, Rolf Klein, and Der-Tsai Lee. Voronoi diagrams and Delaunay
     triangulations. World Scientific Publishing Company, 2013.
[51] Jude May. Multivariate analysis. Scientific e-Resources, 2018.
[52] Beno Eckmann. Harmonische funktionen und randwertaufgaben in einem kom-
     plex. Commentarii Mathematici Helvetici, 17(1):240–255, 1944.
[53] Daniel Hernández Serrano and Darío Sánchez Gómez. Higher order degree in sim-
     plicial complexes, multi combinatorial laplacian and applications of tda to complex
     networks. arXiv preprint arXiv:1908.02583, 2019.
[54] Franz W Kamber and Philippe Tondeur. de rham-hodge theory for riemannian
     foliations. Mathematische Annalen, 277(3):415–431, 1987.
[55] Rundong Zhao, Menglun Wang, Jiahui Chen, Yiying Tong, and Guo-Wei Wei. The
     de Rham–Hodge Analysis and Modeling of Biomolecules. Bulletin of Mathematical
     Biology, 82(8):1–38, 2020.
[56] Jiahui Chen, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Evolutionary de
     Rham-hodge method. Discrete & Continuous Dynamical Systems-B, 2020.
[57] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale gaussian network
     model (mgnm) and multiscale anisotropic network model (manm). The Journal of
     chemical physics, 143(20):11B616_1, 2015.
[58] Marcel Berger. Geometry i. Springer Science & Business Media, 2009.
[59] AA Grigor’yan, Yong Lin, Yu V Muranov, and Shing-Tung Yau. Path complexes
     and their homologies. Journal of Mathematical Sciences, 248(5):564–599, 2020.
[60] Alexander Grigor’yan, Yong Lin, Yuri Muranov, and Shing-Tung Yau. Cohomology
     of digraphs and (undirected) graphs. Asian Journal of Mathematics, 19(5):887–932,
     2015.
[61] Gary Chartrand. Introductory graph theory. Courier Corporation, 1977.
[62] André Gomes and Daniel Miranda. Path cohomology of locally finite digraphs,
     hodge’s theorem and the p-lazy random walk. arXiv preprint arXiv:1906.04781, 2019.
                                            190


[63] Alexander Grigor’yan, Yong Lin, Yuri Muranov, and Shing-Tung Yau. Homotopy
     theory for digraphs. arXiv preprint arXiv:1407.0234, 2014.
[64] Danijela Horak and Jürgen Jost. Spectra of combinatorial laplace operators on sim-
     plicial complexes. Advances in Mathematics, 244:303–336, 2013.
[65] Martin Gollery. Bioinformatics: Sequence and genome analysis, david w. mount.
     cold spring harbor, ny: Cold spring harbor laboratory press, 2004, 692 pp. isbn 0-
     87969-712-1. Clinical Chemistry, 51(11):2219–2219, 2005.
[66] W John Wilbur and David J Lipman. Rapid similarity searches of nucleic acid and
     protein data banks. Proceedings of the National Academy of Sciences, 80(3):726–730,
     1983.
[67] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lip-
     man. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410,
     1990.
[68] Jian Ye, Scott McGinnis, and Thomas L Madden. Blast: improvements for better
     sequence analysis. Nucleic acids research, 34(suppl_2):W6–W9, 2006.
[69] David W Mount. Using the basic local alignment search tool (blast). Cold Spring
     Harbor Protocols, 2007(7):pdb–top17, 2007.
[70] Tao Zhang, Qunfu Wu, and Zhigang Zhang. Probable pangolin origin of sars-cov-2
     associated with the covid-19 outbreak. Current Biology, 2020.
[71] Kangpeng Xiao, Junqiong Zhai, Yaoyu Feng, Niu Zhou, Xu Zhang, Jie-Jian Zou,
     Na Li, Yaqiong Guo, Xiaobing Li, Xuejuan Shen, et al. Isolation of sars-cov-2-related
     coronavirus from malayan pangolins. Nature, pages 1–4, 2020.
[72] Hongru Wang, Lenore Pipes, and Rasmus Nielsen. Synonymous mutations and the
     molecular evolution of sars-cov-2 origins. Virus evolution, 7(1):veaa098, 2021.
[73] Giuseppina La Rosa, Pamela Mancini, Giusy Bonanno Ferraro, Carolina Veneri,
     Marcello Iaconelli, Lucia Bonadonna, Luca Lucentini, and Elisabetta Suffredini.
     Sars-cov-2 has been circulating in northern italy since december 2019: Evidence
     from environmental monitoring. Science of the total environment, 750:141711, 2021.
[74] Ranjit Sah, Alfonso J Rodriguez-Morales, Runa Jha, Daniel KW Chu, Haogao Gu,
     Malik Peiris, Anup Bastola, Bibek Kumar Lal, Hemant Chanda Ojha, Ali A Rabaan,
     et al. Complete genome sequence of a 2019 novel coronavirus (sars-cov-2) strain
     isolated in nepal. Microbiology resource announcements, 9(11):e00169–20, 2020.
[75] Giuseppina La Rosa, Marcello Iaconelli, Pamela Mancini, Giusy Bonanno Ferraro,
     Carolina Veneri, Lucia Bonadonna, Luca Lucentini, and Elisabetta Suffredini. First
     detection of sars-cov-2 in untreated wastewaters in italy. Science of The Total Envi-
     ronment, 736:139652, 2020.
                                           191


[76] Sandra Westhaus, Frank-Andreas Weber, Sabrina Schiwy, Volker Linnemann,
     Markus Brinkmann, Marek Widera, Carola Greve, Axel Janke, Henner Hollert,
     Thomas Wintgens, et al. Detection of sars-cov-2 in raw and treated wastewater
     in germany–suitability for covid-19 surveillance and potential transmission risks.
     Science of The Total Environment, 751:141750, 2021.
[77] Coronaviridae Study Group of the International et al. The species severe acute res-
     piratory syndrome-related coronavirus: classifying 2019-ncov and naming it sars-
     cov-2. Nature Microbiology, 5(4):536, 2020.
[78] Desmond G Higgins and Paul M Sharp. Clustal: a package for performing multiple
     sequence alignment on a microcomputer. Gene, 73(1):237–244, 1988.
[79] Robert C Edgar. Muscle: a multiple sequence alignment method with reduced time
     and space complexity. BMC bioinformatics, 5(1):113, 2004.
[80] Kazutaka Katoh, George Asimenos, and Hiroyuki Toh. Multiple alignment of dna
     sequences with mafft. In Bioinformatics for DNA sequence analysis, pages 39–64.
     Springer, 2009.
[81] Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. Mafft: a
     novel method for rapid multiple sequence alignment based on fast fourier trans-
     form. Nucleic acids research, 30(14):3059–3066, 2002.
[82] Julie D Thompson, Desmond G Higgins, and Toby J Gibson. Clustal w: improv-
     ing the sensitivity of progressive multiple sequence alignment through sequence
     weighting, position-specific gap penalties and weight matrix choice. Nucleic acids
     research, 22(22):4673–4680, 1994.
[83] Mark A Larkin, Gordon Blackshields, Nigel P Brown, R Chenna, Paul A McGetti-
     gan, Hamish McWilliam, Franck Valentin, Iain M Wallace, Andreas Wilm, Rodrigo
     Lopez, et al. Clustal w and clustal x version 2.0. bioinformatics, 23(21):2947–2948,
     2007.
[84] Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method
     for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4):406–425,
     1987.
[85] Gordon Blackshields, Fabian Sievers, Weifeng Shi, Andreas Wilm, and Desmond G
     Higgins. Sequence embedding for fast construction of guide trees for multiple se-
     quence alignment. Algorithms for Molecular Biology, 5(1):21, 2010.
[86] Johannes Söding. Protein homology detection by hmm–hmm comparison. Bioin-
     formatics, 21(7):951–960, 2005.
[87] Michael Levandowsky and David Winter.             Distance between sets.      Nature,
     234(5323):34–35, 1971.
                                            192


[88] Thomas CoVer and Peter Hart. Nearest neighbor pattern classification. IEEE trans-
      actions on information theory, 13(1):21–27, 1967.
[89] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric
      regression. The American Statistician, 46(3):175–185, 1992.
[90] Yuelong Shu and John McCauley. GISAID: Global initiative on sharing all influenza
      data–from vision to reality. Eurosurveillance, 22(13):30494, 2017.
[91] Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-
      Wu Tao, Jun-Hua Tian, Yuan-Yuan Pei, et al. A new coronavirus associated with
      human respiratory disease in China. Nature, 579(7798):265–269, 2020.
[92] Sobin Kim and Ashish Misra. Snp genotyping: technologies and biomedical appli-
      cations. Annu. Rev. Biomed. Eng., 9:289–320, 2007.
[93] Justina Jankauskaitė, Brian Jiménez-García, Justas Dapkūnas, Juan Fernández-
      Recio, and Iain H Moal. SKEMPI 2.0: an updated benchmark of changes in protein–
      protein binding energy, kinetics and thermodynamics upon mutation. Bioinformat-
      ics, 35(3):462–469, 2019.
[94] Sarah Sirin, James R Apgar, Eric M Bennett, and Amy E Keating. AB-Bind: antibody
      binding mutational database for computational affinity predictions. Protein Science,
      25(2):393–409, 2016.
[95] Sherlyn Jemimah, K Yugandhar, and M Michael Gromiha. Proximate: a database
      of mutant protein–protein complex thermodynamics and kinetics. Bioinformatics,
      33(17):2787–2788, 2017.
[96] Quanya Liu, Peng Chen, Bing Wang, Jun Zhang, and Jinyan Li. dbmpikt: a database
      of kinetic and thermodynamic mutant protein interactions. Bmc Bioinformatics,
      19(1):1–7, 2018.
[97] Erik Procko. The sequence of human ace2 is suboptimal for binding the s spike
      protein of sars coronavirus 2. BioRxiv, 2020.
[98] Tyler N Starr, Allison J Greaney, Sarah K Hilton, Daniel Ellis, Katharine HD Craw-
      ford, Adam S Dingens, Mary Jane Navarro, John E Bowen, M Alejandra Tortorici,
      Alexandra C Walls, et al. Deep mutational scanning of SARS-CoV-2 receptor bind-
      ing domain reveals constraints on folding and ACE2 binding. Cell, 182(5):1295–
      1310, 2020.
[99] Jiahui Chen, Kaifu Gao, Rui Wang, and Guo-Wei Wei. Revealing the threat of
      emerging sars-cov-2 mutations to antibody therapies. Journal of molecular biology,
      433(18):167155, 2021.
[100] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,
      46(2):255–308, 2009.
                                              193


[101] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persis-
      tence and simplification. In Proceedings 41st Annual Symposium on Foundations of
      Computer Science, pages 454–463. IEEE, 2000.
[102] Jiahui Chen, Kaifu Gao, Rui Wang, and Guo-Wei Wei. Prediction and mitigation
      of mutation threats to covid-19 vaccines and antibody therapies. Chemical science,
      12(20):6929–6948, 2021.
[103] Delphine C Bas, David M Rogers, and Jan H Jensen. Very fast prediction and ratio-
      nalization of pka values for protein–ligand complexes. Proteins: Structure, Function,
      and Bioinformatics, 73(3):765–783, 2008.
[104] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang,
      Zheng Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new
      generation of protein database search programs. Nucleic acids research, 25(17):3389–
      3402, 1997.
[105] Yuedong Yang, Rhys Heffernan, Kuldip Paliwal, James Lyons, Abdollah Dehzangi,
      Alok Sharma, Jihua Wang, Abdul Sattar, and Yaoqi Zhou. Spider2: A package
      to predict secondary structure, accessible surface area, and main-chain torsional
      angles by deep neural networks. In Prediction of protein secondary structure, pages
      55–63. Springer, 2017.
[106] Beibei Liu, Bao Wang, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. Eses: soft-
      ware for e ulerian solvent excluded surface, 2017.
[107] Todd J Dolinsky, Jens E Nielsen, J Andrew McCammon, and Nathan A Baker.
      Pdb2pqr: an automated pipeline for the setup of poisson–boltzmann electrostatics
      calculations. Nucleic acids research, 32(suppl_2):W665–W667, 2004.
[108] David A Case, Tom A Darden, Thomas E Cheatham, Carlos L Simmerling, Junmei
      Wang, Robert E Duke, Ray Luo, MRCW Crowley, Ross C Walker, Wei Zhang, et al.
      Amber 10. Technical report, University of California, 2008.
[109] Bernard R Brooks, Charles L Brooks III, Alexander D Mackerell Jr, Lennart Nilsson,
      Robert J Petrella, Benoît Roux, Youngdo Won, Georgios Archontis, Christian Bar-
      tels, Stefan Boresch, et al. Charmm: the biomolecular simulation program. Journal
      of computational chemistry, 30(10):1545–1614, 2009.
[110] Duan Chen, Zhan Chen, Changjun Chen, Weihua Geng, and Guo-Wei Wei. Mibpb:
      a software package for electrostatic analysis. Journal of computational chemistry,
      32(4):756–770, 2011.
[111] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand
      Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent
      Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine
      Learning research, 12:2825–2830, 2011.
                                              194


[112] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
      Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
      The journal of machine learning research, 15(1):1929–1958, 2014.
[113] Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their ap-
      plications. Bulletin of the American Mathematical Society, 43(4):439–561, 2006.
[114] Fan Chung. Laplacians and the cheeger inequality for directed graphs. Annals of
      Combinatorics, 9(1):1–19, 2005.
[115] Fan R. K. Chung. Spectral Graph Theory. AMS, 1997.
[116] Robert Grone, Russell Merris, and V S_ Sunder. The laplacian spectrum of a graph.
      SIAM Journal on Matrix Analysis and Applications, 11(2):218–238, 1990.
[117] Stephen J. Kirkland, Jason J. Molitierno, Michael Neumann, and Bryan L. Shader.
      On graphs with equal algebraic and vertex connectivity. Linear Algebra and its Ap-
      plications, 341(1-3):45–56, 2002.
[118] Xiao-Dong Zhang. The laplacian eigenvalues of graphs: a survey. arXiv preprint
      arXiv:1111.2897, 2011.
[119] Chengyuan Wu, Shiquan Ren, Jie Wu, and Kelin Xia. Weighted (co) homology and
      weighted laplacian. arXiv preprint arXiv:1804.06990, 2018.
[120] Timothy E Goldberg. Combinatorial laplacians of simplicial complexes. Senior The-
      sis, Bard College, 2002.
[121] Patrizio Frosini. Measuring shapes by size functions. In Intelligent Robots and Com-
      puter Vision X: Algorithms and Techniques, volume 1607, pages 122–133. International
      Society for Optics and Photonics, 1992.
[122] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete
      & Computational Geometry, 33(2):249–274, 2005.
[123] Konstantin Mischaikow and Vidit Nanda. Morse theory for filtrations and efficient
      computation of persistent homology. Discrete & Computational Geometry, 50(2):330–
      353, 2013.
[124] Gunnar Carlsson, Vin De Silva, and Dmitriy Morozov. Zigzag persistent homology
      and real-valued functions. In Proceedings of the twenty-fifth annual symposium on
      Computational geometry, pages 247–256. ACM, 2009.
[125] Vin De Silva and Robert Ghrist. Coverage in sensor networks via persistent homol-
      ogy. Algebraic & Geometric Topology, 7(1):339–358, 2007.
[126] Y. Yao, J. Sun, X. H. Huang, G. R. Bowman, G. Singh, M. Lesnick, L. J. Guibas, V. S.
      Pande, and G. Carlsson. Topological methods for exploring low-density states in
      biomolecular folding pathways. The Journal of Chemical Physics, 130:144115, 2009.
                                             195


[127] Peter Bubenik and Jonathan A Scott. Categorification of persistent homology. Dis-
      crete & Computational Geometry, 51(3):600–627, 2014.
[128] Tamal K Dey, Fengtao Fan, and Yusu Wang. Computing topological persistence
      for simplicial maps. In Proceedings of the thirtieth annual symposium on Computational
      geometry, page 345. ACM, 2014.
[129] K. L. Xia and G. W. Wei. Persistent homology analysis of protein structure, flexibil-
      ity and folding. International Journal for Numerical Methods in Biomedical Engineering,
      30:814–844, 2014.
[130] Ramón García-Domenech, Jorge Gálvez, Jesus V. de Julián-Ortiz, and Lionello
      Pogliani. Some new trends in chemical graph theory. Chemical Reviews, 108(3):1127–
      1169, 2008.
[131] K. Balasubramanian. Applications of Combinatorics and Graph Theory to Spec-
      troscopy and Quantum Chemistry. Chemical Reviews, 85(6):599–618, 1985.
[132] Ivan Gutman and Nenad Trinajstić. Graph theory and molecular orbitals. total φ-
      electron energy of alternant hydrocarbons. Chemical Physics Letters, 17(4):535–538,
      1972.
[133] Ivet Bahar, Ali Rana Atilgan, and Burak Erman. Direct evaluation of thermal fluctu-
      ations in proteins using a single-parameter harmonic potential. Folding and Design,
      2(3):173–181, 1997.
[134] A. R. Atilgan, S. R. Durell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Bahar.
      Anisotropy of fluctuation dynamics of proteins with an elastic network model. Bio-
      physical Journal, 80(1):505–515, 2001.
[135] Ivet Bahar, Ali Rana Atilgan, Melik C. Demirel, and Burak Erman. Vibrational dy-
      namics of folded proteins: Significance of slow and fast motions in relation to func-
      tion and stability. Physical Review Letters, 80(12):2733–2736, 1998.
[136] Kristopher Opron, Kelin Xia, and Guo Wei Wei. Communication: Capturing protein
      multiscale thermal fluctuations, 2015.
[137] David Bramer and Guo-Wei Wei. Multiscale weighted colored graphs for protein
      flexibility and rigidity analysis. The Journal of chemical physics, 148(5):054103, 2018.
[138] Duc Nguyen and Guo-Wei Wei. Agl-score: Algebraic graph learning score for
      protein-ligand binding scoring, ranking, docking, and screening. Journal of Chemical
      Information and Modeling, 2019.
[139] H.W. Kroto, J.R. Heath, S.C. O’Brien, R.F. Curl, and R E Smalley. C<sub>60</sub>:
      Buckminsterfullerene. Nature, 318(14):162–163, 1985.
[140] W. Krätschmer, Lowell D. Lamb, K. Fostiropoulos, and Donald R. Huffman. Solid
      C60: a new form of carbon. Nature, 347(6291):354–358, 1990.
                                              196


[141] B C Yadav and Ritesh Kumar. Structure , properties and applications of fullerenes.
      International Journal of Nanotechnology and Applications ISSN, 0973(1):15–24, 2008.
[142] Kelin Xia, Xin Feng, Yiying Tong, and Guo Wei Wei. Persistent homology for
      the quantitative prediction of fullerene stability. Journal of computational chemistry,
      36(6):408–422, 2015.
[143] Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure,
      flexibility, and folding. International Journal for Numerical Methods in Biomedical En-
      gineering, (June):814–844, 2014.
[144] B. L. Zhang, C. Z. Wang, K. M. Ho, C. H. Xu, and C. T. Chan. The geometry of small
      fullerene cages: C20 to C70. The Journal of Chemical Physics, 97(7):5007–5011, 1992.
[145] David Bramer and Guo-Wei Wei. Blind prediction of protein b-factor and flexibility.
      The Journal of chemical physics, 149(13):134107, 2018.
[146] Kristopher Opron, Kelin Xia, and Guo-Wei Wei. Fast and anisotropic flexibility-
      rigidity index for protein flexibility and fluctuation analysis. The Journal of chemical
      physics, 140(23):06B617_1, 2014.
[147] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and
      multidomain models—flexibility and rigidity. The Journal of chemical physics,
      139(19):11B614_1, 2013.
[148] Jelena Grbic, Jie Wu, Kelin Xia, and Guo-Wei Wei. Aspects of topological ap-
      proaches for data science. Foundations of Data Science, 2022.
[149] Yiying Tong, Santiago Lombeyda, Anil N Hirani, and Mathieu Desbrun. Dis-
      crete multiscale vector field decomposition. ACM transactions on graphics (TOG),
      22(3):445–452, 2003.
[150] Yoshihiko Mochizuki and Atsushi Imiya. Spatial reasoning for robot navigation
      using the helmholtz-hodge decomposition of omnidirectional optical flow. In 2009
      24th International Conference Image and Vision Computing New Zealand, pages 1–6.
      IEEE, 2009.
[151] Facundo Mémoli, Zhengchao Wan, and Yusu Wang. Persistent Laplacians: proper-
      ties, algorithms and implications. 42nd Conference on Very Important Topics, Digital
      Object Identifier: 10.4230/LIPIcs.CVIT.2016.23, 2020.
[152] Zhenyu Meng and Kelin Xia. Persistent spectral–based machine learning (perspect
      ml) for protein-ligand binding affinity prediction. Science Advances, 7(19):eabc5329,
      2021.
[153] Jiahui Chen, Yuchi Qiu, Rui Wang, and Guo-Wei Wei. Persistent laplacian pro-
      jected omicron ba. 4 and ba. 5 to become new dominating variants. arXiv preprint
      arXiv:2205.00532, 2022.
                                              197


[154] Rui Wang, Rundong Zhao, Emily Ribando-Gros, Jiahui Chen, Yiying Tong, and
      Guo-Wei Wei. Hermes: Persistent spectral graph software. Foundations of data science
      (Springfield, Mo.), 3(1):67, 2021.
[155] Allen Dudley Shepard. A cellular description of the derived category of a stratified space.
      PhD thesis, Brown University, 1985.
[156] Jakob Hansen and Robert Ghrist. Toward a spectral theory of cellular sheaves.
      Journal of Applied and Computational Topology, 3(4):315–358, 2019.
[157] Xiaoqi Wei and Guo-Wei Wei.           Persistent sheaf laplacians.        arXiv preprint
      arXiv:2112.10906, 2021.
[158] Bernardo Ameneyro, Vasileios Maroulas, and George Siopsis. Quantum persistent
      homology. arXiv preprint arXiv:2202.12965, 2022.
[159] Terri A Long, Siobhan M Brady, and Philip N Benfey. Systems approaches to iden-
      tifying gene regulatory networks in plants. Annual review of cell and developmental
      biology, 24:81–103, 2008.
[160] Alexander Grigor’yan, Yuri Muranov, Vladimir Vershinin, and Shing-Tung Yau.
      Path homology theory of multigraphs and quivers. In Forum mathematicum, vol-
      ume 30, pages 1319–1337. De Gruyter, 2018.
[161] Alexander Grigor’yan, Rolando Jimenez, Yuri Muranov, and Shing-Tung Yau. On
      the path homology theory of digraphs and eilenberg–steenrod axioms. Homology,
      Homotopy and Applications, 20(2):179–205, 2018.
[162] Alexander Grigor’yan, Rolando Jimenez, Yuri Muranov, and Shing-Tung Yau.
      Homology of path complexes and hypergraphs. Topology and its Applications,
      267:106877, 2019.
[163] Yong Lin, Shiquan Ren, Chong Wang, and Jie Wu. Weighted path homology of
      weighted digraphs and persistence. arXiv preprint arXiv:1910.09891, 2019.
[164] Tamal K Dey, Tianqi Li, and Yusu Wang. An efficient algorithm for 1-dimensional
      (persistent) path homology. arXiv preprint arXiv:2001.09549, 2020.
[165] Kaifu Gao, Jian Yin, Niel M Henriksen, Andrew T Fenley, and Michael K Gilson.
      Binding enthalpy calculations for a neutral host–guest pair yield widely diver-
      gent salt effects across water models. Journal of chemical theory and computation,
      11(10):4555–4564, 2015.
[166] Linus Pauling. The nature of the chemical bond. iv. the energy of single bonds
      and the relative electronegativity of atoms. Journal of the American Chemical Society,
      54(9):3570–3582, 1932.
[167] Sinan G Aksoy, Cliff Joslyn, Carlos Ortiz Marrero, Brenda Praggastis, and Emilie
      Purvine. Hypernetwork science via high-order hypergraph walks. EPJ Data Science,
      9(1):16, 2020.
                                            198


[168] Stephane Bressan, Jingyan Li, Shiquan Ren, and Jie Wu. The embedded homology
      of hypergraphs and applications. arXiv preprint arXiv:1610.00890, 2016.
[169] Daniel A Spielman. Spectral graph theory and its applications. In 48th Annual IEEE
      Symposium on Foundations of Computer Science (FOCS’07), pages 29–38. IEEE, 2007.
[170] Jeff Cheeger. A lower bound for the smallest eigenvalue of the Laplacian. In Pro-
      ceedings of the Princeton conference in honor of Professor S. Bochner, pages 195–199,
      1969.
[171] Fan RK Chung and Fan Chung Graham. Spectral graph theory, volume 92. American
      Mathematical Soc., 1997.
[172] Joel Friedman. Computing betti numbers via combinatorial laplacians. Algorith-
      mica, 21(4):331–346, 1998.
[173] Tomasz Kaczynski, Konstantin Mischaikow, and Marian Mrozek. Computational
      homology, volume 157. Springer Science & Business Media, 2006.
[174] Peter Bubenik, Peter T Kim, et al. A statistical approach to persistent homology.
      Homology, homotopy and Applications, 9(2):337–362, 2007.
[175] Yongjin Lee, Senja D Barthel, Paweł Dłotko, S Mohamad Moosavi, Kathryn Hess,
      and Berend Smit. Quantifying similarity of pore-geometry in nanoporous materials.
      Nature communications, 8(1):1–8, 2017.
[176] Vasileios Maroulas, Cassie Putman Micucci, and Farzana Nasrin. Bayesian Topo-
      logical Learning for Classifying the Structure of Biological networks. arXiv preprint
      arXiv:2009.11974, 2020.
[177] Maria-Veronica Ciocanel, Riley Juenemann, Adriana T Dawes, and Scott A McKin-
      ley. Topological data analysis approaches to uncovering the timing of ring structure
      onset in filamentous networks. Bulletin of Mathematical Biology, 83(3):1–25, 2021.
[178] Ioannis Sgouralis, Andreas Nebenfuhr, and Vasileios Maroulas. A bayesian topo-
      logical framework for the identification and reconstruction of subcellular motion.
      SIAM Journal on Imaging Sciences, 10(2):871–899, 2017.
[179] Zhenyu Meng, D Vijay Anand, Yunpeng Lu, Jie Wu, and Kelin Xia. Weighted per-
      sistent homology for biomolecular data analysis. Scientific reports, 10(1):1–15, 2020.
[180] Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas J Guibas. Persis-
      tence barcodes for shapes. International Journal of Shape Modeling, 11(02):149–187,
      2005.
[181] Zhenyu Meng and Kelin Xia. Persistent spectral based machine learning (perspect
      ml) for drug design. arXiv:2002.00582, 2020.
[182] Ulrich Bauer. Ripser: a lean C++ code for the computation of Vietoris–Rips persis-
      tence barcodes. Software available at https://github. com/Ripser/ripser, 436, 2017.
                                             199


[183] Dmitriy Morozov. Dionysus Software, 2012.
[184] GUDHI Project. GUDHI User and reference manual, 2015.
[185] Ulrich Bauer, Michael Kerber, and Jan Reininghaus. Dipha (a distributed persistent
      homology algorithm). Software available at https://github. com/DIPHA/dipha, 2014.
[186] Henry Adams, Andrew Tausz, and Mikael Vejdemo-Johansson. JavaPlex: A re-
      search software package for persistent (co) homology. In International Congress on
      Mathematical Software, pages 129–136. Springer, 2014.
[187] Chad Giusti, Eva Pastalkova, Carina Curto, and Vladimir Itskov. Clique topology
      reveals intrinsic geometric structure in neural correlations. Proceedings of the Na-
      tional Academy of Sciences, 112(44):13455–13460, 2015.
[188] Dmitriy Morozov and Primoz Skraba. DioDe Software.
[189] Brittany T Fasy, Jisu Kim, Fabrizio Lecci, Clement Maria, David L Millman, and
      Maintainer Jisu Kim. Package ’TDA’, 2019.
[190] Michael Kerber and Herbert Edelsbrunner. The medusa of spatial sorting: 3D ki-
      netic alpha complexes and implementation. arXiv preprint arXiv:1209.5434, 2012.
[191] Rundong Zhao, Mathieu Desbrun, Guo-Wei Wei, and Yiying Tong. 3D hodge de-
      compositions of edge-and face-based vector fields. ACM Transactions on Graphics
      (TOG), 38(6):1–13, 2019.
[192] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei.                 Persistent spectral graph.
      arXiv:1912.04135, 2019.
[193] WHO. Coronavirus disease 2019 (COVID-19) situation report – 172. Coronavirus
      Disease (COVID-2019) Situation Reports, 2020.
[194] Jasper Fuk-Woo Chan, Cyril Chik-Yan Yip, Kelvin Kai-Wang To, Tommy Hing-
      Cheung Tang, Sally Cheuk-Ying Wong, Kit-Hang Leung, Agnes Yim-Fong Fung,
      Anthony Chin-Ki Ng, Zijiao Zou, Hoi-Wah Tsoi, et al. Improved molecular diag-
      nosis of covid-19 by the novel, highly sensitive and specific covid-19-rdrp/hel real-
      time reverse transcription-pcr assay validated in vitro and with clinical specimens.
      Journal of clinical microbiology, 58(5):e00310–20, 2020.
[195] Victor M Corman, Olfert Landt, Marco Kaiser, Richard Molenkamp, Adam Meijer,
      Daniel KW Chu, Tobias Bleicker, Sebastian Brünink, Julia Schneider, Marie Luisa
      Schmidt, et al. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-
      PCR. Eurosurveillance, 25(3):2000045, 2020.
[196] Buddhisha Udugama, Pranav Kadhiresan, Hannah N Kozlowski, Ayden Malek-
      jahani, Matthew Osborne, Vanessa YC Li, Hongmin Chen, Samira Mubareka,
      Jonathan Gubbay, and Warren CW Chan. Diagnosing COVID-19: The disease and
      tools for ddtection. ACS nano, 2020.
                                               200


[197] Yujin Jung, Gun-Soo Park, Jun Hye Moon, Keunbon Ku, Seung-Hwa Beak, Chang-
      Seop Lee, Seil Kim, Edmond Changkyun Park, Daeui Park, Jong-Hwan Lee, et al.
      Comparative analysis of primer–probe sets for rt-qpcr of covid-19 causative virus
      (sars-cov-2). ACS infectious diseases, 6(9):2513–2523, 2020.
[198] Susanne Pfefferle, Svenja Reucher, Dominic Nörz, and Marc Lütgehetmann. Eval-
      uation of a quantitative rt-pcr assay for the detection of the emerging coronavirus
      SARS-CoV-2 using a high throughput system. Eurosurveillance, 25(9):2000152, 2020.
[199] Chantal BF Vogels, Anderson F Brito, Anne Louise Wyllie, Joseph R Fauver, Is-
      abel M Ott, Chaney C Kalinich, Mary E Petrone, Marie-Louise Landry, Ellen F Fox-
      man, and Nathan D Grubaugh. Analytical sensitivity and efficiency comparisons
      of SARS-CoV-2 qrt-pcr assays. medRxiv, 2020.
[200] Arun K Nalla, Amanda M Casto, Meei-Li W Huang, Garrett A Perchetti, Reigran
      Sampoleo, Lasata Shrestha, Yulun Wei, Haiying Zhu, Keith R Jerome, and Alexan-
      der L Greninger. Comparative performance of SARS-CoV-2 detection assays using
      seven different primer/probe sets and one assay kit. Journal of Clinical Microbiology,
      2020.
[201] Kazuya Shirato, Naganori Nao, Harutaka Katano, Ikuyo Takayama, Shinji Saito,
      Fumihiro Kato, Hiroshi Katoh, Masafumi Sakata, Yuichiro Nakatsu, Yoshio Mori,
      et al. Development of genetic diagnostic methods for novel coronavirus 2019 (ncov-
      2019) in japan. Japanese journal of infectious diseases, pages JJID–2020, 2020.
[202] Kate N Bishop, Rebecca K Holmes, Ann M Sheehy, and Michael H Malim.
      APOBEC-mediated editing of viral RNA. Science, 305(5684):645–645, 2004.
[203] Rafael Sanjuán and Pilar Domingo-Calap. Mechanisms of viral mutation. Cellular
      and Molecular Life Sciences, 73(23):4433–4448, 2016.
[204] Nathan D Grubaugh, William P Hanage, and Angela L Rasmussen. Making sense
      of mutation: what D614G means for the COVID-19 pandemic remains unclear. Cell,
      182(4):794–795, 2020.
[205] Marion Sevajol, Lorenzo Subissi, Etienne Decroly, Bruno Canard, and Isabelle Im-
      bert. Insights into RNA synthesis, capping, and proofreading mechanisms of SARS-
      coronavirus. Virus Research, 194:90–99, 2014.
[206] Hatim T Allawi and John SantaLucia. Thermodynamics and nmr of internal g.t
      mismatches in dna. Biochemistry, 36(34):10581–10594, 1997.
[207] Tugba G Kucukkal, Marharyta Petukh, Lin Li, and Emil Alexov. Structural and
      physico-chemical effects of disease and non-disease nsSNPs on proteins. Current
      Opinion in Structural Biology, 32:18–24, 2015.
[208] Peng Yue, Zhaolong Li, and John Moult. Loss of protein structure stability as a
      major causative factor in monogenic disease. Journal of molecular biology, 353(2):459–
      473, 2005.
                                              201


[209] Jiahui Chen, Kaifu Gao, Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. Review of
      COVID-19 antibody therapies. Annual Review of Biophysics, 50:1–30, 2020.
[210] Peter Chen, Ajay Nirula, Barry Heller, Robert L Gottlieb, Joseph Boscia, Jason Mor-
      ris, Gregory Huhn, Jose Cardona, Bharat Mocherla, Valentina Stosor, et al. SARS-
      CoV-2 neutralizing antibody LY-CoV555 in outpatients with COVID-19. New Eng-
      land Journal of Medicine, 384(3):229–237, 2021.
[211] Wanbo Tai, Lei He, Xiujuan Zhang, Jing Pu, Denis Voronin, Shibo Jiang, Yusen
      Zhou, and Lanying Du. Characterization of the receptor-binding domain (RBD)
      of 2019 novel coronavirus: implication for development of RBD protein as a viral
      attachment inhibitor and vaccine. Cellular & molecular immunology, 17(6):613–620,
      2020.
[212] Wendong Li, Zhengli Shi, Meng Yu, Wuze Ren, Craig Smith, Jonathan H Epstein,
      Hanzhong Wang, Gary Crameri, Zhihong Hu, Huajun Zhang, et al. Bats are natural
      reservoirs of SARS-like coronaviruses. Science, 310(5748):676–679, 2005.
[213] Xiu-Xia Qu, Pei Hao, Xi-Jun Song, Si-Ming Jiang, Yan-Xia Liu, Pei-Gang Wang,
      Xi Rao, Huai-Dong Song, Sheng-Yue Wang, Yu Zuo, et al. Identification of two
      critical amino acid residues of the severe acute respiratory syndrome coronavirus
      spike protein for its variation in zoonotic tropism transition via a double substitu-
      tion strategy. Journal of Biological Chemistry, 280(33):29588–29595, 2005.
[214] Huai-Dong Song, Chang-Chun Tu, Guo-Wei Zhang, Sheng-Yue Wang, Kui Zheng,
      Lian-Cheng Lei, Qiu-Xia Chen, Yu-Wei Gao, Hui-Qiong Zhou, Hua Xiang, et al.
      Cross-host evolution of severe acute respiratory syndrome coronavirus in palm
      civet and human. Proceedings of the National Academy of Sciences, 102(7):2430–2435,
      2005.
[215] Rui Wang, Jiahui Chen, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Emerg-
      ing vaccine-breakthrough SARS-CoV-2 variants. arXiv preprint arXiv:2103.08023,
      2021.
[216] Sarah A Clark, Lars E Clark, Junhua Pan, Adrian Coscia, Lindsay GA McKay, Sun-
      daresh Shankar, Rebecca I Johnson, Vesna Brusic, Manish C Choudhary, James Re-
      gan, et al. SARS-CoV-2 evolution in an immunocompromised host reveals shared
      neutralization escape mechanisms. Cell, 184(10):2605–2617, 2021.
[217] Jiahui Chen, Kaifu Gao, Rui Wang, and Guowei Wei. Prediction and mitigation
      of mutation threats to COVID-19 vaccines and antibody therapies. arXiv preprint
      arXiv:2010.06357, 2020.
[218] Sarah Cherian, Varsha Potdar, Santosh Jadhav, Pragya Yadav, Nivedita Gupta,
      Mousumi Das, Partha Rakshit, Sujeet Singh, Priya Abraham, Samiran Panda, et al.
      SARS-CoV-2 Spike Mutations, L452R, T478K, E484Q and P681R, in the Second
      Wave of COVID-19 in Maharashtra, India. Microorganisms, 9(7):1542, 2021.
                                              202


[219] Soo-Young Lee, Dong-Kyun Ryu, Hanmi Noh, Jongin Kim, Ji-Min Seo, Cheolmin
      Kim, Carel van Baalen, Aloys SL Tijsma, Hyo-Young Chung, Min-Ho Lee, et al.
      Therapeutic efficacy of CT-p59 against P. 1 variant of SARS-CoV-2. bioRxiv, 2021.
[220] Xianding Deng, Miguel A Garcia-Knight, Mir M Khalid, Venice Servellita, Candace
      Wang, Mary Kate Morris, Alicia Sotomayor-González, Dustin R Glasner, Kevin R
      Reyes, Amelia S Gliwa, et al. Transmission, infectivity, and antibody neutralization
      of an emerging SARS-CoV-2 variant in California carrying a L452R spike protein
      mutation. MedRxiv, 2021.
[221] Rui Wang, Jiahui Chen, and Guo-Wei Wei. Mechanisms of sars-cov-2 evolution
      revealing vaccine-resistant mutations in europe and america. The journal of physical
      chemistry letters, 12(49):11850–11857, 2021.
[222] Rui Wang, Jiahui Chen, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Emerg-
      ing vaccine-breakthrough SARS-CoV-2 variants. ACS Infectious Diseases, 2022.
                                             203