METABOLIC MODELING: INTEGRATION WITH MULTI-OMIC DATASETS, STATISTICAL EVALUATION, AND APPLICATION TO OUR UNDERSTANDING OF PHOTOSYNTHETIC CARBON ASSIMILATION By Joshua Akito Matthew Kaste A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Biochemistry and Molecular Biology – Doctor of Philosophy 2024 ABSTRACT Many biotechnological efforts in plants have been proposed to address issues around climate change, sustainability, and food security. These include the modification of the oilseed crop Camelina sativa to improve the efficiency with which it produces oil from captured carbon, to the suppression of photorespiration and enhancement of yield in crops by engineering a Carbon Concentrating Mechanism into them. However, research and development efforts to use biotechnological interventions to improve plants and microbes have been hampered by their extreme complexity. Many of these bioengineering efforts seek to modify the rates of in vivo biochemical reactions – referred to hereafter as fluxes – in order to improve the efficiency or yield with which a desired product(s) is produced. Therefore, these efforts require the characterization and modification of the organism’s metabolic activity. As in other areas of engineering, these processes can be aided by the use of quantitative modeling. In the case of metabolic modeling, multiple approaches already exist. These include the use of simplified compartmental models (Chapter 2) and enzyme-based modeling (Chapter 4), as well as constraint-based approaches such as the linear-optimization-based Flux Balance Analysis (Chapter 3) and the nonlinear regression based Metabolic Flux Analysis (Chapter 2). Although these methods are rarely used in tandem, they are mathematically interrelated and can be used to validate and/or corroborate one another’s findings. Indeed, previous literature in the area of metabolic modeling has frequently paid short shrift to the importance of validation and model selection in this area of study, calling into question the biological relevance and accuracy of many modeling studies. I begin with a discussion of challenges and prospects for future development in the statistical evaluation of metabolic models (Chapter 1), emphasizing the need for cross-comparison of multiple techniques, which I put into practice in later chapters. Emphasis is put on the need for careful validation of 13C-Metabolic Flux Analysis findings using multiple lines of evidence and on the usefulness of validating Flux Balance Analysis flux predictions using estimates from 13C- Metabolic Flux Analysis. I apply these methodological insights to studies of Camelina sativa and its relative Arabidopsis thaliana. I start with a 13C-Metabolic Flux Analysis study of the metabolism of photosynthesizing leaves of Camelina sativa (Chapter 2). By modeling the stable isotopic labeling levels of Calvin-Benson intermediates with a series of polyexponential models, I corroborate the study’s 13C-Metabolic Flux Analysis findings, resulting in a more detailed model of C. sativa’s leaf metabolism and resolving a decades-old mystery in the labeling of these metabolites. Following this, I present a novel method of incorporating multi-omic datasets into FBA predictions of metabolic fluxes in the closely related organism A. thaliana (Chapter 3). I demonstrate that this new method successfully improves agreement between FBA and 13C-MFA flux maps of A. thaliana, setting the stage for improved FBA and metabolic engineering insight into the related C. sativa. Finally, I turn my attention to reaction-diffusion modeling. In Chapter 4, I apply enzyme- based and spatial modeling techniques to understand the net CO2 fixation and light-use efficiency implications of incorporating a biophysical Carbon Concentrating Mechanism into a C3 plant. After some concluding remarks in Chapter 5, I present two additional studies that are related either to the quantitative modeling of plants or metabolic networks, but which are not directly related to the rest of the investigations in this thesis. First, I investigate the extent to which the kinds of tissue-specific gene expression patterns utilized in Chapter 3 are conserved across all flowering plants, providing evidence that such an approach may be broadly usable in plant metabolic modeling. Finally, I present interactive educational materials that teach the underlying theory for all of the metabolic modeling approaches used in the above studies. I implemented these materials into an intensive workshop series put on at Michigan State University and demonstrate that participants’ self-assessed confidence in the techniques taught increased significantly. Copyright by JOSHUA AKITO MATTHEW KASTE 2024 To those who ask one more question when they have already been given an answer v ACKNOWLEDGEMENTS Throughout my time at Michigan State University, I have had the exceedingly good luck to study under, learn from, and work and collaborate with a great number of kind, hard-working, and knowledgeable friends and colleagues. I also would not be here without the support of people in my life who have believed in me over the years. I cannot hope to name everyone who has helped me in my journey. Having said that, here’s my best attempt: Dr. Yair Shachar-Hill has been a steadfast supporter of my work and a wonderful advisor. He has provided me with valuable scientific and professional advice while at the same time allowing me the freedom to develop as an independent scholar. He is a model of what a scientific mentor should be, and I can only hope to provide my own mentees the same combination of patience, insight, and guidance in the future. My committee – Dr. Shachar-Hill, Dr. Tom Sharkey, Dr. Erich Grotewold, Dr. Michaela TerAvest, and Dr. Chih-Li Sung – have provided invaluable guidance and suggestions throughout my Ph.D. and I am immensely grateful for their support. Many members of my committee, and others here at Michigan State University, have also been wonderful collaborators. Dr. Tom Sharkey, Dr. Michaela TerAvest, Dr. Chih-Li Sung, Dr. Berkley Walker, Dr. Dan Chitwood, Dr. Bob VanBuren, Dr. Yuan Xu, Dr. Kathryne Ford, Miles Roberts, Kenia Segura-Aba, Dr. Saurabh Palande, Anne Steensma, and many more trainees from NRT-IMPACTS program have played key roles in the studies I present in this thesis, and I could not have done all of this without them. The current and former members of the Shachar-Hill lab made me feel welcome in Michigan and helped keep me sane during the pandemic. I’d like to sincerely thank Dr. Danielle Hoffmann, Dr. Shawna Rowe, Dr. Yuan Xu, Dr. Na Pang, Anne Steensma, Peter Koroma, and Antwan Green for their camaraderie and good humor. I would like to also thank Antwan Green, who worked with me as an undergraduate researcher, for his patience as I learned how best to serve as a research mentor. I would also like to acknowledge people whose support and passion for science inspired me to keep going and without whom I would not have ended up getting my Ph.D. here at MSU: Dr. Michael Milgroom, Dr. Mickey Drott, Dr. Carolyn Young, Dr. Mihwa Yi, Dr. Farhad Ghavami, and Brenda Johnson. I also would not be here today if it were not for my parents, Matthew and Miki Kaste, as vi well as the support of my late grandfather Hubert Kaste. Finally, I would like to acknowledge the unwavering support of my lovely wife Veronica, who has stuck with me through the many ups and downs of my personal and professional life for close to a decade. And, of course, our wonderful pets Duchess, Gabriel, and Makoto, who have been sources of joy through good times and bad. vii PREFACE “Two things that in my opinion reinforce one another and remain eternally true are: Do not quench your inspiration and your imagination, do not become the slave of your model; and again: Take the model and study it, otherwise your inspiration will never become plastically concrete.” – Vincent van Gogh, in a letter to Theo van Gogh There are many ways of conceptualizing what it is we do as “scientists” and what it is about the scientific process – nebulous and ill-defined as it is – that makes it so unusually effective at making predictions about the natural world and enabling new technologies. In my mind, its power derives in large part from the continuous back-and-forth between empirical measurement and model-making. That the interplay between theory and observation is key to the scientific method is by no means a new observation, but I think we should take a moment to consider why we need both theory and observation and not just one or the other to make sense of the world. When we endeavor to understand something using reason alone, we end up writing something like Plato’s Timaeus: admirably creative and logically constructed, but unmoored entirely from the inconvenient details of how things actually work. On the other hand, if we were to try to understand the world using observation alone, we would accomplish nothing but an accumulation of miscellaneous measurements with no ability to build out of them an understanding of natural phenomena. To summarize: there are an infinite number of “reasonable” explanations of how the world works and an infinite number of observations that could be made about it, but it is only by combining these explanations and observations that we can develop a robust understanding. In reality, this intricate dance between theory and observation is much harder than we usually like to admit and, as scientists, we are prone to missteps. Here, I will focus on those areas of science where we gather quantitative data about natural phenomena and then build and evaluate mechanistic, quantitative models to try to explain these data. We may fit a model to a dataset, but find that this model generalizes poorly. We might conclude, based on a model fit, that our pet hypothesis about an important characteristic of a system was correct, not realizing that a simpler model that does not invoke that characteristic at all actually fits our observations even better. We might rigorously validate some aspect of a model, but then use it in ways that have not been validated. Or, we might fixate on using one specific lens to examine our viii observations, when the use of multiple lenses to look at the system from different angles and at different scales might reveal more about it and, through consilience, make us more confident in our findings. But where there are problems, there are also opportunities, so I have worked throughout my Ph.D. to identify and address such challenges and, in doing so, generate new insights in the area of the quantitative modeling of photosynthetic metabolism. Towards this end, in this thesis I present work on (i) the development and exploration of different modeling techniques, (ii) the use these different techniques as lenses through which to contextualize data, and (iii) the statistical interrogation of model fits and, more generally, the use of modeling in the study of metabolism. The admittedly abstract motivations that inspired much of the work in this thesis has resulted in it being composed of an unusually diverse set of studies, unified loosely by a focus on rigorous development and analysis of models describing photosynthetic metabolism. Despite these abstract motivations, however, the products of the work are concrete, and include: 1. A resolution to a many-decades-old mystery in the labeling patterns of photosynthetic intermediates, pointing towards a previously underappreciated cycling between vacuolar and cytosolic sugars in photosynthesizing leaves. 2. The first demonstration of an algorithm that successfully improves the accuracy of Flux Balance Analysis predictions in a whole-plant model using transcriptomic or proteomic data. 3. Spatially-resolved reaction-diffusion models that refine our understanding of the metabolic tradeoffs involved in the use of Carbon Concentrating Mechanisms. I felt very privileged during my time here at Michigan State University to have been allowed – by my research mentor and my funding sources – to pursue a rather unorthodox set of questions. It is my hope that you, the reader, will also appreciate how the studies described in the following chapter interrelate and contribute to our descriptions of photosynthesis and how we arrive at these descriptions in the first place. ix TABLE OF CONTENTS Chapter 1 Model validation and selection in metabolic flux analysis and flux balance analysis ... 1 1.1. Preface.......................................................................................................................... 2 1.2. Abstract ........................................................................................................................ 2 1.3. Introduction .................................................................................................................. 3 1.4. Validation techniques in FBA and 13C-MFA............................................................... 7 1.5. Model selection in 13C-MFA ..................................................................................... 18 1.6. Future directions ........................................................................................................ 23 1.7. Acknowledgments...................................................................................................... 24 1.8. Author contributions .................................................................................................. 24 REFERENCES ................................................................................................................. 25 Chapter 2 Reimport of carbon from cytosolic and vacuolar sugar pools into the Calvin–Benson cycle explains photosynthesis labeling anomalies ........................................................................ 34 2.1. Preface........................................................................................................................ 35 2.2. Abstract ...................................................................................................................... 37 2.3. Significance statement ............................................................................................... 37 2.4. Introduction ................................................................................................................ 37 2.5. Results ........................................................................................................................ 39 2.6. Discussion .................................................................................................................. 49 2.7. Methods...................................................................................................................... 53 2.8. Acknowledgments...................................................................................................... 54 2.9. Author contributions .................................................................................................. 55 REFERENCES ................................................................................................................. 56 APPENDIX A: Supplemental Material for Chapter 2 ...................................................... 60 Chapter 3 Accurate flux predictions using tissue-specific gene expression in plant metabolic modeling ....................................................................................................................................... 95 3.1. Preface........................................................................................................................ 96 3.2. Abstract ...................................................................................................................... 96 3.3. Introduction ................................................................................................................ 97 3.4. Methods.................................................................................................................... 100 3.5. Results ...................................................................................................................... 104 3.6. Discussion ................................................................................................................ 111 3.7 Acknowledgments..................................................................................................... 113 3.8 Funding ..................................................................................................................... 113 REFERENCES ............................................................................................................... 114 APPENDIX B: Supplemental Material for Chapter 3 .................................................... 118 Chapter 4 Biophysical carbon concentrating mechanisms in land plants: insights from reaction- diffusion modeling ...................................................................................................................... 128 4.1. Preface...................................................................................................................... 129 4.2. Abstract .................................................................................................................... 129 4.3. Introduction .............................................................................................................. 130 4.4. Methods.................................................................................................................... 132 4.5. Results ...................................................................................................................... 140 4.6. Discussion ................................................................................................................ 152 4.7. Data and Code Availability ...................................................................................... 156 x 4.8. Acknowledgments.................................................................................................... 156 4.9. Author Contributions ............................................................................................... 156 REFERENCES ............................................................................................................... 157 APPENDIX C: Supplemental Material for Chapter 4 .................................................... 161 Chapter 5 Concluding Remarks .................................................................................................. 167 5.1. Introduction .............................................................................................................. 168 5.2. Takeaway messages and future work ...................................................................... 168 REFERENCES ............................................................................................................... 173 Chapter 6 Additional Studies: Integrative Teaching of Metabolic Modeling and Flux Analysis with Interactive Python Modules ................................................................................................ 175 6.1. Preface...................................................................................................................... 176 6.2. Abstract .................................................................................................................... 177 6.3. Introduction .............................................................................................................. 177 6.4. Methods.................................................................................................................... 179 6.5. Results and discussion ............................................................................................. 180 6.6. Conclusions .............................................................................................................. 186 6.7. Data and code availability statement ....................................................................... 186 6.8. Acknowledgements .................................................................................................. 186 REFERENCES ............................................................................................................... 187 APPENDIX D: Supplemental Material for Chapter 6 .................................................... 190 Chapter 7 Additional Studies: Topological data analysis reveals a core gene expression backbone that defines form and function across flowering plants .............................................................. 193 7.1. Preface...................................................................................................................... 194 7.2. Abstract .................................................................................................................... 195 7.3. Introduction .............................................................................................................. 195 7.4. Results ...................................................................................................................... 197 7.5. Discussion ................................................................................................................ 209 7.6. Methods.................................................................................................................... 211 7.7. Data availability statement ....................................................................................... 216 7.8. Funding .................................................................................................................... 216 7.9. Author contributions ................................................................................................ 217 REFERENCES ............................................................................................................... 218 APPENDIX E: Supplemental Material for Chapter 7 .................................................... 224 xi Chapter 1 Model validation and selection in metabolic flux analysis and flux balance analysis This research was published in: J. A. M. Kaste, Y. Shachar-Hill, Model validation and selection in metabolic flux analysis and flux balance analysis. Biotechnology Progress, e3413 (2023). 1 1.1. Preface The conversations and ideas that gave rise to this paper started between me, Dr. Shachar- Hill, Dr. Xu, and Dr. Sharkey while we were conducting the study that ultimately became Chapter 2. During the course of that study, I became increasingly interested in the general questions surrounding the statistical evaluation, validation, and model selection practices of 13C-MFA flux maps. The idea of using a simpler compartmental modeling strategy to corroborate the gross architecture of a 13C-MFA model, in the absence of a well worked out general model selection strategy for 13C-MFA, did make it into the paper presented in Chapter 2. However, the conversations that Dr. Shachar-Hill and I continued to have on this topic ended up ranging far beyond the context of that study. Moreover, my work on validating the predictions of a novel FBA implementation using MFA flux estimates, which I present in Chapter 3, gave me insight into validation practices on the FBA side of constraint-based modeling. After some preliminary literature review, it became quite clear that despite the importance of validation and model selection to the area of constraint-based metabolic modeling, shockingly little had been said on the topic. Due to the paucity of discussion and analysis in this area, a straightforward review paper would have been of little use. So, we decided to write a perspective article that summarizes common practices and their drawbacks, while also presenting our original insights and perspectives on the future of this area. The paper presented in this chapter has been published in the journal Biotechnology Progress. I am first and corresponding author on the study. 1.2. Abstract 13C-Metabolic Flux Analysis (13C-MFA) and Flux Balance Analysis (FBA) are widely used to investigate the operation of biochemical networks in both biological and biotechnological research. Both methods use metabolic reaction network models of metabolism operating at steady state so that reaction rates (fluxes) and the levels of metabolic intermediates are constrained to be invariant. They provide estimated (MFA) or predicted (FBA) values of the fluxes through the network in vivo, which cannot be measured directly. These fluxes can shed light on basic biology and have been successfully used to inform metabolic engineering strategies. Several approaches have been taken to test the reliability of estimates and predictions from constraint-based methods and to compare alternative model architectures. Despite advances in other areas of the statistical evaluation of metabolic models, such as the quantification of flux 2 estimate uncertainty, validation and model selection methods have been underappreciated and underexplored. We review the history and state-of-the-art in constraint-based metabolic model validation and model selection. Applications and limitations of the χ2-test of goodness-of-fit, the most widely used quantitative validation and selection approach in 13C-MFA, are discussed, and complementary and alternative forms of validation and selection are proposed. A combined model validation and selection framework for 13C-MFA incorporating metabolite pool size information that leverages new developments in the field is presented and advocated for. Finally, we discuss how adopting robust validation and selection procedures can enhance confidence in constraint-based modeling as a whole and ultimately facilitate more widespread use of FBA in biotechnology. 1.3. Introduction The set of biochemical reaction rates in the metabolic network of a living system (its flux map) represents an integrated functional phenotype that emerges from multiple layers of biological organization and regulation, including the genome, transcriptome, and proteome (Nielsen, 2003). The study of metabolic fluxes is therefore important for systems biology, rational metabolic engineering, and synthetic biology. A grand challenge of systems biology is building an integrated mechanistic understanding of the operation of living organisms across these levels of regulation (Spivey, 2004) – an understanding that goes beyond statistical or correlative descriptions, however useful these can be. Meeting this challenge requires fluxes to be accurately predicted from network structure using explicit rules or hypotheses and reliably estimated using experimental data. Fluxes are also critical to many biotechnological and metabolic engineering applications. Examples such as the development of lysine hyper-producing strains of Corynebacterium glutamicum (Koffas et al., 2003; Koffas and Stephanopoulos, 2005; Becker et al., 2011) and the rewiring of E. coli’s metabolism to make it grow chemoautotrophically (Gleizer et al., 2019) attest to the usefulness of these techniques. As the scale and complexity of integrative systems biology and biological engineering efforts increase, so too will the need for reliable and robust estimates of fluxes. In vivo fluxes cannot be directly measured, necessitating modeling approaches to estimate or predict them. The most commonly used approaches for metabolic modeling are the constraint- based modeling frameworks of 13C-Metabolic Flux Analysis (13C-MFA) and Flux Balance Analysis (FBA). Both require a metabolic network consisting of metabolites linked by 3 biochemical reactions to be defined using the biochemical literature, knowledge of the enzymes and transporters expressed from the genome, and physico-chemical rules. In 13C-MFA, atom mappings describing the positions and interconversions of the carbon atoms in reactants and products are also included in the model. These methods assume that the system is at metabolic steady-state, such that the concentrations of all metabolic intermediates and reaction rates are constant (Antoniewicz, 2015). External fluxes, such as the uptake of a substrate or the rate of production of new cells or a product, are also measured and used to constrain the possible flux ranges. These assumptions and constraints define a “solution space” containing all flux maps consistent with them but are typically insufficient to pinpoint a unique flux map. In 13C-MFA, isotopic labeling data is used to identify a particular solution within the solution space. 13C-labeled substrates are fed to the system under investigation and the endpoint labeling, or time-course labeling in Isotopically Nonstationary Metabolic Flux Analysis (INST- MFA), of metabolites is measured using mass spectrometry and/or NMR techniques (Antoniewicz, 2015; Cheah and Young, 2018). Given a metabolic network, a flux map, and information about the labeled substrate fed into the system, the label distribution through all the metabolites in a network can be solved analytically. However, 13C-MFA works backwards from measured label distributions to flux maps by minimizing the differences between measured and estimated Mass Isotopomer Distribution (MID) values by varying flux estimates (Jazmin et al., 2014). For INST MFA pool size measurements can also be included in the minimization process. In FBA, linear optimization is used to identify a flux map (or set of flux maps) from the solution space (Orth et al., 2010b). This is the map(s) for which the sum of one or more fluxes (the objective function) is maximized or minimized. Objective functions frequently represent measures of efficiency, including the maximization of growth rate or product formation or the minimization of total flux (Holzhütter, 2004). Such functions may embody hypotheses about what the in vivo system has been evolutionarily tuned to optimize, or questions about the operational capacity of that system under particular conditions. Since the objective function, together with the network architecture and empirical and/or theoretical constraints introduced by the modeler, is a key determinant of the flux maps generated by FBA, careful selection, justification, and, ideally, validation of objective functions is crucial. As shown in Schnitzer et al., (2022), alternative objective functions can, and should, be evaluated to identify those that result in the best agreement with experimental data. In many cases, the constraints – typically on 4 external fluxes – imposed during an FBA optimization result in a set of viable flux maps (a solution space) rather than a single map. In such cases, related techniques, including Flux Variability Analysis (Mahadevan and Schilling, 2003) and random sampling (Schellenberger and Palsson, 2009; Bordel et al., 2010; Megchelenbrink et al., 2014; Haraldsdóttir et al., 2017) can be used to characterize the set of flux maps consistent with the set constraints. The computational tractability and small amount of experimental data necessary to perform FBA allow the analysis of Genome-Scale Stoichiometric Models (GSSMs). These models incorporate all known reactions believed to occur in an organism based on a combination of genome annotation and manual curation. Additional linear-optimization-based methods for solving GSSMs using the FBA framework have been developed and are sometimes used together with FBA. These include Minimization of Metabolic Adjustment (MOMA) (Segrè et al., 2002), and Regulatory On/Off Minimization (ROOM) (Shlomi et al., 2005), as well as a host of methods that incorporate omic data into the optimization process [e.g., (Åkesson et al., 2004; Becker and Palsson, 2008; Tian and Reed, 2018; Pandey et al., 2019; Ravi and Gunawan, 2021)]. FBA and its related methods are sometimes used to analyze models other than true GSSMs, such as “core” models that focus on central metabolic processes that conduct the large majority of flux (Orth et al., 2010a). When discussing validation, however, the same principles apply to all of these linear optimization methods and across the different model scales. For the sake of simplicity, we will be using “FBA” to refer to this family of methods generally and will refer to the medium- to large-scale models used with these methods as “FBA models.” Progress has been made in improving the statistical rigor and reliability of flux estimates and characterizing uncertainty in estimates and predictions. For example, in MFA, the development of effective methods for flux uncertainty estimation (Antoniewicz et al., 2006) allows researchers to better quantify confidence in flux predictions and, where appropriate, to gather additional data to better support their conclusions. Bayesian techniques for the characterization of uncertainties in flux estimates derived from isotopic labeling have also been presented (Theorell et al., 2017). On the experimental side of MFA, there have been advances in designing and implementing parallel labeling experiments, wherein the labeling patterns obtained using multiple tracers are simultaneously fit to generate a single 13C-MFA flux map. This enables more precise estimation of fluxes than experiments with individual tracers or tracer combinations allow (Chang et al., 2008; Crown et al., 2012; Crown and Antoniewicz, 2012; 5 Leighty and Antoniewicz, 2013; Millard et al., 2014; Crown et al., 2015; Crown et al., 2016; Beyß et al., 2021). Greater resolution in isotopic labeling data through the use of tandem mass spectrometry techniques, which allow for the quantification of positional labeling, can also improve the precision of modeled fluxes, as described in Choi and Antoniewicz (2019) and Wang et al., (2021). Recent years have also seen developments in FBA meant to improve the reliability of its predictions. For example, studies have characterized the impact of departures from metabolic steady state and devised methods to account for uncertainties in biomass compositions [e.g., (Dinh et al., 2022; Choi et al., 2023)]. The many sources of uncertainty when working with FBA and genome-scale models, and attempts to characterize and mitigate this uncertainty, have been reviewed elsewhere (Bernstein et al., 2021). In this review, we specifically focus on the validation of flux predictions and estimates from constraint-based modeling studies and the selection of well-supported model architectures, which have received less attention and specific treatment in the literature. How can MFA and FBA researchers validate the accuracy of their estimates and predictions? These flux analysis methods also require researchers to make choices about the network structure of the model to be used. This leads to questions of model selection; that is, how do we select the most statistically justified model from among the alternatives? Validation and model selection are key to improving the fidelity of model-derived fluxes to the real in vivo ones. The fields of systems and synthetic biology have seen substantial development of model selection and validation practices (Kirk et al., 2013; Gross and MacLeod, 2017), but these topics are not frequently discussed in the metabolic modeling literature. Previous reviews and methods papers have touched on the use of tools like the χ2-test of goodness-of-fit for the validation of MFA models (Antoniewicz, 2018; Long and Antoniewicz, 2019a). However, to our knowledge, no reviews covering the various methods for validating FBA predictions exist, nor have previous reviews discussed the various limitations of the χ2-test. Moreover, previous reviews have not addressed the most recent improvements in model selection in 13C-MFA, which have not been adequately incorporated into routine practice. Addressing these topics explicitly is important for practitioners as they carry out their work. It is also important for readers of the flux analysis literature, who must understand the assumptions, tests of validity, and model selection techniques underlying what they are reading. Although only a subset of research groups conduct both FBA and MFA modeling, we believe most metabolic modeling practitioners and consumers read literature containing both 6 modeling paradigms. As we highlight in this review, some similar themes emerge when examining the validation of both FBA and MFA flux maps. Finally, one of the most robust validations that can be conducted for FBA predictions is comparison against MFA estimated fluxes, which makes simultaneously considering the validity of both FBA and MFA flux maps crucial. For these reasons, we consider both modeling approaches in this review. We review and provide our perspective on these areas and prospects for future development, highlighting: (1) validation methods applicable to FBA flux maps; (2) approaches for validating 13C-MFA flux maps; and (3) developments and prospects for model selection in 13C-MFA; (4) How validation and model selection practices in 13C-MFA could benefit from a greater emphasis on the isolation of training and validation datasets and; (5) the importance of corroborating flux mapping results using independent modeling and experimental techniques. 1.4. Validation techniques in FBA and 13C-MFA FBA and 13C-MFA studies commonly validate the model(s) used, though there is great variation in their nature and extent. We summarize these validation strategies in Figure 1. 7 1.4.1. Validation in FBA Figure 1.1: Graphical summary of validation strategies in (A) FBA and (B) 13C-MFA. Dotted lines connect inputs with the associated validation technique(s). (A) FBA predictions can be validated by comparing growth rate or growth/no-growth phenotypes across different substrates, growth conditions, or sets of gene knockouts in silico and in vivo. Values can be calculated from flux maps and compared with experimental measurements. FBA internal flux predictions can be compared with 13C-MFA fluxes. (B) Values can be calculated from 13C-MFA flux maps and compared with an independent experimental measurement from the in vivo system. Goodness- of-fit can be assessed between simulated and measured MIDs, and simulated and measured metabolite pool sizes in INST-MFA. Flux maps can be compared with the results of independent modeling exercises. Molecules are schematically shown as connected circles of atomic positions: open circles are unlabeled, and filled circles are isotopically labeled. Abbreviations: Mn - metabolites in the metabolic network; Sn – exogenous substrates; Vi – Fluxes; [Mn] – metabolite concentrations. 8 The COnstraint-Based Reconstruction and Analysis (COBRA) framework, implemented in software solutions such as the COBRA Toolbox (Heirendt et al., 2019) and cobrapy (Ebrahim et al., 2013) and widely used for FBA studies, features functions and pipelines that can be used to ensure basic functionality of models including balancing of charge, pH, and cofactors/cosubstrates, thermodynamic feasibility, and connectivity of all metabolites. Model characteristics evaluated include the inability to generate ATP without an external source of energy and the inability to synthesize biomass without adding substrates not known to be needed. Additionally, the MEMOTE (MEtabolic MOdel TEsts) pipeline contains tests to ensure, for example, that biomass precursors can be successfully synthesized in a model in a variety of growth media (Lieven et al., 2020). MEMOTE has been used to ensure appropriate stoichiometry and consistency with accepted format standards in models entered into the BiGG (Norsigian et al., 2020) model database. These forms of Quality Control are an important first step in ensuring that models are behaving appropriately and generating useful predictions. However, following these initial checks on functionality, the techniques used to validate actual model predictions are varied and not standardized. Indeed, even in the BiGG database, which is highly curated and focuses primarily on models of microbial systems, models vary in the type and extent of validation performed. Given the variety of validation procedures that appear in the literature, it is important when using an FBA model to be aware of what specific validations were used, what their limitations are, and consequently, what inferences or downstream applications are appropriate (summarized in Table 1.1). 9 Table 1.1: The most common model validation strategies in Flux Balance Analysis, what these methods tell us, limitations, and important considerations for researchers and/or readers, and examples of these methods’ implementation in the literature. Method Information Content Limitations Use case Comparison of growth/no- growth on one or more substrates Presence/absence of reactions necessary for substrate utilization and biomass synthesis. Validation is qualitative, only indicating the existence of metabolic routes. Does not test the accuracy of predicted internal flux values Useful when viability/nonviability of different growth conditions is of interest. Unlike a growth-rate comparison, does not indicate whether the efficiency of biomass synthesis is realistic. Comparison of growth rates on one or more substrates Consistency of metabolic network, biomass composition, and maintenance costs with observed efficiency of substrate-to-biomass conversion. Comparison of in vivo and in silico knockout lethality Presence/absence of biosynthetic reactions necessary for substrate use and growth. Comparison of FBA predictions with MFA fluxes Accuracy of internal flux predictions. Provides quantitative information on the overall efficiency of substrate conversion to biomass, but is uninformative with respect to the accuracy of internal flux predictions. Care is needed to reduce incorrect predictions from many different factors, including optimization method and biomass composition changes in response to knockout. Few MFA flux maps exist for most organisms, making this validation impossible or requiring comparison with an MFA flux map taken for very different experimental conditions. When done across multiple substrates and conditions, this validation gives confidence in the predicted efficiency with which the model produces biomass. Useful when identifying growth-limiting factors. Critically important to perform when designing growth-coupled knockout strategies (Burgard et al., 2003; Tepper and Shlomi, 2009; Stanford et al., 2015). Important when the intended use of FBA modeling requires that the predictions of specific internal flux values be accurate. Examples (Pinchuk et al., 2010; Ong et al., 2014; Arion et al., 2023; Coppens et al., 2023; Tec- Campos et al., 2023) (Oftadeh et al., 2021; Arion et al., 2023; Coppens et al., 2023; Tec- Campos et al., 2023) (Gatto et al., 2015; Alzoubi et al., 2019; Oftadeh et al., 2021; Santos- Merino et al., 2023) (Shinfuku et al., 2009; Machado and Herrgård, 2014; Broddrick et al., 2019; Coppens et al., 2023) Perhaps the most common validation in FBA is comparison between FBA-predicted and empirically measured rates of growth [e.g., (Varma and Palsson, 1994; Schroeder and Saha, 2020; Feierabend et al., 2021; Arion et al., 2023; Blázquez et al., 2023; Noecker et al., 2023; Tec-Campos et al., 2023)]. One may similarly evaluate growth/no-growth in different media and/or with different carbon sources [e.g., (Ong et al., 2014; Arion et al., 2023; Blázquez et al., 2023; Heinken et al., 2023; Tec-Campos et al., 2023)]. A related approach is the comparison of in silico metabolite uptake/secretion with experimental measurements (Heinken et al., 2021; Blázquez et al., 2023; Heinken et al., 2023). Such evaluations give confidence in the model’s basic predictions. To ensure that the accuracy of growth-rate predictions generalizes well, we strongly recommend validating growth rates on substrates or in media conditions from which biomass composition and parameters like Growth-Associated Maintenance (GAM) and Non- Growth Associated Maintenance (NGAM) costs were not experimentally derived, as done in 10 (Arion et al., 2023). GAM represents the energy expenditure needed to support a certain rate of biomass growth and NGAM represents the energy expenditure required for a cell or organism to survive without any net growth (Thiele and Palsson, 2010). These values may vary depending on growth conditions, so testing whether the values measured in one set of conditions generalize to others is important. Otherwise, future users may use a model with, for example, another common media composition and find – or worse yet, simply not notice – that the resulting predictions do not accurately reflect essential characteristics of the organism’s actual metabolism. A related approach involves comparing growth/no-growth of gene knockout strains to FBA predictions to address whether the metabolic pathways used in the model mirror the biological system. Experimentally verified lethal knockouts that appear nonlethal in silico point to alternative routes the model can use to grow. Conversely, in silico lethality predictions not confirmed by experiment suggest the model is missing isoforms or alternative reaction routes. Collecting the true positive, true negative, false positive, and false negative predictions from the in silico vs. in vivo lethality predictions into a confusion matrix allows for an at-a-glance evaluation of overall model accuracy and for the comparison of alternative model architectures (Santos-Merino et al., 2023). Researchers sometimes use algorithms to identify knockouts that couple biomass accumulation to flux through a reaction for biotechnological applications (Burgard et al., 2003; Tepper and Shlomi, 2009; Stanford et al., 2015). This requires that models accurately predict growth/no-growth phenotypes for gene knockouts, but previous work in a model of Saccharomyces cerevisiae, for example, shows that FBA performs poorly at predicting the synthetic lethality of double-knockouts, making this a serious concern (Alzoubi et al., 2019). When performing such validations, one must keep in mind that imposed constraints and decisions made during the model construction or optimization process may implicitly or explicitly add the predictions one is trying to validate into the model, rendering the exercise meaningless. This makes clear and transparent documentation of the assumptions used in the modeling process key for reviewers and readers to assess the epistemic value of the validations that are reported. It is crucial to note that the methods discussed above do not validate the internal flux predictions made by FBA. Due to the underdetermined nature of FBA, many radically different flux maps may be compatible with, for example, the optimization of growth-rate (Mahadevan and Schilling, 2003), making validations using growth-rate or any other individual external flux 11 uninformative with respect to internal flux distributions. In well-characterized systems, there may be a wealth of known metabolic functionalities that an organism can carry out and evaluating whether the model can reproduce them can give some assurance of realistic model behavior. In Duarte et al., (2007) and Sigurdsson et al., (2010), 288 metabolic processes known to take place in mammalian cells were evaluated in models of human and mouse models, though it was only the ability to carry out the processes at all, and not the actual flux values, that were evaluated. In favorable cases, individual internal fluxes can be quantitatively estimated in vivo using independent methods and compared directly to ones from a predicted flux map to provide a powerful form of validation. For example, in a study from our group (Kaste and Shachar-Hill, 2023) the ratio of the cyclic electron flow (CEF) to linear electron flow (LEF) fluxes in photosynthesis predicted by FBA was evaluated against CEF/LEF ratios from fluorescence measurements for validation purposes. Though less specific, the sum of FBA-predicted values for fluxes that produce and/or consume a product (such as CO2) can also be compared to experimental measurements. In addition to these approaches, there is the possibility going forward of integrating metabolomics data into the FBA prediction process [e.g., (Lee et al., 2006)] and/or comparison of FBA results against metabolomic datasets. Although, it should be noted that metabolite levels and changes in those levels in the steady-state cannot be directly interpreted in terms of fluxes, so any attempts to validate FBA results using observations in metabolomics datasets should be done with caution. However, validations of internal flux predictions across the network require comparing FBA flux maps with high-quality ones from 13C-MFA. Such validations are the most information-rich of all the methods surveyed so far and tell us the most about how well the FBA flux maps generated by a particular combination of network architecture, constraints, and objective function line up with experimental data. Unfortunately, 13C-MFA flux maps are time- consuming to generate, making this “gold-standard” validation rare. To compare FBA-predicted and MFA-estimated fluxes, the model architectures must be the same, or the MFA must at least be a subnetwork of the model used for the FBA. Additionally, the empirical constraints (e.g., substrate uptake and biomass accumulation) must be the same in both cases. In cases where the growth rates predicted or constrained for an FBA flux map do not perfectly line up with those from an MFA flux map, normalization of fluxes to account for this discrepancy can be used to get an apples-to-apples comparison (Broddrick et al., 2019). The imposition of identical external 12 flux constraints on both the FBA and MFA models may preclude validation of the accuracy of certain external flux predictions by the FBA. However, such comparisons can be done afterwards by removing the relevant constraints. Comparison is also complicated by the underdetermined nature of most FBA optimizations, which can result in large feasible ranges for the individual fluxes being compared against the corresponding flux values obtained from 13C-MFA, making the validation less stringent. FBA optimizations that assume parsimony (Holzhütter, 2004; Lewis et al., 2010) tend to yield narrower flux ranges, but this advantage may come at the cost of neglecting other plausible objective functions that might be more accurate. Finally, when FBA-predicted and MFA-estimated flux maps disagree, assuming the experimental constraints are consistent between the two and that the person doing the comparison is confident in the MFA estimates, either the FBA network architecture or objective function could be to blame. There is not, to our knowledge, a consistent strategy for disambiguating disagreements due to architecture or objective function. If the biological/biochemical accuracy of the objective function is in question, methods for inferring objective functions using isotopic labeling data can be employed [e.g., (Gianchandani et al., 2008)], the resulting objective functions can be compared with the one being used, and discrepancies can be considered. All objective functions that relate to growth will be affected by the accuracy of the biomass composition used in the model, although in some systems central metabolic fluxes may be relatively robust to variability in the exact values of this composition (Yuan et al., 2016). In systems for which extensive biomass composition data is available, known variability in biomass composition can be incorporated during the optimization process (Choi et al., 2023). Despite these various limitations and difficulties when validating FBA using 13C-MFA fluxes, some studies have evaluated the accuracy of FBA against 13C-MFA-estimated flux maps [e.g., (Schuetz et al., 2007; Chen et al., 2011; Machado and Herrgård, 2014; Tian and Reed, 2018; Long and Antoniewicz, 2019b; Blázquez et al., 2023; Coppens et al., 2023)], with mixed results. A consistent challenge when validating FBA fluxes using any method is the need to compare the FBA flux map against empirical fluxes or other measurements that were generated under similar conditions to those being simulated. For organisms or systems whose metabolic models are undergoing continual refinement, thus requiring repeated validation, community- curated and updated validation datasets generated under well-defined and carefully reported 13 conditions may be useful. Standards on what metabolic phenotypes and responses need to be captured by these models [e.g., the 288 known metabolic functions in human cells used in (Duarte et al., 2007)] may also help ensure that reconstructions maintain essential biological features as they grow larger and more detailed. To summarize, we make the following recommendations for the validation of FBA- predicted flux maps: 1. When possible, comparisons between FBA-predicted and 13C-MFA-estimated flux maps should be performed to validate the accuracy of FBA-predicted internal fluxes. This provides a greater wealth of information about where and to what extent the model is, and is not, lining up with experimental evidence. When performing such validations, care should be taken to ensure that the conditions under which the FBA- predictions and MFA-estimates are generated are as similar as possible and that any necessary normalizations to account for differences have been made. For an example of thorough FBA-to-MFA comparisons, see Broddrick et al., (2019) and Roell et al., (2023). • Note: FBA-predicted flux maps require definition not just of the network architecture and constraints, but also an objective function for optimization. Validation of the FBA-predicted flux maps is therefore also a validation of the selected objective function. It is possible for a poorly selected objective function to generate flux predictions that do not align with MFA-estimated fluxes; in such cases, alternative objective functions can be explored. 2. As highlighted in Table 1.1, different validation methods evaluate different aspects of the model’s predictions. Therefore, employing a number of different validations allows for a fuller and more detailed analysis of model performance and increases the likelihood that other users of the model may be able to appropriately apply it to their research question. For an example of a study employing several different validation techniques, see Heinken et al., (2023). 3. Validations of model predictions are only valuable when the data the predictions are validated against has not already been used in the training or construction of the model. The complexity of the metabolic model reconstruction and analysis process can make it difficult to notice when contamination of the validation dataset by 14 training data has occurred. In order to identify contamination, one must consider the source of all data used for validation and consider whether it or a value derived from it was used at any stage of the FBA modeling process. For an example of a study that clearly and systematically validates FBA predictions while avoiding such contamination, see Arion et al., (2023). Improving confidence in the accuracy of FBA flux maps is valuable because generating validated 13C-MFA flux maps for all systems and conditions of interest is impractical. 13C-MFA requires substantial experimental work for each set of conditions and is unsuitable for many multicellular tissues and organisms where the required combination of extended periods of metabolic steady state, controlled provision of informative, non-perturbing labeled substrates, and obtaining enough labeling data cannot be achieved. This FBA-empowered future for systems biology and biotechnology requires well-validated MFA flux maps, so we turn our attention to model validation and selection in MFA. 1.4.2. Validation in 13C-MFA 13C-MFA flux estimates are typically validated based on the goodness-of-fit between measured labeling data and the corresponding values generated by the network model after the optimization of model parameters. The goodness-of-fit is represented by the sum of squared residuals (SSR) where each residual is weighted by dividing it by its experimental variance. The χ2-test of goodness-of-fit, which is built into commonly used 13C-MFA software (Weitzel et al., 2013; Shupletsov et al., 2014; Young, 2014), is then used to test whether the SSR falls within the 95% confidence interval expected for the defined number of degrees of freedom (DOF). Since its development as a validation method in 13C-MFA (Antoniewicz et al., 2006), the χ2-test has been widely used and has been useful in the validation of 13C-MFA metabolic models inferred from genome annotations (Au et al., 2014; Cordova and Antoniewicz, 2016; Cordova et al., 2017; Yu King Hing et al., 2021; Dahle et al., 2022; Imada et al., 2023; Mitosch et al., 2023). However, as described in Sundqvist et al., (2022) and Theorell et al., (2017), the use of the χ2-test can be problematic in 13C-MFA for several reasons. When upper- and lower-bounds are imposed on estimated flux parameter values, this makes accurate estimation of the effective DOF for the χ2-test difficult (Theorell et al., 2017). It can also be difficult to accurately determine errors in the MID measurements made for 13C-MFA, resulting in distortion of the variance-weighted SSR values that are being compared against the 95% Confidence Interval 15 (Sundqvist et al., 2022). In addition to these technical difficulties with properly applying the χ2-test, problems arise from how the test is implemented into the model development process during a typical 13C- MFA study. Especially for eukaryotic systems, 13C-MFA flux modeling generally involves making iterative changes to the model based on how well it can explain the data – as assessed informally and by the χ2-test – followed by refinement and assessment of the data based on this agreement. For example, if the data do not allow the fluxes between the same metabolite in different compartments to be determined, they may be merged in the model or additional measurements may be made to resolve them. Metabolites may also be excluded from the model due to inconsistency between their simulated vs. measured MIDs causing the model to fail the χ2- test, on the assumption that biological, model-structural, or analytical uncertainties underlie these unexplained divergences (Xu et al., 2022)1. The difficulty of accurately quantifying MID measurement errors, mentioned earlier, may be addressed by arbitrarily increasing the assumed measurement error, which reduces the deduced precision of flux estimates to take into account the potential for error sources not accounted for by experimentally observed scatter (Xu et al., 2021b; Sundqvist et al., 2022; Xu et al., 2022)1. This process is a natural consequence of the diversity and uncertainty of the metabolic architecture of different systems and is a valid form of exploratory data analysis and model building. However, altering the model by excluding specific data points and adding additional fluxes or metabolites until the χ2-test passes, and then relying on this very same test as validation is statistically dubious from a rigorous perspective. As in the case of an FBA model validation in which the prediction being validated has been implicitly introduced to the model itself, a final validation of a 13C-MFA model with the same data used to make it acceptable, as quantified by the χ2-test, does not constitute a real validation. It also can naturally lead to over- or under-fit models, which we discuss below in the section on model selection. Due to these difficulties, we propose that the χ2-test, as it is currently used, should be used as one of multiple lines of evidence to consider when validating a 13C-MFA model, especially for less defined and/or more complex eukaryotic systems such as plants. One way to 1 Here we primarily cite our own work because, as discussed, there are a number of sound reasons for leaving out metabolites and/or increasing MID measurement errors. We have chosen not to highlight other studies that have employed the same practices since we do not know all of the experimental and analytical details underlying them and would not want their inclusion here to be interpreted as implicit criticism. 16 address the issue of using the χ2-test for both model development and validation is to reserve a portion of the dataset only for final model validation. This practice of holding out a subset of the data to be used exclusively for validation is standard statistical practice (Gross and MacLeod, 2017) in other areas of systems biology and, conveniently, can also be used for model selection (Sundqvist et al., 2022). In the absence of direct experimentally measurable fluxes, independent measurements that can be measured or inferred from empirical measurements in vivo provide an important ground-truth value to compare with flux estimates and can complement the use of the χ2-test for validation. An example of this can be found in the plant 13C-MFA literature, where independent measurements of the relative rates of oxygenation and carboxylation by the enzyme rubisco can be compared with 13C-MFA flux estimates (Ma et al., 2014; Xu et al., 2021b; Xu et al., 2022). In Xu et al., (2021b) for example, our group compared predicted values for the relative rates of oxygenation and carboxylation by the enzyme rubisco in photosynthesis versus inferred values from stomatal conductance and other empirical measurements. This led us to conclude that labeling data from whole tissue extracts was insufficient to accurately estimate photorespiratory fluxes without information on the compartmentation of certain metabolites. Despite the strength of this form of validation, it is infrequently practiced. Another little-used but potentially valuable approach to validation is the corroboration of key features of 13C-MFA models using independent modeling methods. In Xu et al., (2022), simplified compartmental kinetic models yielded analytical solutions predicting that overall labeling time courses should take the form of sums of exponential rate components. Fitting labeling data to these exponential models and applying statistical model selection techniques provided independent corroboration of the overall architecture of the 13C-MFA model that was used to obtain a detailed flux map. Returning to goodness-of-fit, one must also keep in mind what information is taken into consideration and the effect of the assumed network architecture. In INST-MFA, where time- course labeling data is used, metabolite pool sizes are both estimable parameters and constrainable modeling inputs. When pool sizes are not provided as empirical measurements, pool size estimates are typically imprecise and inaccurate (Zheng et al., 2022). The inaccuracy of these estimates is not usually interpreted as an impediment to publishing 13C-MFA results and according to Zheng et al., (2022), leaving out pool size information does not adversely affect flux 17 estimate accuracy. Flux estimates are not, however, always robust against misspecifications of the network model (Sundqvist et al., 2022). The exclusion of pool size information provides greater flexibility in fitting experimental data, allowing robustness against model misspecifications at the expense of not detecting them (Zheng et al., 2022). A useful next step for this field would be to routinely measure and include pool size estimates to improve the detection of incorrect model architectures. Measurement of all metabolites in a way that allows discrimination of pools for identical metabolites in different cellular compartments requires a method like Non-Aqueous Fractionation [e.g., [Krueger et al., 2011)], which may be prohibitively difficult to implement in many studies. In such cases, use of a strategically selected set of metabolite levels may be used to allow for improved detection of incorrect model architectures. This introduces the matter of model selection. 1.5. Model selection in 13C-MFA As discussed earlier, model development in 13C-MFA is an iterative process. Alternate models developed during this process may differ in their numbers of reactions and metabolites, resulting in different DOF. Adding model parameters can result in overfitting when these extra DOF lead the 13C-MFA optimization to fit noise rather than biological signal. Model selection techniques can be used to avoid this overfitting and to select the most statistically supported model among alternatives. The development of FBA models can also involve deciding between alternative architectures. However, comparison and selection of such models from sets of alternatives based on their predictions’ deviations from empirical measurements is uncommon, so we focus our attention on 13C-MFA. Model misspecification can result in missing important fluxes, incorrectly estimating the rates of modeled fluxes, or incorrectly estimating the precision of flux estimates. In a study our group performed of central metabolic fluxes in the oilseed crop Camelina sativa (Xu et al., 2022), previously published model architectures that passed the χ2-test of goodness-of-fit (Xu et al., 2021b) were nonetheless shown to be missing an important set of metabolic reactions involving the movement of carbohydrates to and from the vacuole. In Sundqvist et al., (2022), in silico examples of sub-optimal model selection resulting in flux estimates that fall outside of the 95% confidence intervals for those same fluxes generated using the correct model architecture are provided, showing the potential for biased flux estimates when model selection is not properly performed. Finally, the literature on “Genome-scale-13C-MFA” has provided evidence 18 that the exclusion of many reactions peripheral to the metabolic network under consideration (typically core metabolism) in 13C-MFA can result in artificially narrow confidence intervals. Genome-scale-13C-MFA involves estimating a flux map by minimizing deviation between predicted and measured isotopic labeling but using the kind of genome-scale metabolic network more typically used for FBA analyses (Gopalakrishnan and Maranas, 2015; Hendry et al., 2020). In studies on the cyanobacterium Synechococcus elongatus (Gopalakrishnan et al., 2018; Hendry et al., 2019), it has been shown that the substantially larger genome-scale 13C-MFA models achieved better fits to the labeling data, that these reductions in SSR were statistically justified, and that the original models of core metabolism underestimated the uncertainty in a number of flux estimates by ignoring alternative metabolic pathways that could also explain patterns in the labeling data (Hendry et al., 2020). The examples above demonstrate that rather than being a statistical curiosity, model selection (or the lack thereof) can have serious implications for the accuracy and reliability of flux modeling results. Several approaches to model selection can be found in the 13C-MFA literature, with different approaches being taken in different studies. The simplest is selecting the model with the smallest SSR. This method does not work when the DOF of the compared models are different, as increasing the DOF in a model inevitably allows it to fit a given data set better. This may be accounted for informally by noting the change in DOF [e.g., (Xu et al., 2022)], or in a more statistically rigorous way using the extra-sum-of-squares test (Draper and Smith, 1998; Boyle et al., 2017) or information criteria (Schwarz, 1978; Akaike, 1998). The most common model selection approach used in 13C-MFA is an informal method using the χ2-test, wherein models are iteratively modified until a model and dataset pass the test, or where several alternative models are evaluated and the one that passes the test by the widest margin is selected (Dalman et al., 2016; Antoniewicz, 2018; Long and Antoniewicz, 2019a; Sundqvist et al., 2022). These approaches have been used, for example, to demonstrate that the isotopic labeling data of co- culture systems cannot be adequately described by modeling with a single-culture 13C-MFA model (Gebreselassie and Antoniewicz, 2015; Wolfsberg et al., 2018), to provide evidence for the operation of previously undescribed fluxes in mammalian cells (Ahn et al., 2016), and to detect missing reactions in metabolic network reconstructions from genome annotations or that are needed to describe the metabolism of mutant E. coli strains (Au et al., 2014; Long and Antoniewicz, 2019b). 19 However, the previously mentioned limitations of the χ2-test for model validation also affect its usefulness for model selection and models failing the test due to these limitations can lead to the addition of statistically unjustified metabolites or reactions to the model until it passes (Sundqvist et al., 2022). We refer to the χ2-test-based methods as “informal” model selection because when multiple models are evaluated, they are not directly or formally compared to determine whether the additional parameters in more complex models are statistically justified, which can naturally lead to the selection of overfit models. The general approach of avoiding overfitting by evaluating models based on their performance on a set of data not used during the fitting process is widely used in statistics [e.g., cross-validation techniques (Hastie et al., 2017)]. The validation-based approach taken in Sundqvist et al., (2022) implements this best practice, separating fitting and testing data sets to avoid the pitfalls discussed above. In our view, this represents a substantial advancement in model selection in 13C-MFA. This method divides the labeling dataset into training and validation subsets and then estimates fluxes in alternative models using the training data. These alternative models’ flux maps, and their accompanying predicted MIDs, are then compared based on their agreement with the validation MID data. The model whose flux map results in the smallest SSR when compared with this validation data is selected. The authors generated synthetic labeling data from a predefined “correct” model and assessed the ability of their new method and other model selection techniques to identify this correct model from a set of alternatives. The validation-based approach accomplishes this more consistently than existing model selection methods, including χ2-test-based methods, and does so irrespective of the value of the measurement error in the labeling datasets. The incorrect models selected by other methods contain flux estimates that fall outside the 95% confidence intervals of the fluxes from the correct model, highlighting the importance of model selection for obtaining accurate flux estimates (Sundqvist et al., 2022). The generation of MID data in additional labeling experiments to precisely measure all fluxes in a network (Chang et al., 2008; Crown et al., 2012; Crown and Antoniewicz, 2012; Leighty and Antoniewicz, 2013; Millard et al., 2014; Crown et al., 2015; Crown et al., 2016; Beyß et al., 2021) provides the reserved validation datasets needed for Sundqvist et al., (2022). This means that for 13C-MFA studies that already require a parallel labeling approach, implementation of this more rigorous model selection approach is simply a matter of setting aside a subset of data to evaluate alternative model architectures. 20 This approach can be extended in INST-MFA by using metabolite pool size measurements in the selection process. Individual pool sizes are sensitive to the local kinetic parameters and will fit poorly when reaction networks are incompletely specified (Zheng et al., 2022). We therefore suggest that validation-based model selection using pool size measurements as input measurements is a promising prospective model selection approach for INST MFA (Figure 1.2). Indeed, although not referred to explicitly as model selection, in Zheng et al., (2022) the authors show that inclusion of pool size information results in an incorrectly specified network architecture failing to pass the χ2-test of goodness-of-fit, whereas a correctly specified network does pass. This corresponds to the “first to pass χ2” method of model selection discussed by Sundqvist et al., (2022) and is subject to the various limitations of the χ2-test as a model selection technique covered earlier. By incorporating these metabolite pool sizes into the formalized model selection framework described by Sundqvist et al., (2022), we may arrive at a more robust form of model selection that is better at detecting misspecified networks. As Sundqvist et al., (2022) note, the optimal model selected by their method should be subjected to a final validation to assess model quality. A model architecture may be selected by the model selection process but result in a substantial deviation of some metric from independently measured values. For this final validation, a combination of the χ2-test, independent experimental measurements, and alternative modeling approaches can be used. Keeping in mind both the trade-off between goodness-of-fit and model complexity and the multiple ways in which 13C- MFA model predictions can be validated will ensure that flux estimates are as accurate and robust as possible. Model validation and selection are an integral part of the 13C-MFA process. Notably, model selection practices like the use of validation-based model selection (Sundqvist et al., 2022) and the use of the extra-sum-of-squares test (Boyle et al., 2017) to compare alternative model architectures represent, in our view, a major improvement over exclusive use of the χ2-test of goodness-of-fit test for both purposes, but are seldom practiced in the literature. We encourage the use of these techniques and believe they hold promise for improving confidence in both the fluxes and network architectures reported in studies. With respect to validation and model selection in MFA, we recommend the following: 1. As highlighted by Antoniewicz, (2018), transparency is key in 13C-MFA, given the assumptions that must be satisfied for 13C-MFA modeling as well as the sensitivity of 21 flux estimates to model architecture. As an example of a transparently reported 13C-MFA study, see Nicolae et al., (2014). 2. The validation and selection of MFA-estimated fluxes, like the validation of any model output, benefits from multiple lines of corroborating evidence. When possible, the use of alternative modeling approaches of isotopic labeling data can be a powerful tool for arriving at well-supported model architectures, as in Xu et al., (2022). 3. In INST-MFA, metabolite pool size measurements can be used to provide additional confidence in model validity and tighten flux confidence intervals (Nöh et al., 2007), as well as provide additional measurements for validation-based model selection. However, practitioners should be aware that these measurements can make model fits highly sensitive to incorrectly specified network models in ways that may or may not affect the accuracy of flux estimates (Zheng et al., 2022). Additionally determination of subcellular compartmentation of certain metabolites may be prohibitively difficult in some cases. In such cases, key metabolites with known subcellular compartmentation may be measured. 4. We recommend the use of a proper model selection framework to compare alternative, biochemically reasonable model architectures when performing 13C-MFA modeling. The framework outlined in Sundqvist et al., (2022) represents the state-of-the-art in this area. Barring the application of that method, a more traditional model selection approach, such as the extra-sum-of-squares approach used in Boyle et al., (2017) can be employed. 22 Figure 1.2: Approaches to model selection for 13C-MFA. Metabolic network models 1-3 having increasing complexity are compared. Model 2 in this example is the correct description of the network. (A) Labeling data (MID1 & MID2) are gathered and, for each model, agreement between model output and these data is optimized. The χ2-test of goodness-of-fit is used to assess each model fit and these model fits are ranked 1st, 2nd, or 3rd, with the 1st passing the test by the widest margin and being selected as the most statistically well-supported model. (B) Labeling data are split into “training” and “testing” subsets and agreement between model output and the “training” data is optimized. The Sum-of-Squared Residuals (SSR) is then calculated for each model from the deviation between its output and the “testing” data. The model fits are then ranked 1st, 2nd, and 3rd, with the 1st having the lowest SSR and being selected. (C) Labeling data and metabolite pool data (C1 and C2) are gathered and split into “training” and “testing” subsets. For each model, agreement between model output and these data is optimized. The Sum-of- Squared Residuals (SSR) is then calculated for each model from the deviation between its output and the “testing” data. The model fits are then ranked 1st, 2nd, and 3rd, with the 1st having the lowest SSR and being selected. The inclusion of metabolite pool size data into both the “fitting” and “testing” datasets provides more data to go off when evaluating goodness-of-fit, potentially increasing the likelihood of identifying the correct model from a set of alternatives. 1.6. Future directions We believe that validation and selection deserve greater attention from the flux analysis community and suggest that implementing the approaches highlighted in this perspective will improve the accuracy and reliability of constraints-based metabolic modeling and flux estimates. However, we also recognize that some approaches suggested here, such as the use of pool size measurements, can be extremely difficult to implement in practice. A recent publication on 23 isotopically non-stationary MFA of Arabidopsis thaliana heterotrophic cell culture metabolism highlighted that although pool size data could potentially be used to improve the accuracy and precision of flux predictions, the experimental difficulty of measuring the concentrations of metabolites distributed across multiple subcellular compartments made this prohibitively difficult (Smith et al., 2022). As in all areas of science, then, the development of consensus best practices in the evaluation of and inference from data and models must arise at the intersection of rigorous statistical theory and experimental practicalities. However, we believe that researchers engaged in constraint-based metabolic modeling as well as readers of modeling studies benefit when the limitations of present validation and selection practices are clarified. Several matters call for investigation before definitive recommendations can be made on best practices. At present, it is not clear how to appropriately weight the contributions to flux estimation of unambiguous direct flux measurements like substrate uptake, which typically have relatively large standard deviations, against MIDs, which frequently have much smaller standard deviations but whose relationship to fluxes depends on model structure and whose measured values may be offset by unknown analytical effects. Likewise, it is unclear how best to deal with those not infrequent MID measurements that have extremely small, but imprecisely measured, standard deviations, which can exert too much control over the fitting process. Finally, we would like to conclude by emphasizing that the process of careful validation and model selection can lead to the generation of models that are not only more quantitatively sound, but that yield exciting scientific insights [e.g., (Ahn et al., 2016; Wolfsberg et al., 2018)]. 1.7. Acknowledgments This research was supported by the Office of Science (BER), U.S. Department of Energy, Grant no DE-SC0018269 (J.A.M.K., Y.S-H.). This work is supported, in part, by the NSF Research Traineeship Program (Grant DGE-1828149) to J.A.M.K. This publication was also made possible by a predoctoral training award to J.A.M.K. from Grant T32-GM110523 from National Institute of General Medical Sciences (NIGMS) of the NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIGMS or NIH. Figures made using BioRender.com. 1.8. Author contributions Conceptualization: J.A.M.K and Y.S-H. Investigation: J.A.M.K. Visualization: J.A.M.K. Writing – original draft: J.A.M.K. Writing – review and editing: J.A.M.K. and Y.S-H. 24 REFERENCES Ahn WS, Crown SB, Antoniewicz MR (2016) Evidence for transketolase-like TKTL1 flux in CHO cells based on parallel labeling experiments and (13)C-metabolic flux analysis. Metabolic engineering 37: 72–78 Akaike H (1998) Information Theory and an Extension of the Maximum Likelihood Principle. In E Parzen, K Tanabe, G Kitagawa, eds, Selected Papers of Hirotugu Akaike. Springer New York, New York, NY, pp 199–213 Åkesson M, Förster J, Nielsen J (2004) Integration of gene expression data into genome-scale metabolic models. Metabolic Engineering 6: 285–293 Alzoubi D, Desouki AA, Lercher MJ (2019) Flux balance analysis with or without molecular crowding fails to predict two thirds of experimentally observed epistasis in yeast. Scientific Reports 9: 1–9 Antoniewicz MR (2015) Methods and advances in metabolic flux analysis: a mini-review. Journal of Industrial Microbiology and Biotechnology 42: 317–325 Antoniewicz MR (2018) A guide to 13C metabolic flux analysis for the cancer biologist. Experimental and Molecular Medicine. doi: 10.1038/s12276-018-0060-y Antoniewicz MR, Kelleher JK, Stephanopoulos G (2006) Determination of confidence intervals of metabolic fluxes estimated from stable isotope measurements. Metabolic Engineering 8: 324–337 Arion I-S, Hiroyuki O, Matti G, Ghita G, Kapil A, X. CO, Terence H, Sebastian B (2023) A Genome-Scale Metabolic Model of Marine Heterotroph Vibrio splendidus Strain 1A01. mSystems 0: e00377-22 Au J, Choi J, Jones SW, Venkataramanan KP, Antoniewicz MR (2014) Parallel labeling experiments validate Clostridium acetobutylicum metabolic network model for (13)C metabolic flux analysis. Metabolic engineering 26: 23–33 Becker J, Zelder O, Häfner S, Schröder H, Wittmann C (2011) From zero to hero-Design- based systems metabolic engineering of Corynebacterium glutamicum for l-lysine production. Metabolic Engineering 13: 159–168 Becker SA, Palsson BO (2008) Context-specific metabolic networks are consistent with experiments. PLoS Computational Biology. doi: 10.1371/journal.pcbi.1000082 Bernstein DB, Sulheim S, Almaas E, Segrè D (2021) Addressing uncertainty in genome-scale metabolic model reconstruction and analysis. Genome Biology 22: 64 Beyß M, Parra-Peña VD, Ramirez-Malule H, Nöh K (2021) Robustifying Experimental Tracer Design for 13C-Metabolic Flux Analysis. Frontiers in Bioengineering and Biotechnology. doi: 10.3389/fbioe.2021.685323 25 Blázquez B, San León D, Rojas A, Tortajada M, Nogales J (2023) New Insights on Metabolic Features of Bacillus subtilis Based on Multistrain Genome-Scale Metabolic Modeling. International Journal of Molecular Sciences. doi: 10.3390/ijms24087091 Bordel S, Agren R, Nielsen J (2010) Sampling the solution space in genome-scale metabolic networks reveals transcriptional regulation in key enzymes. PLoS Computational Biology 6: 16 Boyle NR, Sengupta N, Morgan JA (2017) Metabolic flux analysis of heterotrophic growth in Chlamydomonas reinhardtii. PLoS ONE 12: 1–23 Broddrick JT, Welkie DG, Jallet D, Golden SS, Peers G, Palsson BO (2019) Predicting the metabolic capabilities of Synechococcus elongatus PCC 7942 adapted to different light regimes. Metabolic Engineering 52: 42–56 Burgard AP, Pharkya P, Maranas CD (2003) OptKnock: A Bilevel Programming Framework for Microbial Strain Optimization. for Identifying Gene Knockout Strategies Biotechnology and Bioengineering 84: 647–657 Chang Y, Suthers PF, Maranas CD (2008) Identification of optimal measurement sets for complete flux elucidation in metabolic flux analysis experiments. Biotechnology and Bioengineering 100: 1039–1049 Cheah YE, Young JD (2018) Isotopically nonstationary metabolic flux analysis (INST-MFA): putting theory into practice. Current Opinion in Biotechnology 54: 80–87 Chen X, Alonso AP, Allen DK, Reed JL, Shachar-Hill Y (2011) Synergy between 13C- metabolic flux analysis and flux balance analysis for understanding metabolic adaption to anaerobiosis in E. coli. Metabolic Engineering 13: 38–48 Choi J, Antoniewicz MR (2019) Tandem mass spectrometry for 13C metabolic flux analysis: Methods and algorithms based on EMU framework. Frontiers in Microbiology 10: 31 Choi Y-M, Choi D-H, Lee YQ, Koduru L, Lewis NE, Lakshmanan M, Lee D-Y (2023) Mitigating biomass composition uncertainties in flux balance analysis using ensemble representations. Computational and Structural Biotechnology Journal 21: 3736–3745 Coppens L, Tschirhart T, Leary DH, Colston SM, Compton JR, Hervey IV WJ, Dana KL, Vora GJ, Bordel S, Ledesma-Amaro R (2023) Vibrio natriegens genome-scale modeling reveals insights into halophilic adaptations and resource allocation. Molecular Systems Biology 19: e10523 Cordova LT, Antoniewicz MR (2016) (13)C metabolic flux analysis of the extremely thermophilic, fast growing, xylose-utilizing Geobacillus strain LC300. Metabolic engineering 33: 148–157 Cordova LT, Cipolla RM, Swarup A, Long CP, Antoniewicz MR (2017) (13)C metabolic flux analysis of three divergent extremely thermophilic bacteria: Geobacillus sp. LC300, 26 Thermus thermophilus HB8, and Rhodothermus marinus DSM 4252. Metabolic engineering 44: 182–190 Crown SB, Ahn WS, Antoniewicz MR (2012) Rational design of 13C-labeling experiments for metabolic flux analysis in mammalian cells. BMC Systems Biology 6: 43 Crown SB, Antoniewicz MR (2012) Selection of tracers for 13C-Metabolic Flux Analysis using Elementary Metabolite Units (EMU) basis vector methodology. Metabolic Engineering 14: 150–161 Crown SB, Long CP, Antoniewicz MR (2016) Optimal tracers for parallel labeling experiments and 13C metabolic flux analysis: A new precision and synergy scoring system. Metabolic engineering 38: 10–18 Crown SB, Long CP, Antoniewicz MR (2015) Integrated 13C-metabolic flux analysis of 14 parallel labeling experiments in Escherichia coli. Metabolic engineering 28: 151–158 Dahle ML, Papoutsakis ET, Antoniewicz MR (2022) 13C-metabolic flux analysis of Clostridium ljungdahlii illuminates its core metabolism under mixotrophic culture conditions. Metabolic Engineering 72: 161–170 Dalman T, Wiechert W, Nöh K (2016) A scientific workflow framework for 13C metabolic flux analysis. Journal of Biotechnology 232: 12–24 Dinh HV, Sarkar D, Maranas CD (2022) Quantifying the propagation of parametric uncertainty on flux balance analysis. Metabolic Engineering 69: 26–39 Draper NR, Smith H (1998) Extra Sums of Squares and Tests for Several Parameters Being Zero. Applied Regression Analysis. John Wiley & Sons, Ltd, pp 149–177 Duarte NC, Becker SA, Jamshidi N, Thiele I, Mo ML, Vo TD, Srivas R, Palsson BØ (2007) Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proceedings of the National Academy of Sciences of the United States of America 104: 1777–1782 Ebrahim A, Lerman JA, Palsson BO, Hyduke DR (2013) COBRApy: COnstraints-Based Reconstruction and Analysis for Python. BMC Systems Biology. doi: 10.1186/1752- 0509-7-74 Feierabend M, Renz A, Zelle E, Nöh K, Wiechert W, Dräger A (2021) High-Quality Genome-Scale Reconstruction of Corynebacterium glutamicum ATCC 13032. Frontiers in microbiology 12: 750206 Gatto F, Miess H, Schulze A, Nielsen J (2015) Flux balance analysis predicts essential genes in clear cell renal cell carcinoma metabolism. Scientific Reports 5: 10738 Gebreselassie NA, Antoniewicz MR (2015) (13)C-metabolic flux analysis of co-cultures: A novel approach. Metabolic engineering 31: 132–139 27 Gianchandani EP, Oberhardt MA, Burgard AP, Maranas CD, Papin JA (2008) Predicting biological system objectives de novo from internal state measurements. BMC Bioinformatics 9: 43 Gleizer S, Ben-Nissan R, Bar-On YM, Antonovsky N, Noor E, Zohar Y, Jona G, Krieger E, Shamshoum M, Bar-Even A, et al (2019) Conversion of Escherichia coli to Generate All Biomass Carbon from CO2. Cell 179: 1255-1263.e12 Gopalakrishnan S, Maranas CD (2015) 13C metabolic flux analysis at a genome-scale. Metabolic Engineering 32: 12–22 Gopalakrishnan S, Pakrasi HB, Maranas CD (2018) Elucidation of photoautotrophic carbon flux topology in Synechocystis PCC 6803 using genome-scale carbon mapping models. Metabolic engineering 47: 190–199 Gross F, MacLeod M (2017) Prospects and problems for standardizing model validation in systems biology. Progress in Biophysics and Molecular Biology 129: 3–12 Haraldsdóttir HS, Cousins B, Thiele I, Fleming RMT, Vempala S (2017) CHRR: Coordinate for uniform sampling of constraint-based models. rounding hit-and-run with Bioinformatics 33: 1741–1743 Hastie T, Tibshirani R, Friedman J (2017) Model Assessment and Selection. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2nd ed. Springer, New York, NY, pp 219–260 Heinken A, Hertel J, Acharya G, Ravcheev DA, Nyga M, Okpala OE, Hogan M, Magnúsdóttir S, Martinelli F, Nap B, et al (2023) Genome-scale metabolic reconstruction of 7,302 human microorganisms for personalized medicine. Nature Biotechnology. doi: 10.1038/s41587-022-01628-0 Heinken A, Magnúsdóttir S, Fleming RMT, Thiele I (2021) DEMETER: efficient simultaneous curation of genome-scale reconstructions guided by experimental data and refined gene annotations. Bioinformatics 37: 3974–3975 Heirendt L, Arreckx S, Pfau T, Mendoza SN, Richelle A, Heinken A, Haraldsdóttir HS, Wachowiak J, Keating SM, Vlasov V, et al (2019) Creation and analysis of biochemical constraint-based models using the COBRA Toolbox v.3.0. Nature Protocols 14: 639–702 Hendry JI, Dinh HV, Foster C, Gopalakrishnan S, Wang L, Maranas CD (2020) Metabolic flux analysis reaching genome wide coverage: lessons learned and future perspectives. Current Opinion in Chemical Engineering 30: 17–25 Hendry JI, Gopalakrishnan S, Ungerer J, Pakrasi HB, Tang YJ, Maranas CD (2019) Genome-Scale Fluxome of Synechococcus elongatus UTEX 2973 Using Transient 13C- Labeling Data. Plant Physiology 179: 761–769 28 Holzhütter HG (2004) The principle of flux minimization and its application to estimate stationary fluxes in metabolic networks. European Journal of Biochemistry 271: 2905– 2922 Imada T, Yamamoto C, Toyoshima M, Toya Y, Shimizu H (2023) Effect of light fluctuations on photosynthesis and metabolic flux in Synechocystis sp. PCC 6803. Biotechnology Progress 39: e3326 Jazmin LJ, Beckers V, Young JD (2014) User Manual for INCA. Kaste JAM, Shachar-Hill Y (2023) Accurate flux predictions using tissue-specific gene expression in plant metabolic modeling. Bioinformatics: btad186 Kirk P, Thorne T, Stumpf MPH (2013) Model selection in systems and synthetic biology. Current Opinion in Biotechnology 24: 767–774 Koffas MAG, Jung GY, Stephanopoulos G (2003) Engineering metabolism and product in Corynebacterium glutamicum by coordinated gene overexpression. formation Metabolic Engineering 5: 32–41 Koffas MAG, Stephanopoulos G (2005) Strain improvement by metabolic engineering: Lysine production as a case study for systems biology. Current Opinion in Biotechnology 16: 361–366 Krueger S, Giavalisco P, Krall L, Steinhauser M-C, Büssis D, Usadel B, Flügge U-I, Fernie AR, Willmitzer L, Steinhauser D (2011) A topological map of the compartmentalized Arabidopsis thaliana leaf metabolome. PLoS One 6: e17806 Lee JM, Gianchandani EP, Papin JA (2006) Flux balance analysis in the era of metabolomics. Briefings in Bioinformatics 7: 140–150 Leighty RW, Antoniewicz MR (2013) COMPLETE-MFA: Complementary parallel labeling experiments technique for metabolic flux analysis. Metabolic Engineering 20: 49–55 Lewis NE, Hixson KK, Conrad TM, Lerman JA, Charusanti P, Polpitiya AD, Adkins JN, Schramm G, Purvine SO, Lopez-Ferrer D, et al (2010) Omic data from evolved E. coli are consistent with computed optimal growth from genome-scale models. Molecular Systems Biology. doi: 10.1038/msb.2010.47 Lieven C, Beber ME, Olivier BG, Bergmann FT, Ataman M, Babaei P, Bartell JA, Blank LM, Chauhan S, Correia K, et al (2020) MEMOTE for standardized genome-scale metabolic model testing. Nature Biotechnology 38: 272–276 Long CP, Antoniewicz MR (2019a) High-resolution (13)C metabolic flux analysis. Nature protocols 14: 2856–2877 Long CP, Antoniewicz MR (2019b) Metabolic flux responses to deletion of 20 core enzymes reveal flexibility and limits of E. coli metabolism. Metabolic engineering 55: 249–257 29 Ma F, Jazmin LJ, Young JD, Allen DK (2014) Isotopically nonstationary 13C flux analysis of changes in Arabidopsis thaliana leaf metabolism due to high light acclimation. Proceedings of the National Academy of Sciences of the United States of America 111: 16967–16972 Machado D, Herrgård M (2014) Systematic Evaluation of Methods for Integration of Transcriptomic Data into Constraint-Based Models of Metabolism. PLoS Computational Biology. doi: 10.1371/journal.pcbi.1003580 Mahadevan R, Schilling CH (2003) The effects of alternate optimal solutions in constraint- based genome-scale metabolic models. Metabolic Engineering 5: 264–276 Megchelenbrink W, Huynen M, Marchiori E (2014) optGpSampler: An improved tool for uniformly sampling the solution-space of genome-scale metabolic networks. PLoS ONE. doi: 10.1371/journal.pone.0086587 Millard P, Sokol S, Letisse F, Portais J-C (2014) IsoDesign: A software for optimizing the design of 13C-metabolic flux analysis experiments. Biotechnology and Bioengineering 111: 202–208 Mitosch K, Beyß M, Phapale P, Drotleff B, Nöh K, Alexandrov T, Patil KR, Typas A (2023) A pathogen-specific isotope tracing approach reveals metabolic activities and fluxes of intracellular Salmonella. PLOS Biology 21: e3002198 Nicolae A, Wahrheit J, Bahnemann J, Zeng A-P, Heinzle E (2014) Non-stationary 13C metabolic flux analysis of Chinese hamster ovary cells in batch culture using extracellular labeling highlights metabolic reversibility and compartmentation. BMC Systems Biology 8: 50 Nielsen J (2003) It Is All about Metabolic Fluxes. Journal of Bacteriology 185: 7031–7035 Noecker C, Sanchez J, Bisanz JE, Escalante V, Alexander M, Trepka K, Heinken A, Liu Y, Dodd D, Thiele I, et al (2023) Systems biology elucidates the distinctive metabolic niche filled by the human gut microbe Eggerthella lenta. PLOS Biology 21: e3002125 Nöh K, Grönke K, Luo B, Takors R, Oldiges M, Wiechert W (2007) Metabolic flux analysis at ultra short time scale: Isotopically non-stationary 13C labeling experiments. Journal of Biotechnology 129: 249–267 Norsigian CJ, Pusarla N, McConn JL, Yurkovich JT, Dräger A, Palsson BO, King Z (2020) BiGG Models 2020: multi-strain genome-scale models and expansion across the phylogenetic tree. Nucleic Acids Research 48: D402–D406 Oftadeh O, Salvy P, Masid M, Curvat M, Miskovic L, Hatzimanikatis V (2021) A genome- scale metabolic model of Saccharomyces cerevisiae that integrates expression constraints and reaction thermodynamics. Nature Communications 12: 4790 Ong W, Vu TT, Lovendahl KN, Llull JM, Serres MH, Romine MF, Reed JL (2014) 30 Comparisons of Shewanella strains based on genome annotations, modeling, and experiments. BMC Systems Biology 8: 31 Orth JD, Fleming RMT, Palsson BØ (2010a) Reconstruction and Use of Microbial Metabolic Networks: the Core Escherichia coli Metabolic Model as an Educational Guide. EcoSal Plus. doi: 10.1128/ecosalplus.10.2.1 Orth JD, Thiele I, Palsson BO (2010b) What is flux balance analysis? Nature Biotechnology 28: 245–248 Pandey V, Hadadi N, Hatzimanikatis V (2019) Enhanced flux prediction by integrating relative expression and relative metabolite abundance into thermodynamically consistent metabolic models. PLOS Computational Biology 15: 1–23 Pinchuk GE, Hill EA, Geydebrekht OV, de Ingeniis J, Zhang X, Osterman A, Scott JH, Reed SB, Romine MF, Konopka AE, et al (2010) Constraint-based model of Shewanella oneidensis MR-1 metabolism: A tool for data analysis and hypothesis generation. PLoS Computational Biology 6: 1–8 Ravi S, Gunawan R (2021) ΔFBA—Predicting metabolic flux alterations using genome-scale metabolic models and differential transcriptomic data. PLOS Computational Biology 17: e1009589 Roell GW, Schenk C, Anthony WE, Carr RR, Ponukumati A, Kim J, Akhmatskaya E, Foston M, Dantas G, Moon TS, et al (2023) A High-Quality Genome-Scale Model for Rhodococcus opacus Metabolism. ACS Synth Biol 12: 1632–1644 Santos-Merino M, Gargantilla-Becerra Á, de la Cruz F, Nogales J (2023) Highlighting the potential of Synechococcus elongatus PCC 7942 as platform to produce α-linolenic acid through an updated genome-scale metabolic modeling. Frontiers in Microbiology. doi: 10.3389/fmicb.2023.1126030 Schellenberger J, Palsson B (2009) Use of randomized sampling for analysis of metabolic networks. Journal of Biological Chemistry 284: 5457–5461 Schnitzer B, Österberg L, Cvijovic M (2022) The choice of the objective function in flux balance analysis is crucial for predicting replicative lifespans in yeast. PLOS ONE 17: 1– 15 Schroeder WL, Saha R (2020) Introducing an Optimization- and explicit Runge-Kutta- based Approach to Perform Dynamic Flux Balance Analysis. Scientific Reports 10: 1–28 Schuetz R, Kuepfer L, Sauer U (2007) Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli. Molecular Systems Biology 3: 199. doi: 10.1038/msb4100162 Schwarz G (1978) Estimating the Dimension of a Model. The Annals of Statistics 6: 461–464 31 Segrè D, Vitkup D, Church GM (2002) Analysis of optimality in natural and perturbed metabolic networks. Proceedings of the National Academy of Sciences of the United States of America 99: 15112–15117 Shinfuku Y, Sorpitiporn N, Sono M, Furusawa C, Hirasawa T, Shimizu H (2009) Development and experimental verification of a genome-scale metabolic model for Corynebacterium glutamicum. Microbial Cell Factories 8: 43 Shlomi T, Berkman O, Ruppin E (2005) Regulatory on/off minimization of metabolic flux changes after genetic perturbations. Proceedings of the National Academy of Sciences 102: 7695–7700 Shupletsov MS, Golubeva LI, Rubina SS, Podvyaznikov DA, Iwatani S, Mashko SV (2014) OpenFLUX2: 13C-MFA modeling software package adjusted for the comprehensive analysis of single and parallel labeling experiments. Microbial Cell Factories 13: 1–25 Sigurdsson MI, Jamshidi N, Steingrimsson E, Thiele I, Palsson BØ (2010) A detailed genome-wide reconstruction of mouse metabolism based on human Recon 1. BMC systems biology 4: 140 Smith EN, Ratcliffe RG, Kruger NJ (2022) Isotopically non-stationary metabolic flux analysis of heterotrophic Arabidopsis thaliana cell cultures. Frontiers in plant science 13: 1049559 Spivey A (2004) Systems biology: the big picture. Environmental health perspectives 112: 938– 943 Stanford NJ, Millard P, Swainston N (2015) RobOKoD: Microbial strain design for (over)production of target compounds. Frontiers in Cell and Developmental Biology 3: 1–12 Sundqvist N, Grankvist N, Watrous J, Mohit J, Nilsson R, Cedersund G (2022) Validation- based model selection for 13C metabolic flux analysis with uncertain measurement errors. PLOS Computational Biology 18: e1009999 Tec-Campos D, Posadas C, Tibocha-Bonilla JD, Thiruppathy D, Glonek N, Zuñiga C, Zepeda A, Zengler K (2023) The genome-scale metabolic model for the purple non- sulfur bacterium Rhodopseudomonas palustris Bis A53 accurately predicts phenotypes under chemoheterotrophic, chemoautotrophic, photoheterotrophic, and photoautotrophic growth conditions. PLOS Computational Biology 19: e1011371 Tepper N, Shlomi T (2009) Predicting metabolic engineering knockout strategies for chemical production: Accounting for competing pathways. Bioinformatics 26: 536–543 Theorell A, Leweke S, Wiechert W, Nöh K (2017) To be certain about the uncertainty: Bayesian statistics for 13C metabolic flux analysis. Biotechnology and Bioengineering 114: 2668–2684 Thiele I, Palsson B (2010) A protocol for generating a high-quality genome-scale metabolic 32 reconstruction. Nature Protocols 5: 93–121 Tian M, Reed JL (2018) Integrating proteomic or transcriptomic data into metabolic models using linear bound flux balance analysis. Bioinformatics 34: 3882–3888 Varma A, Palsson BO (1994) Stoichiometric flux balance models quantitatively predict growth and metabolic by-product secretion in wild-type Escherichia coli W3110. Applied and Environmental Microbiology 60: 3724–3731 Wang Y, Hui S, Wondisford FE, Su X (2021) Utilizing tandem mass spectrometry for metabolic flux analysis. Laboratory Investigation 101: 423–429 Weitzel M, Nöh K, Dalman T, Niedenführ S, Stute B, Wiechert W (2013) 13CFLUX2 - High- performance software suite for 13C-metabolic flux analysis. Bioinformatics 29: 143–145 Wolfsberg E, Long CP, Antoniewicz MR (2018) Metabolism in dense microbial colonies: (13)C metabolic flux analysis of E. coli grown on agar identifies two distinct cell populations with acetate cross-feeding. Metabolic engineering 49: 242–247 Xu Y, Fu X, Sharkey TD, Shachar-Hill Y, Walker BJ (2021) The metabolic origins of non- photorespiratory CO2 release during photosynthesis: A metabolic flux analysis. Plant Physiology 186: 297–314 Xu Y, Wieloch T, Kaste JAM, Shachar-Hill Y, Sharkey TD (2022) Reimport of carbon from cytosolic and vacuolar sugar pools into the Calvin-Benson cycle explains photosynthesis labeling anomalies. Proceedings of the National Academy of Sciences 119: e2121531119 Young JD (2014) INCA: A computational platform for isotopically non-stationary metabolic flux analysis. Bioinformatics 30: 1333–1335 Yu King Hing N, Aryal UK, Morgan JA (2021) Probing Light-Dependent Regulation of the Calvin Cycle Using a Multi-Omics Approach. Frontiers in Plant Science. doi: 10.3389/fpls.2021.733122 Yuan H, Cheung CYM, Hilbers PAJ, van Riel NAW (2016) Flux Balance Analysis of Plant Metabolism: The Effect of Biomass Composition and Model Structure on Model Predictions. Frontiers in Plant Science. doi: 10.3389/fpls.2016.00537 Zheng AO, Sher A, Fridman D, Musante CJ, Young JD (2022) Pool size measurements improve precision of flux estimates but increase sensitivity to unmodeled reactions outside the core network in isotopically nonstationary metabolic flux analysis (INST- MFA). Biotechnology Journal 17: 1–17 33 Chapter 2 Reimport of carbon from cytosolic and vacuolar sugar pools into the Calvin–Benson cycle explains photosynthesis labeling anomalies This research was published in: Y. Xu*2, T. Wieloch*, J. A. M. Kaste*, Y. Shachar-Hill, T. D. Sharkey, Reimport of carbon from cytosolic and vacuolar sugar pools into the Calvin-Benson cycle explains photosynthesis labeling anomalies. Proc. Natl. Acad. Sci. 119, e2121531119 (2022). 2 * indicates co-first authorship. 34 2.1. Preface My involvement in this study began as a simple request from Dr. Shachar-Hill and Dr. Xu to model the decrease in the proportion of 12C over all carbon in the CBC intermediates of a C. sativa leaf in a 13C labeling experiment. Dr. Sharkey had previously shown using a semi-log plot of labeling over time that the incorporation of 13C label into the CBC intermediates could be fit to two distinct-looking lines, which he interpreted, as per old biochemical convention, to correspond to two processes acting on distinct time scales. These two processes were seen as mapping to the contributions of two distinct pools to the labeling time course of CBC intermediates – the contribution of the CBC intermediate pool back to itself, and the contribution of the cytosolic sugar pool via the glucose-6-phosphate shunt. This provided a convincing independent line of evidence in favor of the glucose-6-phosphate shunt’s contribution to CBC labeling in prior studies, and so he, along with Dr. Shachar-Hill and Dr. Xu, were interested in demonstrating something similar in this study of C. sativa they were working on. Dr. Shachar-Hill noted that, mathematically speaking, the practice of fitting straight lines to a semi-log plot does not seem to map one-to-one with the idea of identifying processes acting over distinct time scales. Because of this, he was interested in using nonlinear modeling to fit the untransformed % 12C data, for which he enlisted my help. In the course of modeling the labeling data gathered by Dr. Xu, Dr. Shachar-Hill and I refined his suspicions about the problems with the semi-log plot approach. Indeed, in the area of pharmacokinetic modeling, a couple of facts had been long-since established: 1. The movement of a metabolite or label through a system of compartments interlinked by pseudo first-order kinetic processes, which the labeling of the CBC intermediate pool, influenced by cytosolic sugars, is to a first approximation, can be described by sum-of- exponential, or polyexponential, models. 2. Curve-stripping, which is a more refined version of the practice of fitting straight lines to a semi-log plot, only gives a rough approximation of the proper nonlinear fits you get from performing a nonlinear regression using such polyexponential models. Taken together, these suggested that in order to make an inference about the number of pools contributing to CBC labeling, we really did need to use a nonlinear regression approach. But, upon implementing such an approach – and in the course of doing so, dealing with issues surrounding the heteroskedasticity of the dataset – it became apparent that one could reasonably 35 fit several different models (ones assuming the presence of two, three, or four compartments, and with or without metabolically inactive pools). It was at this point that I stopped viewing this all as merely a small side-project and realized that there was a substantial intellectual contribution that I could make to the work. Dr. Shachar-Hill and others had been bothered by a previous 13C-MFA study of Arabidopsis thaliana’s central carbon metabolism due to its inclusion of metabolically inactive pools that both allowed the mathematical solver too much flexibility in fitting the data and did not cohere with our understanding of the leaf metabolism. Moreover, there had always been a substantial degree of fuzziness in the selection and justification of a specific 13C-MFA network architecture, given that there are always alternative models one could posit and reasonably fit to a dataset (as discussed in Chapter 1). The polyexponential fitting using nonlinear regression and application of model selection techniques to pick a best-supported model, which could then be mapped back to a broader compartmental model of metabolism that may or may not include metabolically inactive pools, seemed like a clever way of tackling these concerns. As I carried out the study, it became apparent that Dr. Xu’s data actually fit best to a polyexponential model with three terms, corresponding to a three pool model, as opposed to the two pool model we had initially expected. Further data gathering showed convincing evidence of a third pool, the vacuolar sugars, whose inclusion in the 13C-MFA network used in this study substantially improved model fits. I corroborated this finding by fitting Nicotiana tabacum CBC intermediate data gathered by Dr. Xinyu Fu in Dr. Berkeley Walker’s lab here at MSU, which they kindly provided to us prior to its inclusion in one of their publications, which I cite later on in this chapter. This data was not published at the time that we were submitting our paper to the Proceedings of the National Academy of Sciences, and had to be excluded from the final publication as a result. However, now that those data are published, I have readded the N. tabacum data analysis back into the study featured in this chapter. This study was published in the Proceedings of the National Academy of Sciences. Dr. Yuan Xu, Dr. Thomas Wieloch, and I share co-first authorship on the study. I carried out all of the regression analyses and contributed significantly to the theory and formulation of questions around the regression portion of the study. I also contributed to the writing of the methods, results, and discussion pertinent to my section of the study, as well as editing, proofreading, and formatting of the rest of the main text and the supplement. Finally, I assisted Dr. Xu in the computational implementation of uncertainty estimation in the study. 36 2.2. Abstract When isotopes of carbon are fed to photosynthesizing leaves, metabolites of the Calvin–Benson cycle (CBC) are rapidly labeled initially, but then the rate of labeling slows considerably, raising questions about the integration of the CBC within leaf metabolism. We have used 2-h time courses of labeling of Camelina sativa leaf metabolites to test models of 12C washout when the CO2 source is rapidly switched to 13CO2. Fitting exponential functions to the time course of CBC metabolites, we found evidence for three temporally distinct processes contributing to the labeling but none for metabolically inactive pools. We next modeled the data of all metabolites by 13C isotopically nonstationary metabolic flux analysis, testing a variety of flux networks. In the model that best explains measured data, three processes determine CBC metabolite labeling. First is fixation of incoming 13CO2; second is dilution by weakly labeled carbon in cytosolic glucose reentering the CBC following oxidative pentose phosphate pathway reactions, which forms a shunt bypassing much of the CBC. Third, very weakly labeled carbon from the vacuole further dilutes the labeling. This model predicts the shunt proceeds at about 5% of the rate of net CO2 fixation and explains the three phases of labeling. In showing the interconnection of three compartments, we have drawn a more complete picture of how carbon moves through photosynthetic metabolism in a way that integrates the CBC, cytosolic sugar pools, glucose-6- phosphate shunt, and vacuolar sugars into a single system. 2.3. Significance statement Photosynthesis metabolites are quickly labeled when 13CO2 is fed to leaves, but the time course of labeling reveals additional contributing processes involved in the metabolic dynamics of photosynthesis. The existence of three such processes is demonstrated, and a metabolic flux model is developed to explore and characterize them. The model is consistent with a slow return of carbon from cytosolic and vacuolar sugars into the Calvin–Benson cycle through the oxidative pentose phosphate pathway. Our results provide insight into how carbon assimilation is integrated into the metabolic network of photosynthetic cells with implications for global carbon fluxes. 2.4. Introduction The Calvin–Benson cycle (CBC) of photosynthesis is the source of nearly all carbon in the biosphere. CO2 is used in a carboxylation reaction catalyzed by rubisco, and the resulting carboxylic acid, 3-phosphoglycerate (PGA), is reduced to a sugar using NADPH and helped by adenosine 5′-triphosphate (ATP) made by light-driven photosynthetic electron transport. The 37 reactions involve both gluconeogenesis and the nonoxidative reactions of the pentose phosphate pathway (PPP) (Sharkey, 2019). Since the first description of the CBC by Bassham et al. (Bassham et al., 1954), the core reactions have been confirmed many times. However, this metabolism is embedded in the metabolic network of photosynthesizing cells. Carbon leaves the cycle primarily by export of triose phosphate from the chloroplast for sucrose synthesis in the cytosol (Fliege et al., 1978; Flugge and Heldt, 1991), and conversion of fructose 6-phosphate (F6P) to glucose 6-phosphate (G6P) for synthesis of starch inside the chloroplast (Dietz, 1985; Sharkey et al., 1985; Dietz, 1987; Preiser et al., 2020). Many other exports from the cycle occur, especially erythrose 4-phosphate (E4P) for the phenylpropanoid pathway, pyruvate for fatty acid synthesis, and pyruvate and glyceraldehyde 3-phosphate (GAP) for the methyl erythritol 4- phosphate pathway that leads to isoprenoid synthesis (Sharkey et al., 2020). Another set of reactions comprise the photorespiration pathway. When rubisco fixes oxygen instead of CO2, a series of reactions involving three organelles and amino acid metabolism is initiated that results in 3/4 of the carbon first lost to 2-phosphoglycolate being returned to the CBC. In addition to photorespiratory production of CO2, CO2 is released by a process originally called dark respiration in the light (Farquhar et al., 1980) but now called day respiration (Tcherkez et al., 2017), or light respiration (RL) (Xu et al., 2021a). A static analysis of label in metabolites following 13CO2 feeding (Sharkey et al., 2020) pointed to the oxidative PPP (OPPP) as the source of the bulk of RL, for which our recent metabolic flux analysis (MFA) work provides detailed support (Xu et al., 2021a). However, there remain several puzzling observations on CBC kinetics that date back to early quantitative tracer studies (Mahon et al., 1974; McVetty and Canvin, 1981) and are reinforced by recent 13CO2-based MFA studies. • CBC intermediates label very quickly up to 80 to 90% of 13C, but the last 10 to 20% of labeling is much slower (Hasunuma et al., 2010; Szecowka et al., 2013; Ma et al., 2014; Arrivault et al., 2017; Arrivault et al., 2019). • The proportion of fully unlabeled molecules remains anomalously high well after most molecules are highly labeled [see Szecowka et al. (Szecowka et al., 2013) and Appendix A, Table S2.4, where M0 is greater than M1]. • To achieve acceptable fits, previous MFA studies assumed large metabolically inactive pools of central metabolites including metabolites of the CBC (Szecowka et al., 2013; Ma 38 et al., 2014; Arrivault et al., 2017; Arrivault et al., 2019). However, there is little biochemical evidence for their existence. • Previous studies fixed numerous fluxes, including starch and sucrose biosynthesis, according to independently measured experimental values (Ma et al., 2014; Xu et al., 2021a). Recently, it was recommended to minimize fixed fluxes and imposed constraints in MFA analyses and compare independent experimental values with model outputs rather than using them as model inputs (Wieloch, 2021). • Estimates of the relative rate of photorespiration, that is, the ratio of velocities of oxygenation/carboxylation (vo/vc), in MFA, are low (Xu et al., 2021a) or light dependent (Ma et al., 2014). These anomalies indicate that we do not fully understand how the CBC is integrated into the metabolic network of photosynthetic cells. To explore them, we have extended a previously published dataset of leaf isotope labeling (Xu et al., 2021a) to 2 h, added data for neutral sugars, and examined the processes underlying labeling behavior. We applied several statistical tests of the interpretation of three linked processes. We also have made modifications to the isotopically nonstationary (INST)-MFA of photosynthetic metabolism (Ma et al., 2014; Young, 2014; Allen and Young, 2020). We find that three kinetic processes of labeling in CBC metabolites can be defined, and we propose pathways for each. The proposed network of carbon flow eliminates the need to hypothesize metabolically inactive pools and explains both the observed labeling of neutral sugars due to slow dynamic turnover of these products and the high ratio of unlabeled molecules (M0 isotopologue) to singly labeled ones (M1 isotopologues). 2.5. Results 2.5.1. The CBC shows three kinetic components Following a switch from 12CO2 to 13CO2, a semilog plot of 12C levels for the CBC intermediates RUBP, PGA, E4P, S7P, GAP, dihydroxyacetone phosphate (DHAP), and FBP (Appendix A, Dataset S1) shows three straight lines (Fig. 2.1A). This practice of fitting straight lines on a semilog plot and/or curve stripping is borrowed from pharmacokinetics and serves as an approximation of a polyexponential model with N terms, where N is the number of decay processes acting on distinct time scales (Gibaldi and Perrier, 1982; Dunne, 1985). Interestingly, if a metabolic network is represented as a kinetic model with first-order or pseudo-first-order kinetics and M compartments or pools, the analytical solutions for the isotopic labeling in the 39 different compartments correspond to polyexponentials containing M terms (Appendix, A Supplementary Text T1 and Fig. S2.1). Figure 2.1: Modeling of exponential decay of 12C in photosynthesis metabolites. (A) A semilog plot showing the log transformed %12C remaining in a time course dataset of aggregated CBC intermediates (DHAP, E4P, FBP, GAP, PGA, RUBP, and S7P) (n = 254). Error bars represent mean ±2 SE in A and B. Measured time points of labeling levels fitted by alternative models in the early, middle, and late periods of the labeling time course show evidence for three distinct processes. (B) The exponential, biexponential, and triexponential model fits to the %12C remaining time course for CBC intermediates in the linear domain. The orange shaded area represents the 95% CI of the regression line obtained via bootstrap resampling (resampling n = 1,000). (C) A table summarizing the nested models we fitted to our data using nonlinear regression and model selection results. K: number of model parameters; SoS: extra sum of squares; CV: cross-validation. Green cells indicate that the model selection criterion results for a given model support it as statistically superior to the previous model, orange cells indicate that they do not support it as superior to the previous model, and gray cells indicate that the criterion cannot be evaluated. Details about the calculation and interpretation of these model selection criteria can be found in Appendix A, Supplemental Methods T2. These results uniformly point to the triexponential model without a constant reflecting an inactive pool as the best supported description of our aggregated CBC labeling dataset. 40 This indicates that we can fit our metabolite labeling data directly to polyexponential models and, by using model selection techniques to find the model that best describes our data, relate this to an underlying network architecture. Nonlinear regression and model selection strongly support the existence of three distinct processes controlling the labeling of CBC intermediates but do not support inactive pools. Fig. 2.1B (see also Appendix A, Fig. S2.2) shows the results of fitting the measured 12C levels in the aforementioned aggregated CBC intermediates of Camelina sativa to models with one, two, or three exponential components (corresponding to one to three processes controlling labeling kinetics). We evaluated which model best describes the data, using four statistical model selection criteria. Each represents a different measure of overfitting and approach to model selection. All four statistical criteria support the existence of three exponential components in the CBC labeling time course (Fig. 2.1C), corresponding to an overall metabolic network involving fluxes among three compartments/pools. The model did not show statistically significant improvement in the fit by including constant terms, which would correspond to metabolically inactive pools. Labeling of the aggregate and individual CBC intermediates—as well as ADP glucose (ADPG)—shares similar kinetic parameter values (Appendix A, Dataset S2), consistent with their high rates of interconversion and turnover during photosynthesis, resulting in rapid “mixing” of carbon between them. Our data are best described by a triexponential model without constants (Fig. 2.1C; model 5 approximates the data significantly better than model 4; model 6 provides no statistically significant improvement). This corresponds to a network in which three interlinked pools contribute to 13C labeling and argues against inactive metabolite pools. We do note that the model selection results for model fits to individual metabolite datasets (Figures S2.3 – 2.11) do not uniformly support a triexponential model, though we believe that the aggregated dataset, with its substantially larger sample size, represents a stronger indicator of the overall behavior and structure of the system, which is what we are interested in. Is this network architecture unique to the CBC in Camelina sativa or does it generalize to other plant species? We performed the same exercise of fitting the 13C labeling of aggregated and individual CBC intermediates using a labeling dataset gathered from Nicotiana tabacum (Fu et al., 2023). We find that the model selection results for the aggregated CBC intermediates as well as some of the individual metabolites (RUBP, PGA, and GAP/DHAP) also support the 41 triexponential model (Figure 2.2). This suggests that the gross architecture of three interlinked pools contributing to 13C labeling is a general feature of CBC activity in higher plants, rather than a quirk of C. sativa’s metabolism. To elucidate these pools and their interconnectivity, we now model carbon metabolism by 13C isotopically nonstationary MFA. Figure 2.2: Nonlinear regression fits for single, biexponential, and triexponential models fitted to an aggregated CBC intermediate dataset from Nicotiana tabacum along with a summary of model selection results for the aggregated and individual metabolite datasets. (a) Nonlinear regression fits for polyexponential models with the aggregated CBC intermediate dataset from N. tabacum. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. (b) Model selection results for the aggregated CBC dataset. Green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. (c) Model selection results in the same format as that in panel (b) for individual CBC intermediates as well as ADP-glucose. 42 2.5.2. Network model of three pools of metabolites connected to photosynthesis. Since we found evidence for three phases of exponential decay and against the contribution of inactive metabolite pools, we looked for processes that might account for the three phases. We began with the hypothesis that unlabeled carbon enters photosynthetic metabolism (Sharkey et al., 2020). We tested four alternatives: 1) entry of 12C glucose into the cytosolic hexose-phosphate pool, which can reach the chloroplast via the cytosolic OPPP shunt and pentose phosphate transmembrane transport on either the xylulose phosphate/phosphate transporter or the triose phosphate/phosphate transporter (Hilgers et al., 2018a); 2) entry of 12C glucose into the chloroplastic hexose phosphate pool to look at the possible contribution of starch turnover; 3) injection of 12CO2 into the internal CO2 pool to simulate an unknown source of older C being broken down; and 4) addition of 12C triose phosphate to the plastid triose phosphate pool to simulate entry via the triose phosphate transporter from an unknown source in the cytoplasm (Table 2.1 and Appendix A, Table S2.3). To do so, we increased the time span and range of metabolites over which labeling was measured and updated our previously developed metabolic model to include reversibility of several reactions for which there is biochemical evidence (Appendix A, Dataset S3). We assessed these alternatives comparing sum of squared residuals (SSR), a measure of the goodness of fit between modeled and measured data (Young, 2014). However, SSR will be affected by how many data points are used and other factors. For this reason, we do not compare SSRs found in this study with those from our previous studies but only look for large reductions in SSRs when datasets and degrees of freedom are similar. Table 2.1: Comparison of goodness of fit between data and best-fit simulations from alternative models. The data were consistent with 12C entry from intact unlabeled glucose via the OPPP shunt at a rate of 1.9 μmol⋅g−1 FW⋅h−1, with the best SSR improvement from 1,340 to 1,126. The second possible 12C entry flux is 0.3 μmol⋅g−1 FW⋅h−1 from the triose phosphate transporter from an 43 unknown source in the cytosol, with a decrease in SSR from 1,340 to 1,273. The third possible 12C entry is from starch turnover flux of 0.5 μmol⋅g−1 FW⋅h−1, with a decrease in SSR from 1,340 to 1,300. We also tested other models with variations in starch metabolism to test 1) whether addition of reactions representing starch turnover to the metabolic model meaningfully improves the agreement between the measured and simulated labeling and other flux data and 2) whether the fitting of such models indicates biologically significant fluxes through starch turnover. We tested six such models with different representations of how starch turnover might act to influence the carbon fluxes and expected labeling patterns. Other models were tested in which either the whole starch pool or an intermediate pool (such as might represent either oligoglucans or short- versus long-term starch pools) can turn over while maintaining the measured net starch accumulation rates. No unknown source of older C being broken down is indicated, with 12CO2 entry flux of 0 with no change of SSR (Table 2.1). This result is consistent with the M0 abundance results (see below), as the assimilation of 12CO2 would not selectively increase the proportion of unlabeled molecules, because it does not inject intact carbon skeletons. The starch model with the largest improvement in the fit, as defined by the SSR, was no more than a 1% improvement, with a best fit value for a starch turnover flux of no more than 11% of the G6P dehydrogenase activity (Appendix A, Table S2.3). 2.5.3. Examination of labeling in key CBC intermediates supports the hypothesis that intact unlabeled molecules enter the CBC. In our study’s later time points, anomalously high values for fully unlabeled isotopologues (M0) were found well after the singly labeled (M1) isotopologues had decayed to very low levels (Fig. 2.3). Since the percentage of M2 was always bigger than M1, the percentage of M1 should also be bigger than M0. However, we found the reverse pattern. The ratio of the measured percentages of M1 to M0 ranged from 0.1 to 0.4, much smaller than the predicted ratio range of 48 to 175 (Appendix A, Table S2.4). If inactive metabolite pools cause the lack of complete labeling, then, at later time points, for example, in G6P, only M0 and M6 should be observed. However, the amount of M0 could account for only one-third of the 12C in G6P at 2 h (Appendix A, Table S2.5). We suggest that the high amount of M0 comes from a large metabolic pool, such as fully unlabeled glucose that enters the CBC at a low rate. 44 Figure 2.3: Mass isotopologue distributions of CBC intermediates showing the overabundance of M0 isotopologue at the latest time points. Percentages of relative abundance of each isotopologue for key CBC intermediates at 1 h are shown, with different colors corresponding to different isotopologues (figure legend). The size of each pie chart corresponds to the pool size of that metabolite. An expanded bar next to each pie chart shows proportions of M0, M1, and M2 isotopologues, highlighting the overabundance of the M0 relative to the M1 isotopologue. Abbreviations also Table S2.4): GAP, glyceraldehyde 3-phosphate; DHAP, dihydroxyacetone phosphate; PGA, 3-phosphoglyceric acid; R5P, ribose 5-phosphate; RU5P, ribulose 5-phosphate; XU5P, xylulose 5-phosphate; RUBP, ribulose 1,5-bisphosphate; F6P, fructose 6-phosphate; G6P, glucose 6-phosphate; S7P, sedoheptulose 7-phosphate. (see We also observed slow turnover of neutral sugars, which suggests that a dilution flux of largely or wholly unlabeled hexose enters the CBC over an extended period during labeling experiments (Appendix A, Fig. S2.12; labeling kinetics for other metabolites are shown in Appendix A, Figs. S2.13 and S2.14). At 60 min, the glucosyl and fructosyl moieties of sucrose contained 49% and 46% 13C, respectively (Appendix A, Fig. S2.12). Sucrose recycling through invertase and fructokinase yields F6P that would distribute between sucrose resynthesis and G6P, but this alone is insufficient to account for a prolonged dilution flux. By contrast, at 60 min, glucose and fructose were only 12% and 20% labeled with 13C, respectively (Appendix A, Fig. 45 S2.12), consistent with previous evidence that vacuolar sucrose turns over due to invertase activity (Uys et al., 2007; Nägele et al., 2010; Patrick et al., 2013). If a modest proportion of cytosolic G6P originates from the action of hexokinase on glucose leaving the vacuole, then there would be an additional source of unlabeled carbon in the cytosolic G6P pool. Sucrose recycling and turnover of vacuolar sugars could therefore slow the 12C decay in CBC intermediates and correspond to the additional carbon pool attested to by the polyexponential modeling. 2.5.4. An integrated flux model with three compartments. In light of the above results, we included sucrose recycling and sugar vacuole pool transport reactions in the model with known biochemical reactions that can mediate such slow turnover processes of sucrose/glucose/fructose. Inclusion of sucrose recycling and sugar vacuole pool reactions markedly reduced overall SSR from 1,340 to 968 and reduced individual SSRs for labeling in the least well-fitted metabolites F6P, ribose 5-phosphate (R5P), G6P, ADPG, and UDPglucose (UDPG) (Table 2.1 and Appendix A, Table S2.3). Fig. 2.4 shows the flux map for photosynthetic carbon metabolism for the model, with sucrose recycling reactions and sugar vacuole pool reactions in orange. The nonphotorespiratory CO2 release during photosynthesis from the cytosolic G6P shunt was estimated at 5% of net CO2 fixation compared to a photorespiratory CO2 release of 18% of net CO2 fixation (Appendix A, Table S2.6). While intermediates of the CBC, photorespiration, and starch and sucrose biosynthesis pathways showed substantial 13C labeling, the tricarboxylic acid (TCA) cycle intermediates, and most amino acids derived from them, showed very little labeling after 120 min. The flux map is consistent with previous reports of low TCA fluxes and operation of the OPPP shunt as a source of RL (Xu et al., 2021a). The 95% CIs of the flux values were estimated by both parameter continuation and Monte Carlo methods. These CI estimates showed that the net fluxes whose magnitude approaches or exceeds 1% of the rate of photosynthesis are well defined, with ranges less than ±5% of their values (Appendix A, Dataset S2.4). Exchange fluxes are less well defined, especially for reactions with modest net fluxes. 46 Figure 2.4: Central carbon metabolic fluxes in photosynthetic C. sativa leaves. Fluxes are shown as numbers and depicted by the variable width of arrows. Orange arrows highlight the carbon flow from neutral sugars through the G6P shunt, entering the CBC. Fluxes were estimated by 13C INST-MFA using the INCA software suite constrained by the metabolic network model and experimental inputs including mass isotopologue distributions of measured metabolites, net CO2 assimilation, sucrose and amino acid export rate, and measured vo/vc ratio. Flux units are expressed as micromoles metabolite per gram FW per hour. The model network is compartmentalized into cytosol (“.c”), chloroplast (“.p”), mitochondrion (“.m”), and vacuole (“.v”). Abbreviations: ACA, acetyl-CoA; AKG, α-ketoglutarate; ALA, alanine; ASN, asparagine; ASP, aspartate; CIT, citrate; DHAP, dihydroxyacetone phosphate; EC2, transketolase-bound-2- carbon-fragment; FBP, fructose-1,6-bisphosphatase; FUM, fumarate; GA glycerate; GLN, glutamine; GLY, glycine; isocitrate; MAL, malate; OAA, oxaloacetate; PEP, phosphoenolpyruvate; PYR, pyruvate; RU5P, ribulose-5-phosphate; RUBP, ribulose-1,5- bisophosphate; S7P, sedoeheptulose-7-phosphate; SBP, sedoheptulose-1,7-bisophosphate; SER, serine; SUC, succinate; THR, threonine. ICI, 2.5.5. Model prediction of photorespiration. The estimation of the photorespiration rate in leaves by 13C MFA is complicated by the presence of multiple subcellular pools of serine and glycine and the multiple reactions and interconversions that they can undergo in different compartments (Hanson et al., 2000), and the challenges in obtaining reliable measurements of levels, compartmentation, and labeling of other photorespiratory metabolites (Ma et al., 2017). Here we measured labeling in 2-phosphoglycolate (2PG) but were not able to reliably measure labeling in glycolate, glyoxylate, hydroxypyruvate, or 47 glycerate. In the absence of such additional measurements, the reliability of photorespiratory flux estimates is low, with a substantial range of possible rates, which increases if realistic compartmentation of glycine and serine is included. We therefore estimated vo/vc using gas exchange measurements (Appendix A, Supplementary Information Text T2.3). The value obtained (0.31) was used as input to the MFA model instead of relying on fitting the labeling measured in glycine and serine (Appendix A, Table S2.7). Using measurements of serine, glycine, 2PG, and glycerate without compartmentation, Ma et al. (Ma et al., 2014) obtained a vo/vc ratio of 0.28 to 0.43 in Arabidopsis under low and high light levels, which is consistent with the value estimated here. 2.5.6. No metabolically inactive CBC metabolites. The inclusion of inactive metabolite pools was made in previous studies to account for the persistence of unlabeled carbon in CBC intermediates (Szecowka et al., 2013; Ma et al., 2014; Arrivault et al., 2019; Xu et al., 2021a). Whole shoots may include enough photosynthetically inactive tissues to account for significant inactive pools, while single mature leaves used here should have very little photosynthetically inactive tissue. We therefore eliminated model terms accounting for inactive metabolite pools included in previous studies (Szecowka et al., 2013; Ma et al., 2014; Xu et al., 2021a) for all metabolites except glycine, serine, and alanine, for which significant vacuolar pools with long turnover times are plausible (Fürtauer et al., 2019). The model without inactive pools failed to adequately explain the labeling dataset, with particularly poor agreement for F6P, R5P, G6P, ADPG, and UDPG (Appendix A, Table S2.3). To test the model shown in Fig. 2.4, we added the inactive pools removed earlier back into the model to see whether introducing our mechanistic explanations for the labeling dynamics of the metabolites in this network changed the inactive pool size estimates. Compared to the previous study, we found this model substantially lowered the estimated inactive pool sizes in the best-fit simulations (Appendix A, Fig. S2.15) compared to previous studies (Szecowka et al., 2013; Ma et al., 2014; Xu et al., 2021a). Among them, the inactive pools for RUBP, PGA, hexose 6-phosphates, RU5P, 2PG, ADPG, and UDPG were decreased to almost zero, indicating that the turnover of sugars better explains the proportion of unlabeled molecules in these metabolites than the idea of inactive pools. 48 2.6. Discussion A key finding from this study is that the kinetics of the CBC is best described as a function of three interconnected processes, as indicated both by our modeling analysis of the time course of 12C decay during 13CO2 labeling experiments (Fig. 2.1) and by our MFA modeling results (Appendix A, Fig. S2.15). Our model included three inputs of carbon into the CBC: 1) 172 μmol⋅g−1 FW⋅h−1 by carboxylation by rubisco, 2) 75 (25 × 3 carbons per glycerate) μmol⋅g−1 FW⋅h−1 returned from photorespiration, and 3) 35 (7 × 5) μmol⋅g−1 FW⋅h−1 returned from the G6P shunt. The carbon paths in photorespiration and the G6P shunts require an extra 110 (75 + 35) carbon atoms to be processed for 172 carboxylations, adding more than 50% to the required flux through reactions in the CBC. In previous work, we allowed only a stromal shunt (Xu et al., 2021a). When we allowed both a stromal and a cytosolic shunt with our expanded dataset, all shunt carbon flow was assigned to the cytosolic shunt, and other work based on label in 6-phosphogluconate indicated that, in unstressed plants, only the cytosolic shunt operates (Sharkey et al., 2020). Therefore, we left the stromal shunt out of the final model. The model includes release of CO2 in photorespiration at a rate of 25 μmol⋅g−1 FW⋅h−1 and, from the G6P shunt, at a rate of 7 μmol⋅g−1 FW⋅h−1. The rate of glucose entry into the shunt was estimated to be about 5% of the rate of net CO2 fixation. The cost of the shunt is three ATP per glucose. Therefore, this shunt would increase the energy requirement for CO2 fixation from three ATP and two NADPH to ∼3.15 ATP and two NADPH (photorespiration also affects the energy cost of CO2 fixation) (Sharkey and Weise, 2016; Sharkey et al., 2020). The cost of the G6P shunt may be offset by benefits of refilling the CBC when intermediates fall during transients in light or other factors (Sharkey and Weise, 2016). This has also been proposed by Makowka et al., (2020) for glycolytic pathways in cyanobacteria. 2.6.1. MFA model fits. The use of multiple statistical tests specifically designed for model selection and the comparison of nested model series shows the potential for improvement of statistical rigor in this important aspect of MFA modeling. Although MFA software packages like INCA (Young, 2014) can report out 95% CIs for SSRs, allowing researchers to flag overfit or underfit models, these expected ranges are not appropriate for comparing alternative model architectures. This study demonstrated that, by directly modeling 13C labeling time course data, we can test models of the 49 general structure of the underlying network and corroborate or contradict assumed or proposed MFA models. This attests to the possible utility of these kinds of statistical tools in constraint- based modeling, and we believe advancement in this area could encourage use of MFA models to gain insight into photosynthetic metabolism. 2.6.2. Reaction network improvement. This new model improves on previous efforts on several fronts. Comparisons of the model in this work with previous models (Ma et al., 2014; Xu et al., 2021a) are shown in Appendix A, Dataset S3.3. The reversibility of reactions in the CBC has been corrected. Reactions present in the previous models, representing inactive pools for all the CBC intermediates, ADPG, UDPG, 2PG, phosphoenolpyruvate, and glycerate have been removed. The inactive pools for alanine, glycine, and serine have been retained because of their compartmentation complexity. Reactions newly added in this study, including cytosolic OPPP shunt, sucrose recycling reactions, and sugar vacuole pool reactions, explain the longstanding puzzle of the slow labeling phase of CBC intermediates and the overabundance of fully unlabeled isotopologues. These improvements to the metabolic network have largely eliminated the need for hypothesizing inactive pools. In showing the interconnection of these three compartments, we have drawn a picture of how carbon moves through photosynthetic carbon assimilation in a way that integrates the CBC, cytosolic sugar pools, the glucose-6-phosphate shunt, and vacuolar sugars into a single system. The data are consistent with a cytosolic G6P shunt. A stromal shunt would be undetectable, since the carbon source for a stromal shunt would have the same labeling kinetics as the rest of the CBC, as indicated by the similarity of labeling of ADPG and CBC intermediates. Measurements of the label in 6-phosphogluconate indicated that, in unstressed conditions, only the cytosolic shunt was active, while, in high temperature stress, a stromal shunt also occurs (Sharkey et al., 2020). When models that included both shunts were tested, no flux was assigned to the stromal shunt. The modified model used here predicts that the cytosolic shunt would proceed at a rate that is consistent with measurements of RL made using 12CO2 emission into a 13CO2-containing atmosphere (Loreto et al., 2001). 2.6.3. Sources of unlabeled carbon. Our conclusion is that the source of unlabeled carbon that reenters the CBC is sucrose, glucose, and fructose in the cytosol and vacuole. It has been shown that SUC4-type sucrose transporters can allow sucrose release from vacuoles (Payyavula et al., 2011; Schneider et al., 50 2012; Anaokar et al., 2021), and SWEET17 can mediate fructose transport across the tonoplast in leaves, although its primary activity may be in roots (Guo et al., 2014). Our model allows chloroplasts to take up pentose phosphates. A xylulose 5-phosphate transporter has been described (Eicks et al., 2002), but we found that plants lacking this gene have no growth or photosynthetic phenotype. The xylulose 5-phosphate transporter will also transport triose phosphates, and it is very possible that the triose phosphate/phosphate transporter is also bifunctional. Plants lacking both the xylulose phosphate-phosphate transporter (XPT) and triose phosphate transporter (TPT) accumulate pentose phosphates and show a much stronger reduction in growth than plants lacking the TPT alone (Hilgers et al., 2018a). In the past, starch recycling was proposed as a possible source (Sharkey, 2019). We have abandoned that idea, because a source in starch recycling would require that 36% of the carbon going to starch comes back into metabolism, but without any label. This is unrealistic. The results of various models described above (Appendix A, Table S2.3) provided clear-cut evidence against a biologically significant contribution of starch turnover to labeling patterns or carbon balances in central metabolism. 2.6.4. Previously puzzling observations explained. With the insight gained here, we address the metabolism issues raised in the Introduction. • The CBC intermediates label very quickly up to 80 to 90% of 13C, but the last 10 to 20% of labeling is much slower (Hasunuma et al., 2010; Szecowka et al., 2013; Ma et al., 2014; Arrivault et al., 2017; Arrivault et al., 2019). The CBC in leaves shows three phases, indicating three components. The slower two components account for the apparent slow-to-label pool. This is well-explained by carbon in unlabeled pools of glucose, fructose, and sucrose reentering the CBC by way of the glucose-6-phosphate shunt in the cytosol. No evidence was found for separate active and inactive pools. Hendry et al. (Hendry et al., 2017) proposed that glycogen could supply unlabeled carbon back to the CBC intermediates in Synechococcus to explain a similar lack of complete labeling. • The proportion of fully unlabeled molecules remains anomalously high well after most molecules are highly labeled [see Szecowka et al. (Szecowka et al., 2013) and Appendix A, Table S2.4, where M0 is greater than M1]. The abundance of M0 over M1 isotopologues was confirmed here. If metabolically inactive pools explained the lack of complete labeling, then the M0 isotopologues should account for all the 51 unlabeled carbon atoms. However, using G6P as an example, 2.9% of the molecules were fully unlabeled, but this accounts for only about one-third of the missing label (Appendix A, Table S2.5). Entry of carbon from relatively unlabeled free sugars into active pools accounts for the preponderance of M0 isotopologues. • To achieve acceptable fits, previous MFA studies assumed large metabolically inactive pools of central metabolites including metabolites of the CBC (Szecowka et al., 2013; Ma et al., 2014; Arrivault et al., 2017; Arrivault et al., 2019). However, there is little biochemical evidence for their existence. The new model of metabolism does not predict inactive pools. For all the CBC intermediates, the data fit well assuming carbon reentry through the shunt, eliminating any need to invoke inactive pools (Appendix A, Fig. S2.15 and Table S2.3). The exception is SBP as reported in Arrivault et al. (Arrivault et al., 2017). High levels of M0 were found. This could result from E4P export on the XPT transporter (Hilgers et al., 2018b) followed by attachment of DHAP catalyzed by aldolase. Since there is no SBPase in the cytosol, this would be a metabolic dead end and result in a significant inactive pool of SBP. • Previous studies fixed numerous fluxes, including starch and sucrose biosynthesis, according to independently measured experimental values (Ma et al., 2014; Xu et al., 2021a). Recently, it was recommended to minimize fixed fluxes and impose constraints in MFA analyses and compare independent experimental values with model outputs rather than using them as model inputs (Wieloch, 2021). The final model had no fixed fluxes, although the ratio of vo/vc was constrained (but not fixed) based on gas exchange data (Appendix A, Supplementary Information Text T3.3). Fatty acid synthesis and RL were constrained (but not fixed) based on previous measurements (Xu et al., 2021a). The model returned physiologically reasonable values for starch and sucrose synthesis (Sharkey et al., 1985). • Estimates of the relative rate of photorespiration, that is, velocity of rates of oxygenation/carboxylation (vo/vc), in MFA are low (Xu et al., 2021a) or light dependent (Ma et al., 2014). We found that vo/vc is not well estimated by the model, requiring use of other estimates. Use of MFA to estimate photorespiration rates is less reliable than other methods (Sharkey, 1988; Busch, 2013). 52 Several models of plant behavior, including isotopic disequilibrium methods for measuring RL (Gong et al., 2018) and isoprene studies [reviewed in (Sharkey et al., 2020)], assume that carbon in photosynthesis is fully labeled after 10 min of feeding air with a different carbon isotopic composition and that other processes contribute “old” carbon that does not become labeled. The results presented here will allow more-refined models that include both the lack of complete labeling of CBC intermediates and the occurrence of some label in the sources for these other processes. Our results indicate that isotopic methods for measuring RL underestimate its rate because the source carbon (G6P in the cytosol) has some label at the time RL is assessed. The results presented here provide a framework for more detailed RL measurements. Measuring RL is a very difficult task but very important for understanding global carbon cycles (Tcherkez et al., 2017). In summary, labeling of CBC intermediates by fixation of incoming 13CO2 is diluted by weakly labeled carbon in glucose reentering the CBC. We predict that reentry of weakly labeled molecules occurs at a rate of 5% of the rate of net CO2 fixation. The model explains three phases of labeling. In showing the interconnection of three compartments, this model provides a more complete picture of how carbon moves through photosynthetic metabolism in a way that integrates the CBC, cytosolic sugar pools, the glucose-6-phosphate shunt, and vacuolar sugars into a single system. 2.7. Methods 2.7.1. Plant growth, gas exchange, and 13CO2 labeling. Plant growth and gas exchange methods were used as described previously (Xu et al., 2021a). The 13CO2-labeled leaf samples were collected at time points of 0, 0.5, 1, 2, 2.5, 3, 5, 7, 10, 15, 30, 60, 90, and 120 min as described in detail in Appendix A, Supplemental Methods T3.4 and T3.5. 2.7.2. Mass spectrometry. Mass spectrometry for anion exchange LC-MS/MS and GC-EI-MS were carried out using the methods described in ref. (Xu et al., 2021a) and detailed in Appendix A, Dataset S5. Reverse- phase LC-MS/MS and GC-chemical ionization (CI)-MS had the following changes: Samples for reverse-phase liquid chromatography-tandem mass spectrometry were analyzed by an ACQUITY UPLC pump system (Waters) coupled with Waters XEVO TQ-S ultra-performance liquid chromatography tandem mass spectrometry (Waters) by the method described in ref. (Xu et al., 53 2021a). Samples for gas chromatography-electron ionization-mass spectrometry were analyzed by an Agilent 7890B GC system (Agilent) coupled to an Agilent 7010B triple quadrupole gas chromatography-electron ionization-mass spectrometer with an autosampler (CTC PAL) (Agilent). An Agilent VF5ms GC column, 30 m × 0.25 mm × 0.25 m with 10-m guard column was used. One microliter of the derivatized sample was injected with helium carrier gas at a flow rate of 1.2 mL⋅min−1. The oven temperature gradient was: 40 °C (1-min hold), increased at 40 °C/min to 150 °C, then a 10 °/min to 250 °C, then a 40 °C/min to 320 °C, and finally held at 320 °C for 4.5 min. CI was used, and the mass scan range was 150 amu to 650 amu with step size 0.1 amu. The ionization source temperature was set at 300 °C, and the transfer line temperature was 300 °C. 2.7.3. Nonlinear regression and model selection. A nonlinear ordinary least-squares algorithm implemented in the Python package SciPy was used to fit models 1 to 7 (Fig. 2.1C) to our dataset (Virtanen et al., 2020). Briefly, best-fit lines for each model were generated by initializing and estimating model parameters 100 times with randomly selected initial parameters and then selecting the fit with the smallest SSR. CIs for parameters and fitted values were determined using bootstrap resampling (n = 1,000). Extra sum- of-squares, cross-validation, Akaike information criterion (AIC), and Bayesian information criterion (BIC) model selection criteria were evaluated for all models and model comparisons, and the Bonferroni–Holm multiple testing correction was applied for the P values generated by the extra sum-of-squares hypothesis testing (Holm, 1978; Schwarz, 1978; Akaike, 1998; Draper and Smith, 1998; Hastie et al., 2017). Further details can be found in Appendix A, Supplemental Information Text T3.1 and Supplemental Methods T3.2. 2.8. Acknowledgments This work was supported by the Division of Chemical Sciences, Geosciences and Biosciences, Office of Basic Energy Sciences of the US Department of Energy, Grant DE-FOA-0001650 (Y.X. and Y.S.-H.) and Grant DE-FG02-91ER20021 (T.W. and T.D.S.). T.D.S. receives partial salary support from Michigan State University (MSU) AgBioResearch. This work is supported, in part, by the NSF Research Traineeship Program (Grant DGE-1828149) to J.A.M.K. This publication was also made possible by a predoctoral training award to J.A.M.K. from Grant T32-GM110523 from National Institute of General Medical Sciences (NIGMS) of the NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIGMS 54 or NIH. We thank the staff of the MSU Research Technology Support Facility Mass Spectrometry Core for excellent support of mass spectrometric analysis. We thank Chih-Li Sung for statistical advice. 2.9. Author contributions Author contributions: Y.X., T.W., Y.S-H., and T.D.S. conceived the project and designed the experiments; T.W. and T.D.S. conducted exploratory analyses using published labeling data and developed a preliminary INST-MFA model accounting for photosynthesis labeling lags; Y.X. performed the metabolic flux analysis experiments; Y.X. and T.W. modified the kinetic MFA model used here; J.A.M.K. performed the nonlinear modeling and associated statistical tests of 12C labeling data; Y.S.-H. obtained analytical solutions for simple models and provided guidance for the experimental and computational analyses; and all authors contributed to writing the manuscript. 55 REFERENCES Akaike H (1998) Information Theory and an Extension of the Maximum Likelihood Principle. In E Parzen, K Tanabe, G Kitagawa, eds, Selected Papers of Hirotugu Akaike. Springer New York, New York, NY, pp 199–213 Allen DK, Young JD (2020) Tracing metabolic flux through time and space with isotope labeling experiments. Curr Opin Biotechnol 64: 92–100 Anaokar S, Liu H, Keereetaweep J, Zhai Z, Shanklin J (2021) Mobilizing Vacuolar Sugar Increases Vegetative Triacylglycerol Accumulation. Frontiers in Plant Science. doi: 10.3389/fpls.2021.708902 Arrivault S, Alexandre Moraes T, Obata T, Medeiros DB, Fernie AR, Boulouis A, Ludwig M, Lunn JE, Borghi GL, Schlereth A, et al (2019) Metabolite profiles reveal interspecific variation in operation of the Calvin-Benson cycle in both C4 and C3 plants. J Exp Bot 70: 1843–1858 Arrivault S, Obata T, Szecówka M, Mengin V, Guenther M, Hoehne M, Fernie AR, Stitt M (2017) Metabolite pools and carbon flow during C4 photosynthesis in maize: 13CO2 labeling kinetics and cell type fractionation. J Exp Bot 68: 283–298 Bassham JA, Benson AA, Kay LD, Harris AZ, Wilson AT, Calvin M (1954) The Path of Carbon in Photosynthesis. XXI. The Cyclic Regeneration of Carbon Dioxide Acceptor1. J Am Chem Soc 76: 1760–1770 Busch FA (2013) Current methods for estimating the rate of photorespiration in leaves. Plant Biol (Stuttg) 15: 648–655 Dietz K-J (1985) A possible rate-limiting function of chloroplast hexosemonophosphate isomerase in starch synthesis of leaves. Biochimica et Biophysica Acta (BBA) - General Subjects 839: 240–248 Dietz K-J (1987) Control Function of Hexosemonophosphate Isomerase and Phosphoglucomutase in Starch Synthesis of Leaves. In J Biggins, ed, Progress in Photosynthesis Research: Volume 3 Proceedings of the VIIth International Congress on Photosynthesis Providence, Rhode Island, USA, August 10–15, 1986. Springer Netherlands, Dordrecht, pp 329–332 Draper NR, Smith H (1998) Extra Sums of Squares and Tests for Several Parameters Being Zero. Applied Regression Analysis. John Wiley & Sons, Ltd, pp 149–177 Dunne A (1985) JANA: A new iterative polyexponential curve stripping program. Computer Methods and Programs in Biomedicine 20: 269–275 Eicks M, Maurino V, Knappe S, Flügge U-I, Fischer K (2002) The plastidic pentose phosphate translocator represents a link between the cytosolic and the plastidic pentose phosphate pathways in plants. Plant Physiol 128: 512–522 56 Farquhar GD, Caemmerer S, Berry JA (1980) A biochemical model of photosynthetic CO2 assimilation in leaves of C3 species. Planta 149: 78–90 Fliege R, Flügge UI, Werdan K, Heldt HW (1978) Specific transport of inorganic phosphate, 3-phosphoglycerate and triosephosphates across the inner membrane of the envelope in spinach chloroplasts. Biochim Biophys Acta 502: 232–247 Flugge U, Heldt HW (1991) Metabolite Translocators of the Chloroplast Envelope. Annu Rev Plant Physiol Plant Mol Biol 42: 129–144 Fu X, Gregory LM, Weise SE, Walker BJ (2023) Integrated flux and pool size analysis in plant central metabolism reveals unique roles of glycine and serine during photorespiration. Nat Plants 9: 169–178 Fürtauer L, Küstner L, Weckwerth W, Heyer AG, Nägele T (2019) Resolving subcellular plant metabolism. The Plant Journal 100: 438–455 Gibaldi M, Perrier D (1982) Pharmacokinetics. M. Dekker, New York, NY Gong XY, Tcherkez G, Wenig J, Schäufele R, Schnyder H (2018) Determination of leaf respiration in the light: comparison between an isotopic disequilibrium method and the Laisk method. New Phytologist 218: 1371–1382 Guo W-J, Nagy R, Chen H-Y, Pfrunder S, Yu Y-C, Santelia D, Frommer WB, Martinoia E (2014) SWEET17, a Facilitative Transporter, Mediates Fructose Transport across the Tonoplast of Arabidopsis Roots and Leaves. Plant Physiology 164: 777–789 Hanson AD, Gage DA, Shachar-Hill Y (2000) Plant one-carbon metabolism and its engineering. Trends Plant Sci 5: 206–213 Hastie T, Tibshirani R, Friedman J (2017) Model Assessment and Selection. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2nd ed. Springer, New York, NY, pp 219–260 Hasunuma T, Harada K, Miyazawa S-I, Kondo A, Fukusaki E, Miyake C (2010) Metabolic turnover analysis by a combination of in vivo 13C-labelling from 13CO2 and metabolic profiling with CE-MS/MS reveals rate-limiting steps of the C3 photosynthetic pathway in Nicotiana tabacum leaves. J Exp Bot 61: 1041–1051 Hendry JI, Prasannan C, Ma F, Möllers KB, Jaiswal D, Digmurti M, Allen DK, Frigaard N-U, Dasgupta S, Wangikar PP (2017) Rerouting of carbon flux in a glycogen mutant of cyanobacteria assessed via isotopically non-stationary (13) C metabolic flux analysis. Biotechnol Bioeng 114: 2298–2308 Hilgers EJA, Schöttler MA, Mettler-Altmann T, Krueger S, Dörmann P, Eicks M, Flügge U-I, Häusler RE (2018a) The Combined Loss of Triose Phosphate and Xylulose 5- Phosphate/Phosphate Translocators Leads to Severe Growth Retardation and Impaired Photosynthesis in Arabidopsis thaliana tpt/xpt Double Mutants. Front Plant Sci 9: 1331 57 Hilgers EJA, Staehr P, Flügge U-I, Häusler RE (2018b) The Xylulose 5-Phosphate/Phosphate Translocator Supports Triose Phosphate, but Not Phosphoenolpyruvate Transport Across the Inner Envelope Membrane of Plastids in Arabidopsis thaliana Mutant Plants. Front Plant Sci 9: 1461 Holm S (1978) Board of the Foundation of the Scandinavian Journal of Statistics A Simple Sequentially Rejective Multiple Test Procedure Author ( s ): Sture Holm Published by : Wiley on behalf of Board of the Foundation of the Scandinavian Journal of Statistics Stable U. Scandinavian Journal of Statistics 6: 65–70 Loreto F, Velikova V, Di Marco G (2001) Respiration in the light measured by 12CO 2 emission in 13CO 2 atmosphere in maize leaves. Functional Plant Biol 28: 1103–1108 Ma F, Jazmin LJ, Young JD, Allen DK (2014) Isotopically nonstationary 13C flux analysis of changes in Arabidopsis thaliana leaf metabolism due to high light acclimation. Proceedings of the National Academy of Sciences of the United States of America 111: 16967–16972 Ma F, Jazmin LJ, Young JD, Allen DK (2017) Isotopically Nonstationary Metabolic Flux Analysis (INST-MFA) of Photosynthesis and Photorespiration in Plants. In AR Fernie, H Bauwe, APM Weber, eds, Photorespiration: Methods and Protocols. Springer New York, New York, NY, pp 167–194 Mahon JD, Fock H, Canvin DT (1974) Changes in specific radioactivity of sunflower leaf metabolites during photosynthesis in 14CO2 and 12CO2 at three concentrations of CO2. Planta 120: 245–254 Makowka A, Nichelmann L, Schulze D, Spengler K, Wittmann C, Forchhammer K, Gutekunst K (2020) Glycolytic Shunts Replenish the Calvin–Benson–Bassham Cycle as Anaplerotic Reactions in Cyanobacteria. Molecular Plant 13: 471–482 McVetty PBE, Canvin DT (1981) Inhibition of photosynthesis by low oxygen concentrations. Can J Bot 59: 721–725 Nägele T, Henkel S, Hörmiller I, Sauter T, Sawodny O, Ederer M, Heyer AG (2010) Mathematical Modeling of the Central Carbohydrate Metabolism in Arabidopsis Reveals a Substantial Regulatory Influence of Vacuolar Invertase on Whole Plant Carbon Metabolism. Plant Physiology 153: 260–272 Patrick JW, Botha FC, Birch RG (2013) Metabolic engineering of sugars and simple sugar derivatives in plants. Plant Biotechnol J 11: 142–156 Payyavula RS, Tay KHC, Tsai C-J, Harding SA (2011) The sucrose transporter family in Populus: the importance of a tonoplast PtaSUT4 to biomass and carbon partitioning. The Plant Journal 65: 757–770 Preiser AL, Banerjee A, Weise SE, Renna L, Brandizzi F, Sharkey TD (2020) Phosphoglucoisomerase Is an Important Regulatory Enzyme in Partitioning Carbon out 58 of the Calvin-Benson Cycle. Frontiers in Plant Science. doi: 10.3389/fpls.2020.580726 Schneider S, Hulpke S, Schulz A, Yaron I, Höll J, Imlau A, Schmitt B, Batz S, Wolf S, Hedrich R, et al (2012) Vacuoles release sucrose via tonoplast-localised SUC4-type transporters. Plant Biol (Stuttg) 14: 325–336 Schwarz G (1978) Estimating the Dimension of a Model. The Annals of Statistics 6: 461–464 Sharkey T, Preiser AL, Weraduwage SM, Gog L (2020) Source of 12C in Calvin Benson cycle intermediates and isoprene emitted from plant leaves fed with 13CO2. Biochemical Journal 477: 3237–3252 Sharkey TD (2019) Discovery of the canonical Calvin–Benson cycle. Photosynthesis Research 140: 235–252 Sharkey TD (1988) Estimating the rate of photorespiration in leaves. Physiologia Plantarum 73: 147–152 Sharkey TD, Berry JA, Raschke K (1985) Starch and Sucrose Synthesis in Phaseolus vulgaris as Affected by Light, CO2, and Abscisic Acid 1. Plant Physiology 77: 617–620 Sharkey TD, Weise SE (2016) The glucose 6-phosphate shunt around the Calvin-Benson cycle. J Exp Bot 67: 4067–4077 Szecowka M, Heise R, Tohge T, Nunes-Nesi A, Vosloh D, Huege J, Feil R, Lunn J, Nikoloski Z, Stitt M, et al (2013) Metabolic fluxes in an illuminated Arabidopsis rosette. Plant Cell 25: 694–714 Tcherkez G, Gauthier P, Buckley TN, Busch FA, Barbour MM, Bruhn D, Heskel MA, Gong XY, Crous KY, Griffin K, et al (2017) Leaf day respiration: low CO2 flux but high significance for metabolism and carbon balance. New Phytologist 216: 986–1001 Uys L, Botha FC, Hofmeyr J-HS, Rohwer JM (2007) Kinetic model of sucrose accumulation in maturing sugarcane culm tissue. Phytochemistry 68: 2375–2392 Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17: 261–272 Wieloch T (2021) The next phase in the development of 13C isotopically non-stationary metabolic flux analysis. Journal of Experimental Botany 72: 6087–6090 Xu Y, Fu X, Sharkey TD, Shachar-Hill Y, Walker and BJ (2021) The metabolic origins of non-photorespiratory CO2 release during photosynthesis: a metabolic flux analysis. Plant Physiology 1–18 Young JD (2014) INCA: A computational platform for isotopically non-stationary metabolic flux analysis. Bioinformatics 30: 1333–1335 59 APPENDIX A: Supplemental Material for Chapter 2 SUPPLEMENTAL TEXT T1. Derivation of Polyexponential Models from Analytical Solutions of Compartmental Models In Figure S2.1, we show a simplified model of photosynthetic carbon assimilation with three compartments and rates V1-V5. Assuming first-order or pseudo-first-order kinetics: 𝑉1 = 𝑘1 ∗ [𝑋 ] 𝑉2 = 𝑘2 ∗ [𝑌] 𝑉3 = 𝑘3 ∗ [𝑌] 𝑉4 = 𝑘4 ∗ [𝑍] 𝑉5 = 𝑘5 ∗ [𝑍] (𝐸1) (𝐸2) (𝐸3) (𝐸4) (𝐸5) We define the differential operator D, where [F] is a stand-in for any compartment’s concentration: Given definitions (E1-E5) and notation from E6, the rates of change of compartments X, Y, and 𝑑𝑛[𝐹] 𝑑𝑡 = 𝐷𝑛[𝐹] (𝐸6) Z are: 𝐷[𝑋] = −𝑘1[𝑋] + 𝑘2[𝑌] 𝐷[𝑌] = 𝑘1[𝑋] − 𝑘2[𝑌] − 𝑘3[𝑌] + 𝑘4[𝑍] 𝐷[𝑍] = 𝑘3[𝑌] − 𝑘4[𝑍] − 𝑘5[𝑍] (𝐸7) (𝐸8) (𝐸9) Through a series of substitutions, it can be shown that this system of differential equations simplifies to a linear homogenous differential equation of the 3rd order: 𝐷3 + 𝑎𝐷2[𝑋] + 𝑏𝐷[𝑋] + 𝑐[𝑋] = 0 Where the coefficients a, b, and c are combinations of rate constants such that: 𝑎 = 𝑘1 + 𝑘2 + 𝑘3 + 𝑘4 + 𝑘5 𝑏 = (𝑘1 + 𝑘2)(𝑘3 + 𝑘4 + 𝑘5) 𝑐 = 𝑘1𝑘3𝑘5 (𝐸10) (𝐸11) (𝐸12) (𝐸13) Which are all constants. The general solution to a linear homogenous differential equation with constant coefficients is of the form: 60 Where m is some constant. From E10 and E14, we get the characteristic polynomial: 𝑟3 + 𝑎𝑟2 + 𝑏𝑟 + 𝑐 = 0 [𝑋](𝑡) = 𝑒𝑚∗𝑡 (𝐸14) (𝐸15) This cubic polynomial has three roots, including repeating and complex roots. Due to the linearity of the system, its general solution is a linear combination of its roots, such that: [𝑋](𝑡) = 𝑐1𝑒𝑟1𝑡 + 𝑐2𝑒𝑟2𝑡 + 𝑐3𝑒𝑟3𝑡 (𝐸16) Solving this cubic polynomial for biochemically reasonable estimates of k1 through k5 results in three real and negatively valued roots, making the general result of an identical form as the triexponential decay models we fit our data to in this study. The analytical solutions to the differential equations or systems of differential equations describing single and two-compartment models, likewise, correspond to single exponential and biexponential functions, respectively. Nonlinear regression and bootstrapping T2. Supplemental Methods Fitting of %12C remaining data to polyexponential models was performed in Python using the curve_fit() function implemented in the SciPy package (Virtanen et al., 2020). We performed all regressions 100 times with uniformly sampled initial parameter values and selected the fit with the lowest SSR for further analysis. N = 1000 bootstrap resampling with replacement was performed using functions from the Python package recombinator. Due to the time-course structure of the data, circular block bootstrapping was used to preserve some of the dependence structure between subsequent measurements (Politis and Romano, 1991). Bootstrap samples were fitted using the same general procedure as that used to generate the best-fit lines, with the exception that the initial guesses for the parameter values for the regression of the bootstrap samples were set to the best-fit parameter values. 95% confidence intervals for each parameter were derived by taking the 2.5th and 97.5th percentile values of the resulting distributions of all successful fits. Data treatment for heteroskedastic residuals and outlier identification Heteroskedastic residuals from our nonlinear regressions were corrected using a logit transformation (Johnson, 1949). Specifically, we performed nonlinear regression on models of the form: 61 𝑙𝑜𝑔𝑖𝑡 ( 𝑓(𝑡) 100 ) = 𝑙𝑜𝑔𝑖𝑡 ( 𝐴𝑒𝑏∗𝑡 + ⋯ 100 ) (𝐸17) This preserves the relationship between our response, independent variables, and estimated parameters, allowing for straightforward interpretation while substantially reining in the heteroskedasticity of the residuals. Due to the presence of % 12C remaining values very close to 100% in the tobacco dataset, a constant value of 0.1 was subtracted to avoid inflated values in the first time point exerting too much influence over the nonlinear regression results (and therefore resulting in heteroskedastic residuals). Specifically, the following model was fitted for the tobacco datasets: 𝑙𝑜𝑔𝑖𝑡 ( 𝑓(𝑡) − 0.1 100 ) = 𝑙𝑜𝑔𝑖𝑡 ( (𝐴𝑒𝑏∗𝑡 + ⋯ ) − 0.1 ) 100 (𝐸18) Studentized residuals were calculated for all model fits and datapoints whose studentized residuals exceeded an absolute value of 3 were excluded (N = 5). Due to the substantial impure heteroskedasticity in the Model 1 fits studentized residuals greater than 3 in Model 1 fits were ignored for the purposes of outlier removal. Model selection criteria Extra-sum-of-squares: For each nested pair of models we calculated the probability that, given the null hypothesis that the simpler of the two models is true, we would see the observed improvement in model fit as measured by the sum-of-squared residuals (SSR) (Draper and Smith, 1998). We calculate an F statistic as follows: 𝐹 = 𝑆𝑆𝑅𝑠𝑖𝑚𝑝𝑙𝑒 − 𝑆𝑆𝑅𝑐𝑜𝑚𝑝𝑙𝑒𝑥 𝑆𝑆𝑅𝑐𝑜𝑚𝑝𝑙𝑒𝑥 𝐷𝐹𝑠𝑖𝑚𝑝𝑙𝑒 − 𝐷𝐹𝑐𝑜𝑚𝑝𝑙𝑒𝑥 𝐷𝐹𝑐𝑜𝑚𝑝𝑙𝑒𝑥 (𝐸19) Where SSRsimple and SSRcomplex are the SSR values for the simpler and complex – i.e., fewer and more parameters – models, respectively, and DFsimple and DFcomplex are the degrees of freedom for the two models. The F statistic resulting from E19 was then compared to the F-distribution to derive a p-value representing the probability of observing this F statistic given our null hypothesis, which is that our simpler model is correct. For this study, we set 𝛼 = 0.05 and used the Holm-Bonferroni correction (Holm, 1978) to adjust our p-value cutoff to one that 62 corresponds to a family-wise α of 0.05. For each p-value Pk in the family of hypothesis tests being tested, we evaluate the following expression: 𝛼 𝑚 + 1 − 𝑘 𝑃𝑘 < (𝐸20) where 𝛼 is the family-wise 𝛼 we are adjusting to, m is the number of hypothesis tests being conducted, and k is the rank of the p-value Pk in a ranked list of increasing p-values. We selected the best-supported model for a given dataset by starting with the single exponential model and adding more parameters until we got to a model comparison that did not meet our adjusted p-value cutoff, in which case we went with the simpler model in the comparison. In cases where there was a comparison of two more complex models than the one we arrived at using the method just described that yielded a low p-value, we calculated p-value for the F-statistic comparison between the more complex of those two and the accepted model. If we were justified in rejecting the null hypothesis that the simpler model is better in this case, we went with the more complex model. Cross-validation: For this study we used the cross_validate() function from the SciKitLearn package to perform between 5 and 10 iterations of 5-fold cross-validation on our datasets (Pedregosa et al., 2011; Hastie et al., 2017). The same non-linear ordinary least squares fitting procedure used for our best-fit parameter estimation on the full datasets was used for our cross- validation, with the only difference being that the fitting was done 5 times with different randomly selected bins of data for training and testing, resulting in 5 estimates of prediction error for each alternative model at each iteration. After 5-10 iterations, we took all the negative mean squared error estimates for each model for a given metabolite or aggregated metabolite dataset and then calculated their mean value and 95% confidence interval (± 1.96 SE). The model with the lowest average error and whose 95% CI does not overlap with the next simplest model in terms of the number of fitted parameters was chosen as the best-performing model for each dataset. AIC/BIC: For each best-fit of Models 1-7, the AIC (Akaike, 1998) and BIC (Schwarz, 1978) were calculated as follows: 𝐴𝐼𝐶 = 2𝑘 + 𝑛 ln 𝑆𝑆𝑅 (𝐸21) 63 𝐵𝐼𝐶 = 𝑘 ln 𝑛 + 𝑛 ln 𝑆𝑆𝑅 (𝐸22) where k is the number of estimated parameters in the model, and n is the sample size. The best- supported model for each dataset was chosen by identifying the model with the lowest AIC/BIC value that is not within two absolute units of a simpler (i.e., fewer parameters) model. T3. Calculation of vo/vc We begin with the equation from Farquhar et al., (1980) 𝐴 = 𝑣𝐶 − 0.5𝑣𝑂 − 𝑅𝐿 (𝐸23) where A is the net rate of CO2 assimilation (uptake), vC is the velocity of carboxylation, vO is the velocity of oxygenation, and RL is all other sources of CO2 release in the light, possibly primarily CO2 released by the glucose 6-phosphate shunt (Xu et al., 2021a). Next, we define and so Rearranging We can also estimate vO. and so Φ = 𝑣𝑂 𝑣𝐶 𝐴 = 𝑣𝐶(1 − 0.5Φ) − 𝑅𝐿 𝑣𝐶 = (𝐴+𝑅𝐿) (1−0.5Φ) 𝐴 = 𝑣𝑂 ( 1 Φ − 0.5) − 𝑅𝐿 𝑣𝑂 = (𝐴+𝑅𝐿) 1 −0.5) ( Φ Taking the ratio of equations and canceling (A+RL) 𝑣𝑂 𝑣𝐶 = (1−0.5Φ) ( 1 Φ −0.5) We can expand Φ as in Farquhar et al., (1980) where Γ∗ is the CO2 compensation point in the absence of RL. Therefore, Φ = 2Γ∗ 𝐶 𝑣𝑂 𝑣𝐶 = (1 − 0.5 ( Γ∗ 𝐶⁄ ) − 1) 𝐶 Γ∗ (𝐸24) (𝐸25) (𝐸26) (𝐸27) (𝐸28) (𝐸29) (𝐸30) (𝐸31) 64 Where C is the CO2 partial pressure equivalent at the sites of carboxylation. This is determined by 𝐶 = 𝐶𝑖 − 𝐴 𝑔𝑚 (𝐸32) where Ci is the partial pressure of CO2 in the intercellular air spaces of the leaf (estimated from gas exchange) and gm is the mesophyll conductance for CO2 diffusion. In the absence of a direct measurement gm can be estimated as 𝑔𝑚 = 0.3 + 0.11 ∙ 𝐴 (𝐸33) Based on multiple measurements reported in Caemmerer and Evans, (1991) We can parameterize as follows based on measured gas exchange of the leaves used for this data set A = 17.4 ± 1.9 µmol m-2 s-1 (avg ± SD) (measured) *= 3.18 µmol m-2 s-1 Pa-1 (for tobacco, from Sharkey, (2016), adjusted to 22°C) C = 20.5 Pa (measured Ci and corrected for gm using E32) 𝑣𝑜 𝑣𝑐 = (1 − 3.18 20.5 ) 0.5 ( 20.5 3.18 − 1) = 0.31 (𝐸34) T4. Plant Growth, Gas Exchange, and 13CO2 Labeling. Wild-type Camelina sativa ecotype Suneson was grown under 8/16-h day/night cycles, under a light intensity of 500 μmol m−2 s−1, temperature of 22°C, and 50% relative humidity for 4 weeks. The youngest fully expanded leaves were used for gas exchange and labeling experiments. a LI- COR 6800 portable photosynthesis system (LI-COR Biosciences, Lincoln, NE, USA) was used to measure carbon assimilation. The reference [CO2] was set to 400 ppm, light intensity was 500 μmol m−2 s−1, temperature was 22°C, and relative humidity was 70% to ensure that the leaf vapor pressure deficit was ~0.85 kPa. After 10-15 min acclimation, net CO2 assimilation rate was logged and then the CO2 source was switched to 13CO2 with all other parameters held constant. Gases were mixed with mass flow controllers (Alicat Scientific, Tucson AZ, USA) controlled by a custom-programmed Raspberry Pi touchscreen monitor (Raspberry Pi foundation, code available upon request). Labeled leaf samples were collected at time points of 0, 0.5, 1, 2, 2.5, 3, 5, 7, 10, 15, 30, 60, 90, and 120 min. Liquid nitrogen was directly sprayed on the leaf surface via 65 a customized fast quenching (0.1-0.5 s to <0C) labeling system (13). Leaf temperature fell below 0C between. The frozen leaf sample was stored at -80°C. There were three biological replicates for data points from 0-90 min, and two biological replicates at 120 min. T5. Analysis of Mass Spectrometry Data. Data from LC-MS/MS were acquired with MassLynx 4.0 (Agilent, Santa Clara, CA, USA). Data from GC-EI-MS was acquired with Agilent GC/MSD Chemstation (Agilent, Santa Clara, CA, USA). Data from GC-CI-MS was acquired with Agilent MassHunter Workstation (Agilent, Santa Clara, CA, USA). Metabolites were identified by retention time and mass to charge ratio (m/z), in comparison with authentic standards. Both LC-MS and GC-MS data were converted to MassLynx format and processed with QuanLynx software for peak detection and quantification. Parameters for transitions of measured metabolites in multiple reaction monitoring (MRM) with LC-MS/MS and selected ion monitoring (SIM) with GC-MS are shown in Appendix A, Dataset S5. Experimentally measured mass isotopomer distributions of measured metabolites are shown in Appendix A, Dataset S1. Isotopologue Network and Flux Determination. The metabolic network model with all reactions and their respective carbon atom transitions describing photosynthetic central metabolism in Camelina sativa was constructed based upon the previous studies (Ma et al., 2014; Xu et al., 2021a) and KEGG database. A list of the reactions and abbreviations are provided in Table S2.2. INST-MFA was performed to estimate metabolic fluxes using the Isotopomer Network Compartmental Analysis software package (INCA2.0, http://mfa.vueinnovations.com, Vanderbilt University) (Young, 2014) implemented in MATLAB 2018b. The fit for all the tested models were accepted based on χ2 test of the sum-of-squared residuals (SSR). Global best fit SSR were calculated by parameter continuation analysis. Fatty acid synthesis rate is constrained to 0.0329-0.4405 μmol CO2 g–1FW h–1 by combining the previous measurements of 0.049-0.067 μmol CO2 m−2 s-1 with 0.005-0.012 μmol CO2 m−2 s-1 (Tcherkez et al., 2005; Xu et al., 2021a). RL is constrained to 8.1-10.7 μmol CO2 g–1FW h–1 based on previous measurement (Xu et al., 2021a). vo/vc is constrained to 0.3-0.32 based on measurement in this study. 66 Assessment of Flux Precision Both parameter continuation method and Monte Carlo method were independently estimated the 95% confidence intervals of the estimated flux values as shown in Appendix A, Dataset S4. 10,000 sets of perturbed data were used for Monte Carlo analysis. The resulting distribution of flux values enabled the estimation of confidence intervals. The computation-intensive parameter continuation and Monte Carlo simulations were computed in parallel using a SLURM job scheduler to distribute jobs to hundreds of compute nodes within a high-performance computing cluster provided by the Institute for Cyber-Enabled Research at Michigan State University. The two approaches gave similar results of confidence intervals for each flux solution. Calculation of predicted percentage of isotopologues (fmn) Predicted percentage of isotopologues (fmn) is calculated by the equation of: 𝑓𝑚𝑛 = (𝑝13𝐶)𝑛 ∗ (𝑝12𝐶)𝑚 − 𝑛 ∗ 𝑚𝐶𝑛 (𝐸35) p13C is the measured 13C enrichment; p12C is the measured 12C enrichment; n is the number of 13C carbon; m is the number of total carbons; mCn is the combination for choosing objects of n from the total number of objects of m. 67 FIGURES Figure S2.1: Simplified compartmental model used in “T3.1: Derivation of Polyexponential Models from Analytical Solutions of Compartmental Models” showing the metabolite compartments and rates interconnecting them. Note that we are modeling the depletion of 12C here, not the enrichment of 13C, hence the lack of external input to the CBC under the assumption that we are working with pure 13CO2. 68 Figure S2.2: Nonlinear regression fits for all polyexponential models fitted to the aggregated Calvin-Benson Cycle intermediate dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. Figure 2.1 is a subset of these data. 69 Figure S2.3: Nonlinear regression fits for all polyexponential models fitted to the fructose 1,6- bisphosphate dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. 70 Figure S2.4: Nonlinear regression fits for all polyexponential models fitted to the 3- phosphoglycerate dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. 71 the Figure S2.5: Nonlinear regression fits for all polyexponential models fitted glyceraldehyde-3-phosphate dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. to 72 the Figure S2.6: Nonlinear regression fits for all polyexponential models fitted dihydroxyacetone phosphate along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. to 73 Figure S2.7: Nonlinear regression fits for all polyexponential models fitted to the erythrose-4- phosphate dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. 74 Figure S2.8: Nonlinear regression fits for all polyexponential models fitted to the sedoheptulose- 7-phosphate dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. 75 Figure S2.9: Nonlinear regression fits for all polyexponential models fitted to the ribulose 1,5- bisphosphate dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. 76 Figure S2.10: Nonlinear regression fits for all polyexponential models fitted to the ADP-glucose dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. 77 Figure S2.11: Nonlinear regression fits for all polyexponential models fitted to the UDP-glucose dataset along with a summary of model selection results. The orange line represents the best-fit line and the shaded region represents the 95% CI estimated by bootstrap resampling. In the bottom-right table, green squares represent model selection results supporting the model indicated by that row representing a statistical improvement over a simpler model. Orange squares represent model selection results that do not support adding the additional parameters needed for the model in that row. 78 Figure S2.12: Transient 13CO2 labeling in glucose, fructose, sucrose glucosyl moiety, and sucrose fructosyl moiety. Experimentally determined isotope labeling measurements are shown as points with error bars (n=3, ± stdev). INST-MFA fitted mass isotopologue distributions are shown as solid lines. Nominal masses of M0 mass isotopologues are shown in parentheses. Error bars represent standard errors. 79 Figure S2.13: Transient 13CO2 labeling in measured ions. Experimentally determined isotope labeling measurements are shown as points with error bars (n=3, ± stdev). INST-MFA fitted mass isotopologue distributions are shown as solid lines. Nominal masses of M0 mass isotopologues are shown in parentheses. Error bars represent standard errors. (A) C3 and Glycolysis related metabolites. Core C3-only intermediates [labeled in red]; intermediates shared with glycolysis [purple]; core glycolysis metabolites and products [green]; photorespiratory intermediates [blue]; then carbohydrate building substrates [black]. (B) TCA cycle related metabolites. OAA derived AA’s [labeled in red]; and more slowly Thr which is made from Asp at a slower rate than Asn [purple]; Citrate [green]; Glu and Gln ions [labeled Glx in blue]; Malate Fumarate and Succinate [black]. 80 Figure S2.14: Transient 13CO2 labeling in measured ions. Experimentally determined isotope labeling measurements are shown as points with error bars (n=3, ± stdev). INST-MFA fitted mass isotopologue distributions are shown as solid lines. Nominal masses of M0 mass isotopologues are shown in parentheses. Error bars represent standard errors. (A) C3 and Glycolysis related metabolites. Core C3-only intermediates [labeled in red]; intermediates shared with glycolysis [purple]; core glycolysis metabolites and products [green]; photorespiratory intermediates [blue]; then carbohydrate building substrates [black]. (B) TCA cycle related metabolites. OAA derived AA’s [labeled in red]; and more slowly Thr which is made from Asp at a slower rate than Asn [purple]; Citrate [green]; Glu and Gln ions [labeled Glx in blue]; Malate Fumarate and Succinate [black]. 81 Figure S2.15: The INST-MFA estimated inactive pools for serine, glycine, R5P, RUBP, 3-PGA, H6P, FBP, RU5P, S7P, 2PG, ADPG, UDPG, and alanine were compared with Xu et al., (2021a) and Ma et al., (2014). MSU model lowered the inactive pool sizes for all the above metabolites. Among them, the inactive pools for RUBP, 3-PGA, H6P, RU5P, 2PG, ADPG, UDPG dramatically lowered to almost zero. 82 TABLES Table S2.1: Rate parameters for CBC intermediates, ADPG, and UDPG. The top row is data derived from the average of all the individual metabolites and following the data for each metabolite, the time constants for each is averaged (CBC average not included) and standard deviation is shown. Metabolite(s) CBC average PGA S7P GAP DHAP FBP RUBP ADPG Average - CBC Std Dev Slopes (min-1) Middle -0.203 -0.161 -0.163 -0.196 -0.140 -0.371 -0.141 -0.179 -0.194 0.075 Fast -1.071 -1.007 -1.078 -1.050 -0.950 -1.690 -0.802 -0.665 -1.04 0.30 Slow -0.007 -0.003 -0.003 -0.008 -0.005 -0.018 0.002 -0.013 -0.007 0.006 83 Table S2.2: Abbreviations for metabolites and reactions. Abbreviations 2PG 3PGA ACA acetyl-CoA ADPG AGP AKG ALA ALD ALT AS ASN ASP ASPT C3 cycle CIT CO2 CS DOF E4P EC2 ESI F6P FBA FBP Fru FUM FVCB G1P G6P G6PDH GA GAPDH GC-MS GDC GK Glc GLN GLY GPU GS ICI IDH INST-MFA LC-MS/MS MAL M1P MDH ME MFA Full name 2-phosphoglycolate 3-phosphoglycerate acetyl-CoA acetyl-coenzyme A adenosine diphosphate glucose ADP-glucose phosphorylase α-ketoglutarate alanine aldolase alanine transaminase asparagine synthase asparagine aspartate aspartate transaminase Calvin–Benson–Bassham cycle citrate carbon dioxide citrate synthase degrees of freedom erythrose-4-phosphate transketolase-bound-2-carbon-fragment electrospray ionization fructose-6-phosphate fructose-bisphosphate aldolase fructose-1,6-bisphosphatase fructose fumarate Farquhar, von Caemmerer and Berry glucose-1-phosphate glucose-6-phosphate glucose-6-phosphate dehydrogenase glycerate glyceraldehyde-3-phosphate dehydrogenase gas chromatography-mass spectrometry glycine decarboxylase glycerate kinase glucose glutamine glycine UDP-glucose pyrophosphorylase glutamine synthetase isocitrate isocitrate dehydrogenase isotopically nonstationary metabolic flux analysis liquid chromatography-tandem mass spectrometry malate mannose 1-phosphate malate dehydrogenase malic enzyme metabolic flux analysis 84 Table S2.2 (cont’d) MID MRM netA OAA OPP PCR PDH PEP PFP PGAM PGI PGM PGP PK PPC PPE PPI PRK PRO PYR R5P RL RU5P RUBISCO_CO2 RUBISCO_O2 RUBP S6P S7P SBP SBPase SCA SER SFrc SGA1 SGlc SIM SPS SRES SS SSR Suc SUC T_3PGA T_TP TBDMS TCA THR TK1 TMS TP TS mass isotopologue distribution multiple reaction monitoring net CO2 assimilation oxaloacetate oxidative pentose phosphate pyrroline-5-carboxylate reductase pyruvate dehydrogenase phosphoenolpyruvate phosphofructokinase pyrophosphate phosphoglycerate mutase phosphoglucose isomerase phosphoglucomutase phosphoglycolate phosphatase pyruvate kinase phosphoenolypyruvate carboxylase phosphopentose epimerase phosphopentose isomerase phosphoribulokinase proline pyruvate ribose-5-phosphate respiration in the light ribulose-5-phosphate ribulose-1,5-bisphosphate carboxylase (oxygenase) ribulose-1,5-bisphosphate (carboxylase) oxygenase ribulose-1,5-bisophosphate sucrose-6-phosphate sedoeheptulose-7-phosphate sedoheptulose-1,7-bisophosphate sedoheptulose-1,7-bisphosphatase succinyl-CoA serine sucrose fructosyl moiety serine:glyoxylate aminotransferase sucrose glucosyl moiety selected ion monitoring sucrose-phosphate synthase squared residual starch synthase sum-of-squared residuals sucrose succinate 3PGA transporter TP transporter tert-butyldimethylsilyl tricarboxylic acid threonine transketolase trimethylsilyl triose phosphate threonine synthase 85 Table S2.2 (cont’d) UDPG uridine diphosphate glucose vc vo Vpr X5P velocity of rates of carboxylation velocity of rates of oxygenation photorespiratory CO2 release xylulose-5-phosphate 86 Table S2.3: A comparison of the goodness of fit between data and best-fit simulations from alternative models. Starting model with no inactive pools, model with unlabeled glucose source, and model with sucrose recycling reactions and sucrose vacuole pool reactions were compared with fluxes for key reactions, SSR, top five most different SSR, and DOF. 5* DOF in terms of fluxes. The lowest value of SSR is shown in blue, the 50th percentile of SSR is shown in yellow, the highest value of SSR is shown in red. The starting model with no pools had the biggest overall SSR (1340) and highest individual SSR for R5P, FBP, UDPG, G6P, and F6P. The model with an unlabeled glucose source had both lower overall SSR and individual SSR for R5P, FBP, UDPG, G6P, and F6P. The model with sucrose recycling reactions and sucrose vacuole pool reactions had both lowest overall SSR and individual SSR for R5P, FBP, UDPG, G6P, and F6P. All abbreviations are shown in Table S2.2. Model Reactions Flux SSR TOP5 most different SSR ΔDOF No inactive pools 1340 No inactive pools + unlabeled glucose source CO2.u -> CO2 0 1340 Glucose.u -> G6P.p 0.5 1300 TP.u -> TP.p 0.3 1273 Glucose.u -> G6P.c 1.9 1126 No inactive pools + sucrose recycling reactions + sucrose vacuole pool reactions 2.11 2.11 0.05 2.16 2.16 Lowest value 50 percentile highest value Suc.v <-> Suc.c Glc.v <-> Glc.c Suc.c-> Glc.c + Fru.c Glc.c -> G6P.c Fru.c -> F6P.c 968 UDPG R5P FBP G6P F6P UDPG R5P FBP G6P F6P UDPG R5P FBP G6P F6P UDPG R5P FBP G6P F6P UDPG R5P FBP G6P F6P UDPG R5P FBP G6P F6P 215 115 112 123 109 218 118 114 112 98 216 116 113 85 102 209 112 62 109 96 109 117 101 62 59 53 101 76 32 19 0 1 1 1 1 5* 87 Table S2.4: Predicted and measured ratios between M1 to M0 of CBC intermediates based on their predicated and measured percentage of isotopologues. Metabolites Isotopologue Percentage of isotopologue GAP/DHAP PGA R5P RU5P/XU5P RUBP F6P G6P S7P M0 M1 M2 M3 M0 M1 M2 M3 M0 M1 M2 M3 M4 M5 M0 M1 M2 M3 M4 M5 M0 M1 M2 M3 M4 M5 M0 M1 M2 M3 M4 M5 M6 M0 M1 M2 M3 M4 M5 M6 M0 M1 M2 M3 M4 M5 M6 M7 Predicted 0.01 0.6 12.1 87.4 0.01 0.5 11.7 87.7 0.001 0.04 0.7 6.6 31.7 61.0 0.001 0.03 0.6 6.1 30.9 62.4 0.0001 0.005 0.2 2.6 22.2 75.1 0.000004 0.0004 0.02 0.3 4.0 25.9 69.7 0.0001 0.003 0.1 1.1 8.3 33.6 57.0 0.00000001 0.000002 0.0002 0.01 0.2 2.6 21.3 76.0 Measured 2.4 0.5 5.0 92.1 1.6 0.7 6.5 91.1 2.4 0.4 4.6 4.0 11.2 77.3 2.2 0.4 4.7 3.1 12.2 77.4 1.6 0.4 1.3 1.5 11.5 83.7 2.3 0.3 0.4 1.2 1.9 11.4 82.6 2.9 0.4 0.3 2.1 8.3 10.4 75.6 1.2 0.3 0.2 1.6 2.1 2.3 12.1 82.1 Ratio between M1/M0 Predicted Measured 65 67 0.2 0.4 48 0.2 51 0.2 85 0.2 97 0.1 61 0.1 175 0.2 88 Table S2.5: Contributions of fully unlabeled and partially labeled isotopologues to the lack of complete labeling in glucose 6-phosphate after two hours of labeling with 13CO2. Relative abundances are from Table S5. Fully unlabeled G6P accounts for only 0.174 / (0.174+0.365) = 32% of the labeling deficit. Relative abundance 0.029 0.004 0.003 0.021 0.083 0.104 0.756 1 M0 M1 M2 M3 M4 M5 M6 Sum 12C in M0 0.174 - - - - - - 0.174 12C in M1 to M6 - 0.02 0.012 0.063 0.166 0.104 0 0.365 89 Table S2.6: Carbon accounting for the model. Values in the absolute columns are fluxes from the model (Fig. 2.3) converted to a carbon basis. The last two columns are absolute values divided by the net rate of CO2 assimilation. Absolute μmol g-1 FW hr−1 Out Relative to net assimilation % In Out Calvin-Benson cycle carbon inputs and outputs In Rubisco Photorespiration TPT Starch synthesis G6P shunt Total Rubisco Photorespiration G6P shunt Fatty acids In minus out Starch Sucrose Other cytosolic Fatty acids Total end products 123% 54% 25% 202% 123% 172 75 35 282 102 117 63 282 CO2 budget 172 25 7 0.4 139.6 100% End Products 63.0 68.4 6.5 0.8 138.7 73% 84% 45% 202% 18% 5% 0.3% 45% 49% 5% 1% 99% 90 Table S2.7: vo/vc for models with and without labeling input for serine and glycine, with and without constraints of vo/vc. Four scenarios were tested: 1) with serine and glycine labeling input, unconstrained vo/vc; 2) with serine and glycine labeling input, constrained vo/vc = 0.31 +/- 5%; 3) without serine and glycine labeling input, unconstrained vo/vc; 4) without serine and glycine labeling input, constrained vo/vc = 0.31 +/- 5%. with serine and glycine without serine and glycine Unconstrained vo/vc Constrained vo/vc Unconstrained vo/vc Constrained vo/vc vo vc vo/vc 161.7 33.2 0.21 215.1 65.0 0.30 167.0 50.6 0.30 167.0 50.6 0.30 91 DATASET LEGENDS All supplemental datasets can be found at the following link: https://doi.org/10.1073/pnas.2121531119. Dataset S1 (separate file). Experimentally measured mass isotopologue distributions of measured metabolites. Dataset S2 (separate file). Parameter value estimates and model selection results for aggregated CBC intermediate datasets and individual metabolites. Parameters in exponential terms are sorted in terms of the absolute magnitude of their decay term. Dataset S3 (separate file). Comparisons of the model in this work with previous models (Ma et al., 2014; Xu et al., 2021a). Reactions that are different from Ma et al., (2014) are labeled in red. Reactions from Xu et al., (2021a) are shown in yellow. Reactions newly added in this publication are shown in blue. Reactions have been removed from Ma et al., (2014) and Xu et al., (2021a) are shown in green. Note that the parameters for alanine, glycine, and serine have been kept in the model because of their compartmentation complexity. Dataset S4 (separate file). Estimated flux values and 95% confidence intervals by parameter continuation. Values are absolute fluxes (µmol metabolites gFW-1 hr-1) based on the measured net CO2 uptake rate. The net flux is the difference between influx and efflux of metabolites moved in or out of the cell. The exchange flux is the minimum of the forward and backward fluxes of a reversible reaction. Some confidence intervals of exchange fluxes are unidentifiable or infinite. Subcellular fluxes are shown by metabolites spatially separated in the plastid (.p) and cytosol (.c). Dataset S5 (separate file). Parameters for transitions of measured metabolites in multiple reaction monitoring (MRM) with LC-MS/MS and selected ion monitoring (SIM) with GC-MS. LC-MS/MS dwell time was set at 20 ms for each transition. Q1, m/z of the precursor ion; Q3, m/z of the product ion. Cone and collision energy were optimized by direct infusion of standards. Amino and organic acids were measured by GC-MS by tert-butyldimethylsilyl (TBDMS) derivatization whereas glucose, fructose, and sucrose were derivatized by trimethylsilyl (TMS). 92 REFERENCES Akaike H (1998) Information Theory and an Extension of the Maximum Likelihood Principle. In E Parzen, K Tanabe, G Kitagawa, eds, Selected Papers of Hirotugu Akaike. Springer New York, New York, NY, pp 199–213 Caemmerer S, Evans J (1991) Determination of the Average Partial Pressure of CO2 in Chloroplasts From Leaves of Several C3 Plants. Functional Plant Biology 18: 287 Draper NR, Smith H (1998) Extra Sums of Squares and Tests for Several Parameters Being Zero. Applied Regression Analysis. John Wiley & Sons, Ltd, pp 149–177 Farquhar GD, Caemmerer S, Berry JA (1980) A biochemical model of photosynthetic CO2 assimilation in leaves of C3 species. Planta 149: 78–90 Hastie T, Tibshirani R, Friedman J (2017) Model Assessment and Selection. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2nd ed. Springer, New York, NY, pp 219–260 Holm S (1978) Board of the Foundation of the Scandinavian Journal of Statistics A Simple Sequentially Rejective Multiple Test Procedure Author ( s ): Sture Holm Published by : Wiley on behalf of Board of the Foundation of the Scandinavian Journal of Statistics Stable U. Scandinavian Journal of Statistics 6: 65–70 Johnson NL (1949) Systems of frequency curves generated by methods of translation. Biometrika 36: 149–176 Ma F, Jazmin LJ, Young JD, Allen DK (2014) Isotopically nonstationary 13C flux analysis of changes in Arabidopsis thaliana leaf metabolism due to high light acclimation. Proceedings of the National Academy of Sciences of the United States of America 111: 16967–16972 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al (2011) Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12: 2825–2830 Politis DN, Romano JP (1991) A circular block-resampling procedure for stationary data. Purdue University. Department of Statistics Schwarz G (1978) Estimating the Dimension of a Model. The Annals of Statistics 6: 461–464 Sharkey TD (2016) What gas exchange data can tell us about photosynthesis. Plant Cell and Environment 39: 1161–1163 Tcherkez G, Cornic G, Bligny R, Gout E, Ghashghaie J (2005) In vivo respiratory metabolism of illuminated leaves. Plant Physiology 138: 1596–1606 Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al (2020) SciPy 1.0: fundamental algorithms 93 for scientific computing in Python. Nature Methods 17: 261–272 Xu Y, Fu X, Sharkey TD, Shachar-Hill Y, Walker and BJ (2021) The metabolic origins of non-photorespiratory CO2 release during photosynthesis: a metabolic flux analysis. Plant Physiology 1–18 Young JD (2014) INCA: A computational platform for isotopically non-stationary metabolic flux analysis. Bioinformatics 30: 1333–1335 94 Chapter 3 Accurate flux predictions using tissue-specific gene expression in plant metabolic modeling This research was published in: J. A. M. Kaste, Y. Shachar-Hill. Accurate flux predictions using tissue-specific gene expression in plant metabolic modeling. Bioinformatics 39(5), btad186 (2023). 95 3.1. Preface As discussed in Chapter 1, FBA flux predictions are often some combination of imprecise and inaccurate owing to the small amount of empirical data brought to bear in most FBA studies. One attractive method of improving FBA flux accuracy is to come up with methods of incorporating omic – particularly transcriptomic – data into the prediction process. Unfortunately, this is easier said than done, with many previous attempts failing to consistently generate more accurate predictions than parsimonious FBA. These benchmarking studies were done in unicellular systems where the methods evaluated are generally some variation on the idea “If gene A encoding reaction A* is expressed more highly than gene B encoding reaction B*, reaction A* should, all else being equal, have higher flux than reaction B*.” For what should be obvious reasons, this is not a very good assumption to make. Early on in my Ph. D., it occurred to me that when modeling a multi-tissue system, it seems likely that the relationship between transcript abundance and flux between tissues of the same organism is probably more consistent than this same relationship across different gene-to- reaction pairings within a tissue or organism. This was merely a hunch, but it was compelling enough for me to convince Dr. Shachar-Hill to pursue the idea of making this into an algorithm and assessing whether it can make our flux predictions more accurate, as benchmarked by comparison against 13C-MFA (the gold-standard, as we argue in Chapter 1). Although I was ultimately interested in doing this in C. sativa, I decided to do this study initially with A. thaliana since it has multiple tissue-atlas RNA-seq datasets, a publicly available quantitative proteome dataset, and a lot of prior genome-scale models. As it turns out, the method I developed has been shown to be quite effective, as we present in this chapter. Due to the similarity between A. thaliana and C. sativa’s genomes, we believe this represents a significant step in the direction of more accurate and useful flux modeling in C. sativa for the purposes of engineering this organism for improved biofuel production. I carried out all of the model building, computational analysis, and manuscript writing for this study in consultation with Dr. Shachar-Hill. I am first author on the manuscript featured in this chapter, which has been published in the journal Bioinformatics. 3.2. Abstract Motivation: The accurate prediction of complex phenotypes such as metabolic fluxes in living systems is a grand challenge for systems biology and central to efficiently identifying 96 biotechnological interventions that can address pressing industrial needs. The application of gene expression data to improve the accuracy of metabolic flux predictions using mechanistic modeling methods such as Flux Balance Analysis (FBA) has not been previously demonstrated in multi-tissue systems, despite their biotechnological importance. We hypothesized that a method for generating metabolic flux predictions informed by relative expression levels between tissues would improve prediction accuracy. Results: Relative gene expression levels derived from multiple transcriptomic and proteomic datasets were integrated into Flux Balance Analysis predictions of a multi-tissue, diel model of Arabidopsis thaliana’s central metabolism. This integration dramatically improved the agreement of flux predictions with experimentally based flux maps from 13C Metabolic Flux Analysis (13C-MFA) compared with a standard parsimonious FBA approach. Disagreement between FBA predictions and MFA flux maps, as measured by weighted averaged percent error values, dropped from be-tween 169-180% and 94-103% in high light and low light conditions, respectively, to between 10-13% and 9-11%, depending on the gene expression dataset used. The incorporation of gene ex-pression data into the modeling process also substantially altered the predicted carbon and energy economy of the plant. Availability: Code is available from https://github.com/Gibberella/ArabidopsisGeneExpressionWeights. 3.3. Introduction A grand challenge for systems biology is the ability to accurately predict complex phenotypes from omic datasets based on functional principles and mechanisms. Patterns of cellular metabolism – flux maps – are one such complex phenotype (Ratcliffe and Shachar-Hill, 2006), for which tools to predict phenotypes from basic assumptions have proven useful in exploring and designing metabolic capabilities (Burgard et al., 2003; Orth et al., 2010b; Chen et al., 2011). Methods to quantify flux maps from labeling data now allow the testing of such predictions in both simpler and multicellular systems. However, the integration of omic data to improve the accuracy of flux predictions is still at an early stage. Metabolic flux predictions are also important for real world applications since modifying an organism’s metabolic activity in order to achieve some practical aim, such as overproducing a specific metabolite, is central to many biotechnology projects. As in other areas of engineering, metabolic engineering can benefit from mathematical models that describe and predict the 97 behavior of the relevant system(s). Researchers have developed two major modeling approaches to address this need: (1) 13C-Metabolic Flux Analysis (13C-MFA) and (2) Flux Balance Analysis (FBA) (Orth et al., 2010b; Antoniewicz, 2015). With 13C-MFA, steady-state or kinetic isotopic labeling data for metabolites in a small- to medium-sized network are used to obtain estimates of the net and exchange fluxes through that network (Antoniewicz, 2015). These metabolic flux maps are regarded as the most reliable measures of in vivo metabolic fluxes; however, the throughput of this technique is limited by the large amounts of isotopic labeling data and other measurements needed to generate each flux map. FBA, which is based on applying conservation principles to a network of reactions using one or more assumptions about the functional objective(s) driving biological organization, requires substantially less experimental input data, and is therefore an attractive and commonly used metabolic modeling technique. FBA and related metabolic modeling methods in microbial systems, together with Genome-Scale Models (GEMs) that represent the biochemical reactions encoded in an organism’s genome, have enabled radical modification of microbial central metabolism (Gleizer et al., 2019) and substantial improvements in bioproduct yields (Lee et al., 2007; Park et al., 2007). These methods can, for example, allow bioengineers to predict the behavior of their system and identify interventions, such as gene knock-outs or knock-ins, that will help them modify the organism’s phenotype (Burgard et al., 2003; Tepper and Shlomi, 2009b). However, many metabolic engineering applications require the modification not of microorganisms, but of multicellular eukaryotes like plants. Most GEMs of plants to date [e.g., (Poolman et al., 2009; Dal’Molin et al., 2010a; Saha et al., 2011; Arnold and Nikoloski, 2014)], have treated plants, which are composed of multiple tissues with substantial functional diversity, as single-tissue aggregated metabolic networks. This has motivated the creation of “multi-tissue” GEMs to investigate source-sink dynamics and resource allocation, with the earliest efforts in this space focusing on the interplay between mesophyll and bundle-sheath cells in C4 photosynthesis (Dal’Molin et al., 2010b; Shaw and Cheung, 2020). Re-engineering of plant metabolism on the scale seen in microbial systems has not, to- date, been possible and predictive modeling has been neither validated in detail nor applied to successful plant metabolic engineering. This is partly due to the ease and high throughput of microbial transformation relative even to model plant systems. In addition to the greater experimental demands, the metabolic modeling of these systems is also substantially harder. 98 There is, consequently, a relative lack of MFA datasets with which to compare the predicted flux maps from FBA in plants. This contrasts with the availability of rich multi-omic datasets combining flux estimates with transcript and protein data for a number of different genotypes and growth conditions in systems like E. coli (Ishii et al., 2007). The challenges involved in generating 13C-MFA flux maps for plants make improvement of plant FBA flux predictions an attractive path towards replicating the biotechnological successes seen in microbes. An appealing approach to improving the quality of plant FBA predictions is the integration of additional network-wide data from transcriptomic and proteomic datasets. Gene expression data – particularly transcript data – is substantially easier to generate than 13C-MFA flux maps. Previous attempts at integrating gene expression datasets into metabolic flux predictions have been reviewed elsewhere (Machado and Herrgård, 2014; Vijayakumar et al., 2017). Methods developed before 2014 were evaluated on the basis of their ability to improve upon parsimonious FBA (pFBA) (Lewis et al., 2010) in terms of their predictions’ agreement with MFA-estimated fluxes in microorganisms and were found to not do so reliably (Machado and Herrgård, 2014). A key limitation of these studies was a lack of comparison of FBA- predictions against 13C-MFA derived flux estimates. This lack of comparison against 13C-MFA is shared by the plant FBA literature, in which we are aware of only a small number of evaluations under heterotrophic conditions in green algae (Boyle et al., 2017), Arabidopsis cell cultures (Williams et al., 2010; Cheung et al., 2013), and Brassica napus embryos (Hay and Schwender, 2011). Since then, several studies have developed algorithms benchmarked by their ability to make predictions in agreement with empirical flux maps derived from MFA studies (Tian and Reed, 2018; Pandey et al., 2019; Ravi and Gunawan, 2021). These studies have focused on unicellular organisms or animal tissues modeled in isolation. Their application to FBA in more complex systems is limited by the large number of resource-intensive MFA datasets needed to calibrate them (Tian and Reed, 2018) or their need for a reference expression dataset paired with an assumed-correct flux map (Pandey et al., 2019; Ravi and Gunawan, 2021). To improve the accuracy of FBA in multicellular systems, particularly plants with their complex metabolic networks, we developed a method that integrates tissue-atlas data from multi- tissue systems into the flux-minimization procedure employed in pFBA. This method incorporates evidence from gene expression datasets into FBA metabolic flux predictions by applying weights to individual reactions according to the relative transcript or protein expression 99 of the gene(s) assigned to those reactions between different modeled tissues. The method is evaluated on its ability to make predictions in agreement with MFA flux maps. We demonstrate substantial improvements in the agreement of our FBA predicted fluxes with flux estimates from a 13C-MFA study on Arabidopsis thaliana rosette leaf central metabolism (Ma et al., 2014). Finally, we show that multiple gene expression datasets, when used as inputs, result in similar improvements in agreement and that this result generalizes across different MFA flux maps. This approach has particular potential for plant and animal systems for which there are only a limited number of experimental flux maps. 3.4. Methods 3.4.1. Overview of approach Our method makes two key assumptions: (1) Metabolic flux maps predicted from pFBA (Lewis et al., 2010), minimizing the sum total of flux through the network, are more likely to reflect real flux maps than ones not subject to this constraint, and (2) A reaction present in two tissues A and B catalyzed by an enzyme encoded by a gene that is highly expressed in A and poorly expressed in B is likely to carry higher flux in tissue A. We incorporate assumption 1 by making the objective function of our FBA optimization the minimization of total flux, the same as pFBA (Lewis et al., 2010). This is represented mathematically as finding the minimum value of the linear combination of all fluxes in the network, with each flux vj multiplied by a corresponding coefficient ci: ∑ min 𝑣𝑗 𝑗∈𝑅𝑒𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑗 ∗ 𝑣𝑗 (1) Where Reactions is the list of all reactions j in the network, vj is the flux through a reaction j, and cj is the coefficient – hereafter referred to as a penalty weight since it represents a penalization of the likelihood of using a reaction j to carrying flux. When cj is 1 for all reactions, our method reduces to pFBA, which can be seen as the limiting case of gene expression having no influence in predicting network flux patterns. We incorporate assumption 2 by calculating, for each reaction in our network model, a coefficient derived from the relative expression of genes encoding the enzyme(s) that catalyze that reaction between the different tissues in the gene expression dataset. Using the coefficient vector to account for relative expression resembles the approach taken by Jenior et al., (2020). However, our method compares gene expression across tissues within a multi-tissue model to generate more accurate flux predictions, rather than 100 comparing the expression of genes to the most expressed gene in a dataset as a proxy for transcriptional investment as a way of generating context-specific models. Reactions and genes are associated by the Gene-Protein-Reaction (GPR) terms in the model. This results in reactions mapped to relatively highly expressed genes receiving small values of cj and reactions mapped to minimally expressed genes receiving large ones. 3.4.2. Model construction and dataset selection The Arabidopsis thaliana core metabolism model developed by Arnold and Nikoloski, (2014) was used as the basis for a multi-tissue diel model. This model was chosen due to its rich GPR annotation and focus on central metabolism. The core model was duplicated six times to create leaf, stem, and root versions of the model for both day and night, which were interconnected by transporters allowing the movement of specific compounds and metabolites. The substrates, products, and constraints applied to the model can be found in the Appendix B, Supplementary Methods. The model used in this study can be found in Appendix B, Dataset S2. 13C-MFA flux maps were obtained in planta in Arabidopsis thaliana by Ma et al., (2014), and these were used as the empirical best estimates of flux distributions. Although there are not any other 13C-MFA flux maps available of autotrophic A. thaliana leaves, Szecowka et al., (2013) provides estimates of select fluxes in autotrophic A. thaliana leaf central metabolism, which we used for additional confirmation of our method’s efficacy. The pairing of fluxes in both flux studies to the FBA network are described in Appendix B, Dataset S1. We searched the literature for high-quality, high-coverage RNA-seq and quantitative proteomic tissue atlases and found two suitable datasets meeting these criteria: Mergner et al., (2020) and Klepikova et al., (2016). The proteomic dataset from Mergner et al., (2020) is a mass- spectrometry-based quantitative proteome that reports IBAQ values, which are an accurate measure of protein abundances (Krey et al., 2014). For bioinformatic processing details, see Appendix B, Supplementary Methods. For dataset IDs, growth conditions and key parameters from each study, see Tables S3.4-S3.5. 3.4.3. Penalty weight vector calculation We calculated the expression weight for each gene in each tissue on the basis of how the expression of a reaction in a particular tissue, as measured by transcriptomic or proteomic abundance, compared to the expression of that same gene in the other tissues. 101 𝑊𝑖𝑡 = 𝑀𝑎𝑥(𝐸𝑖) 𝐸𝑖𝑡 (2) Where Wit is the expression weight for a given gene i in a tissue t, Ei is the list of expression values of gene i for each tissue, Eit is the expression of gene i in tissue t, and Max() is the maximum value from a set of one or more elements. Many GPRs in the model consist of multiple genes that represent isozymes or members of protein complexes. The former are denoted by OR terms and the latter by AND terms in the GPR formulation. This results in many reactions having more than one expression weight due to being mapped to multiple genes. We combine these multiple weights into a single penalty weight value for each reaction by averaging the expression weights of isozymes and taking the “worst” (i.e., largest, most penalizing value) when genes form subunits of a protein complex. As an example, the penalty weight for a reaction R in the leaf subnetwork of our model with a GPR of the form (Gene1 OR Gene2) AND (Gene3), corresponding to a protein complex made of the product of Gene 3 and the product of Gene 1 or Gene 2, would be represented by: 𝑐𝑅,𝑙𝑓 = 𝑆𝐹 ( 𝑀𝑎𝑥 ( (𝑊𝑔𝑒𝑛𝑒1,𝑙𝑓 + 𝑊𝑔𝑒𝑛𝑒2,𝑙𝑓) 2 , 𝑊𝑔𝑒𝑛𝑒3,𝑙𝑓) − 1) + 1 Where cR,lf represents the overall penalty weight in the leaf (lf) for reaction R, SF (or the scaling factor) is a coefficient that modulates the magnitude of the calculated penalty weights and Wgene1,lf, Wgene2,lf, and Wgene3,lf are the penalty weights for the individual genes Gene1, Gene2, and Gene3. Note that in the present implementation of this method, stoichiometric coefficients in GPR terms are ignored. When the one or more genes contained in a GPR for a reaction/tissue combination are all more highly expressed than the same genes in the other tissues, the scale for that reaction/tissue combination will be 1. For reaction/tissue combinations that have no corresponding GPR, we explored setting the penalty weights to 1 or a value calculated from the median penalty weight assigned to reactions in the same tissue (for details, see Appendix B, Supplementary Methods). 3.4.4. Optimization The optimization done in this paper is a variation on pFBA, which finds the flux map(s) satisfying imposed constraints with minimum total flux through the network (Lewis et al., 2010). The minimization of total flux (Eq. 1) is subject to the following constraints: 102 𝑆𝑣 = 0 𝐿𝐵𝑗 ≤ 𝑣𝑗 ≤ 𝑈𝐵𝑗 𝑣𝑏𝑖𝑜𝑚𝑎𝑠𝑠(𝑡𝑖𝑠𝑠𝑢𝑒) = 𝑣𝑓𝑖𝑥𝑒𝑑 𝑏𝑖𝑜𝑚𝑎𝑠𝑠𝑡𝑖𝑠𝑠𝑢𝑒 (5) (6) (7) Where S is the stoichiometric matrix of the metabolic network being modeled, v is the vector of all fluxes, LB and UB are the vectors of all upper and lower bound constraints, and vbiomass(tissue) and vfixed biomass(tissue) are the biomass flux for a given tissue and the defined biomass constraint for that tissue, respectively. Eq. 5 represents the steady state of all internal metabolites, Eq. 6 represents the bounds and reversibility constraints, and Eq. 7 represents the definition of biomass accumulation rates. All optimizations were done in the COnstraint-Based Reconstruction and Analysis (COBRA) Toolbox in MATLAB (Heirendt et al., 2019) using the Gurobi™ optimizer version 8.1.1 (Gurobi Optimization, LLC, 2021). 3.4.5. Error evaluation We make the assumption that the 13C-MFA fluxes reported by Ma et al., (2014b) are the true in vivo metabolic fluxes and therefore regard the discrepancy between FBA-predicted fluxes and these 13C-MFA fluxes as a measure of error. Biomass accumulation (i.e., the difference in dry weight between a timepoint t and another timepoint t-1 ) was not reported by Ma et al., (2014b), but is the basis for the flux through the biomass equation in FBA. To allow a comparison between our FBA-predicted fluxes and the MFA-estimated fluxes in Ma et al., (2014b), we set an arbitrary biomass flux of 0.01 g/hr through the leaf, stem, and root biomass reactions in both the day and night, similar to the approach taken by de Oliveira Dal’Molin et al., (2015). We then normalized our fluxes by multiplying them by a factor A calculated as the ratio of the measured leaf CO2 uptake from Ma et al., (2014b) and the net leaf CO2 uptake in our FBA flux map. A weighted average error for each FBA-predicted flux map was then obtained using the following expression: ∑ ( 𝑗∈𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑑 |(𝑣𝑗 𝑚| 𝑝 ∗ 𝐴) − 𝑣𝑗 𝑚| |𝑣𝑗 𝑚| |𝑣𝑗 𝑗∈𝑀𝑒𝑎𝑠𝑢𝑟𝑒𝑑 ∗ ∑ ) 𝑚| |𝑣𝑗 (8) Where vj p and vj m are the FBA-predicted and MFA-estimated fluxes of a flux j and A is the normalization factor previously described. We calculated weighted average errors rather than just average errors because small absolute differences between FBA-predicted and MFA-estimated 103 flux values can correspond to extremely large % error values when the MFA-estimated fluxes are small. Additional details on the error evaluation can be found in the Appendix B, Supplementary Methods. We quantified the maximum/minimum weighted average errors of each flux map using Flux Variability Analysis (FVA) (Mahadevan and Schilling, 2003). For details, see Appendix B, Supplemental Methods. 3.5. Results 3.5.1. The application of gene expression penalty weights reliably reduces discrepancies between FBA-predicted and MFA-estimated fluxes Table 3.1. Weighted average % error values calculated from weighted vs. unweighted flux maps for transcriptomic and proteomic datasets from Mergner et al., (2020) and Klepikova et al., (2016). Values in brackets represent the lowest and highest possible error values given the results of Flux Variability Analysis. Weighted average error values were calculated from flux maps generated using a scaling factor of 1. Dataset Light Level Mergner et al. Transcriptome Mergner et al. Proteome Klepikova et al. Transcriptome High Low High Low High Low Weighted average error (%) No gene expression weights 169 – 180 With penalty weights 14.7 – 17.1 93.8 – 103 169 – 180 93.8 – 103 169 – 180 93.8 – 103 14.9 – 18.1 10.9 – 13.4 8.74 – 10.9 14.8 – 17.4 19.3 – 21.7 Predicted flux maps were generated for a multi-tissue diel model of Arabidopsis thaliana’s central metabolism using flux balance analysis in which the sum of all the metabolic and transport fluxes required for steady state growth is minimized, with each flux being multiplied by a penalty weight that was derived from the relative expression of the gene(s) involved in conducting that flux (see methods). Penalty weights for each reaction were calculated from RNA-seq (Klepikova et al., 2016; Mergner et al., 2020) and proteomic (Mergner et al., 2020) datasets using the relative expression of each gene in the different tissues. The weighted average % error between these flux maps and 13C-MFA estimates from Ma et al., (2014b) were used to quantify the accuracy of these FBA predictions, as compared to the accuracy of flux maps generated by pFBA (Lewis et al., 2010) alone. The flux maps arrived at after the application of either transcriptomic or proteomic penalty weights show greater agreement, as measured by the weighted average % error, with 13C-MFA estimates than the results from pFBA 104 alone (Table 3.1). These reductions in error are substantial and statistically significant at α = 0.01; they are consistent across comparisons against two different flux maps (high-light and low- light conditions) and are sustained across a range of assumed ratios of starch to sucrose production and carboxylase to oxygenase fluxes through rubisco (vo/vc). Marked reductions in error are seen whether one uses the transcriptomic or proteomic tissue-atlas datasets from Mergner et al., (2020) or Klepikova et al., (2016), so that the improvement in flux predictions is not dependent on the values obtained in a specific gene expression dataset or type. We wanted to confirm that these reductions in error are in fact dependent on penalty weights calculated from gene expression data and not an artifact of the weighting procedure itself. Indeed, previous studies have used the application of randomized weights as a method of exploring different possible flux modes in a plant metabolic network (Cheung et al., 2015). We found that substituting the leaf for the root proteomic dataset, and vice-versa, resulted in no reduction in weighted average error (Appendix B, Table S3.1) compared to pFBA. Neither did randomization of the penalty weight vector and subsequent optimization. The mean of the weighted average errors of 50 high-light condition flux maps generated with independent randomized penalty weight vectors at a scaling factor of 1 was 201%, versus the unweighted error value of 169-180% for that condition. 105 Figure 3.1: Percent errors of specific reactions in central metabolism before (A) and after (B) gene expression weight application. The error values in (A) are the lowest possible given FVA results and the values in (B) are the highest possible given FVA results. We see substantial decreases in errors associated with central carbon assimilation, as well as starch and sucrose synthesis. Since the 13C-MFA estimated fluxes from Ma et al., (2014) do not feature the flux from ADPG to Starch, this flux lacks an estimated error and is therefore shown in black. Flux values are relative to the lowest flux in the network. 3.5.2. Increases in agreement between FBA-predicted and MFA-estimated fluxes are broadly distributed across central metabolism Although there is variation among individual fluxes in the degree to which omic data integration improves agreement between predicted and experimentally derived values, the reduction in weighted error as a result of penalty weight application is distributed broadly across the fluxes for which 13C-MFA estimates are available. If, for example, the improvement were due to a substantial decrease in one or a small number of high-flux reactions and a negligible decrease or even increase in error for other reactions (Fig. 3.1) the overall finding would be less striking and potentially less broadly applicable. The reductions in error are consistent not only 106 across metabolic subsystems within a single FBA flux map, but also across alternative stoichiometric network structures. Initial pFBA-derived solutions for a model identical to that used to generate the other predictions except with unconstrained uptake and discharge of protons from root tissue show similar reductions in error (Appendix B, Table S3.2). Upon application of penalty weights, this model converges to a similar value of weighted average error and linear correlation as other model configurations. Table 3.1: Measures of carbon and energy utilization derived from the predicted flux maps with and without penalty weights applied. Reference values: a, (Ma et al., 2014b); b, (Kramer et al., 2004); c, (Weraduwage et al., 2015). (b) and (c) reference values are not associated with a particular light level. Dataset used for weighting Light Rubisco flux ÷ net CO2 assimilation Photorespiratory CO2 loss / net CO2 assimilation (%) Cyclic/Linear Electron Flow None Mergner et al. Protein Mergner et al. Transcripts Klepikova et al.Transcripts Reference values 2.86 High 1.85 Low 1.29 High 1.17 Low 1.20 High 1.15 Low 1.30 High 1.25 Low 1.28a High Low 1.17a 62 26 26 14 25 14 27 15 28a 16a 24% 31% 20% 15% 21% 27% 17% 14% 13%b % of leaf daytime CO2 assimilation going to biomass 43 54 18 26 18 33 19 31 56%c 3.5.3. Error reductions are a function of the scaling factor parameter and are improved by the application of a tissue-specific median weight for reactions lacking Gene Protein Reaction terms The magnitude of the penalty weights calculated and applied by the present method depend on the magnitude of the scaling factor term, (Eq. 2). The increased agreement between the FBA- predicted and MFA-estimated flux maps only manifests in the majority of cases for scaling factors of 0.05-0.1 or greater (Fig 3.2). We also note that the relationship between the scaling factor value and the improved agreement is monotonic – that is, we do not see erratic increases and decreases as we increase the scaling factor value and, by extension, the strength of the assumed relationship between flux and gene expression. The necessity of a non-negligible scaling factor, the consistency of error improvement as the scaling factor is increased, and the similarity in the pattern of error improvement across multiple datasets as seen in Fig 3.2 all suggest that real biological signal related to the partitioning of metabolic activity across the plant’s tissues is being extracted from the gene expression datasets. Finally, we observe that the 107 flux maps generated using penalty weight derived from the Mergner et al., (2020) proteomic dataset have noticeably better weighted average errors than flux maps generated using transcriptomic dataset (Table 3.1; Fig. 3.2). This is consistent with the closer relationship between measured protein levels and metabolic fluxes than between transcripts and fluxes. It is also consistent with at least one other study’s attempts at integrating gene expression data into FBA in E. coli (Tian and Reed, 2018). Although the presented method does not involve fitting the scaling factor parameter using goodness-of-fit to the 13C-MFA fluxes, in Fig 3.1 and Tables 3.1-3.2, we show results from a scaling factor of 1 because it falls in the plateau of low average error values we see in Fig 3.2. There are no independent 13C-MFA datasets of this system against which to evaluate whether a scaling factor value of 1 generalizes well outside of the datasets considered in the present study. However, Szecowka et al., (2013) do report fluxes from illuminated A. thaliana leaves estimated by kinetic flux profiling. The flux map generated using vo/vc and starch:sucrose synthesis constraints from that study without any omic weighting has a weighted average error of 108%; this error drops to between 6-9% when penalty weights generated with a scaling factor of 1 are applied (Appendix B, Table S3.6; Dataset S5). In our initial formulation of the algorithm for generating gene expression derived penalty weights, the weight of all reactions with no associated GPR was set to 1, since this is the implicit value of the coefficient for all reactions in a standard pFBA optimization. Since this runs the risk of introducing a systematic bias against using reactions that have associated GPRs, we attempted to counteract this risk by assigning all reactions lacking a GPR a penalty weight corresponding to the median penalty weight of all weighted reactions in the tissue in which those reactions are found. Comparing the results with and without the tissue-specific median penalty weights for reactions without GPRs, we see modest improvements in the weighted average errors from a scaling factor of 1 onwards when using the transcriptomic and proteomic datasets from Mergner et al., (2020) (Fig. 3.2), though the effect is not large, indicating that our method is robust to including or omitting the tissue-specific median weight correction. 108 Figure 3.2: Weighted average errors of FBA predictions compared with MFA-estimated flux maps as a function of scaling factor value, light-level, and application of a tissue-specific median weight correction. Panels show weighted average errors of flux maps generated using (A) low- light constraints and a tissue-specific median correction applied, (B) low-light constraints and without a tissue-specific median correction applied, (C) high-light constraints and with a tissue- specific median correction applied, and (D) high-light constraints and without a tissue-specific median correction applied. “M Protein” and “M Transcripts” refer to flux maps generated using proteomic- and RNA-seq-derived weights from Mergner et al., (2020). “K Transcripts” refers to flux maps generated using RNA-seq derived weights from Klepikova et al., (2016). Upper and lower bars on each point represent the highest and lowest possible weighted average errors given FVA results, and the points themselves represent the average of these values. 109 3.5.4. Changes in the carbon and energy economy upon application of gene expression weights In addition to improving quantitative agreement between the FBA-predicted and MFA-estimated flux maps, the gene expression weighting procedure also generates flux maps that present a substantially different picture of carbon and energy metabolism in Arabidopsis leaves. In both high and low light FBA-predicted fluxes there is a substantial decrease in leaf mitochondrial Electron Transport Chain (ETC) activity and overall flux in mitochondria- localized reactions in the light relative to nighttime ETC activity and overall flux (Appendix B, Table S3.3). MFA and other recent work further points to low TCA cycle fluxes in photosynthesizing leaves (Tcherkez et al., 2005; Xu et al., 2021a; Xu et al., 2022). This decrease in mitochondrial activity goes hand-in-hand with a predicted decrease in the use of unusually high fluxes related to proline metabolism to indirectly support the consumption of excess reductant produced via the light reactions of photosynthesis. Alongside this decrease in mitochondrial activity is a decrease in the ratio of cyclic electron flow (CEF) to linear electron flow (LEF) in the chloroplast (Table 3.2). Although reliable empirical measurements of this CEF/LEF ratio are difficult to obtain, previous studies have shown that a C3 plant like Arabidopsis relying on cyclic electron flow to bring the ratio of ATP/NADPH produced up to that needed for normal growth would have a CEF amounting to ~13% of LEF (Kramer et al., 2004). Due to the presence of other balancing mechanisms, such as the malate valve (Selinski and Scheibe, 2019), this 13% value would represent an upper bound on stoichiometrically predicted values for CEF/LEF. Application of gene expression data decreases the CEF/LEF ratios in all but one FBA-predicted flux map to values much closer to the expected ~13% upper bound than are predicted using conventional pFBA (Table 3.2). Ma et al., (2014) reported MFA-derived estimates of %vpr, or the rate of photorespiratory CO2 release via glyoxylate decarboxylation as a % of CO2 assimilation, as well as the ratio of rubisco carboxylation flux to net CO2 assimilation in the leaf. The unweighted flux predictions for the high and low light conditions disagree substantially with these estimates (Table 3.2). However, application of gene expression weights consistently brings estimates of these parameters into close agreement with MFA-derived values. The integration of gene expression also changes the predicted efficiency with which Arabidopsis converts atmospheric CO2 into biomass (Table 3.2). For comparison with these predicted efficiencies, we used the empirical A. 110 thaliana biomass, leaf area, and gas exchange data reported by Weraduwage et al., (2015) to calculate that approximately 56% of the net CO2 assimilation in illuminated leaves ends up incorporated into biomass, which is closer to the value in our unweighted flux predictions than our weighted flux predictions, although it should be noted that these data were gathered from a hydroponic system. 3.6. Discussion 13C-MFA is broadly accepted as being the most reliable method for estimating metabolic flux maps in vivo due to its ability to make use of substantial amounts of isotopic labeling data to arrive at well-supported flux maps in small- to medium-scale networks (Antoniewicz, 2015). However, the technique’s utility is limited by the substantial experimental effort that goes into the generation of each individual flux map. FBA, with its requirement of much less experimental data, has become the method of choice for more exploratory or predictive metabolic modeling studies. The implicit assumption is usually that the predictions of FBA – or at least the range of its predictions in cases where a unique solution is not provided – agree with those we would arrive at if we were able to conduct a 13C-MFA study. This makes our optimization procedures when performing FBA and validation of FBA models against MFA results of vital importance. The method presented here, by bringing FBA-predicted fluxes into line with MFA-estimates represents a step in the direction of higher-confidence FBA flux maps. One limitation, as well as motivation, for the present study is the lack of a large set of 13C-MFA datasets in plants and other multi-tissue eukaryotic systems. Systems like E. coli have multi-omic datasets consisting of transcriptomic, proteomic, and fluxomic measurements (Ishii et al., 2007) that have been utilized to empirically infer the relationship between gene expression and metabolic fluxes. This empirical training can then be used to more accurately predict fluxes in new contexts (Tian and Reed, 2018). The sparsity of 13C-MFA data in more complex systems makes such an approach currently impossible. A noteworthy theoretical aspect of the present approach is its simplicity, the only variable parameter being a single scaling factor that controls the magnitude of the penalty weights. That the assumption of a consistent value relating the relative abundances of transcripts or proteins in different tissues to the “preference” of an organism to partition flux among particular reactions can result in substantial improvement in error was of great interest in light of the complexity of the relationship between measures of gene expression – transcriptomic and proteomic 111 abundances – and flux. Particularly when making biotechnological interventions in a system to modify its metabolism, there is often an assumed strong linear relationship between transcription, translation, and, ultimately, metabolic flux, but the reality is rarely so simple. Although moderate correlations between transcript and protein abundances have been demonstrated across many systems, the degree of correlation varies across systems and experimental contexts (Maier et al., 2009; Liu et al., 2016). The correlation between these data types and rates of central metabolic reactions, which carry the large majority of total metabolic flux, is weaker still (Kuile and Westerhoff, 2001). Some previous studies found that changes in the gene expression related to individual reactions typically do not correlate well with changes in fluxes (Schwender et al., 2014; Tian and Reed, 2018), with some central metabolic fluxes in particular showing a negative correlation between changes in gene expression and flux. In both cases, gene expression data related to reactions were compared within the same cell type or tissue; in our study, we instead compare inter-tissue abundances, mirroring the long-standing practice in the literature of inferring relative metabolic activity in different tissues by their transcript and protein investment in relevant pathway steps. It may be that only by considering gene expression on an inter-tissue basis in the context of the entire complex stoichiometric network underlying metabolism can predictive gains from including gene expression evidence be properly realized. Future work should aim to expand the number of available datasets, and the experimental conditions and genotypes for which they are gathered, in order to enable more thorough evaluation of methods like the one presented in this paper. Indeed, evaluating the presented method requires 13C-MFA fluxes, multi-tissue omic data, and a genome-scale model all for the same biological system, which, to our knowledge, is only possible for A. thaliana. Building on the work of Ma et al., (2014), experimental improvements and refinements of the underlying network architecture of central carbon metabolism have been introduced in the context of 13C- MFA in Camelina sativa (Xu et al., 2021a; Xu et al., 2022) and Nicotiana tabacum (Chu et al., 2022). In the present study the Ma et al. 2014 flux maps are used without change and we adopted a highly curated A. thaliana genome-scale model from which to construct the whole-plant model. This approach precluded the possibility of our reanalyzing the MFA-estimated flux map or biasing the construction of a purpose-built genome-scale model, making the MFA-to-FBA comparison more favorable. However, in future studies a combination of MFA network refinements, expanded datasets, and further improvements in the flux estimation procedures 112 holds promise for improving the fidelity of the 13C-MFA comparison data. On the FBA side, the use of more detailed growth and composition measurements for FBA along with more detailed representation of different tissue types will potentially allow for more biologically accurate and representative FBA flux map predictions. These improvements in both MFA-estimation and FBA-prediction of flux maps, along with an expansion in the number of available 13C-MFA datasets against which to compare FBA predictions, will allow for more extensive validation of the method described in this paper as well as other methods aiming to incorporate omic datasets into flux prediction. A distinct aspect of the proposed method is its demonstrated ability to bring FBA- predicted fluxes in line with MFA-estimated fluxes across multiple input datasets, model architectures, and using multiple independent gene expression datasets. Our hope is that methods for incorporating transcriptomic and proteomic data may advance this field to the point where FBA-predicted flux maps can be used with high confidence for practical engineering goals. This, combined with the automated reconstruction of GEMs from genomic and biochemical databases (Saha et al., 2014) suggests a future with rapid turnaround from the initial identification of an organism of interest to metabolic flux predictions and rational genetic engineering to achieve biotechnological aims. 3.7 Acknowledgments We would like to thank Dr. Doug Allen for permission to adapt Figure 3 from Ma et al., (2014) for use in Figure 3.1 in this publication. Figure 3.1 was created with BioRender.com. 3.8 Funding This research was supported by the Office of Science (BER), U.S. Department of Energy, Grant no DE-SC0018269 (J.A.M.K. and Y.S-H.). This work is supported, in part, by the NSF Research Traineeship Program (Grant DGE-1828149) to J.A.M.K. This publication was also made possible by a predoctoral training award to J.A.M.K. from Grant T32-GM110523 from National Institute of General Medical Sciences (NIGMS) of the NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIGMS or NIH. 113 REFERENCES Antoniewicz MR (2015) Methods and advances in metabolic flux analysis: a mini-review. Journal of Industrial Microbiology and Biotechnology 42: 317–325 Arnold A, Nikoloski Z (2014) Bottom-up metabolic reconstruction of arabidopsis and its application to determining the metabolic costs of enzyme production. Plant Physiology 165: 1380–1391 Boyle NR, Sengupta N, Morgan JA (2017) Metabolic flux analysis of heterotrophic growth in Chlamydomonas reinhardtii. PLoS ONE 12: 1–23 Burgard AP, Pharkya P, Maranas CD (2003) OptKnock: A Bilevel Programming Framework for Identifying Gene Knockout Strategies for Microbial Strain Optimization. Biotechnology and Bioengineering 84: 647–657 Chen X, Alonso AP, Allen DK, Reed JL, Shachar-Hill Y (2011) Synergy between 13C- metabolic flux analysis and flux balance analysis for understanding metabolic adaption to anaerobiosis in E. coli. Metabolic Engineering 13: 38–48 Cheung CYM, Ratcliffe RG, Sweetlove LJ (2015) A Method of Accounting for Enzyme Costs in Flux Balance Analysis Reveals Alternative Pathways and Metabolite Stores in an Illuminated Arabidopsis Leaf. Plant physiology 169: 1671–82 Cheung CYM, Williams TCR, Poolman MG, Fell DA, Ratcliffe RG, Sweetlove LJ (2013) A method for accounting for maintenance costs in flux balance analysis improves the prediction of plant cell metabolic phenotypes under stress conditions. Plant Journal 75: 1050–1061 Chu KL, Koley S, Jenkins LM, Bailey SR, Kambhampati S, Foley K, Arp JJ, Morley SA, Czymmek KJ, Bates PD, et al (2022) Metabolic flux analysis of the non-transitory starch tradeoff for lipid production in mature tobacco leaves. Metabolic Engineering 69: 231–248 Dal’Molin CG de O, Quek LE, Palfreyman RW, Brumbley SM, Nielsen LK (2010a) AraGEM, a genome-scale reconstruction of the primary metabolic network in Arabidopsis. Plant Physiology 152: 579–589 Dal’Molin CG de O, Quek LE, Palfreyman RW, Brumbley SM, Nielsen LK (2010b) C4GEM, a genome-scale metabolic model to study C4 plant metabolism. Plant Physiology 154: 1871–1885 Gleizer S, Ben-Nissan R, Bar-On YM, Antonovsky N, Noor E, Zohar Y, Jona G, Krieger E, Shamshoum M, Bar-Even A, et al (2019) Conversion of Escherichia coli to Generate All Biomass Carbon from CO2. Cell 179: 1255-1263.e12 Gurobi Optimization, LLC (2021) Gurobi Optimizer Reference Manual. 114 Hay J, Schwender J (2011) Metabolic network reconstruction and flux variability analysis of storage synthesis in developing oilseed rape (Brassica napus L.) embryos. Plant Journal 67: 526–541 Heirendt L, Arreckx S, Pfau T, Mendoza SN, Richelle A, Heinken A, Haraldsdóttir HS, Wachowiak J, Keating SM, Vlasov V, et al (2019) Creation and analysis of biochemical constraint-based models using the COBRA Toolbox v.3.0. Nature Protocols 14: 639–702 Ishii N, Nakahigashi K, Baba T, Robert M, Soga T, Kanai A, Hirasawa T, Naba M, Hirai K, Hoque A, et al (2007) Multiple high-throughput analyses monitor the response of E. coli to perturbations. Science 316: 593–597 Jenior ML, Moutinho TJ, Dougherty BV, Papin JA (2020) Transcriptome-guided parsimonious flux analysis improves predictions with metabolic networks in complex environments. PLoS Computational Biology. doi: 10.1371/journal.pcbi.1007099 Klepikova AV, Kasianov AS, Gerasimov ES, Logacheva MD, Penin AA (2016) A high resolution map of the Arabidopsis thaliana developmental transcriptome based on RNA- seq profiling. Plant Journal 88: 1058–1070 Kramer DM, Avenson TJ, Edwards GE (2004) Dynamic flexibility in the light reactions of photosynthesis governed by both electron and proton transfer reactions. Trends in Plant Science 9: 349–357 Krey JF, Wilmarth PA, Shin J-B, Klimek J, Sherman NE, Jeffery ED, Choi D, David LL, Barr-Gillespie PG (2014) Accurate label-free protein quantitation with high- and low- resolution mass spectrometers. Journal of proteome research 13: 1034–1044 Kuile BH, Westerhoff HV (2001) Transcriptome meets metabolome: Hierarchical and metabolic regulation of the glycolytic pathway. FEBS Letters 500: 169–171 Lee KH, Park JH, Kim TY, Kim HU, Lee SY (2007) Systems metabolic engineering of Escherichia coli for L -threonine production. Molecular Systems Biology. doi: 10.1038/msb4100196 Lewis NE, Hixson KK, Conrad TM, Lerman JA, Charusanti P, Polpitiya AD, Adkins JN, Schramm G, Purvine SO, Lopez-Ferrer D, et al (2010) Omic data from evolved E. coli are consistent with computed optimal growth from genome-scale models. Molecular Systems Biology. doi: 10.1038/msb.2010.47 Liu Y, Beyer A, Aebersold R (2016) On the Dependency of Cellular Protein Levels on mRNA Abundance. Cell 165: 535–550 Ma F, Jazmin LJ, Young JD, Allen DK (2014a) Isotopically nonstationary 13C flux analysis of changes in Arabidopsis thaliana leaf metabolism due to high light acclimation. Proceedings of the National Academy of Sciences of the United States of America 111: 16967–16972 115 Ma F, Jazmin LJ, Young JD, Allen DK (2014b) Isotopically nonstationary 13C flux analysis of changes in Arabidopsis thaliana leaf metabolism due to high light acclimation. Proceedings of the National Academy of Sciences of the United States of America 111: 16967–16972 Machado D, Herrgård M (2014) Systematic Evaluation of Methods for Integration of Transcriptomic Data into Constraint-Based Models of Metabolism. PLoS Computational Biology. doi: 10.1371/journal.pcbi.1003580 Mahadevan R, Schilling CH (2003) The effects of alternate optimal solutions in constraint- based genome-scale metabolic models. Metabolic Engineering 5: 264–276 Maier T, Güell M, Serrano L (2009) Correlation of mRNA and protein in complex biological samples. FEBS Letters 583: 3966–3973 Mergner J, Frejno M, List M, Papacek M, Chen X, Chaudhary A, Samaras P, Richter S, Shikata H, Messerer M, et al (2020) Mass-spectrometry-based draft of the Arabidopsis proteome. Nature 579: 409–414 de Oliveira Dal’Molin CG, Quek LE, Saa PA, Nielsen LK (2015) A multi-tissue genome- scale metabolic modeling framework for the analysis of whole plant systems. Frontiers in Plant Science 6: 1–12 Orth JD, Thiele I, Palsson BO (2010) What is flux balance analysis? Nature Biotechnology 28: 245–248 Pandey V, Hadadi N, Hatzimanikatis V (2019) Enhanced flux prediction by integrating relative expression and relative metabolite abundance into thermodynamically consistent metabolic models. PLOS Computational Biology 15: 1–23 Park JH, Lee KH, Kim TY, Lee SY (2007) Metabolic engineering of Escherichia coli for the production of L-valine based on transcriptome analysis and in silico gene knockout simulation. Proceedings of the National Academy of Sciences of the United States of America 104: 7797–7802 Poolman MG, Miguet L, Sweetlove LJ, Fell DA (2009) A genome-scale metabolic model of Arabidopsis and some of its properties. Plant Physiology 151: 1570–1581 Ratcliffe RG, Shachar-Hill Y (2006) Measuring multiple fluxes through plant metabolic networks. Plant Journal 45: 490–511 Ravi S, Gunawan R (2021) ΔFBA—Predicting metabolic flux alterations using genome-scale metabolic models and differential transcriptomic data. PLOS Computational Biology 17: e1009589 Saha R, Chowdhury A, Maranas CD (2014) Recent advances in the reconstruction of metabolic models and integration of omics data. Current Opinion in Biotechnology 29: 39–45 116 Saha R, Suthers PF, Maranas CD (2011) Zea mays irs1563: A comprehensive genome-scale metabolic reconstruction of maize metabolism. PLoS ONE. doi: 10.1371/journal.pone.0021784 Schwender J, König C, Klapperstück M, Heinzel N, Munz E, Hebbelmann I, Hay JO, Denolf P, De Bodt S, Redestig H, et al (2014) Transcript abundance on its own cannot be used to infer fluxes in central metabolism. Frontiers in Plant Science 5: 1–16 Selinski J, Scheibe R (2019) Malate valves: old shuttles with new perspectives. Plant Biology 21: 21–30 Shaw R, Cheung CYM (2020) Multi-tissue to whole plant metabolic modelling. Cellular and Molecular Life Sciences 77: 489–495 Szecowka M, Heise R, Tohge T, Nunes-Nesi A, Vosloh D, Huege J, Feil R, Lunn J, Nikoloski Z, Stitt M, et al (2013) Metabolic fluxes in an illuminated Arabidopsis rosette. Plant Cell 25: 694–714 Tcherkez G, Cornic G, Bligny R, Gout E, Ghashghaie J (2005) In vivo respiratory metabolism of illuminated leaves. Plant Physiology 138: 1596–1606 Tepper N, Shlomi T (2009) Predicting metabolic engineering knockout strategies for chemical production: Accounting for competing pathways. Bioinformatics 26: 536–543 Tian M, Reed JL (2018) Integrating proteomic or transcriptomic data into metabolic models using linear bound flux balance analysis. Bioinformatics 34: 3882–3888 Vijayakumar S, Conway M, Lio P, Angione C (2017) Seeing the wood for the trees: A forest of methods for optimization and omic-network integration in metabolic modelling. Briefings in Bioinformatics 19: 1218–1235 Weraduwage SM, Chen J, Anozie FC, Morales A, Weise SE, Sharkey TD (2015) The relationship between leaf area growth and biomass accumulation in Arabidopsis thaliana. 6: 1–21 Williams TCR, Poolman MG, Howden AJM, Schwarzlander M, Fell DA, Ratcliffe RG, Sweetlove LJ (2010) A Genome-scale metabolic model accurately predicts fluxes in central carbon metabolism under stress conditions. Plant Physiology 154: 311–323 Xu Y, Fu X, Sharkey TD, Shachar-Hill Y, Walker and BJ (2021) The metabolic origins of non-photorespiratory CO2 release during photosynthesis: a metabolic flux analysis. Plant Physiology 1–18 Xu Y, Wieloch T, Kaste JAM, Shachar-Hill Y, Sharkey TD (2022) Reimport of carbon from cytosolic and vacuolar sugar pools into the Calvin-Benson cycle explains photosynthesis labeling anomalies. Proceedings of the National Academy of Sciences 119: e2121531119 117 APPENDIX B: Supplemental Material for Chapter 3 SUPPLEMENTARY METHODS Datasets and Omic Data Processing Sample IDs and SRR numbers for all transcriptomic and proteomic datasets used in this study can be found in Appendix B, Table S3.5. The raw RNA-seq data for all 137 samples in Klepikova et al., (2016) was trimmed using the fastp algorithm (Chen et al., 2018) and then aligned to the TAIR10 Arabidopsis thaliana genome obtained from ensembl plants using the salmon algorithm (Lamesch et al., 2012; Patro et al., 2017; Howe et al., 2020). RNA-seq reads from Mergner et al., (2020) were taken directly from the published supplemental material. Library normalization was performed on the RNA-seq datasets from both Klepikova et al., (2016) and Mergner et al., (2020) using the DeSeq2 procedure (Love et al., 2014) and averages of the normalized transcript abundance values across replicates from Klepikova et al., (2016) were used. Intensity-based Absolute Quantification (IBAQ) (Schwanhüusser et al., 2011) values for the proteomic data from Mergner et al., (2020) were divided by the sum total intensity across all protein signals measured for a given sample to normalize them. Error Evaluation and Statistical Analysis To evaluate whether the error values for measured reactions in individual flux maps generated using gene expression weights were statistically significantly different from the errors without application of these weights, the Wilcoxon signed-rank test was used (Wilcoxon, 1992). The Bonferroni-Holm multiple testing correction (Holm, 1978) was used to correct the family- wise α of all hypothesis tests to 0.01, where each hypothesis test is asking, by the Wilcoxon signed-rank test, whether the differences between a given FBA-predicted flux map (e.g., the flux map generated using protein-derived gene expression weights and a Scaling Factor of 1) and our MFA-estimated flux map could be attributed to random chance. In the high light condition 13C- MFA flux map from Ma et al., (2014b), the flux through the malate dehydrogenase reaction was reported as exactly 0 – as this made its error undefined, it was excluded from the high light condition’s error calculation. Fluxes carrying zero flux were likewise excluded from error calculations when comparing FBA predictions against fluxes from the Szecowka et al., (2013) dataset. 118 Flux Variability Analysis Flux Variability Analysis (FVA) (Mahadevan and Schilling, 2003) was used to determine the maximum and minimum values possible for each of the fluxes included in the error calculation, subject to the following constraint: 𝑐 ∙ 𝑣 = 𝑜𝑝𝑡 (1) Where c is the vector of all weight coefficients, v is the vector of all fluxes, and opt is the value of the objective function determined by the initial optimization procedure. As shown in Appendix B, Dataset S3.3, some of the fluxes included in the error function are not uniquely defined, such that they can vary up and down without violating Eq. 1. To account for this variation, maximum and minimum weighted average errors were calculated, where the minimum and maximum errors correspond to the smallest and largest weighted average errors possible for a flux map given the maximum/minimum values for all evaluated fluxes. FVA was performed in MATLAB using the COBRA Toolbox (Heirendt et al., 2019) and Gurobi™ 8.1.1 (Gurobi Optimization, LLC, 2021). Note that we encountered infeasible solutions in some cases when using a scaling factor of 1000 and omic datasets with the leaf and root data swapped – in such cases, the corresponding columns of the FVA results have been left blank. Model Constraints Light uptake and photosynthetic activity were restricted to the leaf tissue and mineral uptake was restricted to root tissue. Inter-tissue transport and day/night continuity of metabolites were defined and constrained as in Cheung & Shaw 2018 (Shaw and Cheung, 2018) as were ATP and NADPH maintenance flux values. Biomass compositions of leaf, stem, and root were taken from de Oliveira Dal’Molin et al., (2015), based on Poorter and Bergkotte, (1992). Reactions were added to produce biomass components that appear in the de Oliveira Dal’Molin et al., (2015) biomass equations but not in the core metabolic model (Arnold and Nikoloski, 2014); this involved adding subnetworks of missing reactions for several components and single summary reactions for others. Cytosolic pentose phosphate pathway reactions were also added to the model. All reactions were converted to irreversible form, wherein all reversible reactions were converted to independent forward and reverse reactions, prior to solving. This is simply to ensure that all fluxes, including those representing the reverse flux of a reversible reaction, take values that are zero or positive. In order to generate predictions corresponding to the high-light and low-light flux maps 119 reported by Ma et al., (2014), the vo/vc, or ratio of ribulose 1,5-bisphosphate carboxylase/oxygenase (rubisco) oxygenation activity to its carboxylation activity, and the ratio of starch to sucrose synthesis were both constrained to the values estimated in that study. vo/vc and starch to sucrose synthesis values were likewise constrained to the values estimated by Szecowka et al., (2013) when comparing FBA fluxes against that study. 120 Table S3.1: Reductions in weighted average error with application of gene expression weights derived from gene expression data with incorrect tissue specification or randomized values. TABLES Dataset Light Level Weighted Average Error (%) Without Gene Expression Weights With Leaf/Root Flipped Gene Expression Weights Mergner et al. Transcriptome Mergner et al. Proteome Klepikova et al. Transcriptome High Low High Low High Low 168 – 180 93.8 – 103 168 – 180 93.8 – 103 168 – 180 93.8 – 103 181 - 215% 155 - 185 249 - 295 131 - 160 87.9 - 109 97.2 – 120 aWeighted average errors are calculated from flux maps generated using a scaling factor of 1. 121 Table S3.2: Reductions in weighted average errors with an alternate model architecture allowing free uptake and discharge of protons from the root tissue compartment. Dataset Light Level Mergner et al. Transcriptome Mergner et al. Proteome Klepikova et al. Transcriptome High Low High Low High Low Weighted Average Error (%) Without Gene Expression Weights (%) With Gene Expression Weights (%)a 127 - 135 66.1 - 73.8 127 - 135 66.1 - 73.8 127 - 135 66.1 - 73.8 11.8 – 14.2 16.8 – 19.4 10.5 - 12.8 8.88 - 11.1 13.7 - 16.5 20.8 - 23.2 aWeighted average errors are calculated from flux maps generated using a scaling factor of 1. 122 Table S3.3: Ratios of day vs. night leaf mitochondrial fluxes and Electron Transport Chain fluxes in flux maps with and without integration of gene expression evidence. Dataset Light Level Mergner et al. Transcriptome Mergner et al. Proteome Klepikova et al. Transcriptome High Low High Low High Low Ratio of total leaf mitochondrial flux in day vs. night Ratio of leaf mitochondrial ATP synthase flux in day vs. night Without Gene Expression Weights With Gene Expression Weightsa Without Gene Expression Weights 1.17 1.23 1.17 1.23 1.16 1.23 0.144 0.0625 0.0693 0.0981 0.179 0.615 1.15 1.20 1.15 1.20 1.15 1.20 With Gene Expression Weightss 0.0888 1.94 * 10^-5 6.17 * 10^-5 1.87 * 10^-5 0.121 0.503 aValues for weighted cases calculated from flux maps generated using a scaling factor of 1. 123 Table S3.4: Growth conditions and key constraints from transcriptomic, proteomic, 13C-MFA, and kinetic flux profiling datasets used in the present study. Dataset Klepikova et al. (Klepikova et al., 2016) Mergner et al. (Mergner et al., 2020) Ma et al. Low Light Condition s (Ma et al., 2014) Ma et al. High Light Condition (Ma et al., 2014) Szecowka et al. (Szecowka et al., 2013) Type Growth conditions Tissue Age vo/vc Starch/sucrose biosynthesis rate Transcriptomic Philips Master TL5 HO 54 W/840 lamps light source at a 27cm distance; 22°C; 50% relative humidity; 16/8-h day/night cycle Leaf Stem Root Anthesis of first flower Anthesis of first flower 7th day after germination Leaf 22 days old Transcriptomic Continuous white light; 22°C Stem 30 days old Root 22 days old Leaf 22 days old Proteomic Continuous white light; 22°C Stem 30 days old Root 22 days old N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 13C-MFA 13C-MFA Kinetic flux profiling 200 µmol m-2 s-1; 22/18°C; 50% relative humidity; 16/8-h day/night cycles 500 µmol m-2 s-1; 22/18°C; 50% relative humidity; 16/8-h day/night cycles 120 µmol m-2 s-1 irradiance; 22/20°C; 50% relative humidity; 8/16-h day/night cycles Leaf 28 days old 0.29 0.26 Leaf 28 days old 0.43 0.16 Leaf 35 days old 0.4 0.45 124 Table S3.5: Reductions in weighted average errors when using constraints from the Szecowka et al. 2013 dataset and gene expression weights from different sources. Gene expression dataset Mergner et al. Transcriptome Mergner et al. Proteome Klepikova et al. Transcriptome Weighted Average Error (%) Without Gene Expression Weights (%) With Gene Expression Weights (%)a 107.6 – 107.8 107.6 – 107.8 107.6 – 107.8 6.09 7.55 8.65 125 REFERENCES Arnold A, Nikoloski Z (2014) Bottom-up metabolic reconstruction of arabidopsis and its application to determining the metabolic costs of enzyme production. Plant Physiology 165: 1380–1391 Chen S, Zhou Y, Chen Y, Gu J (2018) Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34: i884–i890 Gurobi Optimization, LLC (2021) Gurobi Optimizer Reference Manual. Heirendt L, Arreckx S, Pfau T, Mendoza SN, Richelle A, Heinken A, Haraldsdóttir HS, Wachowiak J, Keating SM, Vlasov V, et al (2019) Creation and analysis of biochemical constraint-based models using the COBRA Toolbox v.3.0. Nature Protocols 14: 639–702 Holm S (1978) Board of the Foundation of the Scandinavian Journal of Statistics A Simple Sequentially Rejective Multiple Test Procedure Author ( s ): Sture Holm Published by : Wiley on behalf of Board of the Foundation of the Scandinavian Journal of Statistics Stable U. Scandinavian Journal of Statistics 6: 65–70 Howe KL, Contreras-Moreira B, De Silva N, Maslen G, Akanni W, Allen J, Alvarez- Jarreta J, Barba M, Bolser DM, Cambell L, et al (2020) Ensembl Genomes 2020- enabling non-vertebrate genomic research. Nucleic Acids Research 48: D689–D695 Klepikova AV, Kasianov AS, Gerasimov ES, Logacheva MD, Penin AA (2016) A high resolution map of the Arabidopsis thaliana developmental transcriptome based on RNA- seq profiling. Plant Journal 88: 1058–1070 Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, et al (2012) The Arabidopsis Information Resource (TAIR): Improved gene annotation and new tools. Nucleic Acids Research 40: 1202–1210 Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15: 1–21 Ma F, Jazmin LJ, Young JD, Allen DK (2014a) Isotopically nonstationary 13C flux analysis of changes in Arabidopsis thaliana leaf metabolism due to high light acclimation. Proceedings of the National Academy of Sciences of the United States of America 111: 16967–16972 Ma F, Jazmin LJ, Young JD, Allen DK (2014b) Isotopically nonstationary 13C flux analysis of changes in Arabidopsis thaliana leaf metabolism due to high light acclimation. Proceedings of the National Academy of Sciences of the United States of America 111: 16967–16972 Mahadevan R, Schilling CH (2003) The effects of alternate optimal solutions in constraint- 126 based genome-scale metabolic models. Metabolic Engineering 5: 264–276 Mergner J, Frejno M, List M, Papacek M, Chen X, Chaudhary A, Samaras P, Richter S, Shikata H, Messerer M, et al (2020) Mass-spectrometry-based draft of the Arabidopsis proteome. Nature 579: 409–414 de Oliveira Dal’Molin CG, Quek LE, Saa PA, Nielsen LK (2015) A multi-tissue genome- scale metabolic modeling framework for the analysis of whole plant systems. Frontiers in Plant Science 6: 1–12 Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nature methods 14: 417–419 Poorter H, Bergkotte M (1992) Chemical composition of 24 wild species differing in relative growth rate. Plant, Cell & Environment 15: 221–229 Schwanhüusser B, Busse D, Li N, Dittmar G, Schuchhardt J, Wolf J, Chen W, Selbach M (2011) Global quantification of mammalian gene expression control. Nature 473: 337– 342 Shaw R, Cheung CYM (2018) A dynamic multi-tissue flux balance model captures carbon and nitrogen metabolism and optimal resource partitioning during arabidopsis growth. Frontiers in Plant Science 9: 1–15 Szecowka M, Heise R, Tohge T, Nunes-Nesi A, Vosloh D, Huege J, Feil R, Lunn J, Nikoloski Z, Stitt M, et al (2013) Metabolic fluxes in an illuminated Arabidopsis rosette. Plant Cell 25: 694–714 Wilcoxon F (1992) Individual comparisons by ranking methods. Breakthroughs in statistics. Springer, pp 196–202 127 Chapter 4 Biophysical carbon concentrating mechanisms in land plants: insights from reaction-diffusion modeling A preprint of this study is available: J. A. M. Kaste, B.J. Walker, Y. Shachar-Hill. Biophysical carbon concentrating mechanisms in land plants: insights from reaction-diffusion modeling. bioRxiv (2023). doi: https://doi.org/10.1101/2024.01.04.574220 128 4.1. Preface The project described in this chapter was born out of conversations with Anne Steensma, a fellow graduate student in the Shachar-Hill laboratory, and Dr. Berkley Walker. Anne was interested in using metabolic modeling as a way of exploring a hypothetical setup for a Carbon- Concentrating Mechanism (CCM) in the red alga Cyanidioschyzon merolae. I provided substantial assistance in the early stages of the project, including writing the code for the initial versions of the model architecture, coming up with a computational approach for estimating CO2 compensation points in silico, and setting up the code for distribution to MSU’s High- Performance Computing Cluster (HPCC) for parameter exploration and analysis. The analysis for this project is proceeding steadily and I plan on writing it up as a manuscript before the end of the year, on which I am anticipated to be co-first author. The conversations our group was having about modeling biophysical CCMs led us to asking some broader questions about the efficiency of such mechanisms in land plants, such as: 1. Previous modeling studies make the addition of a carboxysome to a land plant seem very energetically favorable, but is this finding robust? 2. Why do land plants not pump any bicarbonate from apoplastic water when there is a substantial concentration of bicarbonate available? 3. Is there something about the physiology and/or morphology of hornworts that has caused them to repeatedly evolve a pyrenoid-based biophysical CCM when such biophysical CCMs are entirely absent in all other land plant lineages? I took up answering these questions by building spatially-resolved reaction-diffusion models of inorganic carbon and O2 movement in land plant and algal systems using the Virtual Cell platform. The results of this analysis have been written up as a manuscript that has been deposited as a preprint and is currently under peer review. 4.2. Abstract Carbon Concentrating Mechanisms (CCMs) have evolved numerous times in photosynthetic organisms. They elevate the concentration of CO2 around the carbon-fixing enzyme rubisco, thereby increasing CO2 assimilatory flux and reducing photorespiration. Biophysical CCMs, like the pyrenoid-based CCM of Chlamydomonas reinhardtii or carboxysome systems of cyanobacteria, are common in aquatic photosynthetic microbes, but in land plants appear only among the hornworts. To predict the likely efficiency of biophysical CCMs in C3 plants, we used 129 spatially resolved reaction-diffusion models to predict rubisco saturation and light use efficiency. We find that the energy efficiency of adding individual CCM components to a C3 land plant is highly dependent on the permeability of lipid membranes to CO2, with values in the range reported in the literature that are higher than used in previous modeling studies resulting in low light use efficiency. Adding a complete pyrenoid-based CCM into the leaf cells of a C3 land plant is predicted to boost net CO2 fixation, but at higher energetic costs than those incurred by photorespiratory losses without a CCM. Two notable exceptions are when substomatal CO2 levels are as low as those found in land plants that already employ biochemical CCMs and when gas exchange is limited such as with hornworts, making the use of a biophysical CCM necessary to achieve net positive CO2 fixation under atmospheric CO2 levels. This provides an explanation for the uniqueness of hornworts’ CCM among land plants and evolution of pyrenoids multiple times. 4.3. Introduction Ribulose-1,5-bisphosphate carboxylase/oxygenase (rubisco) catalyzes the fixation of CO2 as part of the Calvin-Benson Cycle (CBC) but is also capable of fixing O2. The fixation of O2 results in the formation of 2-phosphoglycolate (2PG), with the photorespiratory pathway being necessary to detoxify and recover the carbon in 2PG and recycle it back into the CBC. Although rubisco shows selectivity for CO2 relative to O2, significant photorespiratory flux still occurs in photosynthetic systems due to the much higher partial pressure of O2 in the earth’s atmosphere relative to CO2. Photorespiratory flux lowers net carbon assimilation and incurs substantial energetic costs, in the form of ATP, redox equivalents, and ultimately photons. Although the costs associated with photorespiration vary between plant species and environmental conditions, it has been estimated that photorespiration accounts for crop yield decreases of 20 and 36% for soybean and wheat respectively under current climate conditions (Walker et al., 2016). Carbon Concentrating Mechanisms (CCMs) increase the concentration of CO2 around rubisco, competitively inhibiting the oxygenation reaction, suppressing photorespiration, and increasing carboxylation flux (Raven et al., 2017). In biochemical CCMs, such as C4 and CAM photosynthesis, inorganic carbon is fixed into an intermediate form of organic carbon, before eventually being released around rubisco (Ludwig, 2013; Bräutigam et al., 2017). Biophysical or “inorganic” CCMs, on the other hand, do not rely on any additional intermediate organic carbon species, but instead use pumps, diffusional barriers, carbonic anhydrases, and pH differences 130 between cellular compartments to increase the CO2 concentration near rubisco (Raven et al., 2008). Such CCMs are common in cyanobacteria and algae (Raven et al., 2008), but are conspicuously absent in C3 plants, including almost all land plants. This has motivated researchers to look into the possibility of introducing a CCM, either in its entirety or individual components, into these plants to improve carbon fixation, reduce photorespiratory CO2 and energy losses, and ultimately boost yields (Ermakova et al., 2020; Hennacy and Jonikas, 2020). The seemingly substantial benefits of CCMs raise the question of why they are not already more widespread in land plants. Despite their lack of a CCM, C3 plants are still the most abundant group of land plants in terms of vegetation coverage and gross photosynthetic productivity (Still et al., 2003; Raven et al., 2017). In the case of C4 photosynthesis, the large number of anatomical and biochemical features required has been invoked as a reason why, rather than being universally adopted in land plants, it has instead repeatedly evolved only in lineages exposed to the kinds of hot, arid conditions that limit water availability and exacerbate the losses associated with photorespiration (Sage et al., 2018). However, such an explanation is less satisfactory in the case of biophysical CCMs because they are present in the hornworts. It also raises the question of why biophysical CCMs are uniformly absent in all land plant lineages except for the hornworts (Villarreal and Renner, 2012). Have inefficiencies associated with biophysical CCMs precluded their successful emergence in C3 plants and can we examine the presence and absence of these biophysical CCMs in different groups of organisms using these inefficiencies? The efficiency of intermediate photosynthetic configurations, featuring some but not all of the essential parts of a CCM, may also represent a barrier to the emergence of CCMs in land plant lineages. Anatomical and life history details of hornworts may explain why, among the land plants, only hornworts have evolved pyrenoid-based biophysical CCMs (PCCMs), and have done so repeatedly (Villarreal and Renner, 2012). The poikilohydric life history of hornworts makes it necessary for them to have highly desiccation-tolerant cell walls which, together with bryophytes’ generally thicker cell walls (Flexas et al., 2021) and hornworts’ simpler tissue architecture, may explain their extremely low gas conductance (Meyer et al., 2008; Carriquí et al., 2019). We hypothesized that the distinct morphologic characteristics and habitat of hornworts may explain why they, uniquely among the land plants, evolved biophysical CCMs. It is possible that the different paths that inorganic carbon has to take from the environment into a C3 land plant cell versus an algal cell 131 can similarly explain why the former never uses pyrenoids to concentrate carbon and the latter frequently does. A closer examination of the costs of a CCM may also inform the viability and strategy of biotechnological projects focused on introducing them to C3 crops. Prior quantitative modeling work argues that incorporating individual CCM components – in particular, bicarbonate transporters at the chloroplast membrane – and entire CCMs into land plant systems may boost net CO2 fixation as well as improve the efficiency of photosynthetic carbon assimilation by reducing the energetic costs associated with photorespiration (McGrath and Long, 2014; Fei et al., 2022). Similar arguments have been made in favor of engineering biochemical – e.g. C4 – photosynthesis into C3 plants (Walker et al., 2016). These models represent sophisticated, integrative descriptions of photosynthetic carbon assimilation. For the purposes of the questions we are interested in, however, we needed models of both land plant and algal systems that represent photo-assimilatory processes at the whole-cell level. We also needed models that allow us to explore substantial uncertainties in certain key parameters, and that include energy costs associated with carbonic anhydrase (CA) activity in the thylakoid lumen. Here we developed spatially-resolved reaction-diffusion models of land plants and green algae with and without PCCMs in the Virtual Cell platform (Schaff et al., 1997; Cowan et al., 2012). These models represent, to our knowledge, the first such models of C3 land plants containing pyrenoid-based biophysical CCMs, as well as the first models of algal systems containing biophysical CCMs going beyond the scale of the chloroplast and including the whole cell in an aqueous environment. We highlight the substantial uncertainty in reported or predicted values of the permeability of lipid membranes to CO2 and explore how this uncertainty can give rise to qualitatively different conclusions as to the efficiency and effectiveness of adding chloroplast envelope bicarbonate pumps in particular. Finally, we find that despite the near- ubiquity of biophysical CCMs in algae, modeling suggests that lower levels of external inorganic carbon (DIC) are needed to make CCMs energetically favorable for land plants. 4.4. Methods 4.4.1. Model details Spatially-resolved reaction-diffusion models of carbon assimilation were developed in the Virtual Cell platform, a software suite that allows for the creation and analysis of chemical reaction diffusion dynamics in the context of 3D models (Schaff et al., 1997; Cowan et al., 132 2012). Baseline parameters for simulations can be found in Table 4.1 and diagrams of two of the models used in this study, showing the representative features of the land plant and algal models, as well as the differences between the with- and without-PCCM models, can be seen in Figure 4.1. Systems were represented as spatially symmetrical, with spherical concentric compartments that were converted into volumetric pixels (voxels) according to the simulations’ spatial resolution. All results presented are from simulations containing either 9,261 voxels or 12,167 voxels. Due to the large parameter explorations done in this study, minor geometrical modifications were made to make efficient numerical simulation feasible. Specifically, the radius of the apoplast water layer in the land plant models was extended out from the 9.41um it should be based on a cell wall thickness of 0.32µm plus an apoplast water layer of equivalent thickness to 10µm. We also modeled the thylakoid tubules of with-PCCM models as a set of six cylinders of radius 0.5µm extending into the pyrenoid, with exchange between the tubules and the pyrenoid occurring at the end of these cylinders, in contrast to the larger number of finer tubules used in Fei et al., (2022). 133 Figure 4.1: Diagrammatic representations of (A) a model of photosynthetic carbon assimilation in a land plant mesophyll cell containing a C. reinhardtii style PCCM, and (B) a model of an algal cell that does not contain a pyrenoid. CA refers to carbonic anhydrase, BLP refers to bestrophin-like proteins that serve as membrane channels for passive bicarbonate transport, and BicA is a cyanobacterial active bicarbonate transporter. In the VCell implementation of the model, some strongly linked steps are combined for the sake of numerical computability. Exact specifications for all flux equations used can be found in the publicly shared model implementations in VCell (see code and data availability statement). Note that for the sake of numerical tractability, the carbonic-anhydrase catalyzed interconversion of CO2 and HCO3 in the thylakoid in models featuring a CCM (v29) is localized to the pyrenoid but uses the pH value of the thylakoid; in the real biological system, the carbonic-anhydrase is inside the thylakoid tubules that penetrate into the pyrenoid. 134 Table 4.1: Model parameter definitions with source references and, where applicable with notes on derivation. When parameters were derived from parameterization of a previous modeling study, both the modeling study and the original literature reference for the parameter are cited. References in “Ref.” column: (1) Mazarei & Sandall 1980; (2) Fei et al. 2022; (3) Xiang & Anderson 1994; (4) Walker, Smith & Cathers 1980; (5) Bentley & Pittman 1997; (6) Gutknecht, Bisson & Tosteson 1977; (7) Missner et al. 2008; (8) Hopkinson et al. 2011; (9) Widomska, Raguz & Subczynski 2007; (10) Mitchell et al. 2010; (11) Larsson et al. 1997; (12) Pocker & Ng 1973; (13) Pocker & Miksch 1978; (14) McGrath & Long 2014; (15) Bernacchi et al. 2005; (16) Badger & Andrews 1974; (17) Farquhar, von Caemmerer & Berry 1980; (18) von Caemmerer 2000; (19) Price et al. 2004; (20) Bernacchi et al. 2006; (21) Kump 2008; (22) Pritchard, Grout & Short 1986; (23) Flexas et al. 2021; (24) Ouk, Oi & Taniguchi 2020; (25) Slaton & Smith 2002; (26) Yu, Tang & Kuo (2000); (27) Feely, Doney & Cooley (2009); (28) Felle 2001; (29) Kramer, Sacksteder & Cruz 1999. Name Value(s) Units Notes Diffusion coefficient of CO2 in water 1.88 x 103 µm2 s-1 Diffusion coefficient of H2CO3 in water 1.2 x 103 µm2 s-1 Assumed in Fei et al., (2022) to be identical to diffusion coefficient of acetic acid Diffusion coefficient of HCO3 - in water 1.15 x 103 µm2 s-1 Diffusion coefficient of O2 in water 2.42 x 103 µm2 s-1 Membrane permeability to CO2 3.50e-03; m s-1 Parameter scanned between reported values Membrane permeability to H2CO3 Membrane permeability to HCO3 - Membrane permeability to O2 3.20e-02 30 0.05 75 Besotrophin-like channel mediated 1 x 10-2 - permeability of thylakoid to HCO3 µm s-1 µm s-1 cm s-1 m s-1 Chloroplast membrane permeability to 1 x 10-8 m s-1 HCO3 - mediated by LCIA Rate constant of spontaneous 6 x 10-2 s-1 hydration of CO2 Rate constant of spontaneous 2 x 101 dehydration of H2CO3 Rate constant of spontaneous 1 x 107 s-1 s-1 deprotonation of H2CO3 Rate constant of spontaneous 5 x 1010 M-1 s-1 protonation of HCO3 - Carbonic anhydrase kcat 0.3 x 106 s-1 Carbonic anhydrase Km for CO2 1.5 mol m-3 Carbonic anhydrase Keq 0.56 x 10-6 mol m-3 Carbonic anhydrase Km for HCO3 Carbonic anhydrase concentration in 34 270 mol m-3 µM stroma Carbonic anhydrase concentration in 135 µM Assumed to be half the stroma value cytosol Ref. (1, 2) (2, 3) (2, 4) (5) (6, 7) (2) (8) (9) (2) (2) (10) (10) (10) (10) (11) (12) (13) (13) (14) (14) 135 Table 4.1 (cont’d) Carbonic anhydrase concentration in 135 µM Assumed to be half the stroma value (14) 7600 1596 8.6 215 µmol L-1 s-1 µmol L-1 s-1 Calculated from ratio of kcat values of carboxylation and oxygenation. µmol L-1 µmol L-1 1.85 x 10-4 mol m-2 s-1 Parameter scanned lumen Rubisco Vmax of carboxylation Rubisco Vmax of oxygenation Rubisco Km O2 Rubisco Km CO2 BicA Vmax BicA Km HCO3 - Stomatal conductance Atmospheric concentration of CO2 Atmospheric concentration of O2 Thickness of cell wall in angiosperms 0.32 Thickness of cell wall in bryophytes Effective porosity of C3 plant cell wall 1.6 0.2 0.217 0.4375 412 0.21 mol m-3 mol m-2 s-1 ppm Partial pressure µm µm Unitless (15) (16, 17) (18) (18) (19) (19) (20) Assumed (21) (14, 22) (23) (14) Effective porosity of hornwort cell wall 0.0001 Parameter scanned Calculated (23) Thickness of unstirred boundary layer 0.32 µm Assumed to be the same as cell wall thickness Assumed in algal model Thickness of unstirred apoplast water 0.32 µm Assumed to be the same as cell wall thickness Assumed layer in land plant models Permeability of pyrenoid starch sheath 0.1 * PCO2 µm s-1 From range of permeabilities that allow (2) to dissolved inorganic carbon effective carbon concentration in modeling done Permeability of pyrenoid starch sheath 0.1 * PO2 µm s-1 Assumed to behave similarly to dissolved Assumed by (Hopkinson et al., 2011) to oxygen Radius of pyrenoid Radius of thylakoid Height of thylakoid Radius of chloroplast Radius of cytosol Radius of plasmalemma surface 1.0 0.5 4 4.63 8.77 9.23 µm µm µm µm µm µm Radius of substomatal space in land 11.63 µm plant model Proportion of cell wall adjacent to 0.5 Unitless intercellular airspace in land plant pH of land plant apoplast pH of ocean water pH of cytosol pH of stroma pH of lumen 6.0 8.1 7.2 8.0 6.0 pH pH pH pH pH inorganic carbon Multiplied by 10X to account for simpler thylakoid architecture (2) (2) Assumed Calculated from stroma volume fraction and (14) assuming spherical geometry Assuming spherical geometry (24) Calculated from radius of cytosol, cell wall Calculated thickness, and assumed apoplast water thickness (14) (25) (26) (27) Calculated (28) (29) (29) 136 4.4.2. Reaction equations Carboxylation flux by rubisco is calculated as in Farquhar et al., (1980) (E1). The rate of carboxylation by rubisco is normally taken to be the minimum of vc and J, where J describes the rate of ribulose-1,5-bisphosphate regeneration enabled by photosynthetic electron transport and a function of Jmax, a maximum rate of RuBP regeneration, among other parameters (Farquhar et al., 1980). Estimates of the relevant parameters are available for land plants but, to our knowledge, not for algae. We are also specifically examining CO2-limiting conditions where rubisco reaction rate limitations dominate. For these reasons, we calculate the carboxylation and oxygenation rates assuming that the system is not limited by RuBP regeneration as in Fei et al., (2022). 𝑣𝑐 = (𝑉max 𝑐𝑎𝑟𝑏𝑜𝑥𝑦𝑙𝑎𝑡𝑖𝑜𝑛 ∗ [𝐶𝑂2]) [𝑂2] 𝐾𝑚 𝐶𝑂2 (1 + 𝑂2)) ([𝐶𝑂2] + 𝐾𝑚 The ratio of oxygenation to carboxylation Vmax is: 𝑉𝑚𝑎𝑥𝑜𝑥𝑦𝑔𝑒𝑛𝑎𝑡𝑖𝑜𝑛 𝑉𝑚𝑎𝑥𝑐𝑎𝑟𝑏𝑜𝑥𝑦𝑙𝑎𝑡𝑖𝑜𝑛 = 𝑜𝑥𝑦𝑔𝑒𝑛𝑎𝑡𝑖𝑜𝑛 𝑘𝑐𝑎𝑡 𝑐𝑎𝑟𝑏𝑜𝑥𝑦𝑙𝑎𝑡𝑖𝑜𝑛 𝑘𝑐𝑎𝑡 (𝑬𝟏) (𝑬𝟐) Using a 𝑜𝑥𝑦𝑔𝑒𝑛𝑎𝑡𝑖𝑜𝑛 𝑘𝑐𝑎𝑡 𝑐𝑎𝑟𝑏𝑜𝑥𝑦𝑙𝑎𝑡𝑖𝑜𝑛 value of 0.21 as in Farquhar et al., (1980), we can thereby calculate the 𝑘𝑐𝑎𝑡 𝑉𝑚𝑎𝑥𝑜𝑥𝑦𝑔𝑒𝑛𝑎𝑡𝑖𝑜𝑛 of our systems. The oxygenation flux by rubisco is then calculated as: 𝑣𝑜 = (𝑉max 𝑜𝑥𝑦𝑔𝑒𝑛𝑎𝑡𝑖𝑜𝑛 ∗ [𝑂2]) [𝐶𝑂2] 𝐾𝑚 𝑂2 (1 + 𝐶𝑂2 )) ([𝑂2] + 𝐾𝑚 (𝑬𝟑) Interconversion of CO2 with bicarbonate via carbonic anhydrase is described as in McGrath and Long, (2014): [𝐶𝐴] ∗ 𝐶𝐴𝑘𝑐𝑎𝑡 ∗ ([𝐶𝑂2] − [𝐻𝐶𝑂3][𝐻+] 𝐾𝑒𝑞 ) 𝐶𝑂2 + [𝐻𝐶𝑂3] ( 𝐾𝑚 𝐶𝑂2 𝐾𝑚 𝐻𝐶𝑂3 𝐾𝑚 ) + [𝐶𝑂2] (𝑬𝟒) In the land plant models, the flux density of dissolution of gaseous CO2 or O2 into the water layer is as in Hemond and Fechner, (2022): 137 𝐹𝑙𝑢𝑥𝐷𝑒𝑛𝑠𝑖𝑡𝑦𝑊𝑎𝑡𝑒𝑟𝐿𝑎𝑦𝑒𝑟 = − 𝐷𝑤 (𝐶𝑤 − 𝐶𝑎 𝐻 ) 𝛿𝑤 (𝑬𝟓) Where Dw is the diffusion rate of the dissolving species in water Cw and Ca are the concentrations of that species in the air and in the water layer, H is the dimensionless Henry’s Law constant, and 𝛿𝑤 is the length of the unstirred water layer into which the gas is dissolving. In our models, we assume the presence of a thin layer of water on top of the plant’s cell wall that is the same thickness as the cell wall itself into which CO2 is dissolving. Permeation of aqueous species through the cell wall is given by the following equation, as in McGrath and Long, (2014): 𝐹𝑙𝑢𝑥𝐷𝑒𝑛𝑠𝑖𝑡𝑦𝐶𝑒𝑙𝑙𝑊𝑎𝑙𝑙 = 𝐷𝑤 𝛿𝐶𝑒𝑙𝑙𝑊𝑎𝑙𝑙 ∗ 𝐸𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑃𝑜𝑟𝑜𝑠𝑖𝑡𝑦 (𝑬𝟔) Where EffectivePorosity is the porosity of the cell wall divided by the tortuosity of the cell wall. For computational tractability, we combine the processes of gases dissolving into water and the aqueous species passing through the cell wall. Note that in the above equation Dw / 𝛿𝑤 and Dw *EffectivePorosity / 𝛿𝐶𝑒𝑙𝑙𝑊𝑎𝑙𝑙 gives permeability (in units of µm/s) of the water layer and the cell wall, respectively. Multiplying these values by surface area (SA) gives conductivities (in units of µm3/s). The inverses of these values are resistances, which can be summed to give the total resistance of the water layer plus the cell wall. The inverse of this, again, will be the conductivity of the overall system, which can be multiplied by the concentration gradient from the air to the surface of the plasmalemma to give the total flux. 𝐽 = (( 𝐷𝑤 ∗ 𝑆𝑢𝑟𝑓𝑎𝑐𝑒𝐴𝑟𝑒𝑎 𝛿𝑤 −1 ) + ( −1 𝐷𝑤 ∗ 𝐸𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑃𝑜𝑟𝑜𝑠𝑖𝑡𝑦 ∗ SA ) 𝛿𝐶𝑒𝑙𝑙𝑊𝑎𝑙𝑙 −1 ) * ( 𝐶𝑎 𝐻 − 𝐶𝑤) (𝑬𝟕) Permeation through lipid membranes is given by: 𝑃 ∗ ([𝑂𝑢𝑡𝑠𝑖𝑑𝑒] − [𝐼𝑛𝑠𝑖𝑑𝑒]) (𝑬𝟖) Active transport by bicarbonate transporter BicA is described using Michaelis-Menten kinetics: 138 𝑉𝑚𝑎𝑥𝐵𝑖𝑐𝐴 ∗ [𝐻𝐶𝑂3] 𝐾𝑚 ∗ [𝐻𝐶𝑂3] (𝑬𝟗) 4.4.3. Efficiency calculations Net CO2 fixation is described as: 𝑁𝑒𝑡𝐹𝑖𝑥𝑎𝑡𝑖𝑜𝑛 = 𝐹𝑙𝑢𝑥𝑐𝑎𝑟𝑏𝑜𝑥𝑦𝑙𝑎𝑡𝑖𝑜𝑛 − 𝐹𝑙𝑢𝑥𝑜𝑥𝑦𝑔𝑒𝑛𝑎𝑡𝑖𝑜𝑛 2 (𝑬𝟏𝟎) 2 NADPH equivalents are expended per carboxylation or oxygenation reaction based on the stoichiometry of the CBC cycle and photorespiration. 3 ATP and 3.5 ATP are used for a single carboxylation or oxygenation event, respectively (Edwards and Walker, 1983). In models featuring a PCCM, there is a lumenal carbonic anhydrase that catalyzes the following reaction: 𝐶𝑂2 + 𝐻2𝑂 ←→ 𝐻𝐶𝑂3 − + 𝐻+ (𝑬𝟏𝟏) Due to the acidic pH of the lumen (Kramer et al., 1999) the net flux of this reaction is overwhelmingly in the direction of CO2 and H2O, so that entry of bicarbonate depletes the proton motive force (pmf) that is maintained by the light reactions of photosynthesis, which imposes an indirect ATP cost on CCM activity by requiring additional proton pumping to maintain the pmf (Mukherjee et al., 2019). Based on a 14:3 ratio of pumped protons to ATP synthesis via the thylakoid membrane ATP synthase, inferred from the number of c-subunits in such ATP synthases (Seelert et al., 2000), we can calculate the indirect ATP cost of this lumen CA activity as: 𝐴𝑇𝑃𝑐𝑜𝑠𝑡 = 𝐽𝐶𝐴𝑙𝑢𝑚𝑒𝑛 ∗ 3 14 (𝑬𝟏𝟐) This is added to the other ATP consumption in the model (due to the metabolic costs of carboxylation and oxygenation) to give total ATP use. This can be compared with NADPH use due to carboxylation and oxygenation to get an estimate of the total ATP, NADPH, and the ATP:NADPH ratio needed to support the activity in the model. From the values provided in Walker et al., (2020) we estimate the amount of either Cyclic Electron Flow (CEF) or Malate Valve activity needed to rebalance the ATP/NADPH ratio needed for a particular model, which we can then convert into an additional demand for photons and, therefore, a the number of 139 photons needed on a per reaction (carboxylation or oxygenation) basis (Appendix C, Figure S4.1). From this, we can calculate the number of photons needed to support model fluxes and then compare this to the net fixation achieved by a model to get an estimate of light use efficiency. 𝜑𝐶𝐸𝐹 = (𝑉𝑐 + 𝑉𝑜) (𝑃ℎ𝑜𝑡𝑜𝑛𝑠𝑏𝑎𝑠𝑒 + ((𝑅𝑎𝑡𝑖𝑜 ∗ 𝑁𝐴𝐷𝑃𝐻𝑏𝑎𝑠𝑒) − 𝐴𝑇𝑃𝑏𝑎𝑠𝑒) ∗ 0.43 𝑉𝑐 − 1 2 𝑉𝑜 𝜑𝑚𝑎𝑙𝑎𝑡𝑒 = 𝑉𝑐 − 1 2 𝑉𝑜 (𝑬𝟏𝟑) 𝑝ℎ𝑜𝑡𝑜𝑛𝑠 𝐴𝑇𝑃 ) (𝑬𝟏𝟒) (𝑉𝑐 + 𝑉𝑜) ∗ (𝑃ℎ𝑜𝑡𝑜𝑛𝑠𝑏𝑎𝑠𝑒 + (𝑅𝑎𝑡𝑖𝑜 ∗ 𝑁𝐴𝐷𝑃𝐻𝑏𝑎𝑠𝑒) − 𝐴𝑇𝑃𝑏𝑎𝑠𝑒 5.45 𝐴𝑇𝑃 2 𝑁𝐴𝐷𝑃𝐻 ∗ 4 𝑝ℎ𝑜𝑡𝑜𝑛𝑠 𝑁𝐴𝐷𝑃𝐻 ) Where Vc and Vo are the modeled rates of carboxylation and oxygenation, Ratio refers to the modeled ATP/NADPH ratio necessary to support the fluxes in the model, and Photonsbase , ATPbase and NADPHbase refer to the photons used and the ATP and NADPH generated in the process of making two NADPH molecules via Linear Electron Flow (LEF) (Walker et al., 2020). 4.4.4. Concentration calculations All concentrations in the models used in this study are in units of µM. To calculate the µM concentrations of CO2 and O2 in the atmosphere, we used the following conversion: 412 µmol CO2 mol air ∗ 1 mol air 24.79 𝐿 air = ~ 16.62 𝑢𝑚𝑜𝑙 𝐶𝑂2 𝐿 𝑎𝑖𝑟 0.2095 mol O2 mol air ∗ 1 mol air 24.79 𝐿 air = ~ 8450.98 𝑢𝑚𝑜𝑙 𝑂2 𝐿 𝑎𝑖𝑟 4.5. Results 4.5.1. Validation of compensation point predictions and sensitivity analysis The land plant and algal carbon assimilation models were validated by comparing a key estimated result (CO2 compensation point) with experimentally measured values from the literature. The CO2 compensation point is the external CO2 level at which net CO2 assimilation by a photosynthesizing organism is zero (i.e., carbon assimilation by rubisco is balanced out by 140 CO2 losses to photorespiration and respiration in the light, denoted as RL). Low compensation points are also a defining feature of organisms with CCMs since they maintain net positive carbon assimilation at lower CO2 concentrations, making this a useful indicator of whether land plant and algal models with and without CCMs reasonably recreate the carbon assimilation dynamics of real systems. As shown in Figure 4.2 and Table 4.2, the models with CCMs have substantially lower compensation points than the models lacking CCMs. Moreover, as shown in Table 4.2, these estimated compensation point values fall within the ranges of values reported in the literature for angiosperm land plants and algae with and without CCMs (Table 4.2). Note that the reported compensation points of hornworts with pyrenoids (11-13 ppm) are lower than those of closely related C3 liverworts, but higher than typical estimates for C4 plants and pyrenoid-containing algae (Villarreal and Renner, 2012). 141 Figure 4.2: Net CO2 assimilation versus external CO2 concentrations in carbon assimilation models. The point at which net CO2 assimilation is zero defines the compensation point. (A) The full range of saturation and external CO2 concentrations, and (B) a zoomed-in panel showing the point at which each curve reaches 0% rubisco saturation (i.e., the compensation point). 142 Table 4.2: Predicted compensation points for different models from the present study compared with reference values from the literature. Reference column numbers refer to their numbering in the bibliography. Model Compensation point Reference Values (ppm CO2) (ppm CO2) Measurement References Land plant with CCM 6.2 1.3; 4.3; 0.7 – 9.0 (Fladung and Hesselbach, 1987; Lee et al., 2022) (Fladung and Hesselbach, 1987; Land plant without CCM 52.7 48; 57; 48.2 – 53.4; 65-100 Tolbert et al., 1995; Peixoto et Algal model with CCM 2.7 0.75 – 2.5; 6.0 Algal model without CCM 44.6 43.5 – 58; 64.5 al., 2021) (Coleman and Colman, 1980; Raven et al., 1982) (Raven et al., 1982; Steensma et al., 2023) The sensitivity analysis results shown in Figure 4.3 show that simulated net CO2 assimilation and quantum yield values from the land plant models are relatively robust to local variations in all parameters, providing us with confidence that these results are not merely the result of a very particular selection of parameters. In both the land plant and algal models without PCCMs, rubisco Vmax, cell and chloroplast radii, and membrane permeability to CO2 are the most influential determinants of net CO2 assimilation and quantum yield. In the land plant model, stomatal conductance also stands out. The addition of a PCCM reduces the sensitivity of net CO2 assimilation to changes in any input parameter but increases the sensitivity of the predicted quantum yield to input parameter values. The local stability of our results to perturbations in key parameters is comparable with previous studies, being more variable than the models presented in Fei et al., (2022), which spatially modeled a smaller system (algal chloroplasts), and significantly less variable than the models presented in McGrath and Long, (2014), which modeled land plant CO2 assimilation at a similar scale. We also characterized the sensitivity of our modeling results to the spatial resolution of the numerical simulations. Our results (Appendix C, Figures S4.2-3) show that rubisco saturation - the percentage of maximum rubisco activity achieved – and quantum yield in an algal model lacking a CCM are robust to the simulation resolution. Increasing the resolution all the way down to 0.32um, well beyond what could feasibly be done given the amount of parameter exploration done in this study, does result 143 in noticeable changes in pyrenoid [CO2] and [HCO3 +], resulting in small increases in rubisco saturation and small decreases in quantum yield (Appendix C, Figures S4.4-5). Figure 4.3: Sensitivity analysis results for (A) the land plant model lacking a CCM, (B) the algal model lacking a CCM, (C) the land plant model with a CCM, and (D) the algal model with a CCM. Orange bars indicate the absolute % change of quantum yield resulting from a 10% change in the indicated parameter, and blue bars represent the same for rubisco saturation. For both of the land plant models, increasing the cytosol radius by 10% resulted in problems with solving the systems numerically, so the cytosol radius was increased by 1% instead and, assuming a linear relationship between the size of radius increase and the change in rubisco saturation and quantum yield, multiplied by 10 to get the values shown in (A-B). 4.5.2. Efficiency of chloroplast membrane bicarbonate channel is strongly dependent on assumed permeability of chloroplast membrane to CO2 Previous studies (Price et al., 2010; McGrath and Long, 2014) have suggested that the incorporation of bicarbonate transporters into the chloroplast membrane of a land plant could improve net fixation and/or the efficiency of carbon assimilation, and that this could represent a reasonable intermediate stage in a broader biotechnological effort to implement a full CCM in a land plant. Modeling studies on CCM systems typically assume the lipid membrane permeability 144 of 0.35 cm/s, which was experimentally measured and reported in Gutknecht et al., (1977). However, there is substantial uncertainty as to the value of parameter, with experimental estimates ranging over many orders of magnitude (Evans et al., 2009). The permeability may be as much as an order of magnitude higher than the Gutknecht et al value, as reported by Missner et al., (2008). We hypothesized that the apparent favorability of employing a chloroplast membrane bicarbonate pump may be highly sensitive to the assumed chloroplast membrane CO2 permeability. To test this hypothesis, we performed a parameter exploration from an order of magnitude lower than the widely cited Gutknecht et al., (1977) value up to the Missner et al., (2008) value in both land plant and algal systems, calculating net fixation as well as ATP/CO2 and light-use efficiency, as shown in Figure 4.4. These results show that the light use efficiency of a chloroplast membrane bicarbonate transporter is highly sensitive to the value of the chloroplast envelope’s permeability to CO2, with a large range of permeabilities resulting in 2X more ATP usage per unit of CO2 fixed. In the land plant model, we see increases in both rubisco saturation and quantum yield as BicA pumping activity increases when lipid membrane permeability values are equivalent to, or below that reported in Gutknecht et al., (1977) (Figure 4.4A-B). At permeabilities higher than this, increased BicA activity actually decreases quantum yield, though net fixation still increases (Figure 4.4A-B). We see a similar picture in the algal model (Figure 4.4E-F), suggesting that the differences in DIC form, concentration, and diffusivity do not greatly impact the sensitivity of this strategy to the specific value of lipid membrane permeability to CO2. The decrease in quantum yield in models with high lipid membrane permeability to CO2 is driven by increased leakage of CO2 from the chloroplast back into the cytosol after it interconverts with the bicarbonate just pumped by BicA (shown as flux V15 in Figure 4.1). As lipid membranes become more permeable to CO2, its tendency to escape the chloroplast before being fixed by rubisco increases. Lowering the external CO2 concentration does, however, change the energy efficiency penalty of increased BicA activity significantly (Figure 4.4C-D;G-H). Even at higher lipid membrane permeability values, we see only minimal decreases in quantum yield with increased BicA bicarbonate pumping. 145 Figure 4.4: Rubisco saturation and quantum yield of land plant and algal models of CO2 assimilation under 100% and 50% external CO2 levels, as a function of lipid membrane permeability to CO2 and BicA bicarbonate transporter Vmax. Fold change of lipid membrane permeability is relative to the value reported in Gutknecht et al., (1977). (A) Predicted rubisco saturation of a land plant model under 100% external CO2. (B) Predicted quantum yield of a land plant model under 100% external CO2. (C) Predicted rubisco saturation of a land plant model under 50% external CO2. (D) Predicted quantum yield of a land plant model under 50% external CO2. (E) Predicted rubisco saturation of an algal model under 100% external CO2. (F) Predicted quantum yield of an algal model under 100% external CO2. (G) Predicted rubisco saturation of an algal model under 50% external CO2. (H) Predicted quantum yield of an algal model under 50% external CO2. The black lines in each plot indicate the Gutknecht et al., (1977) value for lipid bilayer permeability to CO2 as well as a transition in the y-axis from increments of 0.1X to 1X fold changes. 146 4.5.3. Efficiency of a plasmalemma bicarbonate channel is strongly dependent on external DIC levels and limited by the rate of equilibration between CO2 and bicarbonate We found that although the strategy of pumping bicarbonate from the cytosol to the chloroplast may incur substantial energy costs, implementing a bicarbonate pump at the plasmalemma may be more effective. This makes sense considering that in aqueous systems at near-neutral pH, most of the DIC in the system is in the form of bicarbonate. We incorporated a plasmalemma bicarbonate transporter and explored the efficiency of such a system across different external DIC concentrations and activities of the transporter in both algal and land plant systems (Figure 4.5). 147 Figure 4.5: Predicted rubisco saturation and quantum yield in land plant and algal models with a BicA bicarbonate pump present in the plasmalemma membrane, as a function of assumed lipid membrane permeability to CO2 and BicA Vmax. Fold change of lipid membrane permeability is relative to the value reported in Gutknecht et al., (1977). (A) Predicted rubisco saturation of a land plant model lacking an apoplastic carbonic anhydrase. (B) Predicted rubisco saturation of a land plant model with an apoplastic carbonic anhydrase. (C) Predicted rubisco saturation of an algal model. (D) Predicted quantum yield of an algal model. (E) Predicted rubisco saturation of a land plant model with an apoplastic carbonic anhydrase and an apoplast pH of 8. (F) Predicted quantum yield of a land plant model with an apoplastic carbonic anhydrase and an apoplast pH of 8. The black lines in each plot indicate the Gutknecht et al., (1977) value for lipid bilayer permeability to CO2 as well as a transition in the y-axis from increments of 0.1X to 1X fold changes. 148 In the land plant model, the plasmalemma bicarbonate pump is not an effective means of increasing either net fixation or energy efficiency. As anticipated, the pump does work in the algal case (Figure 4.5). The key difference appears to be that the external environment in the algal system, which is suffused with bicarbonate ions, can maintain reasonably high steady-state concentrations in the vicinity of the cell to support the bicarbonate pumping activity (Figure 4.5C-D). In contrast, in the land plant system all dissolved bicarbonate available to the cell must first enter the system as CO2 in the intercellular airspace, dissolve into the water in the apoplast, and then spontaneously hydrate to H2CO3 and deprotonate into bicarbonate. Although the protonation/deprotonation between H2CO3 is extremely fast, the hydration/dehydration is not [first-order rate constant of hydration of CO2 to H2CO3 is 6 x 10-2 s-1 (Mitchell et al., 2010)]. The result is an almost instantaneous depletion of the HCO3 - concentration in the apoplast space, with insufficient spontaneous hydration flux to replenish it (Figure 4.5G). Adding carbonic anhydrase activity to the apoplast allows for much faster regeneration of the external HCO3 - concentration, allowing BicA to impact rubisco saturation (Figure 4.5A-B). However, the pH of the apoplast, although variable, tends to be slightly to moderately acidic (Yu et al., 2000), resulting in low HCO3 - concentrations in the land plant model even with the apoplast carbonic anhydrase included (Figure 4.5G). It is only when the apoplast pH is made substantially more basic (pH of 8) and a carbonic anhydrase is included that the land plant model can replicate the algal model’s rubisco saturation and quantum yield gains by using a plasmalemma bicarbonate pump (Figure 4.5E-F). 149 4.5.4. PCCM integration results in greater marginal cost of CO2 fixation improvements in land plants vs. algal systems and switches from decreasing to increasing light-use efficiency around a Ci typical of C4 plants Figure 4.6: The ratio of the marginal cost in photons of one unit of net CO2 fixation in land plant (A) and algal (B) models resulting from adding a PCCM relative to the average cost of fixing one molecule of CO2 in those same models without CCMs, as a function of lipid membrane permeability and external CO2 concentrations. Fold change of lipid membrane permeability is relative to the value reported in Gutknecht et al., (1977). Blue indicates that for a given lipid membrane permeability / external CO2 concentration combination, the model containing a CCM has a lower marginal cost of CO2 fixation – i.e.., is more light-efficient – than the average cost of CO2 fixation in the model lacking a CCM. Red indicates that for a given parameterization, the model containing a CCM has a higher marginal cost of CO2 fixation than the average cost of CO2 fixation in its CCM lacking counterpart. The black lines in each plot indicate the Gutknecht et al., (1977) value for lipid bilayer permeability to CO2 as well as a transition in the y-axis from increments of 0.1 to 1 in the X-fold changes. We compared the energy-use efficiency of PCCM integration by comparing the predicted cost in photons of fixing CO2 molecule in four different models: (i) a land plant model with a PCCM, (ii) a land plant model without a PCCM, (iii) an algal model with a PCCM, and (iv) an algal model without a PCCM. By dividing the increase in net CO2 fixation in models (i) and (iii) relative to models (ii) and (iv) we estimated the marginal cost of in photons of fixing an additional CO2 molecule using a PCCM in our land plant and models (Figure 4.6). As we observed when examining the efficiency of the plasmalemma and chloroplast envelope BicA bicarbonate pumps, the assumed permeability of lipid membranes can have an impact on 150 efficiency; in this case, however, the relative marginal cost values do not change dramatically between an assumed permeability equivalent to that used in previous studies (1.0 in Figure 4.6A-B) and the higher value closer to that reported in Missner et al., (2008). In the algal models, the use of the PCCM appears to only become marginally efficient with respect to light usage below an external [CO2] of 4.38 µM. In contrast, the CCM is efficient in the land plant model below a substomatal [CO2] of 243 ppm. 4.5.5. As cell wall thickness increases and cell wall effective porosity decreases, PCCMs become more favorable in land plant models Given the findings regarding PCCMs in land plants highlighted above, it is interesting that many species of hornworts have pyrenoids – are there any meaningful biophysical differences between hornworts and other land plants that could explain these differences? As highlighted in Meyer et al., (2008) and Flexas et al., (2021) hornworts and other bryophytes have cell walls that are both substantially thicker and less porous compared to other land plants. From the mesophyll conductance values reported for angiosperms and bryophytes reported in Meyer et al., (2008) and Flexas et al., (2021), and with the assumption that other internal resistances to CO2 diffusion are similar between bryophytes and embryophytes, we can estimate that the effective porosity of a bryophyte like a hornwort must be on the order of four orders of magnitude smaller than in a typical C3 angiosperm. We explore parameters within this range of possible porosity values and across multiple external CO2 concentrations (Figure 4.7). 151 Figure 4.7: Rubisco saturation (A) and quantum yield (B) of a land plant model with varying effective porosity values. Blue points / lines represent predicted rubisco saturation or quantum yield in models including a PCCM; orange points/ lines represent predicted saturation or quantum yield in models not including a PCCM. Below effective porosities on the order of 10-1, which fall in the range we would expect of angiosperms, our model shows that the plant struggles to fix CO2 without a CCM. With a PCCM, however, the model can achieve some level of net CO2 fixation all the way down to effective porosities of 10-3. Below porosities of 10-3, we do not observe net CO2 fixation in the model without a PCCM, and at a porosity of 10-4, both models with and without PCCMs struggle to fix carbon. In terms of light-use efficiency, the model with a PCCM achieves a greater quantum yield of photosynthesis than the model without a PCCM below effective porosities of 10-2. 4.6. Discussion We initially hypothesized that the conspicuous absence of biophysical CCMs in almost all land plant lineages, in contrast to algae where they are widespread (Raven et al., 2005), may be the result of lower efficiency of such systems in land plants relative to algae, and that this results from their different biophysical contexts. To our surprise, we found that PCCMs appear to result in qualitatively similar improvements in quantum yield and net CO2 assimilation in land plant and algal models. In the algal model, the fact that addition of a PCCM does not result in efficiency gains until relatively low external DIC levels are reached is surprising, given that 152 Chlamydomonas reinhardtii cells appear to concentrate carbon even at recent “air-level” – approximately 330 ppm – CO2 concentrations (Badger et al., 1980). This implies that algae may routinely run their CCMs even when this incurs a quantum yield penalty. In contrast, the intercellular CO2 concentration at which the CCM improves quantum yield in the land plant model (~243 ppm) is higher than reported estimates of Ci in C4 plants under laboratory, greenhouse, and field conditions (Bunce, 2005). Previous work has described the evolutionary history of C4 photosynthesis (Sage et al., 2018) and identified certain anatomical features – namely Kranz anatomy – and environmental factors such as hot, arid conditions that lead to increased transpirational water loss and factors such as Water-Use Efficiency (WUE) as key predictors of C4 emergence. If the estimated quantum yield gains resulting from the introduction of a biophysical CCM to a land plant in this study apply to biochemical CCMs like C4 and CAM photosynthesis, this may represent an additional evolutionary driver towards such systems. Hornworts are the only land plant lineage that has evolved a biophysical CCM and they have done so multiple times (Villarreal and Renner, 2012). Hornworts, as well as some other bryophytes, are noteworthy for having substantially slower gas exchange between their surroundings and their photosynthetic tissues when compared with vascular land plants (Meyer et al., 2008). Our results show that a land plant with the low effective cell wall porosities we might expect given their extremely poor gas exchange characteristics, the use of a CCM becomes necessary to achieve net CO2 fixation, which would impose a strong selective pressure for adopting one. The fact that hornworts represent the earliest-diverging extant branch of the land plants, and therefore may have maintained the genes and regulatory networks necessary to adopt a PCCM, may explain why this biophysical CCM strategy has been adopted by hornworts and not other land plants growing in conditions where biochemical CCMS have been selected for. We should note that in the models presented in this study, at effective porosities below 10-3, only single digit values of rubisco saturation are achieved even with a biophysical CCM present and active, which may not be sufficient for viability, especially since we do not have or include estimates of respiration in the light in the models. This is despite the fact that mesophyll conductance to CO2 in hornworts, which we are using effective porosity as a proxy for in this study, has been measured to be four-to-five orders of magnitude lower than in angiosperms (Flexas et al., 2021). This suggests that our model underestimates the strength of the hornwort CCM or otherwise does not properly describe some aspect of hornwort CO2 assimilation. The 153 ratio of chloroplast-to-thallus surface area has not been explored in our modeling, but was found in a previous study to be a potentially important determinant of hornwort mesophyll conductance (Carriquí et al., 2019). Future work might aim to incorporate an exploration of chloroplast position and surface area to better account for this in the modeling. These results shed light on potential challenges associated with improving crop productivity via the introduction of biophysical CCMs. The specific value chosen for the permeability of lipid bilayers to CO2 has a large effect on the predicted energy efficiency of our models, with values higher than those used in previous modeling studies (McGrath and Long, 2014; Fei et al., 2022) but within the range of previously reported literature values (Gutknecht et al., 1977; Missner et al., 2008) resulting in qualitatively different conclusions. We see this in our consideration of BicA-mediated HCO3- pumping, which had been previously flagged as a promising intermediate step in introducing a biophysical CCM to a C3 plant (Price et al., 2010; McGrath and Long, 2014). As noted in Fei et al., (2022), barriers to CO2 diffusion form a key component of known functional CCMs, so the finding that the chloroplast membrane may provide enough of a diffusion barrier for the transport of HCO3 - into the stroma and subsequent conversion to CO2 to meaningfully improve net fixation and carbon assimilatory efficiency was surprising. Our results show that at or below the permeability reported in Gutknecht et al., (1977), which is used in other modeling studies, increasing BicA pumping activity leads to improvements in quantum yield, indicating more efficient CO2 fixation with respect to light use. However, above this value, we see uniform decreases in quantum yield with increased BicA activity. Net CO2 fixation increases with BicA pumping in all cases; therefore, in situations where light is abundant relative to CO2, this decrease in efficiency may not impact plant fitness. However, recent modeling work suggests that Jmax, the maximum rate of ribulose-1,5- bisphosphate (RuBP) regeneration enabled by photosynthetic electron transport, is more limiting to crop yield than limits to the maximum rate of carboxylation (Vmax of rubisco carboxylation) under the projected elevated atmospheric CO2 levels of 2050 and 2100 (He and Matthews, 2023). In this study, improved quantum yields correspond to a combination of (i) lower expenditures of ATP for each CO2 molecule fixed, and (ii) a more favorable ATP/NADPH ratio needed for fixation, resulting in less energy loss from the use of Cyclic Electron Flow during ATP/NADPH rebalancing (Walker et al., 2020). Under conditions of Jmax limitations, differences in quantum yield may become a critical factor in determining yield, making the sensitivity of quantum yield 154 in this and other studies to assumed lipid bilayer permeability to CO2 a matter of critical importance. Interestingly, previous studies in this area (McGrath and Long, 2014; Fei et al., 2022) have performed sensitivity analyses that include this permeability as a surveyed parameter and its modeled effect is small compared to other parameters. These small local sensitivity values are estimated by observing the change in an output value like light-saturated CO2 assimilation with a ± 10% change in the permeability parameter. This ignores the fact that the uncertainty in this value is in the range of at least an order of magnitude (Evans et al., 2009), and so despite low local sensitivity, the overall change that can result from varying it within reasonable bounds is substantial. The substantial uncertainty in this critical parameter could be reined in by future experimental measurements, though this will still be complicated by the potentially large variation between different plant systems, dynamic remodeling of lipid bilayers in response to developmental and environmental cues, etc. In the absence of well-defined values for this parameter, we encourage future groups modeling such systems to explore a range of values and to characterize the robustness of their conclusions to its variation. In the near-neutral or slightly basic conditions that most photosynthetic organisms in aqueous environments find themselves in, HCO3 - represents the primary form of Dissolved Inorganic Carbon (DIC) in their surroundings. Due to the impermeability of lipid bilayers to passive diffusion of HCO3 -, the use of this pool of DIC requires organisms to employ an active transport mechanism [e.g., cyanobacterial HCO3 - pumps like BicA (Price et al., 2004)] to move it from the extracellular to the intracellular space, which may often make sense due to the sheer quantity of DIC that is present in the environment. Although land plants ultimately obtain CO2 from the atmosphere, this CO2 must dissolve into water prior to entering photosynthesizing cells, at which point this aqueous CO2 interconverts with other DIC species. This raises the possibility of a similar strategy – pumping HCO3 - from a land plant’s apoplast water into the intracellular environment to increase net CO2 fixation – potentially viable. However, our results indicate that the limited spontaneous rate of CO2 and HCO3 - interconversion without the activity of carbonic anhydrase means that this strategy does not work. Of note here is the fact that a quantitatively very similar system arises in algae growing in acidic environments where external HCO3 - levels are negligible, such as the red alga Cyanidioschyzon merolae (De Luca et al., 1978). In such systems, all DIC must first enter the 155 cell passively as aqueous CO2, at which point it will interconvert primarily between CO2 and HCO3, with the ratio of CO2:HCO3 determined by the cytosolic pH. There is strong evidence that C. merolae has a non-pyrenoid based CCM (Steensma et al., 2023). Such a system could use HCO3 - pumping across the chloroplast envelope as a method of concentrating carbon, but our results suggest that this system would require maintenance of a near-neutral cytosolic pH along with the presence of carbonic anhydrases in the cytosol to be viable. The maintenance of this near-neutral pH in an acidic environment may, in turn, represent a substantial energetic cost to the organism. 4.7. Data and Code Availability All results generated as part of this study can be found in the Supplemental Material. Models used for generating the results can all be found under the account kastejos in the Virtual Cell interface. Specific model names can be found for each dataset in the corresponding Supplemental Material tables. 4.8. Acknowledgments The Virtual Cell, the software platform used for the reaction-diffusion simulations in this study, is supported by NIH Grant R24 GM137787. This research was supported by the U.S. Department of Energy, Office of Science Biological and Environmental Research Grant no DE-SC0018269 (J.A.M.K., Y.S-H.) and Basic energy Sciences Grant no DE- FG02-91ER20021 (B.J.W.). This work is supported, in part, by the NSF Research Traineeship Program (Grant DGE-1828149) to J.A.M.K. This publication was also made possible by a predoctoral training award to J.A.M.K. from Grant T32-GM110523 from National Institute of General Medical Sciences (NIGMS) of the NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIGMS or NIH. 4.9. Author Contributions J.A.M.K, B.J.W, and Y.S-H. conceptualized the study. J.A.M.K. developed the models, ran the simulations, and analyzed the results. J.A.M.K. wrote the first draft of the manuscript. All authors contributed to revising and editing the final manuscript. 156 REFERENCES Badger MR, Kaplan A, Berry JA (1980) Internal inorganic carbon pool of Chlamydomonas reinhardtii: evidence for a carbon dioxide-concentrating mechanism. Plant physiology 66: 407–413 Bräutigam A, Schlüter U, Eisenhut M, Gowik U (2017) On the Evolutionary Origin of CAM Photosynthesis. Plant Physiol 174: 473–477 Bunce J (2005) What is the usual internal carbon dioxide concentration in C4 species under midday field conditions? Photosynthetica 43: 603–608 Carriquí M, Roig-Oliver M, Brodribb TJ, Coopman R, Gill W, Mark K, Niinemets Ü, Perera-Castro AV, Ribas-Carbó M, Sack L, et al (2019) Anatomical constraints to nonstomatal diffusion conductance and photosynthesis in lycophytes and bryophytes. New Phytologist 222: 1256–1270 Coleman JR, Colman B (1980) Effect of oxygen and temperature on the efficiency of photosynthetic carbon assimilation in two microscopic algae. Plant Physiol 65: 980–983 Cowan AE, Moraru II, Schaff JC, Slepchenko BM, Loew LM (2012) Spatial modeling of cell signaling networks. Methods in cell biology 110: 195–221 De Luca P, Taddei R, Varano L (1978) Cyanidioschyzon merolae: a new alga of thermal acidic environments. Webbia 33: 37–44 Edwards G, Walker D (1983) C3 ,C4: Mechanisms, Cellular and Environmental Regulation of Photosynthesis. Univ of California Press Ermakova M, Danila FR, Furbank RT, von Caemmerer S (2020) On the road to C4 rice: advances and perspectives. Plant J 101: 940–950 Evans JR, Kaldenhoff R, Genty B, Terashima I (2009) Resistances along the CO2 diffusion pathway inside leaves. Journal of Experimental Botany 60: 2235–2248 Farquhar GD, Caemmerer S, Berry JA (1980) A biochemical model of photosynthetic CO= assimilation in leaves of C3 species. Planta 149: 78–90 Fei C, Wilson AT, Mangan NM, Wingreen NS, Jonikas MC (2022) Modelling the pyrenoid- based CO2-concentrating mechanism provides insights into its operating principles and a roadmap for its engineering into crops. Nature Plants 8: 583–595 Fladung M, Hesselbach J (1987) Developmental Studies on Photosynthetic Parameters in C3, C3 - C4 and C4 Plants of Panicum. Journal of Plant Physiology 130: 461–470 Flexas J, Clemente-Moreno MJ, Bota J, Brodribb TJ, Gago J, Mizokami Y, Nadal M, Perera-Castro AV, Roig-Oliver M, Sugiura D, et al (2021) Cell wall thickness and composition are involved in photosynthetic limitation. Journal of experimental botany 72: 157 3971–3986 Gutknecht J, Bisson MA, Tosteson FC (1977) Diffusion of carbon dioxide through lipid bilayer membranes: effects of carbonic anhydrase, bicarbonate, and unstirred layers. The Journal of general physiology 69: 779–794 He Y, Matthews ML (2023) Seasonal climate conditions impact the effectiveness of improving photosynthesis to increase soybean yield. Field Crops Research 296: 108907 Hemond HF, Fechner EJ (2022) Chemical fate and transport in the environment. Academic Press Hennacy JH, Jonikas MC (2020) Prospects for Engineering Biophysical CO2 Concentrating Mechanisms into Land Plants to Enhance Yields. Annu Rev Plant Biol 71: 461–485 Hopkinson BM, Dupont CL, Allen AE, Morel FMM (2011) Efficiency of the CO2- concentrating mechanism of diatoms. Proceedings of the National Academy of Sciences 108: 3830–3837 Kramer DM, Sacksteder CA, Cruz JA (1999) How acidic is the lumen? Photosynthesis Research 60: 151–163 Lee M-S, Boyd RA, Ort DR (2022) The photosynthetic response of C3 and C4 bioenergy grass species to fluctuating light. GCB Bioenergy 14: 37–53 Ludwig M (2013) Evolution of the C4 photosynthetic pathway: events at the cellular and molecular levels. Photosynth Res 117: 147–161 McGrath JM, Long SP (2014) Can the cyanobacterial carbon-concentrating mechanism increase photosynthesis in crop species? A theoretical analysis. Plant physiology 164: 2247–2261 Meyer M, Seibt U, Griffiths H (2008) To concentrate or ventilate? Carbon acquisition, isotope discrimination and physiological ecology of early land plant life forms. Philosophical transactions of the Royal Society of London Series B, Biological sciences 363: 2767– 2778 Missner A, Kügler P, Saparov SM, Sommer K, Mathai JC, Zeidel ML, Pohl P (2008) Carbon dioxide transport through membranes. The Journal of biological chemistry 283: 25340–25347 Mitchell MJ, Jensen OE, Cliffe KA, Maroto-Valer MM (2010) A model of carbon dioxide dissolution and mineral carbonation kinetics. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 466: 1265–1290 Mukherjee A, Lau CS, Walker CE, Rai AK, Prejean CI, Yates G, Emrich-Mills T, Lemoine SG, Vinyard DJ, Mackinder LCM, et al (2019) Thylakoid localized bestrophin-like proteins are essential for the CO2 concentrating mechanism of 158 Chlamydomonas reinhardtii. Proc Natl Acad Sci U S A 116: 16915–16920 Peixoto MM, Sage TL, Busch FA, Pacheco HDN, Moraes MG, Portes TA, Almeida RA, Graciano-Ribeiro D, Sage RF (2021) Elevated efficiency of C3 photosynthesis in bamboo grasses: A possible consequence of enhanced refixation of photorespired CO2. GCB Bioenergy 13: 941–954 Price GD, Badger MR, von Caemmerer S (2010) The Prospect of Using Cyanobacterial Bicarbonate Transporters to Improve Leaf Photosynthesis in C3 Crop Plants. Plant Physiology 155: 20–26 Price GD, Woodger FJ, Badger MR, Howitt SM, Tucker L (2004) Identification of a SulP- type bicarbonate transporter in marine cyanobacteria. Proceedings of the National Academy of Sciences 101: 18228–18233 Raven JA, Ball LA, Beardall J, Giordano M, Maberly SC (2005) Algae lacking carbon- concentrating mechanisms. Can J Bot 83: 879–890 Raven JA, Beardall J, Johnston AM (1982) Inorganic Carbon Transport in Relation to H+ Transport at the Plasmalemma of Photosynthetic Cells. Plasmalemma and Tonoplast: Their Functions in the Plant Cell. Elsevier Biomedical Press, Amsterdam, pp 41–47 Raven JA, Beardall J, Sánchez-Baracaldo P (2017) The possible evolution and future of CO2- concentrating mechanisms. Journal of Experimental Botany 68: 3701–3716 Raven JA, Cockell CS, De La Rocha CL (2008) The evolution of inorganic carbon concentrating mechanisms in photosynthesis. Philos Trans R Soc Lond B Biol Sci 363: 2641–2650 Sage RF, Monson RK, Ehleringer JR, Adachi S, Pearcy RW (2018) Some like it hot: the physiological ecology of C4 plant evolution. Oecologia 187: 941–966 Schaff J, Fink CC, Slepchenko B, Carson JH, Loew LM (1997) A general computational framework for modeling cellular structure and function. Biophysical journal 73: 1135– 1146 Seelert H, Poetsch A, Dencher NA, Engel A, Stahlberg H, Müller DJ (2000) Proton-powered turbine of a plant motor. Nature 405: 418–419 Steensma AK, Shachar-Hill Y, Walker BJ (2023) The carbon-concentrating mechanism of the extremophilic red microalga Cyanidioschyzon merolae. Photosynth Res 156: 247–264 Still CJ, Berry JA, Collatz GJ, DeFries RS (2003) Global distribution of C3 and C4 vegetation: Carbon cycle implications. Global Biogeochemical Cycles 17: 6–1 Tolbert NE, Benker C, Beck E (1995) The oxygen and carbon dioxide compensation points of C3 plants: possible role in regulating atmospheric oxygen. Proceedings of the National Academy of Sciences 92: 11230–11233 159 Villarreal JC, Renner SS (2012) Hornwort pyrenoids, carbon-concentrating structures, evolved and were lost at least five times during the last 100 million years. Proceedings of the National Academy of Sciences 109: 18873–18878 Walker BJ, Kramer DM, Fisher N, Fu X (2020) Flexibility in the Energy Balancing Network of Photosynthesis Enables Safe Operation under Changing Environmental Conditions. Plants (Basel, Switzerland). doi: 10.3390/plants9030301 Walker BJ, VanLoocke A, Bernacchi CJ, Ort DR (2016) The Costs of Photorespiration to Food Production Now and in the Future. Annu Rev Plant Biol 67: 107–129 Yu Q, Tang C, Kuo J (2000) A critical review on methods to measure apoplastic pH in plants. Plant and Soil 219: 29–40 160 APPENDIX C: Supplemental Material for Chapter 4 FIGURES Figure S4.1: Carboxylation / oxygenation events per photon as a function of varying ATP:NADPH ratios. Costs associated with using either Cyclic Electron Flow or the malate valve for increasing ATP:NADPH ratio from the products of the light reactions are taken from Walker et al., (2020). 161 Figure S4.2: Effect of simulation spatial resolution on rubisco saturation. Simulation results are taken from the model of an algal cell without a CCM. 162 Figure S4.3: Effect of simulation spatial resolution on quantum yield. Simulation results are taken from the model of an algal cell without a CCM. 163 Figure S4.4: Effect of simulation spatial resolution on rubisco saturation. Simulation results are taken from the model of an algal cell with a CCM. 164 Figure S4.5: Effect of simulation spatial resolution on quantum yield. Simulation results are taken from the model of an algal cell with a CCM. 165 Supplementary Datasets Descriptions All supplemental datasets can be found at the following link: https://doi.org/10.1101/2024.01.04.574220 Supplemental Tables.xlsx: Contains results from VCell simulations discussed and analyzed in the manuscript. 166 Chapter 5 Concluding Remarks 167 5.1. Introduction Taken as a whole, the studies presented in this thesis represent an attempt to improve and refine our understanding of photosynthetic metabolism by interrogating and improving the techniques we use to model it. I articulate the importance of careful statistical evaluation of metabolic models and the utility of using multiple, independent modeling approaches in Chapter 1 (Kaste and Shachar-Hill, 2023a), and provide an example of putting these ideas into practice Chapter 2 (Xu et al., 2022). Again in Chapter 3, I make use of validation principles and implement the “gold-standard” validation of an FBA flux map – comparison against an MFA flux map – that I argue for in Chapter 1 (Kaste and Shachar-Hill, 2023b). Finally, in Chapter 4, I look at reaction- diffusion modeling of photosynthetic systems, identifying sources of model uncertainty that affect prior work’s conclusions with regards to the efficiency of using Carbon-Concentrating Mechanisms. In this chapter, I will review some of the takeaway results and conclusions from the studies that I have presented. I will also highlight limitations and future directions of this overall research program. 5.2. Takeaway messages and future work 5.2.1. Analyzing systems using multiple modeling paradigms can help reveal new aspects of these systems, but may be redundant with refinements or extensions of existing paradigms In Chapter 2, I showed that complementing our 13C-MFA study with a pharmacokinetics-derived polyexponential modeling approach allowed us to further refine the 13C-MFA model and discover new properties of the system under consideration. However, inasmuch as both approaches are fundamentally just different mathematical formalisms describing the same biological phenomena, there should exist underlying mathematical connections between the two that, with sufficient exploration, unify them or render one or the other redundant. As referenced in Chapter 1 and described in Nöh and Wiechert, (2011) and Zheng et al., (2022), fitting time- course isotopic labeling data to incomplete metabolic network model specifications can lead to unmodeled reactions contributing unaccounted-for labeled/unlabeled atoms (referred to in Zheng et al., (2022) as “time constants”). When fitting the isotopic labeling data to generate a flux map without any metabolite pool sizes constrained by experimental measurements, the pool sizes estimates essentially capture the error introduced by these time constants (Zheng et al., 2022). In Xu et al., (2022), I use the polyexponential modeling appoach to essentially reveal these 168 unmodeled processes without the use of pool size data. It is possible that the routine inclusion of pool size measurements as a way of detecting model misspecifications, which I advocate for in Chapter 1, would make the polyexponential modeling approach as a way of detecting these unmodeled factors redundant, but further investigation will be needed to confirm whether this is the case. Future work should also look into whether this polyexponential modeling method can be fruitfully applied to other systems. I use the polyexponential modeling approach in Chapter 2 to characterize how many processes acting over different time scales are influencing the labeling of CBC intermediates. I then corroborate the findings from the polyexponential modeling with our 13C-MFA results. In that same study, I applied the same polyexponential modeling approach to Nicotiana tabacum data gathered by Xinyu Fu and colleagues in Fu et al., (2023) and found similar patterns. This corroborated our findings by suggesting that the cycling of cytosolic and vacuolar sugars may occur in N. tabacum as well. However, since I did not do a 13C-MFA incorporating the vacuolar sugar exchange and demonstrate that this resulted in a statistically significantly better model fit, I have not yet provided equivalent evidence of the operation of such a cytosolic-to-vacuolar sugar recycling pathway in any plant other than C. sativa. Future work should attempt to provide such evidence, with N. tabacum and A. thaliana representing the obvious candidates for such follow-up studies due to the presence of leaf CBC intermediate isotopic labeling datasets in these systems. In order to gauge how conserved this recycling phenomenon is, though, N. tabacum would be the system of greater biological significance. This is because A. thaliana and C. sativa are very closely related, so if the phenomenon was found in A. thaliana as well, it would raise the question of whether it is conserved broadly in land plants, or just in this very specific lineage of the Brassicaceae. 5.2.2. Considering metabolic network structure in addition to omic datasets can result in drastically improved predictive power, but further work is necessary to demonstrate general applicability of this principle As noted by Schwender et al., (2014), there is a very poor correlation between changes in transcript abundance and changes in flux when comparing a particular tissue – in the case of Schwender et al., (2014), plant embryo tissues in different growth media – under different conditions. Indeed, the large number of biochemical and regulatory processes that intervene between transcript, or even protein, accumulation make the use of transcripts or proteins as an 169 input data type for predicting fluxes questionable. Despite this, the work presented in Kaste and Shachar-Hill, (2023b) and Chapter 3 suggests that it is possible to extract signal from transcriptomic and proteomic datasets for flux prediction. One key difference between the approaches taken by Schwender et al., (2014) and Kaste and Shachar-Hill, (2023b) is that the latter places omic abundances in the context of the entire metabolic network, constraining the influence of the omic data by other metabolic necessities like the accumulation of measured amounts of biomass. However, the datasets analyzed by Schwender et al., (2014) and Kaste and Shachar-Hill, (2023b) are also quite different. In order to provide stronger evidence that it is the consideration of the omic data in the context of a network that allows these data, despite low correlation with differences in flux, to generate accuracy improvements, it should be demonstrated that there is actually a poor correlation between flux differences and omic differences in Kaste and Shachar-Hill, (2023b). This could be done with the FBA predictions in the multiple modeled tissues alone, or as part of a broader study where MFA flux maps are generated for the non-photosynthetic tissues as well. Although successful, this method has thus far only been shown to work in A. thaliana, and a crucial next step to demonstrate its utility would be to show efficacy in another system. Moreover, the MFA-to-FBA flux map comparison was only possible for leaf tissues using values reported by Ma et al., (2014). If MFA flux maps of the non-photosynthetic stem and root tissues of A. thaliana could be generated under similar conditions, I might be able to evaluate the FBA flux maps for those tissues as well. A number of simplifications were employed when developing and evaluating the algorithm described in Chapter 3 and Kaste and Shachar-Hill, (2023b), which could benefit from further evaluation. Although the base model of A. thaliana used to build the multi-tissue model evaluated in the study (Arnold and Nikoloski, 2014) contained GPR terms with detailed enzyme complex stoichiometries, this stoichiometric detail was ignored by the algorithm. Rewriting the algorithm to incorporate these stoichiometric ratios and evaluating whether it results in improved accuracy could be a fruitful future research project. Additionally, I ran into computational constraints when attempting to perform uniform random sampling of the flux solution spaces generated by our FBA optimizations. In short, the multi-tissue models I was optimizing were too large to efficiently sample. Because of this, when reporting weighted average error values, I calculated best- and worst-case (i.e., maximum and minimum possible errors) using the 170 maximum and minimum possible for each flux, given the model constraints, using FVA (Mahadevan and Schilling, 2003). This approach is suboptimal because not every linear combination of maximal and minimal values for each flux from FVA will represent a valid solution to the optimization problem. As a result, the real range of possible weighted average error values is almost certainly narrower than what is reported in (Kaste and Shachar-Hill, 2023b). The use of stronger computational resources to overcome the numerical difficulties posed by the uniform random sampling method, or an improved process by which the FVA- derived upper and lower bounds for each value are iteratively perturbed until a valid flux solution is generated, could be used in a future study to overcome this limitation. 5.2.3. Spatially-resolved reaction-diffusion modeling allows for powerful investigations of photosynthetic metabolism, but is limited by computational power Along similar lines, computational limitations also affected the depth of analysis possible in the work reported in Chapter 4 and Kaste et al., (2024). As described in that chapter, a number of geometric simplifications were made to make solving the spatial reaction-diffusion models in that study numerically tractable, given the large parameter explorations I performed. In addition, these parameter explorations were limited to two-dimensional, or at most very coarse three- dimensional spaces. This stands in contrast with an extensive 10-plus-dimensional parameter exploration I am currently performing together with my colleague Anne Steensma on a forthcoming study on which I am co-first author. The relative computational simplicity of compartmental models allowed for a substantially more thorough parameter exploration. In order to achieve something similar, a follow-up on the work presented in Chapter 4 could derive analytical solutions for the models. One limitation of this approach is that derivation of such analytical solutions often requires some mathematical simplifications, as demonstrated in Fei et al., (2022), where the spontaneous (i.e., not CA-mediated) interconversion of CO2 and HCO3 was omitted because it was incompatible with getting an analytical solution. An alternative, if too many such simplifications would be necessary, would be to distribute the Virtual Cell platform’s calculations to a local High Performance Computing Cluster (HPCC). By default, the Virtual Cell distributes simulation jobs to the HPCC at the University of Connecticut. However, the software imposes a limit of forty jobs per user. By implementing the Virtual Cell on a local university cluster, substantially larger numbers of jobs could be run simultaneously, opening the door for larger parameter explorations and deeper analysis of the model. 171 Even relatively sophisticated models of similar systems have had to employ simplifications and idealizations (McGrath and Long, 2014; Fei et al., 2022) and the parameter explorations and sensitivity analyses they employ are heavily hypothesis-driven, with only a small number of parameters varied and investigated. As the computational power available to research groups continues to grow, there may be great value in revisiting these existing models and rerunning analyses to better characterize the robustness of previous results and conclusions. Such investigations may reveal surprising interactions between geometric or enzymatic parameters and deepen our understanding of what factors contribute to photosynthetic efficiency and productivity. 172 REFERENCES Arnold A, Nikoloski Z (2014) Bottom-up metabolic reconstruction of Arabidopsis and its application to determining the metabolic costs of enzyme production. Plant Physiology 165: 1380–1391 Fei C, Wilson AT, Mangan NM, Wingreen NS, Jonikas MC (2022) Modelling the pyrenoid- based CO2-concentrating mechanism provides insights into its operating principles and a roadmap for its engineering into crops. Nature Plants 8: 583–595 Fu X, Gregory LM, Weise SE, Walker BJ (2023) Integrated flux and pool size analysis in plant central metabolism reveals unique roles of glycine and serine during photorespiration. Nat Plants 9: 169–178 Kaste JAM, Walker BJ, Shachar-Hill Y (2024) Biophysical carbon concentrating mechanisms in land plants: insights from reaction-diffusion modeling. bioRxiv 2024.01.04.574220 Kaste JAM, Shachar-Hill Y (2023a) Model validation and selection in metabolic flux analysis and flux balance analysis. Biotechnology Progress: e3413 Kaste JAM, Shachar-Hill Y (2023b) Accurate flux predictions using tissue-specific gene expression in plant metabolic modeling. Bioinformatics: btad186 Ma F, Jazmin LJ, Young JD, Allen DK (2014) Isotopically nonstationary 13C flux analysis of changes in Arabidopsis thaliana leaf metabolism due to high light acclimation. Proceedings of the National Academy of Sciences of the United States of America 111: 16967–16972 Mahadevan R, Schilling CH (2003) The effects of alternate optimal solutions in constraint- based genome-scale metabolic models. Metabolic Engineering 5: 264–276 McGrath JM, Long SP (2014) Can the cyanobacterial carbon-concentrating mechanism increase photosynthesis in crop species? A theoretical analysis. Plant physiology 164: 2247–2261 Nöh K, Wiechert W (2011) The benefits of being transient: isotope-based metabolic flux analysis at the short time scale. Applied Microbiology and Biotechnology 91: 1247–1265 Schwender J, König C, Klapperstück M, Heinzel N, Munz E, Hebbelmann I, Hay JO, Denolf P, De Bodt S, Redestig H, et al (2014) Transcript abundance on its own cannot be used to infer fluxes in central metabolism. Frontiers in Plant Science 5: 1–16 Xu Y, Wieloch T, Kaste JAM, Shachar-Hill Y, Sharkey TD (2022) Reimport of carbon from cytosolic and vacuolar sugar pools into the Calvin-Benson cycle explains photosynthesis labeling anomalies. Proceedings of the National Academy of Sciences 119: e2121531119 Zheng AO, Sher A, Fridman D, Musante CJ, Young JD (2022) Pool size measurements improve precision of flux estimates but increase sensitivity to unmodeled reactions 173 outside the core network in isotopically nonstationary metabolic flux analysis (INST- MFA). Biotechnology Journal 17: 1–17 174 Chapter 6 Additional Studies: Integrative Teaching of Metabolic Modeling and Flux Analysis with Interactive Python Modules This research was published in: J. A. M. Kaste, A. Green, Y. Shachar-Hill, Integrative Teaching of Metabolic Modeling and Flux Analysis with Interactive Python Modules. Biochemistry and Molecular Biology Education 51(6): 653-661 (2023). 175 6.1. Preface My interest in teaching metabolic modeling – the subject of this chapter – stems from two sources, one practical and one theoretical. The practical reason is that I found learning the underlying theory quite difficult and many concepts of central importance to metabolic modeling can take a great deal of time and effort to properly grasp. The theoretical reason is that the interrelationship between different modeling approaches – enzyme-based simulations, FBA, and MFA – is not emphasized or discussed very strongly in the literature, and the communities that have formed around these different techniques do not seem to interact extensively. As I have discussed in earlier chapters, there is a lot to be gained from comparing and integrating these different ways of looking at the same systems, so I saw great value in developing educational resources that puts this idea front and center for learners right from the get-go. Dr. Shachar-Hill has run an annual intensive workshop series on metabolic modeling at Michigan State University for many years now. Previous iterations have used Excel spreadsheets and proprietary programs like INCA for demonstrating enzyme-based kinetic and constraint- based modeling to learners. I decided to develop interactive Python notebooks that allowed learners to more easily interface with and manipulate their metabolic modeling simulations. I enlisted the help of a bright undergraduate researcher in our lab, Antwan Green, in doing this work. I set up pre- and post-workshop surveys to assess learners’ experiences and we found that the combination of the workshop lecture material and these interactive simulations resulted in positive outcomes. We decided to package these simulations together with lesson plans and lecture notes into a freely available GitHub repository for teachers and learners to access, and also wrote a manuscript describing what we put together, which I present in this chapter. I came up with the concept for these simulations and wrote most of the code for the project, with assistance from Antwan Green. I also ran all logistics related to the survey component of the study, including getting our IRB exemption for the study approved, and wrote up all of the lecture notes and lesson plans. I wrote the manuscript with input and editing from Antwan Green and Dr. Shachar Hill. The manuscript presented in this chapter has been published in the journal Biochemistry and Molecular Biology Education (Kaste et al., 2023). Rather than a one-and-done study, I see this as a first step towards building a robust set of metabolic modeling learning resources. In future iterations of the workshop and in these learning materials, Dr. Shachar-Hill and I plan on interweaving the lecture and interactive sections more 176 seamlessly so that learners are running examples and engaging in active learning, reinforcing concepts they were just exposed to. Although the materials, as presented, already represent a big step-up from existing learning resources for these topics, there is always room for improvement. 6.2. Abstract The modeling of rates of biochemical reactions – fluxes – in metabolic networks is widely used for both basic biological research and biotechnological applications. A number of different modeling methods have been developed to estimate and predict fluxes, including kinetic and constraint-based (Metabolic Flux Analysis and Flux Balance Analysis) approaches. Although different resources exist for teaching these methods individually, to-date no resources have been developed to teach these approaches in an integrative way that equips learners with an understanding of each modeling paradigm, how they relate to one another, and the information that can be gleaned from each. We have developed a series of modeling simulations in Python to teach kinetic modeling, Metabolic Control Analysis, 13C-Metabolic Flux Analysis and Flux Balance Analysis. These simulations are presented in a series of interactive notebooks with guided lesson plans and associated lecture notes. Learners assimilate key principles using models of simple metabolic networks by running simulations, generating and using data, and making and validating predictions about the effects of modifying model parameters. We used these simulations as the hands-on computer laboratory component of a four-day metabolic modeling workshop and participant survey results showed improvements in learners’ self-assessed competence and confidence in understanding and applying metabolic modeling techniques after having attended the workshop. The resources provided can be incorporated in their entirety or individually into courses and workshops on bioengineering and metabolic modeling at the undergraduate, graduate, or postgraduate level. 6.3. Introduction Metabolic modeling provides scientists with a quantitative description of the in vivo rates of biochemical reactions in biological networks. These rates of biochemical reactions – fluxes – are a function of many layers of cellular regulation (transcriptional, translational, post-translational, etc.) and relate directly to the living system’s functional phenotype. Understanding metabolic flux thus provides important insights into biological systems and underlies efforts to rationally modify their metabolism to suit our biotechnological needs (Nielsen, 2003). Fluxes in metabolic pathways and networks cannot be directly measured, necessitating 177 the use of mathematical modeling approaches to estimate or predict them. These approaches can be broadly categorized into kinetic and constraint-based methods. Within both categories, methods exist both for predicting fluxes and for estimating them from experimental data. Kinetic methods involve simulating the dynamically changing fluxes and metabolite concentrations in a metabolic network over time (Saa and Nielsen, 2017), whereas constraint-based methods like Flux Balance Analysis (FBA) (Orth et al., 2010b) and Metabolic Flux Analysis (MFA) (Antoniewicz, 2015) estimate steady-state fluxes using linear optimization principles or experimentally-measured isotopic labeling data. Metabolic modeling, and particularly constraint-based modeling approaches, have been used productively to aid in biotechnological applications. For example, Metabolic Flux Analysis techniques using isotopic labeling informed the engineering of the bacterium Corynebacterium glutamicum to produce high concentrations of lysine (Koffas et al., 2003; Koffas and Stephanopoulos, 2005; Becker et al., 2011). Flux Balance Analysis has been deployed to improve the microbial production of a number of bioproducts, including threonine (Lee et al., 2007) and valine (Park et al., 2007), and in ambitious reengineering efforts like that described in (Gleizer et al., 2019) where FBA and related methods including (Burgard et al., 2003) were used to enable engineering of normally heterotrophic Escherichia coli to incorporate CO2 into its biomass using a heterologously expressed Calvin-Benson Cycle. These and an increasing number of other metabolic modeling applications indicate that this is an area that is of great value to learners and practitioners in biology, biochemistry, and chemical engineering. Related to kinetic metabolic analysis, Metabolic Control Analysis (MCA) provides mathematical tools for understanding how control over flux and internal metabolite concentrations are distributed between the enzymes in a biochemical network (Fell, 1992; Moreno-Sánchez et al., 2008). Like metabolic flux modeling and mapping the questions addressed by Metabolic Control Analysis have major biotechnological implications. We believe it therefore makes sense to introduce and teach concepts in MCA along with kinetic and constraint-based metabolic modeling techniques. Although previous studies have described and provided resources for teaching kinetic metabolic modeling (Armando et al., 2009), FBA (Orth et al., 2010b; Chaves et al., 2022), MFA (Wong et al., 2004; Wong and Barford, 2010), and MCA (Snoep et al., 1999; Rodríguez-Caso et al., 2002; Angelani et al., 2018), there are not any published and freely available instructional 178 resources for introducing these toolsets to learners in an integrative and interactive fashion. Moreover, although papers and books exist describing how to experimentally approach 13C-MFA (Crown et al., 2012; Dieuaide-Noubhani and Alonso, 2014; Krömer et al., 2014; Antoniewicz, 2018) or the theoretical background behind the technique (Stephanopoulos et al., 1998; Ratcliffe and Shachar-Hill, 2006), we are not aware of any dedicated and published educationally focused resources for introducing learners to the theoretical background behind label-assisted MFA. We believe introducing learners to all of these major areas of metabolic modeling together allows them to appreciate their interconnections and better evaluate what approach(es) may be useful to their own research and/or engineering goals than if they encounter them in isolation. To address this gap in the biochemistry education literature, we developed a series of interactive Python-based Jupyter notebooks featuring exercises that give learners hands-on experience with kinetic modeling, FBA, MFA, and MCA. These notebooks were used as the hands-on laboratory exercises for the 2022 iteration of an annual metabolic modeling workshop at Michigan State University. To assess the efficacy of the workshop and the interactive exercises, surveys were distributed to participants – a mix of graduate students and postdoctoral researchers – before, immediately after, and four months after the workshop to measure self- assessed competence and confidence in metabolic modeling techniques and in the application of these techniques to learners’ own research questions. Although the materials are structured with a particular sequence and timeline, the individual notebooks, paired with appropriate lecture material, contain sufficient explanation to be flexibly incorporated into different course or workshop structures. 6.4. Methods 6.4.1. Exercise development All simulation code was written in Python and packaged and presented in Jupyter notebooks (Kluyver et al., 2016). Numpy (Harris et al., 2020) and SciPy (Virtanen et al., 2020) were used to handle data import and export and calculate control coefficients for MCA. Interactive elements were incorporated into the notebooks using the ipywidgets package. MFA simulations were run in Python using the package mfapy (Matsuda et al., 2021) and FBA simulations were run using cobrapy (Ebrahim et al., 2013). For the FBA exercises, the genome-scale model of E. coli’s metabolic network iJO1366 (Orth et al., 2011) was used along with a smaller “core” model of E. coli’s metabolic network (Orth et al., 2010a). Several example networks from (Ratcliffe and 179 Shachar-Hill, 2006) were adopted for demonstration purposes throughout the notebooks. Time-courses of metabolite concentrations, fluxes, and labeling were generated in kinetic simulations featuring reversible or irreversible first-order and Michaelis-Menten kinetics. Euler’s method was used to generate all concentration, flux, and labeling values. In most of the simulations that feature labeling, all metabolites are treated as having only one labelable position, so the proportion of labeled and unlabeled metabolite is tracked. In the simulations in the notebook for Day 4 (see Table 6.1), both one- and two-carbon molecules are present, so the quantities of unlabeled, half-labeled, and fully-labeled species for each metabolite are calculated and tracked independently to allow for comparison with 13C-MFA flux map results. 6.4.2. Survey ethics and analysis The survey component of this study was deemed exempt by the Michigan State University Office of Research Regulatory Support. Survey respondents were asked to self-assess their confidence in and understanding of kinetic and constraint-based metabolic modeling methods and the application of these methods to their own research goals on a Likert scale (Likert, 1932). Survey responses were gathered from workshop participants before, immediately after, and four-months following the workshop. The survey instruments can be found in the supplemental materials. One-sided Mann-Whitney U tests (Neuhäuser, 2011) were used to compare pre- and post-workshop responses, where our null hypothesis was that there is no difference between the pre- and post-workshop responses and our alternative hypothesis was that the post-workshop responses were higher than the pre-workshop responses. We evaluated each question with α = 0.05. 6.5. Results and discussion 6.5.1 Educational Jupyter notebooks We developed a series of four Jupyter notebooks covering various aspects of kinetic and constraint-based metabolic modeling and metabolic control analysis. A graphical summary of the different areas of metabolic modeling covered and their relationships is shown in Figure 6.1. In addition to learning the theory behind these methods, learners are exposed to the key concepts for successful applications of flux modeling listed below. We also note in Table 6.1 and the lesson plans when an exercise can be used to teach one of these concepts. 180 • Concept 1: The relationship between the noise and time resolution of experimental data and the confidence one can have in parameter estimates and assumed model architectures. • Concept 2: The uniqueness and identifiability of flux estimates in FBA and 13C-MFA and their relationship to model complexity. • Concept 3: The distribution of control over fluxes and concentrations in a network across the reactions of that network. Figure 6.1: Metabolic modeling topics covered in the resources presented in this study. A majority of the techniques covered – kinetic modeling, FBA, and 13C-MFA – are used to estimate or predict fluxes through a metabolic network. MCA, on the other hand, is used to analyze the relationship between enzyme activities/concentrations and metabolite or regulator concentrations on the flux through the network. Within the flux estimation/prediction techniques, kinetic modeling can be used to estimate fluxes and metabolite concentrations in systems whether they are in steady-state or not (dynamic systems where concentrations are still changing). The constraint-based modeling techniques of FBA and 13C-MFA, on the other hand, rely on an assumption of metabolic steady-state, as does MCA. These concepts are necessary both to effectively conduct any experiment or study involving flux analysis and to understanding the primary metabolic modeling literature. They are often not intuitively obvious, and the first two also receive rather little attention in the teaching or research literature. The concepts are therefore explained in the lecture notes, revisited throughout the Jupyter notebooks and demonstrated with hands-on exercises. For example, in Exercises 4.0 – 4.2 in the Day 4 Jupyter Notebook, learners gain insight into Concept 1 by first using a kinetic model to generate simulated labeling data and then attempting to fit it using both 181 correctly and incorrectly specified network models using 13C-MFA. By doing so, the learners can observe the difference in 13C-MFA fits when using the correct or incorrect model specification and how this difference can be obscured even by low levels of experimental noise. This allows instructors to highlight important issues concerning data quality and to discuss model selection, which is rarely addressed in the literature (Sundqvist et al., 2022). Table 6.1: A table describing the contents of the interactive exercises presented in this publication. Descriptions of key concepts are outlined in the text. Day Section(s) Contents Concept 1 2 3 4 1.0 – 1.2 2.0 – 2.2 3.0 – 3.1 4.0 5.0 – 5.1 1.0 2.0 3.0 – 3.4 4.0 1.0 – 1.2 2.0 – 2.2 3.0 4.0 5.0 1.0 – 1.2 2.0 3.0 – 3.3 4.0 – 4.2 5.0 Introduction to the Jupyterlab Interface. Exploration of a simulation demonstrating first-order kinetics. Exercise on inferring kinetic parameters from example datasets. Exercise demonstrating the relationship between model architecture and the information contained in each datapoint. Introduction to metabolic steady-state and the utility of labeling data. Introduction to reversible first-order kinetic models Exercise on inferring model parameters in the presence of reversibility Exploration of metabolic control analysis, including calculation of flux and concentration control coefficients as well as elasticities. Comparison of results gathered in 3.0 – 3.4 “by hand” with results from an automated MCA script. Metabolic control analysis with branching networks, negative control coefficients, and modeling a system with an incomplete network description. Kinetic modeling with Michaelis-Menten kinetics. Fitting a dataset using either first-order or Michaelis-Menten kinetics in the presence or absence of noise. Kinetic modeling with reversible Michaelis-Menten kinetics. Using MCA to calculate response coefficients. A kinetic simulation that incorporates labeling dynamics, for comparison with 13C-MFA and FBA. Introduction to FBA modeling. Introduction to FVA and randomized sampling methods in FBA. Introduction to 13C-MFA and comparison with results from 1.0 – 1.2. Discussion about incorporating metabolic modeling into one’s own work and/or research. 1 1 1 3 1 3 1 3 2 2 2, 1 The subjects covered in the sections of each notebook with the timeline for a 4 day workshop are given in Table 6.1. On the first and second days, learners are given an extensive introduction to kinetic modeling theory and exercises before learning about MCA, FBA, and 13C- MFA. We do this to allow learners to gain both a theoretical and practical understanding of the dynamic ways that matter moves through biochemical networks. The hands-on experience exposes learners to the sometimes surprisingly complex behavior of even simple networks governed by systems of Ordinary Differential Equations (ODEs). This is aimed at giving learners 182 a strong sense of the dynamics of metabolic systems before learning about steady-state approaches, in which simplifications of the kinetic state allow powerful analyses in 13C-MFA and FBA. MCA is explored in the second and third days and MCA calculations of flux- and concentration- control coefficients are discussed. Control coefficients are connected to the understanding of reversible first-order kinetics participants gained from the preceding kinetic modeling exercises. Lastly, participants are introduced to constraint-based methods by analyzing the same network structure using kinetic modeling, FBA, and 13C-MFA. This highlights the different inputs needed and the resulting outputs from each technique. To our knowledge, this is the first such cross-comparison of different metabolic modeling techniques presented in the teaching literature, and we believe this will be of value to instructors introducing this material to their students and trainees. Interactive sliders and drop-down menus were incorporated into all of the notebooks to allow learners to modify parameters, run simulations and visualize their results. This allows learners to expose the underlying simulation code and for those with a modest background in Python or general coding to see how the simulations function and potentially to modify the model structures. By default the code is not visible, making the notebooks approachable for participants interested in using metabolic modeling without engaging with the underlying code. We believe that the incorporation of these interactive modules into the notebooks will make the resources presented in this publication useable by learners with little to no coding knowledge. In writing the notebooks, special attention was given to commenting the Python code used to run the simulations and interactive interface elements. We believe the extensive commenting used in these notebooks, together with the use of intuitive and easy-to-understand methods for implementing the simulations will make the notebooks both easy for instructors to adopt and for learners interested in the underlying code to understand it. This is in contrast to software like COPASI that, while very powerful, obscure the underlying simulation logic (Hoops et al., 2006). Installation and compatibility issues are commonplace when using computational resources, particularly when workshop or class participants are asked to run code or software on their own computers. To further ensure maximal useability of these resources by instructors, detailed installation instructions for Windows, MacOS, and Linux systems with the specific version numbers needed to successfully run all of the notebooks provided with the notebooks. 183 Table 6.2: Quantitative pre- and post-workshop survey results evaluating learners’ self-assessed confidence and competence in metabolic modeling techniques. Question Pre- workshop Median Post- workshop Median Significant improvement?a 4 3 3 4 2 2 2 3 4 Significant Significant Significant Not Significant I feel confident in applying and incorporating metabolic modeling techniques to my research question(s). I feel confident in evaluating the results of a metabolic modeling study or a study that incorporates metabolic modeling. I feel confident in identifying metabolic modeling techniques and software that I can apply to my research question(s). I understand the purpose(s) of metabolic modeling. I can describe kinetic metabolic modeling, what information it can provide, and its limitations. I can describe Metabolic Flux Analysis, what information it can provide, and its limitations. I can describe Flux Balance Analysis, what information it can provide, and its limitations. I understand the data types I would need to carry out kinetic metabolic modeling. I understand the data types I would need to carry out Metabolic Flux Analysis. I understand the data types I would need to carry out Flux Balance Analysis. I can name the language(s) or software package(s) I would use to incorporate metabolic modeling into my own research. I can critically evaluate the application and results of metabolic modeling in publications and presentations relevant to my area of research. aStatistically significant improvement was defined by rejection of the null hypothesis by the one- sided Mann-Whitney U test (Neuhäuser, 2011) at α = 0.05. Significant Significant Significant Significant Significant Significant Significant Significant 2.5 2.5 3 2 2 3 2 4 4 4 4 4 4 4 4 6.5.2. Implementation in workshop and survey results The Jupyter notebooks were incorporated into a four-day workshop held at Michigan State University in May 2022. Participants in the workshop included graduate students and postdoctoral researchers. Each day of the workshop consisted of three hours of lecture in the morning and a three-hour hands-on period for computational exercises. Due to time constraints and interest among the participants in constraint-based modeling approaches – particularly label- assisted flux mapping using MFA – the third day’s notebook exercises were omitted and replaced with the fourth day’s exercises on constraint-based modeling. The last day of the workshop was used for an open-ended discussion about participants’ research aims and how they could incorporate what they learned in the workshop into their own work. For instructors interested in incorporating not only the computational resources developed for the workshop, but 184 also all or portions of the lecture material, detailed lecture notes have been provided online at https://github.com/Gibberella/Metabolic-Modeling-Lessons. The pre- and post-workshop survey results suggest that participants felt they gained greater confidence in and knowledge of metabolic modeling over the course of the workshop (Table 6.2). Our survey evaluated participants’ self-assessed confidence and competence but did not ask participants to attribute their comprehension gains to the lecture or hands-on components. In a free-response question (“What did you find useful about the workshop?”), one participant responded, "Understanding what goes into metabolic modeling, learning how to critically appraise these models in published literature, and beginning to learn how to implement them into our own projects.” In response to that same question, another participant focused more specifically on FBA: “The hands-on use of cobrapy was very helpful. This helped me understand how one goes about metabolic modeling.” It should be noted, however, that the sample sizes for the study were small and we had fewer respondents in the post-workshop survey than the pre- workshop survey (N = 12 in the pre-workshop survey and N = 7 in the post-workshop survey). Because of this, the results may be skewed due to survivorship bias from learners who were either no longer interested in the topic or unhappy with the presentation of the material leaving and not participating in the post-workshop survey. Multiple respondents noted that they would have liked to have worked with real datasets in the exercises rather than simulated ones. Given the modifiability and extensive annotation of the notebooks provided, we encourage instructors using the provided resources to add analyses of real datasets that are relevant to their specific audience. We believe this will help provide real- world context for learners as they carry out the exercises. Although we have packaged and used the materials presented in the context of an intensive workshop, we believe the materials can be adapted to a variety of teaching circumstances. The Jupyter-based simulations could be used for computer lab sessions in a semester-long course, for example, or used as an interactive demonstration in a lecture setting. With the relevant theory taught beforehand, these resources may also be appropriate for undergraduate learning. As noted, the extensive annotation of the code paired with the easy-to- use graphical interface for the exercises also makes them suitable for both learners with extensive and with no prior knowledge of programming. 185 6.6. Conclusions Recognizing the absence of resources for teaching the major areas and techniques of metabolic modeling and flux analysis in an integrative fashion, we have developed a set of resources that should be readily adoptable by instructors, students, and researchers alike to teach and learn. By emphasizing the legibility and cross-platform useability of our code, we hope the resources presented in this study can be used and incorporated by the broader teaching community into other workshop and class settings. 6.7. Data and code availability statement All code, documentation, lecture notes, and lesson plans developed for the present study can be found at https://github.com/Gibberella/Metabolic-Modeling-Lessons. 6.8. Acknowledgements This research was supported by the Office of Science (BER), U.S. Department of Energy, Grant no DE-SC0018269 (J.A.M.K., A.G., Y.S-H.). This work is supported, in part, by the NSF Research Traineeship Program (Grant DGE-1828149) to J.A.M.K. This publication was also made possible by a predoctoral training award to J.A.M.K. from Grant T32-GM110523 from National Institute of General Medical Sciences (NIGMS) of the NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIGMS or NIH. We would like to acknowledge the Pathways to Research program at Michigan State University for its support of A.G. We would also like to acknowledge Veronica Greve for her input on the survey component of this study. 186 REFERENCES Angelani CR, Carabias P, Cruz KM, Delfino JM, de Sautu M, Espelt MV, Ferreira-Gomes MS, Gómez GE, Mangialavori IC, Manzi M, et al (2018) A metabolic control analysis approach to introduce the study of systems in biochemistry: the glycolytic pathway in the red blood cell. Biochemistry and Molecular Biology Education 46: 502–515 Antoniewicz MR (2015) Methods and advances in metabolic flux analysis: a mini-review. Journal of Industrial Microbiology and Biotechnology 42: 317–325 Antoniewicz MR (2018) A guide to 13C metabolic flux analysis for the cancer biologist. Experimental and Molecular Medicine. doi: 10.1038/s12276-018-0060-y Armando RP, Francisca SJ, Medina MÁ (2009) First steps in computational systems biology: A practical session in metabolic modeling and simulation. Biochemistry and Molecular Biology Education 37: 178–181 Becker J, Zelder O, Häfner S, Schröder H, Wittmann C (2011) From zero to hero-Design- based systems metabolic engineering of Corynebacterium glutamicum for l-lysine production. Metabolic Engineering 13: 159–168 Burgard AP, Pharkya P, Maranas CD (2003) OptKnock: A Bilevel Programming Framework for Microbial Strain Optimization. Identifying Gene Knockout Strategies for Biotechnology and Bioengineering 84: 647–657 Chaves GL, Batista RS, de Sousa Cunha J, Altmann DL, da Silva AJ (2022) Teaching cellular metabolism using metabolic model simulations. Education for Chemical Engineers 38: 97–109 Crown SB, Ahn WS, Antoniewicz MR (2012) Rational design of 13C-labeling experiments for metabolic flux analysis in mammalian cells. BMC Systems Biology 6: 43 Dieuaide-Noubhani M, Alonso AP (2014) Plant metabolic flux analysis. Springer Ebrahim A, Lerman JA, Palsson BO, Hyduke DR (2013) COBRApy: COnstraints-Based Reconstruction and Analysis for Python. BMC Systems Biology. doi: 10.1186/1752- 0509-7-74 Fell DA (1992) Metabolic control analysis: A survey of its theoretical and experimental development. Biochemical Journal 286: 313–330 Gleizer S, Ben-Nissan R, Bar-On YM, Antonovsky N, Noor E, Zohar Y, Jona G, Krieger E, Shamshoum M, Bar-Even A, et al (2019) Conversion of Escherichia coli to Generate All Biomass Carbon from CO2. Cell 179: 1255-1263.e12 Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, et al (2020) Array programming with NumPy. Nature 585: 357–362 187 Hoops S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendes P, Kummer U (2006) COPASI - A COmplex PAthway SImulator. Bioinformatics 22: 3067–3074 Kaste JAM, Green A, Shachar-Hill Y (2023) Integrative teaching of metabolic modeling and flux analysis with interactive python modules. Biochemistry and Molecular Biology Education 51: 653–661 Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, Kelley K, Hamrick J, Grout J, Corlay S, et al (2016) Jupyter Notebooks—a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas - Proceedings of the 20th International Conference on Electronic Publishing, ELPUB 2016 87–90 Koffas MAG, Jung GY, Stephanopoulos G (2003) Engineering metabolism and product in Corynebacterium glutamicum by coordinated gene overexpression. formation Metabolic Engineering 5: 32–41 Koffas MAG, Stephanopoulos G (2005) Strain improvement by metabolic engineering: Lysine production as a case study for systems biology. Current Opinion in Biotechnology 16: 361–366 Krömer JO, Nielsen LK, Blank LM (2014) Metabolic Flux Analysis. Methods in Molecular Biology. New York, NY Lee KH, Park JH, Kim TY, Kim HU, Lee SY (2007) Systems metabolic engineering of Escherichia coli for L -threonine production. Molecular Systems Biology. doi: 10.1038/msb4100196 Likert R (1932) A technique for the measurement of attitudes. Archives of Psychology 140: 1– 55 Matsuda F, Maeda K, Taniguchi T, Kondo Y, Yatabe F, Okahashi N, Shimizu H (2021) mfapy: An open-source Python package for 13C-based metabolic flux analysis. Metabolic Engineering Communications 13: e00177 Moreno-Sánchez R, Saavedra E, Rodríguez-Enríquez S, Olín-Sandoval V (2008) Metabolic Control Analysis: A tool for designing strategies to manipulate metabolic pathways. Journal of Biomedicine and Biotechnology. doi: 10.1155/2008/597913 Neuhäuser M (2011) Wilcoxon–Mann–Whitney Test BT - International Encyclopedia of Statistical Science. In M Lovric, ed,Springer Berlin Heidelberg, Berlin, Heidelberg, pp 1656–1658 Nielsen J (2003) It Is All about Metabolic Fluxes. Journal of Bacteriology 185: 7031–7035 Orth JD, Conrad TM, Na J, Lerman JA, Nam H, Feist AM, Palsson B (2011) A comprehensive genome-scale reconstruction of Escherichia coli metabolism-2011. Molecular Systems Biology 7: 1–9 188 Orth JD, Fleming RMT, Palsson BØ (2010a) Reconstruction and Use of Microbial Metabolic Networks: the Core Escherichia coli Metabolic Model as an Educational Guide. EcoSal Plus. doi: 10.1128/ecosalplus.10.2.1 Orth JD, Thiele I, Palsson BO (2010b) What is flux balance analysis? Nature Biotechnology 28: 245–248 Park JH, Lee KH, Kim TY, Lee SY (2007) Metabolic engineering of Escherichia coli for the production of L-valine based on transcriptome analysis and in silico gene knockout simulation. Proceedings of the National Academy of Sciences of the United States of America 104: 7797–7802 Ratcliffe RG, Shachar-Hill Y (2006) Measuring multiple fluxes through plant metabolic networks. Plant Journal 45: 490–511 Rodríguez-Caso C, Sánchez-Jiménez F, Medina MÁ (2002) A modeling and simulation approach to the study of metabolic control analysis. Biochemistry and Molecular Biology Education 30: 169–171 Saa PA, Nielsen LK (2017) Formulation, construction and analysis of kinetic models of metabolism: A review of modelling frameworks. Biotechnology Advances 35: 981–1003 Snoep JL, Mendes P, Westerhoff HV (1999) Teaching Metabolic Control Analysis and kinetic modelling: Towards a portable teaching module. The Biochemist 25–28 Stephanopoulos G, Aristidou AA, Nielsen J (1998) Metabolic engineering: principles and methodologies. Academic Press. Sundqvist N, Grankvist N, Watrous J, Mohit J, Nilsson R, Cedersund G (2022) Validation- based model selection for 13C metabolic flux analysis with uncertain measurement errors. PLOS Computational Biology 18: e1009999 Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17: 261–272 Wong KW, Barford JP, Porter JF (2004) Understanding the practical consequences of metabolic interactions - A software package for teaching and research. IFAC Proceedings Volumes (IFAC-PapersOnline) 37: 315–320 Wong KWW, Barford JP (2010) Metstoich: Teaching quantitative metabolism and energetics in biochemical engineering. Chemical Engineering Education 44: 147–156 189 APPENDIX D: Supplemental Material for Chapter 6 SURVEY INSTRUMENTS Pre-workshop survey 1. I am a … a. Undergraduate b. Graduate student c. Postdoctoral researcher d. Faculty member e. None of the above 2. I would describe myself as a … a. Biologist b. Biochemist c. Computational Scientist d. None of the above 3. I feel confident in applying and incorporating metabolic modeling techniques to my research question(s) a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 4. I feel confident in evaluating the results of a metabolic modeling study or exercise. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 5. I feel confident in identifying metabolic modeling software and techniques that I can apply to my research question(s) a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 6. I understand the purpose(s) of metabolic modeling. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 7. I can describe kinetic metabolic modeling and its limitations. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 8. I can describe Metabolic Flux Analysis and its limitations. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 9. I can describe Flux Balance Analysis and its limitations. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 10. I understand the data types I would need to carry out kinetic metabolic modeling. 190 a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 11. I understand the data types I would need to carry out Metabolic Flux Analysis. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 12. I understand the data types I would need to carry out Flux Balance Analysis. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 13. I can name the language(s) or software package(s) I would use to incorporate metabolic modeling into my own research a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 14. I can critically evaluate the application and results of metabolic modeling in publications and presentations relevant to my area of research. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) Post-workshop survey 1. I feel confident in applying and incorporating metabolic modeling techniques to my research question(s) a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 2. I feel confident in evaluating the results of a metabolic modeling study or exercise. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 3. I feel confident in identifying metabolic modeling software and techniques that I can apply to my research question(s) a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 4. I understand the purpose(s) of metabolic modeling. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 5. I can describe kinetic metabolic modeling and its limitations. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 6. I can describe Metabolic Flux Analysis and its limitations. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 7. I can describe Flux Balance Analysis and its limitations. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 8. I understand the data types I would need to carry out kinetic metabolic modeling. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 9. I understand the data types I would need to carry out Metabolic Flux Analysis. 191 a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 10. I understand the data types I would need to carry out Flux Balance Analysis. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 11. I can name the language(s) or software package(s) I would use to incorporate metabolic modeling into my own research a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 12. I can critically evaluate the application and results of metabolic modeling in publications and presentations relevant to my area of research. a. (Strongly Disagree), (Disagree), (Neutral), (Agree) (Strongly Agree) 13. What did you find useful about the workshop? a. Free response 14. What did you not find useful about the workshop? a. Free response 15. What changes to the workshop do you think would improve it in future iterations? a. Free response 4-months after survey 1. If you have had one or more opportunities to apply any of the knowledge you gained in the metabolic modeling workshop, please share your experience(s). If you have not applied any of the knowledge you gained in the metabolic modeling workshop and there are specific reasons why, please share. i. Free response 192 Chapter 7 Additional Studies: Topological data analysis reveals a core gene expression backbone that defines form and function across flowering plants This research was published in: S. Palande, J. A. M. Kaste, M. D. Roberts, K. S. Aba, C. Claucherty, J. Dacon, R. Doko, T. B. Jayakody, H. R. Jeffery, N. Kelly, A. Manousidaki, H. M. Parks, E. M. Roggenkamp, A. M. Schumacher, J. Yang, S. Percival, J. Pardo, A. Y. Husbands, A. Krishnan, B. L. Montgomery, E. Munch, A. M. Thompson, A. Rougon-Cardoso, D. H. Chitwood, R. VanBuren, Model validation and selection in metabolic flux analysis and flux balance analysis. PLoS Biology, 21(12): e3002397 (2023). 193 7.1. Preface The work described in this chapter was born out of a project initially conceived by Dr. Bob VanBuren and Dr. Dan Chitwood, who introduced it as a class project to the students of HRT841: Foundations in Computational and Plant Sciences, the first of a two-part course series given to students in the NRT-IMPACTS fellowship program at Michigan State University. The basic premise put forward to us students was that Topological Data Analysis (TDA) techniques could be used to analyze gene expression patterns in publicly available plant RNA-seq datasets. After the whole class deliberated, we decided that we would delimit our study to flowering plants. What followed was extensive data collection from the NCBI SRA followed by alignment, quantification, curation, and meta data organization to put together a coherent and expansive dataset. For the TDA portion of the study, which was helmed by postdoctoral researcher Dr. Saurabh Palande, we modeled our approach after a previous study that looked at organ-specific gene expression patterns in diverse animal lineages. This study used a statistical technique called Surrogate Variable Analysis (SVA) to help minimize the impact of unmodeled technical variables on their analyses. Myself and three other graduate students in the class – Miles Roberts, Kenia Segura-Aba, and Andriana Manousidaki – took on the task of applying SVA to our dataset. SVA turned out to not be a suitable technique to apply to the dataset gathered for this study, although the process of attempting to use it was nonetheless highly informative. Despite the failure to use SVA, TDA applied to the “uncorrected” expression dataset we had gathered yielded some very interesting results. After the conclusion of both HRT841: Foundations in Computational and Plant Sciences and CSS844: Frontiers in Computational and Plant Sciences, I continued to work on interpreting and contextualizing the specific genes identified by the TDA analysis. This analysis ended up comprising a substantial portion of the results section of the study, which has been published in PLoS Biology and on which I am co-first author (Palande et al., 2023). Although this chapter may seem like a non-sequitur from the rest of my work and came out of a class project, as you will see, the study concerns itself greatly with conserved patterns of gene expression across different plant tissues and stresses. This aspect of the study was of great interest to me due to the importance of tissue-specific expression patterns to the method I developed and presented in Chapter 3. Indeed, if such patterns can be consistently identified and then incorporated into metabolic modeling predictions using the method from Chapter 3 or 194 something similar, multi-tissue FBA flux predictions in novel plant systems could potentially be improved. Although the work presented in this chapter does not go so far as to identify any such patterns, it represents a first step in this direction, and as such relates to the broader aims of this thesis. 7.2. Abstract Since they emerged approximately 125 million years ago, flowering plants have evolved to dominate the terrestrial landscape and survive in the most inhospitable environments on earth. At their core, these adaptations have been shaped by changes in numerous, interconnected pathways and genes that collectively give rise to emergent biological phenomena. Linking gene expression to morphological outcomes remains a grand challenge in biology, and new approaches are needed to begin to address this gap. Here, we implemented topological data analysis (TDA) to summarize the high dimensionality and noisiness of gene expression data using lens functions that delineate plant tissue and stress responses. Using this framework, we created a topological representation of the shape of gene expression across plant evolution, development, and environment for the phylogenetically diverse flowering plants. The TDA-based Mapper graphs form a well-defined gradient of tissues from leaves to seeds, or from healthy to stressed samples, depending on the lens function. This suggests that there are distinct and conserved expression patterns across angiosperms that delineate different tissue types or responses to biotic and abiotic stresses. Genes that correlate with the tissue lens function are enriched in central processes such as photosynthetic, growth and development, housekeeping, or stress responses. Together, our results highlight the power of TDA for analyzing complex biological data and reveal a core expression backbone that defines plant form and function. 7.3. Introduction Over 300,000 gene expression datasets have been collected for thousands of diverse plant species spanning over 900 million years of divergence (Lim et al., 2022). This wealth of publicly available datasets spans ecological niches, species, developmental stages, tissues, stresses, and even single cells, providing a largely untapped reservoir of biological information. These diverse datasets provide an opportunity to link insights from various biological disciplines, including ecology, development, physiology, genetics, evolution, biochemistry, and cell biology through a common computational and mathematical framework. These gene expression datasets have been 195 analyzed individually for specific experiments and hypotheses, but large-scale meta-analyses across the publicly available expression datasets are largely nonexistent for plants. Beyond a common currency that links the subdisciplines of biology, gene expression links its emergent levels. Below gene expression, the genome gives rise to transcriptional networks and protein interactions that are directly responsible for the complexity of gene expression. Above it, gene expression orchestrates cell-specific expression and the development of the organism itself, impacting phenotypes ranging from physiology to plasticity that propagate further to the population, community, and ecological levels. These features, from molecular (DNA, promoter sequences, -omics datasets) to the organismal, population, and ecological levels (life history traits, climatic data from species distributions, etc.) have been used in the past as labels and predicted outputs of machine learning models (Washburn et al., 2019; Azodi et al., 2020). The structure—the shape—of gene expression in flowering plants is therefore a constraint that is formed by and impacts biological phenomena below and above it, respectively. Data visualization lies at the heart of exploratory data analysis and provides us with a powerful tool for generating hypotheses that can later be examined using standard statistical techniques. In the era of Big Data, the development of new data visualization pipelines has become increasingly important due to the high dimensionality of the datasets generated and the need to identify patterns and structures that can then become targets for more focused studies. Just as we can look upon the shape of a leaf and derive insights into how it functions from multiple perspectives (developmental, physiological, and evolutionary), we can visualize the shape of any type of data using a Mapper graph (Singh et al., 2007). The Mapper algorithm takes as input a filter function that describes a biological aspect of the data and uses mathematical ideas of shape to return a graph that reveals the underlying structure of the data. Even abstract data types like gene expression datasets, therefore, have a shape that we can visualize and derive insights from. For example, Nicolau and colleagues visualized the structure of breast cancer gene expression, identifying 2 distinct branches with differing underlying genotypes and prognostic outcomes that traditional statistical and bioinformatic approaches fail to resolve (Nicolau et al., 2011). This structure was revealed using a pairwise correlation distance matrix as input and modeling of the residuals of each sample from a vector of healthy gene expression as a measure of disease severity. In a second example, using a lens of developmental stage on single-cell RNASeq data, Rizvi and colleagues visualized the underlying structure of gene expression 196 during murine embryonic stem cell differentiation, revealing transient states as well as asynchronous and continuous transitions between cell types (Rizvi et al., 2017). In both examples, Mapper allowed the shape of data, through a selected lens, to be visualized. The resulting topology of the graph—in the form of loops, branch points, or flares—allowed previously hidden structures to be seen and novel insights to be derived. Loops, branch points, and flares in topological data analysis (TDA)-based Mapper graphs are visual representations of patterns, transitions, and outliers in the data. They provide insights into the topological structure and organization of the data, helping to identify clusters, subgroups, and potential anomalies. Loops represent recurring patterns or relationships in the data, branch points occur when different subsets of data points exhibit distinct topological characteristics, and flares typically indicate outliers or subgroups within a larger cluster and can help identify regions of interest or anomalous behavior in the data. Surveys of gene expression capture tens of thousands of data points per sample, and this high dimensionality can be represented by a unique shape that underlies emergent biological features. This shape explains gene expression along evolutionary, developmental, and environmental trajectories, leading to innovations that have marked the successful adaptation and proliferation of plant species. To visualize this shape is to better understand what transcriptional profiles are possible and to know the boundaries or constraints that permit or limit gene expression. Here, we analyzed publicly available gene expression profiles across diverse flowering plant families and visualized the underlying structure of gene expression in plants as a graph using the Mapper algorithm. We identified unique topological shapes of plant gene expression when viewed through lenses that delineate different tissue or stress responses. These complex, emergent patterns were largely hidden by biological complexity and sample heterogeneity. Our results demonstrate the ability of Mapper to uncover these patterns in high- dimensional plant gene expression datasets and its potential as a powerful tool for biological hypothesis generation. 7.4. Results 7.4.1. A representative catalog of flowering plant gene expression The vast number of gene expression datasets in plants provides a unique opportunity to search for patterns of conservation and divergence throughout angiosperm evolution, across developmental time, tissues, and stress response axes. Previous studies have tried to find 197 common signatures that define different plant tissues or responses to abiotic/biotic stresses, but these have been limited in species breadth (Proost and Mutwil, 2018), depth (Julca et al., 2021), or had limited downstream analyses (Zhang et al., 2020a). Here, we reanalyzed public expression data on the NCBI sequence read archive (SRA) and applied a topological data analysis method to map the shape of gene expression in plants. We included 54 species that captured the broadest phylogenetic diversity within angiosperms while maximizing the breadth of expression at the tissue and stress levels (Fig 7.1A). This includes 44 eudicots across 13 families and 9 monocot species across 2 families, as well as Amborella trichocarpa, which is sister to the rest of angiosperms. Raw reads were downloaded, cleaned, and reprocessed through a common RNAseq pipeline to remove artifacts related to the different algorithms and downstream analyses used by each group. After filtering datasets with low read mapping, our final set of expression data includes 2,671 samples across 7 distinct developmental tissues and 9 stress classifications for 54 species. 198 Figure 7.1: Dimensional space of plant gene expression across evolution, development, and stress. (A) Representative phylogeny of the 54 plant species included in this study. Nodes (species) are colored by plant family as denoted in Fig 7.1C. Dimensionality reduction of all samples by principal components (left) and t-SNE (right) are shown for tissue type (B), plant family (C), and abiotic/biotic stress (D). Individual samples are quantified and colored by tissue, family, and stress as shown in the respective bar plots. (E) Hierarchical clustering of samples with various biological features highlighted (stress, family, and tissue). Raw expression data underlying the graphs in this figure can be found in S7 Dataset, and code to regenerate analyses can be found in https://zenodo.org/records/8428609 (Palande, 2023). To facilitate comparisons of gene expression across species, we limited our analysis to a set of 6,328 orthologous low-copy genes that were conserved across all 54 plant species using Orthofinder (Emms and Kelly, 2015). These sets of orthologous genes or orthogroups are mostly single copy in our diploid species and scale with ploidy in polyploid species. The orthogroups are conserved across a diverse selection of Angiosperm lineages and correspond to well- conserved biological processes. Gene ontology (GO) term enrichment analysis on the Arabidopsis thaliana loci associated with these orthogroups show enrichment for basic 199 biological functions like “DNA replication initiation” and “tRNA methylation” at the top of the list of enriched GO terms, as well as functions specific to photosynthetic organisms like “photosystem II assembly,” and “tetraterpenoid metabolic process.” Although the remaining orthogroups contain significant biological information, they were excluded from analysis as multigene families typically have diverse functions with divergent expression profiles that would conflate downstream comparative analyses. The transcript per million (TPM) counts were summed for all genes within an orthogroup for a given species and merged into a single dataframe to create a final matrix of 6,335 orthologs by 2,671 samples. Principal component analysis (PCA) (Pearson, 1901) and t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten and Hinton, 2008) based dimensionality reduction show some separation of samples by different biological factors (Fig 7.1). The sample space is most clearly delineated by tissue, where both PC1 (explaining 25.4% variation) and t-SNE1 separate the samples into a gradient from root to leaf tissues with other plant tissues sandwiched in between (Fig 7.1B and 7.1D). This distribution largely correlates with tissue function, as the sink tissues of flowers, seeds, and fruits resolve closer to the root samples along t-SNE1 and PC1. No tissue type is separated fully by either dimensionality reduction approach. Samples from the 16 plant families are distributed throughout the dimensional space, suggesting that family- or species-level traits are not masking emergent features of distinct tissues (Fig 7.1C). Interestingly, abiotic and biotic stresses are similarly distributed throughout the dimensional space, with no clear grouping of the same stress across species or individual experiments. This could be due to intrinsic differences in how individual species respond to stress or to differences in the way stress experiments are carried out by different research groups. To account for batch effects and the influence of unmodeled factors, we applied surrogate variable analysis (SVA) to generate estimates of surrogate variables and their effects on our expression matrices. We identified 24 surrogate variables within the dataset, but these latent variables were intrinsically linked to the primary factors in our study (e.g., stress, tissue, and family). Removing surrogate variables would have masked much of the biology we were attempting to quantify, so we chose not to use these “data cleaning” approaches (see Appendix D, Text S7.2.A for more details). 200 7.4.2. Topological data analysis and the shape of plant gene expression Traditional dimensionality reduction and hierarchical clustering provided some degree of separation, but they were unable to delineate samples by stress or to identify expression patterns related to biological function. This may be related to residual heterogeneity, noise, or because of the inherent biological complexity that underlies plant evolution and function. To test these possibilities, we used a topological data analysis approach to map the shape of our data. TDA was implemented using Mapper (Tauzin et al., 2021), which provides a compact, multiscale representation of the data that is well suited for visual exploration and analysis. Mapper is particularly well suited for genomics data as these datasets typically have extremely high dimensionality and sparsity (Nicolau et al., 2011). To construct mapper graphs from our gene expression data, we created 2 different lenses of tissue and stress, adopting an approach similar to Nicolau and colleagues’ (Fig 7.2A–2E). To create the stress lens, we first identified all the healthy samples from the dataset and fit a linear model to them (Fig 7.2; see Methods). This model serves as the idealized healthy orthogroup expression. We then projected all the samples onto this linear model and obtained the residuals. These residuals measure the deviation of the sample gene expression from the modeled healthy expression, and the lens function is simply the length of the residual vector. The obvious separation between leaf and root samples in the dimension reduction plots supports a strong photosynthetic versus nonphotosynthetic divide. We used this observation to create a binary tissue lens in the same way as the stress lens. We identified all the photosynthetic samples (i.e., leaf tissue) and created an idealized expression profile by fitting a linear model to these expression profiles (Fig 7.2). We then projected all the samples onto this linear model and obtained the residuals to establish the lens function by tissue. To define the cover for each lens, we divided the range of the lens function into intervals of uniform length, with the same amount of overlap between adjacent intervals. We experimented with a range of value lengths of the intervals and the size of the overlap to identify the values that produced relatively stable mapper graphs. The clustering was performed using DBSCAN, a commonly used clustering algorithm in Mapper (Pathak et al., 2021). 201 Figure 7.2: Topology-based Mapper graphs and the shape of gene expression in plants. Overview of Mapper graph construction and lens functions (A-E). The lens function value of each sample is shown in the principal component (top) and t-SNE (bottom) based dimensional reduction from Fig 7.1 for the tissue (F) and stress lens (G). Mapper graphs across variable cover intervals and interval number for the tissue (H) and stress (I) lens function. The Mapper graph constructions we chose for further analysis are enclosed within a box. Raw expression data underlying the graphs in this figure can be found in S7 Dataset, and code to regenerate analyses can be found in https://zenodo.org/records/8428609 (Palande, 2023). Overlaying the tissue lens value of each sample over the PCA and t-SNE dimensional space reveals a clear gradient across PC1 and t-SNE1, with the highest lens function values 202 found in seed, fruit, and flower tissues (Fig 7.2F). For the stress lens function, samples are distributed across the dimensional space, with no obvious correlation between healthy and stressed lens values, similar to the observation from individual abiotic/biotic stresses (Figs 7.1D and 2G). Mapper graphs for the tissue and lens functions reflect an emergent and striking topological shape of plant expression (Fig 7.2H and 7.2I). Each node in the Mapper graphs corresponds to a bin of similar RNAseq samples with color representing the average lens value of samples within each node. Edges (connections) show common samples between overlapping bins. Changing the cover interval overlap and interval number has marginal effects on the core graph structure but changes the shape and connectivity of sparse nodes on the outskirts of the graphs (Fig 7.2H and 2I). This central stability highlights the robustness of our input data and significance of the underlying features defining the graph shape (Carriere and Oudot, 2018). The Mapper graphs for both the tissue and stress lens functions show a backbone structure with numerous embedded nodes and flares that form a well-defined gradient from leaf to seed or healthy to stressed, respectively. This suggests that there are distinct and conserved expression patterns across angiosperms that delineate different tissues or responses to biotic and abiotic stresses. Our input dataset is unbalanced, with large discrepancies in the number of input samples for different species, stresses, or tissue types. We tested if biases in the distribution of samples could explain the topological shape we observed. We downsampled the most frequent factor combinations and surveyed the effect it had on the Mapper graph topology. Our study has 3 factors: family, tissue, and stress with 16 families, 8 tissue types, and 10 stresses. In total, 1,280 unique 3-way combinations are possible (family + tissue + stress), but in our dataset, only 195 unique combinations are present and they have a heavily skewed distribution (Appendix E, Fig S7.1). Based on this distribution, we chose a cutoff of 30 and downsampled the 30 most common factor combinations. This significantly reduced the sampling bias for family, tissue, and stress, but it did not eliminate them (Appendix E, Fig 7.2.B). We then reran the Mapper algorithm using this downsampled dataset. The topology is quite similar, suggesting that biases in sample representation are not the major factor underlying the patterns we observed (Appendix E, Fig 7.2.C). 203 7.4.3. Topological shape reflects the underlying biological features of gene expression To identify and characterize these conserved biological patterns, we first simplified the Mapper graphs into 18 nodes for both the tissue and stress lens functions (Figs 7.3 and 7.4). The core tissue-based Mapper graph has discrete nodes for each surveyed plant tissue with a gradual transition of leaves (node 1), to roots (2), fruits (11 and 13), and, finally, seeds (14, 15, and 16; Fig 7.3A). At the fourth node, the Mapper graph proliferates into terminal branches of flower (node 9), stem (10), fruit (12), and mixtures of uncategorized tissue types (5 and 8). RNAseq samples from the 16 angiosperm families are largely dispersed across nodes by tissue, with some notable exceptions (Fig 7.3B). Most fruit samples are found along the gradient of the core graph structure, but fruits from the rose (Rosaceae) family form a separate node (node 12). Flowers from the eudicot species are mixed with fruit tissues in nodes along the core graph structure, but monocot flowers from the grass family (Poaceae) are found in discrete, branching nodes (9 and 17). The biotic and abiotic stress RNAseq samples are dispersed by tissue across the Mapper graph (Fig 7.3C), supporting the complexity and heterogeneity of these samples. 204 Figure 7.3: Simplified Mapper graphs detailing the distribution of samples along the tissue lens. Nodes along the full Mapper graphs (left) are clustered into simplified Mapper graphs (right), and samples are colored by tissue (A), family (B), and stress category (C). Photosynthetic and nonphotosynthetic ends of the Mapper graph are indicated. 205 Figure 7.4: Simplified Mapper graphs detailing the distribution of samples along the stress lens. Nodes along the full Mapper graphs (left) are clustered into simplified Mapper graphs (right) and samples are colored by tissue (A), family (B), and stress category (C). Healthy and stressed ends of the Mapper graph are indicated. 206 Mapper graphs clearly distinguish tissues across plant taxa, but what are the biological features that underlie this topology? We surveyed the expression patterns of the 6,328 orthogroups used to generate our Mapper graphs to see if they are enriched in certain biological processes related to evolutionarily conserved, tissue-specific functions. We classified genes as positively or negatively correlated with the tissue lens and conducted GO enrichment in these groups of genes. We expect negatively correlated genes to be characteristic of leaf gene expression and positively correlated genes to be characteristic of non-leaf gene expression. Supporting this, Mapper graphs and GO terms associated with the tissue lens–correlated genes point to photosynthetic versus nonphotosynthetic metabolism as a key factor in the overall gene expression patterns of plant tissues (Fig 7.3 and Appendix E, S1 Dataset). Enriched negatively correlated GO terms are mostly related to photosynthesis and include response to red and blue light, chloroplast and thylakoid organization, carotenoid metabolic process, and regulation of photosynthesis among others (Appendix E S1 Dataset). Plants and green algae are characterized by a set of well-conserved genes that are not found in nonphotosynthetic organisms termed “the GreenCut2 inventory” (Karpowicz et al., 2011). Most of the GreenCut2 genes (421 out of 677) are found within the 6,328 orthogroups in our analysis, and we tested if these are enriched among correlated genes. Genes from the GreenCut2 inventory are overrepresented in this set of genes, with 26.7% of the tissue-correlated (positively or negatively) genes being in the GreenCut2 resource versus 6.7% of the entire set of orthogroups (Appendix E, Table 7.1). This overrepresentation is even more stark if we delimit our analysis to only the genes negatively correlated with the tissue lens, of which 50.3% are in the GreenCut2 inventory. The overlapping loci between the 2 sets contain genes encoding protein products involved in various aspects of photosynthesis, including pigment biosynthesis and binding (e.g., AT4G10340, AT1G04620, AT1G44446) (Murray and Kohorn, 1991; Andersson et al., 2001; Meguro et al., 2011), the operation of the photosynthetic light reactions (e.g., AT4G05180, AT5G44650, AT3G17930) (Schubert et al., 2002; Albus et al., 2010; Xiao et al., 2012), or the operation of the Calvin– Benson Cycle (AT1G32060) (Harmon et al., 2001). Enriched GO terms that are positively correlated with the tissue lens are largely related to housekeeping and core metabolic processes including ubiquitination, macromolecule catabolism, the electron transport chain, peptide biosynthesis, and Golgi vesicle–mediated transport among many others (Appendix E, S2 Dataset). Enriched genes include proteins involved in the TCA 207 cycle and respiration (e.g., AT1G47420, AT2G18450, AT4G26910) (Kruft et al., 2001; Millar et al., 2001; Menges et al., 2002) and in the development of specific nonphotosynthetic tissue types like seeds (e.g., AT2G40170, AT2G38560) (Leon-Kloosterziel et al., 1996; Wang et al., 2008) and pollen/pollen tubes (e.g., AT2G03120, AT2G41630) (Han et al., 2009; Zhou et al., 2013). However, many of the tissue lens–correlated genes do not intuitively relate to the photosynthetic versus nonphotosynthetic tissue distinction, and further examination of these loci on a gene-by- gene basis may shed light on conserved differences between plant tissues. The simplified Mapper graph from the stress lens has 18 nodes that form a continuous gradation of healthy to stressed tissues (Fig 7.4). Individual tissue types, regardless of stress condition, are enriched in certain nodes but are less defined than under the tissue lens (Fig 7.4A). RNAseq samples related to light and heat stress are found in discrete nodes (1 and 2, respectively) at the terminus of the Mapper graph across all species where these data were available (Fig 7.4C). Other stress RNAseq samples are found in nodes with healthy tissues but are generally concentrated toward the stress end of the Mapper graph. An interesting exception is a group of cold stressed root samples from the grass (Poaceae) family (node 15). Clustering of distinct stresses within the same node suggests a core stress response conserved across Angiosperms for all abiotic and biotic factors. The gradient of sample distribution from healthy to stressed across the Mapper graph may be related to the severity of stress experienced by plants in each individual experiment. To explore what constitutes these conserved stress-related expression patterns, we searched for GO enrichment of genes that are positively correlated with the stress lens. This group of genes is heavily enriched in functions related to stress, including responses to water deprivation, chitin, reactive oxygen species, fungi, wounding, bacteria, and general defense mechanisms (Appendix E, S3 Dataset). Genes positively correlated with the stress lens include loci related to the biosynthesis of compounds with diverse stress-related activities like jasmonic acid and jasmonic acid derivatives (AT2G35690, AT2G46370) (Staswick and Tiryaki, 2004; Schilmiller et al., 2007) and ascorbic acid (AT3G09940) (Lisenbee et al., 2005). Negatively correlated genes are enriched in functions related to growth and reproduction such as DNA replication, mitosis, and rRNA processing, among others (Appendix E, S4 Dataset). This includes genes involved in regulation of the cell cycle (AT3G54650, AT4G12620, AT2G01120) (Collinge et al., 2004; Masuda et al., 2004; Kim et al., 2008), chromatin organization 208 (AT1G15660, AT1G65470) (Kaya et al., 2001; Ogura et al., 2004), and the development of reproductive structures (AT1G34350, AT2G41670, AT4G27640, AT3G52940) (Broadhvest et al., 2000; Dou et al., 2016; Huang et al., 2017; Liu et al., 2019). This pattern points towards an intuitive distinction between the stressed and unstressed samples in our dataset in terms of their investment in cell proliferation and reproduction. Most of these genes are involved in core biological functions with conserved roles across eukaryotes, and their coordinated perturbation could be predictive of stress responses in diverse lineages. 7.5. Discussion Genome-scale datasets have high dimensionality, and even the simplest pairwise experiment has hundreds or thousands of complex and interconnected cellular pathways in dynamic flux between conditions. Comparisons across plant lineages are similarly complex, as each species has its own evolutionary history with thousands of duplicated, lost, or new genes enabling its unique and elegant biology. This complexity presents major challenges for characterizing underlying biological mechanisms and identifying shared and distinct properties across evolutionary timescales. Here, we leveraged the wealth of public gene expression datasets across diverse flowering plants and used a set of deeply conserved genes to search for patterns of conservation across tissue types, stress responses, and evolution. We first tested traditional dimensionality reduction and clustering-based approaches but found that they were largely ineffective and unable to clearly resolve samples. Instead, we used a novel topological framework to compare samples and test for evolutionary conservation. Topological data analysis has been applied to complex, high dimensionality biological datasets including gene expression profiles correlated with human cancers and other diseases (Nicolau et al., 2011; Mandal et al., 2020; Rabadán et al., 2020). To our knowledge, TDA has not been used for plant science datasets outside of shape (Li et al., 2018; Zeng et al., 2021; Amézquita et al., 2022). Flowering plants have tremendous phylogenetic, developmental, phenotypic, and genomic scale diversity, creating additional layers of complexity compared to other lineages. Despite this, Mapper was able to capture hidden and emergent signatures of gene expression at the tissue and stress scales that were missed using traditional approaches. Most developmental tissues or stress responses are not perfectly separated but instead fall within a gradient along a central shape. The central shape of the tissue lens Mapper graph represents the life cycle of a plant with transitions from the vegetative tissues of leaves and roots to 209 reproductive flowers, fruit, and, eventually, seeds. Nodes along the Mapper graphs that contain mixtures of tissues such as fruits and flowers, leaves and stems, or even leaves and roots reflect developmental plasticity, heterogeneity, and overlapping functions between different organs. Flowers give rise to fruits and the complex processes of fertilization, seed, and fruit development blur the lines between distinct tissue types. This complexity and interconnectivity is central to biological processes but is masked by traditional dimensionality reduction approaches, which can oversimplify nonlinear datasets. The stressed and healthy samples are less clearly delineated in the Mapper graphs than samples from different plant tissues. This may reflect artifacts stemming from variation in the severity, duration, or method of applying stresses across different experiments and species. For example, mildly stressed samples might have expression signatures that mirror healthy tissues with comparatively few differentially expressed genes. Despite this issue, we observed a strong gradient of sample distribution from healthy to stressed across the graph. Distinct stresses were generally found within the same nodes, and genes that were positively correlated with the stress lens show enrichment in classical stress pathways. This includes the core stress-responsive hormones jasmonic acid and abscisic acid and their corresponding transcriptional network as well as broader shifts in metabolic processes geared toward defense. Taken together, this suggests that plants have deeply conserved expression signatures across evolution and for different stresses. Abiotic and biotic stress responses have been mostly studied in isolation, but they typically co-occur in natural environments, and they have overlapping signaling, hormonal, and network responses in plants (reviewed in Rejeb et al. (2014)]. The topological shape of gene expression points to a shared set of pathways or perturbations that define if a tissue is healthy or stressed. Environmental stresses broadly disrupt photosynthesis and core metabolic and cellular functions either as a direct response to physical trauma or in preparation for defense or resilience. These changes may serve as the backbone of the topological shape we observed for the stress lens. Although we observed a deeply conserved pattern of gene expression underlying plant form and function, our analyses capture a snapshot of the evolutionary innovations found in flowering plants. We used a set of low-copy, conserved genes to enable comparisons of expression across species, and we had to exclude around approximately 70% of all plant genes. This includes most enzymes, transcription factors, and regulatory elements, which are mostly found in large, rapidly 210 evolving, or lineage-specific gene families that cannot be resolved to high-confidence orthologs across eudicots and monocots. Duplication and subsequent sub- or neofunctionalization of these genes drive the evolution of new plant traits and developmental differences of plant organs. Single-copy genes by contrast have deeply conserved functions in core metabolism, photosynthesis, and housekeeping processes that typically transcend tissue, species, and environmental changes. Given these limitations, it is somewhat surprising that our analyses were able to clearly separate tissue types and stresses despite missing information from most of the genes that should underlie these biological differences. Applying TDA with a full set of genes in a single species with well-curated gene expression profiles could uncover complex or emergent biological signatures that were previously hidden. Here, we provide a proof of concept for studying complex biological traits using TDA, and a similar analytical framework could be applied to numerous areas of plant science research and beyond. Compared to the approximately 300,000 published plant gene expression datasets (Lim et al., 2022), our study has a somewhat sparse sampling of species and a subset of expressed genes, yet we were able to detect a number of hidden trends. TDA of high-resolution sampling over narrower phenotypic spaces such as drought responses in a single species or tissue divergence across 900 million years of plant evolution could yield transformative insights that were previously overlooked. However, researchers should exercise caution when applying TDA to gene expression data as the lack of a robust hyperparameter tuning procedure could potentially result in misleading conclusions. This reflects a broader problem in machine learning and data science, but hyperparameter search, cross-validation, and feature selection can enable data- driven tuning of the appropriate hyperparameters. With the appropriate datasets and sufficient sampling, TDA can be widely applicable for developing a deeper understanding of complex, emergent biological phenomena. 7.6. Methods 7.6.1. Assembling a representative catalog of flowering plant expression data We selected species that captured the broadest phylogenetic diversity within angiosperms and species that had a breadth of expression at the tissue and stress levels. We also selected only species with a high-quality reference genome to enable accurate read mapping and downstream comparative genomics. Metadata including species, accession, tissue type, experimental treatments, replicate number, and sequencing platform were collected manually for each sample 211 using the NCBI BioProject and SRAs, as well as the primary data publications (Appendix E, S6 Dataset). Raw RNAseq reads were downloaded from the NCBI SRA and quantified using a pipeline developed in the VanBuren lab to trim, quantify, and identify differentially expressed genes (https://github.com/pardojer23/RNAseqV2). Using a common analytical pipeline helped reduce noise between experiments that used different algorithms in the original publications. Raw Illumina reads from various platforms were first quality trimmed using fastp (v0.23) (Chen et al., 2018) with default parameters. The quality filtered reads were pseudoaligned to the corresponding transcripts (gene models) for each species using Salmon (v1.6.0) (Patro et al., 2017) with the quasi-mapping mode. Transcript-level estimates were converted to gene-level transcript per million counts using the R package tximport (Soneson et al., 2015). 7.6.2. Comparing expression across species To facilitate detailed cross-species comparisons, we first clustered proteins from all 54 species into orthogroups using Orthofinder (v2.3.8) (Emms and Kelly, 2015). Genomes and proteomes were downloaded for each species from Phytozome v13 (Goodstein et al., 2012). Orthofinder was run using default parameters and the reciprocal DIAMOND search (v2.0.11) (Buchfink et al., 2021) was used for sequence alignment, and groups of similar proteins were clustered using the Markov Cluster Algorithm. In total, 2,317,289 genes (94% of input genes) were clustered into 86,185 orthogroups across the 54 species. Of these, 33,585 orthogroups are found in only a single species and 7,742 are found in at least 52 out of 54 species. This set of broadly conserved orthogroups was further refined by filtering out orthogroups with an average of >2 genes per ortholog for the diploid species to avoid including multigene families with diverse functions in the analysis. This set of 6,335 orthogroups was used as a common framework to allow comparison of expression across species. For orthogroups where a species had more than one gene, the total TPM for all genes in that orthogroup was summed and the raw TPM was used for single-copy genes. Expression data for each sample across all species were combined into a single expression matrix (Appendix E, S7 Dataset), and SVA was used to characterize the potential impacts of unmodeled technical variables on the dataset (see Appendix E, Text 7.A). PCA was performed using built-in functions in Scikit-learn (Pedregosa et al., 2011) on the log2+1 or z-score transformed gene expression data (raw TPMs) to reduce dimensionality and capture the main sources of variation within the datasets. 212 7.6.3. Surrogate variable analysis To account for batch effects and the influence of unmodeled factors on the expression matrix used for the present study, we applied SVA to generate estimates of surrogate variables and their effects on our expression matrices (Leek and Storey, 2007; Leek et al., 2012). Briefly, SVA assumes that the expression of a particular gene i across j independent RNA-seq experiments can be described by the following linear equation: 𝑥𝑖𝑗 = 𝑢𝑖 + 𝑓𝑖(𝑦𝑗) + 𝑒𝑖𝑗 (𝟏) where ui is the baseline expression level of gene i, fi(yj) represents the effect of a measured variable yj, and eij is the error term (Leek and Storey, 2007). However, if there are a number of L unmodeled factors affecting the expression of gene i, then the error term eij contains both randomly distributed experimental error as well as the effects of unmodeled factors. That is: 𝐿 ′ 𝑒𝑖𝑗 = ∑ 𝑦𝑙𝑖𝑔𝑖𝑗 + 𝑒𝑖𝑗 𝑙 (𝟐) where gl = (gl = (gl1,…,gln) is a function describing the effect of all unmodeled factors up to L, yli is the coefficient describing the influence of an unmodeled factor l on the expression of gene i, and e′ij is the true randomly distributed noise term (Leek and Storey, 2007). Combining (1) and (2) yields: ′ 𝑥𝑖𝑗 = 𝑢𝑖 + 𝑓𝑖(𝑦𝑖) + ∑ 𝑦𝑙𝑖𝑔𝑖𝑗 + 𝑒𝑖𝑗 𝐿 𝑙 (𝟑) By using the svaseq() method implemented in the R package sva (v. 3.36.0) (Leek et al., 2012; Leek, 2014), we identified and estimated the values of 24 separate surrogate variables. These surrogate variables, which correspond to vectors of values for each expression value 𝑥𝑗𝑖, in ′ the ∑ 𝑦𝑙𝑖𝑔𝑖𝑗 + 𝑒𝑖𝑗 𝐿 𝑙 term in (3). To determine the amount of variation due to a proxy batch variable (bioproject), 3 biological primary variables (stress, tissue, and family), and the pairwise interactions each surrogate variable explains, we regressed all the estimated surrogate variables on each variable (either batch or biological) or on a pairwise interaction. McNemar’s formula was used to calculate the adjusted R2 values for each surrogate variable. 213 7.6.4. Mathematical basis of topological data analysis The flexibility of Mapper allows us to apply it to various types of data. Here, we will describe the Mapper construction in the simplest setting of point cloud data and then explain how it was applied to the gene expression data. Consider a point cloud X ⊂ Rd equipped with a function f: X → R. An open cover of X is a collection U = {Ui}i∈I of open sets in Rd, such that X ⊂ ⋃ i∈I Ui, where I is an index set. The 1- dimensional nerve of the cover U, denoted as M: = N1(U), is called the Mapper graph of (X, f). In this graph, each open set Ui is represented as a vertex i, and 2 vertices, i and j, are connected by an edge if and only if the intersection of Ui and Uj is nonempty. To construct a Mapper graph, we start by defining a cover V = {Vj} j∈J of the image f(X) ⊂ R of f, where J is a finite index set, by splitting the range of f(X) into a collection of overlapping intervals. Next, for each Vj, we identify the subset of points Xj in X such that f(Xj) ⊂ Vj and apply a clustering algorithm to identify clusters of points in Xj. The cover U of X is the collection of such clusters induced by f−1(Vj) for each j. Once we have the cover U, we compute its 1-dimensional nerve M and visualize it in the form of a weighted graph. For example, consider Fig 7.2A–2E. The point cloud X in this case consists of points in the 2-dimensional plane, in the shape of a “Y”. The function f simply maps each point to its y- coordinate. We divide the range of f into 4 overlapping intervals, represented by the 4 colored segments along the y-axis in Fig 7.2. For each interval Vj, the colored rectangles in the center panel of the figure show the subsets of points Xj ∈ X such that Xj = f−1(Vj). Then, we apply clustering to each Xj separately to obtain the cover U of X. The 1-dimensional nerve of U, i.e., the mapper graph M, is shown in the rightmost panel. The color of each vertex corresponds to the cover interval it belongs to. Fig 7.2A–2E illustrates mapper graph construction from the same set of points, but with x-coordinate used as the lens. We can observe that the 2 lens functions produce 2 slightly different mapper graphs. 7.6.5. Constructing Mapper graphs and lens functions To construct Mapper graphs from our gene expression data, we create 2 different lenses, adopting an approach similar to the one used in Nicolau and colleagues’ paper (Nicolau et al., 2011). We refer to these lenses as the tissue lens and the stress lens, respectively. To create the stress lens, we first identified all the healthy samples from the dataset and fit a linear model to them. This model serves as the idealized healthy orthogroup expression. Then, we project all the 214 samples (healthy as well as stressed) onto this linear model and obtain the residuals. These residuals measure the deviation of the sample gene expression from the modeled healthy expression. The lens function is simply the length of the residual vector. To define the cover, we divide the range of the lens function into intervals of uniform length, with the same amount of overlap between adjacent intervals. We experimented with a range of values length of the intervals and the size of the overlap to identify the values that produced relatively stable Mapper graphs. The clustering was performed using DBSCAN, a commonly used clustering algorithm for Mapper. The construction of Mapper graph relies on several user-defined parameters: the lens function f, the cover V, and the clustering algorithm. Optimizing these parameters is an interesting open problem in TDA research (Chalapathi et al., 2021). The function f plays the role of a lens, through which we look at the data, and different lenses provide different insights (Singh et al., 2007). The choice of f is typically driven by the domain knowledge and the data under consideration. In this study, the data under consideration are very similar to the dataset studied by Nicolau and colleagues (Nicolau et al., 2011). Therefore, we followed similar methods to define the lenses. Our choice of lenses is further justified by the observations from the dimension reduction plots. The cover V = {Vj}j∈J of f(X) consists of a finite number of open intervals as cover elements. To define V, we use the simple strategy of defining intervals of uniform length and overlap. Adjusting the interval length and the overlap increases or decreases the amount of aggregation provided by the Mapper graph. The optimal choice was made by visually inspecting Mapper graphs over a range of parameter values. The parameters resulting in the most stable structure were selected. Any clustering algorithm can be employed to obtain the cover U. We use the density-based clustering algorithm, DBSCAN (Ester et al., 1996), which is commonly used in Mapper because it does not require a priori knowledge of the number of clusters. Instead, DBSCAN requires 2 input parameters: the number of samples in a neighborhood for a point to be considered as a core point, and the maximum distance between 2 samples for one to be considered in the neighborhood of the other. 7.6.6. Functional annotation of orthogroups The correlation between expression values and tissue lens and stress lens values was calculated for each orthogroup. The top 2.5% most positively and negatively correlated orthogroups for 215 each lens were selected to represent the tissue lens or stress lens correlated orthogroups. Arabidopsis gene IDs were used to identify the overlap between the GreenCut2 (Karpowicz et al., 2011) inventory with Arabidopsis orthologs in our overall set of orthogroups, as well as our sets of tissue lens and stress lens correlated orthogroups. The binom_test() function from SciPy (Virtanen et al., 2020) was used to apply one-sided binomial tests to check for enrichment of GreenCut2 loci in the overall, tissue lens, and stress lens correlated orthogroup sets. GO term enrichment of the sets of genes mapped to orthogroups and correlated with the tissue lens or stress lens was done using GOATOOLS (Klopfenstein et al., 2018). Data on gene function and biochemical reactions associated with specific loci were derived from TAIR (Lamesch et al., 2012), KEGG (Kanehisa and Goto, 2000), and a genome-scale metabolic model of Arabidopsis metabolism from (de Oliveira Dal’Molin et al., 2015). 7.7. Data availability statement The code, metadata, raw datasets from this project are available on a dedicated GitHub page: https://github.com/PlantsAndPython/plant-evo-mapper and Zenodo: https://zenodo.org/records/8428609. 7.8. Funding This work was funded primarily by National Science Foundation Research Traineeship training grant (NSF 1828149 to ATM, DHC, and RV) which established the Integrated training Model in Plant And CompuTational Sciences (IMPACTS) program at Michigan State University. This grant funded fellows within this program (JAMK, MDR, KSA, CC, JD, RD, TBJ, HRJ, AM, EMR, AMS, JY) as well as the project-based curriculum for the Plants and Python Course that formed the backbone of this manuscript. This work is also supported by NSF Plant Genome Research Program awards IOS-2310355 to EM, DHC, and RV, IOS-2310356 to AH, and IOS- 2310357 to AK, NSF Developmental Mechanisms award IOS-2039489 to AH, and NSF Biological Integration Institute award (DBI-2213983 to RV). Several students (JAMK, MDR, KSA, HMP, JP) were supported by predoctoral training award (T32- GM110523 to RV) from the National Institute of General Medical Sciences of the NIH. This project was supported by the USDA National Institute of Food and Agriculture, and by Michigan State University AgBioResearch to AMT, DHC, and RV. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 216 7.9. Author contributions Conceptualization: Sourabh Palande, Joshua A. M. Kaste, Miles D. Roberts, Kenia Segura Aba´, Jamell Dacon, Aman Y. Husbands, Arjun Krishnan, Beronda L Montgomery, Elizabeth Munch, Addie M. Thompson, Alejandra Rougon-Cardoso, Daniel H. Chitwood, Robert VanBuren. Data curation: Sourabh Palande, Joshua A. M. Kaste, Miles D. Roberts, Kenia Segura Aba´, Carly Claucherty, Jamell Dacon, Rei Doko, Thilani B. Jayakody, Hannah R. Jeffery, Nathan Kelly, Andriana Manousidaki, Hannah M. Parks, Emily M. Roggenkamp, Ally M. Schumacher, Jiaxin Yang, Sarah Percival, Jeremy Pardo, Alejandra Rougon-Cardoso, Daniel H. Chitwood, Robert VanBuren. Formal analysis: Sourabh Palande, Joshua A. M. Kaste, Miles D. Roberts, Kenia Segura Aba´, Carly Claucherty, Rei Doko, Thilani B. Jayakody, Hannah R. Jeffery, Nathan Kelly, Andriana Manousidaki, Hannah M. Parks, Emily M. Roggenkamp, Ally M. Schumacher, Jiaxin Yang, Sarah Percival, Jeremy Pardo, Alejandra Rougon-Cardoso, Daniel H. Chitwood, Robert VanBuren. Project administration: Alejandra Rougon-Cardoso, Daniel H. Chitwood, Robert VanBuren. Software: Jeremy Pardo. Supervision: Daniel H. Chitwood, Robert VanBuren. Visualization: Daniel H. Chitwood. Writing – original draft: Joshua A. M. Kaste, Alejandra Rougon-Cardoso, Daniel H. Chitwood, Robert VanBuren. Writing – review & editing: Aman Y. Husbands, Arjun Krishnan, Beronda L Montgomery, Elizabeth Munch, Addie M. Thompson, Alejandra Rougon-Cardoso, Daniel H. Chitwood, Robert VanBuren. 217 REFERENCES Albus CA, Ruf S, Schöttler MA, Lein W, Kehr J, Bock R (2010) Y3IP1, a nucleus-encoded thylakoid protein, cooperates with the plastid-encoded Ycf3 protein in photosystem I assembly of tobacco And Arabidopsis. Plant Cell 22: 2838–2855 Amézquita EJ, Quigley MY, Ophelders T, Landis JB, Koenig D, Munch E, Chitwood DH (2022) Measuring hidden phenotype: quantifying the shape of barley seeds using the Euler characteristic transform. in silico Plants 4: diab033 Andersson J, Walters RG, Horton P, Jansson S (2001) Antisense Inhibition of the Photosynthetic Antenna Proteins CP29 and CP26: Implications for the Mechanism of Protective Energy Dissipation. The Plant Cell 13: 1193–1204 Azodi CB, Pardo J, VanBuren R, de los Campos G, Shiu S-H (2020) Transcriptome-Based Prediction of Complex Traits in Maize[OPEN]. The Plant Cell 32: 139–151 Broadhvest J, Baker SC, Gasser CS (2000) SHORT INTEGUMENTS 2 Promotes Growth During Arabidopsis Reproductive Development. Genetics 155: 899–907 Buchfink B, Reuter K, Drost H-G (2021) Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18: 366–368 Carriere M, Oudot S (2018) Structure and stability of the one-dimensional mapper. Foundations of Computational Mathematics 18: 1333–1396 Chalapathi N, Zhou Y, Wang B (2021) Adaptive covers for mapper graphs using information criteria. 2021 IEEE International Conference on Big Data (Big Data). IEEE, pp 3789– 3800 Chen S, Zhou Y, Chen Y, Gu J (2018) Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34: i884–i890 Collinge MA, Spillane C, Köhler C, Gheyselinck J, Grossniklaus U (2004) Genetic interaction of an origin recognition complex subunit and the Polycomb group gene MEDEA during seed development. Plant Cell 16: 1035–1046 Dou XY, Yang KZ, Ma ZX, Chen LQ, Zhang XQ, Bai JR, Ye D (2016) AtTMEM18 plays important roles in pollen tube and vegetative growth in Arabidopsis. Journal of integrative plant biology 58: 679–692 Emms DM, Kelly S (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biology 16: 157 Ester M, Kriegel H-P, Sander J, Xu X, others (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. kdd. pp 226–231 218 Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, et al (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40: D1178-1186 Han S, Green L, Schnell DJ (2009) The Signal Peptide Peptidase Is Required for Pollen Function in Arabidopsis. Plant Physiology 149: 1289–1301 Harmon AC, Gribskov M, Gubrium E, Harper JF (2001) The CDPK superfamily of protein kinases. New Phytologist 151: 175–183 Huang B, Qian P, Gao N, Shen J, Hou S (2017) Fackel interacts with gibberellic acid signaling and vernalization to mediate flowering in Arabidopsis. Planta 245: 939–950 Julca I, Ferrari C, Flores-Tornero M, Proost S, Lindner A-C, Hackenberg D, Steinbachová L, Michaelidis C, Gomes Pereira S, Misra CS, et al (2021) Comparative transcriptomic analysis reveals conserved programmes underpinning organogenesis and reproduction in land plants. Nature Plants 7: 1143–1159 Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28: 27–30 Karpowicz SJ, Prochnik SE, Grossman AR, Merchant SS (2011) The GreenCut2 resource, a phylogenomically derived inventory of proteins specific to the plant lineage. Journal of Biological Chemistry 286: 21427–21439 Kaya H, Shibahara K ichi, Taoka K ichiro, Iwabuchi M, Stillman B, Araki T (2001) FASCIATA genes for chromatin assembly factor-1 in Arabidopsis maintain the cellular organization of apical meristems. Cell 104: 131–142 Kim HJ, Oh SA, Brownfield L, Hong SH, Ryu H, Hwang I, Twell D, Nam HG (2008) Control of plant germline proliferation by SCFFBL17 degradation of cell cycle inhibitors. Nature 455: 1134–1137 Klopfenstein DV, Zhang L, Pedersen BS, Ramírez F, Warwick Vesztrocy A, Naldi A, Mungall CJ, Yunes JM, Botvinnik O, Weigel M, et al (2018) GOATOOLS: A Python library for Gene Ontology analyses. Scientific Reports 8: 10872 Kruft V, Eubel H, Jänsch L, Werhahn W, Braun HP (2001) Proteomic approach to identify novel mitochondrial proteins in Arabidopsis. Plant Physiology 127: 1694–1710 Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-Hernandez M, et al (2012) The Arabidopsis Information Resource (TAIR): Improved gene annotation and new tools. Nucleic Acids Research 40: 1202–1210 Leek JT (2014) Svaseq: Removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Research 42: e161 219 Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28: 882–883 Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3: 1724–1735 Leon-Kloosterziel KM, van de Bunt GA, Zeevaart JAD, Koornneef M (1996) Arabidopsis Mutants with a Reduced Seed Dormancy. Plant Physiology 110: 233–240 Li M, An H, Angelovici R, Bagaza C, Batushansky A, Clark L, Coneva V, Donoghue MJ, Edwards E, Fajardo D, et al (2018) Topological Data Analysis as a Morphometric Method: Using Persistent Homology to Demarcate a Leaf Morphospace. Frontiers in Plant Science 9: Lim PK, Zheng X, Goh JC, Mutwil M (2022) Exploiting plant transcriptomic databases: Resources, tools, and approaches. Plant Communications 3: 100323 Lisenbee CS, Lingard MJ, Trelease RN (2005) Arabidopsis peroxisomes possess functionally redundant membrane and matrix isoforms of monodehydroascorbate reductase. Plant J 43: 900–914 Liu HH, Xiong F, Duan CY, Wu YN, Zhang Y, Li S (2019) Importinβ4 mediates nuclear import of grf-interacting factors to control ovule development in arabidopsis. Plant Physiology 179: 1080–1092 Mandal S, Guzmán-Sáenz A, Haiminen N, Basu S, Parida L (2020) A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data. In C Martín- Vide, MA Vega-Rodríguez, T Wheeler, eds, Algorithms for Computational Biology. Springer International Publishing, Cham, pp 178–187 Masuda HP, Ramos GBA, De Almeida-Engler J, Cabral LM, Coqueiro VM, Macrini CMT, Ferreira PCG, Hemerly AS (2004) Genome based identification and analysis of the pre- replicative complex of Arabidopsis thaliana. FEBS Letters 574: 192–202 Meguro M, Ito H, Takabayashi A, Tanaka R, Tanaka A (2011) Identification of the 7- hydroxymethyl chlorophyll a reductase of the chlorophyll cycle in arabidopsis. Plant Cell 23: 3442–3453 Menges M, Hennig L, Gruissem W, Murray JAH (2002) Cell cycle-regulated gene expression in Arabidopsis. Journal of Biological Chemistry 277: 41987–42002 Millar AH, Sweetlove LJ, Giegé P, Leaver CJ (2001) Analysis of the Arabidopsis Mitochondrial Proteome. Plant Physiology 127: 1711–1727 Murray DL, Kohorn BD (1991) Chloroplasts of Arabidopsis thaliana homozygous for the ch-1 locus lack chlorophyll b, lack stable LHCPII and have stacked thylakoids. Plant Molecular Biology 16: 71–79 220 Nicolau M, Levine AJ, Carlsson G (2011) Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences 108: 7265–7270 Ogura Y, Shibata F, Sato H, Murata M (2004) Characterization of a CENP-C homolog in Arabidopsis thaliana. Genes and Genetic Systems 79: 139–144 de Oliveira Dal’Molin CG, Quek LE, Saa PA, Nielsen LK (2015) A multi-tissue genome- scale metabolic modeling framework for the analysis of whole plant systems. Frontiers in Plant Science 6: 1–12 Palande S (2023) PlantsAndPython/plant-evo-mapper: plant-evo-mapper-first-release. Palande S, Kaste JAM, Roberts MD, Segura Abá K, Claucherty C, Dacon J, Doko R, Jayakody TB, Jeffery HR, Kelly N, et al (2023) Topological data analysis reveals a core gene expression backbone that defines form and function across flowering plants. PLOS Biology 21: e3002397 Pathak S, Agarwal A, Ankita A, Gurve MK (2021) Restricted Randomness DBSCAN: A faster DBSCAN Algorithm. 2021 Thirteenth International Conference on Contemporary Computing (IC3-2021). pp 7–12 Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nature methods 14: 417–419 Pearson K (1901) LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2: 559– 572 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al (2011) Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12: 2825–2830 Proost S, Mutwil M (2018) CoNekT: an open-source framework for comparative genomic and transcriptomic network analyses. Nucleic Acids Research 46: W133–W140 Rabadán R, Mohamedi Y, Rubin U, Chu T, Alghalith AN, Elliott O, Arnés L, Cal S, Obaya ÁJ, Levine AJ, et al (2020) Identification of relevant genetic alterations in cancer using topological data analysis. Nature Communications 11: 3808 Rejeb IB, Pastor V, Mauch-Mani B (2014) Plant Responses to Simultaneous Biotic and Abiotic Stress: Molecular Mechanisms. Plants (Basel) 3: 458–475 Rizvi AH, Camara PG, Kandror EK, Roberts TJ, Schieren I, Maniatis T, Rabadan R (2017) Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development. Nature Biotechnology 35: 551–560 Schilmiller AL, Koo AJK, Howe GA (2007) Functional diversification of acyl-coenzyme A 221 oxidases in jasmonic acid biosynthesis and action. Plant Physiology 143: 812–824 Schubert M, Petersson UA, Haas BJ, Funk C, Schröder WP, Kieselbach T (2002) Proteome map of the chloroplast lumen of Arabidopsis thaliana. Journal of Biological Chemistry 277: 8354–8365 Singh G, Memoli F, Carlsson G (2007) Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eurographics Symposium on Point- Based Graphics. doi: 10.2312/SPBG/SPBG07/091-100 Soneson C, Love MI, Robinson MD (2015) Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 4: 1521 Staswick PE, Tiryaki I (2004) The oxylipin signal jasmonic acid is activated by an enzyme that conjugate it to isoleucine in Arabidopsis W inside box sign. Plant Cell 16: 2117–2127 Tauzin G, Lupo U, Tunstall L, Pérez JB, Caorsi M, Medina-Mardones AM, Dassatti A, Hess K (2021) giotto-tda: A topological data analysis toolkit for machine learning and data exploration. The Journal of Machine Learning Research 22: 1834–1839 Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. Journal of machine learning research 9: Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17: 261–272 Wang C, Wang H, Zhang J, Chen S (2008) A seed-specific AP2-domain transcription factor from soybean plays a certain role in regulation of seed germination. Science in China Series C: Life Sciences 51: 336–345 Washburn JD, Mejia-Guerra MK, Ramstein G, Kremling KA, Valluru R, Buckler ES, Wang H (2019) Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. Proceedings of the National Academy of Sciences 116: 5542–5549 Xiao J, Li J, Ouyang M, Yun T, He B, Ji D, Ma J, Chi W, Lu C, Zhang L (2012) DAC is involved in the accumulation of the cytochrome b6/f complex in Arabidopsis. Plant Physiology 160: 1911–1922 Zeng D, Li M, Jiang N, Ju Y, Schreiber H, Chambers E, Letscher D, Ju T, Topp CN (2021) TopoRoot: a method for computing hierarchy and fine-grained traits of maize roots from 3D imaging. Plant Methods 17: 127 Zhang H, Zhang F, Yu Y, Feng L, Jia J, Liu B, Li B, Guo H, Zhai J (2020) A Comprehensive Online Database for Exploring ∼20,000 Public Arabidopsis RNA-Seq Libraries. Mol Plant 13: 1231–1233 222 Zhou J-J, Liang Y, Niu Q-K, Chen L-Q, Zhang X-Q, Ye D (2013) The Arabidopsis general transcription factor TFIIB1 (AtTFIIB1) is required for pollen tube growth and endosperm development. Journal of Experimental Botany 64: 2205–2218 223 APPENDIX E: Supplemental Material for Chapter 7 Text S7.A: Confounder discussion from the Surrogate Variable Analysis We used Surrogate Variable Analysis (SVA) (Leek et al., 2012) to explore the effects of confounding technical variables on the publicly available SRA data assembled for this study. Briefly, we identified three primary variables of interest (tissue, stress, and family), which were fixed in the model used to estimate “surrogate variables” to minimize the amount of variability attributable to these primary variables captured by the estimated surrogate variables. These surrogate variables represent unaccounted for technical variables impacting the dataset. Due to the breadth of families, stresses, and tissues analyzed, we do not have a full factorial design (i.e., there are combinations of family, stress, and tissue factor values for which there are no expression datasets). Because of this, SVA would remove variability due to our primary variables and their interactions. To get a sense of what kind of impact the surrogate variables might have on the dataset when removed, we estimated the correlation between the first order interactions between our primary variables and the surrogate variables identified by SVA. We identified 24 surrogate variables which individually captured between 53% and 98% of variation between BioProjects (Fig S7.4). We also estimated the interaction terms between the tissue, family, and stress factor combinations that were present in the dataset and estimated how much of their variation was getting captured by the surrogate variables. Individual surrogate variables captured up to 14% of variation between stress conditions, up to 66% of variation between tissue conditions, and up to 63% of variation between families. For the interaction terms between primary variables, individual surrogate variables captured up to 83% of the variation between tissue and family combinations, up to 65% of the variation between stress and family combinations, and up to 71% of the variation between tissue and stress combinations. This suggests that even though stress, tissue, and family are treated as protected primary variables, there are underlying latent variables related to our primary variables and their interactions that may be important sources of biological variation being captured by the surrogate variables. Although individual surrogate variables could be selectively accounted for in downstream analyses in such a way that minimizes the removal of biological signal, this would be a highly subjective process. Moreover, due to our inability to precisely calculate the true correlation 224 between our surrogate variables and interaction terms due to the fact that many factor combinations are missing, this would be statistically dubious as well. Because the surrogate variables show substantial linear correlation with our primary variables and their interaction terms, the application of SVA would require eliminating substantial amounts of biological signal. Since the goal of our study is to identify heterogeneous patterns due to stress, tissue, and family within a high-dimensional gene expression dataset, SVA may not be appropriate for us to use. Alternatively, one could potentially minimize the loss of this signal by cherry-picking individual surrogate variables to include in downstream analysis, which would naturally introduce human bias. A third option would be to use an algorithm like ComBat-seq (Zhang et al., 2020b) that relies on explicitly defined batches, which is problematic for the present study since the closest metadata for batch available for the studies gathered on SRA is the BioProject ID’s, but these are, at best, a proxy for batches of samples and are not sufficient to assess the technical variability or noise in the data. More broadly, as discussed in (Jaffe et al., 2015), such genomic data “cleaning” methods, by their very nature, delimit the observable features of the resulting datasets to those prespecified by the investigator. In our view, this limits their utility for broad exploratory analyses of the kind described in this study. For all the above reasons, we opted to not use SVA, ComBat, or related techniques. 225 FIGURES Figure S7.1: Histogram of 3-way factors of the RNA seq samples before and after downsampling. The distribution of 3-way factors for family, tissue, and stress are plotted. The 16 families, 8 tissue types and 10 stresses equate to 1280 unique 3-way combinations, but we only observed 195 unique combinations in our dataset. The distribution of samples from the entire dataset is shown on the left and the distribution of samples when downsampling the 30 most common 3-way combinations is shown on the right. 226 Figure S7.2: Factor-wise frequency plots of RNAseq samples before and after subsampling. The number of samples in each family, tissue type, or stress are plotted before (top) and after (bottom) subsampling. 227 Figure S7.3: Topology of Mapper graphs generated from the subsampled data. Samples from each node in the mapper graph are colored by plant family (A), stress (B), or tissue type (C), using the subsampled data. The overall topology and sample distribution are similar to the Mapper graphs constructed with the full, unbalanced dataset, suggesting sample distribution is not a major factor in our analyses. 228 Figure S7.4: Linear regression analysis of association of surrogate variables to one batch variable (bioproject), our biological variables of interest (stress, tissue, family), and their pairwise interactions. All surrogate variables were regressed on either each variable or interaction individually to calculate adjusted R2 values. 229 TABLES Table S7.1: Enrichment of GreenCut2 genes in orthogroup-mapped Arabidopsis thaliana genes and stress-/tissue- correlated orthogroup-mapped genes. The proportion of GreenCut2 genes in the all the orthogroups used in this study was compared against the proportion of GreenCut2 genes in a list of all A. thaliana genes using a one-sided binomial test. The proportion of tissue- lens and stress-lens correlated orthogroup-mapped genes in GreenCut2 was compared against the proportion of GreenCut2 genes in the entire set of orthogroup-mapped genes using one-sided binomial tests. Tissue-correlated genes were hypothesized to be more likely to be in GreenCut2 than a random selection of orthogroup-mapped genes, and the stress-correlated genes were hypothesized to be less likely. Dataset All Arabidopsis Genes All Orthogroup- Mapped Genes All Tissue-lens Correlated Genes Stress-lens Correlated Genes # of Genes in Dataset # of Genes in GreenCut2 27662 677 6328 421 318 318 85 7 % GreenCut2 p-value 2.45 6.65 26.7 2.20 2.76 * 10-96 9.18 * 10-29 0.000252 230 Dataset Descriptions All supplemental datasets can be found at the following link: https://doi.org/10.1371/journal.pbio.3002397 S1 Dataset. GO Term enrichment results on genes negatively correlated with the tissue lens (XLSX) S2 Dataset. GO term enrichment results on genes positively correlated with the tissue lens (XLSX) S3 Dataset. GO term enrichment results on genes positively correlated with the stress lens (XLSX) S4 Dataset. GO term enrichment results on genes positively correlated with the stress lens (XLSX) S5 Dataset. Overlap between orthogroup-mapped genes and tissue lens and stress lens correlated genes with the GreenCut2 resource (Karpowicz) (XLSX) S6 Dataset. Metadata of the raw data used in this experiment (CSV) S7 Dataset. Expression matrix of TPMs for the normalized orthogroups (CSV) 231 REFERENCES Jaffe AE, Hyde T, Kleinman J, Weinbergern DR, Chenoweth JG, McKay RD, Leek JT, Colantuoni C (2015) Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis. BMC Bioinformatics 16: 1–10 Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD (2012) The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28: 882–883 Zhang Y, Parmigiani G, Johnson WE (2020) ComBat-seq: batch effect adjustment for RNA- seq count data. NAR Genomics and Bioinformatics. doi: 10.1093/nargab/lqaa078 232