hwy; i. a dummk a £91, 122:: ,2 i 7:»: : ‘ .. 7 iw I. .....i.!w J 3.313... . 1:! la... uV- ‘I‘LV , al. ”iiiiv.‘ idqn. £211 1...: S. .s . . .ioni . its}. 1‘? 0:. :11}... |vl\ :2... LP: . .a . .5: av) Lzuz .u,.w...u..._.v.$ ”cats 1 f0 E7 This is to certify that the dissertation entitled INTEGRATING GENE EXPRESSION AND METABOLIC PROFILES TO OPTIMIZE CELLULAR FUNCTIONS presented by ZHENG LI has been accepted towards fulfillment of the requirements for the PhD degree in Chemical Enfleering Weir»; 01% Major Professor’s Signature A443, .23: 9004 Date MSU is an Affirmative Action/Equal Opportunity Institution LIBRARY MIChlgq. . State University -.—--.-I-c- 0--I-I-0-u-n-n-o-n-o--o-o-I-I--¢--o-I-u-o-o--u-u-I-I-O-I-a-v--o-o-n-I-.--> PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 c:/ClRC/DateDue.p65-p.15 INTEGRATING GENE EXPRESSION AND METABOLIC PROFILES TO OPTIMIZE CELLULAR FUNCTIONS By ZHENG LI A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Chemical Engineering and Materials Science 2006 ABSTRACT INTEGRATING GENE EXPRESSION AND METABOLIC PROFILES TO OPTIMIZE CELLULAR FUNCTIONS By ZHENG LI With advances in high throughput technology, profiles of gene expressions, proteins and metabolites can be acquired to help elucidate the network of pathways involved in producing a specific phenotype. This dissertation presents a systems approach that was developed to integrate gene expression, metabolic and phenotypic profiles to identify active pathways that confer a phenotype. The approach involves several separate components to i) identify genes that are relevant to a cellular or metabolic process, ii) integrate multi-source information, and iii) reconstruct pathways and networks. Approaches to identify genes relevant to a phenotype were developed using genetic algorithm coupled partial least squares analysis (GA/PLS) and discussed in Chapter 2. GA/PLS used a log linear model to identify subsets of genes that can best predict a phenotype e.g. a metabolic function. Next we applied Bayesian network analysis to infer network structures from metabolic data, discussed in Chapter 3. Metabolic data was chosen initially because if Bayesian network analysis is able to infer well-known metabolic structures, pathways e. g. TCA cycle and urea cycle, from experimental data, which provided confidence in the ability of this methodology to infer other networks, such as, genetic regulatory networks from gene data. In Chapter 4, we integrated both ideas fi'om the previous two chapters into a Three-stage Integrative Pathway Search (TIPSO), which combined methods to identify relevant genes with network reconstruction. Unlike other approaches, this approach identified the active pathways without requiring interaction measurements or libraries of genetic mutants, and with limited amount of data. The reconstructed network was validated through in silico perturbations studies with published results and further experiments. The framework provided very good predictions of the effect of some, but not all, of the perturbation studies. This may be due in part to Bayesian network analysis’ inability to handle transients, such as cycles and feedback loops. Uncovering the additional information from these transients would be valuable in elucidating the mechanism involved in producing a particular phenotype. Therefore, in Chapter 5, we developed a dynamic model using time-series gene expression data. To illustrate the ability of the model, we applied the model to Escherichia coli K12 (E. coli) and Saccharomyces cerevisiae (yeast), which were more readily arnendable to validatation. Finally, Chapter 6 discussed the improvements to the current T IPS© approach namely, to include multiple metabolites, incorporate multi-source information, and dynamic modeling for time-series data. Copyn'ght by ZHENG L1 2006 DEDICATION TO LULU FOR HER SUPPORT AND LOVE ACKNOWLEDGMENTS First I would like to thank Professor Christina Chan, who gave me guidance throughout my research. Without her knowledge, patience and support this dissertation would have not been possible. She is not only my academic advisor, but also a mentor and a good friend. The knowledge and friendship I gained from her will definitely influence the rest of my life. I would also thank Professor Sarat Dass, his keen insights and valuable discussions on statistics often gave me stimulating ideas in research. I am grateful to Professor Worden and Professor Dale for taldng time to serve on the guidance committee and overseeing this work. Their insightful comments and suggestions have enhanced the technical soundness of this dissertation. I am also grateful to all my friends and colleagues. The discussions with them substantially contiibuted to my work and broadened my knowledge. Last but not the least, many thanks go to my family: my wife, my Mom and Dad, and my brother. Without their encouragement and continuous support, I would not be who I am today. This research was supported by National Science Foundation, Whitaker Foundation, Environmental Protection Agency. I was also partially supported by QBMI award and DeVlieg fellowship for Computational Engineering at Michigan State University. TABLE OF CONTENTS LIST OF TABLES ............................................................................................................. xi LIST OF FIGURES ......................................................................................................... xiii LIST OF ACRONYMS .................................................................................................... xv CHAPTER 1 INTRODUCTION ........................................................................................ 1 1.1 Background ......................................................................................................... 1 1.2 Organization of the Dissertation ............................................................................... 5 1.3 Contribution of the Dissertation ................................................................................ 6 CHAPTER 2 IDENTIFYING GENES RELEVANT TO A CELLULAR FUNCTION 7 2.1 Introduction ............................................................................................................... 7 2.2 Methodology of GA/PLS .......................................................................................... 8 2.2.1 Assumption of GA/PLS ..................................................................................... 9 2.2.2 Partial Least Squares Analysis ......................................................................... 10 2.2.3 Genetic Algorithm ........................................................................................... 11 2.2.4 Gene Subset Selection Optimization ............................................................... 15 2.3 Application of GA/PLS to Rat Hepatocypte Culture System ................................. 16 2.3.] Experimental System ....................................................................................... 17 2.3.2 Data Preprocess ................................................................................................ 18 2.3.3 Results .............................................................................................................. 23 2.3.4 Model Robustness Evaluation .......................................................................... 34 2.4 Discussion ............................................................................................................... 4O vii CHAPTER 3 A BAYESIAN FRAMEWORK TO INFER BIOLOGICAL PATHWAYS AND NETWORK ............................................................................................................. 47 3.1 Introduction ............................................................................................................. 47 3.2 MATERIAL AND METHODS .............................................................................. 52 3.2.1 Data Collection ................................................................................................ 52 3.2.2 Bayesian networks ........................................................................................... 54 3.2.3 Inferring the Bayesian Network from information theory ............................... 56 3.2.4 Detecting latent variables and comparing alternative models based on a Bayesian scoring metric ............................................................................................ 58 3.2.5 Bayesian network predictor and target function prediction ............................. 60 3.2.6 Data discretization ........................................................................................... 61 3.2.7 Sensitivity analysis of Bayesian Networks ...................................................... 62 3.3 Results ..................................................................................................................... 64 3.3.1 Reverse engineering the sub-networks ............................................................ 65 3.3.2 Identifying relevant pathways linked to intracellular TG levels ...................... 68 3.3.3 Accuracy and sensitivity of the postulated networks ....................................... 71 3.4 Discussion ............................................................................................................... 74 CHAPTER 4 IDENTIFYING PHENOTYPE RELEVANT PAHTWAYS ..................... 84 4.1 Introduction ............................................................................................................. 84 4.2 Materials and Methods ............................................................................................ 87 4.2.1 Three Stage Integrative Pathway Search Framework (TIPSO) ........................ 87 4.2.2 Bayesian Network Inference ............................................................................ 90 4.3 Results and discussion ............................................................................................ 91 viii 4.3.1. Identifying genes relevant to cytotoxicity or LDH release using GA/PLS 91 4.3.2. Identifying genes involved in an independent pathway related to cytotoxicity using CICA ............................................................................................................... 92 4.3.3. Reconstruct pathways related to cytotoxicity using BN ................................. 93 CHAPTER 5 STATE SPACE MODEL TO IN FER TRANSCRIPTION FACTOR ACTIVITIES FROM DYNAMIC GENE EXPRESSION DATA ................................. 109 5.1 Introduction ........................................................................................................... 109 5.2 Materials and methods .......................................................................................... 113 5.3 Results ................................................................................................................... 120 5.3.1. Inferring TFA from simulated data ............................................................... 120 5.3.2. E. coli ............................................................................................................ 124 5.3.3. Saccharomyces Cerevisiae ............................................................................ 125 5.4 Discussion ............................................................................................................. 128 CHAPTER 6 EXTENSIONS TO IMPROVE THE HPSO FRAMEWORK ................ 134 6.1 Introduction ........................................................................................................... 134 6.2 A Hierarchical Approach Employing Metabolic and Gene Expression Profiles to Identify the Pathways that Confer Cytotoxicity in HepGZ Cells ................................ 134 6.2.1 Introduction .................................................................................................... 134 6.2.2 Materials and Methods ................................................................................... 136 6.2.3 Results ............................................................................................................ 139 6.2.4 Discussion ...................................................................................................... 148 6.2.5 Conclusion ..................................................................................................... 149 ix 6.3 Inferring active pathways that confer a phenotype involving multiple metabolites ..................................................................................................................................... 149 6.3.1 Introduction .................................................................................................... 149 6.3.2 Materials and Methods ................................................................................... 150 6.3.3 Results and discussion ................................................................................... 150 6.4 Using a Dynamic Bayesian Network (DBN) to identify active pathways from the dynamic profiles .......................................................................................................... 156 6.4.1 Introduction .................................................................................................... 156 6.4.2 Materials and Methods ................................................................................... 157 6.4.3 Results and Discussion .................................................................................. 157 6.5 Future work ........................................................................................................... 161 CHAPTER 7 CONCLUSIONS ...................................................................................... 163 Appendix Figure 1. Metabolic Reaction Map ............................................... 165 BIBLIOGRAPHY ....................................................................................................... 166 LIST OF TABLES Table 2.1: 111 genes selected after t-test ....................................................... 19 Table 2.2: 45 genes involved in the metabolic network ...................................... 22 Table 2.3 A. Genes selected for intracellular triglyceride .................................... 25 Table 2.3 B. Genes selected for urea synthesis ................................................. 27 Table 2.4 A. Genes selected for intracellular triglyceride after noise addition ............. 36 Table 2.4 B. Genes selected for urea synthesis afier noise addition .......................... 38 Table 2.5A. Genes selected for intracellular TG using fitness function in equation 2.10 ......................................................................................................... 42 Table 2.53. Gene selected for intracellular TG with fitness function in equation 2.10 after noise addition ........................................................................................ 43 Table 3.1. Learned TCA cycle relationship ...................................................... 54 Table 3.2. Learned urea cycle relationships .................................................... 68 Table 3.3. Relations recovered by Bayesian network analysis with IC* algorithm. . . . . ...71 Table 3.4. Bayesian score and prediction accuracy of the Figure 3.5 ........................ 73 Table 3.5 Sensitivity analysis of TG network shown in Figure 3.6 ......................... 74 Table 3.6. Relations recovered by Bayesian network analysis with MCMC algorithm, a search and score method. A value of 1 indicates a direct connection. ...................... 81 Table 4.1 Simulating genetic perturbation and its effects on LDH release ................. 95 Table 4.2 Simulating down-regulation of PKC-5 and its effects on NF-KB in medium culture (with 20 ng/ml TNF-a). ............................................................... 101 Table 5.1 (a) Parameters of the simulated network shown in Figure 4.2 .................. 121 Table 5.1 (b) Learned parameters of the simulated network shown in Figure 4.2. . . . . ...122 Table 5.1 (c) Simulated gene expression data ................................................. 123 Table 5.2 Eight genes regulated by ARCA and CRP ......................................... 124 xi Table 6.1: Gene sets used in the GSEA analysis .............................................. 138 Table 6.2 Correlation between the metabolic fluxes and the LDH release ................ 142 Table 6.3 Enriched Gene sets in GSEA analysis .............................................. 144 Table 6.4 Genes selected by MBPLS model ................................................... 145 Table 6.5 Important genes selected for two factors ofTG and LDH........................150 Table 6.6 Important genes selected for LDH alone ........................................... 153 Table 6.7 Experimental conditions for DBN learning ......................................... 158 xii LIST OF FIGURES Figure 2.1 An overview of the GA/PLS feature selection algorithm .......................... 9 Figure 2.2 Representative of the string in the GA/PLS algorithm ............................ 12 Figure 2.3 Determining w value for selecting TG genes ....................................... 14 Figure 2.4 Determining w for selecting urea synthesis genes .................................. 14 Figure 2.5 TG gene selection and metabolic function prediction ............................. 25 Figure 2.6 Pathway reconstruction for TG ....................................................... 31 Figure 2.7 PLS prediction of urea synthesis ..................................................... 32 Figure 2.8 Pathway reconstruction for urea synthesis .......................................... 34 Figure 3.1 Bayesian network based framework ................................................. 50 Figure 3.2 An example of a simple Bayesian network ......................................... 55 Figure 3.3 An illustration of the three phases of the Bayesian network learning process using mutual information based algorithm ...................................................... 58 Figure 3.4 Reverse Engineered TCA and urea cycle learned by mutual information based algorithm ............................................................................................. 67 Figure 3.5 Postulated sub-networks for intracellular TG accumulation, learned with IC* algorithm ............................................................................................. 70 Figure 3.6. PLS flux selection ..................................................................... 76 Figure 3.7 The effect of noise in the data on Bayesian network learning of the urea cycle. ......................................................................................................... 78 Figure 3.8 The effect of omitted measurements on Bayesian network learning. . . . . . . . ....79 Figure 4.1 Scheme of T IPS© ...................................................................... 85 Figure 4.2 Example of a Bayesian network ...................................................... 88 Figure 4.3 A representative sub-network related to cytotoxicity ............................. 91 Figure 4.4 Effect of TNFa and palmitate on stearoyl-CoA desaturase (SCD) measured by western blotting ....................................................................................... 94 xiii Figure 4.5 Effect of SCD activator, clofibrate and ciprofibrate, on LDH release in the palmitate cultures .................................................................................... 97 Figure 4.6 Effect of antioxidant N-acetyl-cysteine on SCD measured by western blotting. ......................................................................................................... 99 Figure 4.7 Effect of TNFor on PKG-8 expression measured by western blotting. . .. . ....101 Figure 4.8 Measurement of LDH release in the different culture mediums, with and without rottlerin .................................................................................... 103 Figure 4.9 Effect of rottlerin on NF -KB measured by western blotting .................... 103 Figure 4.10 Effects of TNFor on Bcl-2 .......................................................... 105 Figure 5.1 SSM representation of the gene regulatory network ............................. 112 Figure 5.3 An example and its SSM representation of the gene regulatory motifs ....... 118 Figure 5.3 The results of using a SSM to analyze an E. coli system ........................ 125 Figure 5.4 A SSM representation of the Saccharomyces cerevisiae (yeast) system studied(é ........................................................................................................ 12 Figure 5.5 The results of using a SSM to analyze a yeast system ........................... 128 Figure 6.1 An overview of the hierarchical approach ......................................... 135 Figure 6.2 Fisher’s discriminant analysis of metabolites ..................................... 141 Figure 6.3 DBN learned gene regulatory network ............................................. 162 xiv LIST OF ACRONYMS ACC: acetyl-CoA carboxylase ALDH: aldehyde dehydrogenase ANOVA: analysis of variance Bel-2: B-cell lymphoma 2 BSA: Bovine Serum Albumin BN: Bayesian network analysis ChIP: chromatin irnmunoprecipitation EM: expectation maximization ETC: electron transport chain FDA: Fischer’s discriminant analysis FFA: free fatty acid GA: genetic algorithm GSEA: gene set enrichment analysis GST: glutathione—s transferase IAP: inhibitor of apoptosis protein ICA: independent component analysis LDH: lactate dehydrogenase MBPLS: multi-block partial least squares NAC: N-acetyl-cysteine NF -KB: nuclear factor kappa B PLS: partial least squares analysis PKC-delta: protein kinase C delta isoforrn XV PKR: double stranded RNA dependent protein kinase ROS: reactive oxygen species SCD: stearoyl-CoA desaturase SOD: superoxide dismutase SSM: state-space model TIPS: three stage integrative pathway search system TNF: tumor necrosis factor TG: triglyceride TRAF: TNF receptor associated factor xvi CHAPTER 1 INTRODUCTION 1.1 Background A cell is a complex multi-scale biological system, with information flowing between genes, proteins, and metabolites. Genes transcribe to proteins and some proteins are employed as enzymes to catalyze metabolic reactions, while some other proteins are employed as transcription factors. Genes can be transcriptionally regulated by both proteins (transcription factors) and metabolites. Thus, a cell is a complex system regulated by interconnected gene regulatory and signal transduction networks. The development of high throughput technologies, such as cDNA mircoarray and mass spectroscopy, makes it possible to probe a cell system globally and measure their gene expression, protein and metabolite profiles. Due to the quantity of data, there is an increasing reliance on models to interpret the data and understand the underlying mechanisms. Cellular functions are regulated by networks of genes, proteins, and small molecules, such as metabolites and signaling molecules. When cellular functions are abnormally regulated, disease states may ensue. For example, apoptosis is abnormally regulated in cancer cells, while glucose level is abnormally regulated by insulin in type 2 diabetes. Therefore, to understand the mechanisms of disease development requires knowledge of how cellular function are regulated, which in turn would aid the development of therapies. There are two types of approaches that have been used to reconstruct pathways fiom high throughput data measurements. The first approach reconstructs pathways from measurements of molecular interactions such as protein- protein and protein-DNA interactions. The second approach infers the pathways from measurements of cellular states, such as gene expression and metabolic profiles. With the first approach, gene regulatory [Lee T.I. et al., 2002] and signal transduction [Steffen, M. et al., 2002] networks have been reconstructed using protein- DNA and protein-protein interactions, respectively. Protein-DNA interactions can be measured by chromatin irnmunoprecipitation (ChIP). Lee T.I. et al. (2002) integrated gene expression with protein-DNA interaction data to reconstruct the transcriptional regulatory network in yeast cell cycle. Similarly, gene expression profiles have been integrated with protein-protein interaction data to construct signal transduction pathways [Steffen M. et al., 2002] while a global molecular interaction network was reconstructed by integrating gene expression with protein-DNA and protein-protein interaction [Ideker T. 2002]. In that case, the gene expression data were used to identify the active pathways or sub-networks. The method assumes that genes involved in the same sub- network/pathway should show similar expression changes. Therefore, the active sub- network is identified by selecting a subset of genes that are connected based upon interaction data and whereby their expression profiles changed significantly. However the availability of genome-wide protein-protein and protein-DNA interaction measurements is more limited in mammalian systems. To address this problem, we applied the second approach to infer networks from measurements of cellular states, such as gene expression and metabolite profiles, which are observed cellular responses induced by the underlying regulatory networks. A typical approach is to identify co-regulated genes by clustering genes with similar expression. For example, [Sega], E. et al., 2003] built relational probabilistic models to combine gene expression data with DNA sequence profile or protein-protein interaction data to identify co- expressed genes that share common regulatory motifs or whose gene products interact with each other. [Tanay, A. et al. 2004] developed an approach, SAMBA, to cluster genes based upon the combined information from gene expression, protein-protein interaction, protein-DNA interaction, and growth phenotype data. However, cluster analysis can not identify potential pathways by which the cells are regulated to confer a specific phenotype. To infer the active pathways from gene expression data, Collins’ group has developed a model by assuming a log linear relationship between the expression of a gene, and the genes that regulate it, as well as the external perturbations [di Bernardo, D. et al. 2005]. The gene expression profiles of yeast, under different perturbed states, such as, gene deletion, promoter insertion and drug compound treatment, were obtained from the literature and used to train the model. Then, the gene expression response to a new drug (external perturbation) was obtained and the log linear model was applied to predict the effects induced by a new drug. This method requires a large amount of data for the network reconstruction [di Bernardo, D. et al. 2005]. Therefore, we propose an alternative approach that identifies the active pathways without requiring interaction measurements or libraries of genetic mutants, and with a limited amount of data, namely, by integrating gene expression and phenotypic profiles. The method assumes that information about the regulatory networks is contained within the profiles. Central to our approach is Bayesian Network (BN) analysis, a probabilistic graphic model, which detects implicit as well as explicit connections [Li and Chan, 2004b]. BN can detect indirect influences and unmeasured events and is not susceptible to the existence of unobserved variables. It has been applied to infer gene regulatory network of yeast cell cycle from gene expression data [Friedman 2004], metabolic subnetworks from metabolic data [Li and Chan, 2004b] and protein signaling pathways from protein activity data [Sachs et al. 2005]. Unlike previous studies, in which the nodes are fairly well established in the literature and the BN approach was applied to infer the connections between the nodes, this study aims to identify the relevant nodes, which are unknown a priori to be activated by the environment or important to a cellular process of interest. The BN approach is computationally inefficient when applied to large networks with thousands of genes. In other words, it is more difficult to infer the structure of large networks using BN analysis. Therefore, in our approach we address this shortcoming as follows: since only part of the network is typically perturbed (active) when the cells are subjected to an environmental stress, methods to reduce the set of genes to a smaller and more relevant subgroup will be valuable. BN analysis can then be used to reconstruct the active sub-network from the smaller group of genes to reveal which pathways are induced by the external stimuli or environment. As proof-of-concept, the framework was applied to identify pathways that are involved in palmitate and tumor necrosis factor (TNF)-or induced cytotoxicity in human liver cells. The level of cytotoxicity was measured by lactate dehybrogenase (LDH) release, which is a measure of cell death. Elevated levels of free fatty acids (FFAs) and TNFor have been shown to be involved in the pathogenesis of liver disorders, such as fatty liver disease and steatohepatitis [Felber 2002, Kobayashi 1998, Tilg 2001, Watada 2003]. Quantification of the genetic responses of hepatocytes to physiological actions of these factors helped to identify the pathways involved in conferring cytotoxicity (e.g., producing a cellular phenotype with a high level of LDH release). 1.2 Organization of the Dissertation In the following chapters, I will talk about techniques to select important genes relevant to a cellular function. In chapter 2, genetic algorithm coupled partial least square analysis (GA/PLS) is introduced and applied initially to experimental data to of primary rat hepatocyte culture system to select genes relevant to cellular functions such as urea production and TG accumulation. In chapter 3, Bayesian network analysis is introduced and applied to reconstruct regulatory networks from experimentally obtained metabolic data to illustrate its capabilities. Chapter 4 discusses the Three-stage Integrative Pathway Search (T IPSQ) framework, which combines several techniques e.g. GA/PLS, independent component analysis (ICA) and BN analysis, and its application to a HepG2 cell culture system to infer pathways important in regulating the cytotoxic phenotype. These pathways were subsequently validated through both a literature survey and further experiments. In chapter 5, multisource data including gene regulatory network structure information and gene expression data were integrated in a dynamic state space model to extract underlying information of transcription factor activities. Chapter 6 discusses several extensions to improve the T [PS© framework, namely, by incorporating multiple metabolites and multi-source information into a dynamic model 1.3 Contribution of the Dissertation This dissertation is one of the first endeavors to integrate phenotypic, metabolic and gene expression profiles with the objective to identify active pathways that regulate the cellular functions. It contributes in the following aspects: 1) Introduced GA/PLS to select a subset of genes relevant to cellular functions and was applied to experimental systems. 2) Proposed a Bayesian framework to reconstruct biological pathways. Metabolic fluxes and gene regulatory network were reconstructed from experimental data. 3) Independent component analysis was applied to extract pathways relevant a cellular function. 4) T IPSO framework integrated GA/PLS, ICA and BN to infer active pathways relevant to a phenotype. T IPS© approach was applied to a HepG2 cell system and meaningful pathways were extracted and verified. 5) A dynamic, state space model was introduced to infer transcription factor activities from gene expression data. CHAPTER 2 IDENTIFYING GENES RELEVANT TO A CELLULAR FUNCTION 2.1 Introduction As a step towards uncovering the interplay between genes and metabolic functions, the objective of this chapter is to develop a framework capable of selecting a subset of genes that can i) quantitatively predict metabolic functions based upon their expression levels; ii) be used to reconstruct regulatory pathways. Since metabolic fimctions are determined in part by the levels and activities of enzymes in the network, which in turn are regulated in part by their gene expression, it is reasonable to suggest that metabolic functions could be predicted by the expression level of a subset of relevant genes. To test this assumption, we developed a genetic algorithm coupled partial least squares analysis (GA/PLS) framework to identify a subset of genes that can predict metabolic function(s). The relevance of the gene subset is assessed by reconstructing their regulatory roles in the metabolic network with aid from the current literature knowledge. Gene expression and metabolic data were experimentally obtained for a hepatocellular system. This concept of identifying a subset of genes that can predict metabolic function(s) is analogous to selecting an optimal subset of wavelengths in constructing a spectroscopic calibration model to predict chemical concentrations from signal measurements, e.g. absorbance value [Banglore 1996]. Full spectrum data (signal measurements) consists of absorbance values at different wavelengths, which are easily collected in a matter of seconds, but only a subset of the spectral data is essential to building a calibration model. The optimal subset of wavelengths used to build the calibration model changes as the chemicals being measured change [Leardi 2000]. By analogy, a full range of gene expression data is also easily collected, but only a subset of these genes are relevant to a particular cellular function. This subset of genes changes as the cellular function being predicted changes. In the current study, genetic algorithm (GA) was used to identify a subset of genes that are most influential or relevant to a metabolic function. Then a partial least squares analysis (PLS) model was subsequently constructed to predict the metabolic function based upon the expression level of the selected genes. We reviewed the biological function of the selected genes to elucidate their potential role in regulating the metabolic function. Finally, we evaluated the model robustness by artificially adding noise to the gene expression data and comparing the subset of genes selected before and after noise addition. 2.2 Methodology of GA/PLS The GA/PLS framework is illustrated in Figure 2.1. In brief, the whole dataset is separated into three independent groups, namely training, monitoring, and testing data sets, in order to avoid over-fitting. First, GA randomly selects a subset of genes from the training data set. The expression levels of these genes are subsequently used to construct a PLS model to predict a metabolic function, e.g. intracellular triglyceride (TG) or urea synthesis. The accuracy of the PLS model is evaluated using both the training and monitoring datasets. The accuracy of the PLS model prediction is used to assess the fitness of the subset of genes selected. Genetic algorithm continues to search for a gene subset until the accuracy of the PLS model prediction is maximized. The testing data set is used to validate the prediction ability of the final GA/PLS model. FIGURE 2.1: An overview of the GA/PLS feature selection algorithm. The data was separated into three independent subsets: training set, monitoring set, and testing set. GA randomly selects a subset of genes from the training data set. A PLS model was used to predict metabolic function based upon the expression of the selected genes. The prediction accuracy of the PLS model was used as the fitness value of the gene subset The gene subsets with high fitness values were selected into the next iteration and subjected to crossover and mutation to introduce diversity into the population. The process terminates when the objective function reaches its maximum or when the termination condition (e.g., maximum number of iterations) is satisfied. The final model is validated with a testing data set. 2.2.1 Assumption of GA/PLS We approximate the relation between the metabolic firnction and expression level of this subset of genes with a log-linear model: Met(treatment) _ " Gene(treatment)i C(i) (21) Met(control) i=1 Gene(control)i I where Met(treatment) and Met(control) are the metabolic function for the treated and control cultures, respectively; Gene(treatment)i and Gene(control),- are the expression level of gene i for the treated and control cultures, respectively. G t t t - Denoting Y as log2(Met(treatment)) and X; as log,( ene( rea men )’ ), equation (2.1) Met(control) Gene(c0ntrol),- is transformed to: n Y = 2 com (2.2) i=1 Since DNA mircoarray data are typically measured with respect to a reference level, we applied a log-linear model, which works well when data are presented as relative levels. Furthermore, a log-linear model allows some of the nonlinear relationships between metabolic function and gene expression to be captured. Log-linear models have been applied to approximate nonlinear processes in biochemical systems [Ni and Savageau, 1996; Ni and Savageau, 1996]. In this study the coefficients C(i) in equation (2.2) are determined by partial least square (PLS) analysis and the genes, Genei, are selected by GA/PLS as described below. 2.2.2 Partial Least Squares Analysis PLS [Wold 1984, Geladi 1986] is a statistical approach capable of modeling a large number of variables using a relatively small set of observations. It circumvents typical problems associated with highly correlated and collinear nature of experimental data by projecting the data onto a few independent latent factors. These latent factors simplify the complex and diverse relationships by capturing the variable interactions contained in the original data into a new set of fewer unobserved/latent variables. We used this approach to map the levels of gene expression (X) to a metabolic function (Y), to gain an understanding of the interplay between a hepatic function and the gene expression profile. The PLS algorithm determines, based upon a nonlinear iterative partial least 10 squares (NIPALS) approach, a set of orthogonal projection axes W, henceforth called PLS-weights, and sample projections T. For direct projection of the samples, W* = (W (PT*W)'l) is used: T = XW* (2.3) Then, regression coefficients [5 in equation (2.4) are obtained by regressing Y onto the sample projections T: Y = TB (2.4) With a PLS factors, the PLS model is: A . a Y=XWafl=Tafl=thjflj (2.5) J: T ‘1 T B =(Ta T) Ta Y (2.6) 2.2.3 Genetic Algorithm GA is an optimization method based upon the theory of natural selection. A more complete discussion of GA can be found in [Goldberg 1989]. Here we demonstrate a feature selection algorithm that combines GA and PLS (see Figure 2.1). GA begins with an initial, randomly selected population. Each individual in the population is a potential solution to the optimization problem and its fitness to the optimization problem is evaluated by a fitness function. The population evolves over generation by way of three genetic operators: reproduction, crossover, and mutation. The process terminates when the objective function reaches its maximum or when the termination condition (e.g., maximum number of iterations) is satisfied. The algorithm includes four main steps: initialization, evaluation, reproduction and crossover, and mutation. 11 Step 1: Initialization In GA, an individual is a string of length N, schematically shown in Figure 2.2. The first N—I values of the string are binary with value l/0 representing the inclusion or exclusion of a gene in the PLS model and the N“ value represents the number of latent variables in the PLS model. A population is a collection of different individuals, i.e. a set of different strings (e.g., genes) of the same length. The initial population is created randomly in a user specified bounds of the N variables in the string. For example, the bounds of the first N—I bits of the string in this current study were [0,1], thus the individuals in the initial population were assigned randomly with values 0 or 1. The N“ bit of the string, i.e. number of latent variables, is randomly assigned a value between [1,10]. f Genel Gene2 Gene3 Gene 156 l 1 0 O 1 0 2 Num of latent variables FIGURE 2.2: Representative of the string in the GA/PLS algorithm. 156 genes were selected by t-test and addition of known relevant genes. The first 156 elements of the string contain a binary 0/1 data representing the exclusion or inclusion of the genes in the PLS model. The last bit of the chromosome represents the number of latent variables in the PLS model. Step 2: Evaluation Each individual in the population is evaluated with a fitness function to determine how well they fit or improved the optimization problem. The optimization problem in our case is to find a subset of genes that can be used to construct a PLS model to predict metabolic functions with a minimal prediction error and number of latent variables. The fitness 12 function is defined by equation 2.7. Similar fitness function has been used in [Bangalore 1996] to select wavelengths that optimize calibration models. 1 20,. —y,. )2 +LVW fitness = (2.7) A where y,- is the measured metabolic flux value, y, is the PLS predicted metabolic flux value, LV is the number of latent variables in the PLS model, and w is a weighting factor to establish an optimal balance between prediction accuracy and the model size (number of PLS latent variables). w normalizes the model size to the same scale as the prediction error terms. An Optimal value of w is critical to the success of the GA/PLS model. Too large a w results in a model with too few factors to explain the gene expression data. Too small a w causes the PLS model to include too many factors and become essentially a curve fit. An optimal w, guided by a GA search, allows one to find a model that will provide the most accurate prediction with a minimum number of latent variables. To determine an optimal w value, different w values, ranging from 0 to 0.5, were used to search for a PLS model. The different models were compared using two criteria, the mean square error of prediction and the number of PLS latent variables. An optimal value of w was determined by balancing these two criteria to obtain a reasonable prediction error with a minimum number of latent variables. The optimal value for w was determined to be 0.4 for predicting intracellular TG accumulation and 0.45 to predict urea synthesis, shown in Figures 2.3 and 2.4, respectively. 13 A 8.E-O4 Mean 6.E-04 2,”ng 4.304 predctiou 2304 0.13100 0 w 0.1 0.2 0.3 0.4 0.5 0.6 B 4 Numof 2 T T T A A PLSIatentI ' ' variables 0 0 W 0.1 0.2 0.3 0.4 0.5 0.6 FIGURE 2.3: Determining w value for selecting TG genes. An optimal value for w in equation 2.7 was determined by balancing the two criteria of GA/PLS performance: the mean square error of the PLS model prediction (shown in 3A), and the number of latent variables in the PLS model (shown in 38). Different values of w values ranging from 0 to 0.5 was tested while monitoring the above two criteria. An optimal value of 0.4 for w was found for intracellular TG. At this w value, both the number of latent variables and prediction error reach their minimum levels. A 8.E-04 6.E-04 4.E-04 Mm 23-04 8‘1““ on+oo error 0f 0 0 1 0 2 0 '3 0 4 0 5 O 6 “Cdcum ‘V . . . . .-. . 3 B 2 e e e Nrrm of 1 g g PLS latent variables 0 0 w 0.1 0.2 0.3 0.4 0.5. 0.6 FIGURE 2.4: Determining w for selecting urea synthesis genes. An optimal value for w in equation 2.7 was determined by balancing the two criteria in GA/PLS performance: the mean square error of the PLS model prediction (shown in 4A), and the number of latent variables in the PLS model (shown in 4B). Different values of w ranging from 0 to 0.5 were tested while monitoring the above two criteria. An optimal value of 0.45 for w was found for urea synthesis. At this w value, both the number of latent variables and prediction error reach their minimum levels. Step 3: Reproduction After the evaluation step, each individual in the population is assigned a probability value according to its fitness value, which determines whether it will be selected to the next iteration or generation in GA. We used roulette selection [Holland 1975] to determine the probability of selecting an individual into the next iteration: Pi = §‘ (2.8) where the fitness f,- ,i=1,2, ...,N of each individual is calculated using the fitness function defined in equation 2.7. Step 4: Crossover and Mutation Crossover and mutation was applied to introduce diversity into the population, which ensures the escape from local optimum in the GA search algorithm. In crossover, two distinct individuals were selected randomly from the population and some portions of the strings are exchanged with a probability PC. In mutation, one or more bits of the string are altered with probability Pm. In this study, we applied an arithmetic crossover with a probability of 0.6 and a boundary mutation with probability of 0.05 (see reference [Houck 1995] for details on the crossover and mutation methods applied in this chapter). 2.2.4 Gene Subset Selection Optimization GA/PLS selects different subsets of genes to predict the same metabolic flux given different initial populations. In other words, different subsets of genes can give the same or nearly same degree of accuracy in the prediction of the metabolic flux based upon their gene expression level. Therefore, a strategy was developed to identify a group of genes 15 from multiple possible solutions. The strategy involved running the GA/PLS model with different initial populations and counting the frequency of appearance of each gene in the multiple solutions. The initial population size ranged from 30 to 100 individuals and each individual contained a set of different genes. GA/PLS was run 14 times with different initial population of individuals. A gene was included in the final subset if it was selected by the GA/PLS model in more than half of the runs. Therefore, the genes that appeared more than 8 times as a solution in the GA/PLS model were selected into the final gene subset. Similar methods have been applied for wavelength selection in spectroscopic calibration model [Leardi 2000]. 2.3 Application of GA/PLS to Rat Hepatocypte Culture System To test the ability of GA/PLS in identifying relevant genes of a cellular function, GA/PLS was applied to the gene expression and metabolic profiles obtained fi'om primary rat liver cell cultures. GA/PLS selected a subset of genes relevant to cellular frmctions of urea production and triglyceride accumulation. Then a partial least squares analysis (PLS) model was subsequently constructed to predict the metabolic function based upon the expression level of the selected genes. We reviewed the biological function of the selected genes to elucidate their potential role in regulating the metabolic function. Finally, we evaluated the model robustness by artificially adding noise to the gene expression data and comparing the subset of genes selected before and after noise addition. 16 2.3.1 Experimental System The gene expression and metabolic data were obtained from cultured primary rat hepatocytes [Chan 2002, 2003 a,b,c]. Hepatocytes were isolated from adult female Lewis rats. The isolated hepatocytes were cultured in a collagen sandwich configuration and incubated in standard hepatocyte culture medium containing either 0.5 U/ml insulin (high insulin) or 50 uU/ml insulin (low insulin). The hepatocytes were cultured in this fashion for at least 6 days prior to plasma exposure. This interval is considered the pre- conditioning period. The six-day-old-sandwiched hepatocyte cultures were subsequently exposed to unsupplemented, or amino acid supplemented plasma solution for an additional seven days [Chan 2003 a,b]. The four combinations of pre-conditioning and plasma supplementation evaluated were as follows. They were i) low insulin pre- conditioned and unsupplemented plasma (denoted as [L,O] thereafter), ii) low insulin pre- conditioned and amino acid supplemented plasma (denoted as [L,A]), iii) high insulin pre-conditioned and unsupplemented plasma (denoted as [H, 0]), and iv) high insulin pre- conditioned and amino acid supplemented plasma (denoted as [H,A]) cultures. The metabolites were measured using HPLC and biochemical assays described elsewhere [Chan 2002, 2003 a,b,c]. A model for hepatocyte metabolism was created based on the known stoichiometry of the hepatic metabolic network. Metabolite measurements were coupled to Metabolic Flux Analysis (MFA) to obtain the intracellular fluxes [Chan 2002a]. The gene expression profiles of the samples were collected using Affymetrix chips. All expression profiles were generated using total RNA, with in vitro transcription yielding biotinylated cRNA for hybridization to Affymetrix Rat UG34A GeneChip array. 17 The chips were analyzed at the Brigham and Women’s Hospital in Boston. The expression data can be accessed at www.msu.edu/~lizhengl/microgrrav.zip. 2.3.2 Data Preprocess Expression levels of 8,799 genes were measured using Affyrnetrix chips and a subset of these genes were differentially expressed across the range of experimental conditions in this study. Student t-test was used to identify the subset of genes that were differentially expressed between the high insulin pre-conditioned (including [11, 0], [H,A]) and the low insulin pre—conditioned (including [L,O], [L,A]) groups. In t-test, normalized distance A between the high insulin and low insulin groups is calculated as follows: 2 2 ’51. 31h "h "1h where (mh,mu,) is the population mean value of the high insulin and low insulin groups, respectively, ($11,311,) is the population variances, and (nh, 1111,) is the number of the samples in the population. It is known that D follows approximately a student distribution. The two populations are considered different when D exceeds a certain threshold value, which depends on the confidence level selected. Using student t-test, 179 genes were identified as significantly different with a 95% confidence level. The 179 genes were checked manually to exclude irrelevant ones, for example, genes expressed specifically in other tissues, e.g., brain were removed. Finally, 111 genes remained for further GA/PLS analysis and are listed in Table 2.1. In addition to the genes selected by t-test, many other genes encode the enzymes that are involved in the metabolic network defined in reference [Chan 2003 a] and are relevant to heptocellular function, but were not selected by the t-test. We included these 18 informative genes, namely, 45 genes that encode the enzymes or transcription factors involved in the metabolic network detailed elsewhere [Chan 2003 a]. These genes are listed in Table 2.2. Table 2.1: 111 genes selected after t-test No. Accession number Gene Name Glyeeraldehydes-3-phosphate-dehydrogenase (GAPDH) (GAPDH, 1 AFFX_Rat_GAPDH_3 EC 1 .2.1.12) 2 U93306 U93306 Rattus norvegicus VEGF receptor-ZIFLK-I mRNA 3 AF000139 25—hydroxyvitamin D 1-hydroxylase (CYP1) mRNA, complete eds RATATPBZS Rat Na+, K+ -ATPase (EC 3.6.1.3) betaZ subunit 4 D90048 gene and 5 flank 5 E03344 E033440ds cDNA sequence of peroxisome forming factor 6 J01435 mitochondrial cytochrome oxidase subunits I,|l, Ill genes, cholesterol side-chain cleavage enzyme mRNA (P450800), 7 J05156 complete cds 8 K00750 cytochrome c nuclear-encoded mitochondrial gene and flanks 9 K00996 cytochrome p-450e (phenobarbital-induced) mRNA, 3 end RATCYP45A Rat P-450(1) variant (phenobarbital-inducible) 10 K01721 cytochrome 3 end, flank 11 L27129 stress activated protein kinase gamma isoform, mRNA, 5 end M20131eds RATCYP45| Rat cytochrome P450llE1 gene, complete 12 M201 31 eds M80784 RATTGFBET R.norvegicus type III TGF-beta receptor 13 M80784 mRNA, complete eds 14 J03960 J03960 Rat 5-lipoxygenase mRNA, complete eds M88592 Rattus rattus peroxisome proliferator activated receptor 15 M88592 (PPAR) mRNA, M9505800mplete$eq Rattus rattus steroid 5-alpha-reductase 2 16 M95058 Mrna 17 U12187 U12187 Rattus norvegicus ras-related protein (rad) mRNA 18 U39207 U39207 Rattus norvegicus cytochrome P450 4F5 (CYP4F5) mRNA U11681 Rattus norvegicus rapamycin and FKBP12 target-1 protein 19 U11681 (rRAFT1)mRNA U48596 Rattus norvegicus MAP kinase kinase kinase 1 (MEKK1) 20 U48596 mRNA 21 U68544 cyclophilin D mRNA, nuclear gene encoding mitochondrial protein U40004 Rattus norvegicus cytochrome P450 pseudogene 22 U40004 (CYP2J3P2) mRNA 23 083538 083538 Rat mRNA for 230kDa phosphatidylinositol 4-kinase 24 AF012891 AFO12891 Rattus norvegicus frizzled related protein frpAP mRNA $78284 bel-xshort=apoptosis inducer [rats, ovary, mRNA Partial, 25 878284 537 nt] 26 U18982 foe-related antigen 2 (fra-2) mRNA, complete eds 27 U39943 cytochrome P450 monooxygenase (CYP2J3) mRNA, complete eds mRNA for hepatic microsomal UDP—glucuronosyltransferase 28 Y00156 (UDPGT) 29 A09811 A09811eds R.norvegicus mRNA for BRL-3A binding protein 30 ABOO4278 ABOO4278 Rat mRNA for protocadherin 2, partial eds A8009372 Rattus norvegicus mRNA for Lysophospholipase, 31 A8009372 complete eds l9 Table 2.1 (continued) 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 ABO10154 A801 1068 A801 1530 AF00141 7 AF058795 AF0821 26 AF085693 AF093773 AJ000347 D10891 D1 3978 D161 01 D6341 1 D7921 5 J01 435 J04807 L04672 L1 31 51 L27061 M81 784 $48325 $52878 $59893 $68736 $76404 U04934 U37099 U75923 U78090 X06889 X54250 X62660 X65296 X74833 PKN mRNA for serinlthreonine protein kinase expressed in hippoeam pus A8011068 Rattus rattus mRNA for type II iodothyronine deiodinase, com plete eds ABO11530 Rattus norvegicus mRNA for MEGF4, complete eds AF00141? Rattus norvegicus zinc finger protein mRNA, complete eds AF058795 Rattus norvegicus GABA-B receptor gb2 mRNA, complete eds AF082126 Rattus norvegicus aryl hydrocarbon receptor (AHR) mRNA G protein-coupled receptor kinase-associated ADP ribosylation factor cytosolie malate dehydrogenase (Mdh) mRNA, complete eds mRNA for 3 (2 ),5 -bisphosphate nucleotidase mRNA for metabotropic glutamate receptor mGluR5, complete eds 013978 Rattus sp. mRNA for argininosuccinate lyase, complete eds 016101 RATSGLT1 RAT mRNA for sodium/glucose cotransporter 063411 RATMPR Rat mRNA for mitochondrial precursor receptor, complete eds 079215 Rattus norvegicus mRNA for FGF—10, complete eds mitochondrial cytochrome oxidase subunits |,||, Ill genes J04807mRNA RATINSIIA Rat insulin II gene mRNA, 3 end L04672 RATGPCRCTR Rattus rattus G protein-coupled receptor mRNA, com plete eds L1 31 51 eds RATGAPX Rat GTPase-activating protein (GAP) gene, com plete eds L27061 RATPHOSE Rattus norvegicus phosphodiesterase mRNA, 3 end M81784 RATKCAB Rattus norvegicus K+ channel mRNA, sequence $48325 diabetes-inducible cytochrome P450RLM6 [rats, liver, mRNA Partial, 1093 nt] intestinal 15 kda protein=fatty acid-binding protein homolog [rats, ileum, mRNA, 460 nt] $59893 La=autoantigen SS-B/La {3 region, clone K51} [rats, mRNA, 106 nt] myosin heavy chain [rats, CCl4-eirrhotie liver fat-storing cell line, mRNA, 2924 nt] beta -HKA=H,K-ATPase beta-subunit [rats, Genomic, 8983 nt, segment 2 of 2] (CD-1) clone Kc1 Na-Ca exchanger mRNA, partial eds small GTP-binding protein (rab3c) mRNA, partial eds U75923UTR#1 SEG_RNTRNAIS3 Rattus norvegicus isoleucyl tRNA synthetase mRNA potassium channel regulator 1 mRNA, complete eds X06889eds RNRAB3 Rat res-related mRNA rab3 X54250m RNA RRATBPZ Rat mRNA for zinc finger protein AT-BP2, partial eds X62660m RNA RRGTSB R.rattus mRNA for glutathione transferase subunit 8 X65296cds RRESHVEL R.rattus mRNA for carboxylesterase (Es- HVEL) X74833cds RNACRB1 R.norvegicus mRNA for acetylcholine 20 Table 2.1 (continued) 66 67 68 69 7O 71 79 81 82 83 85 87 89 91 92 93 95 97 98 100 101 102 103 104 105 106 107 108 109 Y09453 Y1 0258 21 1932 E02468 E01884 E01 789 M74494 M5921 1 M64033 L22558 M27433 M22253 U61 772 L10072 AF041 246 U44750 J04629 M6061 7 JO5031 J031 70 M60655 U35775 L46791 U39044 X1 2459 X66494 X80477 X86086 X62528 X04979 X91 234 U57362 U69673 D49980 U53859 D8441 8 M1 8467 078359 J04486 A8009636 M15883 AF003008 A80061 37 D901 09 receptor beta-subunit Y09453cds RNY09453 R.norvegicus mRNA for calcium channel gamma subunit Y10258eds RSKET Rattus sp. mRNA for DNA binding protein, KET Z11932cds RRVASOVZM R.rattus mRNA for vasopressin V2 receptor E02468cds DNA sequence of rat TNF E01884cds DNA sequence coding for rat lL-1-beta(interleukin-1 beta) E01789cds cDNA sequence coding for rat C-kinase type-ll (beta-2) M74494 Rat sodium/potassium ATPase alpha-1 subunit truncated isoform mRNA, M59211 Rat potassium channel Kv3.2b mRNA M64033 Rat secretin gene, complete eds adenylyl cyclase-activated serotonin receptor (5-HT7) mRNA M27433 Rattus norvegicus germinal histone H4 gene M22253 Rattus norvegicus sodium channel I mRNA U61772 Rattus norvegicus merlin (NF2) mRNA L10072 Rattus norvegicus 5-hydroxytryptamine receptor (5HT5a) mRNA AF041246 Rattus norvegicus orexin receptor-2 mRNA NAB-dependent 15-hydroxyprostaglandin dehydrogenase mRNA J04629 Rat (Na+, K+)-ATPase-beta-2 subunit mRNA M60617 Rat CCAAT binding transcription factor-B subunit (CBF- A1 1) mRNA J05031 Rat isovaleryl-CoA dehydrogenase (IVD) mRNA J03170 Rat liver specific transcription factor (LF-B1) gene M60655 Rat alpha-1 B adrenergic receptor mRNA U35775 Rattus norvegicus gamma-adducin mRNA L46791 Rattus norvegicus cholesterol esterase mRNA U39044 Rattus norvegicus cytoplasmic dynein intermediate chain 2A mRNA X12459 Rat mRNA for argininosuccinate synthetase (EC 6.3.4.5) X66494 R.norvegicus CHDT1 mRNA X80477 R.norvegicus P2X mRNA X86086 R.norvegicus RNA for annexin Vl X62528 R.rattus mRNA for ribonuclease inhibitor X04979 Rat gene for apolipoprotein E /cds=(23,958) X91234 R.norvegicus mRNA for 17-beta hydroxysteroid dehydrogenase type 2 U57362 Rattus norvegicus collagen XII alpha 1 (Col12a1) mRNA U69673 Rattus norvegicus protein tyrosine phosphatase 20 mRNA D49980 Rat gene for 8-oxo-dGTPase U53859 Rattus norvegicus calpain small subunit (css1) mRNA 084418 Rat mRNA for chromosomal protein HMGZ M18467 Rattus norvegicus mitochondrial aspartate aminotransferase mRNA 078359 Rat drs (a gene down-regulated by v-src) mRNA J04486 Rat insulin growth factor-binding protein mRNA A8009636 Rattus norvegicus mRNA for phosphoinositide 3-kinase M15883 Rat clathrin light chain (LCB2) mRNA AF003008 Rattus norvegicus Mxi1 protein (eri1) mRNA A8006137 Rattus norvegicus FTA mRNA for alpha 1,2- fucosyltransferase 090109 Rat mRNA for long-chain aeyI-CoA synthetase (EC 6.2.1.3) 21 Table 2.1 (continued) 1 10 J02646 111 M14053 J02646 Rat translational initiation factor (elF-2) alpha subunit mRNA M14053 Rat glucocorticoid receptor mRNA Table 2.2: 45 genes involved in the metabolic network Accession No. number Gene Name 1 L37333 glueose-6-phosphatase (G6Pase) mRNA, complete cds 2 X58865 X58865m RNA Rat PFK-L mRNA for liver phosphofructokinase 3 U32314 U32314 Rattus norvegicus pyruvate carboxylase mRNA 4 K03243 Phosphoenolpyruvate carboxykinase (GTP) gene, exons 1-3 U07177 Rattus norvegicus lactate dehydrogenase-C (LDH-C) 5 U071 77 mRNA 6 M54926 M54926 Rat lactate dehydrogenase A mRNA, 3 end U07181 Rattus norvegicus lactate dehydrogenase~B (LDH-B) 7 U071 81 mRNA 8 U75393 succinyl-CoA synthetase alpha subunit mRNA, 9 J04473 J04473 Rat mitochondrial fumarase mRNA, complete eds 10 K03041 ornithine carbamoyltransferase mRNA 11 J02720 J02720 Rat liver arginase mRNA, complete eds 12 227513 gene for carbamoylphosphate synthase 1, exon 38 13 010354 010354 RATAAT Rat mRNA for alanine aminotransferase J03865mRNA RATSDH22 Rat serine dehydratase (SDH2) gene, 3 14 J03865 flank X07467 Rat mRNA for glucose-6-phosphate dehydrogenase (Gd, 15 X07467 EC 1 .1 .1.49) 16 M16235 M16235 Rat hepatic lipase mRNA X78593 R.norvegicus mRNA for glycerol-3-phosphate 1 7 X78593 dehydrogenase L46791 Rattus norvegicus cholesterol esterase mRNA, complete 18 L46791 eds 19 J05446 J05446 Rat glycogen synthase mRNA, complete eds M76767 RATFASA Rattus norvegicus fatty acid synthase mRNA, 20 M76767 complete eds Rat long chain acyl-CoA dehydrogenase (LCAD) mRNA, complete 21 J05029 eds 038448 Rat mRNA for 88k0a-diacylglyeerol kinase (DGK-lll), 22 038448 com plete eds AF017251 Rattus norvegicus phospholipase 0 (PLDs) mRNA, 23 AF017251 complete cds 24 X12752 X12752 Rat gene for DNA binding protein C/EBP CCAAT binding transcription factor-B subunit (CBF-B) mRNA, 25 M34238 complete eds X54423cds RNHNF1 Rat mRNA for hepatic nuclear factor one 26 X54423 (HNF1) X571 33m RNA RNHNF4 Rat mRNA for hepatocyte nuclear factor 4 27 X57133 (HNF4) Y1 4933m RNA RNHNF6B R.norvegicus mRNA for hepatocyte 28 Y14933 nuclear factor 6 beta 29 X55955 X55955 Rat mRNA for hepatocyte nuclear factor 3A (HNF-3A) 22 Table 2.2 (continued) 30 L09647 hepatocyte nuclear factor 3a (HNF-3 beta) mRNA, complete eds 31 K03486 K03486 RATPKC32 Rat protein kinase C type III mRNA, 3 region 32 M15523 protein kinase C-family related mRNA, partial eds, clone RP16 33 M55417 protein kinase C-gamma (PRKC-gamma) gene, exon 1 34 X07286 X07286eds RNPKCAR Rat mRNA for protein kinase C alpha 35 X07287 X07287cds RNPKCG Rat mRNA for protein kinase C gamma X54096 Rat mRNA for lecithin-cholesterol acyltransferase (EC 36 X54096 2.3.1.43) (LCAT) X53588 Rat mRNA for glucokinase, alternatively spliced GK2 (EC 37 X53588 2.7.1.1) ABOO4329 Rattus norvegicus mRNA for acetyl-CoA carboxylase, 38 ABOO4329 complete eds 39 X55286 X55286 R.norvegicus mRNA for HMG-CoA reductase U03763eds RRU03763 Rattus rattus phospholipase mRNA, 40 U03763 complete eds 41 L03294 L03294 Rattus norvegicus Iipoprotein lipase mRNA, complete eds 42 L06040 L06040 Rattus norvegicus 12-lipoxygenase mRNA, complete eds type II pneumoeyte CD36-related class B scavenger receptor (SRB1 R) mRNA JO5460 RATCHOL7H Rat cholesterol 7-alpha-hydroxylase mRNA, 43 AF071495 44 J05460 complete eds S775280ds rNFIL-6=C/EBP-related transcription factor [rats, 45 877528 Genomic/mRNA, 1759 nt] 2.3.3 Results GA/PLS was applied to gene expression and metabolic data obtained from 4 culture conditions, [H, 0], [H,A], [L,0], [L,A], as described in the Methods. The data was separated into 3 groups, a training ([H,A], [L,Oj), monitoring ([H, 0]), and testing ([L,A]) set. We investigated the ability of GA/PLS to select a subset of genes able to predict hepatic frmctions (intracellular TG accumulation and urea synthesis). In addition, the role of the selected genes in regulating the aforementioned functions was reconstructed with the aid of a literature review. Finally, we evaluated the robustness of the GA/PLS model to noise in the gene expression data. Genes Selected for Intracellular TG The frequency of each gene selected by GA/PLS in the multiple runs for intracellular TG is plotted in Figure 2.5A. 59 genes were selected with a frequency 23 greater than 8 and are listed in Table 2.3. The selected genes can be categorized into the following groups: fatty acid and lipid metabolism, transcription factors, signal pathways, electron transport chain and oxidative phosphorylation, and ion transporter. A PLS model with two latent variables was constructed to predict intracellular TG based upon the expression level of the 59 genes. As shown in Figure 2.5B, intracellular T0 was predicted well using the GA/PLS model (mean square error is 0.0425). 14- Frequen cy ll|llll|' will] ‘1 " 1 9 17 25 33 41 49 57 65 73 81 89 97105113122130140150 Genes A FIGURE 2.5: TG gene selection and metabolic function prediction. 2.5A) TG gene frequency. GA/PLS was run 14 times with different initial populations. The fi'equency at which each gene appeared in the 14 solutions was counted and plotted. These genes that appeared with a fi'equency higher than 8 were selected into the final gene subset (shown in Table 2.3) for prediction and pathway reconstruction 24 0.8 . . . 5 GT6 observation 0.6 _* ____________________ ’ _____________________ .' _______________ ATG prediction ‘ i ,A i i 0.4 - ---------------------- 7 --------------------- :g-[H-"l ------------- y ---------------------- g ----------------------- 0: ' i i ”-1 ‘ """""""""""" ;1 mo] """"""""""""""""""" """"""""""" """""""""""" 4 0 ' . i i 2 . s a 3-0.2 ~+- ---------------------- 3' [LO] § -04 - -------------------------------------------------------------------- é --------------------- § ------ IEAI --------- E A no - ------------------------------------------------------------------- § ---------------------- -03 Experlmertd condition 3 FIGURE 2.5: TG gene selection and metabolic function prediction.. 2.5B) PLS prediction of TG. The level of TG accumulation was predicted based upon the expression level of the genes selected by GA/PLS. [H,A], [L,O] was used for training, [H,O] was used for monitoring, and [H,A] was used for testing and validating the PLS model. The mean square error of the PLS prediction model was 0.0425. Table 2.3 A. Genes selected for intracellular triglyceride Functional Accession Category Number Gene name Frequency Fatty acid and lipid metabolism 090109 mRNA for long-chain acyl-CoA synthetase (EC 6.2.1.3) 8 J03960 5-lipoxygenase mRNA, complete cds 8 X04979 gene for apolipoprotein E /cds=(23,958) 11 M16235 Rat hepatic lipase mRNA 8 X78593 mRNA for glycerol-3-phosphate dehydrogenase 13 long chain acyl-CoA dehydrogenase (LCAD) mRNA, J05029 complete cds 9 AF017251 phospholipase D (PLDs) mRNA, complete eds 11 ABOO4329 mRNA for acetyl-CoA carboxylase, complete eds 12 U03763 phospholipase mRNA, complete cds 8 L06040 12-lipoxygenase mRNA, complete cds 8 cholesterol side-chain cleavage enzyme mRNA J05156 (P4508CC), complete cds 8 25 Table 2.3A (continued) J05460 X54096 Transcription factor M88592 M6061 7 X54250 L09647 lon transporter 090048 016101 U78090 M74494 M5921 1 J04629 Cholesterol 7-alpha-hydroxylase mRNA, complete eds X54096 Rat mRNA for lecithin-cholesterol acyltransferase (EC 2.3.1.43) (LCAT) Peroxisome proliferator activated receptor(PPAR) mRNA Rat CCAAT binding transcription factor-B subunit mRNA for zinc finger protein AT-BP2 hepatocyte nuclear factor 3a (HNF-3 beta) mRNA, complete eds Na+,K+—ATPase(EC3.6.1.3) beta2 subunit gene mRNA for sodium/glucose cotransporter Potassium channel regulator 1 mRNA, complete eds sodium/potassium ATPase alpha-1 subunit truncated isofonn mRNA, Potassium channel Kv3.2b mRNA (Na+, K+)—ATPase-beta-2 subunit mRNA Electron transport chain Oxidative phosphoration K00996 K01721 M20131 U39943 AF0001 39 Signal pathways Others U93306 M80784 U1 1681 U48596 083538 D10891 D6341 1 U37099 X06889 Z1 1932 L10072 M60655 M1 5523 J04486 L27061 J02646 010354 J03865 X07467 RATCYP45E Rat cytochrome p-450e (phenobarbital- induced) mRNA, 3 end RATCYP45A Rat P-450(1) variant (phenobarbital-inducible) cytochrome 3 end RATCYP45| Rat cytochrome P450llE1 gene, complete cds cytochrome P450 monooxygenase (CYP2J3) mRNA, complete eds 25-hydroxyvitamin 0 1-hydroxylase (CYP1) mRNA, complete eds VEGF receptor-2/FLK-1 mRNA type III TGF-beta receptor mRNA, complete eds Rapamycin and FKBP12 target-1 protein (rRAFT1) mRNA MAP kinase kinase kinase 1 (MEKK1) mRNA mRNA for 230kDa phosphatidylinositol 4-kinase mRNA for metabotropic glutamate receptor mGluR5, complete eds mRNA for mitochondrial precursor receptor, complete cds small GTP-binding protein (rab3c) mRNA, partial cds Rat res-related mRNA rab3 mRNA for vasopressin V2 receptor 5-hydroxytryptamine receptor (5HT5a) mRNA alpha-1 B adrenergic receptor mRNA protein kinase C-family related mRNA, partial eds, clone RP16 insulin growth factor-binding protein mRNA Phosphodiesterase mRNA 3 end Rat translational initiation factor (elF-2) alpha subunit mRNA mRNA for alanine aminotransferase serine dehydratase (SDH2) gene, 3 flank mRNA for glucose-6-phosphate dehydrogenase (Gd, EC 26 A OD cocoon —l —b o°°o OCDOO 10 11 12 Table 2.3A (continued) 1.1.1.49) AB006137 FT A mRNA for alpha 1.2-fucosyltransferase 9 X12459 mRNA for argininosuccinate synthetase (EC 6.3.4.5) 8 U39044 cytoplasmic dynein intermediate chain 2A mRNA 9 M64033 secretin gene, complete cds 9 M27433 germinal histone H4 gene 9 NAD-dependent 15-hydroxyprostaglandin dehydrogenase U44750 mRNA 8 X62660 mRNA for glutathione transferase subunit 8 8 mRNA for hepatic mierosomal UDP- Y00156 glucuronosyltransferase (UDPGT) 10 AF012891 frizzled related protein frpAP mRNA 10 U18982 fos-related antigen 2 (fra-2) mRNA, complete cds 9 cyclophilin 0 mRNA, nuclear gene encoding mitochondrial U68544 protein 10 Table 2.3 B. Genes selected for urea synthesis Functional Accession Catggory Number Gene name Frequency Urea cycle 013978 m RNA for argininosuccinate lyase 12 X12459 argininosuccinate synthetase (EC 6.3.4.5) 11 K03041 Ornithine carbamoyltransferase mRNA 8 227513 gene for carbamoylphosphate synthase I, exon 38 1O M18467 mitochondrial aspartate aminotransferase mRNA 9 gluconeogensis U32314 Pyruvate carboxylase mRNA 9 phosphoenolpyruvate carboxykinase (GTP) gene, K03243 exons 1-3 9 L37333 Glucose-6-phosphatase (G6Pase) mRNA 1 1 X58865 PFK-L mRNA for liver phosphofructokinase 9 TCA cycle AF093773 Cytosolic malate dehydrogenase mRNA 9 Transcription factor X12752 gene for DNA binding protein C/EBP 12 877528 rNFIL-6=C/EBP-related transcription factor 12 X55955 Hepatocyte nuclear factor 3A (HNF-3A) 10 Electron transport chain Oxidative phosphoration cytochrome P450 monooxygenase (CYP2J3) mRNA, U39943 complete cds 8 RATCYP45A Rat P-450( 1) variant (phenobarbital- K01721 inducible) cytochrome 3 end, flank 10 mitochondrial cytochrome oxidase subunits l,ll, lll J01435 genes, 9 diabetes-inducible cytochrome P450RLM6 [rats, liver, $48325 mRNA Partial, 1093 nt] 8 E03344 cDNA sequence of peroxisome forming factor 10 27 Table 2.38 (continued) K00750 $48325 Signal pathways Others U1 21 87 L271 29 083538 AF085693 L04672 L1 31 51 L22558 AFO41 246 M60655 A8009636 M95058 AFOOO1 39 AF01 2891 878284 Y001 56 ABOO4278 N000347 06341 1 J04807 L27061 X62660 X65296 U61 772 U35775 X66494 X86086 X62528 U57362 D49980 U53859 084418 M1 5883 J02646 L46791 X55286 AFO71495 Cytochrome c nuclear-encoded mitochondrial gene diabetes-inducible cytochrome P450RLM6 ras-related protein (rad) mRNA Stress activated protein kinase mRNA mRNA for 230kDa phosphatidylinositol 4-kinase GTPase-activating protein (GIT1) mRNA G protein-coupled receptor mRNA, complete cds GTPase—activating protein (GAP) gene, complete cds adenylyl cyclase-activated serotonin receptor (5-HT7) mRNA orexin receptor-2 mRNA alpha-1 B adrenergic receptor mRNA mRNA for phosphoinositide 3-kinase steroid 5-alpha-reductase 2 mRNA 25-hydroxyvitamin D 1-hydroxylase (CYP1) mRNA, complete cds frizzled related protein frpAP mRNA bcl-xshort=apoptosis inducer [rats, ovary, mRNA Partial, 537 nt] mRNA for hepatic microsomal UDP- glucuronosyltransferase (UDPGT) mRNA for protocadherin 2, partial cds mRNA for 3 (2 ),5 -bisphosphate nucleotidase mRNA for mitochondrial precursor receptor, complete cds insulin ll gene mRNA, 3 end Phosphodiesterase mRNA, 3 end mRNA for glutathione transferase subunit 8 mRNA for carboxylesterase (Es-HVEL) U61772 Rattus norvegicus merlin (NF2) mRNA U35775 Rattus norvegicus gamma-adducin mRNA X66494 R.norvegicus CHOT1 mRNA X86086 R.norvegicus RNA for annexin VI X62528 R.rattus mRNA for ribonuclease inhibitor U57362 Rattus norvegicus collagen Xll alpha 1 (Col12a1) mRNA D49980 Rat gene for 8-oxo-dGTPase U53859 Rattus norvegicus calpain small subunit (css1) mRNA 084418 Rat mRNA for chromosomal protein HMGZ M15883 Rat clathrin light chain (LCBZ) mRNA J02646 Rat translational initiation factor (elF-2) alpha subunit mRNA L46791 Rattus norvegicus cholesterol esterase mRNA, complete cds X55286 R.norvegicus mRNA for HMG-CoA reductase type II pneumocyte CD36-related class B scavenger receptor (SRBi R) mRNA A oooco CDCDQCOCD cocoa“: 10 28 To assess the relevance of the genes that were selected by the GA/PLS framework to predict the specified metabolic function(s), we evaluated the role of these genes in connection with these functions by way of a literature review. Many of the genes selected by GA/PLS were closely related to enzymes that are involved in reactions that affect TG levels or its precursors. Gene M16235 encodes hepatic lipase enzyme which reduces TG to glycerol and fatty acids. Gene D90109 and J 05029 encode long-chain aeyl-CoA synthetase and dehydrogenase enzymes, respectively. These two enzymes catalyze the initial reactions of fatty acid B-oxidization and reduce the level of long chain aeyl-CoA which is an important precursor in TG synthesis. Similarly, genes involved in mierosomal oxidation and fatty acid synthesis enzymes were also identified. The P450 family of genes (such as M20131), encode mierosomal oxidase enzyme, which is involved in the oxidation of fatty acid to a-hydroxy fatty acids. Conversely, gene ABOO4329 encodes aeetyl-CoA carboxylase enzyme, which catalyzes the committed step in fatty acid synthesis to produce malonyl-CoA from acetyl- CoA. The level of fatty acid affects the level of fatty acyl-CoA, which in turn affects TG accumulation. Phosphatidie acid is another important precursor of TG. Gene AF 017251 and U03763 encode the phospholipase D and phospholipase A2 enzymes, respectively, which reduce phosphatidyleholine to phosphatidic acid. Another gene X78593 encodes the glycerol-3-P dehydrogenase enzyme which is involved in the first committed pathway to produce phophatidic acid from dihydroxyacetone-P. In addition, GA/PLS also identified genes involved in the signal pathways that are relevant to TG accumulation, notably, the phosphodiesterase (L27061) and protein kinase 29 C (M15523) genes. Insulin activates phosphodiesterase to reduce the CAMP level and thus counteracts glucagon's ability to activate TG lipase to hydrolyze TG. Protein kinase C (PKC) is involved in Ca2+ related signal transduction. A high level of intracellular Ca2+ activates PKC, which attenuates insulin's action and activates lipolysis [Idn's 2001]. Interestingly, three genes (D90048, M74494, 104629) encoding Na+, K+-ATPase were also identified. The activities of Na+, K+-ATPase are significantly decreased in plasma membranes of dietary obese rats [Izpisua 1989]. The reason for this down-regulation is currently not known. Although, Na+, K+-ATPase inhibition has been found to be involved in increasing Ca2+ level [Xie 2002]. The model also identified genes encoding transcription factors that regulate lipid metabolism. Gene M88592 encodes the transcription factor PPAR-a, which has been found to regulate the expression of many enzymes involved in fatty acid oxidation and synthesis, e. g. fatty acid synthetase, fatty acid oxidase, etc. [Smith 2002]. It acts as a lipid sensor to modulate cellular capacity for fatty acid oxidation and synthesis. Gene M60617 encodes the CCAAT binding transcription factor. C/EBP family of transcription factors are involved in acute phase and inflammatory response of the liver as well as lipid metabolism. This transcription factor regulates fatty acid synthase (FAS) expression [Roder 1999] and activates genes of cholesterol and fatty acid metabolism with sterol regulatory element-binding protein (SREBP) as a co-regulator [Magana 2000]. Indeed, many of the genes selected by the GA/PLS model are closely related to TG metabolism. Based upon the current knowledge from the literature, we reconstructed the role of these genes and gene products in the regulating TG metabolism. For illustration purposes, only the genes enumerated above and their role in regulating TG 3O accumulation are shown in Figure 2.6. This GA/PLS framework provides a mechanism to identify potential areas (pathways) for further study in optimizing, in the current study, minimizing cellular TG accumulation. Phosphatidie acid Choline AFO|725I ; x; PLDl p’ i K PPAR cm: NA K Diacyl-glycerol phosphatidyleholine . jw \ ATpase \ ’1 \ ,/ I Transcriptional ' reanlation N‘K ”" Intra TG Insulin : ATPase I g...) 1‘chch A l 31c lipase 1' fl diesterase >- ABOO4329 . __ glycm' Acyl-CoA Lc-ACS , Fatty acid F AS H Malonyl-CoA I l ’\ Lc-acyl-coa < Microsomal F- , G3pdh A D90109 oxidase dehydrog . .. __ X73533/ r T 631’ Trans enoyl CoA I a-hydroxy fatty acid O l: O Enzyme or transcription factor Metabolites Genes FIGURE 2.6: Pathway reconstruction for TG. The pathways were reconstructed by identifying the biological function of the genes selected by GA/PLS based upon current literature knowledge. The elements of the pathways include gene, enzyme or gene products, and metabolites. The genes include lipid and fatty acid metabolism genes, PPAR-a gene, P450 genes, and Nath-ATPase genes. These genes are known in literature to affect TG metabolism. Dotted line indicates an indirect relationship. Genes Selected for Urea Synthesis GA/PLS also was applied to select genes relevant to urea synthesis. It selected 57 genes which are listed in Table 2.3B. Based on the expression level of these 57 genes, the 31 GA/PLS model predicted urea synthesis very well (with a mean square error of 0.0125) as illustrated in Figure 2.7. The role of these genes in regulating urea synthesis was investigated by reviewing the published literature. 1 E E a E E i ' urea wwfim : : 5 : 5 A urea predichon 0.8 - ----------------------- ----------- :~ ---------- 3- ----------- 0.6 .1 ....................... ...................... é .......... 1 i i 5 M1 0.4 -* ----------- “‘4'“ E 5 01 ~ ------------------------ g ----------------------- ¢ ---------- . ----------- ----------------------------------- g 0 . . . . r g s : . z -0.2 _,t, ............................................... :. ................................... A .114 -1 ...................................................................... f. ................................... I 9 .. ........................................... [15,0] ............................... -06 ["10] § -0.8 ' ' Experimertal condtion FIGURE 2.7: PLS prediction of urea synthesis. The level of urea synthesis was predicted based upon the expression level of the genes selected by GA/PLS. [H,A], [L,O] was used for training, [H,O] was used for monitoring data, and [H,A] was used for testing and validating the PLS model. The mean square error of the PLS prediction model was 0.0125. Urea cycle is essential for proper disposal of ammonia in mammals. The cycle is mainly regulated by five enzymes: carbamoyl phosphate synthetase-I (CPS-I; EC 6.3.4.16), ornithine transcarbamylase (OTC; EC 2.1.3.3), argininosuccinate synthetase (AS; EC 6.3.4.5), argininosuccinate lyase (AL; EC 4.3.2.1), and arginase (EC 3.5.3.1). It is encouraging that the genes encoding four of the enzymes CPS-I (227513), OTC (K03041), AS (X12459) and AL (D13978) were selected by GA/PLS. 32 An upstream enhancer required for liver-specific expression and hormonal induction of the rat CPS-I gene has been identified. This enhancer region was revealed to contain DNA elements that bind the transcription factors such as hepatocyte nuclear factor 3 (HNF3) and C/EBP. Detailed transcriptional regulation of urea synthesis has been reviewed in reference (24-26). It is interesting that GA/PLS identified two genes (X12752 and 877528) that encode transcription factors C/EBP and C/EBP-related transcription factors, respectively, and one gene X55955 that encodes the transcription factor HNF30L. In addition, a C/EBP binding element has been found in the OTC and arginase genes [Meijer 2000, Morris 1992, 2002]. GA/PLS also uncovered the coupling between urea cycle, TCA cycle and glueoneogenesis. GA/PLS identified a group of genes involved in gluconeogenesis and TCA cycle that are relevant to urea synthesis. The genes include pyruvate carboxylase (U32314), phospheoenolpyruvate carboxykinase (K03243), g1ucose-6-phosphatase (G6Pase) (L37333), aspartate aminotransferase (M18467), cytosolie malate dehydrogenase (AF093773) and phosphofructokinase (X58865) genes. Parallel increases in gluconeogenesis and urea synthesis have been found in rat liver during prolonged starvation or by hormonal stimulation [Martin-requero 1992]. The anaplerotic effect of increasing the flux through pyruvate carboxylase may enhance the production of aspartate, which drives urea synthesis by producing more arginosuccinate. Using the genes highlighted above, we reconstructed the pathways involved in regulating urea synthesis and illustrated in Figure 2.8. Urea synthesis is regulated by genes encoding the enzymes of urea cycle, gluconeogenesis, TCA cycle and transcription factors, such as C/EBP and HNF families. 33 002+NH4+ATP < M18467 > f [Phosphoenobymvate ] ,_( PC 1 CPS-I ~69 LAspartate v i PEPCK t Carbamoyl < AST > ' ' Citruline Oxalaeetate / , As OTC _ K03041 x12459 I UREA argninosuccinate TCA Transcriptional ‘rgin'se CYCLE CYCLE regulation j_\ \ \ @ A. CEBP 5 HNF , , A ‘\ argmmase I \ ‘- \ \ fumarate O Enzyme or transcription factor FIGURE 2.8: Pathway reconstruction for urea synthesis. The pathways were reconstructed by identifying the biological function of the genes selected by GA/PLS. The elements of the pathways include gene, enzyme or gene products, and metabolites. The genes include urea cycle genes, C/EBP and HNF genes, glueoneogenie and TCA cycle genes. These genes are known in the literature to affect urea synthesis. Dotted line indicates an indirect relationship. 2.3.4 Model Robustness Evaluation Gene expression data is typically noisy, therefore, we evaluated the effect of noise C] Metabolites 0 Genes 1 addition to the gene expression data on our model results. To test the robustness of the model, 10% noise was artificially added to the gene expression data. Noise in the form of a Gaussian distribution was generated according to the following equation, Noise=N*o;,*R 34 where N is a vector of random numbers of length n following a normal distribution, the o; is the standard deviation of the data, and the R is any number between [0,1]. The robustness of the GA/PLS framework was evaluated by comparing the genes selected before and after noise addition. The genes selected after noise addition are enumerated in Table 2.4A and Table 2.43 for intracellular TG and urea production, respectively. 31 of the 59 genes remain selected for TG and 34 of the 57 genes remain selected for urea synthesis (highlighted in bold, Table 2.4). Of the 59 genes selected for TG before noise addition, 14 were from the supervised group (the 45 genes added afier t—test). It is notable that 11 of the 14 genes selected from the supervised group remained the same after noise addition. Similarly, 12 of the 13 genes selected from the supervised group remained for urea synthesis. In the case of TG, most of the fatty acid and lipid metabolism genes such as long-chain fatty aeyl-CoA synthetase and dehydrogenase, phospholipase, hepatic lipase, acetyl-CoA carboxylyase genes remain selected. In addition, the CCAAT binding transcription factor, Na+,K+ ATPase and P450 family genes were also selected after noise addition. In the case of urea synthesis, the urea cycle enzymes, e.g. AS, OTC, CPS-I, aspartate aminotransferase genes; and transcription factors genes, C/EBP and HNF; and glueoneogenesis enzyme genes, such as pyruvate carboxylase, G6Pase, phosphofructokinase and phosphoenolpyruvate carboxykinase genes remain selected after noise addition. This suggests that GA/PLS framework is robust to a certain level of noise in the gene expression data. Nevertheless, several genes such as the PPAR gene for TG and AL gene for urea synthesis disappeared after noise addition. 35 Table 2.4 A. Genes selected for intracellular triglyceride after noise addition Functional Accession Category Number Gene name Frequency Fatty acid and lipid metabolism cholesterol side-chain cleavage enzyme mRNA J05156 (P45OSCC) 9 mRNA for long-chain acyl-CoA synthetase (EC D90109 6.2.1.3) 8 M16235_at M16235 Rat hepatic lipase mRNA 11 X78593_at mRNA for glycerol-3-phosphate dehydrogenase 13 RATACOADA Rat long chain acyl-CoA J05029_s_at dehydrogenase (LCAD) mRNA 9 mRNA for 88kDa-diaeylglyeerol kinase (DGK- DS8448_at Ill), complete eds 10 AF017251_at phospholipase D (PLDs) mRNA, complete eds 11 mRNA for lecithin-cholesterol acyltransferase X54096_at (EC 2.3.1.43) (LCAT) 9 mRNA for aeetyl-CoA carboxylase, complete ABOO4329_at eds 1 2 L06040_s_at 12-lipoxygenase mRNA, complete cds 9 mRNA for glucose-6-phosphate dehydrogenase X07467_at (Gd, EC 1.1.1.49) 11 Transcription factor RRATBP2 Rat mRNA for zinc finger protein AT- X54250 BP2, partial eds 9 CCAAT binding transcription factor-B subunit M60617 (CBF-A1 1) mRNA 8 Ion transporter Na+, K+ -ATPase (EC 3.6.1.3) beta2 subunit 090048 gene and 5 flank 11 (CD-1) clone Ke1 Na-Ca exchanger mRNA, U04934 partial eds 8 Potassium channel regulator 1 mRNA, complete U78090 eds 10 Electron transport chain Oxidative phosphoration E03344 cDNA sequence of peroxisome forming factor 8 cytochrome c nuclear-encoded mitochondrial K00750 gene and flanks 9 P-450(1) variant (phenobarbital-indueible) K01721 cytochrome 3 end, flank 8 U39207 Cytochrome P450 4F5 (CYP4F5) mRNA 8 Signal pathways U93306 VEGF reeeptor-2/FLK-1 mRNA 9 M80784 type III TGF-beta receptor mRNA, complete eds 9 U48596 MAP kinase kinase kinase 1 (MEKK1) mRNA 8 083538 mRNA for 230kDa phosphatidylinositol 4-kinase 8 A09811 mRNA for BRL-3A binding protein 8 ADP ribosylation factor GTPase-activating AF085693 protein (GlT1) mRNA 10 36 Table 2.4A (continued) Others L04672 L22558 M60655 J04486 M1 4053 M1 5523_s_at AF0001 39 M95058 $78284 U1 8982 YOO‘I 56 ABO10154 ABO11530 D1 3978 D7921 5 X62660 X65296 M27433 U44750 J05031 U35775 X66494 X86086 X91234 U53859 J02646 G protein-coupled receptor mRNA, complete eds adenylyl eyelase-activated serotonin receptor (5-HT7) mRNA Rat alpha-1 B adrenergic receptor mRNA insulin growth factor-binding protein mRNA Rat glucocorticoid receptor mRNA Rat protein kinase C-family related mRNA, partial eds, clone RP16 25-hydroxyvitamin D 1-hydroxylase (CYP1) mRNA, complete eds steroid 5-alpha-reduetase 2 mRNA bel-xshort=apoptosis inducer [rats, ovary, mRNA Partial, 537 nt] fos-related antigen 2 (fra-2) mRNA, complete eds mRNA for hepatic mierosomal UDP- glucuronosyltransferase (UDPGT) PKN mRNA for serinlthreonine protein kinase expressed in hippocampus mRNA for MEGF4, complete eds mRNA for argininosuccinate lyase, complete eds mRNA for FGF—10, complete eds mRNA for glutathione transferase subunit 8 mRNA for carboxylesterase (Es-HVEL) germinal histone H4 gene MAD-dependent 15-hydroxyprostaglandin dehydrogenase mRNA Rat isovaleryl-CoA dehydrogenase (IVD) mRNA gamma-adducin mRNA CHOT1 mRNA RNA for annexin VI mRNA for 17-beta hydroxysteroid dehydrogenase type 2 calpain small subunit (css1) mRNA Rat translational initiation factor (elF-2) alpha subunit mRNA oooocooo (D (O A cococooooo coo A oeuro co once on 37 Table 2.4 B. Genes selected for urea synthesis after noise addition Functional Accession Category Number Gene name Frequency Urea cycle X12459 mRNA for argininosuccinate synthetase (EC 6.3.4.5) 8 M18467 mitochondrial aspartate aminotransferase mRNA 9 mRNA RATOTCB Rat (Sprague-Dawley) ornithine K03041 carbamoyltransferase mRNA 8 227513 gene for carbamoylphosphate synthase l, exon 38 11 gluconeogensis glucose-6—phosphatase (G6Pase) mRNA, complete L37333 eds 9 X58865 mRNA for liver phosphofructokinase 9 U32314 pyruvate carboxylase mRNA 12 Rat phosphoenolpyruvate carboxykinase (GTP) K03243 gene, exons 1-3 1 2 mRNA for glucokinase, alternatively spliced GK2 (EC X53588 2.7.1.1) 11 J05446 Rat glycogen synthase mRNA, complete cds 9 TCA cycle cytosolie malate dehydrogenase (Mdh) mRNA, AF093773 complete eds 9 Transcription factor peroxisome proliferator activated receptor (PPAR) M88592 mRNA, 8 J03170 liver specific transcription factor (LF-Bt) gene 9 X12752 gene for DNA binding protein C/EBP 11 X55955 mRNA for hepatocyte nuclear factor 3A (HNF-3A) 12 rNFIL-6=C/EBP-related transcription factor [rats, $77528 Genomic/mRNA, 1759 nt] 9 Electron transport chain Oxidative phosphoration RATCYP45E Rat cytochrome p~450e (phenobarbital- K00996 induced) mRNA, 3 end 10 RATCYP45A Rat P-450(1) variant (phenobarbital- K01721 inducible) cytochrome 3 end, flank 8 E03344cds cDNA sequence of peroxisome forming E03344 factor 9 cytochrome P450 monooxygenase (CYP2J3) mRNA, U39943 complete eds 10 Signal pathways U12187 ras-related protein (rad) mRNA 9 U48596 MAP kinase kinase kinase 1 (MEKK1) mRNA 10 083538 mRNA for 230kDa phosphatidylinositol 4-kinase 9 AF082126 aryl hydrocarbon receptor (AHR) mRNA 8 ADP ribosylation factor GTPase-activating protein (GlT1) AF085693 mRNA 12 mRNA for metabotropic glutamate receptor mGluR5, 010891 complete eds 12 38 Table 2.48 (continued) Others 06341 1 J04807 U37099 X06889 Z1 1932 L22558 L1 0072 J04486 A3009636 M14053 M55417 090048 M95058 U1 1681 878284 U1 8982 Y00156 ABOO4278 ABO10154 $52878 $76404 U04934 X62660 E02468 E01 884 M27433 X86086 X62528 U53859 ABOOG1 37 0901 09 J02646 L467 91 X55286 AF071495 Rat mRNA for mitochondrial precursor receptor, complete eds Rat insulin II gene mRNA, 3 end small GTP-binding protein (rab3c) mRNA, partial eds ras-related mRNA rab3 mRNA for vasopressin V2 receptor adenylyl cyclase-activated serotonin receptor (5-HT7) mRNA 5-hydroxytryptamine receptor (5HT5a) mRNA insulin growth factor-binding protein mRNA mRNA for phosphoinositide 3-kinase Rat glucocorticoid receptor mRNA Rat protein kinase C-gamma (PRKC-gamma) gene, exon 1 Rat Na+, K+ -ATPase (EC 3.6.1.3) beta2 subunit gene and 5 flank M9505800mpleteSeq Rattus rattus steroid 5-alpha- reductase 2 mRNA U11681 Rattus norvegicus rapamycin and FKBP12 target-1 protein (rRAFT1) mRNA apoptosis inducer [rats, ovary, mRNA Partial, 537 nt] fos-related antigen 2 (fra-2) mRNA, complete eds mRNA for hepatic mierosomal UDP- glucuronosyltransferase (UDPGT) mRNA for protocadherin 2, partial eds mRNA for serinlthreonine protein kinase expressed in hippocampus, partial eds fatty acid-binding protein homolog [rats, ileum, mRNA, 460 nt] 37640182 beta -HKA=H,K-ATPase beta-subunit [rats, Genomic, 8983 nt, segment 2 of 2] clone Kc1 Na-Ca exchanger mRNA, partial eds X62660m RNA RRGTSB R.rattus mRNA for glutathione transferase subunit 8 E02468cds DNA sequence of rat TNF E01884eds DNA sequence coding for rat IL-1- beta(interleukin-1 beta) M27433 Rattus norvegicus germinal histone H4 gene X86086 R.norvegicus RNA for annexin VI mRNA for ribonuclease inhibitor calpain small subunit (csst) mRNA mRNA for alpha 1,2-fucosyltransferase Rat mRNA for long-chain acyI-CoA synthetase (EC 6.2.1.3) J02646 Rat translational initiation factor (elF-2) alpha subunit mRNA L46791 Rattus norvegicus cholesterol esterase mRNA, complete eds mRNA for HMG-CoA reductase type II pneumocyte C036-related class B scavenger receptor (SRBt R) mRNA 39 10 11 cocooocoooco coco coco on (D co Table 2.48 (continued) cholesterol side-chain cleavage enzyme mRNA J05156 (P4SOSCC), complete cds 9 2.4 Discussion We applied a GA/PLS framework to gene expression and metabolic data to identify a subset of genes whose expression levels could predict metabolic functions. Subsequently, part of the regulatory pathways was reconstructed based upon the functional information of the genes. The selection of relevant genes was obtained using GA guided by a fitness function (equation 2.7) that was defined by the prediction error and the number of latent variables. No a priori information was considered in equation 2.7. The gene selection process was based solely upon the gene expression and metabolic data. This provides a framework to uncover information from the data itself. However, the information extracted by GA/PLS is redundant for both intracellular TG and urea synthesis, i.e., many of the selected genes are not relevant to these cellular functions. It was observed that the genes selected from the supervised group (45 genes added after t-test) were more relevant to these cellular functions than from the unsupervised group (1 11 genes selected by t-tcst). To reduce the redundant information, an approach that weighs the genes based on a priori knowledge may be better suited. If one knows that some genes are related and others are not related to a metabolic function, this information can be incorporated into the fitness function by adding a factor that awards the inclusion of relevant genes and punishes the inclusion of irrelevant genes. Each gene is assigned a score based upon its relevance to the metabolic firnction based upon the literature. An overall gene relevance score (Gs in equation 2.10) can be obtained 40 by summing up the score of each individual gene and integrating it into a fitness function as follows: 1 26,- -y,-)2 +LVW1 fitness = + Gs w2 (2.10) To illustrate the ability of this fitness fimetion (equation 2.10) to incorporate domain knowledge, we set up a hypothetical example. In this hypothetical example, we included only the genes selected by t-test and assumed we know a priori that genes (D90109, J04087, M88592, U37099) are relevant to intracellular TG accumulation and that the other genes are not relevant. Therefore, a score of one was assigned to genes (090109, 104087, M88592, U37099), and a score of zero was assigned to all the other genes listed in Table 2.1 for intracellular TG. The parameter w] and w2 were assigned values of 0.4 and 0.1, see methods for determination of the w] and w2, so that the prediction error, number of latent variables and gene score are normalized to the same scale. Using this new fitness function in the GA/PLS model with the same gene expression and metabolic data, we evaluated the robustness of the GA/PLS model to the addition of 10% noise in the gene data. The genes selected before and after noise addition are listed in Tables 2.5A and 2.53, respectively. From the results in Table 2.5, we note that the genes assigned a relevance score of one were selected before and after noise addition and at a high frequency (i.e., selected each time the model was run, regardless of the initial population), which suggests that incorporating a priori knowledge of the relevance of the genes into the fitness function (equation 2.10) ensures that their selection is not affected by the noise in the data. In addition to the genes that were defined a priori to be relevant, other genes such as P450 genes, Na+, K+-ATPase genes, which are not defined a priori as relevant, 41 were also selected by the GA/PLS model with this new fitness function, both before and after noise addition. Therefore, by defining an appropriate fitness function, the GA/PLS framework was able to incorporate available domain knowledge into the model as well as able to recognize genes relevant to a metabolic fimction without a prior knowledge. It is possible that the relevance of a number of genes to a cellular function may currently not be known. Therefore an advantage of not including a priori knowledge allows GA/PLS model to be naive to any biases in the user and in turn may allow the uncovering of genes that may be important and to be investigated further. Table 2.5A. Genes selected for intracellular TG using fitness function in equation 2.10 Function Accession Category Number Gene name Freq Score Lipid metabolism mRNA for long-chain aeyl-CoA synthetase (EC 090109 6.2.1.3) 14 1 L27061 Phosphodiesterase mRNA 11 0 Transcription factor peroxisome proliferator activated receptor (PPAR) M88592 mRNA 14 1 J0317O liver specific transcription factor (LF-Bt) gene 10 0 CCAAT binding transcription factor-B subunit (CBF- M60617 A1 1) mRNA 9 0 X54250m RNA RRATBPZ Rat mRNA for zinc finger X54250 protein AT-BP2 12 0 ion transporter J04629 (Na+, K+)—ATPase-beta-2 subunit mRNA 9 0 sodium/potassium ATPase alpha-1 subunit M74494 truncated isoform mRNA 12 0 Electron transport chain Oxidative phosphoration K00996 cytochrome p-450e (phenobarbital-induced) mRNA 8 0 cytochrome c nuclear-encoded mitochondrial gene K00750 and flanks 9 0 mitochondrial cytochrome oxidase subunits |,ll, lll J01435 genes, 11 0 U39207 cytochrome P450 4F5 (CYP4F5) mRNA 8 0 E03344 cDNA sequence of peroxisome forming factor 8 0 Signal pathways U93306 VEGF receptor-2/FLK-1 mRNA 14 0 M80784 type III TGF-beta receptor mRNA 13 0 42 Table 2.5A (continued; Others U68544 083538 J04807 U37099 AF085693 X74833 L22558 M60655 X04979 Y001 56 J03960 859893 U75923 M64033 M27433 J05031 X66494 X80477 X86086 X91234 U53859 J02646 clophilin 0 mRNA, nuclear gene encoding mitochondrial protein mRNA for 230kDa phosphatidylinositol 4-kinase RATINSIIA Rat insulin ll gene mRNA small GTP—binding protein (rab3c) mRNA G protein-coupled receptor kinase-associated ADP ribosylation factor GTPase-activating protein (GlT1) mRNA X mRNA for acetylcholine receptor beta-subunit adenylyl cyclase-activated serotonin receptor (5- HT7) mRNA alpha-1 B adrenergic receptor mRNA X04979 Rat gene for apolipoprotein E mRNA for hepatic mierosomal UDP- glucuronosyltransferase (UDPGT) 5-Iipoxygenase mRNA La=autoantigen SS-B/La isoleucyl tRNA synthetase mRNA Rat secretin gene Gerrninal histone H4 gene lsovaleryl-CoA dehydrogenase (IVD) mRNA X66494 R.norvegicus CHOT1 mRNA X80477 R.norvegicus P2X mRNA X86086 R.norvegicus RNA for annexin Vl X91234 R.norvegicus mRNA for 17-beta hydroxysteroid dehydrogenase type 2 calpain small subunit (css1) mRNA translational initiation factor (elF-2) alpha subunit mRNA 12 O 8 0 14 1 14 1 8 0 12 0 9 O 13 0 9 0 8 O 11 0 9 O 9 0 13 0 1O 0 9 0 9 0 9 0 12 0 12 0 8 0 9 0 Table 2.58. Gene selected for intracellular T6 with fitness function in equation 2.10 after noise addition. Function Accession Category Number Gene name Freq Score Lipid metabolism mRNA for long-chain acyl-CoA synthetase (EC 090109 6.2.1.3) 14 1 L27061 phosphodiesterase mRNA 11 0 Transcription factor peroxisome proliferator activated receptor (PPAR) M88592 mRNA 14 1 CCAAT binding transcription factor-B subunit (CBF- M60617 A11) mRNA 13 0 X54250m RNA RRATBP2 Rat mRNA for zinc finger X54250 protein AT-BP2 10 0 Ion transporter M22253 M22253 Rattus norvegicus sodium channel I mRNA 8 0 sodium/potassium ATPase alpha-1 subunit M74494 truncated isoform mRNA 10 0 43 Table 2.58 (continued) Electron transport chain Oxidative phosphoration M20131 Signal pathways Others U93306 M80784 U68544 J04807 U37099 X74833 L22558 M60655 X04979 U53859 Y001 56 J03960 A801 1530 010891 848325 E01 884 U75923 M64033 M27433 U61 772 U44750 J05031 X66494 X80477 X86086 X91 234 J02646 cytochrome P450llE1 gene VEGF receptor-2/FLK-1 mRNA type III TGF-beta receptor mRNA cyclophilin 0 mRNA, nuclear gene encoding mitochondrial protein RATINSIIA Rat insulin II gene mRNA small GTP—binding protein (rab3c) mRNA mRNA for aeetyleholine receptor beta-subunit adenylyl cyclase-aetivated serotonin receptor (5- HT7) mRNA alpha-1B adrenergic receptor mRNA X04979 Rat gene for apolipoprotein E calpain small subunit (css1) mRNA mRNA for hepatic mierosomal UDP- glucuronosyltransferase (UDPGT) 5-Iipoxygenase mRNA mRNA for MEGF4 mRNA for metabotropic glutamate receptor mGluR5 848325 diabetes-inducible cytochrome P450RLM6 DNA sequence coding for rat lL-1-beta(interleukin-1 beta) isoleucyl tRNA synthetase mRNA Rat secretin gene Gerrninal histone H4 gene merlin (NF2) mRNA NAD—dependent 15-hydroxyprostaglandin dehydrogenase mRNA lsovaleryl-CoA dehydrogenase (lVD) mRNA X66494 R.norvegicus CHOT1 mRNA X80477 R.norvegicus P2X mRNA X86086 R.norvegicus RNA for annexin Vl mRNA for 17-beta hydroxysteroid dehydrogenase type 2 translational initiation factor (elF-2) alpha subunit mRNA 13 14 14 14 12 12 10 10 10 06000 OOOOO 0000 0—3-30 00 ©0000 O In this chapter, GA/PLS was used to find a set of possible solutions rather than a single solution. With this method, multiple solutions of different genes give similar prediction accuracy. By selecting genes based upon their frequency of appearance in the multiple runs, we explored the sub-gene-space for high prediction accuracy. The probability of significant features (important genes) appearing in the sub-gene-space was 44 estimated based upon their frequency. The probabilistic nature of this method improved the robustness of the GA/PLS approach. Increasing the number of runs enhances the effectiveness of the probabilistic nature of gene selection. Although, GA can not guarantee global optimization, it is possible to improve the performance of GA by combining global optimization methods, such as simulated annealing. GA/PLS can be applied to group genes according to the metabolic function they are involved in. As opposed to gene clustering, GA/PLS used both metabolic profile and gene expression data to functionally group genes according to a metabolic function. In addition, one gene can be assigned to more than one group, which is not permitted with clustering methods. Typically, a gene may be involved in multiple pathways and may regulate different cellular functions. For example, transcription factors could affect more than one gene, which in turn may be involved in a number of different pathways. The grouping of genes according to metabolic functions facilitates the reduction of a complex biological system with thousands of genes to a system of functional subgroups. This could help facilitate the optimization of cellular function. A subsequent step after identifying the gene subsets is reconstructing the regulatory pathways based upon the gene expression and metabolic profiles. Data driven reverse engineering methods such as Bayesian network analysis, which has been used successfirlly in reconstructing metabolic sub-networks from flux data, [Li 2004] could be used to infer regulatory networks at both the genetic and metabolic levels as well as the interaction between the two levels. The resulting model could be used to suggest hypothetical pathways involved in regulating metabolic functions, and thus likely pathways for further experimental studies. 45 The current model does not incorporate translational and post-translational effects in the gene selection process and thus may be a possible reason why some of the important genes disappeared after noise addition. Incorporating other data types, such as protein expression profile, protein-protein and protein-DNA interactions would enhance the ability of modeling frameworks to predict cellular and physiological function more accurately. In conclusion, GA/PLS was applied for the first time to quantitatively integrate experimental gene expression and metabolic flux profiles for a cellular system. The results indicate that this method is able to select a subset of genes capable of predicting a metabolic function. In addition, using the selected genes, we were able to reconstruct the pathways involved in regulating a metabolic function. 46 CHAPTER 3 A BAYESIAN FRAMEWORK TO INFER BIOLOGICAL PATHWAYS AND NETWORK 3.1 Introduction The introduction of array technology has made readily attainable thousands of genes and proteins for a given metabolic or physiological state, providing a wealth of data from which regulatory and functional relationships may be unraveled. A major challenge limiting the application of these data is the ability to integrate this information, from the gene through the metabolic level, in such a way that it reconstructs the biological process. Moreover, environmental factors, in addition to the genetic wiring, contribute to the regulation and control of the connections within the biological system. Thus, new approaches and mathematical tools for analyzing this information are needed. The motivation behind reverse engineering of pathways and networks from experimental data is the belief that the data contain the structure of the biological system or phenotype. Therefore, the goal is to provide a framework to extract the network structure and connections from the experimental data, thus capturing the effects and changes due to external stimuli and its interactions with the genetic wiring. In chapter 2, we introduced GA/PLS to identify important genes relevant to a cellular function. We then used information from the literature to manually reconstruct the pathways for TG and urea functions, which depended upon the known information currently available. In order to discover pathways from the data itself, we introduce a Bayesian framework in this chapter. 47 As a first level attempt at reconstruction, numerous studies (Friedman et al., 2000; Pe'er et al., 2001) endeavor to infer gene regulatory pathways and networks from micro- array and simulated data from prokaryotes and yeasts. Typically networks are not known a priori, thus the validity and accuracy of these reconstructions cannot be accessed. Instead, the reconstructed networks are used to suggest biological hypotheses. Alternatively, our goal is to develop a mathematical framework that infers biochemical pathways and networks from metabolic data from mammalian cells (systems). The rationale for reconstructing metabolic networks first, prior to applying this framework to gene expression data, is to confirm the ability of the proposed approach to reconstruct known biological networks, such as the TCA and urea cycles. This approach provides a degree of confidence in the novel networks that may be reconstructed from experimental data. Nevertheless, experimental verification of these novel networks is ultimately still required. Numerous mathematical models, from Boolean or neural network models [Liang et al., 1998] to deterministic (differential equation based) models [Chen et al., 1999] have been applied to represent biological networks. Thus far, the former algorithms have been used predominantly to infer network structures from gene data, both experimental [Hakamada et al., 2001] and simulated [Akutsu et al., 1999; Maki et al., 2001]. The limitation of the Boolean network model is that it captures only the logical relations withinthe data. With this method, the genes take on binary (1/0) inputs and give binary outputs according to logical or Boolean rules. Boolean methodology provides a context to analyze logical rules such as A = B or C. The information captured by such a model would be, for example, whether gene A can be activated by either gene B or gene C. 48 Boolean models are limited to time series data and cannot be used to evaluate quantitatively the dependency between pathways. To infer networks and interactions requires knowledge, to a certain degree, of the cause-effect relationships among the variables. Typically, these data-driven approaches are based up correlation, and correlation between two variables does not guarantee causality [Sprites et al., 1993]. Data-driven methods (herein denoted as data-driven reverse engineering methods), infer relations between variables (pathways) from the data itself without any a priori assumptions of the mechanisms involved, precluding statistical hypothesis testing of possible regulatory models [Hartemink et al., 2001]. Similarly, deterministic models require detailed information that is presently unavailable for mammalian systems, certainly not in the detail required to model the system sufficiently to suggest or evaluate hypothetical environmental conditions. Therefore, the complexity of mammalian systems limits the applicability of deterministic models to relatively simple biological systems and not easily to gene expression data. In contrast, for certain microorganisms, metabolic networks and their control circuits have been reconstructed from currently available gene sequence data combined with a literature search of the known biochemical pathways and their related physiological fimctions [Schilling et al., 1999]. These networks are built from a knowledge-based methodology and permit hypothesis testing of possible regulatory models (herein denoted as model-driven hypothesis testing methods). These knowledge- based approaches are formulated by predefining or assuming a model and evaluating its likelihood with respect to existing sequence data. Therefore, these models cannot easily assess the effects of changes in the environment on the hypothetical networks, although, 49 the metabolic regulatory network of the E. coli has been redesigned by using metabolic control analysis from experimental data [Farmer and Liao, 2000]. MN": |:> Mud lrlonnalon based ' Reversed Metaiollc Profile Lea'ring network Biologcd aid Blochenied knowledge Identify causal related [80‘0” Hypothesis testing Hypothetlcd and optimize <:i m metric cellular fmctlon (Bayesian) ' Networks Figure 3.1. Bayesian network based framework. Mutual information based Bayesian network structure learning algorithm was first applied to the metabolic profile to infer the regulatory network Incorporating known biological knowledge into the formulation of the hypothetical networks further refined the structures. Bayesian metric score identified the mostly likely network structure. Alternatively, we developed a Bayesian-based framework, schematically shown in Ejgrgg 3_.1_, which combines data-driven reverse engineering and model-driven hypothesis testing methods to re-construct sub-network structures, for example, TCA and urea cycles, from hepatocellular metabolic data. The advantage of the Bayesian framework over other data- driven methods is the ability of the Bayesian approach to perform cause-and—effect analysis, thus providing a basis of identifying causal relationships. This is accomplished without a priori detailed knowledge or assumptions of the biological system and the governing equations but rather is based on the concept of conditional probability. Therefore, similar to model-driven methods, the Bayesian network approach permits the evaluation of several hypothetical networks by using quantitative measures, such as a Bayesian metric score, to access the likelihood of a proposed network structure. In 50 contrast to other model-driven methods, our framework incorporates experimental data, which allows one to evaluate the affects of environmental changes. We integrated several Bayesian based techniques into our framework. Data-driven Bayesian network learning method was first applied to learn the raw metabolic regulatory network from the data itself. Many algorithms exist that can infer the Bayesian network structure. Only a few are computationally efficient enough to deal with large datasets. One such method is the information theory-based learning algorithm, which we have applied to the data to infer the underlying metabolic regulatory network. The inferred network was then compared with the known metabolic network to evaluate the ability of our framework to learn for example the TCA and urea cycles. Next, a latent variable detection algorithm was applied to the inferred sub-network with the goal of finding possible latent variables in the network by discovering indirect pathways relevant to our objective function. Methods, such as inductive causation (10"), were applied to the sub- network composed of the identified pathways with the goal of finding possible latent variables in the network that influence biological (objective) functions, such as triglyceride accumulation and the rate of urea synthesis. Then, based on the pathways learned from the data and currently known biological associations, we postulated several metabolic regulatory network models. The utility and accuracy of these alternative models were evaluated by comparing 1) their Bayesian metric scores, calculated from the experimental data, to assess the model’s goodness of fit to the data; and 2) how well the model predicted the objective function(s). Thereupon an optimal model representing the metabolic regulatory network was selected based on these criteria. Finally, a sensitivity analysis was applied to evaluate the stability and robustness of the system or phenotype 51 with respect to changes, that is, noise, in the system parameters, namely, the environmental or metabolic variables. The ultimate goal of these models is to aid our understanding of the underlying mechanism that governs the biological systems and the transition of these systems to different states. In the current investigation, our system experiences a transition in its microenvironment going from culture medium to plasma. Although, for the purpose of the current investigation, we evaluated the system’s steady-state. The structure of the biological network can be obtained from the steady-state data, which represents the cellular, tissue, or organ phenotype or state. Once the structure is attained, one can evaluate, using transient data, the system dynamics and how it changes from the steady- state. Understanding the system dynamics would permit the management of transitions from a diseased to a more beneficial or normal phenotype. 3.2 MATERIAL AND METHODS 3.2.1 Data Collection The sources of the dataset used in this study were previously published elsewhere [Chan et al., 2003a,b,c], and what follows is a brief description of the experimental method and the metabolic flux analysis (MFA) used to obtain the dataset that was used in BN analysis. Primary isolated hepatocytes were cultured in a collagen sandwich configuration and incubated in standard hepatocyte culture medium containing either 0.5 U/ml insulin (high insulin) or 50 uU/ml insulin (low insulin). The hepatocytes were cultured in this fashion for at least 6 days prior to plasma exposure. This interval is considered the pre-conditioning period. The six-day-old-sandwiched hepatocyte cultures 52 were subsequently exposed to unsupplemented, amino acid, hormone, or amino acid plus hormone supplemented plasma solution for an additional seven days (see Chan et al., 2003a,b for details). Therefore, six combinations of pre-conditioning-plasma supplementation were evaluated. They were i) low-insulin, pre-conditioned, and unsupplemented plasma; ii) low-insulin, pre-conditioned, and amino acid-supplemented plasma; iii) high-insulin pre-conditioned and unsupplemented plasma; iv) high-insulin, pre-conditioned, and amino acid-supplemented plasma; v) high-insulin, pre-conditioned, and horrnone-supplemented plasma; and vi) high-insulin, pre-conditioned, and amino acid plus hormone-supplemented plasma. A model for hepatocyte metabolism was created based on the known stoichiometry of the hepatic metabolic network. The stoichiometry of each reaction in the MFA model is listed in Chan 2003a. A total of 33 metabolite measurements were coupled to the MFA to estimate the other 43 intracellular fluxes (see Chan et al., 2003a,b for details). The structure of the resulting data, namely, the measured metabolites (in bold) and the estimated intracellular flux values, with their corresponding flux numbers, are illustrated in Table 3.2. To help illustrate the overall model structure see appendix Figure 1, we applied our Bayesian-based methodology to the data in Chan 2003a, to initially infer the known biological networks, such as the TCA and urea cycles. After which, the methodology was applied to A infer the network structures relevant to intracellular triglyceride (TG) accumulation. 53 Table 3.1. Learned TCA cycle relationships Learned relation Biological explanation 7 9 9 Synthesis of oxaloacetate (pathway no. 7) controls the synthesis of citrate (pathway no. 9). 9 9 10 Citrate synthesis controls (pathway no. 9) the reaction from citrate to cr-ketoglutarate (pathway no. 10). 11 9 12 a-ketodehydrogenase (pathway no. 11) controls the formation of fumarate from succinyl-CoA (pathway no. 12). 13 9 14 Synthesis of malate from fumarate (pathway no. 13) controls the reaction from malate to oxaloacetate (pathway no. 14). 9 (- 7 9 8 Link between the TCA cycle and glueoneogenesis through pyruvate carboxylase (pathway no. 7). 3.2.2 Bayesian networks Bayesian networks are directed acyclic graphs (DAG) whose nodes correspond to variables and whose arcs represent the dependencies between variables. The dependencies are determined by the conditional probabilities of each node x,, given its parent node pa, Pr(x,- | pa(x,-)). A Bayesian network assumes conditional independence, such that each node is independent to its non-descendants, given its parents. In other words, x,- and x; are conditionally independent to each other given pa (see Figge 3.2), then Pr(Xr I 11319.: (xi?) = ”(xi lpa (Id) 00 54 Figure 3.2. An example of a simple Bayesian network. The network consists of three nodes P,, X,, X, and XI, in which P, is the parent node of X, and Xj. X, and X,- are conditionally independent to each other given the parent node P.. P. is the cause of or regulates X, and X,; X, and X,- are the effects of Pa. Also, a Bayesian network consists of the joint distribution defined by a set of variables {x,} as: N Pr(x1,....,x,,) = 11140:, | pa(x,.)) (3.2) In the example above: 1’de xi: Xj.. xk) = ”(M Pr(xz'lpa) Pr(lepa) 1’ka | xi) (3-3) The approaches developed to learn the Bayesian network structure from either experimental or simulated data generally fall into two categories: 1) search and scoring or 55 2) constraint-based approach (dependency analysis approach). With the search-and- scoring-based approach, the algorithm starts with a graph without arcs, and then new arcs are added and evaluated with a specific scoring function. A scoring function is calculated, and if the score improves, the new are is kept; otherwise it is eliminated. Thus, the Bayesian network is learned through optimizing this scoring function. Numerous scoring functions have been applied to this type of algorithm, such as Bayesian scoring method [Heckerman et al., 1994] and minimum description length method [Suzuki, 1996]. Given the immense computational time and cost associated with this approach, heuristic methods to reduce the search space, such as the hill climb, greedy searches, and simulated annealing, have been applied. The constraint-based approach applies the conditional independency (Cl) test to reconstruct the Bayesian network by uncovering the dependencies within the data. Several algorithms are available for this approach; one of the better-known algorithms is inductive causation (IC) [Sprites et al., 1996]. In the current investigation we combined the constraint-based method with a scoring function to guide the development of the metabolic network. 3.2.3 Inferring the Bayesian Network from information theory Metabolic flux data are much smaller in scale than micro-array data. Nevertheless, applying Bayesian network analysis to flux data is still computationally prohibitive. Many algorithms exist that can infer the Bayesian network structure, only a few are computationally efficient enough to deal with large datasets. The information theory- based learning algorithm is one such method, which we’ve applied to flux data to infer the underlying metabolic regulatory network. This algorithm has been applied to real- world data with hundreds of variables and records [Cheng et al., 2002] and is described 56 briefly below. This constraint-based algorithm is performed in three separate phases: drafting, thickening, and thinning. Phase 1 computes the mutual information contained in each pair of fluxes as a measure of closeness indicating the correlation between fluxes and creates a draft of the regulatory network based on this information. The mutual information I(X,-,X,-) for each pair of metabolic fluxes (x,,xj) is computed as: 1(X X ) Z Pr( )1 Pixi’x”) ., . :: x.,x. 0g ‘ J xixj l J Pr(x,.)P(xj) (3.4) Then each pair of metabolic fluxes with mutual information I(X,~,X,-) greater than a threshold value, T, is sorted in a list L from high to low. An arc is drawn for the first two pairs of nodes in L. The pointer is then moved to the next pair of nodes. If no path exists between them, an arc is added. To illustrate this point, we provide a modified example from [Cheng et al., 2002]. Suppose we have five metabolic fluxes A, B, C, D, E, and the mutual information sorted as follows: I(A,B)>I(B,C)>I(B,D)> T > (A,D)> I(A,C)>I(C,D) then L={[A.B],[B,C],[B,D]} and we obtain a draft shown in Figure. 3.3 A. 57 o 9 C 0‘9 C 0 9 C Phase 1 D Phase 2 0 Phase 3 0 Figure 3.3 An illustration of the three phases of the Bayesian network learning process using mutual information based algorithm. Phase 1: the arcs are added to pairs of nodes whose mutual information are larger than the threshold value E. Phase 2: the arcs are added to pairs of nodes that are not conditionally independent using mutual information based CI test. Phase 3: the arcs are removed between pairs of nodes that are conditionally independent using mutual information based CI test. In Phase 2, arcs are added when pairs of unconnected metabolic fluxes (i.e., nodes) are not independent as determined by C1 test. The CI test is based on conditional mutual information as defined by Eq. 5 below. For I(Xi,Xj|c) less than the specified threshold value T, (Xi,X]') are said to be independent given c, where c is a set of metabolic fluxes. Pr(x-,x-|c) I(X,,Xj|c)= Z Pr(xi,xj,c)log ’ J (3.5) .vx. x 1,6 Pr(x,- |c)P(xj 10) I For example, if the metabolic fluxes (A, D) in Figge. 3.3A are not independent based on the CI test, then the are between (A, D) is added, with the resulting network shown in Figgrc. 3.3B. In Phase 3, each arc is examined by using the CI test and arcs are removed if the two metabolic fluxes linked by the are are conditionally independent. For example, if (B, D) are found to be independent given (A) by the CI test, then the are between (B, D) is removed with the resulting network shown in Figge. 3.3C. 3.2.4 Detecting latent variables and comparing alternative models based on a Bayesian scoring metric 58 Many biological systems are not well defined due to our incomplete understanding of the underlying mechanism involved. As a result, certain relevant factors may be overlooked in the experimental design and measurements. From a mathematical standpoint, these factors would be considered latent variables. To detect the presence of these latent variables, algorithms such as 10" [Pearl, 2000], and the faster FCI [Sprites et al., 1993] have been used. Previously [Chan et al., 2003c], we found that the intracellular TG accumulation was not captured completely by the partial least-squares (PLS) method. A possible explanation may be due to insufficient details of the metabolic pathways pertinent to intracellular TG. In our current model development, the IC* algorithm was applied to a sub-network of pathways to detect possible latent variables relevant to intracellular TG. The alternative models containing these latent variables were then compared quantitatively by using a Bayesian scoring metric. A model’s score, BS(S), is defined as the log of the posterior probability of the model 8 given the observed data 0, p(S|D): BS(S) =10gp(SlD) =10gp(5) +10gp(DIS) —10gp(D) (3.6) where p(DlS) is the likelihood of the observed data D given S. For a given set of data D, all possible structures S are assumed a prior to have equal probability, namely p(S) is constant. p(D) is a normalizing constant, which is the a prior probability of a given dataset that is applied to all structures S being evaluated. p(DlS) is determined by Eq. 7, which was first derived by [Cooper and Herskovits,l992]. _ n qi may.) r.- F(x)= ””3 (3.13) }ar+l then S(x[y, e) can be expressed as S(x|y,e)=6p(ay|e)= “"372 (3.14) x (”(+1) To determine the value of a,B,y, we calculate p(y|e)(x) for three different values of x, for example, if we set x to values of 0, 0.5, 1, then (1,8,7 are determined by fl= P0 ___fl- po' 10 -p =p (7+1)- fl where p0, p0'5, pl denote the corresponding posterior probabilities of p(yle) given x = 0, 0.5, 1. We applied a sensitivity analysis to our Bayesian network by using a Matlab code 63 developed by [Wang et al., 2002]. The sensitivity analysis enabled us to identify the important or most sensitive variables (nodes) in the network with regard to the targeted response variable (objective function), that is, intracellular TG or urea. Therefore, carefully controlling or modulating the most sensitive parameters will facilitate the optimization of the response variable or network structure. 3.3 Results In this chapter, we illustrate the integration for the first time of several Bayesian-based techniques, such as Bayesian network structure inference [Cheng et al., 2002], latent variable detection algorithms [Pearl, 2000], hypothesis testing using Bayesian metric scoring [Hartemink et al., 2001], Bayesian network sensitivity analysis [Wang et al., 2002], and Bayesian network prediction [Friedman and Goldszrnidt, 1996], into one united framework. Moreover, we illustrate the utility of this unified framework to biological data, and the first application of this approach to metabolic data for the purpose of inferring metabolic network structure of hepatocellular metabolism from the data. We applied this framework to identify direct and indirect pathways that influence an objective fimction, such as TG accumulation and the rate of urea synthesis. Previously, we developed a methodology that combined the metabolic flux distribution obtained by MFA with PLS to reveal the underlying structure and correlation within the data. PLS could capture and predict urea synthesis very well, but the intracellular TG levels not as well. Alternatively, the Bayesian-based framework could accurately predict both the level of urea synthesis and TG accumulation for each of the cellular conditions evaluated. 64 Algorithms to detect latent variables to uncover non-obvious pathways relevant to intracellular TG accumulation were used to help construct the Bayesian network model, thus providing flexibility in the model building process. This contributed to the improved predictive capabilities of the Bayesian network model over the PLS method. To uncover these latent variables, both the measured and estimated intracellular fluxes, the latter obtained from a MFA model, were combined into one data matrix X (36X76). The matrix X was then subjected to the equal frequency discretization to obtain a discretized data matrix D. We applied Bayesian network analysis to the data matrix D to infer sub- networks within the larger metabolic network and to identify possible latent variables not included in 0. Based on the latent variables identified, we proposed several alternative metabolic sub-networks and evaluated their “accuracy” with a Bayesian scoring function and its ability to predict the obj ective function. 3.3.1 Reverse engineering the sub-networks Using Bayesian network analysis, we reversed engineered parts of the metabolic network from the flux data, namely, the TCA (Figure. 3.4A) and urea (Figure. 3.46) cycles, The TCA cycle is important for producing cellular energy (ATP) from the oxidation of firels and generating intermediates for the glueoneogenie pathway. The analysis enabled us to infer the relationships shown in Figure. 3.43 and listed in firm. The direct links between the TCA pathways 10—11 and 12—13 were not learned by Bayesian network analysis but rather were identified as being coupled through oxidative phosphorylation (flux nos. 51 to 53). This recognition is rather appropriate because oxidative phosphorylation, whereby ATP is formed by electron transfer from NADH and FADHZ to 02, is coupled to the TCA cycle. It is noteworthy that, in our model, pathway no. 10— 65 11 combined two pathways, namely, citrate—isocitrate and isocitrate—a-ketoglutarate, into one, citrate—a-ketoglutarate; likewise two pathways were combined into pathway no. 12— 13. If we add the missing links (10 —) 11 and 12 ——> 13) to the inferred network, the Bayesian metric score indeed improved from —198 to —172. Another important hepatic function is the removal of NI-I4+ derived from protein and amino acid metabolism via its conversion to urea by the liver through the urea cycle (Figure. 3.40. Bayesian network analysis was able to “learn” most of the pathways in the urea cycle (Figure. 3.4D, Table 3.2). A comparison of the actual pathways and the inferred network is shown in Figgre. 3.4. The link between pathways 16 and 15 was not learned by Bayesian network analysis. Adding the missing relation (15 —) l6) improved the Bayesian metric score from —116 to —102. The learned Bayesian-based network shown in F iggre. 3.4D was evaluated on how well it predicted urea production. We found the Bayesian model could predict urea production as well as the PLS and multi-block PLS models, with ~92% accuracy [Chan et al., 2003c; Hwang et al., 2004]. 66 Flux 52,53 Reversed Urea cycle Figure 3.4. Reverse Engineered TCA and urea cycle learned by mutual information based algorithm. A) Actual metabolic network of TCA cycle, each node represents a metabolite and connections (arcs) between nodes represent the metabolic fluxes. B) TCA cycle inferred by Bayesian network analysis, each node represents a flux in Figure 3.4A, each are between the nodes represent a causal relation between the fluxes, the solid connections were learned and the dashed connections were not learned by Bayesian network analysis. C) Actual metabolic network of urea cycle, each node represents a metabolite and connections between nodes represent the metabolic fluxes. D) Urea cycle inferred by Bayesian network analysis, each are between the nodes represent a causal relation between the fluxes, the solid connections were learned and the dashed connections were not learned by Bayesian network analysis. 67 Table 3.2. Learned urea cycle relationships Learned relation Biological explanation l6 9 17 Synthesis of arginine (pathway no. 17) is controlled by the formation of citrulline carbamoylate from ornithine (pathway no. 16), l6 9 43 Aspartate arninotranferase (pathway no.43) is controlled by the formation of citrulline carbamoylate from ornithine (pathway no. 16), 17 9 15 Urea production (hydrolysis of arginine) (pathway no.15) is controlled by pathway no. 17 17 9 18 Uptake of arginine (pathway no. 1 8) is controlled by pathway no. 1 7 3.3.2 Identifying relevant pathways linked to intracellular TG levels The ability of Bayesian network analysis to infer the known metabolic sub-networks from the flux data lends credence to the latent relationships it uncovers. Despite the tremendous amount of information known about the hepatic metabolic network, our understanding is incomplete. Therefore, due to our imperfect knowledge of the metabolic network, it is plausible that important latent variables that are not directly associated but revealed by the model to be linked to currently unassociated pathways may indeed be consequential, as was noted previously with the uncovering of the link between flux nos. 51—53 and the TCA cycle. Applying a constraint-based algorithm to infer a sub-network containing pathways linked to TG, we found that intracellular TG (flux no. 76) was influenced by B- hydroxybutyrate (BOH) dehydrogenase (flux no. 50), glutaminase (flux no. 38), and glutamate dehydrogenase I (flux no. 36), whereas extracellular TG (flux no. 70) was 68 affected by lactate dehydrogenase (LDH; flux no. 8), glyceraldehyde-3-P (G3P; flux no. 5), free fatty acid uptake (flux no. 72), glutamate dehydrogenase I (flux no. 36), and asparaginase (flux no. 45). Latent variable detection (IC*) algorithm was applied to the aforementioned pathways, along with several other pathways (flux nos. 2, 26, 54, 61, 64, 73, 74, 75), a total of 17 pathways, and the results suggest that latent variables may exist that influence intracellular TG and the glutamate related pathways (flux nos. 36 and 38). Those with direct connections learned through 10" are listed in Table 3.3. Although direct connections to flux no. 50 were not found with IC“, a direct connection between flux no. 50 and 76 was indicated by the constraint-based method; therefore, it was included in all subsequent analyses. From the currently known pathways related to TG metabolism and the inferred connections, we postulated several alternative networks, shown in Figure. 3.5. We tested the hypothesis that flux no. 50 may be a latent variable between flux no. 76 and flux no. 36 (Figure. 3.5A) and between flux no. 76 and flux no. 38 (Eiggre. 359. Similarly, we tested the hypothesis that flux no. 36 was a latent variable between flux nos. 76 and 38 (Figure. 3.53) and flux no. 38 was a latent variable between flux nos. 76 and 36 (not shown, but will be denoted as 5B alternate). Finally, in Figure. 3.5D, we tested the possibility that flux no. 50 was the latent variable influencing flux nos. 76 and 38, in combination with flux no. 38 as the latent variable for flux nos. 76 and 36. We evaluated the likelihood of these models with a Bayesian metric score and prediction accuracy of the inferred model. 69 Cholesterol esterf75) M ] Free fatty acld uptake(72) [Cholesterol Free fatty acid] [esterflw J [ uptake(72) [ibydroxybutyrate (50) Glutamate dehydrogenase(36) l Intracellular T6 (76) ] [Glutamlnasemw J A Cholesterol ester(75) [ uptake(72) ] [Free fatty aeld flhydroxybutyrate (50) Glutamate dehydrogenase(3 .1 lntracellular T (76) G] [ Glutamlnase(38) ] C j[,rflhydroxybutyrathe] ( 50) Glutamate dehydrogenase(3 .1 [mmcfls’ar 7'61 Glutamlnase(38)] 8 Cholesterol Free fatty acid ester(75) uptake(72) flhydroxybutyrate (50) Glutamate dehydrogenase(36) [mmcfléj’a’ TGH Glutamlnasef38) ] D Figure 3.5. Postulated sub-networks for intracellular TG accumulation, learned with IC* algorithm. A) Postulated network that assumes B-hydroxybutyrate as the latent variable regulating intracellular TG accumulation and glutamate dehydrogenase pathways. The network’s Bayesian metric score is -l68. B) Postulated network that assumes glutamate dehydrogenase pathway as the latent variable between intracellular TG accumulation and glutaminase. The network’s Bayesian metric score is -158. C) Postulated network that assumes B-hydroxybutyrate as latent variable regulating intracellular TG accumulation and glutaminase pathways. The network’s Bayesian metric score is --155. D) Postulated network that assumes B-hydroxybutyrate as the latent variable regulating intracellular TG accumulation and glutaminase pathways while glutaminase influences intracellular TG accumulation and glutamate dehydrogenase pathways. The network’s Bayesian metric score is 70 -152. Table 3.3. Relations recovered by Bayesian network analysis with IC* algorithm. A value of 0, 1, -1 indicates no direct link, a direct connection, and the existence of a possible latent variable, respectively. 3.3.3 Accuracy and sensitivity of the postulated networks Two methods were used to evaluate the performance of candidate networks, namely, the Bayesian metric score and the prediction accuracy. Linking the known pathways in the urea and TCA cycles that were not inferred by Bayesian network analysis improved the Bayesian metric score of the inferred network. This illustrates that our knowledge of the biological system can be used to help refine the inferred network and the Bayesian metric score can be used to guide the identification of relevant pathways within a network. The metric scores of the models in Figure. 3.5A, 3.53, 3.53 alterngte, 3.5C, and 3.5D were — 168, -—158, —153, —155, and —152, respectively. The scores indicate that the model in Figure. 3.53 is 22,000 (e'0=22,000) times more likely than the model in Figure. 3.5A in explaining the experimental data, and the model Figure. 3.53 altemafi is ~150 times more likely than the model in Figure. 3.53. Similarly, the model in Figgre. 3.5C is ~440,000 times more likely than the model in Figure. 3.5A. Combining Figure. 3.5C and 3.53 alternate into Figure. 3.5D provides a network model that is 20 times more likely than the model in Figure. 3.5C, 3 times more likely than the model in Figure. 3.53 alternate, and 8.9 million times more likely than the model in Figure. 3.5A. The program 71 used to obtain these scores is available at http://www.ai.mit.cdu/~murphyk/Sofiware/bnsoft.html. In the second method, we divided our 36 samples into a training set and a test set consisting of 24 and 12 samples, respectively, to evaluate the hypothetical network structures shown in Figgre. 3.5. We trained several Bayesian classifying models with our training dataset to predict intracellular TG. The ability of the postulated networks to predict intracellular TG for the test set are indicated in Table 3.4. The model shown in Figr_rr_'e. 3.5A classified the samples with a prediction accuracy of ~67%. However, the model in Figure. 3.53, 3.53 _altemgte. 3.5C. and 3.5D predicted with an accuracy of 92, 92, 83, and 92, respectively. The Bayesian metric score and the prediction accuracy method both indicated that the model most likely to be correct given our data is the model illustrated in Figure. 3.5D. Therefore, it is possible that B-hydroxybutyrate (flux no. 50) may be a latent variable affecting flux nos. 38 and 76, whereas the glutaminase pathway (flux no. 38) may be a latent variable participant in flux nos. 36 and 76 (see Table 3.4). Because glutamine is the amino acid that is supplied to the cells in the highest quantities (4 mM vs. less than 0.8 mM for the other amino acids), a possible role played by the glutaminase (no. 38) is to up-regulate the TCA cycle with the necessary intermediates, such as glutamate and in turn or-ketoglutarate and oxaloacetate. An up-regulated TCA cycle increases the demand on acetyl-CoA, which can be obtained from the breakdown of fatty acids via B-oxidation, with some portion directed to ketone body production, such as B-hydroxybutyrate. Therefore, it is encouraging that the model identified both flux nos. 50 and 38 as relevant influences on intracellular TG accumulation. The program used to determine the 72 prediction accuracy is Belief Network Power Predictor, available at http://www.cs.ualberticahicheng/bnpphtm. Table 3.4. Bayesian score and prediction accuracy of the Figure 3.5 Model Prediction Accuracy Bayesian Score 3.6 A 67% -l68 3.6 B 92% -158 3.6 B alternate 92% -153 3.6 C 83% -155 3.6 D 92% -152 A sensitivity analysis was applied to the intracellular TG sub-network shown in flguLe, i5, with intracellular TG (flux no. 76) as the response variable. The results in Table 3.5 indicate that the intracellular TG level is most sensitive to B-hydroxybutyrate (flux no. 50) and glutaminase (flux no. 38) and least sensitive to glutamate dehydrogenase (flux no. 36) and cholesterol ester uptake (flux no. 75). Although the sensitivity coefficients are all small, the coefficients for B-hydroxybutyrate (flux no. 50) and glutaminase (flux no. 38) are generally an order-of-magnitude higher than the other pathways, indicating that they have more impact on TG storage. Thus, changing flux nos. 50 and 38 would more likely alter the level of intracellular TG accumulation as oppose to changing the other pathways. 73 Table 3.5 Sensitivity analysis of TG network shown in Figure 3.6 Pathway Cholesterol FFA Beta- Glutaminase Glutamate uptake(75) uptake sensitivity (72) hydroxy(50) (38) dehydrogenase(36) sensitivitySA 0.0001725 0.0014 0.0094 0.0014 0 5B 0.000345 0.0013 0.0115 0.0048 0.0058 5B 0.0021 0.0023 0.0138 0.0134 0 alternative 5C 0.000469 0.000525 0.0093 0.0013 0 5D 0.0014 0.0016 0.0093 0.0134 0 3.4 Discussion We applied the Bayesian-based framework to infer known network structures from metabolic data and evaluated the hypothetical networks using quantitative measures, such as Bayesian metric scores. The framework was applied to hepatocellular systems and successfully inferred the metabolic sub-networks, TCA and urea cycles, from metabolic data. We identified both direct and indirect pathways that influence our objective function, taken to be either intracellular TG accumulation or rate of urea synthesis. It was found in [Chan et al., 2003c] that urea synthesis can be predicted well with a PLS model, but the intracellular TG accumulation was modeled only reasonably well with PLS. Thus one of the objectives in the current study was to develop a model that could capture the causal mechanism or system structure and in turn more accurately predict intracellular TG accumulation. 74 Compared with PLS analysis [Chan et al., 2003c], Bayesian network analysis offers the advantage/possibility of characterizing the underlying causal structure within the data. PLS is an approach based on correlation analysis; thus it identifies factors that are highly correlated to the target variable or objective function. Thus, PLS cannot distinguish situations in which an identified factor (x,) and the target variable (Xj) are highly correlated but are caused or regulated by a third variable, the common cause or parent variable (pa). Under those conditions, to achieve an optimal response in the target or objective firnction requires modulating the common cause variable as opposed to the variable(s) that are simply highly correlated. F or example, in our previous paper [Chan et al., 2003c], flux no. 43 (aspartate aminotransferase) was found to contribute most to optimizing or restoring urea synthesis. However in the sub-network inferred by the Bayesian network analysis, shown in Figure. 3.4D. flux nos. 43 and 17 (argininosuccinate synthetase and argininosuccinase) have a common cause, flux no.16 (carbamoyl-P- synthetase and ornithine transcabamylase). It suggested that flux no. 43 is relevant to urea synthesis but it is not the cause of urea synthesis. Thus, to systematically optimize urea production, the variable of common cause (flux no. 16) should be modified rather than the conditionally independent variable, flux no. 43. This suggests that ammonia, regardless of its source, generated in the liver mitochondria and combined with HCO3‘ to form carbamoyl phosphate, is the driving force for urea synthesis and is the source of the up-regulation in pathways 17 and 43, argininosuccinate synthetase, and aspartate aminotransferase, respectively. This is a reasonable finding by the model. The ability of the Bayesian framework to perform cause-and-effect analysis provides an advantage over PLS in optimizing the target variable. However, multivariate analysis, such as PLS, can 75 be used to pre-select a subset of variables that are highly correlated to the target variable, for example, intracellular TG, thereby decreasing the Bayesian network learning procedure. For example, we selected the variables with absolute regression coefficient value larger than 0.5 and applied Bayesian network analysis on this subset of variables. The results are shown in Figge. 3.6. Although this decreases the complexity of the learning process, it does not preclude the possibility that causal information may be less by evaluating only the highly correlated variables. Cysteine [Intracellular uptake(26) ] TG(76) [ 91:1“; j Glutamate Cholesterol up T“ ) dehydrogenase(36) ester(75) [Glutaminase(38) Free fatty add uptake(72) ] Figure 3.6. PLS flux selection. A subset of variables with absolute PLS regression coefficients larger than 0.5 were selected based on PLS analysis as being important to intracellular TG accumulation were subsequently subjected to the search and score based Bayesian network learning approach. In the reversed Bayesian network, some relations were missing, for example, pathways 12—>13 in Figure. 3.53 and 15—916 in Figure. 3.5D. Two possible explanations could account for this: 1) noise in the data; and 2) omitted measurements. We tested the first hypothesis by adding noise to our data and examined its effect on the resulting network 76 structure. Noise in the form of a Gaussian distribution was generated according to the following equation, Noise = randn(n) * std(signal) * ratio (3.16) where randn(n) is a vector of random numbers of length n following a normal distribution, the std(signal) is the standard deviation of the data, and the ratio is any number between [0,1]. We added three different levels of noise to the data by assigning a value of 0.05, 0.10, or 0.20 to the ratio, representing 5, 10, and 20% noise, respectively. The resulting urea cycles are shown in Figr_rre. 3.7. When 5% noise was added, the pathway 16—>17 disappeared but instead pathway 15—>16 was learned (Figure. 3.83). This indicated the possibility that the missing pathway 15—>16 may have been due to noise in the data. Adding 10% noise to the data resulted in pathway 17—918 being replaced by a new pathway 16-—)l 8 (Figure. 3.70, and adding 20% noise eliminated both pathways 15—>16 and 17—>18 (Figure. 3.7D). The results suggested that Bayesian network analysis can tolerate a certain degree (5—10%) of noise within the data to provide a satisfactory level of performance, represented by the number of relations inferred. Although when the noise increased to a higher level, for example, 20%, the performance of Bayesian network approach decreased as reflected in the higher number of missed relations. We tested the second explanation by omitting one measurement from the data and examining the pathways inferred. To illustrate this idea, we removed the lactate measurement from the dataset (see Figure. 3.8/1), which resulted in the omission of two additional pathways, pathways 8—>7 and 7—>9, which were no longer inferred by Bayesian network analysis (Figure. 3.83). Therefore, both noise and missing 77 measurements could contribute to an incomplete or incorrect reversed Bayesian network. Thus care must be taken in collecting the data and, if possible, a comprehensive set of measurements needs to be acquired in order to gain the maximum benefit from Bayesian network analysis. It is appropriate to note that incomplete data would be less of an issue with the availability of high-throughput micro-array technology for collecting gene data. Reversed Urea cycle Flux 16 O O Flux 15 Reversed Urea cycle Figure 3.7. The effect of noise in the data on Bayesian network learning of the urea cycle. 5%, 10% and 20% noise were added to experimental data and Bayesian network was applied to each set of data. A) Urea cycle learned from the experimental data without the addition of noise. B) Adding 5% noise resulted in pathway 16917 not inferred but 15916 inferred instead, suggesting the possibility that 15916 were not inferred in Figure 3.7A due to noise in the data. C) Adding 10% noise resulted in pathway 17918 not inferred but 16918 inferred instead, which indicates noise could result in incorrect relations being learned. 0) Adding 20% noise resulted in pathway 17918 not inferred as well as pathway 15916, which indicates noise could cause decrease performance in the Bayesian network learning methods. 78 Reversed TCA Cycle Figure 3.8. The effect of omitted measurements on Bayesian network learning. A) The current network B) Not including the lactate measurement resulted in the causal relations 897 and 799 not being inferred, indicating that omitted measurement also could account for the causal relations missed in the networks inferred by the Bayesian network analysis. According to the methodology developed in the proposed framework, several hypothetical networks describing intracellular TG accumulation were postulated and tested. Data-driven reverse engineering and model-driven hypothesis testing are two essential components of the proposed framework. Of the two, the data-driven portion of the analysis (i.e., the constraint based or mutual information Bayesian learning portion) used to infer the network structure is essentially automatic; thus, this part of the Bayesian network analysis is both fast and efficient. The analysis identifies hypothetical structures containing causal information that can be subsequently refined further by hypothesis testing. This is accomplished by incorporating existing biological knowledge in determining the pathways to include in the hypothetical structures. For instance, in our 79 intracellular TG example, we manually selected a subset of pathways to perform the IC* algorithm to determine which pathways influence TG accumulation. The method identified the existence of latent variables, which we determined through a process of elimination with the help of the metric score and the model’s prediction accuracy to guide us to the “final” model illustrated in Figgre. 3.5D. As an indirect confirmation of the model in Figure. 3.5D, we applied the Markov Chain Monte Carlo (MCMC) algorithm, which is a search-and-score method to the 17 pathways mentioned earlier. Because this method is computationally expensive, it is not feasible to apply this method to large networks. However, we obtained a solution for the sub-network after 500 iterations. This search-and-score method identified the connection between flux nos. 38 and 76 (see Table 3.6), which we postulated to be the latent variable between flux nos. 36 and 76 (Table 3.3). Thus it indirectly lends support to our hypothesis that flux no. 38 may be the latent variable linking flux no. 36 with flux no. 76. However, as illustrated by our example, the hypothesis testing has to be formulated and assessed manually, slowing the entire process. In our example, we manually postulated the four structures shown in Figgre. 3.6 for intracellular TG. Therefore, a reverse engineering framework that combines these two steps becomes time-consuming. If only one could automatically incorporate the existing and, as much as possible, comprehensive knowledge of the biological system into the hypothesis formulation of the framework, the effectiveness of this methodology would be greatly enhanced. 80 Table 3.6. Relations recovered by Bayesian network analysis with MCMC algorithm, a search and score method. A value of 1 indicates a direct connection. 'Flux nos. 36 3s 150 72 75 76 36 0 1 0 1 1 0 38 1 o 0 0 0 1 so 0 0 o 0 o 1 72 l 0 0 o 0 0 75 1 o 0 o o 0 76 0 1 0 0 0 0 The data in the current study were based upon a metabolic flux model, which provides a comprehensive overview of the intracellular metabolic state. MFA invokes certain assumptions in the development of the metabolic flux model, and these assumptions influence the results we obtain. Ideally, it would be advantageous to increase the number of measurements and reduce the number of assumptions. This may eventually come to pass as high-throughput technologies facilitate these measurements. Gene chip data have the advantage of more independent measurements. Unfortunately, transcriptional, translational, and post-translational effects currently limit our ability to infer cellular function from gene expression data. The measurements used in this study are at the functional level, with the disadvantage that MFA is needed to augment the data. That said, the Bayesian network analysis is naive with respect to the assumptions used in MFA. The Bayesian analysis uncovers the dependencies resulting from these assumptions without bias. The ability of the Bayesian-based framework to infer the TCA 81 and urea cycles, which are well-established cellular networks, provides confidence in this approach to uncover relationships from this, as well as, other types of data. In addition, the Bayesian methodology allowed us to infer the coupling of the oxidative phosphorylation pathways (flux no. 51—53) to the TCA cycle (flux nos. 9—14). This finding was encouraging, as oxygen uptake is an independent and direct measurement and is not linked directly to the TCA cycle in the flux model. Therefore, the ability of the Bayesian fiamework to infer this coupling provides a degree of confidence in the ability of this framework to infer, as well, implicit causal relationships from the data. MFA is performed under the assumption of pseudo-steady-state, which limits the evaluation of the dynamic behavior of the system. In the future, we plan to monitor the temporal metabolic profiles in order to evaluate the dynamic properties of the model. Sensitivity analysis will then be used to identify the most sensitive parameters and hence the most important variables in the network that affect the stability of a metabolic state. Furthermore, this analysis can be used to evaluate how to manage the transition from one steady-state to another and thus guide us in designing experiments to collect the requisite data for these transition studies. Because the costs for collecting experimental data are relatively high in both time and monetary expense, this type of analysis can provide a means to screen and thus optimize the number of experiments necessary to obtain the relevant information. In conclusion, we have shown that Bayesian network analysis could enable inference of network structures from metabolic data. From the resulting network, we can 82 identify the relevant variables that most affect a target function(s), thereby providing a framework to optimize the target fimction(s) and insight into the underlying mechanisms that govern the metabolic state. Finally, the ability of this methodology to infer well- known metabolic structures from experimental data provides confidence in the ability of this methodology to infer other networks, such as, genetic regulatory networks from gene data. 83 CHAPTER 4 IDENTIFYING PHENOTYPE RELEVANT PAHTWAYS 4.1 Introduction In chapter 2, GA/PLS was applied to identify important genes relevant to a metabolic function. In chapter 3, Bayesian Network analysis was applied to infer metabolic network from metabolites profiles. In this chapter, approaches, i.e. GA/PLS and Bayesian network analysis are integrated into a unified framework, which we denote as the Three Stage Integrative Pathway Search (T IPSO), to identify the genes and pathways that regulate a particular phenotype. The regulation of cellular function is achieved through the contribution and interactions of genetic, signaling, and metabolic pathways. Consequently, cellular processes may be singularly regulated, i.e., at the gene or transcription level, or controlled by a network of interactions of genes, proteins and metabolites. Therefore, understanding the biological function of the myriad of genes and how these genes and gene products interact and regulate each other to yield a flmctional cell would help to identify more appropriate pathways that should be targeted or studied for a given disease. An effective drug should be designed to regulate the targeted pathways, while leaving other pathways unaltered. Thus, it is important to identify the pathways in cells perturbed by external stimuli including the drug. With the advent of high throughput technologies, systems of genes, proteins and metabolites for a given cellular state may be readily obtained, requiring novel analytical frameworks to aid in unraveling the regulatory and functional relationships. Approaches 84 such as log linear regression [di Bernardo et al. 2005] and mutual information based ARACNe [Basso et a1 2005] have been applied to yeast and human B cells, respectively, to reverse engineering genome wide gene regulatory networks. Unfortunately, these approaches require large amount of data. Thus the challenge remains as to how to reconstruct active networks fi'om a small number of observations. Therefore, we developed an approach that identifies the active pathways without requiring interaction measurements or libraries of genetic mutants, and with a limited amount of data, namely, by integrating gene expression and phenotypic profiles. To identify the active pathways, we first developed approaches to identify phenotype relevant (active) genes with GA/PLS [Li and Chan 2004] and constrained independent component analysis (CICA) [Lin et al. 2002, Lieberrneister 2002]. BN analysis was then used to reconstruct the active sub- network from the smaller group of genes to reveal which pathways are induced by the external stimuli or environment. BN can detect indirect influences and unmeasured events and is not susceptible to the existence of unobserved variables. It has been applied to infer gene regulatory network of yeast cell cycle from gene expression data, metabolic subnetworks from metabolic data and protein signaling from protein activity data [Li and Chan 2004a; Friedman 2004; Sachs et al. 2005]. BN is computationally inefficient when applied to large network e. g. genome wide network. However, in our framework BN was applied to a reduced subset of relevant genes, which circumvented this limit. In addition, the reconstructed model can predict the effect of perturbing a gene (or several genes) on the other genes in the network, and thus provide insight into how the genes interact within the network. Mathematical frameworks that can reconstruct the active pathways 85 will permit better understanding of how the environment interacts with the genes to regulate cellular fimctions and phenotypes. As proof-of-concept, the framework was applied to identify pathways that regulate cytotoxicity in human liver cells. Elevated levels of free fatty acids (FFAs) and TNFcr have been shown to be involved in the pathogenesis of liver disorders, such as fatty liver disease and steatohepatitis. Saturated FFA, palmitate, was found to induce cytotoxicity and TNFcr exacerbated this effect [Srivastava 2006]. Quantification of the genetic responses of hepatocytes to physiologically elevated levels of FFAs and T'NFor may provide insight into the physiological actions of these factors and identify the pathways involved in conferring cytotoxicity. We developed a TIPSO framework whereby we assume, as a first approximation, 3 log linear relationship between gene expression and cytotoxicity. The genes selected by GA/PLS were corroborated with published results to identify known interactions and analyzed by gene ontology to identify known functional associations. Thus, GA/PLS was used to determine the relevance of the genes to LDH release. In order to extract an independent pathway related to a process, such as LDH release, from the gene expression profile, we propose a constrained ICA (CICA) approach. With the relevance as determined by GA/PLS and the LDH profile as constraints, CICA was applied to extract an independent component from the gene expression data. CICA assumes that the expression profile of thousands of genes can be represented by a reduced number of mutually independent processes. Biological meaningful gene groups have been successfully identified by ICA [Lieberrneister 2002; Martoglio et al. 2002]. CICA selects a subset of genes whose profiles are statistically independent to the other components and 86 corresponds closest to the profile of LDH release. This is achieved by minimizing the mutual information between the independent components and maximizing the correlation to the constraints. Finally, Bayesian network (BN) analysis was applied to reconstruct the pathways from the expression profile of the selected genes. The reconstructed network was perturbed to identify i) which genes, when perturbed, have the greatest impact on altering the phenotype (high LDH release) in the palmitate cultures, and ii) how does perturbing one or several genes (nodes) affect the other genes in the network, as well as the phenotype. The simulated perturbations of the reconstructed model were evaluated experimentally to assess the validity of the predictions. The reconstructed network of pathways provided potential explanation(s) on how TNFcr and palmitate induce LDH release. The model identified i) the involvement of stearoyl-CoA desaturase (SCD) in the palmitate-induced cytotoxicity, ii) the activation of nuclear factor kappa B (NF-KB) by TNF-or is mediated by protein kinase C (PKC)-8, , and iii) the role of Bel-2 and PKR in palmitate induced cytotoxicity. 4.2 Materials and Methods 4.2.1 Three Stage Integrative Pathway Search Framework (T IPS©) We applied a mathematical framework that first integrates genetic algorithm (GA) and partial least squares (PLS) analyses to identify the genes relevant to LDH release, but these genes may be involved in many independent pathways. Therefore, the framework then applies CICA to identify an independent pathway involved in LDH release. Finally the connections between these identified genes are reconstructed using BN analysis to infer how the genes interact with each other in the independent pathways. The 87 reconstructed network illustrates how the genes interact under the given enviromnental conditions to regulate LDH release. The framework is shown schematically in Figure 4.1. Metabolic function e. g. LDH \ GA/PLS / Genes involved in the cellular process [ Gene expression ] LI Genes involved in the Independent process \BN/ [ Pathways ] Figure 4.1: Scheme of TIPSe. GA/PLS, ICA and BN are integrated to infer the pathways thtat regulate a cellular function. The framework being developed involved two central tasks. The first task attempts to identify the factors involved in producing a phenotype and the second task attempts to uncover how these factors associated with each other in a network of pathways. The three stage framework illustrated in Figure 4.1 was developed to integrate gene expression and metabolic profiles to identify the pathways involved in producing a particular phenotype. In the TIPS© framework, GA/PLS and CICA were applied to achieve the first task and EN analysis was applied to achieve the second task. ICA was developed originally for blind source separation (BSS). Original sources S were mixed by matrix M resulting in observation Y. The goal of B88 is then to estimate the original source with the dc-mix matrix W resulting in Z. Thus the ICA model can be expressed as Z = WY (4- 1) 88 where W is the de—mix matrix and Y is the observations, Z is the estimated source signal. Z and W are simultaneously estimated by minimizing the statistical dependence (e.g. mutual information) between the columns in Z. The problem of ICA is then reduced to: Minimize (Z) (4.2) s.t. Z = WY. In some cases, a priori information about the signal source and mix matrix are available. This information can be used to constrain W and Z, and thereby incorporate the a priori knowledge. We denote the signal measurements Y to be the gene expression data, S to be the independent components (pathway), and A to be the mixing matrix. The ICA model is then expressed as Y = A s (4.3) The gene expression Y is supplied to the ICA model and the mixing matrix A can be uniquely estimated by assuming that the components in S are statistically independent to each other. Y(i, t) represents the expression of gene 1' in experiment t, A(i,j) represents weight of gene i in independent pathway j, S(/', t) represents the profile of independent pathway j in experiment I. Since the objective here is to identify LDH release related independent pathway, A was further constrained with the frequency learned by GA/PLS and S was further constrained with LDH release profile. Let F(i) be the frequency of gene i with respect to LDH release and a(i) be the weight of gene i in the pathway related to LDH release. Thus, a is constrained to have a correlation with F by equation (4.4), where pl is a threshold value. Corrl = aT*diag (F)*a/(aT*a)>pl , (4.4) 89 Let s(t) be the profile of LDH related pathway in experiment t and L(t) be the profile of LDH release in experiment 1. Similarly, s is constrained to have a correlation with L by equation (4.5), where p2 is a threshold value. Corr2 = sT*L*LT*s/(s*sT) >p2 (4.5) For further details please refer [Lin et al. 2002]. The advantage of TIPS© is that it identifies a subset of genes involved in the active pathways in response to the environmental stimuli and then reconstructs the pathways from this subset of genes using BN analysis. It reduces the number of samples required for pathway reconstruction and provides pathway information specific to a cellular function. 4.2.2 Bayesian Network Inference Bayesian network inference was used to predict the probabilities of a phenotype, e.g. LDH, or a gene within the network taking on a certain value upon changing a target gene. The posterior probability that the class node will take on a certain value given the values of the other nodes is determined based upon conditional probability. Suppose node A in Figure 4.2 is the target node and b1 and c1 are the known values of evidence nodes B and C, respectively, we can predict the posterior probability Pr(A| xi, xj) according to the Bayesian rule: Pr(A=a1|B=b1, C=c1)=Pr(B=b1,C=c, |A=a,)*Pr(A=a1)/Pr(B=b1,C=c1) (4.6) However, applying this exact inference to a large network is computationally expensive [Cooper 1990]. Therefore, we applied approximate inference algorithm such as logic sampling [Henrionl988] to infer the posterior probabilities. Briefly, logic sampling generates a case by randomly assigning values to each node weighted by the probability 90 of that value occurring. To estimate the posterior probability Pr(X|E) where X is the target node and E is the evidence node, we compute the ratio of the number of cases where both E and X are true to the number of cases where just E is true, i.e. Pr(X=x|E=e) = Pr(X=x, E=e)/Pr(E=e). Figure 4.2. Example of a Bayesian network. Gene A is the parent node of Genes B and C, Gene B is the parent node of gene 0. Genes B and gene C are conditionally independent given Gene A. 4.3 Results and discussion 4.3.1. Identifying genes relevant to cytotoxicity or LDH release using GA/PLS We found that exposing human hepatoblastoma cells (HepG2/C3A) to three different types of FFAs (palmitate, oleate, and linoleate at 0.7mM) in the presence and absence of TNFcr (0, 20, 100 ng/ml), only palmitate was cytotoxic to the cells and resulted in significantly higher LDH release. TNFcr alone was not toxic to the cells. The effect of TNFcr on cytotoxicity was observed only in the palmitate-treated cells. To obtain a global 91 view of the cellular processes, we capitalized upon high-throughput methods to quantify gene expression profiles of the HepG2 cells. We applied GA/PLS to the gene expression and LDH release data. The GA/PLS algorithm determined the relevancy of each gene to LDH release by counting the frequency with which each gene was selected by GA into a subset of genes used to predict LDH release. The genes identified by GA/PLS to be relevant to palmitate- induced cytotoxicity, as measured by LDH release, included oxidative stress related genes (e.g., glutathione peroxidaase (AA664180)), apoptosis related genes (e.g., caspase 8 (AA44868), Bel-2 (AA446839)), TNF- related genes (e.g., TNF super-family (AA77863)), mitochondria membrane permeability related genes (e.g., translocase of outer membrane (TOM)-34 (AA457118)), ceramide related genes (e.g., sphingomyelinase (AA676836), sphingosine kinase (AA630354)), translational related genes (e.g., eIFZB (AA027240)), and signal cascade related genes (e.g., protein phosphatase 2A (AA490473) and PKCS (H11054)). The identification by GA/PLS of oxidative stress related genes suggests the involvement of reactive oxygen species (ROS) in the palmitate-induced cytotoxicity. Likewise the model identified ceramide related genes, e.g., sphingomyelinase and sphingosine kinase, are involved. In addition, mitochondria potential and signal transduction are involved in the palmitate-induced cytotoxicity. 4.3.2. Identifying genes involved in an independent pathway related to cytotoxicity using CICA GA/PLS determined the frequency with which each gene was selected to predict LDH release. The frequency and the profile of LDH release were applied as constraints in 92 CICA to extract an independent component from the gene expression profile. The independent component in this case identified a subset of genes whose profiles corresponded to the profile of LDH release. CICA determined the weights for each gene by minimizing the mutual information between the independent components and maximizing the correlation to the constraints. Genes that have weights significantly different from zero with a 95% confidence using the Z-test were subjected to BN analysis for pathway reconstruction. 4.3.3. Reconstruct pathways related to cytotoxicity using BN BN reconstructed how the genes, identified by CICA, are connected in a network and involved in LDH release or cytotoxicity. The resulting network is shown in Figure 4.3. The model revealed possible mechanisms involved in palmitate-induced cytotoxicity. To evaluate these potential mechanisms, we performed perturbation analyses and experimental validation. In the perturbation studies, all connections relevant to the perturbed genes were included. 93 Trensloeese of outer TNF ligand TRAF Binding . . , . —— . — mrtochondnal membrane ——- 601- superfamrly protein (TOT) 20 CPT-1 BCLZ elF28 Capase 9 | 7 Apoptotic chromatin condensation CREB l l PKR PTP receptor N- F AAH 90043895092 I Apoptosis inhibitor I ’ (APl)5 Capase 10 I PP2A PPAR binding _ Fatty aeyl-coa _ ' protein ligase ‘ HMOXZ I TOM 70 ' l PTP non receptor 1 (PT 1) PPZA _ PKC-Iike l-KappanB Interacting Ras—lik protein Cytoeh c (J, i - , Un pling ATPase oxidase prolein PTB _- PKC delta PTPN9 Acetyl-CoAcarboxylase NF-Kappa-B JNK-1 l-kappa-B _1 l Stero -CoA desaTrase Apoptosis related protein LDH Figure 4.3 A representative sub-network related to cytotoxicity. GA/PLS and ICA select the relevant genes, and BN analysis reconstructs the network using the subset of selected genes. The network provides an overview of the factors and pathways involved in regulating cytotoxicity. 94 4.3.3.1 Perturbation of the reconstructed network Based upon the reconstructed network illustrated in Figure 4.3, BN inference was applied to predict the effect of perturbing a single gene/node on i) the other genes within the network and ii) reducing the level of LDH release, in the palmitate and control cultures. The reconstructed network allowed us to artificially perturb the network and identify which nodes were the most sensitive and thus had the most impact on modulating LDH release. An assumption in our model is that perturbations at the gene level will translate correspondingly (qualitatively, but not necessarily quantitatively) to the protein level. Therefore, we altered the activity of the nodes by either down-regulating (up-regulating) the genes or inhibiting (activating) the protein activity, to study how the nodes interact with each other in the network. Genetic perturbations of SCD, PKC-6 , Bel-2, NF-KB and double-stranded-RNA-dependent protein kinase (PKR) were simulated with the BN inference to study their effects on the probability of LDH taking on a phenotype either of high or low release (see Table 4.1). Table 4.1 Simulating genetic perturbation and its effects on LDH release Probability of LDH Release Control Palmitate Gene Perturbation Low High Low High no perturbation 0.94 0.06 0.02 0.98 Upregulation 0.94 0.06 0.55 0.451 SCD Downregulation 0.9 0.1 0.02 0.98 PKC-S Downregulation 0.9 0.1 0.02 0.98 PKR Downregulation 0.85 0.151 0.08 0.921 NF-KB Upregulation 0.94 0.06 0.15 0.851 95 Note that the probability of LDH release taking on a high or low level in the control and palmitate cultures. Using a BN inference on the structure shown in Figure 4.4 — simulations of up/down regulation of SCD, down-regulation of PKR, and down- regulation of PKC-5 and upregulation of NF-KB as compared to the original condition (unperturbed state). “1” indicates decreased probability and “T” indicates increased probability. 4.3.3.2 Role of stearoyl-CoA desaturase (SCD) in palmitate induced cytotoxicity A connection between SCD and acetyl-CoA carboxylase (ACC) was found in the reconstructed network. SCD is the rate-limiting enzyme to produce monounsaturated fatty acids. Its deficiency has been found to increase fatty acid oxidation by activating AMP-activated protein kinase (AMPK) in the liver. AMPK phosphorylates ACC at Ser- 79 which inhibits ACC activity and decreases malonyl-CoA concentration [Dobrzyn et al. 2004]. Malonyl-CoA inhibits carnitine palmitoyl-CoA transferase (CPT-l) [Kemer and Hoppel, 2000]. Thus a decreased level in malonyl-CoA activates CPT-l activity and increases fatty acid beta-oxidation in the mitochondrion [Dobrzyn et al. 2004]. In the palmitate cultures, the protein (Figure 4.4) expression levels of SCD were reduced compared to the control. This may explain in part the preference of hepatocytes to oxidize as oppose to synthesize TG from palmitate, as is the preference with the unsaturated fatty acids. Indeed, co-supplementing palmitate with oleate enhanced the accumulation of TG [Li 2006] and reduced the cytotoxicity of palmitate (Figure 4.4c). 96 (A) B—actin TNF(ng/ml)100 2o 0 100 20 o 100 20 o Oleate Palmitate HepGZ (B) TNF (nglml) 100 20 0 100 20 0 100 20 0 Palmitate + Oleate Palmitate HepGZ (C) 1:: /W P0 P20 P100 P00 PO20 P0100 Experlmental Conditions Figure 4.4 Effect of TNFo. and palmitate on stearoyl-CoA desaturase (SCD) measured by western blotting. (A) SCD was downregulated in the palmitate (0.7 mM) cultures as compared to the oleate (0.7mM) and control cultures. (B) Co-supplementation of oleate (0.3 mM) with palmitate (0.4mM) prevented the downregulation of SCD. TNFor decreased the expression of SCD in the control, oleate and co-supplementation (oleate + palmitate) cultures. (C) Co-supplementing palmitate (0.4 mM) with oleate (0.3mM) decreased LDH release significantly, P<0.01 (t-test). P: treated with 0.7mM palmitate for 48 hours, PO: treated with 0.4 mM palmitate plus 0.3 mM oleate for 48 hours. Number following P or PO represents the concentration of TNFcr in ng/ml, e.g. P20 is 0.7 mM palmitate with 20 ng/ml TNFcr. Data expressed as average +/— SD from three independent experiments. 97 To evaluate the role of SCD on LDH release, we simulated up- or down-regulation of SCD in the reconstructed network by setting the SCD gene expression level at either a high or low level, respectively. Up-regulating SCD in the palmitate cultures reduced the probability from ~98% to ~45% that the LDH release would remain high (Table 4.1). While down-regulating SCD in the control cultures suggests that the LDH release would likely remain low (i.e., down-regulating SCD will have no effect on LDH release). The simulation results agreed well with the literature, e.g., over-expressing SCD has been shown to prevent palmitate-induced cytotoxicity [Listenberger et al. 2001]. The induction of SCDl activity is transcriptionally activated [Ntambi 1995]. In further support of the protective of role of SCD, we supplemented the cultures with 50 M of clofibrate or ciprofibrate, which are known to increase the activity of SCD through a PPAR independent pathway [Rodriguez et al. 2001; Miller and Ntambi 1996]. The fibrate supplementation significantly decreased LDH release in the palmitate cultures (Figure 4.5). In the control cultures TNFa decreased the level of SCD protein (Figure 4.4 A), but did not affect the LDH release [Srivastava 2006], which agreed with the simulation results shown in Table 6.1. To examine the connection between TNFor and SCD shown in Figure 4.3, the protein expression level of SCD in control (HepG2 medium), palmitate, oleate, and palmitate + oleate cultures were measured as a function of the level of TNFa by western blotting. The presence of TNFcr down-regulated the protein expression level of SCD under all the conditions except palmitate (Figure 4.4A, B). Similarly, palmitate supplementation decreased the protein expression level of SCD as compared to the control and oleate cultures. Co-supplementing the palmitate cultures with oleate restored 98 the SCD protein expression level (Figure 4.4B) and correspondingly reduced LDH release (Figure 4.4 C). Administrating fibratcs increases the liver phospholipid proportion of monounsaturated fatty acids [Miller and Ntambi 1996], which appears to be similar in effect to the addition of oleate to the palmitate cultures. LU-lFbleaee(%) a 8 8 b O Ciprofibrate+Palm Clofibrate+Palm Palm Figure 4.5. Effect of SCD activator, clofibrate and ciprofibrate, on LDH release in the palmitate cultures. Clofibrate and ciprofibrate are known to increase the activity of SCD. Palm: treated with 0.7 mM palmitate for 48 hours, Palm+clofibrate: treated with 50 M clofibrate and 0.7 mM palmitate, Palm+ciprofibrate: treated with 50 11M ciprofibrate and 0.7mM palmitate. 6 hours pretreatment followed by 48 hours eo-supplementation of 50 M of clofibrate or ciprofibrate significantly decrease LDH release in the palmitate culture, P<0.01 (t-test). Data expressed as average +/- SD fiom three independent experiments. Oxidative stress appears to be involved in down-regulating the protein level of SCD in the palmitate cultures. Supplementing with 10 mM N-acetyl cysteine (NAC) prevented the down-regulation of SCD (Figure 4.6) and reduced the release of LDH. The involvement of SCD in palmitate-induced cytotoxicity is further supported by a recent study whereby over-expressing SCD was found to protect CHO cells from palmitate induced cytotoxicity [Listenberger et al. 2001]. 99 SCD w in.» w w 1.1—r d B-aetin NAC (10mM) + + + - - - Figure 4.6. Effect of antioxidant N-acetyl-cysteine on SCD measured by western blotting. Down- regulation of SCD in the palmitate culture was prevented by co-supplementing with an antioxidant N-acetyl cysteine (10 mM). 4.3.3.3 Activation of NF-KB by TNF-a is mediated by PKC-6 In a separate study, we found the phospho-p65 NF-KB levels to be significantly lower in the palmitate cultures than in the control cultures (data not shown). Using the BN inference to simulate an up-regulation of NF-KB in the palmitate cultures predicted that the probability of LDH release being high would decrease (see Table 4.1). NF-KB is an important cytoprotective transcription factor, which can be activated by oxidative stress and cytokines, including TNF-or [Haddad 2002]. From Figure 4.3 we find the connection between TNFor and NF-KB is linked through PKC-S, suggesting that PKC-B is an intermediate factor in the activation of NF-KB. PKC-5 has been found to be a redox- sensitive kinase in many cell types [Kanthasamy et al. 2003]. PKC-8 is activated through translocation, tyrosine phosphorylation or proteolysis. During proteolysis, the native PKC-5 (72-74 kDa) is cleaved into two fragments, a catalytically active 41 kDa and a regulatory 38 kDa fragment. The 41 kDa catalytically active fragment plays a key role in promoting apoptotic cell death [Kanthasamy et al. 2003]. In addition, PKC-5 has been found to associate with the P60 TNF receptor during TNFor triggered signaling [Kilpatrick et al. 2002]. As shown in Figure 4.7, TNFcr increased the expression of the 100 PKC-S 41 kDa catalytic active fi'agment. This confirmed the connection between TNFor and PKC-B in the reconstructed network. Connections between TNFcr, PKC-S and NF-KB have been identified in cells such as neutrophils [Kilpatrick et al. 2002] and pancreatic acinar cells [Satoh et al. 2004]. I I . -..r .1.-.” ‘ e. _ 3': r‘u-‘X‘A 4 Cleaved PKC-S B—actin TNF(ngIml) 100 2f0 0 100 20 O 100 20 0 Oleate Palmitate HepG2 Figure 4.7. Effect of TNFo. on PKC-6 expression measured by western blotting. Expression of the catalytically active PKC-S 41 kDa fragment was increased by TNFcr. Table 4.2 Simulating down-regulation of PKC-S and its effects on NF-ch in medium culture (with 20 nglml TNF-a). Probability of NF-KB activation Gene Perturbation Low High no perturbation 0.08 0.92 PKG-6 downregulation 0.4 0.6 1 Note: The probability of NF-xB taking on a high or low level. The model predicts that simulating a down-regulation in PKC-S will decrease the probability of NF-KB taking on a high value. “1” indicates decreased probability and “1” indicates increased probability. Inhibiting PKC-6 has been shown to attenuate TNF-a-mediated activation of the anti- apoptotic transcription factor NF-KB in adherent neutrophils [Kilpatrick et al. 2002]. There has been no study to date suggesting that PKC-S mediates TNF-a’s activation of NF-KB in HepG2 cells. Our model suggests that down-regulating PKC-S will decrease 101 the probability of NF-KB taking on a high expression level in the medium (plus TNF-or) cultures (Table 4.2), and maintain a high probability of a high LDH release phenotype in the palmitate cultures (Table 4.1). To determine whether PKC-5 is involved in mediating the activation of NF-KB by TNFcr in HepG2 cells, we added rottlerin, an inhibitor of PKC-5, and measured the activity levels of NF-KB by western blotting, and LDH release. Rotterlin inhibits PKC-8 by inhibiting the tyrosine phosphorylation of PKC-S. The activity of NF-KB was measured by detecting the levels of phophorylated NF-KB p65 at Set-536 [Sakurai et al. 2003]. As shown in Figure 4.8, the activation of NF-ch p65 was attenuated by rottlerin. Therefore, PKC-S was appropriately identified by the model to be an important factor in mediating the TNFcr signaling pathway. In addition, the model predicted that inhibiting PKC-5 in the palmitate cultures should maintain a high LDH release phenotype. Indeed, as shown in Figure 4.9, rottlerin supplementation increased LDH release in all the palmitate cultures and the medium cultures with TNFcr. Therefore, the cytoprotective role of NF-KB on LDH release in HepG2 cells is mediated in part by PKG-5. 102 TNF(ng/ml) 100 20 o 100 20 o Rottlerin + + + - - - HepGZ NF-ch Mediu B-aetin Palmitate nun—t m NF-ch Medium B-actin “—uww Figure 4.8. Effect of rottlerin on NF-KB measured by western blotting. Expression of phospho—P65 NF-KB in control and palmitate mediums with 0, 20, 100 ng/ml TNFu, with and without rottlerin (5 uM). TNFcr activated phospho-P65 NF-KB and this activation was attenuated with the PKC-6 inhibitor, rottlerin. %LOH Release Experimental Condition Figure 4.9 Measurement of LDH release in the different culture mediums, with and without rottlerin. Palmitate increased LDH release compared to control cultures. Rottlerin significantly increased LDH release in the palmitate cultures (comparing P0 with POR, P20 with P20R, P100 with PlOOR), P<0.05 (t- test). H: treated with culture medium (control) for 48 hours, P: treated with 0.7 mM palmitate for 48 hours; R: treated with 511M rottlerin. Data expressed as average +/- SD from 3 independent experiments. 4.3.3.4 Role of Bel-2 and PKR in Palmitate induced cytotoxicity 103 Bel-2 is a group of proteins including pro-apoptotic members, such as Bax, Bid, Bad, and anti-apoptotic ones such as Bel-2, Bcl-xl, Bel-w. Anti-apoptotic Bel-2 protein inhibits apoptosis by guarding the mitochondrial gate against the release of cytochrome c and the subsequent activation of caspases. Bel-2 was found to be connected to factors such as TNFa, PKR, TOM20, eIF2B, which in turn influenced LDH release (Figure 4.3). The regulation of Bel-2 by TNFa is cell dependent [Kim et al. 2002; Tamatani et a1. 1999]. The protein expression level of Bel-2 in cultured HepG2 cells as a function of TNFa concentrations were measured with western blots (Figure 4.10A). In the control cultures, TNFa (20-100 ng/ml) slightly suppressed the protein expression level of Bel-2 in a dose-dependent manner. Similarly, palmitate significantly decreased Bel-2 protein expression levels as compared to the control and oleate cultures. The suppression of Bel- 2 may explain in part the palmitate and TNFa induced cytoxicity. In support of this finding, over-expression of Bel-2 in 2B4.11 T cell hybridoma cell lines have been shown to inhibit palmitate induced cytotoxicity [dc Pablo et al. 1999]. Bel-2 p-actin TNF 100 20 0 100 20 0 100 20 0 r (nglml) Oleate Palmitate HepG2 Figure 4.10. Effects of TNF a on Bel-2. (A) Effect of palnritate and TNFa on Bel-2 expression measured by western blotting. TNFa supplementation at 20-100 ng/ml downregulated Bel-2 in the control, palmitate, and oleate cultures. Palmitate downreglated Bel-2 protein expression level as compared to the control and oleate cultures 104 (B) Bel-2 ~ ~ WW fl-actin TNF (nglml) 100 2; 0 100 20 0 PKR inhibitor + + + - - - (C) ”_ gum- .u. W... »~. as...“ 1 Bel-2 ”m- “ “I * B-actin TNF(ngImI) 100 go o 100 go o PKR inhibitor + + + - - - Figure 4.10. Effects of TNF a on Bel-2. (B) Effect of PKR inhibition on Bel—2 level in the control cultures. PKR inhibitor (6uM) decreased the expression of Bel-2. (C) Effect of PKR inhibition on Bel-2 level in the palmitate cultures. PKR inhibitor (611M) increased the expression of Bel-2. Bel-2 was connected to the translocase of outer membrane (TOM20), cysteine ligase (GCL) and carnitine palmitoyl transferase (CPT—l) in the reconstructed network. These connections are supported by published results [Motz 2002, Haddad 2004]. Targeting the Bel-2 protein to the mitochondria is mediated by the interaction between the C terminus of Bel-2 and TOM20 [Motz et al. 2002], whereas inhibiting GCL in fATIII cells has been found to enhance Bax and p53 expressions over Bel-2 expression [Haddad 2004]. Finally, direct interaction between Bel-2 and CPT-l has been identified with co- immunoprecipitations [Paumen et al. 1997]. These connections predicted by the model 105 require further analysis into how Bel-2 may be regulated as well as its role in regulating the mitochondrial membrane in HepG2 cells exposed to palmitate. PKR was also found ill the model to be connected to Bel-2. Simulating a down- regulation in the PKR node suggests that the LDH release will decrease in the palmitate cultures and increase in the control cultures. Indeed we found that inhibiting PKR (6 uM PKR inhibitor) in the control cultures decreased the Bel-2 protein expression (Figure 4.10B), correspondingly the LDH release increased from ~3% to ~25%. Inhibiting PKR in the palmitate cultures up-regulated the Bel-2 protein expression (Figure 4.10C) and decreased the LDH release from ~50% to ~40%. Therefore, the model appropriately identified PKR to be an important factor involved in regulating Bel-2 protein expression, and in turn LDH release. In addition, PKR was also found to be connected to eIF2B, PP2A, apoptotic chromatin condensation inducer and apoptosis inhibitor, suggesting that these factors are likely involved in the apoptotic signaling pathway. These connections are supported by published results in the literature. PKR can be cleaved by caspase-3, 7, 8 to liberate the eIF2-or kinase domain, which phosphorylates eIF2-or [Saelens et al. 2001]. Phosphorylation of eIF2-or by PKR will inhibit protein synthesis and lead to apoptosis. Translation of all proteins is dependent upon the binding of the eIF2-GTP-tRNA,M“ ternary complex to the 40S ribosomal subunit, and the amount of eIF2 present in the complex is dependent upon the activity of eIF2B. PKR also can bind to PP2A at the B56 alpha regulatory subunit and increase the phosphatase activity of PP2A [Xu and Williams 2000]. PP2A is a major Ser/Thr phosphatase involved in many signal transduction 106 pathways. PP2A can dephosphorylate and inactivate the anti-apoptotic Bel-2 at Ser-7O [Deng et al. 1998]. 4.3.4. Model extension With the availability of high dimensional biological data to characterize a cellular state, one of the challenges is the development of robust methods that can integrate various orthogonal datasets and identify the genes and pathways that induce a phenotype. The significance of the TIPS© framework is its ability to extract relevant information, both known and unknown, from the high dimensional data. The phenotypic profile guided the information extraction process. Proteins relay information from the genes to execute biological functions, which define the cellular phenotype, e. g. metabolic functions. Thus, the effects of regulation occurring at the protein level manifest themselves in the phenotype. This chapter illustrates that uncovering this information at the protein (i.e., intermediate) level may be achieved by integrating phenotype and gene expression data. Metabolite profiles, which characterize the cellular phenotype, may also be used as constraints. Currently, only one phenotypic profile, e.g., LDH release, was used to identify the active network perturbed by TNFa and FFA exposure. The phenotype characterization could be improved by incorporating more metabolic functions. This is discussed in more detail in chapter 6. An approach to obtain more constraints would be to apply ICA or PLS to extract several latent variables from the metabolic profiles [Griffin et al. 2004]. The TIPS‘9 framework provides hypotheses of potential mechanisms involved in palmitate and TNFa induced cytoxicity that may be experimentally tested and used to further enhance the model. Experimental approaches, such as ChIP technology, which 107 measure the physiological interactions between genes and proteins, are becoming more readily available. Integrating interaction data with the gene expression profile using models, i.e. BN, would improve the network inferred from gene expression data alone. One way to accomplish the integration would be to define the a priori probabilities of the connections in the gene regulatory network based upon the interaction measurement. Hartemink et al. 2002 provided an example of combining protein-DNA interaction with gene expression data in a BN framework to infer regulatory network underlying pheromone response in yeast. Due to computational limitation as well as limited availability of data, it is not possible to reconstruct a network with high confidence using all the genes across the genome. GA/PLS and ICA provide an approach to identify relevant genes for firrther analysis. However, useful information may be missed in the selection process due to low abundance transcripts or the noise in the data. To address the former, methods such as kinetically monitored reverse transcriptase-initiated PCR (kRT-PCR) could be used to measure genes with low abundance transcripts [Holland 2002]. For noise, more replicates of cDNA microarray should be obtained to achieve an accurate estimation of the mRNA expression level. For example, 7 arrrays per sample are required to identify 90% of the truly differentially expressed genes using two-sample t-test [Wang and Chen 2004]. Therefore, another approach to address the latter would be to use a more targeted array with a smaller subset of genes. In conclusion, we have illustrated that TIPS© may be applied to reconstruct the active associations, both known and currently unknown, from gene expression and metabolite profiles to help elucidate the pathways involved in regulating palmitate-induced cytotoxicity. 108 CHAPTER 5 STATE SPACE MODEL TO INFER TRANSCRIPTION FACTOR ACTIVITIES FROM DYNAMIC GENE EXPRESSION DATA 5.1 Introduction The framework introduced in chapter 4 provided very good predictions of the effect of some, but not all, of the perturbation studies. This may be due in part to Bayesian network analysis’ inability to handle transients, such as cycles and feedback loops. Uncovering the additional information from these transients would be valuable in elucidating the mechanism involved in producing a particular phenotype. Therefore, in chapter 5, we developed a dynamic model using time-series gene expression data. To illustrate the ability of the model, we applied the model to Escherichia coli K12 (E. coli) and Saccharomyces cerevisiae (yeast), which were more readily arnendable to validatation. Different gene regulatory motifs such as auto-regulatory, single input, multiple input, feed-forward, has been identified from protein-DNA interaction data of Saccharomyces cerevisiae[Lee 2002]. These various types of data capture different levels of cellular response to environmental factors and contain within it information about the underlying regulatory network structure. Much effort has been devoted to analyzing these various data sets to reconstruct the regulatory features. A majority of the mathematical approaches, such as clustering, independent component analysis (ICA), principal component analysis (PCA), Boolean network, and Bayesian network have been applied to infer biological information from high dimensional micro-array data. These methods 109 typically do not incorporate known regulatory information into the structure of the models. Therefore, we developed a state space model (SSM) that integrates gene ‘ expression and gene regulatory network structure information to extract underlying information on transcription factor activities. Cells respond to environmental and physiological changes through an extensive transcriptional regulatory network, which is composed of transcription factors (TF) and genes. These transcription factors bind to the promoter regions of specific genes to either positively or negatively regulate expression. High throughput technologies, such as cDNA microarray, allow the measurement of expression data of the whole genome; however, genome-wide measurement of the regulatory signals, i.e., transcription factor activities (TFA), remains a challenge. Clustering has been applied to gene expression data to identify co-regulated genes [Bar-Joseph et al., 2002; Eisen et al., 1998; Ramoni et al., 2002] and Bayesian network analysis has been applied to infer regulatory networks [Friedman et al., 2000]. The objective of this chapter is to infer TFAs from gene expression data. The advent of the genome-wide binding assay to measure protein-DNA interactions has helped to uncover the network structure describing the connections between TFs and genes in Escherichia coli K12 (E. coli) [Salgado et al., 2000] and Saccharomyces cerevisiae (yeast) [Lee et al., 2002]. Given the regulatory network structure and gene expression profiles, the TFAs can be inferred with mathematical modeling. Several methods have been developed to infer TFAs from gene expression data. A kinetic based approach [Nachman et al., 2004], which modeled mRNA transcription and decay, did not include feedback from genes to TFs. Network Component Analysis (N CA) 110 [Liao et al., 2003; Kao et al., 2004], which assumed a log-linear relationship between a gene’s expression and its regulatory signals, i.e. TFA, modeled the gene regulatory network as multiple-input motifs. Feedback from genes to TFs within network structures, such as in auto-regulation, feed-forward loops, the regulator chain, or the interaction between TFs, is modeled as a “closed-loop” from the TF to the genes, without explicitly modeling the feedback [Tran et al., 2005]. To complement existing approaches, we have developed a state-space model (SSM) with hidden variables that explicitly models feedback in gene regulatory networks to infer the regulatory signals from the gene expression profile. SSM is a subclass of dynamic Bayesian network (DBN). DBN and has been applied to infer the transcriptional regulatory network from gene expression profiles, e.g., T-cell activation [Range] et al., 2004; Beal et al., 2004]. Other models, such as Hidden Markov model (HMM), the Boolean network, and linear and nonlinear auto-regression models are also subclasses of DBN [Murphy and Mian, 1999]. SSM assumes the existence of state variables that produce observations that are measurable, as well as hidden variables, which are state variables that do not produce an observation. This feature of SSM is attractive for modeling gene regulatory networks. As illustrated in Figure 5.1, a gene regulatory network consisting of TFs and genes can be represented by a SSM. The state of each gene produces observations, such as expression profiles, that can be measured with cDNA microarray. The state of each TF is hidden, and thus, does not produce measurable observations. The structure of the connections can be deduced from measurements of protein-DNA interactions. 111 ___] G2 TF2 (A) . G3 «— e e Space \ G1 G2 G3 G1 G2 G3 (3, ........ -.-.-.-..l.-.-.- -.-......-.. .-.-.-...-.-.- ___.) Obsen atlon timer 'Spaee {El} E2 E3 <1} E2 E3 1 T- Transcription factor Gl T Gene @ Gene expression Figure 5.1: SSM representation of gene regulatory network. A) An example of a gene regulatory network with two transcription factors, TF1 and TF2, and three genes, G1, G2, and G3. TF1 regulates GI and G1 encodes TF1, which formulates auto-regulation between TF1 and G1. TF1 regulates G2, G2 encodes TF2, both TF1 and TF2 regulate G3, which formulates a feed-forward loop between TF1, G2, TF2 and G3. B) A SSM representation of the gene regulatory network illustrated in (A). The transcription factors and genes make up the state space, whereas the observation space is comprised of gene expression data. We demonstrate that SSM can be applied to represent gene regulatory networks of known structures and to infer the TFA from the gene expression profile. We first applied SSM to learn the TFA from data simulated for the gene regulatory network illustrated in Figure 5.1. Then we applied the model to experimental data from E. coli transitioning its carbon source from glucose to acetate [Kao et al., 2004] and to cell cycle data from Saccharomyces cerevisiae [Spellman et al., 1998]. The inferred activity profile for each TF was validated either by a physical measurement (if available) or activity information from the literature. Finally, further extensions to improve the current SSM model are discussed. 5.2 Materials and methods State-Space Model State and Observation: In SSM, a sequence of observations (01, 02,...,OT) is generated from a sequence of states (S1, 82,. . .,ST) with the following model: S,=AS,.1+Wt (5.1) O, = B S, + Vt (5.2) where A defines the state transition probability P(S(|S(-1), i.e., how the state at time point T can be determined from the state at time point T-l, and B defines the observation probability P(O,|S,), i.e., how the observation at time point at T can be determined from the state at time point T. W~N(0,Q) and V~N(0,R) define the Gaussian noise of the state and observation, respectively. For convenience of notation, parameters A, B, W, V were combined into a single parameter vector 0 = (A, B, W, V). In the SSM model, the structure is time invariant and the parameters are also time invariant, i.e., the parameters that determine the transition from T-l to T are the same as the parameters that govern the 113 .uxmr.e.-' L 011' m" transition from T to T+1. However, the time-scale for each loop is not assumed to be the same. For example, in the yeast dataset [Spellman et al., 1998], the time-scale for each loop is the same (~ 7 minutes). However, in the E. coli [Kao etal., 2004] example, the 10 time points were taken at 0, 5, 15, 30, 60, 120, 180, 240, 300, 360 minutes. Therefore, the time-scale for each loop using the first 5 data points is not the same as for the last 5 data points. Parameter learning: 0 = (A, B, W, V) defines a SSM and is learned from N samples of observation data 0 = (On (1),...., On (N)) by maximizing the likelihood of the observation. An Expectation-Maximization (EM) algorithm is used to learn the parameters. Starting with an initial guess of 0, we perform the E (expectation) step at iteration k to estimate the value of states S_hat with 6x and 0 using inference; then, we perform the M (maximization) step to maximize the likelihood of the conditional probability P(0,S_hat|t9), such that 6i.+1=argmax(P(0,S_hat| 6)). For details of the parameter learning, see [Murphy and Mian, 1999]. We used Bayes Net Toolbox [Murphy, 2001] for the model computation. State inference: After the parameters are learned with the EM algorithm from the observation data, the value of the state variables, including the hidden variables, can be recursively inferred with the Bayes rule: P(s, 101,) = mo, 15,) Ems, I s,_1 )P(s,_1 | 01-1—1) (5.3) t—l Sampling method to generate simulated data If the structure and parameters of a SSM are defined, data can be generated with sampling methods such as Gibbs sampling. The function of sample_bnet in Bayes Net 114 Toolbox [Murphy, 2001] was used to generate the simulated gene expression data and TFA profiles. Represent gene regulatory motifs within SSM framework Different motifs, such as auto-regulation, feed-forward loops, multiple-inputs, and single input, have been identified with protein-DNA interaction measurements [Lee et al., 2002]. Protein-DNA interaction measurements identify the DNA sequences of a gene to which a TF will bind. Once a TF binds to a DNA sequence (i.e., binding site) of a gene, the TF regulates the transcription, and in turn the level of expression of that gene. Some TFs bind to similar DNA sequences that may exist on many genes, while other TFs bind to specific DNA sequences present in only one or a few genes. TFA is defined as the concentration of the active conformation of a TF that is capable of DNA binding. In the current SSM representation of a gene regulatory network, each TF has one TFA node and its TFA is the same for all binding sites. Therefore, in the model the level of activity of the transcription factor (i.e., the TFA) indicates only whether a transcription factor is activating (either positively or negatively) its genes or not. The likelihood that a gene is activated by a transcription factor is inferred from the data as conditional probabilities. For example, transcription factor TF1 binds to genes G1 and G2 at the same (or different) binding sites. The activity level of TF1 is assumed to be the same for G1 and GZ, however, the probabilities that G1 or G2 is activated by TF1 are different as defined by P(GllTFl) and P(GZITFI). In addition, the SSM assumes that there is a time delay between binding and transcription. Here we demonstrate that SSM is able to model these motifs, i.e., auto-regulation, feed-forward loops, multiple-inputs, and single input. As shown in Figure 5.1, SSM is a 115 dynamic model composed of two parts: states and observations. State variables generate observations that are measurable, whereas state variables that do not generate observations are called hidden variables. A static gene regulatory motif can be represented dynamically by connecting genes at time point T-l to the TFs they encode at time point T, and connecting the TFs at time point T-l to the genes that they regulate at time point T. Each TF has a TFA node. Each gene has a state node and an observation node. Duplicating the state and observable variables of a gene will increase the computational load. However, this limitation has an intrinsic advantage. Having the state variable (gene) and the observation variable (gene expression) as separate entities allows the SSM approach to take into account potential effects, such as post-transcriptional effects and measurement noise. Ideally, if RNA decay and measurement noise did not exist, then a gene that is transcriptionally activated would be measured at the mRNA level to be activated. In other words, Pr (E=1 | G=1) =1. However, mRNA expression is determined by the net effect of mRNA transcription and degradation. Effects, such as RNA decay (post-transcriptional modification) and noise in the microarray measurement, could result in conditional probabilities Pr (E=1|G=1) less than 1. In other words, a gene that is transcriptionally activated may be measured at the mRNA level to be inactivated if mRNA decay dominates over transcription [Wang et al., 2002]. Figure 5.2 illustrates a graphical representation of the auto-regulation, feed-forward loop, multiple-input, and multi-component loop motifs. 116 stelZ Stelz @ T-1 T (A)autoregulation 2 9 9 I' 0x1 ap6 m ap6 ea c» O ) IROX 9 _ T-1 T (B) Mum-component loop Figure 5.2: An example and its SSM representation of the gene regulatory motifs of A) auto- regulation. Transcription factor Ste12 binds to the promoter sequence of gene STE12, which encodes transcription factor Ste12. B) multi-componcnt loop. Transcription factor Roxl binds to the promoter sequence of gene YAP6, which encodes transcription factor Yap6. Transcription factor Yap6 binds to the promoter sequence of gene ROXl , which encodes transcription factor Roxl. 117 [11er FPLMA [hrszre [arszzA IPL2B . L16 A I, 82113 , 822A 1 l l I | 1 j I T-1 1' Figure 5.2: An example and its SSM representation of the gene regulatory motifs of C) feed-forward loop. Transcription factor Mcml binds to the promoter sequence of gene SW14, which encodes transcription factor Swi4. Transcription factor Swi4 and Mcml bind to promoter sequence of gene CLBZ. and D) multi-input motif. Transcription factors Fhll, Rap2, YapS all bind to the promoter sequences of genes RPLZB, RPL16A, RPSZIB, RPSZZA. 118 Threshold determination We used a function with a definable threshold Th to discretize the gene expression data. Any gene that showed a change larger than Th, based upon log; ratio, was assigned a discrete value of 1, or otherwise was assigned a value of 0. Thus, a threshold of 1 indicates that a 2-fold (2') change in the expression of a gene, relative to its initial state, is significant. SSM requires an optimal threshold in order to obtain reliable results. To find an optimal threshold, we evaluate the TF A data for each TF over various thresholds between —1 and 1. The optimal threshold, which gives the most appropriate profile for each TF, is determined by comparing the TFA results with known (e.g., measured or literature) values. For example, the gene expression data that we used for Saccharomyces cerevisiae was taken over 18 time points that spanned two cell cycles [Spellman et al., 1998]. The TFs we studied in Saccharomyces cerevisiae are known to have phase-specific activity during the cell cycle [Acme et al. 1998; Baetz et al., 1999; Kovacech et al., 1996; Lee et al., 2002; MacKay et al., 2001; Mclnemy et al., 1997; Oehlen et al., 1996; Spellman et al., 1998]. Therefore, we assumed that the activity profile of each TF during the first cell cycle should be repeated in the second cell cycle. We identified the optimal threshold as the one that predicted this cyclic behavior for all the transcription factors. 119 5.3 Results 5.3.1. Inferring TFA from simulated data Before testing SSM with experimental data where the transcription factor activities are unknown, we applied SSM to a simulated system. We created a simple regulatory network (see Figure 5.1a) with the feed-forward loop and auto-regulation motifs containing two TFs and three genes. From the known network structure, we constructed a SSM representation of the network (see Figure 5.1b) and pre—defmed the 0 parameter, which is shown in Table 5.1(a). Using the network reconstructed by the SSM, we simulated the dynamic profiles of TFA and gene expression levels with the sampling method discussed in the Methods section. Given the simulated gene expression profile (shown in table 5.1(c)) and the network structure, SSM learned the parameters 0 and the TFA profile. The learned parameters are shown in Table 5.1(b). They match closely to the predefined parameters in Table 5.1(a). The learned or inferred TFA was compared to the simulated TFA. The learned TFA matched 95% of the time points (38 out of 40) of the simulated TFA. The results of this simulation provided confidence that the SSM may be able to analyze a real system. The simulations were performed on Matlab R13, Windows XP, on a PC with Celeron CPU 2.4 GHZ and RAM 512MB. The simulation time was 328.5 seconds. SSM requires sufficient data to infer the parameters. We evaluated how many data points are needed with the simulation data by varying the number of time points from 6 to 40. Ten time points were needed to correctly infer 80% of the TFA profile and this percentage increased with the number of time points used. The sampling time needs 120 to be small enough to capture the dynamic profile, which will vary with the biological system being modeled. Table 5.1 (a) Parameters of the simulated network shown in Figure 5.1 1.1nitial probabilities of (TF1, TF2, G1, G2, G3) [P(x=1),P(x=2)] = [0.5, 0.5] 2.Conditional probabilities between gene (G) and its expression (E) [P(E=1|G=1),P(E=2|G=l),P(E=1|G=2),P(E=2|G=2)= [0.9, 0.1, 0.1, 0.9] 3.Conditiona1 probabilities between transcription factors (TF) and genes (G) TF1—>Gl; TF1->G2 [P(G=1 |TF=1),P(G=2|TF=1),P(G=1|TF=2),P(G=2|TF=2)] = [0.9, 0.1, 0.1, 0.9] (TF1+TF2)->G(3) P(G=1|TF1=1,TF2=1)=O.9 P(G=1 |TF1=1,TF2=2)=0.7 P(G=1 |TF=2,TF2=1)=0.7 P(G=1 |TF=2,TF2=2)=0.01 4.Conditional probabilities between genes (G) and transcription factors (TF) G1->TF(1);G2->TF2 P(TF=1|G=1)=O.9 P(TF=2|G=1)=O.1 P(TF=1 |G=2)=0.1 P(TF=2|G=2)=0.9 121 Table 5.1 (b) Learned parameters of the simulated network shown in Figure 5.1 1.Initial probabilities of (T F1, TF2, G1, G2, G3) [P(TF1=1),P(TF1=2)] = [0.25, 0.75] [P(TF2=1),P(TF2=2)] = [0.55, 0.45] [P(Gl=1),P(G1=2)] = [049,051] [P(GZ=1), P(G2=2)] = [045,055] [P(G3=l),P(G3=2)] = [0.44, 0.56] 2.Conditional probabilities between gene (G) and its expression (E) [P(E1=1|G1=l),P(E1=2|G1=1),P(E1=1|Gl=2),P(El=2|G1=2)= [0.83, 0.17, 0.13, 0.87] [P(E2=1|G2=1),P(E2=2|G2=l),P(E2=l|G2=2),P(E2=2|GZ=2)= [0.95, 0.05, 0.12, 0.88] [P(E3=1|G3=l),P(E3=2|G3=l),P(E3=l|G3=2),P(E3=2|G3=2)= [0.91, 009,015,085] 3.Conditional probabilities between transcription factors (TF) and genes (G) TF1->Gl [P(G=1 |TF=1),P(G=2|TF=1),P(G=1lTF=2),P(G=2|TF=2)] = [0.94, 0.06, 0.05, 0.95] TF1->G2 [P(G=1 |TF=1),P(G=2|TF=1),P(G=1|TF=2),P(G=2|TF=2)] = [0.83, 017,007,093] (TF1+TF2)->G(3) P(G=1|TF1=1,TF2=1)=0.76 P(G=1|TF1=1,TF2=2)=O.86 P(G=1 lTF=2,TF2=1)=O.15 P(G=1 lTF=2,TF2=2)=0.04 4.Conditional probabilities between genes (G) and transcription factors (TF) G1->TF(1) P(TF=1|G=1)=0.92 P(TF=2|G=1)=0.08 P(TF=1 |G=2)=0.1 P(TF=2|G=2)=0.9 G2->TF(2) P(TF=1|G=1)=O.44 P(TF=2|G=1)=0.56 P(TF=1 |G=2)=0.34 P(TF=2|G=2)=0.66 122 123 Table 5.1 (c) Simulated gene expression data Note: T: time point, E1, E2, E3 are expression data of gene 1, 2, 3 respectively. 0: inactive. 1: active. Table 5.2 Eight genes regulated by ARCA and CRP Gene TFs Acs + SucD - SucC Sth - SdhA - SucB - mdh - - + Note: (+) positive control, (-) negative control 5.3.2. E. coli We applied SSM to a model system: E. coli transitioning from glucose to acetate as a carbon source. The gene expression data was obtained fi'om [Kao et al., 2004] and the regulatory information is available from the regulonDB database. The SSM model included two TFs, CRP and ArcA, and eight of the genes (shown in Table 5.2) that are regulated by the two TFs. The dynamic profiles of these two TFs were learned from the gene expression levels of the eight genes. In [Kao et al., 2004], cAMP was measured to indirectly indicate the activity of CRP, since the activation of CRP requires the binding of CAMP. From the measurement of CAMP (see Figure 5 in Kao 2004), we can see that the level decreased from its initial high but remained upregulated, around 10 fold above the basal level, fi'om the second time point onward. Indeed, SSM inferred that CRP is active from the second point onwards (see Figure 5.3). F or the first time point, however, CRP is predicted to be inactive. The expression level at the first time point is the reference; all subsequent expression levels are measured relative to the expression level at the first time point. SSM also identified that ArcA was inactive for the first four time points and active for the remaining 6 time points (see Figure 5.3). An ArcA measurement is not readily obtainable for comparison. 124 ___-4" AKA SSM Profile CRP SSM Profile ‘70 2 3 :fi ; e e 6 “e; e e e e at; I ii 2r f; lei . . TA 11 . . " 0 2 4 6 " 0 2 4 6 innerllours) Tilt-cilia") Figure 5.3: The results of using a SSM to analyze an E. coli system. SSM predicts that CRP is inactive initially and then active from the second time point onwards, as expected item the CAMP measurement [Kao 2004]. SSM predicts that ArcA is inactive for the first four time points, but is activated for the last six time points. 5.3.3. Saccharomyces Cerevisiae Next, we applied SSM to model several of the common regulatory motifs in Saccharomyces cerevisiae. According to Lee et al. (2002), 39 out of the 106 studied regulators were involved in feed forward loops and 10 of the 106 were involved in auto- regulation. Lee et al. (2002) also found that in combinations of two or more of the 106 regulators, 295 were involved in multi-input motifs. Therefore, we selected TFs with the aforementioned regulatory motifs, and whose activities could be verified by the literature. Namely, Mle, Swi4, Swi5, and Swi6, which are well studied and well—understood [Acme et al., 1998; Baetz et al., 1999; Kovacech et al., 1996; Lee et al., 2002; MacKay et al., 2001; McInemy et al., 1997; Oehlen et al., 1996; Spellman et al., 1998] TFs. SSM was applied to analyze the system illustrated in Figure 5.4. The gene regulatory motifs, feed-forward (Mcml 9 SW15 9 Swi5 9 YJL160C + PIRl + PIR3), multiple-input (Mcml + Swi4 + Swi6 9 CLA4 + SW14 + LSM4 + PCLl + SIMl + GIN4 + YDR509W), and auto-regulation, (Swi5 9 SW15; Swi4 9 SW14) were obtained from [Lee et al. 2002]. 125 masters @Fo‘mrag flow erasure ~ (sneer emanate Figure 5.4: A SSM representation of the Saccharomyces cerevisiae (yeast) system studied. The gene regulatory motifs were taken fiom [Lee et al. 2002], including feed-forward (Mcml->SW15->Swi5- >YJL160C+PIR1+PIRB), multiple-input (Mcml + Swi4 + Swi6 -> CLA4 + SW14+LSM4+PCL1+SIM1+GIN4+YDR509W), auto-regulation (Swi5->SW15; Swi4->SWI4). We evaluated the ability of the SSM to infer TFA profiles during the cell cycle of Saccharomyces cerevisiae. Lee et al. (2002) performed a genome-wide binding analysis to obtain the connectivity information between the TFs and genes in yeast. We coupled the connectivity information [Lee et al., 2002] with gene expression data taken from yeast cultures synchronized by or-factor arrest [Spellman et al., 1998]. Of the four synchronization methods used by Spellman et al., we chose the data obtained with the or- factor method because it presented the least amount of missing data for the genes studied. With the (Jr-factor arrest method, the data were sampled every 7 minutes, which captured approximately two cell cycles with the 18 time points [Spellman et al., 1998]. This provided 9 time points (~ 63 minutes) for each cell cycle. Each phase (i.e., M, G1, S, and G2) within a cell cycle takes about 15 minutes [Liao et al., 2003]. In other words, one cell cycle is about 60 minutes. Therefore, a 7 minute sampling time is small enough to 126 capture the phase change profile within a cell cycle. The binding motifs and gene expression data were used by SSM to infer the TFAs. The SSM predictions are consistent with the literature results. SSM inferred that Mle is active during the G2/M/Gl phases, Swi4 and Swi6 are active during the GI and S phases, and Swi5 is active during the M and G1 phases. We confirmed the predictions (Figure 5.5) made by SSM with the literature. Past studies found that Mle induced the expression of many genes during the G2/M/G1 phases. High transcription of both FAR] and STE2 in the G2/M phases requires Mle [Oehlen et al., 1996]. Mle is also known to induce the transcription of CLN3, SW14, and CDC6 at the M/Gl boundary [Spellman et al., 1998]. In contrast, Swi4 induced genes in the G1 and 8 phases. Swi4 is the DNA binding component of SBF [Baetz et al., 1999]. Baetz et a1. (1999) indicated that SBF promotes the induction of gene expression at the Gl/S-phase transition of the mitotic cell cycle. MacKay et al. (2000) also showed that a complex containing Swi4 induces CLNl and CLN2 transcription in the late G1 and drives the transition to S. Similarly, Swi6 induced genes during the G1 and 8 phases. The activity of Swi6 is very similar to that of Swi4, and these two factors are known to be connected [Baetz et al., 1999; MacKay et al., 2000]. MacKay et al. (2000) showed that a Swi4-Swi6 complex induces CLNl and CLN2 transcription in late G1 until S. Baetz et al. (1999) also suggested that the DNA binding domain of Swi4 is inaccessible in the full-length protein when not complexed with Swi6. In contrast, Swi5 was activated during the M phase and the M/Gl boundary. Kovacech et al. (1996) found that Swi5 is partially responsible for the peak in EGT2 expression during late M and early G1 phases. Aerne et al. (1998) 127 found that SwiS regulates the expression of PCL2, PCL9, and the SICl Cdk inhibitor in the late M phase. ‘ M’GlGl S G2 MIvI/Gl G1 S G2 M MJ'GIGI S G2 IVIIvi/Gl G1 S G2 M _. l .4 OIIIIIIITTIIIIIIIII OIIIIIIIIIITIIIIIII 1357911131517 1357911131517 1 “Mi'GlGl S G3 MM/Gl G1 3 (33 M NUGIGI S G2 MM/Gl G1 S G2 M 2 q M 2 - 1 ‘ l -( W Ollflllllllllllrllll 0 rilllllllllllllrll 1357911131517 1357911131517 Figure 5.5: The results of using a SSM to analyze a yeast system. The SSM predictions closely followed the experimental trends and phases in which each transcription factor is known to be active or inactive [Acme et al., 1998; Baetz et al., 1999; Kovacech et al., 1996; Lee et al., 2002; MacKay et al., 2001; McInemy et al., 1997; Oehlen et al., 1996; Spellman et al., 1998]. SSM inferred that Mle is active during the G2/M/Gl phases, that Swi4 and Swi6 are active during the G1 and S phases, and that Swi5 is active during the M and G1 phases. In summary, SSM was applied to three systems, a simulated gene regulatory network and two experimental systems, E. coli and S. cerevisiae. The simulation study confirmed the ability of SSM to infer network parameters and state values from observational data. Application of SSM to the experimental systems illustrates that TFAs can be inferred from the gene expression data given the regulatory network structure. 5.4 Discussion SSM is a subclass of DBN. DBN has been applied to infer the structure of regulatory networks from temporal gene expression data [Rangel et al., 2004; Beal et al., 2005; Ong 128 p7 et al., 2002; Perrin et al., 2003; Nachman et al., 2004]. Ong et al. (2002) explicitly included operons as hidden variables in the model to facilitate the incorporation of a priori biological knowledge of the co-expressed genes to improve the quality of the analysis. Here, we explicitly included TFs in the model and included the connections between TFs and genes obtained through binding analysis [Lee etal., 2002]. SSM has the potential to infer unmeasurable, as well as unmeasured, connections and events. The benefit of the SSM over existing models (e.g., NCA [Liao et al., 2003], kinetic modeling [Nachman et al., 2004) is that all gene regulatory motifs, including feedback from gene to TFs, are explicitly modeled. NCA implicitly models the feedback from gene to TFs [Tran et al., 2005]. In kinetic modeling [Nachman et al., 2004], feedback from gene to TFs is not incorporated. In contrast, SSM can explicitly model the feedback from gene to TFs, such as the auto-regulatory motif. This facilitates the incorporation of domain or experimental knowledge. For example, if a TF is experimentally knocked—out or silenced, the SSM approach could easily incorporate this information for the auto-regulatory motif. In SSM, the nonlinear relationship between the TFs and genes are quantified with conditional probabilities, i.e. P(0,|S,-,). By using conditional probabilities, SSM does not presuppose a relationship between the TFs and genes in the model, whereas NCA assumes a log-linear relationship and kinetic modeling assumes a form for the rate law, such as Michaelis-Menten kinetics. Conversely, NCA [Liao et al., 2003] and kinetic modeling [Nachman et al., 2004] can infer continuous profiles of TFA. Another advantage of kinetic modeling is the ability to explicitly model both mRNA transcription and decay. In the current application of SSM the gene expression data are discretized, 129 thus allowing us to infer when the TFAs are active. This was sufficient to allow verification of the model predictions with the literature. F or example, the model predicted the phase in the cell cycle in which the genes that are regulated by a TF are activated, which can be compared with the phase of the cell cycle in which the genes are known to be activated in the literature. The SSM assumes there is a time delay between binding and transcription. By considering the state values at time T as a function of the state values at time T-l, the SSM implicitly assumes a time delay for all TF effects on gene expression. In other words, although the binding of a TF to the DNA sequence of a gene may occur quickly [MeAdams and Arkin, 2002], there is a time offset between the binding of the TF to the DNA sequence of a gene and the onset of transcription. This time offset ranges between minutes to hours [Kersberg, 2004]. It has been shown that incorporating a time delay in modeling gene regulatory networks is critical to inferring the oscillatory behavior of NF- kB [Monk, 2003]. We further evaluated this assumption by allowing the genes to be regulated by the current TFA in the yeast dataset. Without the time delay, the cyclic TFA could not be inferred. This, in addition to the previous study [Monk, 2003], suggests that this biologically relevant time delay [Kersberg, 2004] must be incorporated in the model to accurately infer the TF A profiles. In some cases, if the actual time delay is on the order of minutes and the measurements are taken on the order of hours, then considerable error would be introduced. In those cases, it would be more appropriate to incorporate the connection between the TFS and genes in the same time slice. In the simulation study, the TFA inferred by the SSM matched the simulated TFA well but not exactly. The mismatches may be due in part to the EM algorithm being a 130 T local optimization method [Ong et al., 2002], in other words, the algorithm cannot guarantee a TFA of (global) maximal likelihood. The optimization could be improved by either running EM multiple times from different starting points or using a global search algorithm, such as Markov Chain Monte Carlo (MCMC) [Murphy 2001]. The SSM model determines an optimal threshold value for discretizing the gene expression data based upon a priori knowledge of the TFA. If no a priori knowledge is available for the TFA dynamics, this can be addressed one of two ways. In one approach, the TFAs could be estimated from other approaches, e. g. NCA, and the estimated TFA could be used to determine the threshold value. In the other approach, the optimal threshold could be determined in the SSM by including the threshold value (Th) as a part of the parameter learning process, i.e., in the parameters 0 = (A, B, W, V, Th). How the observation data O(Th) = (Om (1),. . .., Om (N)) is discretized depends on the threshold value, e.g., a very high threshold value will set all the genes in the inactive state while a very low threshold value will set all the genes in the active state. The Expectation- Maximization (EM) algorithm could be used to learn the parameters. Starting with an initial guess of 0, we can perform the E (expectation) step at iteration k to estimate the value of states S_hat with 0,. and O(Th) using inference; then, we can perform the M (maximization) step to maximize the likelihood of the conditional probability P(0(Th),S_hat|6), such that 61+1=argmax(P(0(Th),S_hat|6)). Therefore, the parameters, including the threshold value, can be inferred by maximizing the probability of the observation and state values given the inferred parameters (e.g., conditional probabilities). 131 The current SSM infers the most probable model, given the observed data, by approximating the underlying structure of the noise to be Gaussian [Murphy and Mian, 1999; Perrin et al., 2003]. Gene expression is an inherently stochastic phenomenon [McAdams and Arkin, 2002]. SSM modeled the regulatory networks stochastically using conditional probabilities. This probabilistic SSM may capture some of the stochastic nature of the gene regulatory network, but an accurate representation of the stochasticity requires further understanding of the underlying structure of the noise. Without knowing the structure of the noise, studies have assumed it to be Gaussian [Perrin et al., 2003]. The current SSM model could be extended to incorporate a step to learn the structure before inferring the TFAs by searching for a network that gives the maximal likelihood against the observation. The structural information obtained from the binding analysis could be used to construct the initial network as a starting point for the search [Nachman et al. 2004]. Alternatively, the connections indicated by the interaction data could be used to define a priori probabilities of the connections in the network [Hartemink et al., 2002]. Thus, the connections that are supported by the interaction measurements would have a higher likelihood of being valid connections in the network than the unsupported connections. By including a step to learn the structure, it could help refine the network by inferring the interactions that are unmeasurable or missing due to error (noise) in the measurement. A more accurate network provides more confidence to the inferred TFA profile. In addition, the model size was limited by the computational tool that was used, namely Bayes Net Toolbox [Murphy, 2001], which is on a Matlab platform. We are currently working on developing an executable SSM in a CH version 132 ..,... of Bayes Net Toolbox, Probabilistic Networks Library http://www.intel.corn/techno]ogy/computing/pnl/ to handle larger model sizes. 133 (PNL) “- .' 1 CHAPTER 6 EXTENSIONS TO IMPROVE THE T IPS© FRAMEWORK 6.1 Introduction In the previous chapters, we introduced the T [PS© framework to identify genes and pathways relevant to a phenotype, e. g. cytotoxicity. This chapter discusses improvements to the current T IPS© framework. Improvements include hierarchically integrating phenotypic, metabolic and gene expression profiles with genomic function information, inferring pathways that confer a phenotype involving multiples metabolites and developing a dynamic modeling. Preliminary results are presented in this chapter. 6.2 A Hierarchical Approach Employing Metabolic and Gene Expression Profiles to Identify the Pathways that Confer Cytotoxicity in HepG2 Cells 6.2.1 Introduction In the TIPS© approach introduced in chapter 2 to chapter 4, genes relevant to a phenotype e.g. cytotoxicity were selected with GA/PLS and ICA without incorporating metabolic profiles and functional information of the genes. It is the objective of this section to develop an approach to hierarchically integrate phenotypic, metabolic and gene expression profiles with a priori knowledge to select genes relevant to a phenotype. The approach was applied to the cell culture system introduced in chapter 4, namely HepG2 cells exposed to FFAs and TNF-or, to study the genes involved in palmitate induced cytotoxicity. 134 Metabolite Gene sets, groups or Measurement expression profile Correlation Correlation / FDA (GSEA) Identification of Identification of metabolites gene sets related related to to cytotoxicity MB-PL Identification of genes related be the metabolites Figure 6.1 An overview of the hierarchical approach. First, important metabolic fluxes relevant to a phenotype were identified with discriminant analysis. Second, to identify the transcriptionally altered pathways, gene set enrichment analysis (GSEA) was applied to the cDNA microarray data. Finally, the expression of the enriched gene sets and the metabolic profiles were integrated with a multi-block partial least squares analysis (MBPLS) regression model to identify the genes and gene sets that regulate the metabolic pathways identified in step 1. This figure was prepared by Shireesh Srivastava. An overview of the hierarchical framework that was developed is shown in Figure 6.1. The framework consisted of three stages. First, the metabolic fluxes of glucose, fatty acid and amino acid metabolism were measured to identify the metabolic changes responsible for producing the cytotoxic phenotype. Different phenotypes induced by TNF-or and FFA were characterized with Fisher’s Discriminant Analysis (FDA) (Chan 2003 c) to identify the metabolic fluxes that contributed most to separating the phenotypes. Ketone body (e.g., beta-hydroxybutyrate (BOH) and acetoacetate) release and TG accumulation were identified to be the most important in separating the phenotypes, suggesting their involvement in inducing the cytotoxicity. However, the FDA of the metabolic changes did not provide information on the changes in the signaling and gene regulatory pathways. To obtain this missing information, the genomic responses were measured using cDNA microarrays, and analyzed with the gene set 135 enrichment analysis (GSEA) [Subramanian 2005]. GSEA integrates a priori knowledge of a gene’s functional role with the expression data to detect concerted expression changes in a set of genes responsible for producing a phenotype. GSEA indicated the involvement of mitochondria-related, oxidative stress-related and FFA metabolism pathways in inducing the cytotoxic phenotype. GSEA identifies the possible functional pathways, but does not provide a quantitative model to predict the effects of individual genes on the phenotype. Therefore, in the final step, the gene expression and metabolic flux profiles were integrated with multi-block partial least squares analysis (MBPLS) [Hwang 2004] to identify the genes most relevant to the metabolic fluxes which correlated most to the cytotoxicity. In the MBPLS model, the different gene sets identified to be enriched by GSEA formed the blocks of the MBPLS. This ensured that the blocks were separated according to the functional role of the genes. Block scores were extracted from each block to predict the metabolic fluxes. Some of the identified genes were experimentally perturbed to validate their predicted roles in regulating the cytotoxicity. 6.2.2 Materials and Methods Fisher’s Discriminant Analysis FDA was applied to identify the measured metabolic fluxes that contributed to the separation of the different phenotypes (cytotoxic versus nontoxic). FDA was applied to 15 samples belonging to 5 groups with 27 metabolic flux measurements for each sample. FDA identifies the projection axes that maximize the ratio of the between-group and the within-group variations. Details on the FDA algorithm can be found in [Chan 20030]. 136 Two axes T1 and T2 were extracted; each is a linear combination of the 27 metabolic fluxes. T = 2W,- x Met(i) (6.1) W,- provides the contribution of each metabolic flux in separating the phenotypes in the T1 and T2 directions. Gene Set Enrichment analysis (GSEA) of the gene data In order to evaluate the genetic response of the cells to the treatments, cDNA microarray analyses were conducted. The expression levels of 19,522 genes were measured, and the data was analyzed using GSEA. GSEA aims to identify the gene sets whose coordinated change differentiates two phenotypes. The software GSEA-P, from http://www.broad.harvard.edu/gseal, was used for the GSEA analysis. Thirty seven gene sets from the molecular signature database, MsigDB [Subramanian 2005] functional gene group c2 were selected, as shown in Table 6.1. These sets included 10 metabolic pathways, 26 signal pathways and 1 cellular component. An enrichment score of a gene set S characterizes whether the set of genes randomly distributed across the list or falls mainly at the bottom or top of the list. The null hypothesis that a gene set S randomly distributes across the ranked gene list was tested with Kolmogorov-Smimov test, with the statistical significance value estimated through 1000 random permutations of the phenotype label. The gene sets with a high significance of enrichment are considered important in separating the distinct phenotypes. Here, the expressions of 19,522 genes were measured and ranked based upon the comparison between the palmitate treatment and the control using t-test. Integrating the gene expression and metabolic flux profiles 137 MBPLS is a hierarchical multivariate analysis method [MacGregor 1994; Lopes 2002], where the variables are divided into different blocks based upon a priori knowledge, for example according to different stages of a process [MacGregor 1994] or different metabolic pathways in a cell [Hwang 2004]. In this study, we separated the genes into different blocks based upon their functional roles in different pathways. This facilitates the identification of an important block (gene set) to a metabolic flux, and then further identifies the important genes within the block. Gene expression X is divided into K blocks X = [X1,X2,. . .,Xk], block scores B = [b1, b2, ...bk] are extracted from each block Table 6.1: Gene sets used in the GSEA analysis Metabolic pathway Glycolysis fatty acid metabolism Oxidative TCA phosphorylation PPP pathway beta-oxdiation Fructose Sphingoglycolipid Fatty acid biosynthesis Glutathione 8m pathway Bcl2family_and_reg_network Map_kinase CAPASE Mapkk Cell death Ppara ceramidePathway pparG crebPathway S1 P signaling NFKB Programmed cell death ERK1 Erk2 Mapk P38 pathway Akt JNK EGF TNFalpha EIF2 Stress Pathway E|F4 TNFR1 pathway Electron Transport Chain TNFR2 pathways ERK pathway Cellular Comflment Mitochondria respectively. A PLS regression model was then constructed to map B to a metabolic flux. For details of the MBPLS algorithm, please refer [Hwang 2004]. Important genes sets 138 were identified by evaluating the weights of each block and importance of individual genes were then further identified by evaluating the regression coefficients of the genes within the block. The N-way toolbox (http://www.models.kvl.dk/source/nwgytoolbox) [Anderson 2000] was applied to conduct the MBPLS modeling. 6.2.3 Results Metabolic flux important in separating cytotoxic phenotype FDA was applied to identify the metabolic fluxes responsible for separating the phenotypic response (cytotoxic versus nontoxic as defined by the level of lactate dehydrogenase (LDH) release). The metabolic flux data were obtained by Shireesh Srivastava. As shown in Figure 6.2 (a), palmitate cultured samples were separated from the rest in a two dimensional space defined by FDA projections T1 and T2. Furthermore, as shown in Figure 6.2 (b) metabolic fluxes of BOH, acetoacetate, intracellular TG accumulation and amino acids such as aspartate, alanine, ornithine were the most significant in separating the cytotoxic (palmitate) from the nontoxic phenotype (control, unsaturated fatty acids, TNF-or and their combinations). Intracellular TG, ornithine, aspartate have the highest coefficients in separating palmitate from the other conditions in the T2 direction, while BOH, acetoacetate and alanine separated the two groups in the T1 direction, which suggested differences in the metabolic state between the phenotypes exist mainly in the aforementioned metabolic fluxes such as TG, BOH and acetoacetate. While FDA can identify the important metabolic fluxes to the phenotype, it does not provide the direction of the relation, i.e., whether the variables are correlated positively or negatively. To obtain the direction of the relationships, we performed correlation analysis. The (Pearson’s) correlation coefficients (r) between cytotoxicity and metabolic 139 fluxes are shown in Table 6.2. BOH, acetoacetate, alanine release had positive correlations. Acetoacetate and BOH are produced from acetyl-CoA, the metabolic intermediate of fatty acid oxidation. Increased fatty acid oxidation is known to increase oxidative stress and cell death [Gill 2006, Sanyal 2002]. TG accumulation, ornithine and aspartate had negative correlations. TG accumulation has been found to protect cells from palmitate induced lipoxicity [Listenberger 2003]. Ornithine and aspartate may affect fatty acid metabolism through the TCA cycle. Thus, both the FDA and correlation analyses suggested that metabolic fluxes relevant to fatty acid oxidation are positively associated with the palmitate-induced cytotoxicity, while metabolic fluxes relevant to the accumulation of TG are negatively correlated to cytotoxicity. 140 10 OH .8 AP xC) XL T1 0'4 _ o Glucose N I Lacate _l Asp a Acec Acac .’ BOH X BOH B ' . Glycerol A J o FFA 0 X + T6 r I 1 r {P .., i I r ' 80' (i o 0.2 0.4 0.8 - Gin L1 ' His -o.2 ~ ' ‘59 o Glu 6'5! TG -NH3 \ -0.4 i , NO + Thr -Ala -0.6 a - Pro 9 Cys .. Tyr -08 d a Val >l< . Om / Met a: Om o Lys . lie -Leu - Phe Figure 6.2. Fisher’s discriminant analysis of metabolites. (a) Two dimensional score (T1 and T2) plot of the samples, H: control, B: BSA, P: palmitate culture, 0: oleate culture, L: linoleate culture. (b) Loading plot of the first two dimensions for the metabolites. 141 Table 6.2 Correlation between the metabolic fluxes and the LDH release Metabolite Correlation Beta- hydroxybutyrate 0.96 Acetoacetate 0.74 Lactate 0.47 Glutamate 0.37 Alanine 0.27 Fatty acid 0.25 NH4 (iri 0.19 Serine 0.14 Histidine -0.09 Aspartate -0.09 Triglyceride -0.33 Glutamine -0.35 Arginine -0.5 Lysine -0.51 Proline -0.52 Glycine —0.55 Tyrosine —0.55 lsoleucine -0.56 Leucine -0.6 Threonine -0.6 Ornithine -0.61 Valine -0.62 Phenylalanine -0.64 Methionine -0.69 Glycerol -0.7 Cysteine -0.84 Gene Sets Enriched by Palmitate 37 gene sets involved in a variety of cellular processes and organelles, such as fatty acid metabolism, cell death, TNF-a signaling, etc., were evaluated to identify which of these processes were associated with palmitate-induced cytotoxicity (Table 6.1). Of the 37 gene sets evaluated, 14 were found to be significantly enriched with nominal p values less than 0.05, and shown in Table 6.3. The pathways that were enriched included 142 electron transport chain (ETC), fatty acid metabolism, glycolysis, oxidative phosphorylation, ROS, the pentose phosphate pathway (PPP), cell death, fatty acid beta- oxidation, TCA cycle, fi'uctose metabolism, glutathione, and the ERK l, ERK 2, MAP kinase pathways. Therefore, it suggested that the altered fatty acid metabolism may be associated with oxidative stress related pathways such as ROS, glutathione, oxidative phosphorylation and ETC to induce cytotoxicity. Interestingly, the gene set belonging to sphingolipid (ceramide) metabolism was not selected, suggesting that ceramide metabolism does not play an important role in the observed toxicity. Notably, all the enriched gene sets belong to the mitochondria, for example fatty acid beta-oxidation, electron transport chain, oxidative phosphorylation, indicating a central role of this organelle in the observed cytotoxicity. To further support this finding, the gene set of mitochondria with 229 genes, defined according to MsigDB, was found to be significantly changed in the palmitate-treated cells, with a nominal p value of 0.00189. Genes Important to Cytotoxicity To identify the genes that regulate the metabolic functions most closely associated with cytotoxicity, a MBPLS model was developed to predict the metabolic processes based upon the expression data of the gene sets identified by GSEA. Ketogenesis was identified to be positively related to LDH release by the FDA and correlation analysis. Therefore, a MBPLS model was developed to predict BOH based upon the expression profiles of the 14 enriched gene sets shown in Table 6.3. Cytoxicity-related functional pathways were identified by evaluating the block weights. Based upon the block weigth of the MBPLS 143 Table 6.3 Enriched Gene sets in GSEA analysis NOM p- Gene Sets SIZE NES val Electron Transport Chain 43 -1.998 0 Fatty acid metabolism 33 -1.824 0 Glycolysis 55 -1 .765 0 Oxidative Phosphorylation 30 -1 .89 0.004 ROS 26 -1 .741 0.009 PPP pathway 15 -1.644 0.01 Cell death 15 -1.631 0.01 ERK pathway 31 -1.62 0.016 Fatty acids beta- oxidation 7 -1 .58 0.02 Fatty acid biosynthesis 5 -1.574 0.021 TCA 15 -1 .631 0.023 Fructose 19 -1 .599 0.026 ERK1 Erk2 Mapk pathway 29 -1 .542 0.034 Glutathione 23 -1 .52 0.036 Mitochondria 229 -1 .945 0.002 Note: Size: number of genes in the gene set; NES: normalized enrichment score; NOM p-val: nominal P value. identified by evaluating the block weights. Based upon the block weight of the MBPLS model, it was identified that functional groups such as glycolysis, oxidative phosphorylation, ETC, ERK, ROS, glutathione, and fatty acid metabolism had relatively higher weights, suggesting the involvement of these pathways in inducing the cytotoxicity. The genes relevant to these functional groups were identified by evaluating the regression coefficients of the individual genes, and are discussed below. Identification and literature evaluation of the cytotoxic pathways through BOH BOH and acetoacetate were identified to be important in separating the phenotypes. Increased production of ketone bodies, such as BOH and acetoacetate, were observed in the palmitate treated cells. Increased fatty acid oxidation has been shown to be associated with oxidative stress [Gill 2006, Sanya 2003]. Thus, the genes with large (positive or 144 negative) regression coefficients to BOH release, for example, could suggest pathways relevant to the cytotoxicity. The regression coefficients of the genes were inspected to identify the individual genes that were important. The genes with the largest regression coefficients are listed in Table 6.4. Genes relevant to ROS production, fatty acid metabolism, and detoxification of lipid peroxidation products were found to have high regression coefficients. Genes involved in ETC and oxidative phosphorylation including NADH dehydrogenase with positive regression coefficients, indicated a positive role in the cytotoxicity. ROS are produced during the electron transfer and oxidative phosphorylation process. NADH dehydrogenase complex, or complex I of the electron transport chain, is a main source of superoxide production [Moller 2001]. Table 6.4 Genes selected by MBPLS model Num VAID GenName 933 CASP8 and FADD-like apoptosis regulator 1088 glutathione S-transferase M1 1095 Glucose-6-phosphatase, catalytic 1160 mitogen-activated protein kinase 3 1 172 dihydrolipoamide dehydrogenase 1197 cytochrome P450, family 2, subfamily A, polypeptide 7 1222 clusterin (corn plement lysis inhibitor, SP-40) 1247 cytochrome P450, family 2, subfamily E, polypeptide 1 1319 aldehyde dehydrogenase 3 family, member 82 1322 alcohol dehydrogenase IB (class I), beta polypeptide 1332 acyi-Coenzyme A dehydrogenase, C-4 to C-12 straight chain 1341 ATPase, H+ transporting, lysosomal 42kDa, V1 subunit C, isoform 1 1342 ATPase, Cu++ transporting, alpha polypeptide nuclear factor of kappa light polypeptide gene enhancer in B-cells 1 1 3 1 554 (p105) 14 1650 TNF receptor-associated factor 2 [‘I’55353,NM_021 138.3,7186] 15 2024 forkhead box M1 [AA136566,NM_202003.1.2305] 16 2056 TNF receptor-associated factor 3 [AA504259,NM_145725.1 .71 87] 17 2111 brain and reproductive organ-expressed (TNFRSF1A modulator) a—l OOG‘JODU'leN-‘O ‘4 NA 145 Table 6.4 (Continued) 1 8 1 9 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 54 2152 2177 2179 2201 2202 2244 2408 2427 2510 2541 3926 4111 4824 5100 5110 5146 5147 5148 5186 5532 5640 5982 6532 7320 7546 8188 8331 8512 8692 8749 8959 9089 9112 9148 9160 9269 9282 9404 9456 10037 10252 10378 10441 10486 11001 11072 12701 peroxiredoxin 2 [H68845,NM_005809.4,7001] HIV-1 Rev binding protein [AA485958,NM_004504.3,3267] Fas (TNFRSF6)—associated via death domain isocitrate dehydrogenase 3 (NAD+) gamma caspase 8, apoptosis-related cysteine protease glutathione S-transferase M4 [AA486570,NM_000850.3,2948] v-raf murine sarcoma 3611 viral oncogene homolog neutrophil cytosolie factor 1 (47kDa) FOS-like antigen 2 ['l’58873,NM_005253.3,2355] zinc finger protein 118 [N57658,NM_006955.1,7558] (gF) insulin-like growth factor 1 receptor_(AA256532,_,Hs.239176) interleukin 1 7D [N92873,NM_1 38284.1 ,53342] myeloperoxidase [R05801,NM_000250.1 ,4353] FOS—like antigen 2 [T 58873,NM_005253.3,2355] v—raf murine sarcoma 3611 viral oncogene homolog baculoviral lAP repeat-containing 3 [AA002126,NM_001165.3,330] caspase 3, apoptosis-related cysteine protease baculoviral lAP repeat-containing 2 [AA702174,NM_001166.3,329] platelet-derived growth factor receptor, alpha polypeptide alcohol dehydrogenase 4 (class II), pi polypeptide BCL2-associated athanogene 4 [N25897,NM_004874.2,9530] (gC) ESTs, Highly similar to AS3983 pyruvate kinase (EC 2.7.1.40) KIAA1117 [AA001870,NM_015018.2,23033] tumor necrosis factor, alpha-induced protein 3 aldehyde dehydrogenase 1 family, member A2 tumor necrosis factor receptor superfamily, member 18 nerve growth factor receptor (TNFR superfamily, member 16) clusterin-like 1 (retinal) [H91647,NM_199167.1,27098] epithelial membrane protein 2 [T88721,NM_001424.3,2013] Kruppel-like factor 6 [AA055585,NM_001300.4,1316] fucose-1-phosphate guanylyltransferase [R38619,NM_003838.2,8790] glutathione S-transferase M5 [N56898,NM__000851.2,2949] fructose-1,6-bisphosphatase 1 [AA699427,NM_000507.2,2203] mitogen-activated protein kinase kinase 2 [AA425826,NM_030662.2,5605] cytochrome P450, family 2, subfamily J, polypeptide 2 aldehyde dehydrogenase 1 family, member A1 acyI-Coenzyme A dehydrogenase, C-2 to C-3 short chain ATP citrate lyase [H08548,NM_001096.2,47] acylphosphatase 1, erythrocyte (common) type acyl-CoA synthetase long-chain family member 3 mitogen-activated protein kinase kinase kinase 7 chemokine (C-C motif) ligand 5 [AA486072,NM_002985.2,6352] NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 4, 9kDa glutathione peroxidase 3 (plasma) [AA664180,NM_002084.2,2878] ATPase, H+ transporting, lysosomal 13kDa, V1 subunit G isoform 1 Phosphoglucomutase 3 [AA1891 13,NM_015599.1 .5238] amyotrophic lateral sclerosis 2 (juvenile) chromosome region, candidate 13 146 Table 6.4 (Continued) 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 12963 1 3744 1 3975 1 6995 17052 17066 171 14 17207 1 7299 1 7355 1 7461 1 7484 18054 18637 1971 7 receptor (TNFRSF)—interacting serine-threonine kinase 1 interleukin 17B [AA443286.NM_014443.2,27190] oxidative-stress responsive 1 [R95128.NM_005109.1,9943] copper chaperone for superoxide dismutase [N30404,NM_005125.1 .9973] nuclear factor of kappa light polypeptide gene enhancer in B-cells inhibitor, beta dual specificity phosphatase 2 [AA759046,NM_004418.2.1844] ecotropic viral integration site 1 [AA181023,NM_005241.1,2122] epithelial membrane protein 3 [W73810,NM_001425.1,2014] amyotrophic lateral sclerosis 2 (juvenile) chromosome region. candidate 3 TNFRSF1A-associated via death domain [AA916906,NM_003789.2,8717] serine/threonine kinase 25 (STE20 homolog, yeast 2,3-bisphosphoglycerate mutase [AA678065,NM_199186.1 .669] mitogen-activated protein kinase kinase kinase 3 [ NADH dehydrogenase (ubiquinone) flavoprotein 1, 51 kDa (gC) alcohol dehydrogenase IB (class I), beta polypeptide ACAC (Metabolic flux) BOH (Metabolic flux) TG (Metabolic flux) LDH (Phenotype) Experimental Validations In the hierarchical approach, FDA suggested a protective role of TG accumulation. Indeed, our laboratory had found that supplementing palmitate cultures with oleate increased TG synthesis and reduced the cytotoxicity significantly, to the same levels as the control [Li 2006]. GSEA found ROS to be an important factor associated with the cytotoxicity of saturated FFA and TNF-a. Indeed, our laboratory has found that ROS levels were increased in response to palmitate and that ROS played an important role in the observed cytotoxicity in the palmitate cultures [Srivastava 2006]. NADH dehydrogenase was identified by MBPLS to be one of the genes that is positively related to cytotoxicity. This was experimentally validated for its role in regulating palmitate- induced cytotoxicity [Srivastava 2006]. The levels of ROS and LDH release were 147 measured in the palmitate cultured cells with NADH dehydrogenase inhibitor and both were significantly reduced [Srivastava 2006, Li 2006], which confinned the role of NADH dehydrogenase in the observed toxicity as predicted by the model. 6.2.4 Discussion A hierarchical framework was developed to integrate phenotypic, metabolic and gene expression profiles with functional genomics information to identify the pathways which played an important role in regulating FFA and TNF-or induced cytotoxicity in HepG2/C3A cells. The phenotype was connected to the gene expression profile through the intermediate metabolic level. Metabolic changes that contributed and are relevant to the cytotoxic phenotype were identified by discriminant and correlation analyses. Pathway-based gene expression analysis, namely GSEA, was used to identify the functional pathways that were associated with the cytotoxic phenotype. Metabolic and signaling pathways identified from the gene expression analysis support the results obtained by FDA and correlation analysis of the metabolic analysis and provided additional information on the mechanisms of toxicity. While GSEA was able to identify the gene sets important for the phenotype, it did not provide information on which genes within these sets were responsible for mediating the cytotoxic effect. MBPLS offered the advantage of identifying the individual genes that were directly related to the desired phenotype. We applied this framework to a model system of HepG2/C3A cells and successfully identified the targets to reduce F FA toxicity. For example, GSEA identified that mitochondrial and ROS-related genes, but not ceramide synthesis contribute to the toxicity, which was found experimentally to be indeed the case [Srivastava 2006]. The 148 MBPLS prediction of the role of NADH dehydrogenase in regulating cytotoxicity was validated with complex I inhibitor studies [Srivastava 2006, Li 2006]. 6.2.5 Conclusion In summary, this section illustrated how phenotypic, metabolic and genetic profiles can be integrated hierarchically to identify phenotype relevant genes and pathways. The metabolites whose levels were most strongly related to the cytotoxicity were identified. The genes associated with these metabolic processes were subsequently identified. Determining the genes which control the levels of these metabolites provided important information on the likely mechanism of toxicity and identified the targets to control the cytotoxicity of palmitate. This approach identified the involvement of ROS generation, altered fatty acid and energy, but not ceramide metabolism in the cytotoxicity. Thus, the integration of metabolic and genetic information can provide a more comprehensive picture of the cellular states and provide potential targets that regulate the cellular I'CSpOl'lSCS . 6.3 Inferring active pathways that confer a phenotype involving multiple metabolites 6.3.1 Introduction In the current application of T IPS© to identify cytotoxicity related pathways in HepG2 cells, the profile of LDH release was used to characterize the phenotype. In order to define a phenotype more accurately and robustly, more information, e.g. more metabolites should be incorporated. Metabolites are connected to each other through the metabolic network, which are implicitly captured by the metabolite profiles [Li and Chan 149 2004b]. Therefore, multiple metabolites can discriminate different phenotypes more readily than a single biomarker. 6.3.2 Materials and Methods Extracting pathways for multiple metabolites Let S(t) be the profile of pathways related to a phenotype in experiment t and L1 (t) be the profile of metabolite L1 in experiment t, L2(t) be the profile of metabolite L2 in experiment t. S is constrained to have a correlation with L1 and L2 by equation (6.2), where p2 is a threshold value. Corr2 = ST*L1*L1T*S/(S*ST) *A+ ST*L2*L2T*S/(S*ST) *B>p2 (6.2) A and B in the above equation defines the weight of each latent variable that contributes to the overall correlation of the pathways related to a phenotype. 6.3.3 Results and discussion Preliminary study to identify genes relevant to both LDH and TG We applied the approach described above to identify genes relevant to two cellular processes, TG and LDH, hereby denoted as two factors. The results are presented in Table 6.5. The genes selected using only LDH are listed in Table 6.6 for comparison. Table 6.5 Important genes selected for two factors TG and LDH. Access Num Name AA018907 protein phosphatase 3 (formerly 28), regulatory subunit B, 19kDa, alpha isoform R06605 protein tyrosine phosphatase, non-receptor type 1 (PTPN1) AA160670 lysophosphatidic acid phosphatase (ACP6) AA464163 acyl-Coenzyme A dehydrogenase, very long chain (ACADVL) R39463 aldolase C, fructose-bisphosphate (ALDOC) AA437389 Lanosterol synthase (2,3-oxidosqualene-Ianosterol cyclase) R46512 protein phosphatase 1, regulatory subunit 7 150 Table 6.5 (Continued) T55835 AA644448 R46823 AA457700 T81 764 T40568 T71 976 AA490466 T77729 AA789328 N93686 M59851 3 N71 653 AA520978 N50655 AA1291 71 M443899 AA421 269 T61 256 M487588 N531 69 H21 869 AA443577 H08753 R71 691 R2221 9 W92859 AA1 86686 R06458 R98442 AA434420 M464957 N46830 AA01 0079 N58305 AA279072 AA425450 R95841 AA682469 AA022480 AA497051 AA41 7279 R53787 AA45481 9 AA608857 N62379 R88440 AA41 6584 H29475 histone deacetylase 6 (HDAC6) protein tyrosine phosphatase, receptor type, U (PTPRU) N-acetylgalactosaminidase, alpha- (NAGA) stearoyl-CoA desaturase (delta-9—desaturase) (SCD) cell division cycle 27 (CDC27) phosphatidylinositol-4-phosphate 5-kinase, type II, alpha (PIP5K2A) phosphatidic acid phosphatase type 28 (PPAPZB) Gap junction protein. beta 2. 26kDa (connexin 26)_(AA490466,_,Hs.323733) Pyruvate carboxylase (PC). nuclear gene encoding mitochondrial protein cyclin-dependent kinase (CDC2-like) 10 (CDK10) Aldehyde dehydrogenase 3 family, member B1 (ALDH3B1) protein tyrosine phosphatase, receptor type, F (PTPRF) aspartoacylase (aminoacylase 2, Canavan disease) (ASPA) Ubiquitin-conjugating enzyme E2H (UBC8 homolog, yeast) (UBEZH) solute carrier family 27 (fatty acid transporter), member 3 (SLC27A3) protein phosphatase 2, regulatory subunit B (356). beta isofonn (PPPZRSB) scavenger receptor class B, member 1 (SCARB1) phosphatidylinositol 4-kinase, catalytic. alpha polypeptide (PIK4CA) ketohexokinase (fructokinase) (KHK) ATPase. H+ transporting, lysosomal interacting protein 1 (ATP6IP1) apolipoprotein C-lll (APOC3) COX10 homolog tumor necrosis factor (ligand) superfamily, member 13 (T NFSF1 3) guanine nucleotide binding protein (G protein) TNF receptor-associated factor 1 (T RAF1) (gF) phosphate cytidylyltransferase 2, ethanolamine (PCYT2) protein tyrosine phosphatase prostaglandin E synthase 2 (PTGESZ) lecithin-cholesterol acyltransferase UDP-glucose ceramide glucosyltransferase-Iike 1 protein tyrosine phosphatase, non-receptor type 9 (PTPN9) Pleckstrin homology, Sec7 and coiled/coil domains 2 (cytohesin-Z) (PSCD2) Cytokine receptor-like factor 3 (CRLF3) protein kinase, interferon-inducible double stranded RNA dependent (PRKR) hydroxysteroid (17-beta) dehydrogenase 7 inositol polyphosphate phosphatase-like 1 (INPPL1) glycoprotein (transmembrane) nmb (GPNMB) ribosomal protein 86 kinase, 90kDa alcohol dehydrogenase lB (class I) CREB binding protein (Rubinstein-Taybi syndrome) (CREBBP) sialyltransferase (STHM) protein tyrosine phosphatase. non-receptor type substrate 1 (PTPNS1) protein phosphatase 2. regulatory subunit 8 (856), epsilon isoform (PPP2R5E) mitogen-activated protein kinase 3 serine/threonine protein kinase SSTK (SSTK) l-kappa-B-interacting Ras-like protein 1 Pas apoptotic inhibitory molecule 2 chemokine (C-C motif) ligand 3 Pyruvate dehjdrogenase kinase, isoengyme 2 (PDK2) 151 Table 6.5 (Continued) AA86281 3 AA4431 21 AA464590 AA236957 R83038 T61 271 AA446839 N34876 R6261 2 AA676836 AA464421 H75862 AA452802 M098980 AA485397 M490501 N801 29 AA1 48548 N59764 AA460646 M28062 AA4551 02 W80489 AA644550 H98694 AA451 886 N481 78 M405901 M678095 AA464580 AA001 614 M485347 H03436 M488373 AA1 56571 AA459265 H92232 AA429946 T62040 AA663439 H53340 AA401 1 1 1 AA42934 T81 033 cytochrome c oxidase subunit Vlll (COX8), nuclear gene encoding mitochondrial protein apoptosis-inducing protein D (APPD) protein tyrosine phosphatase. receptor type, N polypeptide 2 (PTPRN2) Rac/Cdo42 guanine nucleotide exchange factor (GEF) 6 (ARHGEF6) low density lipoprotein receptor-related protein 5 (LRP5) phospholipase A2, group IIA (platelets, synovial fluid) (PLA262A) BCL2/adenovirus E1B 19kDa interacting protein 3 (BNIP3) TRAP-binding protein domain fibronectin 1 (FN1) acid sphingomyelinase-like phosphodiesterase (ASM3A) zinc finger protein 144 (Mel-18) (ZNF144) Apoptotic chromatin condensation inducer in the nucleus (ACINUS) solute carrier family 4, sodium bicarbonate cotransporter. protein kinase C-like 2 (PRKCL2) Ubiquitin-Iike 4 UV radiation resistance associated gene (UVRAG) metallothionein 1L (MT1 L) fatty acid binding protein 3. muscle and heart (mammary-derived growth inhibitor) (FABP3) guanine monphosphate synthetase (GMPS) peroxisomal biogenesis factor 6 (PEX6) diacylglycerol kinase, delta 130kDa (DGKD) heat shock 70kDa protein 2 acylphosphatase 1. erythrocyte (common) type (ACYP1) translocase of outer mitochondrial membrane 20 (yeast) homolog (KIAA0016) Pl-3-kinase-related kinase SMG-1 (SMG1) cytochrome P450. subfamily l (dioxin-inducible) phosphoinositide-binding protein PlP3-E NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 3. 9kDa (NDUFA3) G protein-coupled receptor 48 (GPR48) acetyl-Coenzyme A carboxylase alpha insulin receptor_ protein phosphatase 1, regulatory (inhibitor) subunit 11 (PPP1R11) UDP-GlcNAczbetaGal beta-1.3-N-acetylglucosaminyitransferase 3 (B3GNT3) Phosphoglucomutase 1 (PGM1) alanyl-tRNA synthetase (AARS) (gF) C/EBP-induced protein (LOC81558) guanine nucleotide binding protein (G protein), alpha 11 (Go class) (GNA11) peroxisomal short-chain alcohol dehydrogenase (humNRDR) electron-transfer-flavoprotein, beta polypeptide (ETFB) solute carrier family 25 (mitochondrial carrier; adenine nucleotide translocator) metallothionein 1 G (MT1 G) glucose phosphate isomerase (GPI) Rho GTPase activating protein 1 (ARHGAP1) zinc finger protein 148 (pHZ-52) 152 Table 6.6 Important genes selected for LDH alone Access Num AA626370 AA86281 3 AA464590 AA453404 AA400973 AA495981 T67006 T81 033 AA430654 T73556 AA485898 W92859 H75862 H57850 T47788 W8571 0 H1 1 054 AA8724ZO N721 15 R2221 9 M644550 H77542 H92556 R31 562 R50407 AA676836 AA482070 R95841 T64223 AA481 464 H0981 9 AA1691 76 AA41 1656 R36587 H98694 AA1 60670 H95038 AA458563 AA776294 H6661 7 AA451 935 AA098980 N34876 AA281 945 Gene Name (gM) heme oxygenase (decycling) 2 (HMOXZ) (gC) cytochrome c oxidase subunit VIII (COX8) (gC) protein tyrosine phosphatase, receptor type, N polypeptide 2 (PTPRN2) (gM) PPAR binding protein (9) lipocalin 2 (oncogene 24p3) (LCN2) (gC) Rho GTPase activating protein 6 (ARHGAP6) (gF) glucokinase (hexokinase 4) regulatory protein (GCKR) (gC) zinc finger protein 148 (pHZ-52) (gN) ATPase, H+ transporting. lysosomal V0 subunit 3 isofonn 1 (ATP6VOA1) (gM) fatty-acid-Coenzyme A ligase, long-chain 2 (FACL2) (gC) apoptosis related protein APR-3 (APR-3) (gC) protein tyrosine phosphatase, non-receptor type 1 (90) apoptotic chromatin condensation inducer in the nucleus (ACINUS) (90) protein phosphatase 2 (formerly 2A), regulatory subunit A (PR 65). beta isoform (PPP2R1 B) (gC) apoptosis-inducing factor (AlF)-homologous mitochondrion-associated inducer of death (AMID) (gC) carnitine palmitoyltransferase l, muscle (CPT1 B) (gC) protein kinase C, delta (gN) collagen. type Vlll. alpha 1 (COL8A1) (gC) cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4) (CDKN2C) (gF) phosphate cytidylyltransferase 2. ethanolamine (PCYT2) (gM) translocase of outer mitochondrial membrane 20 (yeast) homolog (KIAA0016) (gC) hyperpolarization activated cyclic nucleotide-gated potassium channel 3 (9C) eukaryotic translation initiation factor 2B. subunit 2 beta, 39kDa (EIF2B2) (gC) GDP-diacylglycerol synthase (phosphatidate cytidylyltransferase) 1 (C081) (gM) chemokine (C-X-C motif) ligand 2 (CXCL2) (gC) acid sphingomyelinase-like phosphodiesterase (ASM3A) (gC) Rho guanine exchange factor (GEF) 16 (ARHGEF16) (gC) ribosomal protein 86 kinase, 90kDa, polypeptide 3 (9M) carboxypeptidase A3 (mast cell) (CPA3) (gC) peptidylprolyl isomerase B (cyclophilin B) (PPIB) (gC) phosphomevalonate kinase (PMVK) (gC) glycerol-3-phosphate dehydrogenase 2 (mitochondrial) (gC) chemokine (C-X-C motif) ligand 16 (CXCL16) (gM) phosphatidylinositol (4.5) bisphosphate 5-phosphatase homolog (gC) Pl-3-kinase—related kinase SMG-1 (SMG1) (gC) lysophosphatidic acid phosphatase (ACP6) (gl) uncoupling protein 4 (gN) ribulose-5-phosphate-3-epimerase (RPE) (gC) Rab geranylgeranyltransferase, alpha subunit (RABGGTA) (gC) golgi apparatus protein 1 (GLG1) (gC) apoptosis inhibitor 5 (API5) (gC) protein kinase C-like 2 (PRKCL2) (gC) TRAF-binding protein domain (90) MAP-kinase activating death domain (MADD) 153 Table 6.6 (Continued) AA45481 0 T65790 R5378? AA022480 AA1291 71 AA01 8907 AA279072 H1 751 3 AA463931 M6001 73 AA609421 R941 91 AA459265 AA431 988 W65461 AA437389 AA487586 AA434420 AA487588 AA464580 N71 497 R98442 H04769 AA490494 T61 256 AA486407 T55835 R39463 AA490501 AA451 886 AA280677 AA457700 R83038 N62847 T73468 AA443577 AA682469 T77729 AA644448 W20487 N91 1 35 (g) tumor-associated calcium signal transducer 2 (TACSTDZ) (gC) famesyl diphosphate synthase (famesyl pyrophosphate synthetase (gC) protein phosphatase 2, regulatory subunit B (B56), epsilon isoform (PPP2R5E) (gC) CREB binding protein (Rubinstein-Taybi syndrome) (CREBBP) (90) protein phosphatase 2. regulatory subunit B (356) (9C) protein phosphatase 3 (formerly 2B), regulatory subunit B, 19kDa (gK) inositol polyphosphate phosphatase-like 1 (INPPL1) (gF) heat shock 70kDa protein 1-like_(H17513,_.Hs.80288) (gC) inositol 1.3,4-triphosphate 5/6 kinase (ITPK1) (gC) ubiquitin-conjugating enzyme EZA (RADG homolog) (UBEZA) (gC) membrane protein, palmitoylated 6 (MAGUK p55 subfamily member 6) (MPP6) (gC) translocase of outer mitochondrial membrane 70 homolog A (yeast) (T OMM70A) (gF) C/EBP-induced protein (LOC81558), mRNA. (AA459265,NM_030802.Hs.9851 ) (gC) fatty acid amide hydrolase (FAAH), mRNA. (AA431988,NM__001441,Hs.288828) (gC) dual specificity phosphatase 5 (DUSP5), mRNA. (W65461,NM_004419,Hs.2128) (g) lanosterol synthase (2,3-oxidosqualene-lanosterol cyclase)_(AA437389._.Hs.93199) (gC) inorganic pyrophosphatase (SlD6-306), mRNA. (AA487586,NM_006903.Hs.5123) (gC) protein tyrosine phosphatase, non-receptor type 9 (PTPN9) (gC) ATPase, H+ transporting, lysosomal interacting protein 1 (ATP6IP1) (gC) acetyl-Coenzyme A carboxylase alpha_(AA464580,_,Hs.7232) (gC) low density lipoprotein receptor-related protein 6 (LRP6) (gC) UDP-glucose ceramide glucosyltransferase-like 1_(R98442,_,Hs.105794) (gC) protein kinase (CAMP-dependent. catalytic) inhibitor beta_(H04769,_. Hs.106106) (gC) tumor necrosis factor receptor superfamily, member 21 (T NFRSF21) (gM) ketohexokinase (fructokinase) (KHK) (gN) phosphodiesterase 4D interacting protein (myomegalin) (gC) histone deacetylase 6 (HDACG) (gC) aldolase C. fructose-bisphosphate (ALDOC) (gC) UV radiation resistance associated gene (UVRAG) (gC) cytochrome P450, subfamily I (dioxin-inducible) (gC) zinc finger protein 258 (ZNF258) (gC) stearoyl-CoA desaturase (delta-9—desaturase) (SCD) (90) low density lipoprotein receptor-related protein 5 (LRP5) (g) lysosomal-associated membrane protein 2 (LAMP2) (gC) glutathione S-transferase A2 (GSTA2) (gC) tumor necrosis factor (ligand) superfamily, member 13 (TNFSF13) (gC) alcohol dehydrogenase IB (class I), beta polypeptide (90) pyruvate carboxylase (PC) (90) protein tyrosine phosphatase. receptor type (90) BCL2-related protein A1 (BCL2A1) (9N) chloride intracellular channel 3 (CLIC3) 154 Table 6.6 (Continued) AA425450 (gC) glycoprotein (transmembrane) nmb (GPNMB) R71691 (gC) TNF receptor-associated factor 1 (TRAF1) AA186686 (gC) prostaglandin E synthase 2 (PTGESZ) AA454588 (gC) BCL2-Iike 2 (BCL2L2) T71976 (gC) phosphatidic acid phosphatase type 23 (PPAP2B) N46830 (gC) cytokine receptor-like factor 3 (CRLF3) T81764 (90) cell division cycle 27 (CD027) AA434130 (gC) thioredoxin reductase 2 (T XNRDZ) AA520978 (gC) ubiquitin-conjugating enzyme E2H (UBC8 homolog, yeast) . (gC) protein kinase, interferon-inducible double stranded RNA dependent AA010079 (PRKR) AA446839 (gC) BCL2/adenovirus E1 B 19kDa interacting protein 3 (BNIP3) (gC) cytochrome P450, subfamily l (aromatic compound-inducible). polypeptide AA418907 1 (CYP1A1) R06605 (gC) protein tyrosine phosphatase, non-receptor type 1 (PTPN1) N62379 (gC) l-kappa-B-interacting Ras-like protein 1 R46823 (g) N-acetylgalactosaminidase, alpha- (NAGA) Comparing Tables 6.5 with 6.6, we can see that many important genes such as SCD, ACC, TNF super family, TRAFI, Bel-2 were selected in both cases. However, the two factor case selected more lipid metabolism related genes, e. g. acyl-CoA dehydrogenase and diacylglycerol kinase. In addition, in the two factors case selected NADH dehydrogenase, which is involved in complex I of the electron transport chain. It is a major site of ROS production and has been shown to play an important role in the palmitate-induced toxicity [Srivastava 2006, Li 2006]. Therefore, more useful information was extracted with multiple factors. The next paragraph describes how the T IPS© framework would be extended to incorporate more than two metabolites. To incorporate more than two metabolites or cellular processes such as the entire metabolome profile into the T IPS© framework, latent variables can be extracted from the metabolome data with methods such as PCA, PLS [Griffin 2004], and FDA. L1 and L2 in Equation 6.2 would be substituted with the extracted latent variables that represent the metabolome data. PCA and PLS extracts latent variables that account for most of the 155 variance in the data while FDA extracts latent variables that have the largest discriminating capability. To incorporate nonlinearity effects, kernels may be applied to the aforementioned methods. The number of latent variables may be determined based upon the variance captured by the latent variables. For example, to capture 90% of the variance in the raw data, we included enough latent variables to extract 90% (or more) of the variance. In the network reconstruction stage by BN, the latent variables (rather than simply LDH and TG) may be used as pseudo nodes to represent pathways, the biological meaning of the pathways are then determined from the genes in the latent variables [Griffin 2004]. An alternative method is to select the metabolites with high weights to be the latent variables that would then be used for the final network reconstruction in the BN step. This method is not limited to two factors as illustrated in equation 6.2, but may be extended to multiple factors, L1, L2, ..., Ln. 6.4 Using a Dynamic Bayesian Network (DBN) to identify active pathways from the dynamic profiles 6.4.1 Introduction The current TIPS© model handles static data, i.e. data at a single time point. However, cells continuously reprogram gene regulatory networks when they sense the changes in the environmental conditions. In order to understand how cells are regulated in response to environmental changes, time series data (dynamic data) are required. In our system, gene expression data as well as metabolites over three consecutive days (24, 48, 72 hours afier FFA and TNFa exposure) have been profiled. It has been observed that cytotoxicity was regulated by FFA and TNFa differently as a function of time [Srivastava 2006]. For example, TNFa increased cytoxicity on day 1 and had no effects on day 2. However it 156 decreased cytotoxicity on day 3. This suggests that the gene regulatory network changes over time. Uncovering this information would be valuable in helping to elucidate the mechanism of toxicity and cell death. Models, such as DBN, can be applied to model time series data. To apply DBN to identify the pathways that regulate cytotoxicity in HepG2 cells, we first applied a hierarchical framework introduced in section 6.2 to select the genes relevant to cytotoxicity. 6.4.2 Materials and Methods Dynamic Bayesian network analysis We applied Banjo (http://www.cs.duke.edu/~amink/software/ba_mio/) from Alexander Hartemink’s group to learn the DBN. Every node in the Bayesian network represents a gene, a metabolic flux or a phenotypic response. A first order Markov DBN was used, i.e. a node at a given time point can be influenced by itself or its parents in the immediate previous time point. The search and score approach was used to find a network structure with the maximum score. Bayesian Dirichlet equivalence (BDe) was used to score a network. Greedy search algorithm was used to search for the optimal network. 6.4.3 Results and Discussion Network Reconstruction 157 Table 6.7: Experimental conditions for DBN learning TNF-(nglml) FFA 0 20 100 Hepcz medium (H)|H-0 H-20 H-100 BSA(B) B0 320 B100 Palmitate(P) P0 P20 P100 Oleate(0) 00 020 0100 Linoleate(L) L0 L20 L100 Based upon the hierarchial approach in section 6.2, we selected 80 genes, 3 metabolites (BOH, ACAC, TG) and one phenotype (LDH) for DBN reconstruction. The genes, metabolites and phenotypic profiles are listed in Table 6.4. The network was learned for each treatment, i.e. HepG2 medium (H, control), Bovine Serum Albumin (B), Palmitate (P), Oleate (O), Linoleate (L) and their combinations with TNF—0r, separately, based upon the dynamic profile of the genes measured at 24, 48 and 72 hours after the treatments. As summarized in Table 6.7, we learned the network for 15 conditions with two replicates for each condition. The experiments were performed by Shireesh Srivastava and the gene expression data were obtained at the Van Andel Institute. All connections that appeared more than twice were combined together for network reconstruction. The network is shown in Figure 6.3, with the numbers in the network corresponding to the Nums in Table 6.4. From the network shown in Figure 6.4 we can see that the phenotype is mainly regulated by two categories of factors, oxidative stress related factors and TNF-0r signal pathway related factors. The oxidative stress related factors include NADH 158 dehydrogenase, aldehyde dehydrogenase (ALDH), glutathione-S transferase (GST), copper chaperone for superoxide dismutase (CCS) and CYP2E1. TNF-or related factors include NF-KB, IKB, TRAF, TNF superfamily, capase 8, inhibitor of apoptosis protein (lAP). These two categories interact with each other in the network. LDH was found connected directly to genes such as copper chaperone for superoxide dismutase (CCS) and biophosphoglycerate mutase (BPGM). CCS was connected to ALDH1A2, NADH dehydrogenase, CYP2E1, and GSTMS. All these factors are relevant to cellular ROS and lipid peroxidation and thus play a role in the palmitate-induced cytotoxicity. Excessive accumulation of ROS leads to oxidative damage, e. g. peroxidizing membrane to produce deleterious aldehydes such as malonedialdehyde and 4-hydroxy-2,3-(E)-nonenal (4- HNE). ROS are produced during the electron transfer and oxidative phosphorylation process. NADH dehydrogenase complex, or complex I of the electron transport chain, is a main source of superoxide production [Moller 2001]. Superoxide dismutase (SOD) protects cells from free radical damage by reducing superoxide anion radical into hydrogen peroxide. 4-HNE has been found to elicit cytotoxic effects in diverse cells [Hartley 1995]. The detoxification pathways of 4-I-INE include oxidation of its carbonyl group into carboxylic acid by ALDH or conjugation with glutathione by GSTs. Cytochrome P-450 e.g. CYP2E1 sequentially transfers two electrons from NADPH to molecular oxygen. In the course of electron transfer some of the activated oxygen is released as superoxides or H202 [Moller 2001]. Excessive accumulation of ROS leads to oxidative damage. Therefore, the connections among CCS, ALDH, NADH dehydrogenase, GST, CYP2E1 suggest the involvement of ROS in the observed cytotoxicity and possible detoxification mechanisms. BPGM is an enzyme involved in 159 glycolysis to convert 3-phospho-D-glyceroyl phosphate to 2,3-bisphospho-D-glycerate and it is connected to two apoptosis inhibitor genes of IAP2 and IAP3. The connections identified between LDH and IAP and BPGM suggested the involvement of glycolysis and apoptosis in the palmitate induced cytotoxicity. High glucose level has been reported to synergize with saturated fatty acids to cause increased cell death through apoptosis in pancreatic beta cells because elevated glucose level has inhibitory effects on the lipid detoxification via beta-oxidation [El-Assaad 2003]. In addition, high glucose metabolism may increase superoxide production, which also induces apoptosis [Okuyama 2003]. Indeed the glucose level in the HepG2 culture medium is very high. Apoptosis was initiated in the palmitate cultures as was evident by a significantly higher caspase activity detected in these cultures as compared to the control and unsaturated fatty acid cultures [Srivastava 2006]. The activation of NF-KB by the TNF-0L signaling pathway was uncovered by DBN. TNFRSFI connects to TRAF3, which then connects to IKB. IKB is connected to NF- KB through GSTM4. TNF-0r is a cytokine that can be detected by the TNF receptor which activates TRAF and IKK. Activation of IKK will phosphorate and degrade IKB to release NF-KB. It is encouraging that the DBN model was able to reconstruct this signaling pathway. The model also indicated that the NF-KB activation may be mediated by genes such as GST and ALDH. Discriminant and correlation analyses found the metabolic flux, BOH, to be the most relevant to the palmitate-induced cytotoxicity. BOH was found to be connected to NADH dehydrogenase and SOD. BOH are produced from acetyl-CoA, which is a metabolic intermediate of fatty acid oxidation. Fatty acid oxidation induces R08 160 production [Gill 2006, Sanya 2003]. NADH dehydrogenase complex, or complex I of the electron transport chain, is a main site of superoxide production. Thus, the connection between NADH dehydrogenase, SOD and BOH indicates the involvement of ROS production in the palmitate-induced cytotoxicity. Indeed, experimentally inhibiting NADH dehydrogenase in the palmitate cultures significantly reduced the LDH release [Srivastava 2006, Li 2006]. Therefore based upon the important genes identified by the hierarchical approach (described in section 6.2), DBN was able to reconstruct a regulatory network from the dynamic gene expression and metabolic profiles. The resulting network suggested possible interactions between the identified genes and metabolic fluxes, which indicated the involvement of some pathways such as apoptosis, NF-KB and ROS in the palmitate induced cytotoxicity. The inferred network structure may be used to construct an in silico regulatory network to simulate how BOH or LDH release is dynamically regulated. Thus, potential hypothesis and mechanisms involved in inducing cytoxicity may be identified based upon the reconstructed network. 6.5 Future work We discussed in this chapter the improvements to TIPS© including hierarchically integrating phenotypic metabolic, and gene expression profiles with genomic function information, inferring pathways that confer a phenotype using multiples metabolites and dynamic modeling. Future work would be to integrate all these features into a dynamic TIPS© fiamework. Briefly, metabolites important to a phenotype can be identified with the hierarchical approach using FDA. This method would be able to identify multiple metabolites that may be important in characterizing the phenotype. GSEA can be used to 161 define the functronal gene groups and identify significantly enriched ones Wthh can be integrated With the multiple metabolites, identified in FDA by GA/MB PLS to select important functronal gene sets for the several metabolites that are most relevant to a phenotype Fmally, important gene sets may be subjected to DBN analysrs to learn the regulatory network and reconstruct the pathways from time senes data /\//’,/ /I/ , :\ \\\ ‘1411115411391 11(6411 \‘ \/\/ \‘T' l Vj \ \ 1 Y, : \ \ l ‘1 I I / ‘1 (if; 1 . \1 // \ \J\ 1 “‘1 ‘ ‘ 1/ ‘ 1 ' \\ ,1 \ 1 43 ,1,“ \‘f /X'\<1 ‘ 50) 1 \\ ‘\ 1 1 2,. :\\\ 1’ \\\ \ \.'\:\ \ ‘ .\ ‘1 j‘\.1\;_1\/// \ \. \ 1: 1173,): 7603 '1‘ \\ : X i 1 - 1 4.1 \_ 1' , 1 , \ :1 162114611 11 ‘1‘ 1 e54 \ : 1 0111911 (74"441 '1, 1: ::I:3‘&*< v/“K/ ’11 1X/;‘\ / 1:11 M571110111,,58/“111119/11 77* 5713;,31 1;: ’ \ \\ ‘\ 15W {\37’1161116‘ "U42 Figure 6 3 DBN reconstrunction of the network. The nubers in the nodes correspond to the nums rn Table 6.4. 162 CHAPTER 7 CONCLUSIONS It is the objective of this research to develop approaches that can identify the pathways involved in conferring a phenotype or cellular state. Once these pathways are elucidated, targets may be identified and engineered to more accurately modulate or optimize the cellular state. Microarray gene expression and metabolic flux data were obtained to characterize cellular states. In addition to the measurements, a priori knowledge about the genes, metabolites and pathways available in literature and public databases were also used. Systems approaches were developed in this thesis to integrate these multiple sources of data to identify the genes and active pathways relevant to a particular phenotype. T [PS© was developed in chapter 2 to chapter 4 to identify the active pathways that regulate a phenotype. T IPSO applied GA/PLS and CICA to identify a subset of relevant genes, and this subset of genes was subsequently subjected to BN analysis to reconstruct active pathways. T IP50 was successfully applied to identify active pathways relevant to palmitate and TNF-01 induced cytotoxicity in HepG2 cells. A dynamic SSM was developed in chapter 5 to integrate gene expression and a priori knowledge of the gene regulatory network structure to infer underlying transcription factor activity profiles. SSM was successfully applied to infer transcription factor activities of E. C011 and yeast, which were validated by the literature. Extensions to TIPS© were discussed in chapter 6 to incorporate functional information in the gene selection process, infer pathways relevant to multiple metabolites, and from time series data using DBN analysis. In summary, systems approaches were developed to integrate gene expression, metabolic and phenotypic profiles, along with a priori information, to elucidate mechanisms and 163 pathways involved in conferring a particular phenotype. As a proof of concept, the approaches developed were applied to HepG2 cells exposed to FFAs and TNF-01 to identify genes and pathways that confer a cytotoxic vs. a cytoprotective phenotype. 164 Appendix Figure l BIBLIOGRAPHY Abreo K., Sella M., Alvarez-Hemandez X., Jain S. (2004) “Antioxidants prevent aluminum-induced toxicity in cultured hepatocytes” J Inorg Biochem. 98(6):] 129—34. Aeme B.L., Johnson A.L., Toyn J.H., Johnston L.H. (1998). “Swi5 Controls a Novel Wave of Cyclin Synthesis in Late Mitosis” Mol. Biol. Cell 9, 945-956. Akutsu T., Miyano S., Kuhara S. (1999) “Identification of Genetic Networks from a Small Number of Gene Expression Patterns Under the Boolean Network Model” chific Symposium on Biocomputing, 4:17-28. Al-Shahrour F., Minguez P., Vaquerizas J .M., Conde L., Dopazo J .(2005) “Babelomics: a suite of web-tools for functional armotation and analysis of group of genes in high-throughput experiments” Nucleic Acids Research. 33(Web Server issue), W460-W464. Alter 0., Brown P.O., Bostein D. (2000) “Singular value decomposition for genome-wide expression data processing and modeling” PNAS. 97(18):]0101-10106. Andersson C. A. and Bro R. (2000) “The N-way Toolbox for MATLAB”, Chemometrics & Intelligent Laborgtory Systems. 52 (1):1-4 Basso K., Margolin A.A., Stolovitzky G., Klein U., Dalla-Favera R., Califano A.(2005) “Reverse engineering of regulatory networks in human B cells” Nat Genet. 37(4):382-90. Bangalore A.S., Shaffer R. E., Small G.W. (1996) “Genetic algorithm-based method for selecting wavelengths and model size for use with partial least-squares regression: application to near-infrared spectroscopy.” Anal. Chem, 68: 4200-4212. Baetz K., Andrews B. (1999). “Regulation of Cell Cycle Transcription Factor Swi4 through Auto-Inhibition of DNA Binding.” Mol. Cell Biol. 19, 6729-6741. Bar-Joseph Z., Gerber G., Gifford D., Jaakkola T., Simon 1. (2002). “A New Approach to Analyzing Gene Expression Time Series Data.” 6th Annual International Conference on Research in Computgtional Molecular Biology. pp. 39-48 Beal M.J., Falciani F., Ghahramani 2., Range] C., Wild, BL, (2005) “A Bayesian approach to reconstructing genetic regulatory networks with hidden factors.” Bioinforrnatics 21(3): 349-356. Bligh E.G., Dyer W.J., (1959) “A rapid method of total lipid extraction and purification”, Can J Biochem Physiol. 37(8):911-7. 166 Bolon C., Gauthier C., Simonnet H. (1997) “Glycolysis inhibition by palmitate in renal cells cultured in a two-chamber system.” Am J Physiol. 273(5 Pt 1):C1732-8. Brenner D. A. (1998)“Signal transduction during liver regeneration”, J Gwenterol. Hmatol. 13 Suppl: 893-95. Brookes P. S. (2005)“Mitochondrial H(+) leak and ROS generation: an odd couple.” FE Radic Biol Med. 38(1):]2-23. Castillo E. F., Gutierrez J. M., and Hadi A. S. (1997) “Sensitivity analysis in discrete Bayesian networ ”, IEEE Transactions on System; Man,421nd Cybernetics - Pa_rt A: Systemsfi and Humans, 27:412-423. Chan C., Berthiaume F., Washuzu J ., Toner M., and Yarmush M. L. (2002) “Metabolic Pre-Conditioning of Cultured Cells in Physiological Levels of Insulin: Generating Resistance to the Lipid Accumulating Effects of Plasma in Hepatocytes” Biotechnology and Bioengineerigg, 78: 753-760. Chan C., Berthiaun‘re F ., Lee K., and Yarmush M. L. (2003a) “Metabolic Flux Analysis of Cultured Hepatocytes Exposed to Plasma”, Biotechnology and Bioengineering, 81: 33-49. Chan C., Berthiaume F., Lee K., and Yarmush M. L. (2003b) “Effect of Hormones on Liver-specific Function and Lipid Accumulation of Hepatocytes during Plasma Exposure Analyzed by Metabolic Flux Analysis”, Metabolic Engineering, 5: 1-15. Chan C., Hwang D. H., Stephanopoulos G. N., Yarmush M. L., and Stephanopoulos G. (20030) “Application of Multivariate Analysis to Optimize Function of Cultured Hepatocytes”, Biotechnology Prggress, 19: 580-598. Chen J .L., Peacock E., Samady W., Turner S.M., Neese R.A., Hellerstein M.K., Murphy E.J.(2005) “Physiologic and pharmacologic factors influencing glyceroneogenic contribution to triacylglyceride glycerol measured by mass isotopomer distribution analysis”, J Biol Chem. 280(27):25396-402. Chen T., He H. L., Church GM. (1999) “Modeling Gene Expression with Differential Equations”, Pacific Symposium on Biocomputing, 4:29-40 Cheng J., Kelly J ., Bell DA. and Liu W. (2002) “Learning belief networks from data: An information theory based approach”, Artificial Intelligence J ourn_al, 137: 43-90. Ciccoli L., Ferrali M., Casini A. F., Comporti M. (1981) “Effect of reduced glutathione on carbon tetrachloride induced fatty liver: various considerations” Boll Soc Ital Biol Sper. 57(13):]463-9. 167 Cooper GP. (1990) “The computational complexity of probabilistic inference using Bayesian belief networks”. Artificial Intelligence, 42(2-3):393-405 Cooper G. and Herskovits E. (1992) “A Bayesian method for the induction of probabilistic networks from data”, Machine Learning, 9:309-347. de Pablo M.A. (1999) “Palmitate induces apoptosis via a direct effect on mitochondria.” Apoptosis 4, 81 (1999). Deng X., Ito T., Carr B., Mumby M., May Jr, W.S. (1998) “Reversible Phosphorylation of Bch following Interleukin 3 or Bryostatin 1 Is Mediated by Direct Interaction with Protein Phosphatase 2A”. J. Biol. Chem. 273, 34157-34163. Dentin R., Benhamed F., Pegorier J. P., Foufelle F., Viollet B., Vaulont S., Girard J., Postic C. (2005) “Polyunsaturated fatty acids suppress glycolytic and lipogenic genes through the inhibition of ChREBP nuclear protein translocation.” J Clin Invest. 1 15(10):2843-54. di Bernardo D., Thompson M. J., Gardner T. S., Chobot S. E., Eastwood E. L., Wojtovich A. P., Elliott S. J ., Schaus S. E., and Collins J .J . (2005) “Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks.” Nature Biotechnology 23(3):377-383. Di Giovanni S., Mirabella M., Papacci M., Odoardi F., Silvestri G., Servidei S. (2001) “Apoptosis and ROS detoxification enzymes correlate with cytochrome c oxidase deficiency in mitochondrial encephalomyopathies”, Mol Cell Neurosci. 17(4):696-705. Ding W. X., Yin X. M.(2004) “Dissection of the multiple mechanisms of TNF-alpha- induced apoptosis in liver injury”, J Cell Mol Med, 8(4): 445-54. Dobrzyn P., Dobrzyn A., Miyazaki M., Cohen R, Asilmaz E., Hardie D.G., Friedman J. M., Ntambi J. M. (2004) “Stearoyl-CoA desaturase ] deficiency increases fatty acid oxidation by activating AMP-activated protein kinase in liver.” PNAS. 101(17):6409-6414. Eisen M.B., Spellman P.T., Brown PO. and Botstein D. (1998) “Cluster analysis and display of genome-wide expression patterns.” Proc. Natl. Acad. Sci. USA, 95(25): 14863-14868. El-Assaad W., Buteau J ., Peyot M. L., Nolan C., Roduit R., Hardy S., Joly E., Dbaibo G., Rosenberg L., Prentki M. (2003) “Saturated fatty acids synergize with elevated glucose to cause pancreatic beta-cell death.” Endocrinology. 144(9):4154-63. Farmer W.R., and Liao J .C. (2000) “Improving lycopene production in Escherichia coli by engineering metabolic control”, Nature biotechnology, 1825332537. 168 Felber J. P. and Golay A. (2002) "Pathways from obesity to diabetes." International Journ_al of Obesity 26(Suppl. 2): $39-$45. Fellenberg K., Hauser N.C., Brors B., Neutzner A., Hoheisel J .D., Vingron M. (200]) “Correspondence analysis applied to microarray data.” Proc. Natl. Acad. Sci. USA, 98(19):]0781-10786. Fleury C. (1997) “Uncoupling protein-2: a novel gene linked to obesity and hyperinsulinemia.” Nature Genetics 15, 269-272. Friedman N., Linial M., Nachman 1., and Pe' er D. (2000) “Using Bayesian networks to analyze expression data” J. Comp. Bio.. 7:601-620. Friedman N. (2004) “Inferring Cellular Networks Using Probabilistic Graphical Models.” Science 303(5659):799-805. Geladi P., and Kowalski B. P. (1986) “Partial least-squares regression: a tutorial.” Anal. Chim. Acta. 185,]. Gill H., Wu G. (2006) “Non-alcoholic fatty liver disease and the metabolic syndrome: Effects of weight loss and a review of popular diets. Are low carbohydrate diets the answer.” World J Ga_stroente;o_l 12(3):345-353 Goldberg D.E, Deb K. (1989) “Genetic algorithms in search, optimization, and machine learning” Addison-Wesley, Reading, MA. Gomez E.O., Mendoza-Milla C., Ibarra-Sanchez M.J., Ventura-Gallegos J.L. (1996). “Ceramide reproduces late appearance of oxidative stress during TNF-mediated cell death in L929 cells”, Biochem Biophys Res Commun. 228(2): 505-9. Griffin J .L. (2004) “An integrated reverse functional genomic and metabolic approach to understanding orotic acid-induced fatty liver.” Physiol. Genomics 17, 140-149. Haddad J. (2002) “Oxygen-sensing mechanisms and the regulation of redox-responsive transcription factors in development and path0physiology” Respirgtory Research 3, 26. Haddad J.J. (2004) “On the antioxidant mechanisms of Bcl-2: a retrospective of NF- [kappa]B signaling and oxidative stress.” Biochemical and Biophysical Research Communicgtions 322, 355. Hakarnada K., Hanai T., Honda H., and Kobayashi T., (2001) “Identifying genetic network using experimental time series data by Boolean algorithm”, Genome Informatics, 12:272-273. 169 Henrion M. (1988) “Propagating uncertainty in Bayesian networks by probabilistic logic sampling”. In Uncertainty in Artificial Intellgience 2, pages 149-163, New York, N. Y., Elsevier Science Publishing Company, Inc. Hartley D. P., Ruth J. A., Petersen D. R. (1995) “The hepatocellular metabolism of 4- hydroxynonenal by alcohol dehydrogenase, aldehyde dehydrogenase, and glutathione S-transferase”Arch Biochem Biophys. 316(1): 197-205. Hartemink A.J., Gifford D. K., Jaakkola T. S., and Young RA (2001) “Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networ ” Pacific Symposium on Biocomputing, 6:422-433. Hayes J. D., Pulford D. J. (1995) “The glutathione S-transferase supergene family: regulation of GST and the contribution of the isoenzymes to cancer chemoprotection and drug resistance.” Crit Rev Biochem Mol Biol. 30(6):445- 600. Heckerman D., Geiger D. and Chickering D.M. (1995) “Learning Bayesian networks: the combination of knowledge and statistical data” firehine LeannrgLJoumJal. 20:197-243. Heyninck K., Wullaert A., Beyaert R. (2003) “Nuclear factor-kappa B plays a central role in tumour necrosis factor-mediated liver disease”, Biochem. Phannacol. 66(8):]409-15. Holland J. (1975) “Adaptation in natural and artificial system.” The university of Michigan press, Ann Arbor. Holland M. (2002) “Transcript abundance in yeast varies over six orders of magnitude.” J. Biol. Chem. 277, 14363-14366. Houck C.R., Joines J.A., Kay MG. (1995), “A genetic algorithm for function optimization: a Matlab implementation,” NCSU-IETR 95-09. Hwang D. H., Stephanopoulos G., Chan C. (2004) “Inverse modeling using multi-block PLS to determine the environmental conditions that provide optimal cellular function,” Bioinformatics 20, 487-499. Ideker T., Ozier O., Schwikowski B., Siege] Andrew F. (2002) “Discovering regulatory and signalling circuits in molecular interaction networks.” Bioinformatics 18 Suppl l:S233-240. Idris 1., Gray 8., Donnelly R. (200]) “Protein kinase C activation: isozyme-specific effects on metabolism and cardiovascular complications in diabetes” Diabetologig, 44: 659-673. 170 Ikuko M., Hiroto M., Yuki K., Tetsuya K., Sachiko S., Shinya T. (2003) “Acetaldehyde Induces Granulocyte Macrophage Colony-Stimulating Factor Production in Human Bronchi through Activation of Nuclear Factor-KB” Allergy and Asthma Proceedings, 24 (5): 367-371 Izpisua J.C., Barber T., Cabo J., Hrelia S., Rossi C.A., Parenti C. G., Lerker G., Biagi P.L., Bordoni A., Lenaz G. (1989) “Lipid composition, fluidity and enzymatic activities of rat liver plasma and mitochondiral membranes in dietary obese rats.” Int J Obes. 13(4): 531-542. J anicke R., Droge W. (1985) “Effect of L-ornithine on proliferative and cytotoxic T-cell responses in allogeneic and syngeneic mixed leukocyte cultures” Cell Irnmunol. 92(2):359-65. Jim R., Si L., Srivastava S., Li Z., Chan, C. (2006) “A Knowledge Driven Regression Model for Gene Expression and Microarray Analysis”, EMBC, accepted. J etten A. M., Suter U. (2000) “The peripheral myelin protein 22 and epithelial membrane protein family”, Prog Nucleic Acid Res Mol Biol. 64297-129. Ji J ., Zhang L., Wang P., Mu Y. M. (2006) “Saturated free fatty acid, palmitic acid, induces apoptosis in fetal hepatocytes in culture”, Exp Toxicol Pathol. 56(6):369- 76. J ia Z., Xu S. (2005) “Clustering expressed genes on the basis of their association with a quantitative phenotype.” Genet Res. 86(3):]93-207. Jump DB. (2004) “Fatty acid regulation of gene transcription”, Crit Rev Clin Lab Sci. 41(1):41-78. Kao K.C., Yang Y.L., Boscolo R., Sabatti C., Roychowdhury V.P., Liao J.C., (2004) “Transcriptome-based determination of multiple transcription regulator activities in Escherichia coli by using network component analysis.” PNAS, 101(2): 641- 646. Kao K.C., Tran L.M., Liao J .C., (2005) “A global regulatory role of gluconeogenic genes in Escherichia coli revealed by transcriptome network analysis,” The Journal of Biological Chemistry, Vol 280 (43):36079-36087 Kanthasamy A.G., Kitazawa M., Kanthasamy A., Anantharam V. (2003) “Role of Proteolytic Activation of Protein Kinase C in Oxidative Stress-Induced Apoptosis ”, Antioxid Redox Siggal. 5(5):609-620. Kerszberg M. (2004) “Noise, delays, robustness, canalization and all that” Current Opinion in Genetics and Development 14: 440-445 171 Kemer J., Hoppel C. (2000) “Fatty acid import into mitochondria.” Biochimica et Biophysica Acta (BBA) - Molecular and Cell Biology of Lipids 1486, l). Kilpatrick L. E., Lee J. Y., Haines K.M., Campbell D.E., Sullivan K.E., Korchak HM. (2002) “A role for PKC-delta. and PI 3-kinase in TNF-alpha -mediated antiapoptotic signaling in the human neutrophil.” Am J Physiol Cell Physiol 283(1):C48-57. Kim BC. (2002) “Tumor Necrosis Factor Induces Apoptosis in Hepatoma Cells by Increasing Ca2+ Release from the Endoplasmic Reticulum and Suppressing Bcl-2 Expression.” J. Biol. Chem. 277, 31381-31389. Kim J. S. He L., and Lemasters J. J. (2003) “Mitochondrial permeability transition: 3 common pathway to necrosis and apoptosis”, Biochem Biophys Res Comm 304(3):463-70. Kobayashi M. (1998) "Molecular mechanism of insulin resistance." Saishin Igaku 53(6): 1210-1216. Kovacech B., Nasmyth K., Schuster T. (1996). “EGT2 Gene Transcription is Induced Predominantly by Swi5 in Early G1.” Mol. Cell. Biol. 16, 3264—3274. Leardi R. (2000) “Application of genetic algorithm-PLS for feature selection in spectral data sets.” J. Chemometrics. 14: 643-655. Lee T. 1., Rinaldi N. J ., Robert F., Odom D. T., Bar-Joseph Z., Gerber G. K., Hannett N. M., Harbison C. T., Thompson C. M., Simon 1., Zeitlinger J., Jennings E. G., Murray H. L., Gordon D. B., Ren B., Wyrick J. J., Tagne J.-B., Volkert T. L., Fraenkel E., Gifford D. K., and Young R. A. (2002) “Transcriptional Regulatory Networks in Saccharomyces cerevisiae”. Science, 298(5594): 799-804. Lewandowski E. D., Kudej R. K., White L. T., O'Donnell J. M., Vatner S. F. (2002) “Mitochondrial preference for short chain fatty acid oxidation during coronary artery constriction.” Circulation. 105(3):367-72. Liang S., Fuhrman S. and Somogyi R. (1998) “REVEAL, a general reverse engineering algorithm for inference of genetic network architectures.” Pacific Symposium on BiocomputingLBAS -29. Liao, J .C., Boscolo, R., Yang, Y.L., Tran, L.M., Sabatti, C., Roychowdhury, V.P., (2003), “Network component analysis: Reconstruction of regulatory signals in biological systems. ” PNAS, 100(26):15522-15527. Lieberrneister W. (2002) “Linear modes of gene expression determined by independent component analysis.” Bioinformatics 18, 51-60 (2002). 172 Listenberger L. L., Han X., Lewis S. E., Cases 8., Farese R.V., Jr. Ory D. S., Schaffer J .E. (2003) “Triglyceride accumulation protects against fatty acid-induced lipotoxicity.” Proceedings of the National Academy of Sciences of the United States of America 100(6):3077-3082. Listenberger L.L., Ory D.S., Schaffer J .E. (2001) “Palmitate-induced apoptosis can occur through a ceramide-independent pathway.” Journgl of Biological Chemistry 276, 14890-14895. Li Z., Chan C. (2004a) “Integrating Gene Expression and Metabolic Profiles.” J Biol Chem 279(26):27124-27137. Li Z., Chan C. (2004b) “Inferring pathways and networks with a Bayesian framework” FASEB J 18(6):746-748. Li Z., Srivastava S., Mittal 8., Norton P., Resau J., Haab B., Chan C. (2006) “A Hierarchical Approach to Identify Pathways that Confer Cytotoxicity in HepG2 Cells from Metabolic and Gene Expression Profiles”, in review. Lin S. M., Liao X., McConnell P., Vata K., Carin L., Goldschmidt P. (2001) “Using functional genomic units to corroborate user experiments with the Rosetta compendium.” Methods of Microarray Data Analysis II, Papers from CAMDA 11, Durham, NC, United States, Oct 15-16, 2001 2002:123-137. Loew L.M., Schaff J .C. (2001) “The virtual cell: a software environment for computational cell biology.” Trends in Biotechnology, 19(10): 401-406. Lopes J.A., Menezes J.C., Westerhuis J.A., Smilde A.K.. (2002) “Multiblock PLS analysis of an industrial pharmaceutical process” Biotechnol Bioeng. 80(4):4l9- 27 Lu Z.H., Mu Y.M., Wang B.A., Li X.L.(2003) “Saturated free fatty acids, palmitic acid and stearic acid, induce apoptosis by stimulation of ceramide generation in rat testicular Leydig cell”, BiochemicL and Biophysigrl Research Communication; 303(4): 1002-1007. MacKay V.L., Mai B., Waters L., Breeden LL. (2001). “Early Cell Cycle Box-Mediated Transcription of CLN3 and SW14 Contributes to the Proper Timing of the Gl-to- S Transition in Budding Yeast.” Mol. Cell Biol. 21, 4140-4148. McAdams H. and Arkin A. (1997). "Stochastic mechanisms in gene expression." PNAS 94(3): 814-819. McInemy C.J., Partridge J.F., Mikesell G.E., Greemer D.P., Breeden LL. (1997). “A novel Mcml-dependent element in the SW14, CLN3, CDC6, and CDC47 promoters activates M/Gl-specific transcription.” Genes & Dev. 11, 1277-1288. 173 Monk N. A.M. (2003) “Oscillatory expression of Hesl, p53 and NF—kB driven by transcriptional time delays,” Current Biology 13: 1409-1413 Murphy K. and Mian S. (1999). “Modelling gene expression data using Dynamic Bayesian Networks.” Technical report, University of California, Berkeley. Murphy K. (2001) “The Bayes net toolbox for matlab.” Computing Science and Statistics: Proceedings of Interface. 33 MacGregor J. F., Jaeckle C., Kiparissides C., Koutoudi, M., (1994) “Process monitoring and diagnosis by multiblock PLS methods” AIChE J ., 40, 826-838. Maki Y., Tominaga D., Okarnoto M., Watanabe S., Eguchi Y. (2001) “Development of a System for the Inference of Large Scale Genetic Networ ” Pacific Symposium on Biocomputing, 62446—458. Magana M.M., Koo S.H., Towle H.C., Osborne T.F. (2000) “Different sterol regulatory element-binding protein-1 isoforms utilize distinct co-regulatory factors to activate the promoter for fatty acid synthase.” The ioumal of biologal chemistry, 275(7): 4726-4733. Martens G.A., Cai Y., Hinke S., Stange G., Van de Casteele M., Pipeleers D. (2005) “Glucose suppresses superoxide generation in metabolically responsive pancreatic beta cells.” J Biol Chem. 280(21):20389-96. Martin-Requero A., Cipres G., Rodriguez A., Ayuso M. S., and Parrilla R. (1992) “On the mechanism of stimulation of ureagenesis by glueoneogenie substrates: role of pyruvate carboxylase.” American J ou_m_al of Physiology 263: E493-E499 Martoglio A.M., Miskin J.W., Smith S.K., MacKay D.J.C. (2002) “A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer.” Bioinformatics 18, 1617-1624. Meijer A. J ., Lamers W. H., Chamuleau R. A. F. M. (1990) "Nitrogen metabolism and ornithine cycle fimction." Physiol. Rev., 70: 701-748. Miller C.W., Ntambi J .M. (1996) "Peroxisome proliferators induce mouse liver stearoyl- CoA desaturase 1 gene expression." PNAS 93, 9443-9448. Moller I. M. (2001) “Plant mitochondria and oxidative stress: Electron Transport, NADPH Turnover, and Metabolism of Reactive Oxygen Species” Annu Rev Plant Physiol Plant Mol Biol. 52:561-591. Mootha V. K., Lindgren C. M., Eriksson K. F., Subrarnanian A., Sihag S., Lehar J., Puigserver P., Carlsson E., Ridderstrale M., Laurila E., Houstis N., Daly M. J., 174 Patterson N., Mesirov J .P., Golub T.R., Tarnayo P., Spiegelman B., Lander E.S., Hirschhom J .N., Altshuler D., Groop LC. (2003) “PGC-lalpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes” Nat Genet. 34(3):267-73. Morris S. M., Jr. (1992) "Regulation of enzymes of urea and arginine synthesis." Annu. Rev. Nutr. 12: 81-101. Morris S. M., Jr. (2002) "Regulation of enzymes of the urea cycle and arginine metabolism." Annu. Rev. Nu_tr., 22: 87-105. Motz C., Martin H., Krimmer T., Rassow J. (2002) "Bel-2 and Porin Follow Different Pathways of TOM-dependent Insertion into the Mitochondrial Outer Membrane." J ourngl of Molecular Biology 323, 729-738. Murphy K. and Mian S. (1999). “Modelling gene expression data using Dynamic Bayesian Networks.” Technical report, University of California, Berkeley. Murphy KR (2001) “The Bayesian Net Toolbox for Matlab”, Computing Science and Statistics, 33. Available at http://www.cs.berl_