MN H MIMI“ 1 l 55,23 munmumImummlmmm _.I I (I) I {002 This is to certify that the thesis entitled Random Forests and Gene Selection to Classify Arabidopsis Thaliana Ecotypes presented by Hsueh-han Yeh .LlBRARY M'Chigan State University ' has been accepted towards fulfillment of the requirements for the IV._§.___”___ degree in ._ Statistics and Probability ___.-_.-J&@Ud~jlafi/" Major Professor’s Signature 840»- 07 Date MSU is an arithmetive-action, equai-opportuniw employer ovo----un-ws~.- ‘c—u- — —~-no-v-v-0---oo---u-_v_uu!f a»- w —s v--u-- cup-vo- -.-':-—o--o .v.-. —- -v-a-nu-u-r-cn-u- - PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DAIEDUE DAIEDUE DAIEDUE 6/07 p:/ClRC/DateDue.indd-p.1 Random Forests and Gene Selection to Classify Arabidopsis Thaliana Ecotypes By Hsueh-han Yeh A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Statistics and Probability 2007 ABSTRACT Random Forests and Gene Selection to Classify Arabidopsis Thaliana Ecotypes By Hsueh-han Yeh This thesis discusses the classification and gene selection of ecotype data for Arabidopsis thaliana. Gene expressions from Oligonuoleotide gene expression arrays were used to classify Arabidopsis thaliana ecotypes using statistical methods. The hierarchical cluster method was used to group ecotypes according to latitude and altitude to distinguish ecotypes. Limma was used to select differentially expressed genes. The Random Forest algorithm provides a ranking of genes to indicate how well they can discriminate between ecotypes. We focus on the Random Forest algorithm. It is an efficient approach and can deal with a large number of predictor variables in a classification process. Parameters are optimal to achieve a small classification error rate. The final selection of genes may play an important role in adaptation to stress conditions. They were further examined for gene function and evidence regarding stress resistance. Keywords: Arabidopsis thaliana, Microarray Data, Hierarchical Cluster, Limma, Random Forest, Classification. ACKNOWLEDGEMENTS I wish to thank many people who made this thesis possible. First of all, it is hard to overstate my gratitude to my advisor, Dr. Marianne Huebner, Department of Statistics, Michigan State University. With her enthusiasm, her patience, and her encouragement, she helped to make statistics and biology fun for me. Throughout my thesis-writing period, she provided many suggestions and lots of good ideas. Dr. Huebner also helped me revising my English. I am very glad and enjoyable to work with her. I wish thank to Dr. Andreas Weber for his support and grant. Dr. Weber also gave me suggestions to examine gene functions which makes this thesis complete. I wish to thank my parents. They raised me, supported me, taught me, and loved me. To them I dedicate this thesis. I wish to thank my best fiiend Hsiu-ching Chang, for helping me get through the difficult times, and for all the emotional support. My special gratitude is due to my brother, for his loving support. I also wish to thank William Robert Swindell for giving many helpful suggestions of biology section. Finally, I have to say 'Thank You' to all my friends and family, wherever they are and where they go. III TABLE OF CONTENTS List of Tables ...................................................................................... V List of Figures ..................................................................................... VI Chapter 1 Introduction of Microarray and Arabidopsis Ecotypes Data 1.1 Microarray Data ........................................................................ 1 1.2 Arabidopsis thaliana ................................................................... 2 1.3 Gene Selection Process ................................................................ 5 Chapter 2 Statistical Methodology 2.1 Hierarchical Clustering ................................................................ 7 2.2 Limma - Linear Models for Microarray Data ...................................... 10 2.3 Random Forest ........................................................................ 12 Chapter 3 Application of Lima and Random Forest to Ecotypes 3.1 Gene Selection using Limma ........................................................ 19 3.2 Ecotypes of Cvi and Shakdara ....................................................... 22 3.3 Gene Selection firom Cvi contrasts with other 8 ecotypes ....................... 23 3.4 Gene Selection from Shakdara contrasts with other 8 ecotypes ................ 25 3.5 Gene Selection from CviSOO and ShaSOO by Random Forest .................. 26 3.6 Compare the OOB error rate of Random Forest ................................... 28 3.7 Misclassifications of Ecotypes ...................................................... 29 Chapter 4 Gene Ontology 4.1 Gene Ontology with Classification Superviewer ................................. 30 4.2 Gene Ontology of Cvi43 and sha84 .............................................. 33 Appendices ........................................................................................... 38 Bibliography ......................................................................................... 59 IV LIST OF TABLES Table 1 Ecotypes Geography Information .............................................. 4 Table 2 Resources of Arabidopsis Genome ............................................. 4 Table 3 Geography of Ecotypes ........................................................... 8 Table 4 The number of significant genes for per contrast .......................... 20 Table 5 Comparison of OOB error rate ............................................... 28 Table 6 Misclassification List ............................................................ 29 Table 7 Main Function categories of FunCat ......................................... 32 Table 8 FLC, Cytochrome P450 genes and Glutathione—S-transferase genes .............................................. 37 Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Figure 17 Figure 18 Figure 19 LIST OF FIGURES First 10 Ecotypes Distribution Map ........................................... 3 Hierarchical Clustering Process ................................................ 8 Ecotype Cluster .................................................................... 9 Random Forest Construction ................................................ 13 Number of significant contrasts ............................................... 19 The number of significant genes for per contrast ......................... 20 Significant genes for Latitude and Altitude .............................. 21 Gene expression of 247999__at ............................................... 21 Optimal value of ntree for Cvi ................................................. 24 Optimal value of mtry for Cvi ................................................ 24 Optimal value of ntree for Shakdara ........................................ 25 Optimal value of mtry for Shakdara ......................................... 25 Optimal value of the number of genes for Cvi ........................... 26 Optimal value of the number of genes for Shakdara ..................... 27 Overlapping genes from Cvi500 and Sha500 .............................. 27 Misclassification figure ......................................................... 29 Cvi43 - Classification Superviewer ........................................... 33 Sha84 - Classification Superviewer .......................................... 34 Expression graph for 5 specific genes ...................................... 36 VI Chapter 1 Introduction of Microarray and Arabidopsis Ecotypes Data 1.1. Microarray Data Regulatory regions of plant genes is likely to be more concise than those of animal genes, but the transcription factors encoded in plant genomes is larger than those of animals. Thus, plants can contribute to research regarding the influence of transcriptional factors in multicellular development. Here, we study the reference plant, Arabidopsis thaliana, for our study, and the dataset is AtGenExpress Ecotypes Expression estimated by gcRMA. The data is part of the public AtGenExpress expression atlas, which was created by Afiymeuix ATHl array platform. Microarray, obtained by Oligonuoleotide Chips or spotted arrays, is a technology to study the expression of thousands of genes. Microarray technology requires statistical methods to analyze the dataset which are high dimensional data sets. Statistical approaches can be used for multiple comparisons of genes to define the differentially expressed genes between arrays. Data mining is used widely for Microarray data since it can use a subgroup of genes to predict the observations (e. g. Ecotypes) that would help to reduce the dimension of Microarray data. In this study, we use classification approach and data mining technique, Random Forest, to classify the Arabidopsis thaliana Expression Ecotypes Data. 1.2. Arabidopsis Data Arabidopsis thaliana The Arabidopsis ATHl Genome Array, built in TIGR (The Institute for Genomic Research), contains more than 22,500 probe sets displaying approximately 24,000 gene sequences on a single array. (http://wwwaffmetrixcom) Arabidopsis thaliana is a flowering plant, an inconspicuous weed. It has been used as a model plant organism for many years and has been chosen for used in molecular genetic analysis. Laibach (1943) first specify that some significant characteristics of Arabidopsis thaliana make them are suitably used for model plant organism. It has a short life cycle; it only needs several weeks to mature. Due to its size, it can grow in a limited area. Furthermore, it has small genome size and nearly non-repetitive DNA (S Barth, A E Melchinger; 7H Lfibberstedt, 2002). These features make Arabidopsis thaliana plants much conveniently for genetic analysis. Due to these features in ArabidOpsz's thaliana, international effort has been devoted to build the methods to research its genome. Arabidopsis thaliana at an early stage of flowering. [Drawing by K. Sutliff] Arabidopsis thaliana Ecotype Data Figure 1 First 10 Ecotypes Distribution Map D f o 024 .{9/ A Col-0 (J ,3 + Cvi ‘i A} x Est ‘3 “(1'1 0 Kin-0 _ g V Ler \ 7 Bay 0\ a N d-1 { Ler / * Shakdara -*""\ rV' 0 Van-0 An ecotype is a population of a plant that survives as a distinct group through ecological environment. AtGenExpress Ecotypes Data used in this paper come from weigelworld (WWW.weige/worldozg), including 34 ecotypes. Each ecotype is composed by one or several arrays of 22810 genes each. Arabidopsis thaliana is widely distributed (Meinke et al, 1998), and the 34 ecotypes in the Arabidopsis thaliana Ecotype data used in this study represent locations in Europe, North America and Afiica. The location, longitude, latitude, and altitude of each ecotype were listed in the Tablel. The latitudes of these ecotypes range from 16N to 59N. The longitudes range from 0.53E to 73B, and from 0.22W to 123W. The highest altitude is 3400m. Overview the distribution of the ecotypes, 27 ecotypes distributed throughout Europe and 12 ecotypes among these 27 ecotypes in Germany. The other ecotypes are distributed in North America and Africa. We want to examine if we can use these gene expressions to classify Arabidopsis thaliana ecotypes by statistical methods. First of all, the problem we confront is the large size of genes in each ecotype. Dimension reduction can help deal with large variables efficiently and select the most important variables. We use Random Forest to decrease the size of dataset and classify ecotypes. Random Forest Algorithm will be discussed in the Chapter 2. Table 1 Ecotypes Geography Information Ecotype Location Altitude Latitude Longitude Temperature (“C ) Bay-0 Bayreuth. Germany 350 49N 11 E -2 - 18 C24 Coimbra, Portugal 179 40M 8 E 7.2 — 27 Col-0 Columbia University (US) 49 39M 93 W -3.3 -- 28.9 Cvi Cape Verde Islands 43 16N 24 W 24 - 29 Est Estonia 15 59N 26 E -5.2 - 17 Kin-0 Kinneville, MI 273 43N 85 W -12.2 — 32.2 Ler Landsberg. Germany 628 53N 16 E -1.7 - 19.4 Nd-1 Niederzunzheim, Germany 250 50N 8 E 5.5 - 9.5 Shakdara Pamiro-Alay. Tadjikistan 3400 37N 71 E 0 - 30 Van-0 UBC (Vancouver) 50 50M 123 W 0 — 26 Table 2 Resources of Arabidopsis thaliana GenomeL Resources Contact Person Information of website Arabidopsis database (AtDB) ABRC* Stock Center (USA) NASCT Stock Centre (UK) TIGR: (USA) SPP§ Consortium (USA) CSHL\ Consortium (USA) ESSAConsortium (Europe) Genoscope (France) Kazusa Institute ( Japan) David W. Meinke, J. Michael Cherry,* Caroline Dean, Steven D. Rounsley, Maarten Koomneef. Arabidopsis thaliana: A Model Plant for Genome Analysis (1998) M. Cherry R. Scholl M. Anderson S. Rounsley R. Davis R. McCombie M. Bevan F. Quetier S. Tabata http://genome- www.3tanford.edu/Arabidopsis/ http://aims.cps.msu.edu/aims http://nasc.nott.ac.uk http://www.tigr.org/tdb/at/at.html http://sequence— www.stanford.edu/ara/SPP.html http://nucleus.cshl.org/protarab/ http://muntj ac.mips.biochem.mpg.de/ arabi/index.html http://www.genoscope.cns.fr/exteme/ arabidopsis/Arabidopsis.html http://www.kazusa.or.jp/arabi/ 1.3. Gene Selection Process Grouping 10 ecotypes (3 replications each) by latitude and altitude of first 10 ecotypes of Arabidopsis thaliana Ecotypes Data using Hierarchical Cluster. Four groups are as follows: La4 (La-A, La-B, La-C, La—D) A14 (Al-A, Al-B, Al-C, Al-D) For each of these two groupings (A14 and La4) with Limma function of R software. A-B, A-C, A-D, B-C, RD, and C-D in each of the grouping La4 and A14, respectively. The number of significant genes for each contrast in each grouping is counted. After counting the number of significant genes, we found that Cvi (La-D) has the largest number of significant genes differentially expressed in comparison with other 3 latitude groups. Shakdara (Al-A) has the largest number of significant genes differentially expressed in comparison with other 3 altitude groups. Cvi (smallest latitude) and Shakdara (highest altitude) are compared to the other ecotypes to identify genes that differentiate these. Contrasts to be considered : - Cvi - é—(Bayo + C24 + Colo + Est + Kino + Ler + Nd1 + Vano) o Sha - :3—(Bayo + C24 + Colo + Est + Kino + Ler + Nd1 + Vano) The top 500 differently expressed genes are selected from each of these two contrasts. Corresponding gene sets are Cvi500 and Sha5 00. Optimal parameters, ntree and mtry, in Random Forest are chosen for Cvi5 00 and Sha5 00. Highly ranked genes (variable importance) are selected from Cvi5 00 and Sha5 00. 10 11 There are 43 genes chosen from Cvi500 and 84 genes chosen from ShaSOO. Compare OOB error rate for the selected genes. Discuss misclassification arrays in Random Forest. Gene functions of the selected genes are considered. Chapter 2 Statistical Methodology In this chapter, clustering (2.1), linear models for Microarray Data (2.2), and Random Forest (2.3) will be discussed. (2.1) Clustering is the first step in our gene selection process. In this section, we use Hierarchical Clustering method to group the 10 ecotypes into subsets and those subsets will be contrasted with linear models. (2.2) Limma is the second step. In this step, we choose smaller subgroups of genes which are differentially expressed from Lima method by contrasting subsets of ecotypes obtained in clustering result. We explain the differentially expressed genes. (2.3) Random Forest is a method to rank genes by their importance in classifying ecotypes. In this section, we will explain the Random Forest algorithm and the selection of important predictor variables (genes) from the gene sets chosen with the linear models. 2.1. Clustering Grouping a collection of observations into subgroups (clusters) is called Clustering. Observations within the each cluster have smaller distance to each other than to observations assigned to other different clusters. In Hierarchical Clustering (J inwook Seo, Ben Shneiderrnan 2002), the observations are not separated into subgroups in only one step. Instead, observations are separated by a serious of partitions. Clustering may start from a single cluster containing all observations to subgroups of observations, called Divisive method. On the other hand (Figure 2), it may start from n clusters (if you have n observations) and each cluster contains one observation, then finding the closest distance pair of clusters and combining them into a single cluster. In the end, all clusters will be combined into one cluster, called Agglomerative method. The Agglomerative method is used here to identify latitude and altitude groups (Table 3). Table 3 Geography of Ecotypes. Ecotype Location Altitude Group(Al) Latitude Group(La) 1 Bay-0 Bayreuth. Germany 350 C 49.56 B 2 024 Coimbra, Portugal 179 C 40.2 C 3 Col-0 Columbia University (US) 49 D 43.0125 C 4 Cvi Cape Verde Islands 43 D 16 D 5 Est Estonia 15 D 59 A 6 Kin-0 Kinneville. MI 273 C 42.466 C 7 Ler Landsberg. Germany 628 B 48.2 B 8 Nd-1 Niederzunzheim, Germany 250 C 50.778 B 9 Shakdara Pamiro-Alay. Tadjikistan 3400 A 37.183 C 10 Van-0 UBC (Vancouver) 50 D 49.85 B The process of Agglomerative Method as follows: Given a set of n observations (ecotypes) to be grouped, and a nxn distance matrix (Euclidean distance measure used) illustrates each pair of two observation distance. Step1. Start with n clusters, and each cluster contains a single observation. Step2. Select the closest pair of clusters to merge into one new cluster. Step3. Calculate the distance of the new cluster and other old single observation cluster. Step4. Repeat Step2 and Step3 until all observations merge into one cluster. Figure 2 Hierarchical Clustering Process Hierarchical Clusteran Comm min] L E1, E2, E3, E4, E5 ] //\\ [—1 E2. E4 —_’\ \ \.. // \x E“: [a ca :3 @ l Agglomerative method 1 Hierarchical cluster used to cluster 10 ecotypes into subgroups according to their altitude and latitude (F igure3). From Figure3, we can see that Cvi and Shakdara differ the most from the remaining ecotypes. Figure 3 Ecotype Cluster Latlmdeauster mom 0 O In (0 3‘ 8 8 o. 8 8— N O 81 N «)3— 8 G)— ‘U “7 a 3.— }? E _. <81 0_ 9 O o- |O rem rig. N 3% all] qu, ll to _]F .- 1' 09°. ' m‘ >‘N ‘7 5 as 3%: ”5%38021: 2.2. Limma — Linear Models for Microarray Data Before Random Forest is applied to gene sets, we use Limma, Linear Models for Microarray Data (Smyth, G K. 2004), to choose smaller subgroups of genes between ecotypes. The grouping will be discussed in the following paragraph. Differentially expressed genes will be used in Random Forest to classify ecotypes and to assign ranks to the genes. Limma is used to identify genes whose expression pattern differs from others. Limma is a software package in Bioconductor in R environment (http://www.r-project.org) for the analysis of gene expression microarray data. Linear models are constructed for each gene to determinate weather they are differentially expressed in subgroups of ecotypes defined by latitude an altitude clusters. In the topTable function of Limma, M- value, t-statistic, B—statistic and P. Value of each gene can provide overall ranking of genes in order of differential expression. M-value is logz-fold change between two groups. M = 10g2(expresszon value of gene at group A expression value of gene in group B The t-statistic is a well-known hypothesis to test the mean of two groups. The B- statistic is the log odds that the gene is differentially expressed. For example, if the B- 3.5 statistic is 3.5, the probability that the gene is differentially expressed is —e—; = 97%. 1 . + e3 A larger B-statistic indicates higher probability that the gene is differentially expressed. The P. Value is adjusted for multiple hypothesis testing using Benjamini- Hochberg ’s method (BH). B—statistics and P. Value provide the same ranking when no data is missing. Besides, differentially expressed genes are ranked in topTable by their P. Values. Benjamini- Hochberg ’s method controls the false discovery rate (FDR) when testing thousands of hypotheses, such as in microarray data. We identify genes differentially lO expressed in subgroups from Hierarchical Cluster (Figure3) and assign the letters of A, B, C, D to those four groups (Table3). 11 2.3. Random Forest The Random Forest algorithm by Leo Breiman (L. Breiman 2001) is a classification procedure consisting of a collection of tree-structured classifiers. Each tree is independent, identically distributed random vectors. Each tree gives a unit vote for the class of input vectors (arrays). Random Forest can analyze high dimensional data efficiently. Two processes of randomization occur in Random Forest: trees and nodes. Trees were built by bootstrap samples, and each node was split by randomly selected predictor variables (genes). In the ecotype data, there are ten ecotypes and each of them has 3 arrays , so there are 30 arrays in the ecotype data. Moreover, each array has 22810 genes. In the Random Forest, the 30 arrays are “input vectors” (class observations) and 22810 genes are as “predictor variables”. Randomly select N arrays from those 30 arrays with replacement for the training set (in-bag). The arrays which are not included in the training set are called out-of-bag (OOB). The training set data are used to grow the tree. The OOB data are used to estimate the classification error rate and get a variable importance measure. 12 Figure 4 Random Forest Construction L 30 Arrays (from 10 ecotypes) I Bootstrap Bootstrap ............... Bootstrap / ln Bag 003 lnBag 003 In Bag 003 ‘— 500 Bootstrap l l A l litree was built by a single In-Bag data I Each sec 9 has 3 arra s 30 Arra 5 Each an'ay contains 22810 predictor variables (Genes) / gm? ,, .................... Miami ..14 .......... .4 Each tree is grown as follows Step 1. The training set consists of N observations (arrays) selected at random. Take N observations (arrays) at random and with replacement from the original data set called “in-bag”. The observations not selected are called “out-of-bag”. On average, there will be two third observations “in-bag”, and one third “out-of- bag . Step 2. The observations selected from the training set are used to construct a decision tree. The number of variables is M. A fixed number m"? (mm < 0. ‘5 I h g] : a. mtry163 no: <3,l mtry133 mtry148 mtry161 O Ni .51 l, i ___v,__r 1 =='lffi’r‘ri V '—r ~7'ri #fi' 130 140 150 160 170 Number of mtry 25 3.5. Gene Selection from Cvi500 and Sha500 by Random Forest In order to reduce the number of Cvi500 and Sha500, we select important genes from Random Forest, but the question is how many genes are needed for the best performance of classification. Beside ntree and mtry, the number of genes which has smallest OOB error rate is which we are interested in. From above procedure of finding optimal mtry and ntree (Figure 9, 10, 11, 12), the value of ntree greater than 200 can get stable smaller OOB error rate, but the value of mtry is not significant association with the OOB error rate. Thus, we select the number of most important genes from Random Forest with ntree=200, but keep mtry be default in Cvi500 and Sha500 respectively. To rank the genes the measure MeanDecreaseAccuracy was used to measure reliable importance. In Cvi5 00, 43 genes is the smallest number for optimal classification. In Sha500, 84 genes is the smallest number for optimal classification. Then we compare those two sets of selected genes, there are 43 genes from the intersection of Cvi500 and Sha500, and there are 4 genes from the intersection of Cvi43 and Sha84. (Figure 15) Figure 13 Optimal value of the number of genes for Cvi L0 0' V. O ‘5': <0. 43 Genes l— O a: to l O N. i '. o o ‘1 =. \. : ...... g .................. '. , g ----- if A 7 ,.~—— ’ 531—4? =4”? 2 5 10 20 50 100 200 500 Number of variables used 26 Figure 14 Optimal value of the number of genes for Shakdara 1.\ to V‘- . d '3 2‘3. “X ‘6 l, 84 Genes i: ‘ “ 0 V. ‘ m o O o ., \ N d ’3 o o c— —)—o—o \Dv 7v-\::‘.._o—4~.—‘.-—‘~—‘lj:;°¥v o. l l ‘T‘ I i T T r T I I ‘ ‘ 1 2 5 10 20 50 100 200 500 Number of variables used Figure 15 Overlapping genes from Cvi500 and Sha500 102 Genes overlapping Cvi500 and Sha500 4 Genes from intersection of Cvi43 and Sha84 27 3.6. Compare the OOB error rate of Random Forest Several sets of genes were selected with Limma and Random Forest. We have two sets of 500 genes selected from topTable of Limma; they are Cvi5 00 and Sha500. Moreover, we have a set of 43 genes from Cvi5 00, and a set of 84 genes from Sha5 00. The following table will show the OOB error rate for Cvi5 00 and Sha5 00 and compare the status of using the optimal ntree and mtry with the status of without optimal ntree and mtry. Besides, Table5 also shows that the OOB error rate for the selected 84 genes and selected 50 genes without adjusting parameters Table 5 Comparison of OOB error rate Genes Status Number of Genes OOB error rate Without optimal value of 500 16.67% ntree and mtry Cv' oo 15 With optimal values of ntree and mtry and the smallest number of genes 43 6-67% Without optimal value of 00 10% ntree and mtry 5 Sha500 With optimal values of ntree and mtry and the smallest number of genes 84 3.33% 28 3.7. Misclassifications of Ecotypes When running the Random Forest, there are some arrays which are misclassified. Each run gives us different misclassified ecotypes. Table6 shows misclassified arrays from all Random Forest runs. The most frequent misclassifications are VanO and KinO. The array ATGE_116_B.CEL (Kino) is often misclassified. Table 6 Misclassification List Array Actual Ecotype Predicted Ecotype ATGE_112_A.CEL C24 Shakdara ATGE_115_D.CEL Est Colo ATGE_116_A.CEL Kino Vano ATGE_116_B.CEL Kino Vano, Shakdara, Bayo ATGE_116_C.CEL Kino Vano ATGE_117_D.CEL Ler Est ATGE_120_A.CEL Vano Kino ATGE_120_C.CEL Vano Kino Figure 16 Misclassification figure O Misclassiflcation O No Misclassification Predicted Class True Class 29 Chapter 4 Gene Ontology 4.1. Gene Ontology with Classification Superviewer We have identified genes that may be important in adaptation. We selected two groups of genes, Cvi43 and Sha84 based on Random Forest. Cvi is close to the equator off the coast of Africa with higher temperature than other ecotypes, and Shakdara is a mountainous (around Himalayas) landlocked country in Central Asia and thus exposed to climate (eg. Temperature). The adaptation of these two ecotypes has likely been driven by these stress conditions. We would like to argue that these selected genes are important for stress resistance. In order to validate the genes we selected from Random Forest, we classify the gene function on a group of genes based on the website: “The Bio-Array Resource for Arabidopsis thaliana Functional Genomics” http://bar.utoronto.ca/. The web-based tool of Classification Super Viewer creates an overview of gene functional classification of a group of AGI genes based on the MIPS database (Munich Information Center for Protein Sequences). Currently, there are 25450 genes for MIPS classifications in the MAtDB (MIPS Arabidopsis Thaliana Database). Here we do not focus on single genes. Instead, we want to find gene functions overrepresented in the selected sets of genes that can provide important information on stress response. Gene function classification is an approach for grouping genes based on functional similarity. However, Functional Classification Pie Chart often used in Bioinforrnatics provides the absolute numbers and percentage of gene function. Absolute numbers of genes on functional classification might be misleading in a different treatment and situation, but normalizing the group of genes can avoid this misdirection. In this way, the differences of gene function are more easily detected. Classification SuperI/iewer includes normalization, bootstrap sampling, 3O and provides a confidence estimate for the accuracy of results. The standard deviation may make results spurious and unreliable. Moreover, if the confidence intervals include one, the genes of this functional classification may be due to a small number of genes, and thus the class score is unreliable. We only consider a class score greater than one and. confidence intervals not including one to check if these categories of functions are associated with stress response. A class score for normalization was calculated based on the following equation: (N is gene number) N class(inputset) / N classified (inputset) SCOT' class : N / N class(25K) classified (25K ) (inputset .° Cvi43 and Sha84) One hundred Bootstrap samples were chosen from the input set. After sampling, classifying each set and generating them to get class score as above equation. Furthermore, the standard deviation of each class was shown along with the class score. If the class scores are greater than one and confidence intervals not including one, the gene ontology categories are overrepresented within a group of genes. In the following section is applied to gene groups Cvi43 and Sha84 in Classification Superviewer and discuss how their overrepresented gene functions affect the stress response. After that, we simplify the broad and wide spectrum of known protein functions based on F unCat annotation which includes 7 main gene categories (Table7). 31 Table 7 Main Function categories of FunCat Main Function categories of FunCat Metabolism 01 Metabolism 02 Energy 04 Storage protein Information pathways 10 Cell cycle and DNA processing 1 1 Transcription 12 Protein synthesis 14 Protein fate (folding, modification and destination) 16 Protein with binding function or cofactor requirement (structural or catalytic) 18 Protein activity regulation Transport 20 Cellular transport, transport facilitation and transport routes Perception and response to stimuli 30 Cellular communication/signal transduction mechanism 32 Cell rescue, defense and virulence 34 Interaction with the cellular environment 36 interaction with the environment (systemic) 38 Transposable elements, viral and plasmid proteins Developmental processes 40 Cell fate 41 Development (systemic) 42 Biogenesis of cellular components 43 Cell type differentiation 45 Tissue differentiation 47 Organ differentiation Localization 70 Subcellular localization 73 Cell type localization 75 Tissue localization 77 Organ localization 78 Ubiquitous expression Experimentally uncharacterized proteins 98 Classification not yet clear-cut 99 Unclassified proteins With the exception of categories 78, 98 and 99. all main categories are the origin of hierarchical. tree-like structures. To make the introduction of new main categories possible. the numbering of the categories is not strictly sequential. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes, Nucleic Acids Research, 2004, Vol.32, No.18: 5539-5545. 32 4.2 Gene Ontology of Cvi43 and sha84 Figure 17 Cvi43 — Classification Superviewer 6 LL TYPE LOORLISRTION REGULRTION OF/INTERROTION N. CELLULRR ENVIRONMENT YST'ENIC REGULBTION OF/INTERBCTION H. ENUIRONNENT ‘WRRNSPOSRBLE ELEMENTS: UIRRL FIND PLRSHID PROTEINS TRFINSPORT FROILITRTION CELL RESCUE: DEFENSE FIND UIRULENCE ORGFIN LOCRLISRTION PROTEIN ROTIUITY REGULRTION CELL TYPE DIFFERENTIRTION TISSUE OIFFERENTIBTION ETRBOLISH SUBCELLULRR LOCRL ISRTION CELLULAR TRRNSPORT 9ND TRRNSPORT NEOHHNISNS CONTROL OF OELLULRR ORGRNIZRTION PROTEIN SYNTHESIS EUELOPNENT PROTEIN H. BINDING FUNCTION/COFRCTOR REQUIRENENT CEL FRTE CELLULOR OOHNUNICRTION + SIBNflL TRRNSDUOTION UNCLRSSIFIED PROTEINS CELL CYCLE 9ND DNA PROCESSING TRONSCRIPTION No classification uhetsoeuer 1 I I ‘ As we can see, there are five terms whose class scores greater than one and confidence intervals not including one. The number of genes, Cvi43, associated with terms (1)-(5) below is greater than expected on the basis of chance. In other words, terms (1)—(5) are overrepresented in the gene set of Cvi43. (1) CELL TYPE LOCALISATION (2) REGULATION OF/INTERACTION W. CELLULAR ENVIRONMENT (3) SYSTEMIC REGULATION OFleTERACTION W. ENVIRONMENT (4) TRANSPORT FACILITATION (5) CELL RESCUE, DEFENSE AND VIRULENCE Refer to Table7 , (2) (3) (5) are in category of Perception and response to stimuli. Plant perception indicates the change in the environment. The stimuli which plants perceive can respond to the environmental effects of chemicals, gravity, light, moisture, infections, temperature, oxygen, and carbon dioxide. Plants detect stimuli in different methods and a variety of reaction response, but generally plant perception occurs at the cellular level. 33 Thus, the selected genes are related to climatic conditions for Cvi. Figure 18 Sh384 — Classification Superviewer STORRGE PROTEIN TISSUE LOCRLISRTION CELL TYPE DIFFERENTIRTION TISSUE DIFFERENTIRTION HET‘RBOLISN CELL RESCUE, DEFENSE FIND UIRULENCE ENERGY TRRNSPORT FNCIL ITRTION CELL TYPE LOCRLISRTION SUBCELLULHR LOCRLISRTION SYSTEMIC REGULRTION OF/INTERRCTION N. ENUIRONNENT CELLULRR TRHNSPORT 9ND TRRNSPORT HECHFINISNS REGULHTION OF/INTERRCTION N. CELLULRR ENUIRONNENT P OTEIN SYNTHESIS CELLULRR CONNUNICRTION + SIGNBL TRRNSDUCTION PROTEIN H. BINDING FUNCTION/COFHCTOR REQUIREMENT UNCLHSSIFIED PROTEINS PROTEIN RCTXUITY REGULRTION CELL FRTE CONTROL OF CELLULRR ORGHNIZRTION SCRIPTION DEUELOPNENT CELL CYCLE ONO DNR PROCESSING No classification uhdsocucr i : z 4 : In Figurel 8, there are eight terms whose class scores greater than one and confidence intervals not including one. Thus, terms (l)-(8) below are overrepresented in the gene set of Sha84. (1) STORAGE PROTEIN (2) TISSUE LOCALISATION (3) CELL TYPE DIFFERENTIATION (4) ORGAN LOCALISATION (5) TISSUE DIFFERENTIATION (6) METABOLISM (7) CELL RESCUE, DEFENSE AND VIRULENCE (8) ENERGY Terms of (1) (6) (8) covered all sub-functions of the metabolism. The definition for metabolism is: “Chemical process occurring within a living cell or organism, including anabolism and catabolism. Metabolism is a chemical process that typically transforms 34 small molecules, but also includes macromolecular process and protein synthesis and degradation.” Metabolism is associated with energy in some ways. Under stress, in metabolism some compounds are broken down to yield energy. Then this energy is directed at repairing the damage made by stress. Thus, metabolism would be an important factor under many different types of stressors. Under stress, plants may undergo a change of metabolism which would direct energy away from grth and reproduction and focus on cellular defense and maintenance. Instead, this helps plants survive in tough environments. Thus, the selected genes Sha84 may be important for adapting to the climatic conditions in high altitude. Moreover, cytochrome P450 genes and glutathione-S-transferase genes may play an important role in oxidative stress resistance since oxidative stress is generated by all forms of stress in some ways. Several papers mention that Cytochrome P450 genes is important for plants. Oxidative detoxification of some herbicides in plant tissues is obtained by a Cytochrome P450-dependent monooxygenase system (Donaldson and Luster 1991, Hatzios 1991, and Sanderrnann 1992). Cytochrome P450s play important roles in biosynthesis of a variety of endogenous lipophilic compounds (Donaldson and Luster 1991 and Bolwell et a1. 1994). Cytochrome P450 monooxygenases are a group of haem-containing proteins which catalyze various oxidative reactions (Schuler 1996 and Chapple 1998). In addition, some papers support that Glutathione-S-transferase plays an important role in plants. Glutathione S-transferases (GSTs) appear to be ubiquitous in plants and have defined roles in herbicide detoxification (Lamoureux and Rusness 1993). The fundamental function of GSTs is the detoxification of both endogenous and xenobiotic compounds (Marrs 1996). GSTs play a fundamental role in protection against endogenous or exogenous toxic chemicals (Sheehan et al. 2001). Furthermore, 35 cytochrome P450 genes and glutathione-S-transferase are phase I and phase H detoxification enzyme, respectively. Therefore, finding such genes associated with any form of stress may be biologically meaningful. Besides, a gene (A15g10140) in Cvi43 is FLC (FLOWERIN G LOCUS C) gene which is a main detemrinant of flowering time. Arabidopsis thaliana locates in the Northern Hemisphere with long day time light hours which may affect flowering time. The transition to flowering is an important event in the plant life cycle and is adapted by several environmental factors of photoperiod, light quality, vemalization, and growth temperature, as well as biotic and abiotic stresses. Thus, F LC can respond to stresses and environmental effects. The following 5 genes were identified in both Cvi43 and Sha84 corresponding to these 3 specific genes and the graph also shows the expressions of these 5 genes. Figure 19 Expression graph for 5 specific genes. 253534_at 252827_at 5! ~ e - e - ° ° = ° 2 a g o '”/ o / \ / 0/ g g ° \ /° ° T“ ° a: - ° > .0 _ \ o _ o .5 .E / . / in g; o / ° \ D o \ ‘8 Q \ / a to — a w 4 ° Lfi £5 a) co 5 v _ g u: L (D (D .1 N , I r r r l I r r r‘r ' . . r . r l . Bay-0 C24 Col-0 Cvl Est Kin—0 Ler Nd-1 Sha Van-0 Bay-O 024 Col-0 Cvi Est Kin—0 Ler Nd‘1 Sha Van-0 262916_at 264052_at 5!- e— e— 2— % % > n_ > m_ a 5 a .5 °\ /°\°_‘°/\° i /\ i ,/ ~—°=° 3 117—0 .\°’° o E“ o— g... \/°/\ 3“ (D o °/° N-i N— l r r Bay-o (:24 Col-0 (NI Est Kin-0 Ler Nd-1Sha Van-0 Bay-0 624 Col-O (M Est Kin-0 Ler Nd—1 Sha Van-0 250476__at 10 1 Gene Expression Value 4 O 4 1 ~-/\/ 0 ./°\ ° Bay-0 024 I T l I Col-0 CM Est Kin-O Ler Nd-t Sha Vano Table 8 FLC, Cflochrome P450 and Glutathione-S-transferase genes AGI ID Affy ID Annotation At4g31500 253534_at CYP8331_ATR4_RED1_RNT1_SUR2_LYP83BI (CYTOCHROME P450 MONOOXYGENASE 8381); oxygen binding At4g39950 252827_at CYP79BZ_CYP79BZ (cytochrome P450, family 79, subfamily B, polypeptide 2); oxygen binding At1959700 262916_at ATGSTU16_ATGST016 (Arabidopsis thaliana Glutathione S-transferase (class tau) 16); glutathione transferase At2922330 264052_at CYP79B3_CYP79B3 (cytochrome P450, family 79, subfamily B, polypeptide 3); oxygen binding AthlOl40 250476_at FLC_AGL25_FLF_FLC (FLOWERING LOCUS C) Annotation from “’I‘AIR. affv ATHl array elements-2006-07-l4.txt” 37 APPENDICES 38 APPENDIX A Selected groups of Genes — Cvi43 & Sha84 Cvi43 Affy ID AGI ID Annotation At3g61520 246173_s_at At5g28370 pentatricopeptide (PPR) repeat-containing protein At5g28460 24667 l_at At5 g3 0450 246862_at At5g25760 UBC21_PEX4_PEX4 (PEROXIN4); ubiquitin-protein ligase 247760_at At5g59130 subtilase family protein 24779l_at At5g58710 ROC7_ROC7 (rotamase CyP 7); peptidyl-prolyl cis-trans isomerase 248460_at At5g50915 basic helix-loop-helix (bHLH) family protein similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5624655.1); 249752_at At5g24660 similar to unknown protein [Brassica rapa subsp. pekinensis] (GB:AAQ92331.1) 249780_at At5g24240 phosphatidylinositol 3- and 4-kinase family protein / ubiquitin family protein 250476_at At5g10140 FLC_AGL25_FLF_FLC (FLOWERING LOCU S C) similar to PBS lyase HEAT-like repeat-containing protein [Arabidopsis At3 62460 thaliana] (TAIR:AT3G62530.1); similar to 8OC09_3 [Brassica rapa subsp. 25124l_s_at At3g 62 530 pekinensis] (GB:AAZ41814.1); similar to OsO7g0637200 [Oryza sativa g (j aponica cultivar-group)] (GB:NP_001060400.1); contains InterPro domain Protein of unknown firnction DUF537; (InterProzlPROO749l) 251962_at At3g53420 PIP2A_PIP2_PIP2A (plasma membrane intrinsic protein 2;l) 252168_at At3g50440 hydrolase similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5G65810.1); similar to OsOlg0144000 [Oryza sativa (japonica cultivar-group)] 252231_at At3g49720 (GB:NP_001042001.1); similar to conserved hypothetical protein [Medicago truncatula] (GB:ABE78370.1); contains domain S-adenosyl-L-methionine— dependent methyltransferases (SSF53335) At3 g47220 . . . . . . . 252459_s_at A t3g 47290 phosphomosrtrde-specrfic phospholrpase C family protein 252529_.at At3g46490 oxidoreductase, 20G-Fe(II) oxygenase family protein similar to unknown protein [Arabidopsis thaliana] (TAIR:AT2G26240.1); similar to OsO4gO653100 [Oryza sativa (japonica cultivar-group)] (GB:NP_001054104.1); similar to transmembrane protein 14C [Argas 252723_at At3g43520 monolakensis] (GB:AB152790.1); similar to OsO3g0568500 [Oryza sativa (japonica cultivar-group)] (GB:NP_OOIOSOS 10.1); contains InterPro domain Protein of unknown function UPF0136, Transmembrane; (InterProzlPROOS349) similar to myosin-related [Arabidopsis thaliana] (TAIR:AT1G24460.1); similar to hypothetical protein, conserved [Leishmania major] 253532_at At4g31570 (GB:CAJ07774.1); contains InterPro domain Prefoldin; (InterProzlPROO9053); contains InterPro domain t-snare; (InterProzlPR010989) 39 CYP83Bl_ATR4_REDl_RNT1_SUR2_CYP83Bl (CYTOCHROME 2535343“ “4331500 P450 MONOOXYGENASE 8381); oxygen binding 254351_at At4g22300 carboxylic ester hydrolase 25436l__at At4g22212 Encodes a defensin-like (DEFL) family protein. 254928_at At4g11410 short-chain dehydrogenase/reductase (SDR) family protein 255257_at At4g05050 UBQ11__UBQ11 (UBIQUITIN 11); protein binding RIC10_RIC10 (ROP-INTERACT IVE CRIB MOTIF-CONTAINING 255307_at At4g04900 PROTEIN 10) 255578_at At4g01450 nodulin MtN21 family protein 256497_at Atlg31580 ECS1_CXC750__ECS 1 256863_at At3g24070 zinc knuckle (CCHC-type) family protein 257071 at A t3g281 8 0 ATCSLC04_ATCSLC4_CSLC04_ATCSLC04 (Cellulose synthase-lrke - C4); transferase, transfernng glycosyl groups 257205_at At3 g] 6520 UDP-glucoronosyl/UDP-glucosyl transferase family protein 259067_at At3g07550 F-box family protein (F BL12) similar to OsO4g0528100 [Oryza sativa (japonica cultivar-group)] 259591_at At1g28150 (GB:NP_001053373J) 259733_at At1g77480 nucellin protein, putative similar to unknown protein [Oryza sativa (japonica cultivar—group)] 260232_at Atl g74640 (GB:BAD28539.1); contains domain no description (G3D.3.40.50. 1820); contains domain alpha/beta-Hydrolases (SSF 53474) 260244_at At1g74320 choline kinase, putative 260252_at At1g74240 mitochondrial substrate carrier family protein 263034_at At1g24020 Bet v I allergen family protein 263777 at At2g 4 64 5 0 ATCNGC12_CNGC12_ATCNGC12 (cyclic nucleotide gated channel 12); - cyclic nucleotide brndmg / ion channel similar to unknown protein [Arabidopsis thaliana] (TAIR:AT2G31670.1); similar to Hypothetical protein [Oryza sativa] (GB:AAK55783.1); contains 265142_at At1g51360 InterPro domain Stress responsive alpha-beta barrel; (InterPro:IPR013097); contains InterPro domain Dimeric alpha-beta barrel; (InterPro:IPR011008) 265162_at At1g30910 molybdenum cofactor sulfurase family protein 265486_at similar to unknown protein [Arabidopsis thaliana] (TAIR:AT3G48690.1); similar to unknown protein [Arabidopsis thaliana] (TAIR:AT3G48700.1); similar to Esterase/lipase/thioesterase [Medicago truncatula] 2656993“ At2g03550 (GB:ABE83378.1); contains InterPro domain Esterase/lipase/thioesterase; (InterPro:IPR000379); contains InterPro domain Alpha/beta hydrolase fold- 3; (InterPro:IPR013094) 265768_at At2g48020 sugar transporter, putative 266643 5 at At2g29710 UDP-glucoronosyl/UDP-glucosyl transferase family protein - - At2g29730 267093_at A t2g38170 CAX1_RCI4_CAX1 (CATION EXCHANGER l); calcrumzhydrogen antiporter 40 Sha84 Affy ID AGI ID Annotation 2 4 5 03 8__a t A t2g2 6 5 60 PLA2A_PLA IIA_PLP2_PLA IIA_PLP2 (PHOSPHOLIPASE A 2A); nutrient reservorr 245400_at At4gl7040 ATP-dependent Clp protease proteolytic subunit, putative 245456_at At4gl6950 RPP5_RPP5 (RECOGNITION OF PERONOSPORA PARASITICA 5) 2459.77 at At5g13110 G6PD2_G6PD2 (GLUCOSE-6-PHOSPHATE DEHYDROGENASE 2); - glucose-6-phosphate l-dehydrogenase At5 34920 246642_s_at “£59620 similar to unknown protein [Arabidopsis thaliana] (TAIR:AT3G04860.1); similar to OsO7g0572300 [Oryza sativa (japonica cultivar-group)] (GB:NP_001060057.1); similar to OsO3g0806700 [Oryza sativa (japonica 246708_at At5g28150 cultivar-group)] (GB:NP_001051637.1); similar to Protein of unknown ftmction DUF868, plant [Medicago truncatula] (GB:ABE92686.1); contains InterPro domain Protein of unknown function DUF868, plant; (InterPro:IPR008586L 247210 at Ath 6 5020 ANNAT2_ANNAT2 (ANNEXIN ARABIDOPSIS 2); calcrum 1011 bmdmg — calcrum-dependent phospholrpid brndmg SALl F RYl HOSZ SAL] FIERYl ; 3' 2' ,5'-bis hos hate nucleotidase/ 247313_at At5g63980 inositch or phbsphaticglinositdl phosphzitas: ) P P 247404_at At5 g62890 permease, putative 247814_at At5 g5 83 10 hydrolase, alpha/beta fold family protein 247999_at At5g56150 UBC30__UBC30; ubiquitin-protein ligase 248079_at At5 g55790 unknown protein 248200_at At5g54160 ATOMT1_OMT1_ATOMT1 (O-METHYLTRANSFERASE 1) 248427_at At5g51750 subtilase family protein 248796_at At5g47180 vesicle-associated membrane family protein / VAMP family protein 248800_at At5g47320 RPS19_RPSI9 (4OS ribosomal protein S19); RNA binding 248961_at At5 g45 650 subtilase family protein 249258_at At5g4l650 lactoylglutathione lyase family protein / glyoxalase I family protein 249567_at At5 g3 8020 S-adenosyl-L-methionine:carboxyl methyltransferase family protein similar to 0302 0815400 sativa 'a onica cultivar- ou 249610_at At5g37360 mam—00104385021) [0W 0 1’ gr 1’” 249645_at At5g36910 TH12.2.2_THI2.2 (THIONIN 2.2); toxin receptor binding 249733_at At5g24400 EMB2024_EMBZOZ4 (EMBRYO DEFECTIVE 2024); catalytic similar to unknown protein [Arabidopsis thaliana] (TAIR:AT1G61065. 1); similar to unknown protein [Saussurea involucrata] (GB:ABC68264.1); similar to 0506 0114700 0 a sativa 'a onica cultivar- rou ”0072—“ “52417210 (GB:NP_00105I5606J); SIEIDIIZI'Z to 030530534800 [Oryza sgativiig'aponica cultivar-group)] (GB:NP_001055640.1); contains InterPro domain Protein of unknown function DUF1218; (InterPro:IPR009606) 250633_at A t5g07 4 6O PMSR2._PMSR2 (PEPTIDEMETHIONINE SULFOXIDE REDUCTASE 2); protem-methronme-S-oxrde reductase 250751__at At5 g05 890 UDP-glucoronosyl/UDP-glucosyl transferase family protein HB-6 LSN BLH9 BLR PNY RPL VAN LSN LARSON, 2510313“ At5g02030 VAAMANA); DNA bincfing / tfanscfiption factor ( 251903_at At3g54120 reticulon family protein (RTNLB 12) 41 GSA2_GSA2 (GLUTAMATE- 1 -SEMIALDEHYDE 2,1- 2523181“ At3g48730 AMINOMUTASE 2); glutamate-l-semialdehyde 2,1-aminomutase similar to unknown protein [Arabidopsis thaliana] (TAIR:AT3G47200.2); similar to hypothetical protein LOC_Oleg29620 [Oryza sativa (japonica cultivar-group)] (GB:ABA98257.1); similar to 031 1g0543300 [Oryza sativa 252462__at At3g47250 (japonica cultivar-group)] (GB:NP_001068043.1); similar to OsO4g0505400 [Oryza sativa (japonica cultivar-group)] (GB:NP_001053253.1); contains InterPro domain Protein of unknown function DUF247, plant; (InterProzfl’R0041 5 8) 2 5 2 47 8_a t At3g4 6540 epsm N-terrmnal homology (ENTH) domam-contammg protern / clathnn assembly protem—related 252529_at At3g46490 oxidoreductase, 20G-Fe(II) oxygenase family protein 252659_at At3 g44430 similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5G41660. 1) At3g44300 252678_s_at At3g44310 NIT2__NIT2 (N ITRILASE 2) similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5G47860.1); similar to OsO9g0436900 [Oryza sativa (japonica cultivar-group)] 252724_at At3g43540 (GB:NP_001063263.1); similar to unknown protein [Oryza sativa (japonica cultivar-group)] (GB:BAD36432.1); contains InterPro domain Protein of unknown function DUF1350; (InterPro:IPR010765) 252827 at At 4g399 5 0 CYP79B2_CYP79BZ (cytochrome P450, fanuly 79, subfarruly B, - polypeptide 2); oxygen bmdmg MI-l-P SYNTHASE_MI-1-P SYNTHASE (Myo-inositol-l-phosphate 2528633“ At4g39800 synthase); inositol-3-phosphate synthase 25 3422_at At4g32240 unknown protein 253666 at At 4g3 0270 MERISB_BRU1_MERI-5__MERISB (MERISTEM-S); hydrolase, acting on - glycosyl bonds 254248_at At4g23270 protein kinase family protein similar to unknown protein [Arabidopsis thaliana] (TAIR:ATSG44670.1); similar to OsO6gO328800 [Oryza sativa (japonica cultivar-group)] (GB:NP_001057533.1); similar to Os02g0712500 [Oryza sativa (japonica 2545083“ At4g20170 cultivar-group)] (GB:NP_001047907.1); similar to unknown protein [Oryza sativa (japonica cultivar-group)] (GB:BAD72474.1); contains InterPro domain Protein of unknown function DUF23; (InterPro:IPR008 166) 254553_~at At4gl9530 disease resistance protein (TIR-NBS-LRR class), putative AOP2__AOP2 (ALKENYL HYDROXALKYL PRODUCING 2); 2 5 5 437—3 t A t 4g03 0 60 oxrdoreductase, actmg on paired donors, wrth mcorporatron or reduction of molecular oxygen, 2-oxoglutarate as one donor, and mcorporatron of one atom each of oxygen into both donors 25 5859_at At5g34930 arogenate dehydrogenase 256021_at Atl g5 8270 ZW9__ZW9 similar to 18$ pro-ribosomal assembly protein gar2-re1ated [Arabidopsis 25 6096_at Atl g1 3650 thaliana] (TAIR:AT2G03810.3); similar to hypothetical protein [Trypanosoma cruzi strain CL Brener] (GB:XP_813437.1) 256221_at At1g56300 DNAJ heat shock N-terminal domain-containing protein 256454_at At1g75280 isoflavone reductase, putative 25645 8_at At1g75220 integral membrane protein, putative 256489_at At1g31550 carboxylic ester hydrolase/ lipase 256940_at At3g30720 unknown protein 42 257205_at At3g16520 UDP-glucoronosyl/UDP-glucosyl transferase family protein 257228_at At3g27890 NQR_NQR (NADPH2QUINONE OXIDOREDUCTASE); FMN reductase 257580_at At3g06210 binding similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5G24600.1); similar to hypothetical protein [Oryza sativa (japonica cultivar-group)] (GB:BAC55679.1); similar to OsOZg0292800 [Oryza sativa (japonica 2581243" “398215 cultivar-group)] (GB:NP_001046597.1); similar to OsO8g0153900 [Oryza sativa (japonica cultivar-group)] (GB:NP_001061011.1); contains InterPro domain Protein of unknown function DUF599; (InterPro:IPR006747) 258322 at A t3g227 40 HMT3_HMT3 (Homocysterne S-methyltransferase 3); homocysterne S- — methyltransferase Atlg07780 PAIl_TRP6_PAIl (PHOSPHORIBOSYLANTHRANILATE 259770_s_at At1g29410 ISOMERASE 1): hos horibos lanthranilate isomerase At5g05590 ’ p p y ATT12__ATT12 (ARABIDOPSIS THALIANA TRYPSIN INHIBITOR 2605463” “293520 PROTEIN 2); trypsin inhibitor GT_GT/UGT74F2 (UDP-GLUCOSYLTRANSFERASE 74F2); UDP- 260567_at At2g43 820 glucosyltransferase/ UDP-glycosyltransferase/ transferase, transferring glycosyl groups / transferase, transferring hexosyl groups 260685__at At1g17650 phosphogluconate dehydrogenase (decarboxylating) 260872_at At1g21350 electron carrier/ oxidoreductase similar to zinc finger (Ran-binding) family protein [Arabidopsis thaliana] (TAIR:ATIGSSO40. 1); similar to Zn-finger in Ran binding protein and others, putative [Oryza sativa (japonica cultivar-group)] (GB:AAX95 671.1); 2609813" At1g53460 similar to OsO3g0712200 [Oryza sativa (j aponica cultivar-group)] (GB:NP_OOIOS 1062.1); similar to 0301 g0203300 [Oryza sativa (japonica cultivar-group)] (GB :NP_00104233 l . l) NRS/ER_NRS/ER (NU CLEOTIDE-RHAMN OSE 261 105_at At1g63000 SYNTHASE/EPIMERASE-REDUCT ASE) 261326 3 at At1g44180 aminoacylase putative / N-acyl-L-amino-acid amidohydrolase putative - - At1g44820 ’ ’ SMT3_SMT3 (S-adenosyl-methionine-sterol-C-methyltransferase 3); S- 26l727_at At1g76090 adenosylmethionine-dependent methyltransferase 26l924_at Atl g22550 proton-dependent oligopeptide transport (POT) family protein 262134_at At1g77990 AST56_SULTR2;2_AST56 (sulphate transporter 2;2); sulfate transporter 26245 8__at Atlg11280 carbohydrate binding / kinase G-TMT_TMT1_VTE4_G-TMT (GAMMA-TOCOPHEROL 262875_at Atl g64970 METHYLTRANSFERASE) ATGSTU16_ATGSTU16 (Arabidopsis thaliana Glutathione S-transferase 2629l6_at Atlg59700 (class tau) 16); glutathione transferase 263553_at At2g16430 PAP10__PAP10; acid phosphatase/ protein serine/threonine phosphatase 263714_at At2g20610 SUR1_ALFl_HLS3_RTY_SURl_SUR1 (SUPERROOT 1); transaminase 2 640 52-21 t At2g223 30 CYP79B.3_CYP79B3 (cytochrome P450, family 79, subfamily B, polypeptide 3); oxygen bmdrng 2 64 5 1 3_a t A t1 g09 42 0 G6PD4_G6PD4 (GLUCOSE-6-PHOSPHATE DEHYDROGENASE 4); glucose-6-phosphate l-dehydrogenase 264790_at At2gl7820 ATHK1_AHK1_ATHK1_ATHK1 (HISTIDINE KINASE 1) 264954_at Atl g77060 mutase family protein 43 ARP2_RPL3B__ARP2/RPL3B (ARABIDOPSIS RIBOSOMAL PROTEIN 2650321“ Atlg6l580 2); structural constituent of ribosome 265058 3 at At1g52030 MBP2_F-ATMBP_MBP1.2_MBP2 (MYROSINASE-BINDING - - Atlg52040 PROTEIN 2) 265354_at At2gl6700 ADF5_ADF5 (ACT IN DEPOLYMERIZING FACTOR 5); actin binding 265486_at 265486_at 265611_at At2g25510 unknown protein similar to transcription elongation factor-related [Arabidopsis thaliana] (TAIR:AT5G25520.2); similar to PHD finger protein-like [Oryza sativa (japonica cultivar-group)] (GB:BAD24999.1); similar to OsOZg0208600 265905_at At2g25640 [Oryza sativa (japonica cultivar-group)] (GB:NP_001046260.1); contains InterPro domain Transcription elongation factor S-Il, central region; (InterPro:IPROO3618); contains InterPro domain SPOC; (InterPro:IPRO 1 292 1) 266472_at 266643_s_at 23:33.7]; UDP-glucoronosyl/UDP-glucosy1transferase family protein 267078_at At2g40960 nucleic acid binding 44 APPENDIX B R CODE ##################################################################### # # # Pakages Used # # # library(limma) library(randomForest) library(varSelRF) library(maps) # # # Reading Data # # —--——-—-— # Ecodata = read.table("AtGE_ecotypes.txt", header =T, sep="\t") Geo = read.table("Geo.txt", header =T, sep="\t") x = read.table("EcotypesGeo.txt",sep="\t") ##################################################################### # # #Map# # # Paflmepw, 4)) par(mfrow=c(2,l)) map("world",col="grey") text(Geo$Longitude,Geo$Latitude,Geo$Ecotype,col="black",cex=0.8) points(Geo$Longitude,Geo$Latitude,col= rainbow(16220)[l : 10] ,cex=0.7,lwd=3) legend( 120,85, Geo$Ecotype , f111= rainbow(16220)[1:10] , cex=0.8, bty="n") points(Geo$Longitude[7],GeoSLatitudeU],col="DarkGoldenRod ",cex=3,lwd=2) arrows(10.5, 30, 10.5, -50, lwd=2,angle = 15,col="DarkGoldenRod ") text(5.5,-70, "Germany ", adj=0, cex=1.5, col="DarkGoldenRod ") map("world", "Germany.,COI="DarkGoldenRod H) text(Geo$Longitude,Geo$Latitude,Geo$Ecotype,cex=0.8,col="red") points(Geo$Longitude,Geo$Latitude,col= rainbow(16:20)[1: 10] ,cex=0. 7,1wd=3) 45 # # # Cluster for La4 and A14 # # # la = x[-1,4] la=la[l :10] names(la)=x[- l , 1][ l : 10] dist(la) hc.la <- hclust(dist(la)) plot(hc.la) lakm <- laneans(dist(la),4)$cluster lakm # Cluster a1 = x[-1,7] al=al[1 :10] names(a1)=x[-1, 1][1 : 10] dist(al) hc.al <— hclust(dist(al)) plot(hc.al) al.km <- kmeans(dist(al),4)$cluster a1.km # Cluster # # # Gene Expression Plot # # # genelist = la4 # changable variable name: "La4" genedata<-Ecodata[genelist,2 :3 1] gene.x1<-apply(genedata[,1 :3], 1,mean) 46 gene.x2<-apply(genedata[,4:6],1,mean) gene.x3<-apply(genedata[,7 :9], 1 ,mean) gene.x4<-apply(genedata[,10: 12], 1,mean) gene.xS<-apply(genedata[, 13: 15], 1 ,mean) gene.x6<-apply(genedata[,16: 1 8], 1,mean) gene.x7<-apply(genedata[, 19:2 1 ],1,mean) gene.x8<-apply(gcnedata[,22:24],1,mean) gene.x9<—apply(genedata[,25:27],1,mean) gene.x] O<-apply(genedata[,2 8:30], 1,mean) genegexp<-data.fi'ame(gene.xl, gene.x2, gene.x3, gene.x4, gene.xS, gene.x6, gene.x7, gene.x8, gene.x9, gene.xlO) for (i in l:length(genelist)){ GeneExpression.gene=t(rbind(genegexp[i,])) matplot(GeneExpression.gene,axes=F,frame=T,type='b',pch=1) row.names(GeneExpression.gene)<-c("BayO", "C24", "C010", "Cvi", "Est", "KinO", "Let", "Nd1", "Sha", "VanO") axis(l, 1:10, row.names(GeneExpression.gene)) par(new=T) } title(xlab="Ecotypes",main=paste(name)) # # # Gene Expression Plot - Each picture represents one gene # # # genelist = la4 # changeable variable N = 20 # the number of genes genedata<-Ecodata[genelist,2 : 3 l] gene.x1<-apply(genedata[, 1 :3], 1 ,mean) gene.x2<~apply(genedata[,4:6], 1 ,mean) gene.x3<—apply(genedata[,7:9], 1 ,mean) gene.x4<-apply(genedata[, 10: 12], 1,mean) gene.xS<—apply(genedata[,13:15],1,mean) 47 gene.x6<-apply(genedata[, l 6: l 8], 1,mean) gene.x7<-apply(genedata[,19:21],],mean) gene.x8<—apply(genedata[,22:24],1,mean) gene.x9<-apply( genedata[,25 :27], 1,mean) gene.x] O<-apply(genedata[,28 : 30] , 1 ,mean) genegexp<-data.frame(gene.x1, gene.x2, gene.x3, gene.x4, gene.x5, gene.x6, gene.x7, gene.x8, gene.x9, gene.xlO) for (i in 1:N){ GeneExpression.gene=t(rbind(genegexp[i,])) matplot(GeneExpression.gene,axes=F,frame=T,type='b',pch=1 ) row.names(GeneExpression.gene)<-c("BayO", "C24", "C010", "Cvi", "Est", "KinO", "Ler", "Nd1", "Sha", "Van0") axis(1, 1:10, row.names(GeneExpression.gene)) title(main=paste("Gene",i)) } 44 4! ll " # Latitude - La4 # «H 4‘ II II ecorep = c(1,l,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,10) design = model.matrix(~-1+factor(ecorep)) Ala4 = design[,S] Bla4 = design[, 1 ]+design[,7]+design[,8]+design[,10] Cla4 = design[,2]+design[,3]+design[,6] +design[,9] Dla4 = design[,4] designla4 = data.frame(Ala4, Bla4, Cla4, Dla4) contrast.matrixla4 = makeContrasts(Ala4 — Bla4, Ala4 — Cla4, Ala4 — Dla4, Bla4 — Cla4, Bla4 — Dla4, Cla4 - Dla4,]evels=designla4) eco.fitla4 = lmFit(Ecodata[,2:31],designla4) eco.fit2la4 = contrasts.fit(eco.fitla4, contrast.matrixla4) eco.ebla4 = eBayes(eco.fit21a4) 48 #= decideTests =# clasla4 = decideTests(eco.ebla4, method: “nestedF”, adjustmethod: “fdr”, p=0.05) rownames(clasla4) = Ecodata[, 1] “24 = rowSums(abs(clasla4)) # select the genes which are significant at least in one contrast cl.la4 = clasla4[,1] c2.la4 = clasla4[,2] c3.la4 = clasla4[,3] c4.la4 = clasla4[,4] c5.la4 = clasla4[,5] c6.la4 = clasla4[,6] la4.cl = which(c1.la4 f—= 1 |c1.la4 =—1) Ia4.c2 = which(c2.la4 = 1 |c2.la4 =-1) la4.c3 = which(c3.la4 = 1 | c3.la4 =1) la4.c4 = which(c4.la4 == 1 |c4.1a4 ==-l) la4.c5 = which(c5.la4 = 1 |c5.la4 ==-1) la4.c6 = which(c6.la4 = 1 |c6.la4 ==-1) la4.all = unique(c(la4.c1,la4.c2,la4.c3,la4.c4,la4.c5,la4.c6)) #= Look decideTests in different way =# la4k0 = length(which(kla4==0)) la4k1 = length(which(kla4=l)) la4k2 = length(which(kla4=2)) la4k3 = length(which(kla4=3)) la4k4 = length(which(kla4=4)) la4k5 = length(which(kla4==5)) la4k6 = length(which(kla4=6)) la4k = C(la4k0, la4k1, la4k2, la4k3, la4k4, la4k5, la4k6) names(la4k)=c(0:6) 49 la4bar = barplot(la4k,space=l .5,col= c("yellow",”red”,”blue”,"lightblue", "mistyrose", "lightcyan", "lavender"),legend=1a4k, xlab=“number of significant contrasts”, main=“la4”) la4row = which(kla4>=4) 1H 4! n W # Latitude - A14 # 44 44 [I ll ecorep = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,10) design = model.matrix(~-1+factor(ecorep)) Aal4 = design[,9] Bal4 = design[,7] Cal4 = design[, 1 ]+design[,2]+design[,6]+design[,8] Dal4 = design[,3]+design[,4]+design[,5]+design[,10] designal4 = data.frame(Aal4, Bal4, Cal4, Dal4) contrastmatrixal4 = makeContrasts(Aal4-Bal4,Aal4-Ca14,Aal4-Dal4,Bal4—Cal4, Bal4-Dal4,Cal4-Dal4,1evels=designal4) eco.fital4 = lmFit(Ecodata[,2:31],designal4) eco.fit2al4 = contrasts.fit(eco.fital4,contrast.matrixal4) eco.ebal4 = eBayes(eco.fitZal4) #= decideTests =# clasal4 = decideTests(eco.ebal4, method: “nestedF”, adjust.method= “fdr”, p=0.05) kal4 = rowSums(abs(clasal4)) # select the genes which are significant at least in one contrast c1.al4 = clasal4[,1] c2.al4 = clasal4[,2] c3.al4 = clasal4[,3] 50 c4.al4 = clasal4[,4] c5.al4 = clasal4[,5] c6.al4 = clasal4[,6] al4.c1 = which(cl.al4 = 1 | cl.al4 ==-l) al4.c2 = which(c2.al4 = 1 | c2.al4 ==-l) al4.c3 = which(c3.al4 = 1 |c3.al4 =1) al4.c4 = which(c4.al4 = 1 | c4.al4 =-1) al4.c5 = which(cS.al4 == 1 | c5.al4 ==-1) al4.c6 = which(c6.al4 = 1 | c6.al4 ==-1) al4.all = unique(c(al4.c1,al4.c2,al4.c3,al4.c4,al4.c5,al4.c6)) #= Look decideTests in different way =# al4k0 = length(which(kal4==0)) al4kl = length(which(lcal4=l)) al4k2 = length(which(kal4=2)) al4k3 = length(which(kal4=3)) al4k4 = length(which(kal4=4)) a14k5 = length(which(kal4=5)) al4k6 = length(which(kal4=6)) al4k = c(al4k0, al4kl, al4k2, al4k3, al4k4, al4k5, al4k6) names(al4k)=c(0:6) al4bar = barplot(al4k,space=l .5,col= c("yellow",”red”,”blue”,"lightblue", "mistyrose", "lightcyan", "lavender"),legend=al4k, xlab=“number of significant contrasts”, main=“al4”) al4row = which(kal4>=5) 44 «H 1' UV # Cvi vs. the other 8 Ecotypes (without Sha) # 4! 44 '1 ll 51 ecorep = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,10) design = model.matrix(~-1+factor(ecorep)) designcvi = design colnames(designcvi)<-c("BayO", "C24", "C010", "Cvi", "Est", "KinO", "Ler", "Nd1", "Shakdara", "VanO") contrast.matrixcvi<~makeContrasts(Cvi -Bay0/ 8 - C24/8 - C010/8 — Est/8 - KinO/ 8 —- Ler/ 8 - Nd1/8 - VanO/8 ,levels=designcvi) eco.fitcvi = lmFit(Ecodata[,2:31],designcvi) eco.fit2cvi = contrasts.fit(eco.fitcvi, contrastmatrixcvi) eco.ebcvi = eBayes(eco.fit2cvi) clascvi = decideTests(eco.ebcvi, method: “nestedF”, adjust.method= “fdr”, p=0.05) kcvi = rowSums(abs(clascvi)) #= Toptable =# selecting the first 500 genes from toptable num=5 00 cvi = topTab1e(eco.ebcvi, genelist= eco.ebcvi $genes, coef=1, n=num, adjust="fdr") d.cvi = read.csv("cvinumber.csv") n.cvi = d.cvi[,1] 44 44 # Sha vs. the other 8 Ecotypes (without Cvi) # 44 44 II II designsha = design colnames(designsha)<—c("BayO", "C24", "C010", "Cvi", "Est", "KinO", "Ler", "Nd1", "Shakdara", "Van0") contrastmatrixsha<—makeContrasts(Shakdara -Bay0/ 8 - C24/8 - C010/8 — Est/8 - KinO/ 8 — Ler/8 - Nd1/8 - VanO/8,levels=designsha) eco.fitsha = lmFit(Ecodata[,2:31],designsha) 52 eco.fit2§ha = contrasts.fit(eco.fitsha, contrastmatrixsha) eco.ebsha = eBayes(eco.fit25ha) #= Toptable =# selecting the first 500 genes from toptable sha = topTable(eco.ebsha, coef=l , n=num, adjust="fdr") d.sha = read.csv("shanumber.csv") n.sha = d.sha[,l] 1‘! if # Highly Variation - geneselect # 44 44 '7 U vars=apply(AtGE, 1, var) sortvars=sort(vars,decreasing = TRUE) geneselect=sortvars[ 1 :number] gs = names(geneselect) gs = as.numeric( gs) Ecodata[gs, 1] 4'4 44 # Randomly Selection - ran # 44 44 II " x=runif(number, min=1, max=228 10) ran=as.integer(x) # RandomForest # ff # rfgenes = n.cvi # changeable variable 53 rfname = "n.cvi" library( randomForest) ecol=t(Ecodata[rfgenes,2:31]) ## select "number" genes econames=rep(c("Bay0", "C24", "ColO", "Cvi", "Est", "KinO", "Ler", "Nd1", "Shakdara", "Van0"),each=3) colnames(eco1)=Ecodata[rfgenes, 1] ecotypel=data.frame(ecol,econames) ## Data which we want ## ecotype.rf = randomForest(econames ~ ., data=ecotype1, ntree=100, keep.forest=TRUE, importance=TRUE) ecotype.rf imp = importance(ecotype.rf) plot(sort(imp[,l1]),type="h",ylab=“Importance Score”, main = rfname) # see Accuraacy 44 44 II II # ntree vs. 00B error rate # 44 44 n n ntree=300 nrf=10 # number of boostrap m = matrix(rep(0,ntree*nrt),nrow=ntree) for (j in 1:nrt){ for(i in 1:ntree){ ecotype.rf = randomForest(econames ~ ., data=ecotypel, ntree=i, mtry=sqn(22810), keep.forest=TRUE, importance=TRUE) m[i,j]=ecotype.rf$err.rate[i,l] matplot(m,type="l",col="grey",lty=1, xlab="number of trees",ylab="OOB error rate",ylim=c(0,1),frame.plot=F) axis(l, seq(0,ntree,by=50),col = "#EE9A00", col.axis="blue", lwd = 2) axis(2, seq(0,l,by=0.2),col = "#EE9AOO", col.axis="blue", lwd = 2) } 54 par(new=T) } mean = apply(m, 1, mean) mean = ifelse(mmean ="NaN", 1, mean) op = par(new=T) par(0p) plot(mmean,type="l",cex=l,col="red",lwd=2, xlab="number of trees",ylab="OOB error rate",ylim=c(0,1),frame.plot=F,axes=F) axis(l, seq(0,ntree,by=50),col = "#EE9A00", col.axis="blue", lwd = 2) axis(2, seq(0,1,by=0.2),col = "#EE9AOO", col.axis="blue", lwd = 2) par(0p) mini = min(mean) a = which(mean—T-mini) # a is the number of trees which we want text(a,0.4,paste("ntree",a),adj= l ,cex=l .2,col="dark green") points(a,mean[a],col="dark green",cex=l .2,lwd=3) #= After finding “optimal ntree” =# ecotype.rf = randomForest(econames ~ ., data=ecotypel, ntree=a, keep.forest=TRUE, importance=TRUE) ecotype.rf econames=as.factor(econames) e = ecotypel[,-501] rf.eco <- varSelRF(e, econames, ntree = 210, mtry=4) rf.eco plot(rf.eco) # mtry vs. OOB error rate # 44 44 '1 H 55 rf.mtry=l70 nrf=10 # number of boostrap numtree=a m = matrix(rep(0,rf.mtry*nrf),nrow=rf.mtry) for (j in l:nrf){ for(i in 130:rf.mtry){ ecotype.rf = randomForest(econames ~ ., data=ecotype1,ntree=numtree, mtry=i, keep.forest=TRUE, importance=TRUE) m[i,j]=ecotype.rf$err.rate[nnmtree, l] matplot(m,type="l",col="grey",lty=l , xlab="number of mtry",ylab="OOB error rate",ylim=c(0,1),xlim=c(l30,170),frame.plot=F) axis(l, seq(130,170,by=2),col = "#EE9A00", col.axis="blue", lwd = 2) axis(2, seq(0,1,by=0.2),col = "#EE9AOO", col.axis="b1ue", lwd = 2) } par(new=T) } mean = apply(m, 1, mean) #mean = ifelse(mmean ="NaN", 1, mean) op = par(new=T) par(op) plot(mmean,type="l",cex=1 ,col="red",lwd=2, xlab="number of mtry",ylab="OOB error rate",ylim=c(0,1),xlim=c(130, l 70),frame.plot=F,axes=F) axis(l, seq(130,170,by=2),col = "#EE9A00", col.axis="blue", lwd = 2) axis(2, seq(0,1,by=0.2),col = "#EE9A00", col.axis="blue", lwd = 2) Par(Op) mini = min(mmean[131:170]) b = which(mmean==mini) # a is the number of trees which we want text(b,0.4,paste("mtry",b),adj=0,cex=l .2,col="dark green") points(b,mean[b],col="dark green",cex=l .2,lwd=3) 56 #= After finding "optimal ntree" & "optimal mtry" =# ecotype.rf = randomForest(econames ~ ., data=ecotype1, ntree=a,mtry=b, keep.forest=TRUE, importance=TRUE) ecotype.rf # # # Select the number of Genes from RandomForest # 44 44 rfgenes=n.cvi ecol=t(Ecodata[rfgenes,2:31]) econames = rep(c("Bay0", "C24", "C010", "Cvi", "Est", "KinO", "Ler", "Nd1", "Shakdara", "Van0"),each=3) colnames(ecol)= Ecodata[rfgenes,1] ecotypel = data.frame(ecol ,econames) econames=as.factor(econames) e = ecotypel[,-501] rf.eco <- varSelRF(e, econames, ntree=a , mtry=b) rf.eco plot(rf.eco) 57 ESL-tat Ecotype Latitude Longitude Bay-0 49.56 11.34 C24 40.2 8.25 Col-0 43.01251667 -70.05 Cvi 16.00208056 -24.05 Est 59 25.04 Kin-0 42.46638889 —84.46 Ler 48.2 10.52 Nd-l 50.77777778 8.03 Shakdara 37.18333333 73.166 Van-O 49.85049722 -123.11 EcotypesGeoxxt Ecotype Location Latitude Longitude Altitude Bay-0 AtGE_111_A, B, C Bayreuth, Germany 49.56 11.34 350 C24 AtGE_112_A, C, D Coimbra, Portugal 40.20 8.25 179 Col-0 AtGE_113_A, C, D Columbia University (US) 43.01 -70.05 49 Cvi AtGE_114_A, B, C Cape Verde Islands 16.00 24.05 43 Est AtGE_115_A, B, D Estonia 59.00 25.04 15 Kin-O AtGE_116__A, B, C Kinneville, MI 42.47 -84.46 273 Ler AtGE_117_IB, C, D Landsberg, Germany 48.20 10.52 628 Nd-l AtGE_118_A, B, C Niederzunzheim, Germany 50.78 8.03 250 Shakdara AtGE_119_A, C, D Pamiro-Alay, Tadjikistan 37.18 73.17 3400 Van-O AtGE_120__A, B, C University of British Columbia 49.85 -123.11 50 AtGE ecotypeslxt Which are available on WEIGEL WORLD website: http://www.weigelworld.org/resources/microarray/AtGenExpress/ 58 BIBLIOGRAPHY 59 BIBLIOGRAPHY A. Liaw, M. Wiener (2002), Classification and regression by randomForest, R News 2/3: 18—22. C. Strobl, A. Bonlesteix, A. Zeileis, T. Hothorn (2007), Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution, BMC Bioinformatics: 8-25. C.V. Zwan, S.Brodie, J.J. Campanella (2000), The intraspecific phylogenetics of Arabidopsis thaliana in worldwide populations, Systematic Botany 25: 47—59. D. W. Meinke, et a1. (1998), Arabidopsis thaliana: a model plant for genome analysis, Science, Vol.282. Furlanello et a1. (2003), GIS and the Random Forest Predictor: Integration in R for Tick- bome Disease Risk Assessment, Proceedings of the DSC-03 International Workshop on Distributed Statistical Computing, Vienna, Austria. H. Pang, A. Lin, M. Holford, B.E. Enerson, B. Lu, M. P. Lawton, E. Floyd, H. Zhao (2006), Pathway analysis using random forests classification and regression, Bioinforrnatics, Vol.22. Jinwook Seo, Ben Shneiderman (2002), Interactively Exploring Hierarchical Clustering Results, IEEE Computer, Vol 35: 80-86. K. Apel, H. Hirt (2004), Reactive oxygen species: Metabolism, oxidative stress and signal transduction, Annual Review Plant Biology, Vol.55: 373—399 L. Breiman (2001), Random Forests, In Machine Learning, Vol.45: 5-32 L. Breiman, A. Cutler, Random Forests. URL:http://www.stat.berkelev.edu/users/breimagRandomForests/cc papershtm M.Schmid, T.S.Davison, S.R.Henz, U.J.Pape, M.Demar, M.Vingron, B.Sch6|kopf, D.Weigel, and J .U.Lohmann (2005), A gene expression map of Arabidopsis development. Nature Genetics, V0137: 501-506. R. Diaz-Uriarte (2004), Variable Selection from Random Forests: Application to Gene Expression Data, Spanish Bioinfonnatics Conference 2004: 47-52. R. Diaz-Uriarte (2005), GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest, CNIO, Spain. ' 60 R. Diaz-Uriarte, S. Alvarez de Andres (2006), Gene selection and classification of microarray data using random forest, BMC Bioinformatics 2006, 7:3. S. Karpinski, H. Reynolds, B. Karpinska, G.. Wingsle, G. Creissen, P. Mullineaux (1999), Systemic signaling and acclimation in response to excess excitation energy in Arabidopsis. Science Vol.284: 654—657. Smyth, G K. (2004), Linear models and empirical Bayes methods for assessing differential expression in microarray experiments, Statistical Applications in Genetics and Molecular Biology 3, No. 1, Article 3. T.K. Ho (1995), Random Decision Forests, Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada: 278-282. U. Johanson, et a1. (2000), Molecular Analysis of FRIGIDA, a Major Determinant of Natural Variation in Arabidopsis Flowering Time, Science Vol.290. Y. Truong, X. Lin, C. Beecher (2004), Learning a complex metabolomic dataset using random forests and support vector machines, KDD (Knowledge Discovery and Data Mining): 835-840. 61 ill]l]]fl]l]]j]]it]: ‘44 n.-.—L~. . - a A- - -