MN

  

H

MIMI“

1
l

 

55,23 munmumImummlmmm

 

_.I
I
(I)

I

{002

This is to certify that the
thesis entitled

Random Forests and Gene Selection to Classify Arabidopsis
Thaliana Ecotypes

presented by

 

Hsueh-han Yeh

.LlBRARY
M'Chigan State
University

' has been accepted towards fulﬁllment
of the requirements for the

 

 

 

IV._§.___”___ degree in ._ Statistics and Probability

 

___.-_.-J&@Ud~jlaﬁ/"

Major Professor’s Signature

840»- 07

 

Date

MSU is an arithmetive-action, equai-opportuniw employer

ovo----un-ws~.- ‘c—u- —

—~-no-v-v-0---oo---u-_v_uu!f

a»- w —s

v--u-- cup-vo-

-.-':-—o--o .v.-.

—- -v-a-nu-u-r-cn-u-

 

- PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DAIEDUE

DAIEDUE

DAIEDUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/07 p:/ClRC/DateDue.indd-p.1

 

 

Random Forests and Gene Selection to Classify Arabidopsis Thaliana Ecotypes

By
Hsueh-han Yeh

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
Department of Statistics and Probability

2007

ABSTRACT

Random Forests and Gene Selection to Classify Arabidopsis Thaliana Ecotypes

By

Hsueh-han Yeh

This thesis discusses the classiﬁcation and gene selection of ecotype data for
Arabidopsis thaliana. Gene expressions from Oligonuoleotide gene expression arrays
were used to classify Arabidopsis thaliana ecotypes using statistical methods. The
hierarchical cluster method was used to group ecotypes according to latitude and altitude
to distinguish ecotypes. Limma was used to select differentially expressed genes. The
Random Forest algorithm provides a ranking of genes to indicate how well they can
discriminate between ecotypes.

We focus on the Random Forest algorithm. It is an efﬁcient approach and can deal
with a large number of predictor variables in a classiﬁcation process. Parameters are
optimal to achieve a small classiﬁcation error rate.

The ﬁnal selection of genes may play an important role in adaptation to stress
conditions. They were further examined for gene function and evidence regarding stress

resistance.

Keywords: Arabidopsis thaliana, Microarray Data, Hierarchical Cluster, Limma,

Random Forest, Classiﬁcation.

ACKNOWLEDGEMENTS

I wish to thank many people who made this thesis possible. First of all, it is hard to
overstate my gratitude to my advisor, Dr. Marianne Huebner, Department of Statistics,
Michigan State University. With her enthusiasm, her patience, and her encouragement,
she helped to make statistics and biology fun for me. Throughout my thesis-writing
period, she provided many suggestions and lots of good ideas. Dr. Huebner also helped
me revising my English. I am very glad and enjoyable to work with her. I wish thank to
Dr. Andreas Weber for his support and grant. Dr. Weber also gave me suggestions to
examine gene functions which makes this thesis complete. I wish to thank my parents.
They raised me, supported me, taught me, and loved me. To them I dedicate this thesis. I
wish to thank my best ﬁiend Hsiu-ching Chang, for helping me get through the difﬁcult
times, and for all the emotional support. My special gratitude is due to my brother, for
his loving support. I also wish to thank William Robert Swindell for giving many helpful

suggestions of biology section.

Finally, I have to say 'Thank You' to all my friends and family, wherever they are

and where they go.

III

TABLE OF CONTENTS
List of Tables ...................................................................................... V
List of Figures ..................................................................................... VI

Chapter 1 Introduction of Microarray and Arabidopsis Ecotypes Data

1.1 Microarray Data ........................................................................ 1
1.2 Arabidopsis thaliana ................................................................... 2
1.3 Gene Selection Process ................................................................ 5

Chapter 2 Statistical Methodology

2.1 Hierarchical Clustering ................................................................ 7
2.2 Limma - Linear Models for Microarray Data ...................................... 10
2.3 Random Forest ........................................................................ 12

Chapter 3 Application of Lima and Random Forest to Ecotypes

3.1 Gene Selection using Limma ........................................................ 19
3.2 Ecotypes of Cvi and Shakdara ....................................................... 22
3.3 Gene Selection ﬁrom Cvi contrasts with other 8 ecotypes ....................... 23
3.4 Gene Selection from Shakdara contrasts with other 8 ecotypes ................ 25
3.5 Gene Selection from CviSOO and ShaSOO by Random Forest .................. 26
3.6 Compare the OOB error rate of Random Forest ................................... 28
3.7 Misclassiﬁcations of Ecotypes ...................................................... 29

Chapter 4 Gene Ontology

4.1 Gene Ontology with Classiﬁcation Superviewer ................................. 30
4.2 Gene Ontology of Cvi43 and sha84 .............................................. 33
Appendices ........................................................................................... 38
Bibliography ......................................................................................... 59

IV

LIST OF TABLES

Table 1 Ecotypes Geography Information .............................................. 4
Table 2 Resources of Arabidopsis Genome ............................................. 4
Table 3 Geography of Ecotypes ........................................................... 8
Table 4 The number of significant genes for per contrast .......................... 20
Table 5 Comparison of OOB error rate ............................................... 28
Table 6 Misclassification List ............................................................ 29
Table 7 Main Function categories of FunCat ......................................... 32
Table 8 FLC, Cytochrome P450 genes and

Glutathione—S-transferase genes .............................................. 37

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18

Figure 19

LIST OF FIGURES

First 10 Ecotypes Distribution Map ........................................... 3
Hierarchical Clustering Process ................................................ 8
Ecotype Cluster .................................................................... 9
Random Forest Construction ................................................ 13
Number of significant contrasts ............................................... 19
The number of signiﬁcant genes for per contrast ......................... 20
Signiﬁcant genes for Latitude and Altitude .............................. 21
Gene expression of 247999__at ............................................... 21
Optimal value of ntree for Cvi ................................................. 24
Optimal value of mtry for Cvi ................................................ 24
Optimal value of ntree for Shakdara ........................................ 25
Optimal value of mtry for Shakdara ......................................... 25
Optimal value of the number of genes for Cvi ........................... 26
Optimal value of the number of genes for Shakdara ..................... 27
Overlapping genes from Cvi500 and Sha500 .............................. 27
Misclassification ﬁgure ......................................................... 29
Cvi43 - Classification Superviewer ........................................... 33
Sha84 - Classification Superviewer .......................................... 34
Expression graph for 5 speciﬁc genes ...................................... 36

VI

Chapter 1 Introduction of Microarray and Arabidopsis Ecotypes Data
1.1. Microarray Data

Regulatory regions of plant genes is likely to be more concise than those of animal
genes, but the transcription factors encoded in plant genomes is larger than those of
animals. Thus, plants can contribute to research regarding the inﬂuence of transcriptional
factors in multicellular development. Here, we study the reference plant, Arabidopsis
thaliana, for our study, and the dataset is AtGenExpress Ecotypes Expression estimated
by gcRMA. The data is part of the public AtGenExpress expression atlas, which was
created by Aﬁymeuix ATHl array platform. Microarray, obtained by Oligonuoleotide
Chips or spotted arrays, is a technology to study the expression of thousands of genes.
Microarray technology requires statistical methods to analyze the dataset which are high
dimensional data sets.

Statistical approaches can be used for multiple comparisons of genes to deﬁne the
differentially expressed genes between arrays. Data mining is used widely for
Microarray data since it can use a subgroup of genes to predict the observations (e. g.
Ecotypes) that would help to reduce the dimension of Microarray data. In this study, we
use classiﬁcation approach and data mining technique, Random Forest, to classify the

Arabidopsis thaliana Expression Ecotypes Data.

1.2. Arabidopsis Data
Arabidopsis thaliana

The Arabidopsis ATHl Genome Array, built in TIGR (The Institute for Genomic
Research), contains more than 22,500 probe sets displaying approximately 24,000 gene
sequences on a single array. (http://wwwaffmetrixcom)

Arabidopsis thaliana is a ﬂowering plant, an inconspicuous weed. It has been used as a
model plant organism for many years and has been chosen for used in molecular genetic
analysis. Laibach (1943) ﬁrst specify that some signiﬁcant characteristics of Arabidopsis
thaliana make them are suitably used for model plant organism. It has a short life cycle;
it only needs several weeks to mature. Due to its size, it can grow in a limited area.
Furthermore, it has small genome size and nearly non-repetitive DNA (S Barth, A E
Melchinger; 7H Lﬁbberstedt, 2002). These features make Arabidopsis thaliana plants
much conveniently for genetic analysis. Due to these features in ArabidOpsz's thaliana,

international effort has been devoted to build the methods to research its genome.

 

Arabidopsis thaliana at an early stage of ﬂowering. [Drawing by K. Sutliff]

Arabidopsis thaliana Ecotype Data

Figure 1 First 10 Ecotypes Distribution Map

 

 

D
f o 024
.{9/ A Col-0
(J ,3 + Cvi
‘i A} x Est
‘3 “(1'1 0 Kin-0
_ g V Ler
\ 7 Bay 0\ a N d-1
{ Ler / * Shakdara
-*""\ rV' 0 Van-0

An ecotype is a population of a plant that survives as a distinct group through
ecological environment. AtGenExpress Ecotypes Data used in this paper come from
weigelworld (WWW.weige/worldozg), including 34 ecotypes. Each ecotype is composed by
one or several arrays of 22810 genes each. Arabidopsis thaliana is widely distributed
(Meinke et al, 1998), and the 34 ecotypes in the Arabidopsis thaliana Ecotype data used
in this study represent locations in Europe, North America and Aﬁica. The location,
longitude, latitude, and altitude of each ecotype were listed in the Tablel. The latitudes
of these ecotypes range from 16N to 59N. The longitudes range from 0.53E to 73B, and

from 0.22W to 123W. The highest altitude is 3400m. Overview the distribution of the

ecotypes, 27 ecotypes distributed throughout Europe and 12 ecotypes among these 27

ecotypes in Germany. The other ecotypes are distributed in North America and Africa.

We want to examine if we can use these gene expressions to classify Arabidopsis thaliana

ecotypes by statistical methods. First of all, the problem we confront is the large size of

genes in each ecotype. Dimension reduction can help deal with large variables efﬁciently

and select the most important variables. We use Random Forest to decrease the size of

dataset and classify ecotypes. Random Forest Algorithm will be discussed in the Chapter

 

 

 

 

 

 

 

 

 

 

2.
Table 1 Ecotypes Geography Information
Ecotype Location Altitude Latitude Longitude Temperature (“C )

Bay-0 Bayreuth. Germany 350 49N 11 E -2 - 18
C24 Coimbra, Portugal 179 40M 8 E 7.2 — 27
Col-0 Columbia University (US) 49 39M 93 W -3.3 -- 28.9
Cvi Cape Verde Islands 43 16N 24 W 24 - 29
Est Estonia 15 59N 26 E -5.2 - 17
Kin-0 Kinneville, MI 273 43N 85 W -12.2 — 32.2
Ler Landsberg. Germany 628 53N 16 E -1.7 - 19.4
Nd-1 Niederzunzheim, Germany 250 50N 8 E 5.5 - 9.5
Shakdara Pamiro-Alay. Tadjikistan 3400 37N 71 E 0 - 30
Van-0 UBC (Vancouver) 50 50M 123 W 0 — 26

 

 

 

 

 

 

Table 2 Resources of Arabidopsis thaliana GenomeL

 

Resources

Contact Person

Information of website

 

Arabidopsis database (AtDB)

ABRC* Stock Center (USA)
NASCT Stock Centre (UK)

TIGR: (USA)

SPP§ Consortium (USA)
CSHL\ Consortium (USA)
ESSAConsortium (Europe)

Genoscope (France)

Kazusa Institute ( Japan)
David W. Meinke, J. Michael Cherry,* Caroline Dean, Steven D. Rounsley, Maarten Koomneef.

Arabidopsis thaliana: A Model Plant for Genome Analysis (1998)

M. Cherry

R. Scholl
M. Anderson
S. Rounsley

R. Davis
R. McCombie
M. Bevan

F. Quetier
S. Tabata

http://genome-
www.3tanford.edu/Arabidopsis/
http://aims.cps.msu.edu/aims
http://nasc.nott.ac.uk
http://www.tigr.org/tdb/at/at.html
http://sequence—
www.stanford.edu/ara/SPP.html
http://nucleus.cshl.org/protarab/
http://muntj ac.mips.biochem.mpg.de/
arabi/index.html
http://www.genoscope.cns.fr/exteme/
arabidopsis/Arabidopsis.html
http://www.kazusa.or.jp/arabi/

1.3.

Gene Selection Process
Grouping 10 ecotypes (3 replications each) by latitude and altitude of ﬁrst 10
ecotypes of Arabidopsis thaliana Ecotypes Data using Hierarchical Cluster. Four
groups are as follows:
La4 (La-A, La-B, La-C, La—D) A14 (Al-A, Al-B, Al-C, Al-D)
For each of these two groupings (A14 and La4) with Limma function of R software.
A-B, A-C, A-D, B-C, RD, and C-D in each of the grouping La4 and A14,
respectively.
The number of signiﬁcant genes for each contrast in each grouping is counted.
After counting the number of signiﬁcant genes, we found that Cvi (La-D) has the
largest number of signiﬁcant genes differentially expressed in comparison with
other 3 latitude groups. Shakdara (Al-A) has the largest number of signiﬁcant
genes differentially expressed in comparison with other 3 altitude groups.
Cvi (smallest latitude) and Shakdara (highest altitude) are compared to the other
ecotypes to identify genes that differentiate these.

Contrasts to be considered :

- Cvi - é—(Bayo + C24 + Colo + Est + Kino + Ler + Nd1 + Vano)

o Sha - :3—(Bayo + C24 + Colo + Est + Kino + Ler + Nd1 + Vano)

The top 500 differently expressed genes are selected from each of these two
contrasts. Corresponding gene sets are Cvi500 and Sha5 00.

Optimal parameters, ntree and mtry, in Random Forest are chosen for Cvi5 00 and
Sha5 00.

Highly ranked genes (variable importance) are selected from Cvi5 00 and Sha5 00.

10

11

There are 43 genes chosen from Cvi500 and 84 genes chosen from ShaSOO.
Compare OOB error rate for the selected genes.
Discuss misclassiﬁcation arrays in Random Forest.

Gene functions of the selected genes are considered.

Chapter 2 Statistical Methodology

In this chapter, clustering (2.1), linear models for Microarray Data (2.2), and Random
Forest (2.3) will be discussed.

(2.1) Clustering is the ﬁrst step in our gene selection process. In this section, we use
Hierarchical Clustering method to group the 10 ecotypes into subsets and those subsets
will be contrasted with linear models.

(2.2) Limma is the second step. In this step, we choose smaller subgroups of genes
which are differentially expressed from Lima method by contrasting subsets of
ecotypes obtained in clustering result. We explain the differentially expressed genes.
(2.3) Random Forest is a method to rank genes by their importance in classifying
ecotypes. In this section, we will explain the Random Forest algorithm and the selection
of important predictor variables (genes) from the gene sets chosen with the linear models.
2.1. Clustering

Grouping a collection of observations into subgroups (clusters) is called Clustering.
Observations within the each cluster have smaller distance to each other than to
observations assigned to other different clusters.

In Hierarchical Clustering (J inwook Seo, Ben Shneiderrnan 2002), the observations
are not separated into subgroups in only one step. Instead, observations are separated by
a serious of partitions. Clustering may start from a single cluster containing all
observations to subgroups of observations, called Divisive method. On the other hand
(Figure 2), it may start from n clusters (if you have n observations) and each cluster
contains one observation, then ﬁnding the closest distance pair of clusters and combining
them into a single cluster. In the end, all clusters will be combined into one cluster,

called Agglomerative method. The Agglomerative method is used here to identify latitude

and altitude groups (Table 3).

Table 3 Geography of Ecotypes.

 

 

 

 

 

 

 

 

 

 

Ecotype Location Altitude Group(Al) Latitude Group(La)
1 Bay-0 Bayreuth. Germany 350 C 49.56 B
2 024 Coimbra, Portugal 179 C 40.2 C
3 Col-0 Columbia University (US) 49 D 43.0125 C
4 Cvi Cape Verde Islands 43 D 16 D
5 Est Estonia 15 D 59 A
6 Kin-0 Kinneville. MI 273 C 42.466 C
7 Ler Landsberg. Germany 628 B 48.2 B
8 Nd-1 Niederzunzheim, Germany 250 C 50.778 B
9 Shakdara Pamiro-Alay. Tadjikistan 3400 A 37.183 C
10 Van-0 UBC (Vancouver) 50 D 49.85 B

 

 

 

 

 

 

 

The process of Agglomerative Method as follows:

Given a set of n observations (ecotypes) to be grouped, and a nxn distance matrix

(Euclidean distance measure used) illustrates each pair of two observation distance.

Step1. Start with n clusters, and each cluster contains a single observation.

Step2. Select the closest pair of clusters to merge into one new cluster.

Step3. Calculate the distance of the new cluster and other old single observation cluster.

Step4. Repeat Step2 and Step3 until all observations merge into one cluster.

Figure 2 Hierarchical Clustering Process

 

Hierarchical Clusteran

 

Comm min]

 

 

L E1, E2, E3, E4, E5 ]

//\\
[—1

E2. E4

 

—_’\

 

 

\ \..
// \x E“:
[a ca :3 @

 

 

 

l Agglomerative method 1

 

 

Hierarchical cluster used to cluster 10 ecotypes into subgroups according to their altitude
and latitude (F igure3). From Figure3, we can see that Cvi and Shakdara differ the most
from the remaining ecotypes.

Figure 3 Ecotype Cluster

 

 

 

 

 

 

 

 

Latlmdeauster mom
0
O
In
(0
3‘ 8
8
o.
8
8— N
O
81
N
«)3— 8
G)—
‘U “7
a 3.—
}? E
_. <81
0_ 9
O
o-
|O
rem rig.
N
3% all] qu, ll
to _]F .- 1'
09°. ' m‘ >‘N ‘7
5 as 3%: ”5%38021:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.2. Limma — Linear Models for Microarray Data

Before Random Forest is applied to gene sets, we use Limma, Linear Models for
Microarray Data (Smyth, G K. 2004), to choose smaller subgroups of genes between
ecotypes. The grouping will be discussed in the following paragraph. Differentially
expressed genes will be used in Random Forest to classify ecotypes and to assign ranks to
the genes.

Limma is used to identify genes whose expression pattern differs from others.
Limma is a software package in Bioconductor in R environment (http://www.r-project.org)
for the analysis of gene expression microarray data. Linear models are constructed for
each gene to determinate weather they are differentially expressed in subgroups of
ecotypes deﬁned by latitude an altitude clusters. In the topTable function of Limma, M-
value, t-statistic, B—statistic and P. Value of each gene can provide overall ranking of

genes in order of differential expression. M-value is logz-fold change between two groups.

M = 10g2(expresszon value of gene at group A

 

expression value of gene in group B

The t-statistic is a well-known hypothesis to test the mean of two groups. The B-

statistic is the log odds that the gene is differentially expressed. For example, if the B-

3.5
statistic is 3.5, the probability that the gene is differentially expressed is —e—; = 97%.
1 .

+ e3
A larger B-statistic indicates higher probability that the gene is differentially expressed.
The P. Value is adjusted for multiple hypothesis testing using Benjamini- Hochberg ’s
method (BH). B—statistics and P. Value provide the same ranking when no data is missing.
Besides, differentially expressed genes are ranked in topTable by their P. Values.

Benjamini- Hochberg ’s method controls the false discovery rate (FDR) when testing

thousands of hypotheses, such as in microarray data. We identify genes differentially

lO

expressed in subgroups from Hierarchical Cluster (Figure3) and assign the letters of A, B,

C, D to those four groups (Table3).

11

2.3. Random Forest

The Random Forest algorithm by Leo Breiman (L. Breiman 2001) is a classiﬁcation
procedure consisting of a collection of tree-structured classiﬁers. Each tree is
independent, identically distributed random vectors. Each tree gives a unit vote for the
class of input vectors (arrays). Random Forest can analyze high dimensional data
efﬁciently. Two processes of randomization occur in Random Forest: trees and nodes.
Trees were built by bootstrap samples, and each node was split by randomly selected
predictor variables (genes).

In the ecotype data, there are ten ecotypes and each of them has 3 arrays , so there
are 30 arrays in the ecotype data. Moreover, each array has 22810 genes. In the
Random Forest, the 30 arrays are “input vectors” (class observations) and 22810 genes
are as “predictor variables”. Randomly select N arrays from those 30 arrays with
replacement for the training set (in-bag). The arrays which are not included in the
training set are called out-of-bag (OOB). The training set data are used to grow the tree.
The OOB data are used to estimate the classiﬁcation error rate and get a variable

importance measure.

12

Figure 4 Random Forest Construction

 

L 30 Arrays (from 10 ecotypes) I

 

Bootstrap Bootstrap ............... Bootstrap

/

ln Bag 003 lnBag 003 In Bag 003 ‘— 500 Bootstrap

 

 

l l A l

litree was built by a single In-Bag data I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Each sec 9 has 3 arra s

 

 

 

30 Arra 5
Each an'ay contains 22810 predictor variables (Genes)

 

     

/
gm? ,, ....................
Miami ..14 .......... .4

 

Each tree is grown as follows

Step 1. The training set consists of N observations (arrays) selected at random. Take N
observations (arrays) at random and with replacement from the original data set
called “in-bag”. The observations not selected are called “out-of-bag”. On
average, there will be two third observations “in-bag”, and one third “out-of-

bag .

Step 2. The observations selected from the training set are used to construct a decision

tree. The number of variables is M. A ﬁxed number m"? (mm <<M) of
variables are chosen randomly from M variables, and the number of mm, is

held constant during forest growing. These may variables are candidate for

splitting the node. The best split on these randomly chosen m-variables is used

to split the node which visualizes the tree and examine diagnostic statistics of

each tree. For example, if we have M=5 variables, we can choose mny =3
variables to split the node. There are C,5 =10 candidates and each candidate

has 3 variables (mtry ). Randomly choose one of these 10 candidates and apply

the best predictor variable (genes) of these 3 variables to split the node of each
tree. Each tree is grown as large as possible and without pnming.

Step 3. Repeat Step] and Step2 to construct 500 trees, ie. n =5 00 (default number in

tree
R). Thus, the algorithm is called “Random Forest.”

Step 4. Each tree give a classiﬁcation for 10 ecotypes, we say each array “votes” for
that class (ecotype). For example, if AtGE_111A was predicted for BayO at the

terminal node, we say AtGE_1 1 IA “votes” BayO, similarly to other arrays. As

14

the tree is built, each array will be assigned to a class (ecotype) in the terminal
node (vote). For each of the N bootstrap samples, a tree is built. The majority

vote for an array in this forest will be the predicted class (ecotype).

Notations
M : 2281 0 Genes.
N : 30 Arrays.

m,” : The number of variables (genes) used to split each tree node.

time : The number of trees (bootstraps) in the forest.

In the original paper (Leo Breiman 2001) of Random Forests, it was shown that the
error rate in Random Forest depends on two properties: the pairwise correlation between
trees and the strength of each individual tree. The correlation is the extent to which
arrays in a tree are similar ﬁom one to another. The strength is the overall average

prediction quality. Higher correlation between trees will increase the error rate, and

larger strength of each individual tree will reduce the error rate. Increasing the number of

variables, m will increase both of correlation between trees and strength of each

tryt

individual tree. Decreasing mm, decreases both of them. Therefore, we can use the error

rate to estimate optimal 171,,y . The parameter mm. is the only modiﬁable parameter

which is sensitive in random forest. The predicted class (ecotype) of overall trees
establishes the classiﬁcation of Random Forest by choosing the most votes of the class in
overall trees.

Features of Random Forest

' It runs efﬁciently on thousands of observations.

I It can handle large number of predictor variables (genes).

' It can rank predictor variables (genes) importance in the classiﬁcation.

15

Parameters of ntree and mtry

In the Random Forest, the most important and sensitive parameters are the number
of trees (lime) and the number of variables (mm, ) which are selected at random from all
variables. Each ecotype represents a class in Random Forest. We want to ﬁnd the

optimal nlree and m", to lower the OOB error rate, since the OOB error rate means that

the ecotypes can be classiﬁed well or not. The optimal values for n, and mm, are not

ree
unique.
OOB error estimate

There are about one-third of observations (arrays) not included in the training set.
Building trees based on the observations in the test set (OOB). If the class j has the most
of the votes every time as observation n is in OOB data, class j will be as the predicted
class. The proportion of the number of times that j is not equal to the true class 1' over all

observations N is the OOB error rate estimate.

N
Z 1(an ICm.)

 

OOB error rate : n = 1 N (There are 500 boostrap samples here)
For observation n :
C , : Class j gets the most votes (as every time observation n is in OOB data)
"J
C , : The true class for observation n is i
m
N .° There are N observations

Oi '=l'
I(C.|C.)= f]
n1 m 1 ifj¢i

16

Example:

There are 10 classes in Ecotypes and each class has 3 arrays, so there are 30 observations
in the data. The confusion matrix is computed as follows. For example, observations of
AtGE_111_A, AtGE_111_B, and AtGE_111_C belong to class of Bay0, but in random
forest procedure, class Est gets the most votes for AtGE_111_A which imply that

[(CAtGE_111_A j i CAIGQ “1.4 i ) = I , and class Bay0 gets the most votes for

AtGE_111_B and AtGE_111_C which implies that
[(CAtGE_III_B j i CAtGE_lll_Bi ) = 0 and [(CAtGE_lll_C j i CAtGE_lII_Ci ) = 0

(j: Est, i: BayO). In our example, there are 21 observations with [(an i Cm) = I , so

.N
IEZID(CZyi(:hD)
n=l

 

 

21

OOB error rate is N = 36 = 70% .
OOB estimate of error rate: 70%
Confusion matrix:

BayO C24 C010 Cvi Est Kino Ler Nd1 Shakdara VanO class.error
BayO 2 0 0 0 1 0 0 0 0 0 0.3333333
C24 0 l 0 0 O 0 0 0 2 0 0.6666667
C010 0 0 1 0 0 0 1 0 0 1 0.6666667
Cvi 0 O 0 1 0 1 0 0 0 1 0.6666667
Est 2 O 0 0 0 O 0 1 0 0 1.0000000
KinO 0 0 1 1 0 0 0 0 0 1 1.0000000
Ler 0 0 1 1 1 0 0 0 0 0 1.0000000
Nd1 O 0 0 0 0 1 1 1 O 0 0.6666667
Shakdara 0 0 0 O 0 0 0 O 3 0 0.0000000
VanO 0 0 1 2 0 0 O O 0 0 1.0000000

 

 

 

Variable importance
Much interest in bioinformatics is given to Variable Importance measures. In this
study, we rank the genes and thus reduce the number of variables. A variable importance

measure is obtained as the trees are built based on the OOB data set. The most important

17

predictor variables (genes) are identiﬁed by calculating an important score for each
predictor variable (gene). For a predictor variable (gene) X, the gene expression values
of the gene X are permuted in each OOB data set to build the tree. The raw importance
scores are calculated by subtracting the number of votes for each correct class with
permutation from the number of votes for the correct class without permutation. The
average of the raw value over all trees is the raw importance score. The raw importance
score is normalized by dividing by standard error. There are fewer correct votes when
predictor variables (genes) are permuted. Thus, a higher importance score for a gene

identiﬁes this gene with more discriminatory power.

Raw _ S C 07' e (X) = z (Nwzthout—permutatton _ N permutation )tree_i n tree

tree_ i

Raw — Score(X)

without— permutation _ N permutation )]

Z - Score(X) =

 

Square[Variance(N

‘th t- t t' .
N W‘ 0“ perm“ “ '0" :the number of votes for correct class after permutation

l I. . .
N p "m“ a '0" :the number of votes for correct class wzthout permutation

18

Chapter 3 Results of Limma and Random Forest to Ecotypes
3.1. Gene Selection using Limma

There are thirty-four ecotypes input vectors in the original Ecotype data. Here, we
just pick up ﬁrst ten ecotypes that have been replicated. The locations of these ecotypes
are located across every continent in the world.

Let’s examine latitude clusters ﬁrst, we divide those 22810 genes into seven sets
(Figure 5). The seven sets are:

It 6: genes are signiﬁcant in all six contrast combinations

It 5: genes are signiﬁcant in any ﬁve of six contrast combinations

It 4: genes are signiﬁcant in any four of six contrast combinations

It 3: genes are signiﬁcant in any three of six contrast combinations

It 2: genes are signiﬁcant in any two of six contrast combinations

lb 1: genes are signiﬁcant in any one of six contrast combinations

It 0: genes are not signiﬁcant in any of the six contrast combinations.

Figure 5 Number of signiﬁcant contrasts

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

134 314
l— g- r-
s r E 21444 1:3 20705
- 276 - 480
- 382 - 570
1:5 584 C3 759
11:! 96 g _ :1 225
8 — i: 27 .— 1:) 67
‘— 1: 1 :1 4
8 4 i —
oi __ — - D — _ _ o _ _.J - - D = _ _
0 1 2 3 4 5 6 O 1 2 3 4 5 6
Number of signiﬁcant contrasts Number of signiﬁcant contrasts

 

 

 

l9

Similarly altitude clusters, genes are also divided into seven sets (Figure 5). As
expected most genes are not statistically signiﬁcant. Moreover, we are also interested in
the number of signiﬁcant genes per contrast (Table 4 and Figure 6)

Table 4 The number of signiﬁcant genes for per contrast

 

 

 

 

 

 

 

 

 

 

 

 

Latitude Altitude
Nmnber of Nanber of
Contrast Signiﬁcant Contrast Signiﬁcant
Genes Genes
A—B 349 A—B 1 118
A—C 403 A—C 924
A—D 736 A—D 9 16
13-0 295 B—C 754
B—D 775 13-13 897
C—D 759 OD 547

 

 

 

Figure 6 The number of signiﬁcant genes for per contrast

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Significant Genes
8 _
'- - A 3
g a _ = 233 g : 2:3
5 - a.o ‘3 I B-C
: as E‘ '
E - C:D E - C-D
“a i‘ s. i ‘
05 '05
‘8 § — ‘6 é ‘
g _
D is
3 _.
[E § § i
0
8 7 l I
T T T T T T T I
1 2 3 4 5 6 1 2 3 4 5 6
Contrasts in Latitude Contrasts in Altitude

 

 

 

From Table4, we can see that the contrasts of A-D, B—D, C-D have the larger number
of signiﬁcant genes at signiﬁcance level 0.05 in Latitude grouping, and the contrasts of
A-B, A-C, A-D have the larger number of signiﬁcant genes at signiﬁcance level 0.05 in
Altitude grouping. Therefore, Group D in Latitude and group A in Altitude are signiﬁcant

group within other groups. This corresponds to Cvi (group D in Latitude) and Shakdara

20

(group A in Altitude). Therefore, we will discuss these two ecotypes (Cvi and Shakdara)
in more detail in the following chapter. 423 genes are signiﬁcant for all three contrasts A-
D, B-D, C—D among latitude, and 8 genes are signiﬁcant for all three contrasts A-B, A-C,
A-D among altitude (Figure5). Only one gene (247999_at) appears in both, in the 423-
Latitude genes and the 8-Altitude genes. As expected, gene expression differ the most in
the Shakdara and Cvi ecotypes compared to the others (Figure3).

Figure 7 Signiﬁcant genes for Latitude and Altitude.

 

Latitude A-D Altitude A-B

 
  

.
\

L\

I

l

. i
“it

      

Altitude A-C Altitude A—D

 

 

 

 

 

 

 

ID

 

‘ 8signiﬂcant
. Genes 247999_at (ATSGS6150)

‘ from altitude

423 signiﬁcant
Genes
from latitude

 

Annotation

ubiquitin-conjugating
l enzyme, putative, strong
similarity to ubiquitin-
.. conjugating enzyme UBC2
(Mesembryanthemum
/. crystallinum) GI:5762457,
° \. °# °\ /° ° UBC4 (Pisum sativum)
° GI:456568; contains Pfam
\/ proﬁle PF00179:

° Ubiquitin-conjugating
Bay-0 624 Cot-0 Cvi Est Kin—0 Ler Nd-1 Sha Van-0
enzyme.

 

 

 

2479993:

 

Gene Expression Value

 

 

 

 

 

 

 

 

 

21

3.2. Ecotypes of Cvi and Shakdara
We are interested in how Cvi and Shakdara differ from the other 8 ecotypes. Thus,
we examine the contrast between Cvi and the average of other 8 ecotypes, and the

contrast between Shakdara and other 8 ecotypes.
o Cvi - 3(Bay0 + C24 + Colo + Est + Kino + Ler + Nd1 + Vano)
o Sha - é-(Bayo + C24 + Colo + Est + Kino + Ler + Nd1 + Vano)

In each of these two contrasts, we perform multiple comparisons and select the top 500
differently expressed genes ranked by P.Values. Therefore, we have two sets of genes
and each set has 500 genes.

We use Random Forest to reduce the number of genes and decide which of these

highly signiﬁcant genes mostly affect the classiﬁcation performance of these 10 ecotypes.

22

3.3. Gene Selection from Cvi contrasts with other 8 ecotypes

Genes were selected from topTable of Limma for the Contrast:

o Cvi - éwayo + C24 + Colo + Est + Kino + Ler + Nd1 + Vano)

500 top differently expressed genes were selected from Limma with this contrast, and
called Cvi500. Then we would like to use Random Forest to ﬁnd the optimal number of
Cvi5 00 genes to improve classiﬁcation. Before selecting top ranked genes from Random
Forest, we need to ﬁnd the optimal ntree and mtry ﬁrst to reduce the OOB error rate. The
procedure for ﬁnding the optimal ntree is as follows:

1. Run Random Forest with different number of trees but select mtry is the default (The

 

default mtty is J the number of variables z 151 ).

2. Repeat 1. ten times and average the OOB error rate of these ten times for each of the
number of trees.

3. To see which number of trees has the lowest average OOB error rate and this
number is our optimal number of trees, ie. ntree. We found the optimal number of

trees is 297 from Cvi500. (Figure 8)

23

Figure 9 Optimal value of ntree for Cvi

 

OOB error rate

 

 

 

S Cvi

0°.

0

(D

d

V.

O

0!

O

O.

O

0 50 1 00 1 50 200 250 300

Number of trees

 

 

After ﬁnding the optimal ntree, we would like to ﬁnd optimal mtry as follows.

4.

Run randomforest with ntree=297 and different number of mtry which is near
J the number of variables z151. Here taking the range of mtry from 130 to 170.

Repeat 4. ten times and average the OOB error rate of these ten times for each of the
number of mtry.
To see which number of mtry has the lowest average OOB error rate and this number

is our optimal mtry. We found the optimal mtry is 148 from Cvi5 00. (FigurelO)

Figure 10 Optimal value of mtry for Cvi

 

OOB error rate

 

 

 

 

0.
“ l Cvi
:1.
i
<0.
0
22- mtry 148
N
.- l
P _
O. F ,
o v vr—Y—‘riii 7
130 140 150 160 170
Number of mtry

 

3.4. Gene Selection from Shakdara contrasts with other 8 ecotypes

Genes were selected from topTable of Limma for the Contrast:
. Sha — éwayo + C24 + Colo + Est + Kino + Ler + Nd1 + Vano)

500 top differently expressed genes was selected from Limma with this contrast, and
called Sha500. We follow the same procedure of ﬁnding the optimal ntree and mtry, and
choose optimal ntree = 291 and there are 4 optimal numbers of mtry which can make
Random Forest OOB error rate smallest, 133, 148, 161, 163. (Figurell) (F igure12)

Figure 11 Optimal value of ntree for Shakdara

 

    

 

 

 

 

 

3 Shakdara
°°.
0 O
9 to.
I- o
O
E v.
m o ntree 291
O
N
O o- ,
o V . . . It
0' 1 , . .‘1'1 1 l.
0 50 100 150 200 250 300
Number of trees

 

Figure 12 Optimal value of may for Shakdara

 

o_
“l Shakdara
09
a> 0.
‘5 I
h g]
:
a. mtry163
no: <3,l mtry133 mtry148 mtry161
O Ni
.51

l,
i

___v,__r 1

=='lfﬁ’r‘ri V '—r ~7'ri #ﬁ'

130 140 150 160 170
Number of mtry

 

 

 

25

3.5. Gene Selection from Cvi500 and Sha500 by Random Forest

In order to reduce the number of Cvi500 and Sha500, we select important genes
from Random Forest, but the question is how many genes are needed for the best
performance of classiﬁcation. Beside ntree and mtry, the number of genes which has
smallest OOB error rate is which we are interested in. From above procedure of ﬁnding
optimal mtry and ntree (Figure 9, 10, 11, 12), the value of ntree greater than 200 can get
stable smaller OOB error rate, but the value of mtry is not signiﬁcant association with the
OOB error rate. Thus, we select the number of most important genes from Random
Forest with ntree=200, but keep mtry be default in Cvi500 and Sha500 respectively. To
rank the genes the measure MeanDecreaseAccuracy was used to measure reliable
importance.

In Cvi5 00, 43 genes is the smallest number for optimal classiﬁcation. In Sha500, 84
genes is the smallest number for optimal classiﬁcation. Then we compare those two sets
of selected genes, there are 43 genes from the intersection of Cvi500 and Sha500, and

there are 4 genes from the intersection of Cvi43 and Sha84. (Figure 15)

Figure 13 Optimal value of the number of genes for Cvi

 

 

 

 

 

L0
0'
V.
O
‘5': <0. 43 Genes
l— O
a:
to l
O N. i '.
o o ‘1 =.
\. : ......
g .................. '. ,
g ----- if A 7 ,.~—— ’ 531—4? =4”?
2 5 10 20 50 100 200 500
Number of variables used

 

 

 

26

Figure 14 Optimal value of the number of genes for Shakdara

 

 

 

 

 

 

1.\
to V‘- .
d '3 2‘3. “X
‘6 l, 84 Genes
i: ‘ “
0 V. ‘
m o
O
o .,
\
N
d ’3
o o c— —)—o—o \Dv 7v-\::‘.._o—4~.—‘.-—‘~—‘lj:;°¥v
o. l l ‘T‘ I i T T r T I I ‘ ‘ 1
2 5 10 20 50 100 200 500
Number of variables used

 

 

Figure 15 Overlapping genes from Cvi500 and Sha500

 

102 Genes overlapping Cvi500 and Sha500

 

4 Genes from intersection
of Cvi43 and Sha84

 

 

 

27

3.6. Compare the OOB error rate of Random Forest

Several sets of genes were selected with Limma and Random Forest. We have two
sets of 500 genes selected from topTable of Limma; they are Cvi5 00 and Sha500.
Moreover, we have a set of 43 genes from Cvi5 00, and a set of 84 genes from Sha5 00.
The following table will show the OOB error rate for Cvi5 00 and Sha5 00 and compare
the status of using the optimal ntree and mtry with the status of without optimal ntree and
mtry. Besides, Table5 also shows that the OOB error rate for the selected 84 genes and
selected 50 genes without adjusting parameters

Table 5 Comparison of OOB error rate

 

Genes Status Number of Genes OOB error rate

 

 

Without optimal value of 500 16.67%
ntree and mtry

 

Cv' oo
15 With optimal values of

ntree and mtry and
the smallest number of
genes

43 6-67%

 

Without optimal value of

00 10%
ntree and mtry 5

 

Sha500 With optimal values of

ntree and mtry and
the smallest number of
genes

84 3.33%

 

 

 

 

 

 

28

3.7. Misclassiﬁcations of Ecotypes

When running the Random Forest, there are some arrays which are misclassiﬁed.
Each run gives us different misclassiﬁed ecotypes. Table6 shows misclassiﬁed arrays
from all Random Forest runs. The most frequent misclassiﬁcations are VanO and KinO.
The array ATGE_116_B.CEL (Kino) is often misclassiﬁed.

Table 6 Misclassiﬁcation List

 

 

 

 

 

 

 

 

 

 

Array Actual Ecotype Predicted Ecotype
ATGE_112_A.CEL C24 Shakdara
ATGE_115_D.CEL Est Colo
ATGE_116_A.CEL Kino Vano
ATGE_116_B.CEL Kino Vano, Shakdara, Bayo
ATGE_116_C.CEL Kino Vano
ATGE_117_D.CEL Ler Est
ATGE_120_A.CEL Vano Kino
ATGE_120_C.CEL Vano Kino

 

 

 

 

Figure 16 Misclassiﬁcation ﬁgure

 

 

 

O Misclassiﬂcation

O No Misclassiﬁcation

 

 

 

Predicted Class

 

 

 

 

 

 

 

 

True Class

 

 

 

29

Chapter 4 Gene Ontology
4.1. Gene Ontology with Classification Superviewer

We have identiﬁed genes that may be important in adaptation. We selected two
groups of genes, Cvi43 and Sha84 based on Random Forest. Cvi is close to the equator
off the coast of Africa with higher temperature than other ecotypes, and Shakdara is a
mountainous (around Himalayas) landlocked country in Central Asia and thus exposed to
climate (eg. Temperature). The adaptation of these two ecotypes has likely been driven
by these stress conditions. We would like to argue that these selected genes are important
for stress resistance.

In order to validate the genes we selected from Random Forest, we classify the gene
function on a group of genes based on the website: “The Bio-Array Resource for
Arabidopsis thaliana Functional Genomics” http://bar.utoronto.ca/. The web-based tool
of Classification Super Viewer creates an overview of gene functional classiﬁcation of a
group of AGI genes based on the MIPS database (Munich Information Center for Protein
Sequences). Currently, there are 25450 genes for MIPS classiﬁcations in the MAtDB
(MIPS Arabidopsis Thaliana Database). Here we do not focus on single genes. Instead,
we want to ﬁnd gene functions overrepresented in the selected sets of genes that can
provide important information on stress response. Gene function classiﬁcation is an
approach for grouping genes based on functional similarity. However, Functional
Classiﬁcation Pie Chart often used in Bioinforrnatics provides the absolute numbers and
percentage of gene function. Absolute numbers of genes on functional classiﬁcation
might be misleading in a different treatment and situation, but normalizing the group of
genes can avoid this misdirection. In this way, the differences of gene function are more

easily detected. Classification SuperI/iewer includes normalization, bootstrap sampling,

3O

and provides a conﬁdence estimate for the accuracy of results. The standard deviation
may make results spurious and unreliable. Moreover, if the conﬁdence intervals include
one, the genes of this functional classiﬁcation may be due to a small number of genes,
and thus the class score is unreliable. We only consider a class score greater than one and.
conﬁdence intervals not including one to check if these categories of functions are
associated with stress response.

A class score for normalization was calculated based on the following equation: (N

is gene number)

N

class(inputset)

/ N classiﬁed (inputset)

 

SCOT' class : N / N
class(25K) classiﬁed (25K )

(inputset .° Cvi43 and Sha84)

One hundred Bootstrap samples were chosen from the input set. After sampling,
classifying each set and generating them to get class score as above equation.
Furthermore, the standard deviation of each class was shown along with the class score.
If the class scores are greater than one and conﬁdence intervals not including one, the
gene ontology categories are overrepresented within a group of genes. In the following
section is applied to gene groups Cvi43 and Sha84 in Classiﬁcation Superviewer and
discuss how their overrepresented gene functions affect the stress response. After that,
we simplify the broad and wide spectrum of known protein functions based on F unCat

annotation which includes 7 main gene categories (Table7).

31

Table 7 Main Function categories of FunCat

 

Main Function categories of FunCat

 

 

Metabolism
01 Metabolism
02 Energy

04 Storage protein

Information pathways

10 Cell cycle and DNA processing

1 1 Transcription

12 Protein synthesis

14 Protein fate
(folding, modiﬁcation and destination)

16 Protein with binding function or cofactor requirement
(structural or catalytic)

18 Protein activity regulation

Transport
20 Cellular transport, transport facilitation and transport routes

Perception and response to stimuli

30 Cellular communication/signal transduction mechanism
32 Cell rescue, defense and virulence

34 Interaction with the cellular environment

36 interaction with the environment (systemic)

38 Transposable elements, viral and plasmid proteins

Developmental processes

40 Cell fate

41 Development (systemic)

42 Biogenesis of cellular components
43 Cell type differentiation

45 Tissue differentiation

47 Organ differentiation

Localization

70 Subcellular localization
73 Cell type localization
75 Tissue localization

77 Organ localization

78 Ubiquitous expression

Experimentally uncharacterized proteins
98 Classiﬁcation not yet clear-cut
99 Unclassiﬁed proteins

 

 

 

With the exception of categories 78, 98 and 99. all main categories are the origin of
hierarchical. tree-like structures. To make the introduction of new main categories
possible. the numbering of the categories is not strictly sequential.

The FunCat, a functional annotation scheme for systematic classiﬁcation of
proteins from whole genomes, Nucleic Acids Research, 2004, Vol.32, No.18:
5539-5545.

32

4.2 Gene Ontology of Cvi43 and sha84

Figure 17 Cvi43 — Classiﬁcation Superviewer

 

 

 
 
 
   
 
 
 
    
  

6 LL TYPE LOORLISRTION
REGULRTION OF/INTERROTION N. CELLULRR ENVIRONMENT
YST'ENIC REGULBTION OF/INTERBCTION H. ENUIRONNENT
‘WRRNSPOSRBLE ELEMENTS: UIRRL FIND PLRSHID PROTEINS
TRFINSPORT FROILITRTION

CELL RESCUE: DEFENSE FIND UIRULENCE
ORGFIN LOCRLISRTION

PROTEIN ROTIUITY REGULRTION

CELL TYPE DIFFERENTIRTION

TISSUE OIFFERENTIBTION

ETRBOLISH

SUBCELLULRR LOCRL ISRTION

CELLULAR TRRNSPORT 9ND TRRNSPORT NEOHHNISNS
CONTROL OF OELLULRR ORGRNIZRTION

PROTEIN SYNTHESIS

EUELOPNENT

PROTEIN H. BINDING FUNCTION/COFRCTOR REQUIRENENT
CEL FRTE

CELLULOR OOHNUNICRTION + SIBNﬂL TRRNSDUOTION
UNCLRSSIFIED PROTEINS

CELL CYCLE 9ND DNA PROCESSING

TRONSCRIPTION

No classification uhetsoeuer

 

 

 

 

 

 

1 I I ‘

 

 

As we can see, there are ﬁve terms whose class scores greater than one and conﬁdence
intervals not including one. The number of genes, Cvi43, associated with terms (1)-(5)
below is greater than expected on the basis of chance. In other words, terms (1)—(5) are
overrepresented in the gene set of Cvi43.

(1) CELL TYPE LOCALISATION

(2) REGULATION OF/INTERACTION W. CELLULAR ENVIRONMENT

(3) SYSTEMIC REGULATION OFleTERACTION W. ENVIRONMENT

(4) TRANSPORT FACILITATION

(5) CELL RESCUE, DEFENSE AND VIRULENCE

Refer to Table7 , (2) (3) (5) are in category of Perception and response to stimuli. Plant
perception indicates the change in the environment. The stimuli which plants perceive
can respond to the environmental effects of chemicals, gravity, light, moisture, infections,
temperature, oxygen, and carbon dioxide. Plants detect stimuli in different methods and a

variety of reaction response, but generally plant perception occurs at the cellular level.

33

Thus, the selected genes are related to climatic conditions for Cvi.

Figure 18 Sh384 — Classiﬁcation Superviewer

   
   
   
  

 

STORRGE PROTEIN
TISSUE LOCRLISRTION
CELL TYPE DIFFERENTIRTION

 

TISSUE DIFFERENTIRTION

HET‘RBOLISN

CELL RESCUE, DEFENSE FIND UIRULENCE
ENERGY

TRRNSPORT FNCIL ITRTION

CELL TYPE LOCRLISRTION
SUBCELLULHR LOCRLISRTION

SYSTEMIC REGULRTION OF/INTERRCTION N. ENUIRONNENT
CELLULRR TRHNSPORT 9ND TRRNSPORT HECHFINISNS

REGULHTION OF/INTERRCTION N. CELLULRR ENUIRONNENT
P OTEIN SYNTHESIS

CELLULRR CONNUNICRTION + SIGNBL TRRNSDUCTION
PROTEIN H. BINDING FUNCTION/COFHCTOR REQUIREMENT
UNCLHSSIFIED PROTEINS

PROTEIN RCTXUITY REGULRTION

CELL FRTE

CONTROL OF CELLULRR ORGHNIZRTION

SCRIPTION

DEUELOPNENT

CELL CYCLE ONO DNR PROCESSING

No classification uhdsocucr

 

 

 

 

 

 

i : z 4 :

 

 

In Figurel 8, there are eight terms whose class scores greater than one and conﬁdence
intervals not including one. Thus, terms (l)-(8) below are overrepresented in the gene set
of Sha84.

(1) STORAGE PROTEIN

(2) TISSUE LOCALISATION

(3) CELL TYPE DIFFERENTIATION

(4) ORGAN LOCALISATION

(5) TISSUE DIFFERENTIATION

(6) METABOLISM

(7) CELL RESCUE, DEFENSE AND VIRULENCE

(8) ENERGY

Terms of (1) (6) (8) covered all sub-functions of the metabolism. The deﬁnition for
metabolism is: “Chemical process occurring within a living cell or organism, including

anabolism and catabolism. Metabolism is a chemical process that typically transforms

34

small molecules, but also includes macromolecular process and protein synthesis and
degradation.” Metabolism is associated with energy in some ways. Under stress, in
metabolism some compounds are broken down to yield energy. Then this energy is
directed at repairing the damage made by stress. Thus, metabolism would be an
important factor under many different types of stressors. Under stress, plants may
undergo a change of metabolism which would direct energy away from grth and
reproduction and focus on cellular defense and maintenance. Instead, this helps plants
survive in tough environments. Thus, the selected genes Sha84 may be important for
adapting to the climatic conditions in high altitude.

Moreover, cytochrome P450 genes and glutathione-S-transferase genes may play an
important role in oxidative stress resistance since oxidative stress is generated by all
forms of stress in some ways. Several papers mention that Cytochrome P450 genes is
important for plants. Oxidative detoxiﬁcation of some herbicides in plant tissues is
obtained by a Cytochrome P450-dependent monooxygenase system (Donaldson and
Luster 1991, Hatzios 1991, and Sanderrnann 1992). Cytochrome P450s play important
roles in biosynthesis of a variety of endogenous lipophilic compounds (Donaldson and
Luster 1991 and Bolwell et a1. 1994). Cytochrome P450 monooxygenases are a group of
haem-containing proteins which catalyze various oxidative reactions (Schuler 1996 and
Chapple 1998). In addition, some papers support that Glutathione-S-transferase plays an
important role in plants. Glutathione S-transferases (GSTs) appear to be ubiquitous in
plants and have deﬁned roles in herbicide detoxiﬁcation (Lamoureux and Rusness 1993).
The fundamental function of GSTs is the detoxiﬁcation of both endogenous and
xenobiotic compounds (Marrs 1996). GSTs play a fundamental role in protection against

endogenous or exogenous toxic chemicals (Sheehan et al. 2001). Furthermore,

35

cytochrome P450 genes and glutathione-S-transferase are phase I and phase H
detoxiﬁcation enzyme, respectively. Therefore, ﬁnding such genes associated with any
form of stress may be biologically meaningful.

Besides, a gene (A15g10140) in Cvi43 is FLC (FLOWERIN G LOCUS C) gene which
is a main detemrinant of ﬂowering time. Arabidopsis thaliana locates in the
Northern Hemisphere with long day time light hours which may affect ﬂowering time.
The transition to ﬂowering is an important event in the plant life cycle and is adapted by
several environmental factors of photoperiod, light quality, vemalization, and growth
temperature, as well as biotic and abiotic stresses. Thus, F LC can respond to stresses
and environmental effects. The following 5 genes were identiﬁed in both Cvi43 and
Sha84 corresponding to these 3 speciﬁc genes and the graph also shows the expressions
of these 5 genes.

Figure 19 Expression graph for 5 speciﬁc genes.

 

 

 

 

 

 

 

 

253534_at 252827_at
5! ~ e -
e - ° ° = ° 2 a
g o '”/ o / \ / 0/ g
g ° \ /° ° T“ °
a: - ° > .0 _ \ o _ o
.5 .E / . /
in g; o / ° \ D o \
‘8 Q \ /
a to — a w 4 °
Lﬁ £5
a) co
5 v _ g u: L
(D (D
.1 N ,
I r r r l I r r r‘r ' . . r . r l .
Bay-0 C24 Col-0 Cvl Est Kin—0 Ler Nd-1 Sha Van-0 Bay-O 024 Col-0 Cvi Est Kin—0 Ler Nd‘1 Sha Van-0

262916_at 264052_at

 

 

 

5!- e—
e— 2—
% %
> n_ > m_ a
5 a .5 °\ /°\°_‘°/\°
i /\ i ,/ ~—°=°
3 117—0 .\°’° o E“ o—
g... \/°/\ 3“
(D
o °/°
N-i N—

 

 

 

 

 

l

r r

Bay-o (:24 Col-0 (NI Est Kin-0 Ler Nd-1Sha Van-0 Bay-0 624 Col-O (M Est Kin-0 Ler Nd—1 Sha Van-0

250476__at

 

10
1

Gene Expression Value
4 O
4 1

 

~-/\/

0

./°\ °

 

 

Bay-0 024

I T l I
Col-0 CM Est Kin-O Ler Nd-t Sha Vano

Table 8 FLC, Cﬂochrome P450 and Glutathione-S-transferase genes

 

 

 

 

 

AGI ID Affy ID Annotation
At4g31500 253534_at CYP8331_ATR4_RED1_RNT1_SUR2_LYP83BI (CYTOCHROME P450 MONOOXYGENASE 8381); oxygen binding
At4g39950 252827_at CYP79BZ_CYP79BZ (cytochrome P450, family 79, subfamily B, polypeptide 2); oxygen binding
At1959700 262916_at ATGSTU16_ATGST016 (Arabidopsis thaliana Glutathione S-transferase (class tau) 16); glutathione transferase
At2922330 264052_at CYP79B3_CYP79B3 (cytochrome P450, family 79, subfamily B, polypeptide 3); oxygen binding
AthlOl40 250476_at FLC_AGL25_FLF_FLC (FLOWERING LOCUS C)

 

 

 

Annotation from “’I‘AIR. affv ATHl array elements-2006-07-l4.txt”

37

APPENDICES

38

APPENDIX A

Selected groups of Genes — Cvi43 & Sha84

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Cvi43
Affy ID AGI ID Annotation
At3g61520
246173_s_at At5g28370 pentatricopeptide (PPR) repeat-containing protein
At5g28460
24667 l_at At5 g3 0450
246862_at At5g25760 UBC21_PEX4_PEX4 (PEROXIN4); ubiquitin-protein ligase
247760_at At5g59130 subtilase family protein
24779l_at At5g58710 ROC7_ROC7 (rotamase CyP 7); peptidyl-prolyl cis-trans isomerase
248460_at At5g50915 basic helix-loop-helix (bHLH) family protein
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5624655.1);
249752_at At5g24660 similar to unknown protein [Brassica rapa subsp. pekinensis]
(GB:AAQ92331.1)
249780_at At5g24240 phosphatidylinositol 3- and 4-kinase family protein / ubiquitin family protein
250476_at At5g10140 FLC_AGL25_FLF_FLC (FLOWERING LOCU S C)
similar to PBS lyase HEAT-like repeat-containing protein [Arabidopsis
At3 62460 thaliana] (TAIR:AT3G62530.1); similar to 8OC09_3 [Brassica rapa subsp.
25124l_s_at At3g 62 530 pekinensis] (GB:AAZ41814.1); similar to OsO7g0637200 [Oryza sativa
g (j aponica cultivar-group)] (GB:NP_001060400.1); contains InterPro domain
Protein of unknown ﬁrnction DUF537; (InterProzlPROO749l)
251962_at At3g53420 PIP2A_PIP2_PIP2A (plasma membrane intrinsic protein 2;l)
252168_at At3g50440 hydrolase
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5G65810.1);
similar to OsOlg0144000 [Oryza sativa (japonica cultivar-group)]
252231_at At3g49720 (GB:NP_001042001.1); similar to conserved hypothetical protein [Medicago
truncatula] (GB:ABE78370.1); contains domain S-adenosyl-L-methionine—
dependent methyltransferases (SSF53335)
At3 g47220 . . . . . . .
252459_s_at A t3g 47290 phosphomosrtrde-specrﬁc phospholrpase C family protein
252529_.at At3g46490 oxidoreductase, 20G-Fe(II) oxygenase family protein
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT2G26240.1);
similar to OsO4gO653100 [Oryza sativa (japonica cultivar-group)]
(GB:NP_001054104.1); similar to transmembrane protein 14C [Argas
252723_at At3g43520 monolakensis] (GB:AB152790.1); similar to OsO3g0568500 [Oryza sativa
(japonica cultivar-group)] (GB:NP_OOIOSOS 10.1); contains InterPro domain
Protein of unknown function UPF0136, Transmembrane;
(InterProzlPROOS349)
similar to myosin-related [Arabidopsis thaliana] (TAIR:AT1G24460.1);
similar to hypothetical protein, conserved [Leishmania major]
253532_at At4g31570 (GB:CAJ07774.1); contains InterPro domain Prefoldin;

 

 

(InterProzlPROO9053); contains InterPro domain t-snare;
(InterProzlPR010989)

 

39

CYP83Bl_ATR4_REDl_RNT1_SUR2_CYP83Bl (CYTOCHROME

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2535343“ “4331500 P450 MONOOXYGENASE 8381); oxygen binding
254351_at At4g22300 carboxylic ester hydrolase
25436l__at At4g22212 Encodes a defensin-like (DEFL) family protein.
254928_at At4g11410 short-chain dehydrogenase/reductase (SDR) family protein
255257_at At4g05050 UBQ11__UBQ11 (UBIQUITIN 11); protein binding
RIC10_RIC10 (ROP-INTERACT IVE CRIB MOTIF-CONTAINING
255307_at At4g04900 PROTEIN 10)
255578_at At4g01450 nodulin MtN21 family protein
256497_at Atlg31580 ECS1_CXC750__ECS 1
256863_at At3g24070 zinc knuckle (CCHC-type) family protein
257071 at A t3g281 8 0 ATCSLC04_ATCSLC4_CSLC04_ATCSLC04 (Cellulose synthase-lrke
- C4); transferase, transfernng glycosyl groups
257205_at At3 g] 6520 UDP-glucoronosyl/UDP-glucosyl transferase family protein
259067_at At3g07550 F-box family protein (F BL12)
similar to OsO4g0528100 [Oryza sativa (japonica cultivar-group)]
259591_at At1g28150 (GB:NP_001053373J)
259733_at At1g77480 nucellin protein, putative
similar to unknown protein [Oryza sativa (japonica cultivar—group)]
260232_at Atl g74640 (GB:BAD28539.1); contains domain no description (G3D.3.40.50. 1820);
contains domain alpha/beta-Hydrolases (SSF 53474)
260244_at At1g74320 choline kinase, putative
260252_at At1g74240 mitochondrial substrate carrier family protein
263034_at At1g24020 Bet v I allergen family protein
263777 at At2g 4 64 5 0 ATCNGC12_CNGC12_ATCNGC12 (cyclic nucleotide gated channel 12);
- cyclic nucleotide brndmg / ion channel
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT2G31670.1);
similar to Hypothetical protein [Oryza sativa] (GB:AAK55783.1); contains
265142_at At1g51360 InterPro domain Stress responsive alpha-beta barrel; (InterPro:IPR013097);
contains InterPro domain Dimeric alpha-beta barrel; (InterPro:IPR011008)
265162_at At1g30910 molybdenum cofactor sulfurase family protein
265486_at
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT3G48690.1);
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT3G48700.1);
similar to Esterase/lipase/thioesterase [Medicago truncatula]
2656993“ At2g03550 (GB:ABE83378.1); contains InterPro domain Esterase/lipase/thioesterase;
(InterPro:IPR000379); contains InterPro domain Alpha/beta hydrolase fold-
3; (InterPro:IPR013094)
265768_at At2g48020 sugar transporter, putative
266643 5 at At2g29710 UDP-glucoronosyl/UDP-glucosyl transferase family protein
- - At2g29730
267093_at A t2g38170 CAX1_RCI4_CAX1 (CATION EXCHANGER l); calcrumzhydrogen

 

 

antiporter

 

40

 

Sha84

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Affy ID AGI ID Annotation

2 4 5 03 8__a t A t2g2 6 5 60 PLA2A_PLA IIA_PLP2_PLA IIA_PLP2 (PHOSPHOLIPASE A 2A);
nutrient reservorr

245400_at At4gl7040 ATP-dependent Clp protease proteolytic subunit, putative

245456_at At4gl6950 RPP5_RPP5 (RECOGNITION OF PERONOSPORA PARASITICA 5)

2459.77 at At5g13110 G6PD2_G6PD2 (GLUCOSE-6-PHOSPHATE DEHYDROGENASE 2);

- glucose-6-phosphate l-dehydrogenase
At5 34920

246642_s_at “£59620
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT3G04860.1);
similar to OsO7g0572300 [Oryza sativa (japonica cultivar-group)]
(GB:NP_001060057.1); similar to OsO3g0806700 [Oryza sativa (japonica

246708_at At5g28150 cultivar-group)] (GB:NP_001051637.1); similar to Protein of unknown
ftmction DUF868, plant [Medicago truncatula] (GB:ABE92686.1); contains
InterPro domain Protein of unknown function DUF868, plant;
(InterPro:IPR008586L

247210 at Ath 6 5020 ANNAT2_ANNAT2 (ANNEXIN ARABIDOPSIS 2); calcrum 1011 bmdmg

— calcrum-dependent phospholrpid brndmg

SALl F RYl HOSZ SAL] FIERYl ; 3' 2' ,5'-bis hos hate nucleotidase/

247313_at At5g63980 inositch or phbsphaticglinositdl phosphzitas: ) P P

247404_at At5 g62890 permease, putative

247814_at At5 g5 83 10 hydrolase, alpha/beta fold family protein

247999_at At5g56150 UBC30__UBC30; ubiquitin-protein ligase

248079_at At5 g55790 unknown protein

248200_at At5g54160 ATOMT1_OMT1_ATOMT1 (O-METHYLTRANSFERASE 1)

248427_at At5g51750 subtilase family protein

248796_at At5g47180 vesicle-associated membrane family protein / VAMP family protein

248800_at At5g47320 RPS19_RPSI9 (4OS ribosomal protein S19); RNA binding

248961_at At5 g45 650 subtilase family protein

249258_at At5g4l650 lactoylglutathione lyase family protein / glyoxalase I family protein

249567_at At5 g3 8020 S-adenosyl-L-methionine:carboxyl methyltransferase family protein
similar to 0302 0815400 sativa 'a onica cultivar- ou

249610_at At5g37360 mam—00104385021) [0W 0 1’ gr 1’”

249645_at At5g36910 TH12.2.2_THI2.2 (THIONIN 2.2); toxin receptor binding

249733_at At5g24400 EMB2024_EMBZOZ4 (EMBRYO DEFECTIVE 2024); catalytic
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT1G61065. 1);
similar to unknown protein [Saussurea involucrata] (GB:ABC68264.1);
similar to 0506 0114700 0 a sativa 'a onica cultivar- rou

”0072—“ “52417210 (GB:NP_00105I5606J); SIEIDIIZI'Z to 030530534800 [Oryza sgativiig'aponica
cultivar-group)] (GB:NP_001055640.1); contains InterPro domain Protein of
unknown function DUF1218; (InterPro:IPR009606)

250633_at A t5g07 4 6O PMSR2._PMSR2 (PEPTIDEMETHIONINE SULFOXIDE REDUCTASE
2); protem-methronme-S-oxrde reductase

250751__at At5 g05 890 UDP-glucoronosyl/UDP-glucosyl transferase family protein
HB-6 LSN BLH9 BLR PNY RPL VAN LSN LARSON,

2510313“ At5g02030 VAAMANA); DNA bincﬁng / tfanscfiption factor (

251903_at At3g54120 reticulon family protein (RTNLB 12)

 

41

GSA2_GSA2 (GLUTAMATE- 1 -SEMIALDEHYDE 2,1-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2523181“ At3g48730 AMINOMUTASE 2); glutamate-l-semialdehyde 2,1-aminomutase
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT3G47200.2);
similar to hypothetical protein LOC_Oleg29620 [Oryza sativa (japonica
cultivar-group)] (GB:ABA98257.1); similar to 031 1g0543300 [Oryza sativa

252462__at At3g47250 (japonica cultivar-group)] (GB:NP_001068043.1); similar to OsO4g0505400
[Oryza sativa (japonica cultivar-group)] (GB:NP_001053253.1); contains
InterPro domain Protein of unknown function DUF247, plant;
(InterProzﬂ’R0041 5 8)

2 5 2 47 8_a t At3g4 6540 epsm N-terrmnal homology (ENTH) domam-contammg protern / clathnn
assembly protem—related

252529_at At3g46490 oxidoreductase, 20G-Fe(II) oxygenase family protein

252659_at At3 g44430 similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5G41660. 1)

At3g44300

252678_s_at At3g44310 NIT2__NIT2 (N ITRILASE 2)
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5G47860.1);
similar to OsO9g0436900 [Oryza sativa (japonica cultivar-group)]

252724_at At3g43540 (GB:NP_001063263.1); similar to unknown protein [Oryza sativa (japonica
cultivar-group)] (GB:BAD36432.1); contains InterPro domain Protein of
unknown function DUF1350; (InterPro:IPR010765)

252827 at At 4g399 5 0 CYP79B2_CYP79BZ (cytochrome P450, fanuly 79, subfarruly B,

- polypeptide 2); oxygen bmdmg
MI-l-P SYNTHASE_MI-1-P SYNTHASE (Myo-inositol-l-phosphate

2528633“ At4g39800 synthase); inositol-3-phosphate synthase

25 3422_at At4g32240 unknown protein

253666 at At 4g3 0270 MERISB_BRU1_MERI-5__MERISB (MERISTEM-S); hydrolase, acting on

- glycosyl bonds

254248_at At4g23270 protein kinase family protein
similar to unknown protein [Arabidopsis thaliana] (TAIR:ATSG44670.1);
similar to OsO6gO328800 [Oryza sativa (japonica cultivar-group)]
(GB:NP_001057533.1); similar to Os02g0712500 [Oryza sativa (japonica

2545083“ At4g20170 cultivar-group)] (GB:NP_001047907.1); similar to unknown protein [Oryza
sativa (japonica cultivar-group)] (GB:BAD72474.1); contains InterPro
domain Protein of unknown function DUF23; (InterPro:IPR008 166)

254553_~at At4gl9530 disease resistance protein (TIR-NBS-LRR class), putative
AOP2__AOP2 (ALKENYL HYDROXALKYL PRODUCING 2);

2 5 5 437—3 t A t 4g03 0 60 oxrdoreductase, actmg on paired donors, wrth mcorporatron or reduction of
molecular oxygen, 2-oxoglutarate as one donor, and mcorporatron of one
atom each of oxygen into both donors

25 5859_at At5g34930 arogenate dehydrogenase

256021_at Atl g5 8270 ZW9__ZW9
similar to 18$ pro-ribosomal assembly protein gar2-re1ated [Arabidopsis

25 6096_at Atl g1 3650 thaliana] (TAIR:AT2G03810.3); similar to hypothetical protein
[Trypanosoma cruzi strain CL Brener] (GB:XP_813437.1)

256221_at At1g56300 DNAJ heat shock N-terminal domain-containing protein

256454_at At1g75280 isoﬂavone reductase, putative

25645 8_at At1g75220 integral membrane protein, putative

256489_at At1g31550 carboxylic ester hydrolase/ lipase

256940_at At3g30720 unknown protein

 

 

 

42

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

257205_at At3g16520 UDP-glucoronosyl/UDP-glucosyl transferase family protein
257228_at At3g27890 NQR_NQR (NADPH2QUINONE OXIDOREDUCTASE); FMN reductase
257580_at At3g06210 binding
similar to unknown protein [Arabidopsis thaliana] (TAIR:AT5G24600.1);
similar to hypothetical protein [Oryza sativa (japonica cultivar-group)]
(GB:BAC55679.1); similar to OsOZg0292800 [Oryza sativa (japonica
2581243" “398215 cultivar-group)] (GB:NP_001046597.1); similar to OsO8g0153900 [Oryza
sativa (japonica cultivar-group)] (GB:NP_001061011.1); contains InterPro
domain Protein of unknown function DUF599; (InterPro:IPR006747)
258322 at A t3g227 40 HMT3_HMT3 (Homocysterne S-methyltransferase 3); homocysterne S-
— methyltransferase
Atlg07780 PAIl_TRP6_PAIl (PHOSPHORIBOSYLANTHRANILATE
259770_s_at At1g29410 ISOMERASE 1): hos horibos lanthranilate isomerase
At5g05590 ’ p p y
ATT12__ATT12 (ARABIDOPSIS THALIANA TRYPSIN INHIBITOR
2605463” “293520 PROTEIN 2); trypsin inhibitor
GT_GT/UGT74F2 (UDP-GLUCOSYLTRANSFERASE 74F2); UDP-
260567_at At2g43 820 glucosyltransferase/ UDP-glycosyltransferase/ transferase, transferring
glycosyl groups / transferase, transferring hexosyl groups
260685__at At1g17650 phosphogluconate dehydrogenase (decarboxylating)
260872_at At1g21350 electron carrier/ oxidoreductase
similar to zinc ﬁnger (Ran-binding) family protein [Arabidopsis thaliana]
(TAIR:ATIGSSO40. 1); similar to Zn-ﬁnger in Ran binding protein and
others, putative [Oryza sativa (japonica cultivar-group)] (GB:AAX95 671.1);
2609813" At1g53460 similar to OsO3g0712200 [Oryza sativa (j aponica cultivar-group)]
(GB:NP_OOIOS 1062.1); similar to 0301 g0203300 [Oryza sativa (japonica
cultivar-group)] (GB :NP_00104233 l . l)
NRS/ER_NRS/ER (NU CLEOTIDE-RHAMN OSE
261 105_at At1g63000 SYNTHASE/EPIMERASE-REDUCT ASE)
261326 3 at At1g44180 aminoacylase putative / N-acyl-L-amino-acid amidohydrolase putative
- - At1g44820 ’ ’
SMT3_SMT3 (S-adenosyl-methionine-sterol-C-methyltransferase 3); S-
26l727_at At1g76090 adenosylmethionine-dependent methyltransferase
26l924_at Atl g22550 proton-dependent oligopeptide transport (POT) family protein
262134_at At1g77990 AST56_SULTR2;2_AST56 (sulphate transporter 2;2); sulfate transporter
26245 8__at Atlg11280 carbohydrate binding / kinase
G-TMT_TMT1_VTE4_G-TMT (GAMMA-TOCOPHEROL
262875_at Atl g64970 METHYLTRANSFERASE)
ATGSTU16_ATGSTU16 (Arabidopsis thaliana Glutathione S-transferase
2629l6_at Atlg59700 (class tau) 16); glutathione transferase
263553_at At2g16430 PAP10__PAP10; acid phosphatase/ protein serine/threonine phosphatase
263714_at At2g20610 SUR1_ALFl_HLS3_RTY_SURl_SUR1 (SUPERROOT 1); transaminase
2 640 52-21 t At2g223 30 CYP79B.3_CYP79B3 (cytochrome P450, family 79, subfamily B,
polypeptide 3); oxygen bmdrng
2 64 5 1 3_a t A t1 g09 42 0 G6PD4_G6PD4 (GLUCOSE-6-PHOSPHATE DEHYDROGENASE 4);
glucose-6-phosphate l-dehydrogenase
264790_at At2gl7820 ATHK1_AHK1_ATHK1_ATHK1 (HISTIDINE KINASE 1)
264954_at Atl g77060 mutase family protein

 

 

 

43

ARP2_RPL3B__ARP2/RPL3B (ARABIDOPSIS RIBOSOMAL PROTEIN

 

 

 

 

 

 

 

 

2650321“ Atlg6l580 2); structural constituent of ribosome
265058 3 at At1g52030 MBP2_F-ATMBP_MBP1.2_MBP2 (MYROSINASE-BINDING
- - Atlg52040 PROTEIN 2)

265354_at At2gl6700 ADF5_ADF5 (ACT IN DEPOLYMERIZING FACTOR 5); actin binding

265486_at 265486_at

265611_at At2g25510 unknown protein
similar to transcription elongation factor-related [Arabidopsis thaliana]
(TAIR:AT5G25520.2); similar to PHD ﬁnger protein-like [Oryza sativa
(japonica cultivar-group)] (GB:BAD24999.1); similar to OsOZg0208600

265905_at At2g25640 [Oryza sativa (japonica cultivar-group)] (GB:NP_001046260.1); contains
InterPro domain Transcription elongation factor S-Il, central region;
(InterPro:IPROO3618); contains InterPro domain SPOC;
(InterPro:IPRO 1 292 1)

266472_at

266643_s_at 23:33.7]; UDP-glucoronosyl/UDP-glucosy1transferase family protein

267078_at At2g40960 nucleic acid binding

 

 

 

44

APPENDIX B
R CODE

#####################################################################

# #
# Pakages Used #
# #

 

 

library(limma)
library(randomForest)
library(varSelRF)

library(maps)

# #
# Reading Data #
# —--——-—-— #

 

Ecodata = read.table("AtGE_ecotypes.txt", header =T, sep="\t")
Geo = read.table("Geo.txt", header =T, sep="\t")
x = read.table("EcotypesGeo.txt",sep="\t")

#####################################################################

 

# #
#Map#
# #

 

Paﬂmepw, 4))

par(mfrow=c(2,l))

map("world",col="grey")
text(Geo$Longitude,Geo$Latitude,Geo$Ecotype,col="black",cex=0.8)
points(Geo$Longitude,Geo$Latitude,col= rainbow(16220)[l : 10] ,cex=0.7,lwd=3)
legend( 120,85, Geo$Ecotype , f111= rainbow(16220)[1:10] , cex=0.8, bty="n")
points(Geo$Longitude[7],GeoSLatitudeU],col="DarkGoldenRod ",cex=3,lwd=2)
arrows(10.5, 30, 10.5, -50, lwd=2,angle = 15,col="DarkGoldenRod ")
text(5.5,-70, "Germany ", adj=0, cex=1.5, col="DarkGoldenRod ")

map("world", "Germany.,COI="DarkGoldenRod H)

text(Geo$Longitude,Geo$Latitude,Geo$Ecotype,cex=0.8,col="red")
points(Geo$Longitude,Geo$Latitude,col= rainbow(16:20)[1: 10] ,cex=0. 7,1wd=3)

45

 

# #
# Cluster for La4 and A14 #
# #

 

la = x[-1,4]

la=la[l :10]
names(la)=x[- l , 1][ l : 10]
dist(la)

hc.la <- hclust(dist(la))
plot(hc.la)

lakm <- laneans(dist(la),4)$cluster
lakm # Cluster

a1 = x[-1,7]

al=al[1 :10]
names(a1)=x[-1, 1][1 : 10]
dist(al)

hc.al <— hclust(dist(al))
plot(hc.al)

al.km <- kmeans(dist(al),4)$cluster

 

a1.km # Cluster

# #
# Gene Expression Plot #
# #

 

genelist = la4 # changable variable

name: "La4"

genedata<-Ecodata[genelist,2 :3 1]
gene.x1<-apply(genedata[,1 :3], 1,mean)

46

gene.x2<-apply(genedata[,4:6],1,mean)
gene.x3<-apply(genedata[,7 :9], 1 ,mean)
gene.x4<-apply(genedata[,10: 12], 1,mean)
gene.xS<-apply(genedata[, 13: 15], 1 ,mean)
gene.x6<-apply(genedata[,16: 1 8], 1,mean)
gene.x7<-apply(genedata[, 19:2 1 ],1,mean)
gene.x8<-apply(gcnedata[,22:24],1,mean)
gene.x9<—apply(genedata[,25:27],1,mean)
gene.x] O<-apply(genedata[,2 8:30], 1,mean)

genegexp<-data.ﬁ'ame(gene.xl, gene.x2, gene.x3, gene.x4, gene.xS,

gene.x6, gene.x7, gene.x8, gene.x9, gene.xlO)

for (i in l:length(genelist)){

GeneExpression.gene=t(rbind(genegexp[i,]))
matplot(GeneExpression.gene,axes=F,frame=T,type='b',pch=1)
row.names(GeneExpression.gene)<-c("BayO", "C24", "C010", "Cvi", "Est", "KinO", "Let", "Nd1", "Sha",
"VanO")

axis(l, 1:10, row.names(GeneExpression.gene))

par(new=T)

}

title(xlab="Ecotypes",main=paste(name))

 

# #
# Gene Expression Plot - Each picture represents one gene #
# #

 

genelist = la4 # changeable variable
N = 20 # the number of genes
genedata<-Ecodata[genelist,2 : 3 l]
gene.x1<-apply(genedata[, 1 :3], 1 ,mean)
gene.x2<~apply(genedata[,4:6], 1 ,mean)
gene.x3<—apply(genedata[,7:9], 1 ,mean)
gene.x4<-apply(genedata[, 10: 12], 1,mean)
gene.xS<—apply(genedata[,13:15],1,mean)

47

gene.x6<-apply(genedata[, l 6: l 8], 1,mean)
gene.x7<-apply(genedata[,19:21],],mean)
gene.x8<—apply(genedata[,22:24],1,mean)

gene.x9<-apply( genedata[,25 :27], 1,mean)

gene.x] O<-apply(genedata[,28 : 30] , 1 ,mean)
genegexp<-data.frame(gene.x1, gene.x2, gene.x3, gene.x4, gene.x5,
gene.x6, gene.x7, gene.x8, gene.x9, gene.xlO)

for (i in 1:N){

GeneExpression.gene=t(rbind(genegexp[i,]))
matplot(GeneExpression.gene,axes=F,frame=T,type='b',pch=1 )
row.names(GeneExpression.gene)<-c("BayO", "C24", "C010", "Cvi",
"Est", "KinO", "Ler", "Nd1", "Sha", "Van0")

axis(1, 1:10, row.names(GeneExpression.gene))
title(main=paste("Gene",i))

}

44 4!
ll "

# Latitude - La4 #

«H 4‘
II II

ecorep = c(1,l,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,10)

design = model.matrix(~-1+factor(ecorep))

Ala4 = design[,S]
Bla4 = design[, 1 ]+design[,7]+design[,8]+design[,10]
Cla4 = design[,2]+design[,3]+design[,6] +design[,9]
Dla4 = design[,4]

designla4 = data.frame(Ala4, Bla4, Cla4, Dla4)
contrast.matrixla4 = makeContrasts(Ala4 — Bla4, Ala4 — Cla4, Ala4 — Dla4, Bla4 — Cla4, Bla4 — Dla4,
Cla4 - Dla4,]evels=designla4)

eco.ﬁtla4 = lmFit(Ecodata[,2:31],designla4)

eco.ﬁt2la4 = contrasts.ﬁt(eco.ﬁtla4, contrast.matrixla4)

eco.ebla4 = eBayes(eco.ﬁt21a4)

48

#= decideTests =#

clasla4 = decideTests(eco.ebla4, method: “nestedF”, adjustmethod: “fdr”, p=0.05)

rownames(clasla4) = Ecodata[, 1]

“24 = rowSums(abs(clasla4))

# select the genes which are signiﬁcant at least in one contrast

cl.la4 = clasla4[,1]
c2.la4 = clasla4[,2]
c3.la4 = clasla4[,3]
c4.la4 = clasla4[,4]
c5.la4 = clasla4[,5]
c6.la4 = clasla4[,6]

la4.cl = which(c1.la4 f—= 1 |c1.la4 =—1)
Ia4.c2 = which(c2.la4 = 1 |c2.la4 =-1)
la4.c3 = which(c3.la4 = 1 | c3.la4 =1)
la4.c4 = which(c4.la4 == 1 |c4.1a4 ==-l)
la4.c5 = which(c5.la4 = 1 |c5.la4 ==-1)
la4.c6 = which(c6.la4 = 1 |c6.la4 ==-1)

la4.all = unique(c(la4.c1,la4.c2,la4.c3,la4.c4,la4.c5,la4.c6))

#= Look decideTests in different way =#

la4k0 = length(which(kla4==0))

la4k1 = length(which(kla4=l))

la4k2 = length(which(kla4=2))

la4k3 = length(which(kla4=3))

la4k4 = length(which(kla4=4))

la4k5 = length(which(kla4==5))

la4k6 = length(which(kla4=6))

la4k = C(la4k0, la4k1, la4k2, la4k3, la4k4, la4k5, la4k6)
names(la4k)=c(0:6)

49

la4bar = barplot(la4k,space=l .5,col= c("yellow",”red”,”blue”,"lightblue", "mistyrose", "lightcyan",

"lavender"),legend=1a4k, xlab=“number of significant contrasts”, main=“la4”)

la4row = which(kla4>=4)

1H 4!
n

W

# Latitude - A14 #

44 44
[I ll

ecorep = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,10)
design = model.matrix(~-1+factor(ecorep))

Aal4 = design[,9]

Bal4 = design[,7]

Cal4 = design[, 1 ]+design[,2]+design[,6]+design[,8]

Dal4 = design[,3]+design[,4]+design[,5]+design[,10]

designal4 = data.frame(Aal4, Bal4, Cal4, Dal4)

contrastmatrixal4 = makeContrasts(Aal4-Bal4,Aal4-Ca14,Aal4-Dal4,Bal4—Cal4,
Bal4-Dal4,Cal4-Dal4,1evels=designal4)

eco.fital4 = lmFit(Ecodata[,2:31],designal4)

eco.fit2al4 = contrasts.fit(eco.fital4,contrast.matrixal4)

eco.ebal4 = eBayes(eco.ﬁtZal4)

#= decideTests =#

clasal4 = decideTests(eco.ebal4, method: “nestedF”, adjust.method= “fdr”, p=0.05)
kal4 = rowSums(abs(clasal4))

# select the genes which are signiﬁcant at least in one contrast

c1.al4 = clasal4[,1]

c2.al4 = clasal4[,2]
c3.al4 = clasal4[,3]

50

c4.al4 = clasal4[,4]
c5.al4 = clasal4[,5]
c6.al4 = clasal4[,6]

al4.c1 = which(cl.al4 = 1 | cl.al4 ==-l)
al4.c2 = which(c2.al4 = 1 | c2.al4 ==-l)
al4.c3 = which(c3.al4 = 1 |c3.al4 =1)
al4.c4 = which(c4.al4 = 1 | c4.al4 =-1)
al4.c5 = which(cS.al4 == 1 | c5.al4 ==-1)
al4.c6 = which(c6.al4 = 1 | c6.al4 ==-1)

al4.all = unique(c(al4.c1,al4.c2,al4.c3,al4.c4,al4.c5,al4.c6))

#= Look decideTests in different way =#

al4k0 = length(which(kal4==0))

al4kl = length(which(lcal4=l))

al4k2 = length(which(kal4=2))

al4k3 = length(which(kal4=3))

al4k4 = length(which(kal4=4))

a14k5 = length(which(kal4=5))

al4k6 = length(which(kal4=6))

al4k = c(al4k0, al4kl, al4k2, al4k3, al4k4, al4k5, al4k6)

names(al4k)=c(0:6)

al4bar = barplot(al4k,space=l .5,col= c("yellow",”red”,”blue”,"lightblue", "mistyrose", "lightcyan",

"lavender"),legend=al4k, xlab=“number of signiﬁcant contrasts”, main=“al4”)

al4row = which(kal4>=5)

44 «H
1' UV

# Cvi vs. the other 8 Ecotypes (without Sha) #

4! 44
'1 ll

51

ecorep = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9,10,10,10)

design = model.matrix(~-1+factor(ecorep))

designcvi = design

colnames(designcvi)<-c("BayO", "C24", "C010", "Cvi", "Est", "KinO", "Ler", "Nd1", "Shakdara", "VanO")

contrast.matrixcvi<~makeContrasts(Cvi -Bay0/ 8 - C24/8 - C010/8 — Est/8 - KinO/ 8 —- Ler/ 8 - Nd1/8 -

VanO/8 ,levels=designcvi)

eco.ﬁtcvi = lmFit(Ecodata[,2:31],designcvi)
eco.ﬁt2cvi = contrasts.ﬁt(eco.ﬁtcvi, contrastmatrixcvi)

eco.ebcvi = eBayes(eco.ﬁt2cvi)

clascvi = decideTests(eco.ebcvi, method: “nestedF”, adjust.method= “fdr”, p=0.05)

kcvi = rowSums(abs(clascvi))

#= Toptable =# selecting the ﬁrst 500 genes from toptable

num=5 00

cvi = topTab1e(eco.ebcvi, genelist= eco.ebcvi $genes, coef=1, n=num, adjust="fdr")

d.cvi = read.csv("cvinumber.csv")

n.cvi = d.cvi[,1]

44 44
# Sha vs. the other 8 Ecotypes (without Cvi) #

44 44
II II

designsha = design
colnames(designsha)<—c("BayO", "C24", "C010", "Cvi", "Est", "KinO", "Ler", "Nd1", "Shakdara", "Van0")

contrastmatrixsha<—makeContrasts(Shakdara -Bay0/ 8 - C24/8 - C010/8 — Est/8 - KinO/ 8 — Ler/8 - Nd1/8 -
VanO/8,levels=designsha)

eco.fitsha = lmFit(Ecodata[,2:31],designsha)

52

eco.fit2§ha = contrasts.ﬁt(eco.ﬁtsha, contrastmatrixsha)

eco.ebsha = eBayes(eco.ﬁt25ha)

#= Toptable =# selecting the ﬁrst 500 genes from toptable

sha = topTable(eco.ebsha, coef=l , n=num, adjust="fdr")
d.sha = read.csv("shanumber.csv")

n.sha = d.sha[,l]

1‘! if

# Highly Variation - geneselect #

44 44
'7 U

vars=apply(AtGE, 1, var)
sortvars=sort(vars,decreasing = TRUE)
geneselect=sortvars[ 1 :number]

gs = names(geneselect)

gs = as.numeric( gs)

Ecodata[gs, 1]

4'4 44

# Randomly Selection - ran #

44 44
II "

x=runif(number, min=1, max=228 10)

ran=as.integer(x)

# RandomForest #
ff #

rfgenes = n.cvi # changeable variable

53

rfname = "n.cvi"

library( randomForest)

ecol=t(Ecodata[rfgenes,2:31]) ## select "number" genes
econames=rep(c("Bay0", "C24", "ColO", "Cvi", "Est", "KinO", "Ler", "Nd1", "Shakdara", "Van0"),each=3)

colnames(eco1)=Ecodata[rfgenes, 1]

ecotypel=data.frame(ecol,econames) ## Data which we want ##

ecotype.rf = randomForest(econames ~ ., data=ecotype1, ntree=100,
keep.forest=TRUE, importance=TRUE)

ecotype.rf

imp = importance(ecotype.rf)
plot(sort(imp[,l1]),type="h",ylab=“Importance Score”, main = rfname)

# see Accuraacy

44 44
II II

# ntree vs. 00B error rate #

44 44
n

n

ntree=300
nrf=10 # number of boostrap

m = matrix(rep(0,ntree*nrt),nrow=ntree)

for (j in 1:nrt){
for(i in 1:ntree){
ecotype.rf = randomForest(econames ~ ., data=ecotypel, ntree=i, mtry=sqn(22810),
keep.forest=TRUE, importance=TRUE)
m[i,j]=ecotype.rf$err.rate[i,l]
matplot(m,type="l",col="grey",lty=1,
xlab="number of trees",ylab="OOB error rate",ylim=c(0,1),frame.plot=F)
axis(l, seq(0,ntree,by=50),col = "#EE9A00", col.axis="blue", lwd = 2)
axis(2, seq(0,l,by=0.2),col = "#EE9AOO", col.axis="blue", lwd = 2)
}

54

par(new=T)
}
mean = apply(m, 1, mean)

mean = ifelse(mmean ="NaN", 1, mean)

op = par(new=T)
par(0p)
plot(mmean,type="l",cex=l,col="red",lwd=2,
xlab="number of trees",ylab="OOB error rate",ylim=c(0,1),frame.plot=F,axes=F)
axis(l, seq(0,ntree,by=50),col = "#EE9A00", col.axis="blue", lwd = 2)
axis(2, seq(0,1,by=0.2),col = "#EE9AOO", col.axis="blue", lwd = 2)

par(0p)

mini = min(mean)

a = which(mean—T-mini) # a is the number of trees which we want
text(a,0.4,paste("ntree",a),adj= l ,cex=l .2,col="dark green")

points(a,mean[a],col="dark green",cex=l .2,lwd=3)

#= After ﬁnding “optimal ntree” =#

ecotype.rf = randomForest(econames ~ ., data=ecotypel, ntree=a,
keep.forest=TRUE, importance=TRUE)
ecotype.rf

econames=as.factor(econames)
e = ecotypel[,-501]
rf.eco <- varSelRF(e, econames, ntree = 210, mtry=4)

rf.eco

plot(rf.eco)

# mtry vs. OOB error rate #

44 44
'1 H

55

rf.mtry=l70
nrf=10 # number of boostrap

numtree=a

m = matrix(rep(0,rf.mtry*nrf),nrow=rf.mtry)

for (j in l:nrf){

for(i in 130:rf.mtry){

ecotype.rf = randomForest(econames ~ ., data=ecotype1,ntree=numtree, mtry=i,
keep.forest=TRUE, importance=TRUE)

m[i,j]=ecotype.rf$err.rate[nnmtree, l]

matplot(m,type="l",col="grey",lty=l ,
xlab="number of mtry",ylab="OOB error rate",ylim=c(0,1),xlim=c(l30,170),frame.plot=F)
axis(l, seq(130,170,by=2),col = "#EE9A00", col.axis="blue", lwd = 2)
axis(2, seq(0,1,by=0.2),col = "#EE9AOO", col.axis="b1ue", lwd = 2)
}
par(new=T)

}

mean = apply(m, 1, mean)

#mean = ifelse(mmean ="NaN", 1, mean)

op = par(new=T)
par(op)
plot(mmean,type="l",cex=1 ,col="red",lwd=2,
xlab="number of mtry",ylab="OOB error rate",ylim=c(0,1),xlim=c(130, l 70),frame.plot=F,axes=F)
axis(l, seq(130,170,by=2),col = "#EE9A00", col.axis="blue", lwd = 2)
axis(2, seq(0,1,by=0.2),col = "#EE9A00", col.axis="blue", lwd = 2)

Par(Op)

mini = min(mmean[131:170])

b = which(mmean==mini) # a is the number of trees which we want
text(b,0.4,paste("mtry",b),adj=0,cex=l .2,col="dark green")

points(b,mean[b],col="dark green",cex=l .2,lwd=3)

56

#= After finding "optimal ntree" & "optimal mtry" =#

ecotype.rf = randomForest(econames ~ ., data=ecotype1, ntree=a,mtry=b,

keep.forest=TRUE, importance=TRUE)

 

 

ecotype.rf

# #
# Select the number of Genes from RandomForest #
44 44

rfgenes=n.cvi

ecol=t(Ecodata[rfgenes,2:31])

econames = rep(c("Bay0", "C24", "C010", "Cvi", "Est", "KinO", "Ler", "Nd1",
"Shakdara", "Van0"),each=3)

colnames(ecol)= Ecodata[rfgenes,1]

ecotypel = data.frame(ecol ,econames)
econames=as.factor(econames)

e = ecotypel[,-501]

rf.eco <- varSelRF(e, econames, ntree=a , mtry=b)

rf.eco

plot(rf.eco)

57

ESL-tat

 

Ecotype Latitude Longitude
Bay-0 49.56 11.34
C24 40.2 8.25
Col-0 43.01251667 -70.05
Cvi 16.00208056 -24.05
Est 59 25.04
Kin-0 42.46638889 —84.46
Ler 48.2 10.52
Nd-l 50.77777778 8.03
Shakdara 37.18333333 73.166
Van-O 49.85049722 -123.11

EcotypesGeoxxt

 

Ecotype Location Latitude Longitude Altitude
Bay-0 AtGE_111_A, B, C Bayreuth, Germany 49.56 11.34 350
C24 AtGE_112_A, C, D Coimbra, Portugal 40.20 8.25 179
Col-0 AtGE_113_A, C, D Columbia University (US) 43.01 -70.05 49
Cvi AtGE_114_A, B, C Cape Verde Islands 16.00 24.05 43
Est AtGE_115_A, B, D Estonia 59.00 25.04 15
Kin-O AtGE_116__A, B, C Kinneville, MI 42.47 -84.46 273
Ler AtGE_117_IB, C, D Landsberg, Germany 48.20 10.52 628
Nd-l AtGE_118_A, B, C Niederzunzheim, Germany 50.78 8.03 250
Shakdara AtGE_119_A, C, D Pamiro-Alay, Tadjikistan 37.18 73.17 3400
Van-O AtGE_120__A, B, C University of British Columbia 49.85 -123.11 50

AtGE ecotypeslxt

Which are available on WEIGEL WORLD website:

http://www.weigelworld.org/resources/microarray/AtGenExpress/

58

BIBLIOGRAPHY

59

BIBLIOGRAPHY

A. Liaw, M. Wiener (2002), Classiﬁcation and regression by randomForest, R News 2/3:
18—22.

C. Strobl, A. Bonlesteix, A. Zeileis, T. Hothorn (2007), Bias in Random Forest Variable
Importance Measures: Illustrations, Sources and a Solution, BMC Bioinformatics: 8-25.

C.V. Zwan, S.Brodie, J.J. Campanella (2000), The intraspeciﬁc phylogenetics of
Arabidopsis thaliana in worldwide populations, Systematic Botany 25: 47—59.

D. W. Meinke, et a1. (1998), Arabidopsis thaliana: a model plant for genome analysis,
Science, Vol.282.

Furlanello et a1. (2003), GIS and the Random Forest Predictor: Integration in R for Tick-
bome Disease Risk Assessment, Proceedings of the DSC-03 International Workshop on
Distributed Statistical Computing, Vienna, Austria.

H. Pang, A. Lin, M. Holford, B.E. Enerson, B. Lu, M. P. Lawton, E. Floyd, H. Zhao
(2006), Pathway analysis using random forests classiﬁcation and regression,
Bioinforrnatics, Vol.22.

Jinwook Seo, Ben Shneiderman (2002), Interactively Exploring Hierarchical Clustering
Results, IEEE Computer, Vol 35: 80-86.

K. Apel, H. Hirt (2004), Reactive oxygen species: Metabolism, oxidative stress and
signal transduction, Annual Review Plant Biology, Vol.55: 373—399

L. Breiman (2001), Random Forests, In Machine Learning, Vol.45: 5-32

L. Breiman, A. Cutler, Random Forests.
URL:http://www.stat.berkelev.edu/users/breimagRandomForests/cc papershtm

M.Schmid, T.S.Davison, S.R.Henz, U.J.Pape, M.Demar, M.Vingron, B.Sch6|kopf,
D.Weigel, and J .U.Lohmann (2005), A gene expression map of Arabidopsis
development. Nature Genetics, V0137: 501-506.

R. Diaz-Uriarte (2004), Variable Selection from Random Forests: Application to Gene
Expression Data, Spanish Bioinfonnatics Conference 2004: 47-52.

R. Diaz-Uriarte (2005), GeneSrF and varSelRF: a web-based tool and R package for
gene selection and classiﬁcation using random forest, CNIO, Spain. '

60

R. Diaz-Uriarte, S. Alvarez de Andres (2006), Gene selection and classiﬁcation of
microarray data using random forest, BMC Bioinformatics 2006, 7:3.

S. Karpinski, H. Reynolds, B. Karpinska, G.. Wingsle, G. Creissen, P. Mullineaux
(1999), Systemic signaling and acclimation in response to excess excitation energy in
Arabidopsis. Science Vol.284: 654—657.

Smyth, G K. (2004), Linear models and empirical Bayes methods for assessing
differential expression in microarray experiments, Statistical Applications in Genetics
and Molecular Biology 3, No. 1, Article 3.

T.K. Ho (1995), Random Decision Forests, Proceedings of the 3rd International
Conference on Document Analysis and Recognition, Montreal, Canada: 278-282.

U. Johanson, et a1. (2000), Molecular Analysis of FRIGIDA, a Major Determinant of
Natural Variation in Arabidopsis Flowering Time, Science Vol.290.

Y. Truong, X. Lin, C. Beecher (2004), Learning a complex metabolomic dataset using
random forests and support vector machines, KDD (Knowledge Discovery and Data
Mining): 835-840.

61

 

 

ill]l]]ﬂ]l]]j]]it]:

‘44 n.-.—L~. . - a A- - -