UNRAVELING PLANT GENE REGULATORY NETWORKS USING MULTILAYER DATA 
INTEGRATION 

By 

Fabio Andrés Gómez Cano 

A DISSERTATION  

Submitted to  
Michigan State University 
in partial fulfillment of the requirements 
for the degree of 

Biochemistry and Molecular Biology - Doctor of Philosophy 

2023 

 
 
 
 
 
 
 
 
  
  
  
  
 
 
 
 
 
 
 
 
  
 
 
 
 
ABSTRACT 

The  translation  of  genotype  into  phenotype  largely  depends  on  genes  being  expressed  in  the 

appropriate  cell  types  at  the  correct  time.  These  expression  patterns  are  largely  determined  by 

transcription factors (TFs) controlling specific gene sets which together result in gene regulatory 

networks (GRN). GRNs may be elucidated using TF-centered approaches, such as DNA-affinity 

purification  and  chromatin  immunoprecipitation  sequencing  (DAP-  &  ChIP-seq,  respectively). 

Alternatively,  the  generation  of  thousands  of  gene  expression  samples  has  allowed  the 

implementation of robust methods for TF-target inference. As part of my research, I developed 

strategies  that  integrate  several  high-throughput  data  types  to  identify  transcription  factor 

regulators  of  a  broad  spectrum  of  metabolic  pathways  in  several  plant  systems.  Specifically,  I 

established frameworks for the analysis of Camelina sativa, maize (Zea mays), and Arabidopsis 

thaliana  with  species-specific  tailored  pipelines.  Data  resources  availability  by  species-guided 

pipeline differed between species. In Camelina, I combined expression and DAP-seq assays to 

identify transcriptional regulators of lipid metabolism. In maize, I integrated expression variation, 

expression  quantitative  loci  (eQTLs),  and  DAP-  &  ChIP-seq  to  build  a  multiple-layer  network 

predicting regulators of phenylpropanoid, lipids, and carbon metabolism. Lastly, for Arabidopsis, 

utilizing  a  vast  collection  of  RNA-seq  samples,  protein-DNA  interactions  (PDI),  and  protein-

protein  interactions  (PPI),  I  tested  co-regulation  models  that  incorporate  the  influence  of  TF 

physical interactors on TF-target co-expression profiles. This comprehensive analysis also enabled 

the prediction of high-level TF complexes, providing valuable insights for refining models of TF 

regulation  based  on  co-expression.  Together,  my  studies  contributed  new  knowledge  to  the 

regulatory  hypotheses  of  specific  metabolic  pathways  in  plants,  establishing  a  framework  for 

elucidating GRN in other systems. 

 
ACKNOWLEDGEMENTS 

I  wish  to  extend  sincere  gratitude  to  my  esteemed  advisor,  Dr.  Erich  Grotewold,  and  all  the 

distinguished  members  –  both  present  and  past  –  of  the  Grotewold  Lab.  The  remarkable 

contributions and fruitful collaborations bestowed upon this work have been invaluable, making 

this endeavor possible through their unwavering support and steadfast dedication. I would also like 

to  express  my  appreciation  to  all  the  committee  members  for  their  willingness  to  actively 

participate  and  continually  contribute,  further  enhancing  the  quality  and  impact  of  this  work. 

Moreover, I am deeply thankful for the support from Michigan State University and the NRT-

IMPACTS fellowship program during my doctoral studies. 

I would also like to acknowledge collaborators who played a crucial role in my research, such as 

Dr. Danny Schnell, Dr. Shin-Han Shiu, Dr. Patrick Edger, Dr. Arjun Krishnan, and Dr. Nathan 

Springer, all of whom were indispensable to my research outcomes. A special thanks goes to Dr. 

Arjun Krishnan, who consistently accompanied my journey through the PhD with a willingness to 

listen to any sort of unconventional ideas that come to mind, guiding them in the right direction. 

Although this PhD took five years, it is a dream that started many years ago involving a lot of 

people  who  supported  and  brought  light  to  me  in  the  darkest  moments.  I  would  like  to  thank 

especially Angela Natalia Cano Garavito and Nubia Suárez Rincón, who were like angels for me 

during my early days in high school and as an undergraduate student back in Colombia. Luis Angel 

Cano Garavito, friend and big brother who was always there for give me hand when need it. Fabio 

Aldemar  Gomez  Sierra,  mentor  who  guided  me  in  nurturing  my  scientific  passion.  I  will  also 

express my gratitude to Juan Gonzalez and Julieta Manrique, who were more than friends; they 

were like my second parents in the US. 

iii 

 
I want to extend my deep appreciation to my cherished parents and siblings for their unwavering 

support,  invaluable  guidance,  and  selfless  sacrifices.  Without  them,  none  of  my  achievements 

would have been possible. I'd also like to thank my dear Isaac and Zoe, who are my oxygen and 

reason to persevere each day. Lastly, I express my gratitude to my beloved wife – Mariel – who 

serves as my guiding light in the storm, my oasis in the desert, and my faithful companion on this 

journey. This PhD is dedicated to her! 

iv 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TABLE OF CONTENTS 

LIST OF ABBREVIATIONS…………………………………………………………………...viii 

CHAPTER ONE: INTRODUCTION......………………………………………………………....1 
1.1 GENE REGULATORY NETWORKS (GRNs)………………….…………………………2 
1.2 GRN CHARACTERIZATION…………………………………………………...................3 
1.3 COMBINATORIAL GENE REGULATION (CGR) ………………………………………6 
1.4 UNRAVELING GENE REGULATION: THE ROLE OF MULTI-OMICS…………….…8 
1.5 WORKING SYSTEM AND CHAPTERS DISTRIBUTION……………………………….9 
REFERENCES………………………………………………………………………….……...11 

CHAPTER  TWO:  CAMREGBASE:  A  GENE  REGULATOIN  DATABASE  FOR  BIOFUEL 
CROP, CAMELINA SATIVA1…………….………….………………………………………...........18 
2.1 ABSTRACT………….……………….……………….…………………………………….…19 
2.2 INTRODUCTION………….……………….……………….………………………………...19 
2.3 RESULTS AND DISCUSSION………….……………….……………….…………………21 
2.3.1 Database structure………….……………….……………….……………………………..21 
2.3.2 Expression database content………….……………….……………….…………………22 
2.3.3 Annotation of TFs………….……………….……………….……………………………..24 
2.3.4 Database functionalities………….……………….……………….……………………....26 
2.4 METHODS………….……………….……………….………………………………………...26 
2.4.1 Gene expression data source………….……………….………….…………………….....26 
2.4.2 Database and web platform construction………….……………….……………….…....27 
2.4.3 Camelina sativa gene annotation………….………….…….……….……….…………….27 
2.4.4 Gene regulation data collection and analysis…………...………….……………….……28 
2.4.5 Gene co-expression analysis………….………………..…………..….…........…...…….28 
REFERENCES……….………….………………………………………………………….…...30 

CHAPTER THREE: EXPLORING CAMELINA SATIVA LIPID METABOLISM 
REGULATION BY COMBINING GENE CO-EXPRESSION AND DNA AFFINITY 
PURIFICATION ANALYSIS1…………………………………………………………………..33 
3.1 ABSTRACT……………………………………………….….…..…….………………….34 
3.2 INTRODUCTION…………………………………………….…………………………....34 
3.3 RESULTS………………………………….……………………………………………….38 
3.3.1  Expression  analysis  of  genes  involved  in  lipid  accumulation  during  Camelina  seed 
development………………………………….……………………………………………….38 
3.3.2  Identification  of  candidate 
transcriptional  regulators  by  co-expression 
analysis…………………………………………………………….…….…………….….…..41 
3.3.3 Establishing the DNA-binding landscape of the candidate transcription factors……....45 
3.3.4 Predicting gene targets for the selected TFs……………………………….……….…..51 
v 

lipid 

 
 
 
 
3.3.5 Identified TFs associate with distinct aspects of lipid metabolism…………........….…56 
3.3.6 Dynamic behavior of the predicted networks during seed development………………59 
3.4 DISCUSSION………….….….……….……………….…………………………………....63 
3.5 METHODS………………………………………………………………………………....68 
3.5.1 Plant materials and growth conditions…………………………………………………68 
3.5.2 Cloning and expression of transcription factors for DAP-seq…………………………68 
3.5.3 RNA-seq library preparation…………………………………………………………...68 
3.5.4 DAP-seq library preparation………………………………....………………..….…….69 
3.5.5 Data processing, quantification, and statistical analyses…………………….…………70 
3.5.6 Data availability and accession numbers………………………………………………74 
REFERENCES……………………………………………………………………………....…75 

CHAPTER FOUR: MULTI-NETWORK INTEGRATION TO PRIORITIZE REGULATORY 
GENES OF METABOLISM IN MAIZE.......................................................................................85 
4.1 ABSTRACT..........................................................................................................................86 
4.2 INTRODUCTION.................................................................................................................86 
4.3 RESULTS..............................................................................................................................91 
4.3.1 Construction of a maize regulatory network based on multiple layers...........................91 
4.3.2 TF Functional annotation................................................................................................96 
4.3.3 Evaluation of functional prediction with knockouts.......................................................100 
4.3.4 Evaluation of functional prediction by comparing with random networks................ ...102 
4.3.5 Prioritization of regulators by biological process..........................................................105 
4.3.6 Topological properties predict TF homeologs redundancy...........................................110 
4.4 DISCUSSION......................................................................................................................114 
4.5 METHODS..........................................................................................................................118 
4.5.1 Genetic markers.............................................................................................................118 
4.5.2 RNA-seq and co-expression data..................................................................................119 
4.5.3 eQTL identification and classification..........................................................................119 
4.5.4 Protein-DNA interactions data analysis........................................................................120 
4.5.5 Functional annotation....................................................................................................121 
4.5.6 Network integration......................................................................................................121 
4.5.7 Knockout and random network validation....................................................................123 
4.5.8 Prioritization of transcriptional regulators-process associations..................................124 
4.5.9 Similarities in sequence among TF paralogs................................................................125 
REFERENCES.........................................................................................................................126 
APPENDIX...............................................................................................................................134 

CHAPTER FIVE: ARABIDOPSIS CO-EXPRESSION SIGNATURES OF COMBINATORIAL 
GENE REGULATION.................................................................................................................149 
5.1 ABSTRACT.........................................................................................................................150 

vi 

 
 
 
5.2 INTRODUCTION................................................................................................................150 
5.3 RESULTS.............................................................................................................................153 
5.3.1 Transcription factors and their targets show varying levels of co-expression..............153 
5.3.2 Few targets are highly co-expressed with their respective TFs......................................158 
5.3.3 PPIs condition TF co-expression with direct targets......................................................158 
5.3.4 Co-expressed targets shared by binary TF complexes suggest higher-order 
arrangements... .........................................................................................................................163 
5.3.5 Genes highly co-expressed with TFs are enriched in indirect TF targets......................167 
5.4 DISCUSSION.......................................................................................................................169 
5.5 METHODS..........................................................................................................................172 
5.5.1 Data collection...............................................................................................................172 
5.5.2 Evaluation of co-expression and determination of mutual rank values.........................172 
5.5.3 Identification of TFs co-expressed with the corresponding target genes.......................173 
5.5.4 Identification of targets co-expressed with TF complexes............................................174 
5.5.5 Definition of highly co-expressed targets......................................................................175 
5.5.6 Degree network connectivity.........................................................................................175 
5.5.7 Protein-Protein Interactions (PPIs) and Protein-DNA interactions (PDIs) network 
randomization..........................................................................................................................175 
5.5.8 Definition of tri-bi complexes with significant number of shared targets....................176 
5.5.9 Counting the HCG of a TFx that are targeted by TFz partners and TFy downstream of 
the corresponding TFx............................................................................................................176 
5.5.10 Definition of local expression clusters.........................................................................176 
REFERENCES...........................................................................................................................178 
APPENDIX.................................................................................................................................186 

CHAPTER SIX: CONCLUSIONS................................................................................................192 

vii 

 
 
 
 
 
 
 
 
 
 
 
 
 
LIST OF ABBREVIATIONS 

ABA 

ACR 

Abscisic acid 

Accessible chromatin regions 

ATAC  

Assay for transposase accessible chromatin 

ChIP-seq 

Chromatin immunoprecipitation sequencing 

CRE 

Cis-regulatory element 

CRMs   

Cis-regulatory modules 

CUT&RUN  Cleavage under targets and release using nuclease 

CUT&Tag 

Cleavage under targets and tagmentation 

DEG 

DNA 

Differentially expressed gene 

Deoxyribonucleic acid  

eQTL   

expression quantitative loci 

GRN 

Gene regulatory network 

GWAS  

Genome-wide association studies  

HCG 

HCT 

LCT 

PDI 

PPI 

RNA 

SNP 

TF 

TE 

Highly co-expressed gene 

Highly co-expressed target 

Low co-expressed target 

Protein-DNA interaction 

Protein-protein interaction 

Ribonucleic acid 

Single-nucleotide polymorphism 

Transcription factor 

Transposable element 

viii 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TRAP   

Translating ribosome affinity purification 

WiDiv  

Wisconsin diversity 

ix 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER ONE: INTRODUCTION 

1 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1.1 GENE REGULATORY NETWORKS (GRNs) 

Plants, unlike many other organisms, are sessile but account for over 80% of biomass on Earth 

(Bar-On et al., 2018). Their remarkable success can be attributed to their physiological diversity, 

which is governed by complex molecular networks. Therefore, a plant phenotype, whether it is 

morphological  or  physiological,  can  be  defined  as  an  emergent  property  of  the  molecular 

interactions that underlie it. Within these intricate molecular networks, transcription factor (TF) 

proteins play a crucial role as they are positioned at the end of signaling pathways and guide the 

transcription machinery responsible for the activation or repression of other genes (referred to as 

target genes of the corresponding TFs) (Gupta et al., 2021). The mechanistic basis of TF function 

lies in their ability to form protein-DNA interactions (PDI) by recognizing specific cis-regulatory 

elements  (CREs)  located  near  or  distant  from  their  target  genes.  Such  interactions  guide  the 

recruitment of the transcriptional machinery. The collection of TFs and their corresponding target 

genes constitutes a gene regulatory network (GRN). In plants, as in other organisms, the structure 

of  these  GRNs  determines  spatiotemporal  gene  expression  patterns  (Swift  and  Coruzzi,  2017). 

Consequently, the wiring of a GRN has implications for phenotypic variation (Deplancke et al., 

2016), plant responses to abiotic and biotic stress (Nakashima et al., 2014; Birkenbihl et al., 2017), 

speciation (Mack and Nachman, 2017), adaptation, and diversification (Mack and Nachman, 2017; 

Bowles  et  al.,  2020),  highlighting  and  justifying  any  effort  to  understand  its  structure  and 

dynamics.  

CRE sequence variation, primarily located in the non-coding regions of the genome, drives 

rewiring changes in GRNs (Sullivan et al., 2014). Single-nucleotide polymorphisms (SNPs) and 

small  insertions/deletions  within  CREs  can  affect  TF  binding  affinity,  altering  the  interaction 

between TFs and their corresponding CRE (Marand et al., 2023). However, transposable elements 

2 

 
(TEs), which are highly abundant in non-coding sequences (Bennetzen et al., 2005) and constitute  

up  to  85%  of  the  plant  genome,  such  as  maize  (Schnable  et  al.,  2009),  are  among  the  major 

contributors of genomic variability. TEs can impact gene function through various mechanisms, 

such  as  gene  inactivation,  gene  expression  reprogramming,  deletions,  rearrangements,  gene 

transposition, and protein exaptation (Lisch, 2013; Schmitz et al., 2022). In terms of expression 

variation, TEs can induce gene expression reprogramming by inserting, removing, or establishing 

new regulatory connections (Greene et al., 1994; Butelli et al., 2012). Moreover, TE insertions can 

modify  the  epigenetic  landscape  surrounding  a  gene,  leading  to  changes  in  gene  expression 

through chromatin modifications.  

1.2 GRN CHARACTERIZATION 

Wet lab approaches. Approaches to establish PDI can be categorized as gene-centered and TF-

centered methods, which correspond to strategies focused on identifying TF regulators for specific 

genes and target genes for specific TFs, respectively (Arda and Walhout, 2010; Mejia-Guerra et 

al., 2012; Yang et al., 2017). The yeast one-hybrid (Y1H) assay and the electrophoretic mobility 

shift assay (EMSA) are frequently employed gene-centered methods (Arda and Walhout, 2010; 

Yang  et  al.,  2016).  Among 

the  diverse  array  of  TF-centered  strategies,  Chromatin 

Immunoprecipitation Sequencing (ChIP-seq) is a highly utilized assay for the identification of TF 

binding sites (TFBS) in vivo. Variations of ChIP-seq include Cleavage Under Targets and Release 

Using  Nuclease  (CUT&RUN)  (Skene  and  Henikoff,  2017)  and  Cleavage  Under  Targets  and 

Tagmentation (CUT&Tag) (Kaya-Okur et al., 2019), which overcome challenges associated with 

crosslinking  and  solubilization.  These  methods  also  require  minimal  sample  material,  offering 

significant  advantages  in  experimental  applications.  Within  the  in  vitro  techniques,  systematic 

evolution of ligands by exponential enrichment (SELEX), protein binding microarrays (PBM), and 

3 

 
DNA affinity purification sequencing (DAP-seq) are within the most widely used methods (Yang 

et  al.,  2016;  O’Malley  et  al.,  2016).  Limitations  to  consider  for  EMSA  and  ChIP-seq  include 

restrictions  on  the  number  of  sequences  and  TFs  that  can  be  tested,  respectively.  Additionally, 

ChIP-seq  captures  numerous  indirect  binding  events,  making  it  challenging  to  identify  direct 

targets.  Similarly,  DAP-seq,  SELEX,  and  PBM  can  produce  a  high  number  of  non-functional 

PDIs, primarily due to the lack of a native chromatin environment (Yang et al., 2016). Therefore, 

TF-target  gene  associations  determined  by  these  methods  always  require  further  experimental 

validation. 

Given the inherent presence of false positives and the large number of interactions obtained 

through these experimental approaches, complementary analyses have been employed to identify 

high-confidence TF-target gene associations. The most widely used strategy is the identification 

of differentially expressed genes (DEGs) - after the perturbation of the corresponding TF - which 

identifies  downstream  genes  affected  by  the  perturbation  of  the  corresponding  TF.  The 

perturbation  itself  also  recovers  a  large  number  of  indirect  changes,  such  as  cellular  responses 

associated with the perturbation itself. However, the combination of PDI and DEG analyses allows 

for  the  identification  and  differentiation  between  direct  target  genes  and  indirect  effects  of  the 

perturbation, respectively. Shockingly, this approach has shown that the overlap between DEGs 

and  PDIs  is  overall  low  and  may  vary  between  5-30%  (Zeller  et  al.,  2006;  Morohashi  and 

Grotewold, 2009; Morohashi et al., 2012; Eveland et al., 2014; Liu et al., 2015), indicating that a 

large fraction of the PDI may not lead to expression changes of the corresponding target genes. In 

yeast, the low fraction of overlapping DEG and PDI was associated with paralog TFs backing-up 

the  function  of  knocked-out  TFs  (Gitter  et  al.,  2009).  This  phenomenon  has  not  yet  been 

investigated in the context of plants. As an alternative to perturbation analysis, the identification 

4 

 
of  co-expression  networks  has  gained  significant  attention  for  narrowing  down  target  genes  to 

those that exhibit high co-expression with the corresponding TF (Eisen et al., 1998; Allocco et al., 

2004; Vandepoele et al., 2009; Haynes et al., 2013; Wu and Ji, 2013; Angelini and Costa, 2014; 

Jiang  and  Mortazavi,  2018;  Haque  et  al.,  2019).  Thus,  the  integration  of  DEGs  under  TF 

perturbations and co-expression networks provides opportunities to improve predictions obtained 

from experiments like DAP-seq. This approach is particularly valuable for systems in which ChIP-

seq presents technical challenges, such as to generate mutants or antibodies for the corresponding 

TF. Additionally, it also allows for scalability in the number of TFs that can be tested (O’Malley 

et al., 2016; Ricci et al., 2019).  

Numerous  systematic  and  genome-wide  endeavors  have  led  to  the  discovery  of  millions  of 

PDIs in various model organisms (Harbison et al., 2004; Deplancke et al., 2006; Zhu et al., 2009; 

Gerstein et al., 2010; Consortium et al., 2010; Négre et al., 2011; ENCODE Project Consortium, 

2012). In the case of plants, specifically Arabidopsis thaliana and maize (Zea maize L.), similar 

efforts  have  been  undertaken  on  a  smaller  scale  and  within  specific  biological  contexts.  These 

include the regulation of the root stele (Brady et al., 2011), secondary cell wall synthesis (Taylor-

Teeples et al., 2015), phenolic metabolism (Yang et al., 2017), flower development (Chen et al., 

2018), as well as responses to ABA (Song et al., 2016) and nitrogen (Gaudinier et al., 2018) among 

others. It is also noteworthy to highlight the significant contributions made in the identification of 

TF binding motifs (TFBMs) for over 640 TFs in Arabidopsis (O’Malley et al., 2016; Weirauch et 

al., 2014; Franco-Zorrilla et al., 2014) and more than 30 TFs in maize (Ricci et al., 2019; Galli et 

al., 2018). Invaluable source of information for the construction of regulatory models based on 

multi-omic data integrations (Song et al., 2020; Pérez et al., 2023). 

5 

 
Computational  approaches.  Technological  advances  in  RNA  sequencing  have  allowed  the 

generation of thousands of expression samples, enabling the implementation of methods for TF-

target inference. All these methods utilize the idea of identifying co-expressed genes as a means 

of  inferring  regulation  without  prior  knowledge  of  the  regulatory  network.  Among  the  various 

forms of co-expression, the most commonly employed approach is the inference of gene regulatory 

networks (GRNs) through the analysis of expression variations in spatial (e.g., different organs), 

temporal (e.g., developmental trajectory), perturbation, or genetic background contexts (Haque et 

al., 2019; Zhou et al., 2020). In all scenarios, the construction of a co-expression network involves 

three  key  steps:  data  processing  and  normalization,  network  reconstruction,  and  network 

evaluation (Haque et al., 2019; Johnson and Krishnan, 2022). While all three steps are important, 

the reconstruction method is particularly critical due to the constraints/assumptions it imposes on 

the network and the ability to differentiate between association and causation associations (Haque 

et  al.,  2019).  The  strategies  for  network  reconstruction  can  be  classified  into  four  categories, 

including  correlation  and  information-theoretic  approaches,  Boolean  network  approaches, 

Bayesian network approaches, and regression and differential equation-based models (Banf and 

Rhee,  2017).  Each  approach  has  its  strengths  and  limitations,  especially  when  considering  the 

network's scale and the number of samples. However, common practices to enhance their strength 

and reduce limitations include restricting tested interactions, incorporating known interactions to 

improve threshold identification during the prediction process, and utilizing background models 

based on randomly assigned expression datasets (Banf and Rhee, 2017). 

1.3 COMBINATORIAL GENE REGULATION (CGR) 

A defining characteristic of GRNs is their combinatorial nature, where a single TF can regulate 

multiple sets of target genes through interactions with other proteins. These interactions can be 

6 

 
direct or indirect, for example mediated by DNA, and involve multiple regulatory proteins. This 

phenomenon  is  known  as  combinatorial  gene  regulation  (CGR).  From  a  practical  perspective, 

CGR presents unique challenges for the prediction or identification of transcriptional regulation of 

specific processes, as a single TF may be linked to multiple processes. Additionally, multiple TFs 

may  be  linked  to  the  same  process  simultaneously.  Consequently,  CGR  contributes  to  the 

expansion and diversification of the regulatory repertoire of TFs (Reményi et al., 2004; Brkljacic 

and Grotewold, 2017). At the molecular level, implications of CGR include that TFs may form 

different  protein  complexes  and/or  bind  to  DNA  in  modular  fashion  to  cis-regulatory  modules 

(CRMs) (Brkljacic and Grotewold, 2017). In general, TF binding to a CRM can be categorized 

into  three  models:  independent  binding,  competitive  binding,  and  cooperative  binding.  In 

independent binding, TFs bind to separate CREs without any physical interaction between them. 

Competitive  binding  occurs  when  different  TFs  compete  for  the  binding  of  the  same  CREs, 

potentially involving physical interactions. Cooperative binding, on the other hand, requires the 

formation of a TF complex to bind a CRE (Reiter et al., 2017; Colinas and Goossens, 2018). Major 

advances has been main to understand the molecular mechanisms behind the CGR, including TFs 

spatiotemporal expression variation, TFs post-translational modification, splicing of different TFs 

isoforms,  TF  conformational  changes  trigger  by  the  interaction  of  small  molecules,  as  well  as 

histone modifications and chromatin structure (Brkljacic and Grotewold, 2017; Reiter et al., 2017). 

However, there is currently no single model that comprehensively predicts the CGR landscape of 

a  gene  or  biological  process,  i.e.,  the  combination  of  TFs  that  may  exert  control  over  the 

corresponding gene or biological process. 

7 

 
 
 
1.4 UNRAVELING GENE REGULATION: THE ROLE OF MULTI-OMICS 

The incorporation of diverse genomic information has enhanced the accuracy of GRN models 

(Qian and Huang, 2020). Common sources of information include accessible chromatin regions 

(ACRs),  histone  marks,  and  DNA  methylation  patterns,  enabling  the  construction  of 

cell/tissue/condition-specific regulatory circuits (ENCODE Project Consortium, 2012; Neph et al., 

2012;  Baur  et  al.,  2020).  The  integration  of  additional  layers  of  information  offers  several 

advantages,  such  as  uncovering  novel  regulatory  principles  and  the  identification  of  new 

combinations  of  cis-regulatory  elements  (i.e.,  novel  CRMs)  (Neph  et  al.,  2012;  Sullivan  et  al., 

2014).  Furthermore,  the  inclusion  of  protein-protein  interactions  (PPIs)  between  TFs  and  their 

corresponding PDIs associates highly connected TFs with stronger expression effects (Gerstein et 

al., 2012). Additionally, genes targeted by multiple TFs exhibit broader expression windows and 

at  the  same  time  collection  of  co-binding  events  enables  the  identification  of  TF  complexes 

(Heyndrickx  et  al.,  2014).  These  co-binding  events  have  demonstrated  specificity  to  particular 

biological processes, such as development-specific gene expression patterns (Chen et al., 2018). 

In addition to the PDI-related datasets, the construction of GRNs based on transcriptomics and 

proteomics  has  shown  to  complement  each  other,  recovering  more  interactions  together  than 

individually when compared to GRNs built from ChIP-seq assays (Walley et al., 2016). Similarly, 

the integration of multiple layers of genomic information, ranging from chromatin to translation 

changes,  enables  the  identification  of  species-  and  layer-specific  responses  to  submergence,  as 

well as the CRE architectures responsible for submergence-induced expression changes (Reynoso 

et  al.,  2019).  The  inclusion  of  marks  that  capture  epigenetic  features,  along  with  chromatin 

accessibility  (Yan  et  al.,  2019)  and  chromatin  interaction  data  (Ricci  et  al.,  2019),  enables  the 

identification  of  development-specific  enhancers  and  the  association  of  distal  ACR  with  target 

8 

 
genes through the formation of chromatin loops. Additionally, the incorporation of genetics, i.e. 

molecular  trait  information  at  population  level,  has  demonstrated  an  outstanding  potential  to 

uncover the molecular mechanisms underlying complex traits (Li et al., 2013; Wen et al., 2014, 

2016;  Mizrachi  et  al.,  2017;  Kremling  et  al.,  2018;  Zhou  et  al.,  2019;  Mazaheri  et  al.,  2019; 

Shrestha et al., 2022). Notably, with a few exceptions (Yang et al., 2022; Schaefer et al., 2018; 

Mizrachi et al., 2017),  the integration of data through the identification of common features or 

patterns have been a common denominator in the described studies. However, significant progress 

has been made in other systems (Lee et al., 2019; Krassowski et al., 2020; Subramanian et al., 

2020; Qian and Huang, 2020; Kang et al., 2022; Vahabi and Michailidis, 2022), and approaches 

used in these studies still need to be tested in plant systems. 

1.5 WORKING SYSTEM AND CHAPTERS DISTRIBUTION 

Analyzing  multi-omics  data  has  unveiled  valuable  insights  in  gene  regulation,  yet  also 

introduced unique challenges. My research addresses some of these challenges by implementing 

and establishing strategies to integrate multiple-omic data and predicting GRNs, uncovering the 

regulatory circuits associated with specific plant biological processes. Specifically, I focused on 

predicting GRNs involved in the regulation of lipid metabolism in Camelina Sativa (Chapter 2), 

as well as other biological processes in Maize (Zea mays) (Chapter 3), and Arabidopsis thaliana 

(Chapter 4), using computational techniques. Due to the unique characteristics and data availability 

of each species, I have developed customized strategies for their analysis.  

Camelina  sativa,  is  a  winter  oilseed  annual  crop,  member  of  the  Brassicaceae  family.  

Camelina oilseed crop that has gained attention for its potential use in biofuel production (Bansal 

and Durrett, 2016; Carlsson, 2009). However, despite its growing popularity, the available gene 

expression datasets for Camelina are limited to a few tens of samples (Gomez-Cano et al., 2020), 

9 

 
which is ~150 and ~500 times less than the data available for maize and Arabidopsis, respectively. 

Thus, I used co-expression-based prediction with hard filters to build a GRN associated with the 

control  of  lipid  metabolism.  My  work  represented  the  first  lipid-related  GRN  described  for 

Camelina.  

Maize is one of the most widely grown cereal crops in the world, its grain and maize stover is 

a source of biomass for liquid fuel and is also extensively used as a major forage component (Khan 

et al., 2015; Trivedi et al., 2015).  Unlike Camelina, maize has a wealth of genomic and genetic 

resources, which favor the generation of regulatory models based on more sophisticated strategies. 

Specifically, I used several multi-omic datasets to build multiple molecular networks, which were 

integrated  using  three  different  approaches.  After  a  systematic  evaluation  of  the  integrations,  I 

selected the best method to describe TF regulators of a diverse set of biological processes in maize. 

These resources are crucial for guiding the design of future experiments and laying the foundation 

for integrating multi-omic datasets in maize and other plant systems.  

Arabidopsis,  like  Camelina,  belongs  to  the  Brassicaceae  family  and  is  one  of  the  most 

extensively studied plant species. This makes Arabidopsis an appealing system for exploring the 

co-expression  relationships  between  TFs  and  their  experimentally  identified  target  genes. 

Leveraging  the  vast  collection  of  expression  and  PDI  datasets  in  Arabidopsis,  I  uncovered 

previously unknown combinations of TFs that contribute to the regulation of diverse biological 

processes.  These  findings  carry  significant  implications  for  the  empirical  understanding  of 

complex gene regulatory networks, the function of transcription factors, and the significance of 

co-expression in protein-protein and protein-DNA interactions. 

10 

 
 
 
 
 
REFERENCES 

Allocco, D.J., Kohane, I.S., and Butte, A.J. (2004). Quantifying the relationship between co-

expression, co-regulation and gene function. BMC Bioinformatics 5: 18. 

Angelini, C. and Costa, V. (2014). Understanding gene regulatory mechanisms by integrating 

ChIP-seq and RNA-seq data: Statistical solutions to biological problems. Frontiers in Cell 
and Developmental Biology 2: 1–8. 

Arda, H.E. and Walhout, A.J.M. (2010). Gene-centered regulatory networks. Brief. Funct. 

Genomics 9: 4–12. 

Banf, M. and Rhee, S.Y. (2017). Computational inference of gene regulatory networks: 

Approaches, limitations and opportunities. Biochim. Biophys. Acta Gene Regul. Mech. 
1860: 41–52. 

Bansal, S. and Durrett, T.P. (2016). Camelina sativa: An ideal platform for the metabolic 
engineering and field production of industrial lipids. Biochimie 120: 9–16. 

Bar-On, Y.M., Phillips, R., and Milo, R. (2018). The biomass distribution on Earth. Proc. Natl. 

Acad. Sci. U. S. A. 115: 6506–6511. 

Baur, B., Shin, J., Zhang, S., and Roy, S. (2020). Data integration for inferring context-specific 

gene regulatory networks. Curr Opin Syst Biol 23: 38–46. 

Bennetzen, J.L., Ma, J., and Devos, K.M. (2005). Mechanisms of recent genome size variation 

in flowering plants. Ann. Bot. 95: 127–132. 

Birkenbihl, R.P., Liu, S., and Somssich, I.E. (2017). Transcriptional events defining plant 

immune responses. Curr. Opin. Plant Biol. 38: 1–9. 

Bowles, A.M.C., Bechtold, U., and Paps, J. (2020). The Origin of Land Plants Is Rooted in 

Two Bursts of Genomic Novelty. Curr. Biol. 30: 530–536.e2. 

Brady, S.M. et al. (2011). A stele-enriched gene regulatory network in the Arabidopsis root. 

Mol. Syst. Biol. 7: 1–9. 

Brkljacic, J. and Grotewold, E. (2017). Combinatorial control of plant gene expression. 

Biochim. Biophys. Acta 1860: 31–40. 

Butelli, E., Licciardello, C., Zhang, Y., Liu, J., Mackay, S., Bailey, P., Reforgiato-Recupero, 
G., and Martin, C. (2012). Retrotransposons control fruit-specific, cold-dependent 
accumulation of anthocyanins in blood oranges. Plant Cell 24: 1242–1255. 

Carlsson, A.S. (2009). Plant oils as feedstock alternatives to petroleum - A short survey of 

potential oil crop platforms. Biochimie 91: 665–670. 

Chen, D., Yan, W., Fu, L.Y., and Kaufmann, K. (2018). Architecture of gene regulatory 

networks controlling flower development in Arabidopsis thaliana. Nat. Commun. 9:1–13. 

11 

 
Colinas, M. and Goossens, A. (2018). Combinatorial Transcriptional Control of Plant 

Specialized Metabolism. Trends Plant Sci. 23: 324–336. 

Consortium, M. et al. (2010). Identification of Functional Elements and Regulatory Circuits by 

Drosophila modENCODE. Science 330: 1787–1797. 

Deplancke, B. et al. (2006). A gene-centered C. elegans protein-DNA interaction network. Cell 

125: 1193–1205. 

Deplancke, B., Alpern, D., and Gardeux, V. (2016). The Genetics of Transcription Factor 

DNA Binding Variation. Cell 166: 538–554. 

Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and 

display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U. S. A. 95: 14863–
14868. 

ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the 

human genome. Nature 489: 57–74. 

Eveland, A.L. et al. (2014). Regulatory modules controlling maize inflorescence architecture. 

Genome Res. 24: 431–443. 

Franco-Zorrilla, J.M.M., López-Vidriero, I., Carrasco, J.L.L., Godoy, M., Vera, P., and 

Solano, R. (2014). DNA-binding specificities of plant transcription factors and their 
potential to define target genes. Proc. Natl. Acad. Sci. U. S. A. 111: 2367–2372. 

Galli, M., Khakhar, A., Lu, Z., Chen, Z., Sen, S., Joshi, T., Nemhauser, J.L., Schmitz, R.J., 

and Gallavotti, A. (2018). The DNA binding landscape of the maize AUXIN 
RESPONSE FACTOR family. Nat. Commun. 9: 4526. 

Gaudinier, A. et al. (2018). Transcriptional regulation of nitrogen-associated metabolism and 

growth. Nature 563: 259–264. 

Gerstein, M.B. et al. (2012). Architecture of the human regulatory network derived from 

ENCODE data. Nature 489: 91–100. 

Gerstein, M.B. et al. (2010). Integrative Analysis of the Caenorhabditis elegans Genome by the 

modENCODE Project. Science 330: 1775–1787. 

Gitter, A., Siegfried, Z., Klutstein, M., Fornes, O., Oliva, B., Simon, I., and Bar-joseph, Z. 

(2009). Backup in gene regulatory networks explains differences between binding and 
knockout results. Mol. Syst. Biol. 5: 1–7. 

Gomez-Cano, F., Carey, L., Lucas, K., García Navarrete, T., Mukundi, E., Lundback, S., 

Schnell, D., and Grotewold, E. (2020). CamRegBase: a gene regulation database for the 
biofuel crop, Camelina sativa. Database 2020. 

Greene, B., Walko, R., and Hake, S. (1994). Mutator insertions in an intron of the maize 
knotted1 gene result in dominant suppressible mutations. Genetics 138: 1275–1285. 

12 

 
Gupta, O.P., Deshmukh, R., Kumar, A., Singh, S.K., Sharma, P., Ram, S., and Singh, G.P. 

(2021). From gene to biomolecular networks: a review of evidences for understanding 
complex biological function in plants. Curr. Opin. Biotechnol. 74: 66–74. 

Haque, S., Ahmad, J.S., Clark, N.M., Williams, C.M., and Sozzani, R. (2019). 

Computational prediction of gene regulatory networks in plant growth and development. 
Curr. Opin. Plant Biol. 47: 96–105. 

Harbison, C.T. et al. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature 

431: 99–104. 

Haynes, B.C., Maier, E.J., Kramer, M.H., Wang, P.I., Brown, H., and Brent, M.R. (2013). 

Mapping functional transcription factor networks from gene expression data. Genome 
Res. 23: 1319–1328. 

Heyndrickx, K.S., Van de Velde, J., Wang, C., Weigel, D., and Vandepoele, K. (2014). A 
functional and evolutionary perspective on transcription factor binding in Arabidopsis 
thaliana. Plant Cell 26: 3894–3910. 

Jiang, S. and Mortazavi, A. (2018). Integrating ChIP-seq with other functional genomics data. 

Brief. Funct. Genomics 17: 104–115. 

Johnson, K.A. and Krishnan, A. (2022). Robust normalization and transformation techniques 
for constructing gene coexpression networks from RNA-seq data. Genome Biol. 23: 1. 

Kang, M., Ko, E., and Mersha, T.B. (2022). A roadmap for multi-omics data integration using 

deep learning. Brief. Bioinform. 23. 

Kaya-Okur, H.S., Wu, S.J., Codomo, C.A., Pledger, E.S., Bryson, T.D., Henikoff, J.G., 

Ahmad, K., and Henikoff, S. (2019). CUT&Tag for efficient epigenomic profiling of 
small samples and single cells. Nat. Commun. 10: 1930. 

Khan, N.A., Yu, P., Ali, M., Cone, J.W., and Hendriks, W.H. (2015). Nutritive value of maize 
silage in relation to dairy cow performance and milk quality. J. Sci. Food Agric. 95: 238–
252. 

Krassowski, M., Das, V., Sahu, S.K., and Misra, B.B. (2020). State of the Field in Multi-

Omics Research: From Computational Needs to Data Mining and Sharing. Front. Genet. 
11: 610798. 

Kremling, K.A.G., Chen, S.-Y., Su, M.-H., Lepak, N.K., Romay, M.C., Swarts, K.L., Lu, F., 

Lorant, A., Bradbury, P.J., and Buckler, E.S. (2018). Dysregulation of expression 
correlates with rare-allele burden and fitness loss in maize. Nature 555: 520–523. 

Lee, B., Zhang, S., Poleksic, A., and Xie, L. (2019). Heterogeneous Multi-Layered Network 

Model for Omics Data Integration and Analysis. Front. Genet. 10: 1381. 

Li, H. et al. (2013). Genome-wide association study dissects the genetic architecture of oil 

biosynthesis in maize kernels. Nat. Genet. 45: 43–50. 

13 

 
Lisch, D. (2013). How important are transposons for plant evolution? Nat. Rev. Genet. 14: 49–

61. 

Liu, S., Kracher, B., Ziegler, J., Birkenbihl, R.P., and Somssich, I.E. (2015). Negative 

regulation of ABA signaling by WRKY33 is critical for Arabidopsis immunity towards 
Botrytis cinerea 2100. Elife 4: e07295. 

Mack, K.L. and Nachman, M.W. (2017). Gene Regulation and Speciation. Trends Genet. 33: 

68–80. 

Marand, A.P., Eveland, A.L., Kaufmann, K., and Springer, N.M. (2023). cis-Regulatory 

Elements in Plant Development, Adaptation, and Evolution. Annu. Rev. Plant Biol. 74: 
111–137. 

Mazaheri, M. et al. (2019). Genome-wide association analysis of stalk biomass and anatomical 

traits in maize. BMC Plant Biol. 19: 45. 

Mejia-Guerra, M.K., Pomeranz, M., Morohashi, K., and Grotewold, E. (2012). From plant 

gene regulatory grids to network dynamics. Biochimica et Biophysica Acta (BBA) - 
Gene Regulatory Mechanisms 1819: 454–465. 

Mizrachi, E., Verbeke, L., Christie, N., Fierro, A.C., Mansfield, S.D., Davis, M.F., Gjersing, 

E., Tuskan, G.A., Van Montagu, M., Van de Peer, Y., Marchal, K., and Myburg, 
A.A. (2017). Network-based integration of systems genetics data reveals pathways 
associated with lignocellulosic biomass accumulation and processing. Proc. Natl. Acad. 
Sci. U. S. A. 114: 1195–1200. 

Morohashi, K. et al. (2012). A genome-wide regulatory framework identifies maize pericarp 

color1 controlled genes. Plant Cell 24: 2745–2764. 

Morohashi, K. and Grotewold, E. (2009). A systems approach reveals regulatory circuitry for 

Arabidopsis trichome initiation by the GL3 and GL1 selectors. PLoS Genet. 5: e1000396. 

Nakashima, K., Yamaguchi-Shinozaki, K., and Shinozaki, K. (2014). The transcriptional 

regulatory network in the drought response and its crosstalk in abiotic stress responses 
including drought, cold, and heat. Front. Plant Sci. 5: 1–7. 

Négre, N. et al. (2011). A cis-regulatory map of the Drosophila genome. Nature 471: 527–531. 

Neph, S. et al. (2012). An expansive human regulatory lexicon encoded in transcription factor 

footprints. Nature 489: 83–90. 

O’Malley, R.C., Huang, S.S.C., Song, L., Lewsey, M.G., Bartlett, A., Nery, J.R., Galli, M., 

Gallavotti, A., and Ecker, J.R. (2016). Cistrome and Epicistrome Features Shape the 
Regulatory DNA Landscape. Cell 165: 1280–1292. 

Pérez, N.M., Ferrari, C., Engelhorn, J., Depuydt, T., Nelissen, H., Hartwig, T., and 

Vandepoele, K. (2023). MINI-AC: Inference of plant gene regulatory networks using 
bulk or single-cell accessible chromatin profiles. bioRxiv: 2023.05.26.542269. 

14 

 
Qian, Y. and Huang, S.-S.C. (2020). Improving plant gene regulatory network inference by 
integrative analysis of multi-omics and high resolution data sets. Current Opinion in 
Systems Biology 22: 8–15. 

Reiter, F., Wienerroither, S., and Stark, A. (2017). Combinatorial function of transcription 

factors and cofactors. Curr. Opin. Genet. Dev. 43: 73–81. 

Reményi, A., Schöler, H.R., and Wilmanns, M. (2004). Combinatorial control of gene 

expression. Nat. Struct. Mol. Biol. 11: 812. 

Reynoso, M.A. et al. (2019). Evolutionary flexibility in flooding response circuitry in 

angiosperms. Science 365: 1291–1295. 

Ricci, W.A. et al. (2019). Widespread long-range cis-regulatory elements in the maize genome. 

Nature Plants 5: 1237–1249. 

Schaefer, R.J., Michno, J.-M., Jeffers, J., Hoekenga, O., Dilkes, B., Baxter, I., and Myers, 

C.L. (2018). Integrating Coexpression Networks with GWAS to Prioritize Causal Genes 
in Maize. Plant Cell 30: 2922. 

Schmitz, R.J., Grotewold, E., and Stam, M. (2022). Cis-regulatory sequences in plants: Their 

importance, discovery, and future challenges. Plant Cell 34: 718–741. 

Schnable, P.S. et al. (2009). The B73 maize genome: complexity, diversity, and dynamics. 

Science 326: 1112–1115. 

Shrestha, V., Yobi, A., Slaten, M.L., Chan, Y.O., Holden, S., Gyawali, A., Flint-Garcia, S., 
Lipka, A.E., and Angelovici, R. (2022). Multiomics approach reveals a role of 
translational machinery in shaping maize kernel amino acid composition. Plant Physiol. 
188: 111–133. 

Skene, P.J. and Henikoff, S. (2017). An efficient targeted nuclease strategy for high-resolution 

mapping of DNA binding sites. Elife 6. 

Song, L., Huang, S.S.C., Wise, A., Castanoz, R., Nery, J.R., Chen, H., Watanabe, M., 

Thomas, J., Bar-Joseph, Z., and Ecker, J.R. (2016). A transcription factor hierarchy 
defines an environmental stress response network. Science 354. 

Song, Q., Lee, J., Akter, S., Rogers, M., Grene, R., and Li, S. (2020). Prediction of condition-

specific regulatory genes using machine learning. Nucleic Acids Res. 48: e62. 

Subramanian, I., Verma, S., Kumar, S., Jere, A., and Anamika, K. (2020). Multi-omics Data 

Integration, Interpretation, and Its Application. Bioinform. Biol. Insights 14: 
1177932219899051. 

Sullivan, A.M. et al. (2014). Mapping and dynamics of regulatory DNA and transcription factor 

networks in A. thaliana. Cell Rep. 8: 2015–2030. 

Swift, J. and Coruzzi, G.M. (2017). A matter of time - How transient transcription factor 

15 

 
interactions create dynamic gene regulatory networks. Biochimica et Biophysica Acta 
(BBA) - Gene Regulatory Mechanisms 1860: 75–83. 

Taylor-Teeples, M. et al. (2015). An Arabidopsis gene regulatory network for secondary cell 

wall synthesis. Nature 517: 571–575. 

Trivedi, P., Malina, R., and Barrett, S.R.H. (2015). Environmental and economic tradeoffs of 

using corn stover for liquid fuels and power production. Energy Environ. Sci. 8: 1428–
1437. 

Vahabi, N. and Michailidis, G. (2022). Unsupervised Multi-Omics Data Integration Methods: 

A Comprehensive Review. Front. Genet. 13: 854752. 

Vandepoele, K., Quimbaya, M., Casneuf, T., De Veylder, L., and Van de Peer, Y. (2009). 

Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and 
coexpression networks. Plant Physiol. 150: 535–546. 

Walley, J.W., Sartor, R.C., Shen, Z., Schmitz, R.J., Wu, K.J., Urich, M.A., Nery, J.R., 

Smith, L.G., Schnable, J.C., Ecker, J.R., and Briggs, S.P. (2016). Integration of omic 
networks in a developmental atlas of maize. Science 353: 814–818. 

Weirauch, M.T. et al. (2014). Determination and inference of eukaryotic transcription factor 

sequence specificity. Cell 158: 1431–1443. 

Wen, W., Li, D., Li, X., Gao, Y., Li, W., Li, H., Liu, J., Liu, H., Chen, W., Luo, J., and Yan, 

J. (2014). Metabolome-based genome-wide association study of maize kernel leads to 
novel biochemical insights. Nat. Commun. 5: 3438. 

Wen, W., Liu, H., Zhou, Y., Jin, M., Yang, N., Li, D., Luo, J., Xiao, Y., Pan, Q., and Tohge, 
T. (2016). Combining quantitative genetics approaches with regulatory network analysis 
to dissect the complex metabolism of the maize kernel. Plant Physiol. 170: 136–146. 

Wu, G. and Ji, H. (2013). ChIPXpress: Using publicly available gene expression data to 

improve ChIP-seq and ChIP-chip target gene ranking. BMC Bioinformatics 14. 

Yang, F. et al. (2017). A Maize Gene Regulatory Network for Phenolic Metabolism. Mol. Plant 

10: 498–515. 

Yang, F., Ouma, W.Z., Li, W., Doseff, A.I., and Grotewold, E. (2016). Establishing the 

Architecture of Plant Gene Regulatory Networks. Methods Enzymol. 576: 251–304. 

Yang, Z., Xu, G., Zhang, Q., Obata, T., and Yang, J. (2022). Genome-wide mediation 
analysis: an empirical study to connect phenotype with genotype via intermediate 
transcriptomic data in maize. Genetics 221. 

Yan, W., Chen, D., Schumacher, J., Durantini, D., Engelhorn, J., Chen, M., Carles, C.C., 
and Kaufmann, K. (2019). Dynamic control of enhancer activity drives stage-specific 
gene expression during flower morphogenesis. Nat. Commun. 10: 1705. 

16 

 
Zeller, K.I. et al. (2006). Global mapping of c-Myc binding sites and target gene networks in 
human B cells. Proceedings of the National Academy of Sciences 103: 17834. 

Zhou, P., Li, Z., Magnusson, E., Gomez Cano, F., Crisp, P.A., Noshay, J.M., Grotewold, E., 
Hirsch, C.N., Briggs, S.P., and Springer, N.M. (2020). Meta Gene Regulatory 
Networks in Maize Highlight Functionally Relevant Regulatory Interactions. Plant Cell 
32: 1377–1396. 

Zhou, S., Kremling, K.A., Bandillo, N., Richter, A., Zhang, Y.K., Ahern, K.R., Artyukhin, 
A.B., Hui, J.X., Younkin, G.C., Schroeder, F.C., Buckler, E.S., and Jander, G. 
(2019). Metabolome-Scale Genome-Wide Association Studies Reveal Chemical 
Diversity and Genetic Control of Maize Specialized Metabolites. Plant Cell 31: 937–955. 

Zhu, C. et al. (2009). High-resolution DNA-binding specificity analysis of yeast transcription 

factors. Genome Res. 19: 556–566. 

17 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER TWO: CAMREGBASE: A GENE REGULATION DATABASE FOR 
BIOFUEL CROP, CAMELINA SATIVA1 

1This chapter has been published in the following manuscript: 

Gomez-Cano F., Carey L., Lucas K., García Navarrete T., Mukundi E., Lundback S., Schnell S., 

Grotewold E., (2020), CamRegBase: a gene regulation database for the biofuel crop, Camelina 

sativa, Database, baaa075, https://doi.org/10.1093/database/baaa075 

Copyright © 2020, Oxford University Press.  

18 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2.1 ABSTRACT 

Camelina is an annual oilseed plant from the Brassicaceae family that is gaining momentum 

as a biofuel winter cover crop. However, a significant limitation in further enhancing its utility as 

a producer of oils that can be used as biofuels, jet fuels or bio-based products is the absence of a 

repository for all the gene expression and regulatory information that is being rapidly generated 

by  the  community.  Here,  we  provide  CamRegBase  (https://camregbase.org/)  as  a  one-stop 

resource  to  access  Camelina  information  on  gene  expression  and  co-expression,  transcription 

factors,  lipid  associated  genes  and  genome  wide  orthologs  in  the  close-relative  reference  plant 

Arabidopsis. We envision this as a resource of curated information for users, as well as a repository 

of new gene regulation information. 

2.2 INTRODUCTION 

Camelina sativa is an emerging biofuel crop (Carlsson, 2009; Iskandarov et al., 2014). With a 

low economic input requirement (Iskandarov et al., 2014), early season growth habit (Allen et al., 

2014;  Chaturvedi  et  al.,  2018),  genetic  similarity  to  the  model  plant  Arabidopsis  (Liang  et  al., 

2013) and relatively high oil composition in the seed (Moser, 2010; Berti et al., 2016), it has gained 

traction as a potential target for jet fuel and biodiesel production. Camelina’s genome has been 

sequenced and annotated, has a hexaploid genome structure harboring 89 418 protein-coding genes 

organized in 20 chromosomes (Liang et al., 2013; Kagale et al., 2014) and is relatively easy to 

genetically transform (Liu et al., 2012). 

A  challenge,  albeit  not  unique  to  Camelina,  is  how  to  best  utilize  the  burgeoning  genomic 

information for predictive metabolic engineering of seed oil production (Chappell and Grotewold, 

2008; Grotewold, 2008). Clearly, knowing how much and where gene expression takes place is 

necessary,  as  recently  demonstrated  by  recent  studies  aimed  at  increasing  oil  production  in 

19 

 
Camelina using the co-expression of select genes (Chhikara et al., 2018). While RNA-Seq is a 

very  powerful  tool  to  determine  global  levels  of  gene  expression,  each  analysis  yields  a  large 

amount  of  data  and  therefore  is  non-trivial  to  curate  and  analyze  for  potential  targets.  To  take 

advantage of all the currently available RNA-Seq data for Camelina, a relational database is the 

most  ideal  resource.  Currently,  Camelina  genomics  resources  are  part  of  the  Brassica  database 

BRAD (http://brassicadb.org) together with 11 Brassicaceae genomes. BRAD has a comparative 

approach to make plots of syntenic genomic regions and search the orthologs genes (Wang et al., 

2015), but the most information available in BRAND is directed to Brassica rapa. In particular, 

the Camelina Genome Portal (camelinagenomics.org) allows a user to browse the whole Camelina 

genome assembly, conduct BLAST analyses to the Camelina genome, and view any of 15,946 

(current number at date of publication) contig scaffolds on the sequenced genome. The University 

of Toronto has developed an electronic fluorescent pictograph browser (http://bar.utoronto.ca/) for 

Camelina  sativa,  which  allows  quick  visual  representation  of  expression  data  from  a  large 

developmental set. The Camelina Genomic Resources (camelinagenome.org) contains transcript 

data on protein and lipids but is only restricted to the developing embryo. Many databases also 

exist that provide information on TFs for one or multiple plants (Davuluri et al., 2003; Guo et al., 

2005; Gao et al., 2006; Guo et al., 2008; Rushton et al., 2008; Wang et al., 2010; Yilmaz et al., 

2009;  Kagale  et  al.,  2016).  AGRIS  (https://agris-knowledgebase.org/),  for  example,  provides  a 

useful  resource  for  the  knowledgebase  described  here,  because  it  provides  a  comprehensive 

collection of Arabidopsis TFs and other regulatory components, that can be easily translated to 

Camelina based on the close relationship between these plants. Here, we introduce the Camelina 

Gene Regulation Database (https://camregbase.org/), which is intended as a one-stop resource for 

aspects related to Camelina gene regulation. CamRegBase v1.0 harbors all RNA-Seq experiments 

20 

 
available  to-date  with  read  abundance  and  the  corresponding  metadata,  tissue-specific  gene 

expression visualization and gene co-expression analyses. Additionally, CamRegBase 1.0 offers 

information on the orthologous relationships between Camelina and Arabidopsis genes along with 

the  reported  syntelog  data  (Kagale  et  al.,  2016).  Finally,  as  a  valuable  resource  to  researchers 

interested in studying the control of gene regulation, CamRegBase 1.0 provides a comprehensive 

catalog of transcription factors (TFs) and co-activators identified by our own analyses and those 

previously  reported  (Kagale  et  al.,  2016)  (http://planttfdb.cbi.pku.edu.cn/).  With  all  the  above-

mentioned information integrated as one resource, CamRegBase is poised to become a primary 

resource for Camelina gene expression analyses. 

2.3 RESULTS AND DISCUSSION 

2.3.1 Database structure 

The utilization of the open source Tripal toolkit for the construction of the database web portal 

ensures that it can be expanded by the addition of compatible extension modules, and it ensures 

interoperability with a number of widely used biological knowledgebases (Spoor et al., 2019). The 

overall database organization is schematized in Figure 2.1, with the search functionality of the site 

relying on the underlying database tables shown in the entity relationship diagram. All the records 

in Drupal are stored in the ‘node’ table, which is queried in relation to the other tables on the search 

term provided by the end user. The lines shown in the diagram show how the tables are related 

when a search is run. For example, when a search is run using the Gene Search page, the ‘Search 

Data’ table is queried to return data matching the search term in the ‘title’, ‘name’ or ‘category’ 

fields. That table contains a consolidation of data from the ‘Node’ and ‘Taxonomy Data’ tables 

along with a ‘category’ value based on the presence of the record in any of the ‘Goslim Term’, 

‘Aralip  Pathway’  and/or  ‘TF  Family’  tables.  The  consolidation  was  done  to  improve  the 

21 

 
performance of the search function. Other searches query the tables directly. In the case of the 

Syntelogs search, the ‘Homolog’ is examined, and the data are returned along with related results 

from the ‘Csa_g1’, ‘Csa_g2’, ‘Csa_g3’ and ‘Taxonomy Data’ tables using the relationships shown 

in the entity relationship diagram.  

NCBI

Raw expression data
Data  description 

Query genes

Database

1. Literature
2. Arabidopsis 

homology

Annotation

TFs and CoRs in house
annotation

Query  gene organ
expression profile

Query  gene top co-
expressed  genes

Query  gene family 
classification (TF or 
CoR if apply)

Query  gene 
functional annotation  

Figure 2.1 Schematic diagram outlining the architecture of CamRegBase 1.0 

2.3.2 Expression database content 

The gene expression database was built based on 131 publicly available Camelina RNA-seq 

experiments (See ‘Materials and methods’). The data correspond to gene expression information 

from  five  different  Camelina  ‘varieties’,  with  DH55  and  Suneson  being  the  varieties  with  the 

largest number of samples (Figure 2.2a). Out of the 131 samples, 28 had no details regarding the 

variety and thus were labeled as unknown and utilized solely for the co-expression analyses (See 

below). Data were classified based on variety and further grouped on the basis of plant organs and 
22 

 
 
 
seed development stages. In total, we analyzed data from 12 different ‘organs’, including whole 

plant pools (referred as ‘Plant’), and samples without ‘organ’ specification (defined as ‘Unknown’) 

(Figure 2.2b). Notably, seeds and roots represented the majority of samples available (38.8% and 

16.8%,  respectively)  (Figure  2.2b).  In  terms  of  ‘seed  developmental  stages’,  samples  were 

analyzed that covered a range of 36 days post-anthesis (DPAs), with 14 different time points from 

4  to  40  days  post-anthesis.  Overall,  approximately  four  billion  reads  were  analyzed,  with  an 

average of 29.9 million reads per sample and with 95.7% of the reads mapping to the genome.  

To  characterize  the  transcriptome  at  the  sample  level,  the  top  5%  of  genes  with  highest 

expression  variation  (TPMs)  across  all  131  samples  were  selected  and  a  principal  component 

analysis  (PCA)  was  performed.  The  first  two  principal  components  explained  54.7%  of  the 

variation of the samples and allowed us to separate the 12 organs into 7 groups (Figure 2.2c). The 

‘root’  and  ‘seed’  samples  grouped  closest  together  and  were  the  most  distinct  from  the  other 

samples.  As  expected,  some  samples  aligned  closely  with  others  such  as  ‘embryo’  with  ‘seed’ 

samples, ‘cotyledons’ with ‘young leaf’, and ‘buds’ with ‘flowers’ (dashed circles, Figure 2.2c). 

The  observed  separation  suggests  that,  at  least  for  the  major  groups,  the  data  collected  and 

presented here capture relevant biological information. 

23 

 
 
a

b

1

6 6

28

51

39

DH55

Suneson

Unknown

CO-46

Joelle

Celine

c

Individuals − PCA

)

%
1
.
2
2
(
2
m
D

i

50

25

0

−25

3 3 3 3

3

4

4

6
6

10

50

14

22

Root
Plant
Stem
Embryo

Seed
Unknown
Flower
Cotyledon
Young leaf

Leaf
Shoot
Bud
Senescing leaf

Organs
Organs

Bud
Buds
Cotyledon
cotyledons
Embryo
Embryon

Flower
Flower

Leaf
Leaf

Plant
Plant

Root
Roots
Seed
Seed

Senescing leaf
senescing leaf

Shoot
Shoot
Stem
Stem

Young leaf
young leaf

−60

−30

0

30

60

Dim1 (32.6%)

Figure 2.2 Gene expression data hosted on CamRegBase 1.0 
Summary of expression data available on CamRegBase at the level of a Camelina varieties and b 
organ-specific  samples.  c.  PCA  of  the  Camelina  transcriptome  using  log2TPMs.  Dotted  ovals 
indicate major groups of samples identified by visual inspection of the PCA results. 

2.3.3 Annotation of TFs 

TFs and CoRs play central roles in controlling gene expression, and they provide powerful 

tools  to  manipulate  developmental  or  metabolic  pathways  for  biotechnological  purposes 

(Grotewold, 2008; Gray and Grotewold, 2011). Thus, to characterize Camelina TFs and CoRs, 

24 

 
 
 
 
Members

0

1

0

0

2

0

0

3

0

0

4

0

0

bHLH

MYB−related

AP2/ERF−ERF

MADS−M−type

NAC

C2H2

WRKY

bZIP

B3

C3H

F
a
m

i
l
i

e
s

MYB
LOB
HSF
GRAS
C2C2−Dof
G2−like
C2C2−GATA
Trihelix
HB
MYB−Related
HB−HD−ZIP
Homeobox
TCP
C2C2−CO−like
FAR1
AP2−EREBP
GARP−G2−like
SBP
zf−HD
RWP−RK
OFP
GeBP
HB−WOX
Tify
ARF
PLATZ
NF−YB
TUB
NF−YA
BES1
MADS−MIKC
SRS
GRF
NF−YC
CPP
Alfin−like
E2F−DP
MADS
EIL
BBR−BPC
C2C2−YABBY
CAMTA
ABI3VP1
AP2/ERF−RAV
CSD
AP2/ERF−AP2
B3−ARF
C2C2−LSD
S1Fa−like
Whirly
ARR−B
HB−other
CCAAT−HAP3
DBB
HRT
NF−X1
STAT
ULT
VOZ
CCAAT−HAP5
LIM
AP2
CCAAT−DR1
DBP
LFY
NOZZLE
Orphan
SAP
BZR
LBD
NLP

advantage was taken of the current literature in this regard (Kagale et al., 2016) and used to expand 

the  previous  collection  using  pipelines  based  on  protein  domain  characterization  and  family 

classifications that worked well before in other plants (Yilmaz et al., 2009, 2011) (See ‘Materials 

and methods’). In total, 4,619 TFs and 805 CoRs were identified of which 1,075 TFs and 793 CoRs 

had not been previously reported. Our analysis, however, failed to identify 971 TFs previously 

reported based on homology (Kagale et al., 2016). Currently, CamRegBase harbors information 

on 5,590 TFs classified into 81 families, and 805 CoR, classified into 25 different families (Figure 

2.3). 

a

Members
Number of TFs
3
2
0
0
300
200
0
0

1
0
100
0

4
0
400
0

0
0

b

Members
Number of CoR
6
60
0

3
30
0

9
90
0

0
0

y
F
l
i
a
m
m
a
F
R
e
o
s
C

i
l
i

SNF2
AUX/IAA
mTERF
SET
TRAF
GNAT
Jumonji
PHD
SWI/SNF−BAF60b
TAZ
IWS1
ARID
Rcd1−like
MBF1
SWI/SNF−SWI3
SOH1
Coactivator p15
HMG
MED7
LUG
DDT
MED6
RB
Pseudo ARR−B
Orphan

F
y
l
a
i
m
m
a
F
e
F
T
s

i
l
i

MYB−Related
bHLH
AP2/ERF−ERF
MADS−M−type
NAC
C2H2
WRKY
bZIP
B3
C3H
MYB
LOB
HSF
GRAS
C2C2−Dof
G2−like
C2C2−GATA
Trihelix
HB
HB−HD−ZIP
Homeobox
TCP
C2C2−CO−like
FAR1
AP2−EREBP
GARP−G2−like
SBP
zf−HD
RWP−RK
OFP
GeBP
HB−WOX
Tify
ARF
PLATZ
NF−YB
TUB
NF−YA
BES1
MADS−MIKC
SRS
GRF
NF−YC
CPP
Alfin−like
E2F−DP
MADS
EIL
BBR−BPC
C2C2−YABBY
CAMTA
ABI3VP1
AP2/ERF−RAV
CSD
AP2/ERF−AP2
B3−ARF
C2C2−LSD
S1Fa−like
Whirly
ARR−B
HB−other
CCAAT−HAP3
DBB
HRT
NF−X1
STAT
ULT
VOZ
CCAAT−HAP5
LIM
AP2
CCAAT−DR1
DBP
LFY
NOZZLE
Orphan
SAP
BZR
LBD
NLP

Figure 2.3 Distribution of the (a) TF and (b) CoR genes according to families as currently 

Members
6
0

3
0

9
0

0

F
a
m

i
l
i

e

s

SNF2
AUX/IAA
mTERF
SET
TRAF
GNAT
Jumonji
PHD
SWI/SNF−BAF60b
TAZ
IWS1
ARID
Rcd1−like
MBF1
SWI/SNF−SWI3

Coactivator p15

SOH1

HMG

MED7

LUG

DDT

MED6

RB

Pseudo ARR−B

Orphan

25 

 
 
 
Figure 2.3 (cont’d) 
present in CamRegBase 1.0 

2.3.4 Database functionalities 

CamRegBase  1.0  consists  of  quick-buttons  and  tabs  for  navigation.  The  buttons  are 

redundancies of the navigation tab. A unified search function within the ‘Gene Search’ tab was 

implemented, where a user may query Camelina genes by gene accession number, Arabidopsis 

GO  Slim  term  or  pathways  from  the  Aralip  database  to  find  a  gene  of  choice.  Once  a  gene  is 

selected, the resulting page provides gene information, a link to explore gene expression and a list 

of the top 50 co-regulated genes with their associated PCCs. When gene expression is explored, 

an expression analysis chart is displayed showing expression values across biosample numbers. 

Hovering  over  a  data  point  will  show  the  complete  data  information.  Charts  can  also  be 

downloaded in CSV format. Under the ‘Regulation’ tab a user may find a group of genes within a 

TF family, or they can go directly to the gene information page by searching with a Camelina gene 

accession  number.  Under  the  ‘Gene  Expression’  tab,  a  user  can  go  directly  to  gene  expression 

information, or click on ‘Heat Map’ to view the selection of genes in a heat map, which can be 

sorted  by  gene  name,  annotation,  or  blast  description.  A  drop-down  selection  of  the  samples 

permits to visualize just a few, or all the gene expression samples in the database. Alternatively, 

sample selection can also be done on the created heatmap by highlighting the desired samples; the 

heatmap will adjust accordingly. Finally, on the ‘Syntelogs’ tab, a user can query a Camelina or 

Arabidopsis gene accession number to see how they relate to one another.  

2.4 METHODS 

2.4.1 Gene expression data source 

Expression data present in CamRegBase 1.0 was retrieved from the Gene Expression Omnibus. 

All  samples  collected  corresponded  to  RNA-Seq  experiments  generated  using  the  Illumina 

26 

 
 
platform. RNA-Seq results for a total of 131 experiments (including replicates) were collected, 40 

of  which  corresponded  to  single-end  libraries  and  91  to  paired-end  libraries.  These  131 

experiments corresponded to a total of 16 different projects. All samples were subject to quality 

control  using  FastQC  (http://www.bioinformatics.babraham.ac.uk/proje  cts/fastqc/,  V0.11.5). 

Libraries with adapters and reads with low quality (Phred < 20) were removed using Cutadapt (-a 

and -u, respectively) (http://cutadapt.readthedocs.io/ en/stable/index.html, V1.9). Clean reads were 

mapped to the reference genome (V2.0, http://camelinadb.ca) using HISAT2 (2.0.4) (Kim et al., 

2019) with default parameters. Reads aligned to genes were counted with the R package Rsubread 

(V1.32.2), using default parameters and allowing multi-mapping reads (Liao et al., 2019), and the 

transcript abundance estimated as transcripts per kilobase million (TPM). 

2.4.2 Database and web platform construction 

The website sits on an Ubuntu 18.04 operating system, the current long-term support release, 

using  a  PostgreSQL  database  instance  for  backend  storage  and  the  Apache  webserver  for 

displaying pages. It was built on top of that base using the Drupal content management system 

with the Tripal and Tripal Analysis Expression modules along with their dependencies (Ficklin et 

al., 2011; Sanderson et al., 2013; Spoor et al., 2019). The data were loaded into the Chado and 

Drupal database schemas using importers constructed using Tripal, and custom PHP codes were 

written to provide the functionality seen on the site today. The software and versions currently in 

use  are  PostgreSQL  (v10.12),  PHP  (v7.1),  Apache  (v2.4.41),  Tripal  (v3.2),  Tripal  Analysis 

Expression (v3.0) and Drupal (v7.69). 

2.4.3 Camelina sativa gene annotation 

All the functional annotations of C. sativa genes analyzed here, except for TFs (see below), 

were based on homology with Arabidopsis thaliana obtained by performing reciprocal BLAST 

27 

 
analyses on ‘all proteins against all’, and from literature (Kagale et al., 2014). The characterization 

and annotation of TFs and co-regulatory proteins (CoRs) assigned to the two databases harboring 

TFs  and  co-regulators  (CsTFDB  and  CsCoTFDB,  respectively)  was  carried  out  based  on  the 

identification of proteins that contain domains distinctive of these groups of proteins, as previously 

described (Yilmaz et al., 2009, 2011). 

2.4.4 Gene regulation data collection and analysis 

To identify potential TFs, we utilized already existing knowledge of known and identified TF 

protein  domains  from  published  literature  sources,  particularly  AGRIS  and  GRASSIUS 

(grassius.org) (Yilmaz et al., 2009, 2011). The data obtained were used in conjunction with Pfam’s 

Hidden  Markov  Models  (HMM)  to  perform  a  domain  search  using  the  HMMER(v3)  software 

against  the  predicted  Camelina  proteins  sequences  (Kagale  et  al.,  2014).  Hit  scores  were  only 

retained if they were considered significant, where the threshold used was a gathering score greater 

than the reported HMM for domains that are found in the Pfam database.  

Once potential TFs were identified, they were classified based on already established domain 

rules. The rules consist of which protein domain or domains are required for a TF to be part of a 

certain  family.  In  some  instances,  it  involves  not  having  a  specific  domain  or  set  of  domains 

(forbidden  domains)  to  be  classified  as  part  of  the  specified  family.  The  co-regulators  were 

classified based on rules previously established (Burdo et al., 2014). A modified version of the 

iTAK Perl script (Zheng et al., 2016) was utilized to assign the proteins to families based on hits 

obtained from the hmmscan application in the HMMER Program. 

2.4.5 Gene co-expression analyses 

The co-expression analyses between pairs of genes was calculated using the log2 of the TPMs as input 

data  and  the  weighted  Pearson  correlation  coefficient  (PCC)  as  a  metric  for  co-expression  using  the  R 

28 

 
package wCorr (Version 1.9.1) (Emad and Bailey, 2017), with an optimal threshold of 0.4 to weight samples 

similarities. 

29 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
REFERENCES 

Allen, B.L., Vigil, M.F., and Jabro, J.D. (2014). Camelina growing degree hour and base 

temperature requirements. Agron. J. 106: 940–944. 

Berti, M., Gesch, R., Eynck, C., Anderson, J., and Cermak, S. (2016). Camelina uses, 
genetics, genomics, production, and management. Ind. Crops Prod. 94: 690–710. 

Burdo, B. et al. (2014). The Maize TFome--development of a transcription factor open reading 

frame collection for functional genomics. Plant J. 80: 356–366. 

Carlsson, A.S. (2009). Plant oils as feedstock alternatives to petroleum - A short survey of 

potential oil crop platforms. Biochimie 91: 665–670. 

Chappell, J. and Grotewold, E. (2008). Plant biotechnology - Predictive, green and 

quantitative. Curr. Opin. Biotechnol. 19: 129–130. 

Chaturvedi, S., Bhattacharya, A., Khare, S.K., and Kaushik, G. (2018). Camelina sativa: An 
emerging biofuel crop. In Handbook of Environmental Materials Management, C. Hussain, 
ed (Sringer: Switzerland), pp. 1–38. 

Chhikara, S., Abdullah, H.M., Akbari, P., Schnell, D., and Dhankher, O.P. (2018). 

Engineering Camelina sativa (L.) Crantz for enhanced oil and seed yields by combining 
diacylglycerol acyltransferase1 and glycerol-3-phosphate dehydrogenase expression. Plant 
Biotechnol. J. 16: 1034–1045. 

Davuluri, R.V., Sun, H., Palaniswamy, S.K., Matthews, N., Molina, C., Kurtz, M., and 
Grotewold, E. (2003). AGRIS: Arabidopsis gene regulatory information server, an 
information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC 
Bioinformatics 4: 25. 

Emad, A. and Bailey, P. (2017). wCorr: weighted correlations.–R package ver. 1.9. 1. 

Ficklin, S.P., Sanderson, L.-A., Cheng, C.-H., Staton, M.E., Lee, T., Cho, I.-H., Jung, S., 

Bett, K.E., and Main, D. (2011). Tripal: a construction toolkit for online genome databases. 
Database 2011. 

Gao, G., Zhong, Y., Guo, A., Zhu, Q., Tang, W., Zheng, W., Gu, X., Wei, L., and Luo, J. 
(2006). DRTF: a database of rice transcription factors. Bioinformatics 22: 1286–1287. 

Gray, J. and Grotewold, E. (2011). Transcription factors, gene regulatory networks and 

agronomic traits. In Sustainable Agriculture and New Biotechnologies (CRC Press), pp. 65–
94. 

Grotewold, E. (2008). Transcription factors for predictive plant metabolic engineering: are we 

there yet? Curr. Opin. Biotechnol. 19: 138–144. 

Guo, A., He, K., Liu, D., Bai, S., Gu, X., Wei, L., and Luo, J. (2005). DATF: a database of 

30 

 
Arabidopsis transcription factors. Bioinformatics 21: 2568–2569. 

Guo, A.Y., Chen, X., Gao, G., Zhang, H., Zhu, Q.H., Liu, X.C., Zhong, Y.F., Gu, X., He, K., 
and Luo, J. (2008). PlantTFDB: a comprehensive plant transcription factor database. 
Nucleic Acids Res. 36: D966–9. 

Iskandarov, U., Kim, H.J., and Cahoon, E.B. (2014). Camelina: An emerging oilseed platform 
for advanced biofuels and bio-based materials. In Plants and BioEnergy, MC McCann, M.S. 
Buckeridge, and N.C. Carpita, eds (Springer: New York), pp. 131–140. 

Kagale, S. et al. (2014). The emerging biofuel crop Camelina sativa retains a highly 

undifferentiated hexaploid genome structure. Nat. Commun. 5: 1–11. 

Kagale, S., Nixon, J., Khedikar, Y., Pasha, A., Provart, N.J., Clarke, W.E., Bollina, V., 

Robinson, S.J., Coutu, C., Hegedus, D.D., Sharpe, A.G., and Parkin, I.A.P. (2016). The 
developmental transcriptome atlas of the biofuel crop Camelina sativa. Plant J. 88: 879–894. 

Kim, D., Paggi, J.M., Park, C., Bennett, C., and Salzberg, S.L. (2019). Graph-based genome 
alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37: 907–
915. 

Liang, C., Liu, X., Yiu, S.-M., and Lim, B.L. (2013). De novo assembly and characterization 
of Camelina sativa transcriptome by paired-end sequencing. BMC Genomics 14: 146. 

Liao, Y., Smyth, G.K., and Shi, W. (2019). The R package Rsubread is easier, faster, cheaper 
and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 
47: e47–e47. 

Liu, X., Brost, J., Hutcheon, C., Guilfoil, R., Wilson, A.K., Leung, S., Shewmaker, C.K., 

Rooke, S., Nguyen, T., and Kiser, J. (2012). Transformation of the oilseed crop Camelina 
sativa by Agrobacterium-mediated floral dip and simple large-scale screening of 
transformants. In Vitro Cellular & Developmental Biology-Plant 48: 462–468. 

Moser, B.R. (2010). Camelina (Camelina sativa L.) oil as a biofuels feedstock: Golden 

opportunity or false hope? Lipid technology 22: 270–273. 

Rushton, P.J., Bokowiec, M.T., Laudeman, T.W., Brannock, J.F., Chen, X., and Timko, 

M.P. (2008). TOBFAC: the database of tobacco transcription factors. BMC Bioinformatics 
9: 53. 

Sanderson, L.-A., Ficklin, S.P., Cheng, C.-H., Jung, S., Feltus, F.A., Bett, K.E., and Main, 
D. (2013). Tripal v1.1: a standards-based toolkit for construction of online genetic and 
genomic databases. Database 2013. 

Spoor, S. et al. (2019). Tripal v3: an ontology-based toolkit for construction of FAIR biological 

community databases. Database 2019. 

Wang, X., Wu, J., Liang, J., Cheng, F., and Wang, X. (2015). Brassica database (BRAD) 

version 2.0: integrating and mining Brassicaceae species genomic resources. Database 2015. 

31 

 
Wang, Z., Libault, M., Joshi, T., Valliyodan, B., Nguyen, H.T., Xu, D., Stacey, G., and 

Cheng, J. (2010). SoyDB: a knowledge database of soybean transcription factors. BMC 
Plant Biol. 10: 14. 

Yilmaz, A., Mejia-Guerra, M., Kurz, K., Liang, X., Welch, L., and Grotewold, E. (2011). 
AGRIS: Arabidopsis Gene Regulatory Information Server, an update. Nucleic Acids Res. 
39: D1118–1122. 

Yilmaz, A., Nishiyama, M.Y., Jr, Fuentes, B.G., Souza, G.M., Janies, D., Gray, J., and 

Grotewold, E. (2009). GRASSIUS: a platform for comparative regulatory genomics across 
the grasses. Plant Physiol. 149: 171–180. 

Zheng, Y. et al. (2016). iTAK: A Program for Genome-wide Prediction and Classification of 

Plant Transcription Factors, Transcriptional Regulators, and Protein Kinases. Mol. Plant 9: 
1667–1670. 

32 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER THREE: EXPLORING CAMELINA SATIVA LIPID METABOLISM 

REGULATION BY COMBINING GENE CO-EXPRESSION AND DNA AFFINITY 

PURIFICATION ANALYSES1 

1This chapter has been published in the following manuscript: 

Gomez-Cano F., Chu Y.-H., Cruz-Gomez M., Abdullah H.M., Lee Y.S., Schnell D., Grotewold 

E. (2022), Exploring Camelina sativa lipid metabolism regulation by combining gene co-

expression and DNA affinity purification analyses. Plant J, 110: 589-606. 

https://doi.org/10.1111/tpj.15682 

© 2022 Society for Experimental Biology and John Wiley & Sons Ltd. 

33 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3.1 ABSTRACT 

Camelina  is  an  annual  oilseed  plant  that  is  gaining  momentum  as  a  biofuel  cover  crop.  

Understanding  gene  regulatory  networks  (GRNs)  is  essential  to  deciphering  plant  metabolic 

pathways, including lipid metabolism. Here, we take advantage of a growing collection of gene 

expression datasets to predict transcription factors (TFs) associated with the control of Camelina 

lipid metabolism. We identified ~350 TFs highly co-expressed with lipid-related genes (LRGs). 

These TFs are highly represented in the MYB, AP2/ERF, bZIP, and bHLH families, including a 

significant  number  of  homologs  of  well-known  Arabidopsis  lipid  and  seed  developmental 

regulators. After prioritizing the top 22 TFs for further validation, we identified DNA-binding sites 

and  predicted  target  genes  for  16  out  of  the  22  TFs  tested  using  DNA  affinity  purification 

sequencing (DAP-seq). Enrichment analyses of targets supported the co-expression prediction for 

most TF candidates, and the comparison to Arabidopsis revealed some common themes, but also 

aspects unique to Camelina. Within the top potential lipid regulators, we identified CsaMYB1, 

CsaABI3AVP1-2, CsaHB1, CsaNAC2, CsaMYB3, and CsaNAC1 as likely involved in the control 

of seed fatty acid elongation; and CsaABI3AVP1-2 and CsabZIP1 as potential regulators of the 

synthesis and degradation of triacylglycerols (TAGs), respectively. Altogether, the integration of 

co-expression data and DNA-binding assays permitted us to generate a high-confidence and short 

list of Camelina TFs involved in the control of lipid metabolism during seed development. 

3.2 INTRODUCTION 

The  Brassicaceae  Camelina  (Camelina  sativa  L.  Crantz)  annual  plant  is  gaining  increasing 

attention  as  a  potential  oilseed  crop  with  characteristics  that  make  it  alluring  as  a  renewable 

feedstock  for  biofuels  and  biobased  products,  among  many  other  applications  (Carlsson,  2009; 

Iskandarov et al., 2014). Camelina has a hexaploid genome that harbors ~90,000 genes organized 

34 

 
into 20 chromosomes (Liang  et al., 2013; Kagale et al., 2014). When compared with Arabidopsis 

(Arabidopsis  thaliana),  Camelina  genes  were  classified  into  three  types,  including  syntenic 

orthologs  (syntelogs,  ~70%  of  all  genes),  tandem  duplicates  (~12%),  and  non-syntenic  (~18%) 

genes (Kagale et al., 2014). Within the set of syntelogs (a.k.a. paralogs), 10% of them are defined 

as fractionated because not all the three copies are conserved (Kagale et al., 2014). Remarkably, 

in addition to having a low rate of fractionation, the majority of Camelina's paralogs (in the case 

of triplicated genes) display no significant differences in expression levels (Kagale et al., 2014). 

Despite  the  challenges  imposed  by  its  polyploid  genome,  extensive  gene  expression  analyses 

performed on developing Camelina seeds provided a transcriptome reference for this emerging 

crop,  which  includes  26  different  datasets  obtained  at  13  different  time  points  during  seed 

development,  and  one  immediately  after  germination,  expression  data  that  is  available  at 

CamRegBase  (Gomez-Cano  et  al.,  2020).  Yet,  despite  the  growing  collection  of  mRNA 

accumulation  data,  expression  information  from  early  time  points  during  seed  development,    a 

critical  stage  for  lipid  biosynthesis  (Rodríguez-Rodríguez  et  al.,  2013;  Pollard  et  al.,  2015),  is 

largely  missing.  Another  important  available  resource  in  Camelina,  given  its  biotechnological 

implications,  is  the  growing  list  of  genes  associated  with  fatty  acid  (FA)  and  oil  biosynthesis 

(Nguyen et al., 2013; Mudalkar et al., 2014; Abdullah et al., 2016; Gomez-Cano et al., 2020). This 

was  to  a  large  extent  possible  thanks  to  the  close  phylogenetic  relationship  of  Camelina  with 

Arabidopsis,  reflected  in  the  high  sequence  similarity  of  their  genomes  (Nikolov  et  al.,  2019; 

Mandáková et al., 2019).  

Camelina seeds are ~50 times larger than those of Arabidopsis, they are rich in triacylglycerols 

(TAGs)  containing  mainly  long  unsaturated  FAs,  including  linoleic  acid  (C18:3),  which  are 

excellent sources of omega-3 FAs (Gugel and Falk, 2006; Berti et al., 2016). Depending on the 

35 

 
ecotype, Camelina oil may represent up to 40% of the total seed dry weight, which also contains 

high levels of vitamin E and antioxidants responsible for extending the lifetime of Camelina oil-

containing products (Berti et al., 2016; Malik et al., 2018; Chaturvedi et al., 2019). As in other 

plants, Camelina TAG synthesis starts with the synthesis of FAs in plastids (Voelker and Kinney, 

2001).  In  Camelina  embryos,  the  maximum  rate  of  oil  synthesis  is  at  mid-maturation  (“green 

cotyledon”), i.e., between 14-20 days post anthesis (DPA), while the mid-point for oil deposition 

is  around  17-18  DPA.  Consistently,  C18:3  reaches  its  highest  accumulation  rate  at  22  DPA 

(Pollard et al., 2015). In addition to C18:3, Camelina TAG also contain significant amounts of 

very long chain FAs (VLCFA) (C20-C24) with similar accumulation rates to C18:3 (maximum 

rate ~ 22-24 DPA), and detected as early as 11 DPA (Pollard et al., 2015).  

A  growing  number  of  Camelina  genes  involved  in  FA  and  TAG  biosynthesis  are  being 

identified (Nguyen et al., 2013; Abdullah et al., 2016; Morineau et al., 2017; Ozseyhan et al., 2018; 

Neumann et al., 2021). However, Camelina transcription factors (TFs) that control the expression 

of the corresponding enzymatic genes remain largely unknown. In higher plants, the synthesis of 

FA and TAG in seeds is tightly coordinated with development. In Arabidopsis, there is a growing 

number of TFs involved in seed development with direct or indirect effects on FA/TAG synthesis 

(Le et al., 2010; Baud and Lepiniec, 2010; Leprince et al., 2016; Tian et al., 2020). Major regulators 

include 

the  ABI3VP1  proteins  LEAFY  COTYLEDON  2  (LEC2),  ABSCISIC  ACID 

INSENSITIVE  3  (ABI3),  and  FUSCA3  (FUS3),  which  besides  controlling  seed  development-

related  processes,  are  also  positive  regulators  of  FA/TAG  synthesis  (Giraudat  et  al.,  1992; 

Bäumlein  et  al.,  1994;  Stone  et  al.,  2001).  Other  important  regulators  include  LEAFY 

COTYLEDON 1 (LEC1) and LEAFY COTYLEDON1-LIKE (L1L), which are CCAAT-HAP3 

proteins (Lotan et al., 1998; Kwong et al., 2003), the basic leucine zipper 53 (bZIP53) (Alonso et 

36 

 
al.,  2009),  AGAMOUS-Like15  (AGL15)  (Zheng  et  al.,  2009),    the  MYB  proteins  MYB115, 

MYB118, MYB107, and MYB9 (Wang et al., 2009; Troncoso-Ponce et al., 2016; Lashbrooke et 

al.,  2016),  and  the  homeobox  GLABRA2  (Shen  et  al.,  2006).  Also,  VP1/ABSCISIC  ACID 

INSENSITIVE3-LIKE1, 2, and 3 (VAL1, 2, 3), all members of the ABI3VP1 family, are known 

for their roles in repressing the seed maturation program before germination (Tsukagoshi et al., 

2007;  Suzuki  and  McCarty,  2008;  Guerriero  et  al.,  2009).  Downstream  of  some  of  these 

developmental  regulators  are  several  TFs  that  modulate  specific  aspects  of  lipid  metabolism, 

including  WRINKLED1  (WRI1),  which  controls  carbon  flux  from  sucrose  to  FA  biosynthesis 

(Cernac and Benning, 2004). WRI1 is regulated at the transcriptional level by LEC1 and MYB89, 

and 

at 

the 

post-translational 

level 

by 

KIN10 

and 

TEOSINTE 

BRANCHED1/CYCLOIDEA/PROLIFERATING  CELL  FACTOR  4  (TCP4)  (Li  et  al.,  2017; 

Zhai et al., 2017; Kong et al., 2020; Pelletier et al., 2017). Some of these regulatory associations 

are conserved between species (Kong and Ma, 2018; Kong, et al., 2020; Devic and Roscoe, 2016). 

In  Camelina,  the  overexpression  of  Arabidopsis  MYB96  led  to  a  significant  increase  in 

epicuticular  and  total  wax  (Lee  et  al.,  2014),  resembling  the  functions  that  MYB96  has  in 

Arabidopsis under drought conditions (Seo et al., 2011). To what extent these regulatory networks 

are conserved between Arabidopsis and Camelina remains unknown.  

Despite  the  close  phylogenetic  relationship  of  Arabidopsis  and  Camelina,  they  accumulate 

different  quantities  and  types  of  seed  oils  (Li  et  al.,  2006).  Thus,  understanding  the  regulatory 

processes associated with these differences provides opportunities for further enhancing seed oil 

production.  Here,  we  describe  the  use  of  gene  co-expression  analyses  to  identify  several  TF 

candidates associated with the regulation of lipid biosynthesis in Camelina. These predictions were 

confirmed  using  DNA  affinity  purification  followed  by  sequencing  (DAP-seq)  analysis  of  the 

37 

 
corresponding TF candidates. Altogether, we identify and associate different TF candidates with 

specific aspects of the lipid-related process, including key players in regulating lipid accumulation 

during seed development in Camelina. 

3.3 RESULTS 

3.3.1 Expression analysis of genes involved in lipid accumulation during Camelina seed 

development 

To complement the sparse gene expression information available for early stages of Camelina 

seed development, we collected seeds from Suneson Camelina plants at 5, 8, and 11 days post-

anthesis  (DPA).  The  sampling  was  performed  for  three  biological  replicates  and  RNA  was 

extracted from seeds at the corresponding developmental stages and then used to perform RNA-

seq analyses (see Methods). To characterize the expression of genes involved in lipid metabolism, 

we first collated lipid-related genes (LRGs) from CamRegBase (https://camregbase.org/) (Gomez-

Cano  et  al.,  2020)  and  classified  them  according  to  the  information  provided  by  AraLip 

(http://aralip.plantbiology.msu.edu/) (Li-Beisson et al., 2013). In accordance with these criteria, a 

total of 2,765 Camelina LRGs were identified, which were then classified into 25 different groups 

according to their role in different aspects of lipid metabolism, and because of their homology to 

well-described Arabidopsis lipid regulators.  

We used the publicly available developing-seed gene expression datasets and the RNA-seq 

information generated here from 5-11 DPA seeds to analyze mRNA accumulation patterns of the 

annotated LRGs. Overall, we identified four major types of genes based on their mRNA abundance 

during seed development. The smallest group (121 genes) corresponded to genes expressed at high 

levels [average transcripts per million (TPM) ~380] across all the developing-seed stages tested. 

These genes were largely associated with functions such as TAG and FA synthesis (Figure 3.1a, 

38 

 
b).  The  second  highest-expressed  group  (average  TPM  ~20)  consisted  of  553  genes  with 

predominant functions associated with lipid synthesis, desaturation, and export from plastids. Most 

of the genes were associated with two groups with medium-low (average TPM 4.5, ~5,847 genes) 

and  low  (average  TPM  ~0.5,  1,244  genes),  primarily  related  to  functions  associated  with  the 

biosynthesis of membrane lipids, waxes, and suberins (Figure 3.1a, b).  

Genes highly expressed in developing seeds corresponded to functions associated with I) TAG 

synthesis, II) FA synthesis, and III) FA elongation & desaturation (Figure 3.1b), and this is why 

we analyzed the mRNA accumulation dynamics of these major groups of lipid metabolic genes 

across  the  various  developmental  stages.  TAG  synthesis  genes  peak  at  18  -  29  DPA  with 

expression values (TPM) several times (>10 times) higher than the other two processes (Figure 

3.1c). Partially consistent with metabolite data (Pollard et al., 2015), FA synthesis, elongation, and 

desaturation  genes  peak  at  10-11  DPA  (Figure  3.1c).  The  value  of  the  newly-added  RNA-seq 

datasets (indicated red in Figure 3.1a), particularly for 5 and 8 DPA is evident from the high level 

of  expression  of  several  LRGs  early  during  seed  development  (a  few  examples  indicated  with 

asterisks in Figure 3.1a). Taken together, our analyses provide a comprehensive overview of the 

expression of LRGs during Camelina seed development, featuring specific gene sets with potential 

major lipid metabolism roles, providing an opportunity to uncover key regulators. 

39 

 
4.3% (121/2,765)

20.0% (553/2,765)

30.6% (847/2,765)

44.9% (1,244/2,765)

Group 1

a

Group 1

Group 2

Group 2

Group 3

*

*

*

*

*
*

Group 3

*
G
R
L

Group 4

Group 4

*
*
*
*
*

*

*

*
*
*

5 8 10 11 12 14 16
5 8 11

58

0
1

1
1

2
1

4
1

6
1

8
1

0
2

2
2

4
2

6
2

1
2
−
6
1

16-21 18 20 22 24 26
35-39
25-29

9
2
−
5
2

S
9
3
G
S
−
G
5
3

DPA
DPA
log2(TPM)
0
0 5 10 15

5 10 15

Csa00532s200
Csa01g005810
Csa01g013220
Csa01g030760
Csa02g005050
Csa02g012990
Csa02g026890
Csa02g039290
Csa02g039360
Csa02g041750
Csa02g044890
Csa02g055650
Csa02g057710
Csa02g057920
Csa02g065010
Csa02g065020
Csa02g072770
Csa03g030350
Csa03g031170
Csa03g036140
Csa03g053070
Csa03g053840
Csa03g059760
Csa03g061400
Csa04g012470
Csa04g015780
Csa04g037660
Csa04g052350
Csa04g055500
Csa04g061470
Csa04g062460
Csa05g005670
Csa05g006640
Csa05g011590
Csa05g014740
Csa05g023410
Csa05g029860
Csa05g033930
Csa05g068550
Csa05g076270
Csa06g006590
Csa06g008780
Csa06g041400
Csa06g046960
Csa06g050010
Csa06g050960
Csa07g011510
Csa07g013360
Csa07g028160
Csa07g052330
Csa07g063870
Csa08g007680
Csa08g052610
Csa09048s010
Csa09g014800
Csa09g086550
Csa09g092100
Csa10g007610
Csa10g016060
Csa10g017840
Csa10g032090
Csa10g042420
Csa10g043160
Csa10g047190
Csa11g008950
Csa11g017470
Csa11g019460
Csa11g052000
Csa11g057650
Csa11g065710
Csa11g082030
Csa11g082710
Csa11g101180
Csa12g009650
Csa12g025120
Csa12g028090
Csa12g034710
Csa12g079570
Csa12g086580
Csa13g018350
Csa13g036670
Csa13g047050
Csa13g055360
Csa14g002090
Csa14g034720
Csa14g035500
Csa14g048970
Csa14g054890
Csa14g062020
Csa14g064720
Csa15g002300
Csa15g006200
Csa15g016000
Csa15g025140
Csa16g012010
Csa16g014970
Csa16g020350
Csa16g043970
Csa16g047510
Csa17g001570
Csa17g018960
Csa17g035810
Csa17g037690
Csa17g050040
Csa17g070600
Csa17g080430
Csa17g080630
Csa17g092830
Csa17g094610
Csa18g003740
Csa18g021370
Csa18g021430
Csa18g022020
Csa18g026690
Csa18g032610
Csa18g037700
Csa19g001360
Csa19g007270
Csa19g016350
Csa19g025230
Csa25320s010

Csa00511s040
Csa00630s010
Csa00765s020
Csa01g002220
Csa01g006360
Csa01g006930
Csa01g007550
Csa01g007860
Csa01g008750
Csa01g009150
Csa01g009410
Csa01g010330
Csa01g012210
Csa01g013930
Csa01g016910
Csa01g017540
Csa01g018370
Csa01g019650
Csa01g020740
Csa01g021420
Csa01g021440
Csa01g022250
Csa01g023380
Csa01g024080
Csa01g031130
Csa01g031720
Csa01g031830
Csa01g034760
Csa01g042590
Csa02g005780
Csa02g011170
Csa02g012830
Csa02g028230
Csa02g033420
Csa02g033710
Csa02g035580
Csa02g043020
Csa02g049780
Csa02g057870
Csa02g057970
Csa02g061450
Csa02g065510
Csa02g067250
Csa02g067560
Csa02g068560
Csa02g068600
Csa02g073210
Csa02g073590
Csa02g076920
Csa03g002150
Csa03g006040
Csa03g009740
Csa03g010630
Csa03g011960
Csa03g012700
Csa03g012810
Csa03g012900
Csa03g013730
Csa03g014250
Csa03g016400
Csa03g017090
Csa03g017310
Csa03g017420
Csa03g021660
Csa03g023080
Csa03g023250
Csa03g025780
Csa03g027250
Csa03g029260
Csa03g031200
Csa03g033280
Csa03g038570
Csa03g053630
Csa03g061330
Csa03g062460
Csa04g008800
Csa04g009470
Csa04g011090
Csa04g015990
Csa04g018180
Csa04g022560
Csa04g024180
Csa04g024660
Csa04g028400
Csa04g033750
Csa04g034000
Csa04g037310
Csa04g037370
Csa04g040040
Csa04g040400
Csa04g041210
Csa04g041610
Csa04g046850
Csa04g046960
Csa04g047810
Csa04g049710
Csa04g051720
Csa04g054530
Csa04g056010
Csa04g058040
Csa04g061620
Csa04g065230
Csa04g065570
Csa05g002680
Csa05g003010
Csa05g006520
Csa05g009140
Csa05g009530
Csa05g011070
Csa05g011780
Csa05g012590
Csa05g013960
Csa05g014730
Csa05g014870
Csa05g015510
Csa05g015650
Csa05g015750
Csa05g017850
Csa05g018780
Csa05g020570
Csa05g020680
Csa05g023080
Csa05g023090
Csa05g025750
Csa05g032270
Csa05g033570
Csa05g044720
Csa05g059570
Csa05g068650
Csa05g085820
Csa05g092130
Csa05g092320
Csa05g093640
Csa05g093810
Csa05g094970
Csa06488s010
Csa06g001230
Csa06g004790
Csa06g005340
Csa06g005440
Csa06g017080
Csa06g018480
Csa06g019520
Csa06g022600
Csa06g022970
Csa06g025650
Csa06g025800
Csa06g026040
Csa06g027980
Csa06g028490
Csa06g028810
Csa06g029540
Csa06g033110
Csa06g033370
Csa06g037050
Csa06g037170
Csa06g037980
Csa06g038840
Csa06g040780
Csa06g041540
Csa06g043170
Csa06g044570
Csa06g045350
Csa06g047480
Csa06g048550
Csa06g050120
Csa06g052640
Csa06g053000
Csa07g001770
Csa07g002960
Csa07g004710
Csa07g008380
Csa07g012660
Csa07g012680
Csa07g013000
Csa07g023510
Csa07g023540
Csa07g028210
Csa07g033170
Csa07g037530
Csa07g038900
Csa07g038910
Csa07g045710
Csa07g046400
Csa07g047670
Csa07g050270
Csa07g052320
Csa07g057200
Csa07g058960
Csa07g061970
Csa07g064130
Csa07g064380
Csa08g002790
Csa08g005750
Csa08g008240
Csa08g008360
Csa08g008370
Csa08g008550
Csa08g011200
Csa08g011940
Csa08g017810
Csa08g018210
Csa08g028460
Csa08g029940
Csa08g032230
Csa08g038880
Csa08g047170
Csa08g048870
Csa08g051930
Csa08g054420
Csa08g054430
Csa08g055910
Csa08g056520
Csa09g001290
Csa09g006870
Csa09g007220
Csa09g009040
Csa09g011440
Csa09g016030
Csa09g034290
Csa09g035780
Csa09g040850
Csa09g050950
Csa09g051410
Csa09g052210
Csa09g058550
Csa09g058880
Csa09g062470
Csa09g064030
Csa09g065870
Csa09g067340
Csa09g069840
Csa09g071950
Csa09g078030
Csa09g078240
Csa09g078890
Csa09g079550
Csa09g080850
Csa09g084190
Csa09g084390
Csa09g086540
Csa09g092370
Csa09g094070
Csa09g097250
Csa09g097470
Csa09g097760
Csa10g002250
Csa10g008140
Csa10g008170
Csa10g008690
Csa10g010740
Csa10g011570
Csa10g011640
Csa10g011940
Csa10g013510
Csa10g013600
Csa10g014990
Csa10g017440
Csa10g019220
Csa10g020660
Csa10g020850
Csa10g020950
Csa10g022650
Csa10g023370
Csa10g027790
Csa10g028750
Csa10g029660
Csa10g032670
Csa10g034800
Csa10g038720
Csa10g049100
Csa11g002930
Csa11g005290
Csa11g005910
Csa11g009560
Csa11g010130
Csa11g011540
Csa11g012410
Csa11g012490
Csa11g014490
Csa11g014560
Csa11g016230
Csa11g021040
Csa11g023750
Csa11g023870
Csa11g026470
Csa11g031070
Csa11g032960
Csa11g033900
Csa11g038700
Csa11g043130
Csa11g044740
Csa11g058960
Csa11g059590
Csa11g063970
Csa11g064690
Csa11g071650
Csa11g072150
Csa11g072410
Csa11g075840
Csa11g077030
Csa11g078780
Csa11g083530
Csa11g083570
Csa11g085290
Csa11g088480
Csa11g088640
Csa11g090570
Csa11g093140
Csa11g093150
Csa11g093610
Csa11g094190
Csa11g094450
Csa11g097450
Csa11g101570
Csa11g102300
Csa12g002720
Csa12g007640
Csa12g009060
Csa12g009600
Csa12g010270
Csa12g011830
Csa12g014560
Csa12g016160
Csa12g016230
Csa12g016580
Csa12g020390
Csa12g022880
Csa12g025980
Csa12g030590
Csa12g034590
Csa12g036760
Csa12g038660
Csa12g040690
Csa12g050960
Csa12g053150
Csa12g057400
Csa12g065690
Csa12g069570
Csa12g079300
Csa12g086590
Csa13635s010
Csa13g006100
Csa13g009890
Csa13g010480
Csa13g011920
Csa13g011930
Csa13g012250
Csa13g014450
Csa13g016510
Csa13g018930
Csa13g019050
Csa13g021560
Csa13g022210
Csa13g023030
Csa13g028040
Csa13g028790
Csa13g041680
Csa13g052720
Csa13g054650
Csa14g001390
Csa14g002130
Csa14g005850
Csa14g005930
Csa14g007700
Csa14g008640
Csa14g009990
Csa14g010140
Csa14g010750
Csa14g010860
Csa14g010940
Csa14g012300
Csa14g013790
Csa14g015560
Csa14g016250
Csa14g016490
Csa14g016600
Csa14g024600
Csa14g024750
Csa14g025070
Csa14g027130
Csa14g027210
Csa14g030620
Csa14g032490
Csa14g037700
Csa14g041770
Csa14g043220
Csa14g044300
Csa14g054310
Csa14g067710
Csa15g002280
Csa15g002290
Csa15g003110
Csa15g006690
Csa15g007500
Csa15g008120
Csa15g008250
Csa15g008510
Csa15g008630
Csa15g009440
Csa15g010030
Csa15g010950
Csa15g014910
Csa15g015500
Csa15g016760
Csa15g018720
Csa15g019440
Csa15g020390
Csa15g021020
Csa15g023000
Csa15g024360
Csa15g024790
Csa15g026960
Csa15g030520
Csa15g050020
Csa15g050420
Csa15g052810
Csa15g055530
Csa15g055690
Csa15g064270
Csa15g072100
Csa15g074060
Csa15g081030
Csa15g084220
Csa16g004410
Csa16g007890
Csa16g008170
Csa16g008180
Csa16g009200
Csa16g014410
Csa16g014430
Csa16g014700
Csa16g020320
Csa16g022880
Csa16g028120
Csa16g031150
Csa16g031720
Csa16g034610
Csa16g036500
Csa16g038240
Csa16g038860
Csa16g040270
Csa16g041780
Csa16g049520
Csa16g052640
Csa16g055340
Csa16g055500
Csa16g055780
Csa17g001530
Csa17g002210
Csa17g002220
Csa17g007890
Csa17g009720
Csa17g010620
Csa17g011970
Csa17g012110
Csa17g012820
Csa17g012930
Csa17g013010
Csa17g013940
Csa17g014450
Csa17g015030
Csa17g016850
Csa17g018610
Csa17g018860
Csa17g019670
Csa17g020660
Csa17g024140
Csa17g025060
Csa17g025260
Csa17g025440
Csa17g025600
Csa17g028710
Csa17g028800
Csa17g028820
Csa17g030510
Csa17g034620
Csa17g037730
Csa17g043080
Csa17g059730
Csa17g078890
Csa17g094560
Csa17g095770
Csa18g002190
Csa18g002740
Csa18g009590
Csa18g010580
Csa18g011360
Csa18g018830
Csa18g022810
Csa18g024490
Csa18g026900
Csa18g026910
Csa18g029980
Csa18g032620
Csa18g033660
Csa18g033950
Csa18g034930
Csa18g038810
Csa18g042580
Csa19g004750
Csa19g005470
Csa19g007750
Csa19g008560
Csa19g010180
Csa19g010300
Csa19g010540
Csa19g012130
Csa19g013020
Csa19g017020
Csa19g021040
Csa19g021690
Csa19g022540
Csa19g023890
Csa19g026750
Csa19g028170
Csa19g028990
Csa19g036630
Csa19g037960
Csa19g039640
Csa19g040900
Csa19g047700
Csa19g050830
Csa19g056370
Csa20g001900
Csa20g003090
Csa20g009760
Csa20g009980
Csa20g010310
Csa20g011910
Csa20g014160
Csa20g015430
Csa20g015440
Csa20g015800
Csa20g017180
Csa20g022050
Csa20g023740
Csa20g023750
Csa20g023910
Csa20g027020
Csa20g028930
Csa20g031950
Csa20g038500
Csa20g039090
Csa20g039590
Csa20g055600
Csa20g058900
Csa20g066350
Csa20g066660
Csa20g067180
Csa20g071220
Csa20g079430
Csa20g081510

Csa00382s080
Csa00382s230
Csa00430s100
Csa00511s160
Csa00555s050
Csa00579s080
Csa00894s030
Csa01215s030
Csa01264s010
Csa01264s020
Csa01g001410
Csa01g002200
Csa01g002210
Csa01g003020
Csa01g003220
Csa01g006240
Csa01g007680
Csa01g007950
Csa01g008410
Csa01g011210
Csa01g011790
Csa01g012530
Csa01g012760
Csa01g016590
Csa01g016720
Csa01g016830
Csa01g018280
Csa01g018460
Csa01g018820
Csa01g021080
Csa01g021830
Csa01g021850
Csa01g022270
Csa01g023560
Csa01g025630
Csa01g030380
Csa01g035770
Csa01g036250
Csa01g037870
Csa01g038310
Csa01g038410
Csa01g039960
Csa01g040150
Csa01g041860
Csa01g044650
Csa02238s010
Csa02263s020
Csa02350s010
Csa02g001440
Csa02g001530
Csa02g001540
Csa02g001630
Csa02g005760
Csa02g016220
Csa02g021800
Csa02g028180
Csa02g043430
Csa02g044850
Csa02g044860
Csa02g049060
Csa02g049110
Csa02g057190
Csa02g059740
Csa02g062360
Csa02g064350
Csa02g070110
Csa02g070590
Csa02g072460
Csa02g072760
Csa02g076340
Csa03g001510
Csa03g001630
Csa03g001640
Csa03g002680
Csa03g005960
Csa03g009800
Csa03g009940
Csa03g011180
Csa03g012390
Csa03g012400
Csa03g012770
Csa03g013910
Csa03g014330
Csa03g014540
Csa03g014710
Csa03g017030
Csa03g017330
Csa03g018880
Csa03g018890
Csa03g021310
Csa03g022190
Csa03g023430
Csa03g023640
Csa03g025850
Csa03g025860
Csa03g026470
Csa03g028590
Csa03g029410
Csa03g029640
Csa03g030000
Csa03g031590
Csa03g032300
Csa03g034930
Csa03g036080
Csa03g036550
Csa03g037520
Csa03g039320
Csa03g046720
Csa03g047400
Csa03g053460
Csa03g055300
Csa03g055480
Csa03g058380
Csa03g058630
Csa03g059790
Csa03g060520
Csa03g061830
Csa03g061840
Csa03g062840
Csa04g002360
Csa04g009630
Csa04g009790
Csa04g011010
Csa04g012540
Csa04g015800
Csa04g015810
Csa04g022310
Csa04g024230
Csa04g024240
Csa04g025980
Csa04g025990
Csa04g028640
Csa04g030200
Csa04g030530
Csa04g034350
Csa04g036660
Csa04g037650
Csa04g037780
Csa04g039560
Csa04g040050
Csa04g041120
Csa04g042410
Csa04g043530
Csa04g043810
Csa04g045020
Csa04g047040
Csa04g051480
Csa04g051560
Csa04g051790
Csa04g052510
Csa04g053290
Csa04g053480
Csa04g055050
Csa04g055300
Csa04g056170
Csa04g057500
Csa04g057840
Csa04g062850
Csa04g062860
Csa04g063050
Csa04g063640
Csa04g064760
Csa04g065020
Csa04g065130
Csa04g066390
Csa04g067780
Csa05027s010
Csa05642s010
Csa05g001040
Csa05g001530
Csa05g001930
Csa05g003100
Csa05g003260
Csa05g003480
Csa05g003620
Csa05g004120
Csa05g004320
Csa05g005950
Csa05g006090
Csa05g009930
Csa05g011450
Csa05g012020
Csa05g013720
Csa05g014070
Csa05g015740
Csa05g015850
Csa05g016500
Csa05g020450
Csa05g022900
Csa05g031830
Csa05g035870
Csa05g039760
Csa05g041910
Csa05g043280
Csa05g044600
Csa05g046090
Csa05g049640
Csa05g058130
Csa05g058160
Csa05g059420
Csa05g059550
Csa05g060850
Csa05g064870
Csa05g067230
Csa05g067240
Csa05g068800
Csa05g068860
Csa05g081840
Csa05g083370
Csa05g084930
Csa05g086400
Csa05g091360
Csa05g092430
Csa05g093030
Csa05g094920
Csa05g095180
Csa06908s010
Csa06g002780
Csa06g004990
Csa06g005050
Csa06g006640
Csa06g008810
Csa06g009010
Csa06g010190
Csa06g015300
Csa06g017130
Csa06g017140
Csa06g018860
Csa06g021000
Csa06g022410
Csa06g024190
Csa06g025030
Csa06g026020
Csa06g026030
Csa06g028500
Csa06g029450
Csa06g029870
Csa06g029970
Csa06g030710
Csa06g031410
Csa06g032880
Csa06g034300
Csa06g037250
Csa06g040040
Csa06g040680
Csa06g040850
Csa06g043260
Csa06g047660
Csa06g048050
Csa06g050050
Csa06g050240
Csa06g051350
Csa06g051540
Csa06g052070
Csa06g052090
Csa06g052420
Csa06g052530
Csa06g053760
Csa06g053780
Csa06g054160
Csa07g001420
Csa07g002090
Csa07g003100
Csa07g003790
Csa07g004320
Csa07g004440
Csa07g005350
Csa07g007870
Csa07g008020
Csa07g008030
Csa07g008400
Csa07g010470
Csa07g012330
Csa07g012560
Csa07g016680
Csa07g017010
Csa07g019750
Csa07g022100
Csa07g023480
Csa07g023930
Csa07g029390
Csa07g029440
Csa07g029910
Csa07g029920
Csa07g032970
Csa07g033690
Csa07g034210
Csa07g036400
Csa07g037900
Csa07g038580
Csa07g039800
Csa07g040110
Csa07g040670
Csa07g040680
Csa07g040900
Csa07g043100
Csa07g044560
Csa07g045620
Csa07g046450
Csa07g050070
Csa07g050330
Csa07g050420
Csa07g053240
Csa07g055190
Csa07g057690
Csa07g058860
Csa07g059410
Csa07g060190
Csa07g060400
Csa07g060440
Csa07g060470
Csa07g060690
Csa07g061690
Csa07g062040
Csa07g063740
Csa07g063910
Csa07g064950
Csa07g066010
Csa08g001470
Csa08g001600
Csa08g001670
Csa08g004790
Csa08g005560
Csa08g006190
Csa08g007050
Csa08g009620
Csa08g010030
Csa08g011190
Csa08g011830
Csa08g013630
Csa08g013880
Csa08g014550
Csa08g014620
Csa08g017550
Csa08g017990
Csa08g018580
Csa08g028100
Csa08g034560
Csa08g037570
Csa08g037600
Csa08g047360
Csa08g054040
Csa08g055280
Csa08g055300
Csa08g056830
Csa08g057930
Csa08g058890
Csa08g059510
Csa08g059930
Csa08g059940
Csa08g060480
Csa08g061580
Csa09g002440
Csa09g007470
Csa09g008540
Csa09g008880
Csa09g011500
Csa09g021290
Csa09g030650
Csa09g032920
Csa09g034370
Csa09g039570
Csa09g041070
Csa09g047650
Csa09g047990
Csa09g050690
Csa09g051620
Csa09g051630
Csa09g058490
Csa09g058640
Csa09g058860
Csa09g058870
Csa09g061880
Csa09g062480
Csa09g064730
Csa09g066260
Csa09g068070
Csa09g068220
Csa09g069390
Csa09g069610
Csa09g071100
Csa09g071960
Csa09g072350
Csa09g075300
Csa09g075310
Csa09g077250
Csa09g077510
Csa09g078140
Csa09g078340
Csa09g078800
Csa09g079600
Csa09g081420
Csa09g084470
Csa09g084560
Csa09g085780
Csa09g087490
Csa09g090360
Csa09g092780
Csa09g094000
Csa09g094530
Csa09g095390
Csa09g097070
Csa09g097190
Csa09g097210
Csa09g098250
Csa09g098530
Csa09g099310
Csa10g001440
Csa10g002240
Csa10g002400
Csa10g002490
Csa10g002540
Csa10g004090
Csa10g006270
Csa10g007950
Csa10g008130
Csa10g008980
Csa10g009240
Csa10g010470
Csa10g012170
Csa10g012960
Csa10g013990
Csa10g016940
Csa10g018810
Csa10g020840
Csa10g020880
Csa10g021840
Csa10g021850
Csa10g026620
Csa10g026630
Csa10g027920
Csa10g027930
Csa10g028850
Csa10g029840
Csa10g031090
Csa10g032080
Csa10g032380
Csa10g032940
Csa10g033300
Csa10g041200
Csa10g047050
Csa10g049570
Csa11g001480
Csa11g001630
Csa11g002490
Csa11g002500
Csa11g002980
Csa11g002990
Csa11g004600
Csa11g004610
Csa11g007410
Csa11g008720
Csa11g008920
Csa11g009790
Csa11g009870
Csa11g011280
Csa11g012790
Csa11g013060
Csa11g013960
Csa11g015010
Csa11g018420
Csa11g019530
Csa11g020530
Csa11g020620
Csa11g023900
Csa11g024850
Csa11g024860
Csa11g025640
Csa11g031210
Csa11g031250
Csa11g033050
Csa11g034080
Csa11g035440
Csa11g038690
Csa11g039990
Csa11g044750
Csa11g044930
Csa11g051470
Csa11g055550
Csa11g055560
Csa11g063960
Csa11g064070
Csa11g070430
Csa11g070500
Csa11g071140
Csa11g074490
Csa11g074620
Csa11g082810
Csa11g083150
Csa11g084630
Csa11g091410
Csa11g092430
Csa11g097420
Csa11g099320
Csa11g100520
Csa11g100840
Csa11g101170
Csa11g104540
Csa11g105070
Csa12377s010
Csa12g001380
Csa12g002340
Csa12g002790
Csa12g004340
Csa12g004990
Csa12g009440
Csa12g011480
Csa12g011550
Csa12g014010
Csa12g016820
Csa12g017860
Csa12g020490
Csa12g026120
Csa12g026400
Csa12g028170
Csa12g030160
Csa12g034280
Csa12g034570
Csa12g034640
Csa12g036750
Csa12g037190
Csa12g047190
Csa12g051070
Csa12g053430
Csa12g057390
Csa12g057600
Csa12g060050
Csa12g061290
Csa12g061590
Csa12g063010
Csa12g066070
Csa12g069680
Csa12g075140
Csa12g077540
Csa12g081940
Csa12g083770
Csa13g001680
Csa13g002980
Csa13g003830
Csa13g006520
Csa13g006530
Csa13g006940
Csa13g008570
Csa13g009260
Csa13g009610
Csa13g011030
Csa13g015280
Csa13g016300
Csa13g016990
Csa13g019040
Csa13g019190
Csa13g019590
Csa13g020200
Csa13g020600
Csa13g022110
Csa13g024650
Csa13g024720
Csa13g027780
Csa13g028190
Csa13g028430
Csa13g036280
Csa13g038010
Csa13g044710
Csa13g044750
Csa13g049040
Csa13g054680
Csa13g056850
Csa13g056910
Csa13g056920
Csa13g057020
Csa14g001560
Csa14g002450
Csa14g002690
Csa14g005360
Csa14g007750
Csa14g007930
Csa14g010450
Csa14g011810
Csa14g013390
Csa14g013620
Csa14g016510
Csa14g017150
Csa14g018220
Csa14g018240
Csa14g021590
Csa14g021690
Csa14g021790
Csa14g022180
Csa14g027200
Csa14g032670
Csa14g035530
Csa14g039490
Csa14g042110
Csa14g044140
Csa14g044150
Csa14g047290
Csa14g049770
Csa14g054710
Csa14g054720
Csa14g055300
Csa14g055810
Csa14g059740
Csa14g064660
Csa14g066070
Csa14g066080
Csa14g069150
Csa15033s010
Csa15g001450
Csa15g002660
Csa15g003290
Csa15g003300
Csa15g006830
Csa15g006990
Csa15g007630
Csa15g009130
Csa15g009820
Csa15g011390
Csa15g012850
Csa15g014050
Csa15g014060
Csa15g014420
Csa15g015210
Csa15g018370
Csa15g018520
Csa15g018630
Csa15g020250
Csa15g020340
Csa15g020490
Csa15g021670
Csa15g023010
Csa15g026520
Csa15g026530
Csa15g026980
Csa15g030730
Csa15g031400
Csa15g036980
Csa15g038240
Csa15g044570
Csa15g051440
Csa15g057230
Csa15g058700
Csa15g059780
Csa15g071200
Csa15g076280
Csa15g077900
Csa15g079450
Csa15g080950
Csa15g084470
Csa16896s010
Csa16g001770
Csa16g002110
Csa16g003090
Csa16g003160
Csa16g003620
Csa16g003810
Csa16g004280
Csa16g006180
Csa16g007730
Csa16g015430
Csa16g017200
Csa16g017490
Csa16g020300
Csa16g021520
Csa16g022830
Csa16g022910
Csa16g027400
Csa16g027640
Csa16g028550
Csa16g031080
Csa16g032260
Csa16g032700
Csa16g032890
Csa16g035660
Csa16g036170
Csa16g036340
Csa16g036600
Csa16g037650
Csa16g038150
Csa16g038910
Csa16g038920
Csa16g040840
Csa16g041610
Csa16g041840
Csa16g041970
Csa16g043960
Csa16g046780
Csa16g047830
Csa16g049390
Csa16g049970
Csa16g050320
Csa16g051010
Csa16g051040
Csa16g051070
Csa16g051300
Csa16g052350
Csa16g052730
Csa16g052740
Csa16g055180
Csa16g055310
Csa16g056180
Csa16g057210
Csa17g001450
Csa17g002080
Csa17g007350
Csa17g007970
Csa17g009480
Csa17g009910
Csa17g010220
Csa17g012460
Csa17g012890
Csa17g014600
Csa17g014800
Csa17g018880
Csa17g020640
Csa17g021650
Csa17g023060
Csa17g023140
Csa17g023610
Csa17g023790
Csa17g028840
Csa17g029530
Csa17g034790
Csa17g035430
Csa17g039140
Csa17g049990
Csa17g051550
Csa17g055610
Csa17g059490
Csa17g059560
Csa17g061000
Csa17g071380
Csa17g081040
Csa17g083150
Csa17g083310
Csa17g090380
Csa17g090610
Csa17g092600
Csa17g092850
Csa17g093770
Csa17g095060
Csa17g095070
Csa17g097110
Csa17g098440
Csa18g002080
Csa18g009540
Csa18g010110
Csa18g010980
Csa18g011070
Csa18g014440
Csa18g014580
Csa18g014860
Csa18g016090
Csa18g022120
Csa18g022780
Csa18g023800
Csa18g024810
Csa18g026790
Csa18g026810
Csa18g026830
Csa18g026860
Csa18g030890
Csa18g031850
Csa18g031880
Csa18g034960
Csa18g037360
Csa18g037690
Csa18g038060
Csa18g042120
Csa19g002500
Csa19g004730
Csa19g004740
Csa19g007460
Csa19g007920
Csa19g008090
Csa19g010660
Csa19g011120
Csa19g011460
Csa19g011870
Csa19g013930
Csa19g014200
Csa19g014220
Csa19g014550
Csa19g015230
Csa19g015580
Csa19g015830
Csa19g020680
Csa19g020840
Csa19g020950
Csa19g022440
Csa19g022610
Csa19g022640
Csa19g023040
Csa19g024920
Csa19g026290
Csa19g026300
Csa19g026770
Csa19g028390
Csa19g031620
Csa19g031820
Csa19g036260
Csa19g039180
Csa19g039770
Csa19g039780
Csa19g040300
Csa19g040810
Csa19g043380
Csa19g043490
Csa19g046330
Csa19g047650
Csa19g048140
Csa19g049020
Csa19g058420
Csa20g005210
Csa20g005730
Csa20g005750
Csa20g006200
Csa20g008850
Csa20g012500
Csa20g012520
Csa20g019000
Csa20g019190
Csa20g020640
Csa20g021480
Csa20g023160
Csa20g023920
Csa20g024230
Csa20g024700
Csa20g025580
Csa20g025960
Csa20g028800
Csa20g032720
Csa20g032790
Csa20g036130
Csa20g038120
Csa20g038700
Csa20g041260
Csa20g041390
Csa20g055560
Csa20g057510
Csa20g066730
Csa20g068670
Csa20g071290
Csa20g072130
Csa20g082110
Csa20g082220
Csa22442s010
Csa26607s010

Csa00382s240
Csa00441s260
Csa00474s110
Csa00512s020
Csa00518s150
Csa00566s170
Csa00566s180
Csa00579s070
Csa00619s050
Csa00633s100
Csa00637s030
Csa00637s040
Csa00692s010
Csa00780s010
Csa01215s040
Csa01270s010
Csa01694s010
Csa01718s010
Csa01726s010
Csa01730s010
Csa01832s010
Csa01g001030
Csa01g002590
Csa01g003210
Csa01g003230
Csa01g005030
Csa01g006520
Csa01g007070
Csa01g008480
Csa01g009030
Csa01g010730
Csa01g011430
Csa01g011450
Csa01g012370
Csa01g012490
Csa01g013090
Csa01g015460
Csa01g016340
Csa01g018480
Csa01g019450
Csa01g021000
Csa01g021070
Csa01g021840
Csa01g025160
Csa01g025170
Csa01g025180
Csa01g025640
Csa01g025830
Csa01g025840
Csa01g028090
Csa01g029620
Csa01g029630
Csa01g031850
Csa01g033200
Csa01g034730
Csa01g035750
Csa01g035920
Csa01g036170
Csa01g039800
Csa01g041500
Csa01g043720
Csa01g044280
Csa01g044790
Csa02023s010
Csa02088s020
Csa02142s010
Csa02144s010
Csa02874s010
Csa02994s010
Csa02g001310
Csa02g001460
Csa02g002060
Csa02g002100
Csa02g005090
Csa02g009820
Csa02g010990
Csa02g011080
Csa02g012940
Csa02g014480
Csa02g014570
Csa02g017430
Csa02g021750
Csa02g023450
Csa02g024680
Csa02g030700
Csa02g035180
Csa02g035360
Csa02g035850
Csa02g035910
Csa02g039280
Csa02g044910
Csa02g045170
Csa02g045360
Csa02g045380
Csa02g049070
Csa02g049700
Csa02g049710
Csa02g051520
Csa02g051530
Csa02g055390
Csa02g057830
Csa02g057840
Csa02g057850
Csa02g057860
Csa02g057880
Csa02g057890
Csa02g057930
Csa02g057940
Csa02g059720
Csa02g061640
Csa02g062240
Csa02g064320
Csa02g064340
Csa02g064500
Csa02g065030
Csa02g066920
Csa02g067700
Csa02g067940
Csa02g070130
Csa02g071970
Csa02g072110
Csa02g072140
Csa02g075640
Csa03324s010
Csa03423s010
Csa03584s010
Csa03963s010
Csa03984s010
Csa03g001000
Csa03g001770
Csa03g002110
Csa03g002140
Csa03g002410
Csa03g005200
Csa03g005450
Csa03g005460
Csa03g007020
Csa03g008510
Csa03g008530
Csa03g009690
Csa03g009750
Csa03g010170
Csa03g010200
Csa03g011360
Csa03g012090
Csa03g013930
Csa03g015230
Csa03g016320
Csa03g017900
Csa03g019180
Csa03g019870
Csa03g021120
Csa03g021200
Csa03g021840
Csa03g022100
Csa03g022250
Csa03g023040
Csa03g026900
Csa03g027170
Csa03g030250
Csa03g030260
Csa03g030270
Csa03g033550
Csa03g036070
Csa03g038240
Csa03g038440
Csa03g039310
Csa03g053440
Csa03g055180
Csa03g055310
Csa03g055540
Csa03g058310
Csa03g058330
Csa03g059560
Csa03g059670
Csa03g060660
Csa03g061240
Csa03g061270
Csa03g061600
Csa03g061750
Csa03g062230
Csa03g062760
Csa04114s010
Csa04234s010
Csa04556s010
Csa04561s010
Csa04592s010
Csa04776s010
Csa04837s010
Csa04917s010
Csa04990s010
Csa04g002120
Csa04g002450
Csa04g008260
Csa04g009760
Csa04g009830
Csa04g009840
Csa04g009860
Csa04g009870
Csa04g009880
Csa04g010000
Csa04g012500
Csa04g012530
Csa04g012550
Csa04g016850
Csa04g018090
Csa04g020000
Csa04g022290
Csa04g024220
Csa04g025800
Csa04g029230
Csa04g030150
Csa04g030190
Csa04g030540
Csa04g030830
Csa04g030880
Csa04g034640
Csa04g034650
Csa04g035700
Csa04g036300
Csa04g037250
Csa04g037360
Csa04g037980
Csa04g038140
Csa04g038290
Csa04g040010
Csa04g041310
Csa04g041320
Csa04g041360
Csa04g041420
Csa04g041430
Csa04g041540
Csa04g042310
Csa04g042430
Csa04g042440
Csa04g043000
Csa04g043090
Csa04g043350
Csa04g045080
Csa04g049090
Csa04g050910
Csa04g051350
Csa04g052330
Csa04g052340
Csa04g052360
Csa04g052950
Csa04g053320
Csa04g053460
Csa04g054350
Csa04g055270
Csa04g055630
Csa04g059270
Csa04g061710
Csa04g061810
Csa04g061990
Csa04g062120
Csa04g062130
Csa04g062660
Csa04g063750
Csa04g065830
Csa05086s010
Csa05342s010
Csa05867s010
Csa05g001000
Csa05g001010
Csa05g002440
Csa05g004460
Csa05g005960
Csa05g006110
Csa05g006120
Csa05g006200
Csa05g006390
Csa05g008890
Csa05g011810
Csa05g012750
Csa05g013740
Csa05g013750
Csa05g013880
Csa05g013890
Csa05g014300
Csa05g014310
Csa05g014710
Csa05g014720
Csa05g015440
Csa05g015830
Csa05g015970
Csa05g018510
Csa05g029330
Csa05g029470
Csa05g030270
Csa05g030350
Csa05g031890
Csa05g031910
Csa05g031940
Csa05g032000
Csa05g032010
Csa05g032020
Csa05g032120
Csa05g034570
Csa05g041820
Csa05g044310
Csa05g044350
Csa05g044360
Csa05g058110
Csa05g058140
Csa05g058150
Csa05g063420
Csa05g065890
Csa05g065900
Csa05g073520
Csa05g073720
Csa05g073750
Csa05g079530
Csa05g081640
Csa05g083380
Csa05g085240
Csa05g085860
Csa05g095950
Csa06364s010
Csa06595s010
Csa06607s010
Csa06720s010
Csa06724s010
Csa06728s010
Csa06871s010
Csa06880s010
Csa06g001110
Csa06g005110
Csa06g005130
Csa06g005320
Csa06g005330
Csa06g006610
Csa06g006630
Csa06g010140
Csa06g012440
Csa06g015280
Csa06g017120
Csa06g018430
Csa06g018870
Csa06g021240
Csa06g021510
Csa06g021560
Csa06g023150
Csa06g023160
Csa06g023170
Csa06g025600
Csa06g025730
Csa06g025790
Csa06g026190
Csa06g026360
Csa06g026400
Csa06g026580
Csa06g026760
Csa06g028460
Csa06g029640
Csa06g029650
Csa06g029670
Csa06g029690
Csa06g029750
Csa06g029760
Csa06g030610
Csa06g030740
Csa06g031320
Csa06g031430
Csa06g031680
Csa06g033390
Csa06g033410
Csa06g033430
Csa06g034330
Csa06g038260
Csa06g040480
Csa06g040580
Csa06g040600
Csa06g041380
Csa06g041390
Csa06g042930
Csa06g043330
Csa06g043430
Csa06g043450
Csa06g044410
Csa06g045120
Csa06g045310
Csa06g047090
Csa06g048840
Csa06g048850
Csa06g050480
Csa06g050490
Csa06g050500
Csa06g050620
Csa06g050630
Csa06g051180
Csa06g053230
Csa06g053240
Csa06g053770
Csa07800s010
Csa07815s010
Csa07g001000
Csa07g005340
Csa07g008050
Csa07g010990
Csa07g011160
Csa07g011800
Csa07g011880
Csa07g012400
Csa07g012410
Csa07g012460
Csa07g014830
Csa07g015800
Csa07g015890
Csa07g017900
Csa07g019650
Csa07g019670
Csa07g019690
Csa07g019700
Csa07g020880
Csa07g026560
Csa07g029370
Csa07g029950
Csa07g032000
Csa07g033700
Csa07g034440
Csa07g034450
Csa07g035800
Csa07g035840
Csa07g037540
Csa07g038130
Csa07g039070
Csa07g039100
Csa07g039130
Csa07g039620
Csa07g040190
Csa07g040270
Csa07g040740
Csa07g040780
Csa07g040790
Csa07g048250
Csa07g050240
Csa07g052020
Csa07g053250
Csa07g054790
Csa07g056820
Csa07g056920
Csa07g057370
Csa07g057960
Csa07g058830
Csa07g063450
Csa07g065190
Csa07g065950
Csa08004s010
Csa08046s010
Csa08064s010
Csa08389s010
Csa08517s010
Csa08518s010
Csa08625s010
Csa08g001330
Csa08g001490
Csa08g001580
Csa08g001590
Csa08g002290
Csa08g002810
Csa08g004600
Csa08g005480
Csa08g005490
Csa08g005880
Csa08g007000
Csa08g007420
Csa08g008290
Csa08g008510
Csa08g008910
Csa08g008980
Csa08g010970
Csa08g012100
Csa08g014240
Csa08g015630
Csa08g015710
Csa08g018530
Csa08g018550
Csa08g018560
Csa08g023030
Csa08g023040
Csa08g024000
Csa08g028070
Csa08g032210
Csa08g034640
Csa08g043040
Csa08g043110
Csa08g044130
Csa08g044140
Csa08g044650
Csa08g048970
Csa08g049110
Csa08g049120
Csa08g049130
Csa08g049320
Csa08g051960
Csa08g052570
Csa08g053540
Csa08g056390
Csa08g056790
Csa08g056840
Csa08g056850
Csa08g056860
Csa08g056870
Csa08g057200
Csa08g058450
Csa08g058550
Csa08g060720
Csa08g062850
Csa09466s010
Csa09513s010
Csa09908s010
Csa09g001000
Csa09g001150
Csa09g008520
Csa09g008530
Csa09g008600
Csa09g008610
Csa09g008810
Csa09g008820
Csa09g008840
Csa09g008850
Csa09g008870
Csa09g011460
Csa09g011470
Csa09g011510
Csa09g011670
Csa09g014820
Csa09g014830
Csa09g022670
Csa09g034330
Csa09g034350
Csa09g034360
Csa09g037040
Csa09g046600
Csa09g046650
Csa09g048430
Csa09g050410
Csa09g051590
Csa09g051600
Csa09g051610
Csa09g052100
Csa09g052950
Csa09g052960
Csa09g058630
Csa09g059000
Csa09g059200
Csa09g059390
Csa09g059420
Csa09g059540
Csa09g062440
Csa09g065970
Csa09g065980
Csa09g066030
Csa09g066050
Csa09g066090
Csa09g068200
Csa09g068240
Csa09g068950
Csa09g069070
Csa09g069080
Csa09g069110
Csa09g069120
Csa09g071150
Csa09g072280
Csa09g072540
Csa09g073950
Csa09g075460
Csa09g075490
Csa09g075520
Csa09g075530
Csa09g075550
Csa09g076110
Csa09g077590
Csa09g078070
Csa09g078080
Csa09g078480
Csa09g086260
Csa09g086270
Csa09g087170
Csa09g087460
Csa09g088950
Csa09g088960
Csa09g092010
Csa09g092540
Csa09g093970
Csa09g096780
Csa09g099270
Csa09g099540
Csa10123s010
Csa10190s010
Csa10197s010
Csa10361s010
Csa10596s010
Csa10870s010
Csa10g001310
Csa10g001330
Csa10g002260
Csa10g002270
Csa10g004100
Csa10g004770
Csa10g007250
Csa10g007620
Csa10g008110
Csa10g008420
Csa10g008890
Csa10g010320
Csa10g012780
Csa10g013020
Csa10g014260
Csa10g014270
Csa10g014560
Csa10g014570
Csa10g015430
Csa10g015800
Csa10g016030
Csa10g017150
Csa10g018510
Csa10g018900
Csa10g020550
Csa10g020710
Csa10g020720
Csa10g020730
Csa10g020740
Csa10g022180
Csa10g027940
Csa10g027950
Csa10g028670
Csa10g030720
Csa10g031540
Csa10g038400
Csa10g039800
Csa10g039820
Csa10g040780
Csa10g045070
Csa10g047130
Csa10g047760
Csa10g049520
Csa10g049580
Csa11632s010
Csa11727s010
Csa11822s010
Csa11862s010
Csa11g001460
Csa11g002480
Csa11g002520
Csa11g002820
Csa11g006950
Csa11g006960
Csa11g006970
Csa11g007400
Csa11g009270
Csa11g011130
Csa11g013730
Csa11g014030
Csa11g014040
Csa11g015490
Csa11g015500
Csa11g015780
Csa11g015790
Csa11g016740
Csa11g017130
Csa11g017440
Csa11g019060
Csa11g020220
Csa11g023600
Csa11g023720
Csa11g023730
Csa11g023740
Csa11g025210
Csa11g029860
Csa11g031220
Csa11g031230
Csa11g031240
Csa11g032860
Csa11g035060
Csa11g037040
Csa11g041510
Csa11g044470
Csa11g050960
Csa11g050970
Csa11g053930
Csa11g055430
Csa11g055720
Csa11g064400
Csa11g064410
Csa11g064450
Csa11g068200
Csa11g068510
Csa11g072070
Csa11g078360
Csa11g082020
Csa11g082940
Csa11g083590
Csa11g083830
Csa11g084000
Csa11g084010
Csa11g084020
Csa11g085220
Csa11g088540
Csa11g088550
Csa11g088560
Csa11g088570
Csa11g088600
Csa11g088610
Csa11g089280
Csa11g089880
Csa11g090750
Csa11g091270
Csa11g092400
Csa11g092420
Csa11g092580
Csa11g093160
Csa11g094590
Csa11g099150
Csa11g100530
Csa11g103880
Csa11g103890
Csa11g104980
Csa12118s010
Csa12574s010
Csa12588s010
Csa12864s010
Csa12g001360
Csa12g001500
Csa12g002370
Csa12g002510
Csa12g002530
Csa12g004350
Csa12g008700
Csa12g009590
Csa12g009950
Csa12g013860
Csa12g016260
Csa12g017460
Csa12g017920
Csa12g017930
Csa12g022180
Csa12g022460
Csa12g022470
Csa12g023450
Csa12g025090
Csa12g029840
Csa12g030250
Csa12g034210
Csa12g034320
Csa12g034330
Csa12g034340
Csa12g034350
Csa12g034370
Csa12g045680
Csa12g047200
Csa12g047210
Csa12g047220
Csa12g048240
Csa12g048250
Csa12g049890
Csa12g053410
Csa12g053420
Csa12g056840
Csa12g057760
Csa12g057830
Csa12g067150
Csa12g074630
Csa12g074640
Csa12g077990
Csa12g081320
Csa12g081730
Csa12g082010
Csa12g082580
Csa12g084000
Csa13236s010
Csa13858s010
Csa13869s010
Csa13g001050
Csa13g007570
Csa13g007940
Csa13g008080
Csa13g009400
Csa13g009550
Csa13g009560
Csa13g009570
Csa13g009580
Csa13g010000
Csa13g011020
Csa13g013920
Csa13g015290
Csa13g015300
Csa13g016210
Csa13g016630
Csa13g017750
Csa13g017800
Csa13g018160
Csa13g019150
Csa13g022350
Csa13g023360
Csa13g025900
Csa13g025970
Csa13g028700
Csa13g028720
Csa13g028730
Csa13g028760
Csa13g028770
Csa13g030600
Csa13g031260
Csa13g031270
Csa13g034190
Csa13g036250
Csa13g039230
Csa13g041760
Csa13g048060
Csa13g048130
Csa13g048670
Csa13g050670
Csa13g052810
Csa13g052890
Csa13g052900
Csa13g052910
Csa13g055330
Csa13g056390
Csa13g057000
Csa13g057160
Csa14458s010
Csa14464s010
Csa14645s010
Csa14g001690
Csa14g002040
Csa14g005130
Csa14g005370
Csa14g006980
Csa14g007140
Csa14g007470
Csa14g007490
Csa14g007660
Csa14g007760
Csa14g008200
Csa14g009130
Csa14g010820
Csa14g011960
Csa14g014350
Csa14g015470
Csa14g016190
Csa14g017430
Csa14g018290
Csa14g018510
Csa14g020190
Csa14g023550
Csa14g023720
Csa14g024560
Csa14g024910
Csa14g027790
Csa14g030530
Csa14g031760
Csa14g031950
Csa14g032580
Csa14g033940
Csa14g034330
Csa14g034610
Csa14g034630
Csa14g035330
Csa14g035910
Csa14g036700
Csa14g038020
Csa14g041720
Csa14g042830
Csa14g043940
Csa14g044320
Csa14g047130
Csa14g047270
Csa14g047280
Csa14g051500
Csa14g051510
Csa14g055520
Csa14g055640
Csa14g059340
Csa14g059690
Csa14g059710
Csa14g059940
Csa14g061840
Csa14g061960
Csa14g062050
Csa14g063810
Csa14g063960
Csa14g063970
Csa14g064580
Csa14g064920
Csa14g065060
Csa14g067480
Csa15052s010
Csa15630s010
Csa15g001040
Csa15g003310
Csa15g005200
Csa15g006820
Csa15g009190
Csa15g009690
Csa15g015060
Csa15g015170
Csa15g015870
Csa15g017250
Csa15g018160
Csa15g020510
Csa15g020520
Csa15g024730
Csa15g030670
Csa15g030720
Csa15g036990
Csa15g037020
Csa15g038250
Csa15g044590
Csa15g044630
Csa15g047140
Csa15g055710
Csa15g056820
Csa15g062040
Csa15g064440
Csa15g064630
Csa15g071180
Csa15g071330
Csa15g072130
Csa15g076730
Csa15g082310
Csa15g082460
Csa16010s010
Csa16226s010
Csa16399s010
Csa16807s010
Csa16g005560
Csa16g007900
Csa16g007910
Csa16g009720
Csa16g009880
Csa16g012520
Csa16g012600
Csa16g012610
Csa16g013090
Csa16g014140
Csa16g014160
Csa16g014170
Csa16g014200
Csa16g014220
Csa16g014340
Csa16g016450
Csa16g018260
Csa16g018920
Csa16g023330
Csa16g023370
Csa16g025400
Csa16g027660
Csa16g028130
Csa16g029120
Csa16g030190
Csa16g030270
Csa16g030440
Csa16g030480
Csa16g032620
Csa16g034310
Csa16g034750
Csa16g035210
Csa16g035380
Csa16g035760
Csa16g036280
Csa16g036290
Csa16g037780
Csa16g043680
Csa16g044640
Csa16g046420
Csa16g047430
Csa16g047970
Csa16g048240
Csa16g049360
Csa16g050830
Csa16g056420
Csa16g057140
Csa16g057410
Csa17228s010
Csa17g001170
Csa17g001600
Csa17g001930
Csa17g002070
Csa17g007100
Csa17g007110
Csa17g007360
Csa17g007380
Csa17g009010
Csa17g009190
Csa17g009490
Csa17g009500
Csa17g009510
Csa17g009670
Csa17g009740
Csa17g009750
Csa17g011150
Csa17g012480
Csa17g014110
Csa17g016790
Csa17g018560
Csa17g019120
Csa17g019580
Csa17g020940
Csa17g023250
Csa17g024050
Csa17g024200
Csa17g025010
Csa17g028830
Csa17g030380
Csa17g033940
Csa17g035040
Csa17g035670
Csa17g035700
Csa17g035710
Csa17g035720
Csa17g041880
Csa17g043320
Csa17g053070
Csa17g053080
Csa17g059350
Csa17g059740
Csa17g059750
Csa17g065810
Csa17g073110
Csa17g079260
Csa17g089790
Csa17g090310
Csa17g090330
Csa17g092760
Csa17g093760
Csa17g094450
Csa17g094460
Csa17g094480
Csa17g094780
Csa17g094950
Csa17g095480
Csa17g099080
Csa17g099420
Csa17g099440
Csa18078s010
Csa18558s010
Csa18758s010
Csa18772s010
Csa18940s010
Csa18g002460
Csa18g002470
Csa18g008530
Csa18g008540
Csa18g017450
Csa18g021820
Csa18g022430
Csa18g022770
Csa18g022830
Csa18g023050
Csa18g023240
Csa18g024400
Csa18g024410
Csa18g026750
Csa18g026760
Csa18g026780
Csa18g026800
Csa18g026820
Csa18g026870
Csa18g026880
Csa18g029380
Csa18g030170
Csa18g030780
Csa18g031870
Csa18g032030
Csa18g032630
Csa18g033050
Csa18g034100
Csa18g035740
Csa18g035760
Csa18g035920
Csa18g037100
Csa18g040500
Csa18g042500
Csa19165s010
Csa19510s010
Csa19585s010
Csa19g003550
Csa19g005080
Csa19g005650
Csa19g006470
Csa19g007800
Csa19g008710
Csa19g011740
Csa19g013410
Csa19g015390
Csa19g015530
Csa19g016150
Csa19g016160
Csa19g018500
Csa19g020460
Csa19g020850
Csa19g022670
Csa19g023650
Csa19g024450
Csa19g025550
Csa19g025630
Csa19g026660
Csa19g031630
Csa19g031830
Csa19g033960
Csa19g033970
Csa19g033980
Csa19g034010
Csa19g035580
Csa19g036290
Csa19g039800
Csa19g039890
Csa19g042190
Csa19g046290
Csa19g047840
Csa19g048040
Csa19g049170
Csa19g050630
Csa19g050870
Csa19g057610
Csa19g057960
Csa20138s010
Csa20670s010
Csa20853s010
Csa20939s010
Csa20g001130
Csa20g003920
Csa20g006850
Csa20g008180
Csa20g008300
Csa20g009600
Csa20g009940
Csa20g009950
Csa20g009960
Csa20g010030
Csa20g010410
Csa20g011880
Csa20g016150
Csa20g016570
Csa20g017590
Csa20g018060
Csa20g018070
Csa20g018940
Csa20g019320
Csa20g021430
Csa20g021810
Csa20g024050
Csa20g024150
Csa20g024170
Csa20g024180
Csa20g024190
Csa20g025120
Csa20g029070
Csa20g032360
Csa20g036060
Csa20g036070
Csa20g039470
Csa20g039490
Csa20g039500
Csa20g039520
Csa20g039540
Csa20g039550
Csa20g041370
Csa20g041380
Csa20g041400
Csa20g048720
Csa20g048730
Csa20g051560
Csa20g058850
Csa20g072180
Csa20g073440
Csa20g073450
Csa20g081800
Csa20g081810
Csa20g081820
Csa21383s010
Csa21740s010
Csa22276s010
Csa22574s010
Csa23699s010
Csa23721s010
Csa24110s010
Csa24404s010
Csa24700s010
Csa25431s010
Csa26268s010
Csa26874s010
Csa26989s010
Csa27191s010
Csa28125s010
Csa29178s010
Csa29237s010
Csa31267s010
Csa31435s010
Csa33248s010
Csa33412s010
Csa34923s010
Csa35871s010
Csa36287s010
NA.1
NA.10
NA.11
NA.12
NA.13
NA.14
NA.15
NA.16
NA.17
NA.18
NA.19
NA.2
NA.20
NA.21
NA.22
NA.23
NA.24
NA.25
NA.26
NA.27
NA.28
NA.29
NA.3
NA.30
NA.31
NA.32
NA.33
NA.34
NA.35
NA.36
NA.37
NA.38
NA.39
NA.4
NA.40
NA.41
NA.42
NA.43
NA.44
NA.45
NA.46
NA.47
NA.48
NA.5
NA.6
NA.7
NA.8
NA.9

b
1

p
u
o
r
G
2

p
u
o
r
G
3

p
u
o
r
G
4

p
u
o
r
G

60
40
20
0
60
40
20
0
60
40
20
0
60
40
20
0

P
P
e
e
r
c
r
c
e
e
n
n
t
t
a
a
g
g
e
e
o
o
f
g
f
e
g
n
e
e
n
s
e
s

g
n

i
t
i

d
p

i

i
l

d
E

n
w
o
n
k
n
U

n
o

g
n

i
t

i
l

a
g
n
o
E
A
F

l

i

a
n
g
S
d
p

i

i
l

i

s
s
e
h

t

n
y
S
d
p

i

i

t

s
s
e
h
n
y
S
x
a
W

o
h
p
s
o
h
P

i
l

o
h
p
s
o
h
P

t
r
o
p
s
n
a
r
T

i

t

s
s
e
h
n
y
S
G
A
T

m
s

i
l

t

o
b
a
e
M
n
p

i

i
l

y
x
O

n
o

i
t

a
d
a
r
g
e
D
G
A
T
−
A
F

i

i

s
s
e
h

s
s
e
h

t

t

n
o

i
t

l

a
u
g
e
R

i

t

s
s
e
h
n
y
S
A
F

i

s
s
e
h

d
p

i

i
l

o

i

s
s
e
h

t

n
y
S
d
p

i

i
l

l

t
c
a
a
G
c
i
t

t

n
y
S
d
p

i

i
l

n
y
S
n

l

a
n
o

i
t

u
C

i

s
s
e
h

t

n
y
S
n
i
r
e
b
u
S

o

f
l

u
S

o
y
r
a
k
u
E

i

o
g
n
h
p
S

i
t

p
i
r
c
s
n
a
r
T

d

l

i
t
s
a
P
m
o
r
F

t
r
o
p
x
E

n
o

i
t

a
r
u

t

a
s
e
D
n
o

i
t

a
g
n
o
E
A
F

l

o

f
l

u
S
d
p

i

i
l

o

l

t
c
a
a
G
c
i
t
o
y
r
a
k
o
r
P

i

g
n
k
c
i
f
f

a
r
T
d
p
L

i

i

i

s
s
e
h

t

i

n
y
S
d
c
A
c
o
p
L

i

i

m
s

i
l

o
b
a

t

e
M
A
F

l

a
i
r
d
n
o
h
c
o
t
i

M

i

s
s
e
h

t

n
y
S
d
p

i

i
l

o
h
p
s
o
h
P

l

a
i
r
d
n
o
h
c
o
t
i

M

n
y
S
e
d
i
r
a
h
c
c
a
s
y
o
p
o
p
L

l

i

l

a
i
r
d
n
o
h
c
o
t
i

M

TAG Synthesis

FA Synthesis

FA Elongation & Desaturation

400

300

200

100

0

300

200

100

0

5 8 10 11 12 14 16

16−21 18 20 22 24 26
35−39
25−29

S
G

5 8 10 11 12 14 16

16−21 18 20 22 24 26
35−39
25−29

S
G

5 8 10 11 12 14 16

16−21 18 20 22 24 26
35−39
25−29

S
G

DPA

c

M
P
T

8,000

6,000

4,000

2,000

0

(d)
d

s
F
T

20

10

0

l

t

d
e
a
e
R
−
B
Y
M

3
B

I

P
Z
b

C
A
N

H
L
H
b

2
H
2
C

Y
K
R
W

B
H

H
3
C

F
S
H

F
R
A

S
A
R
G

1
P
V
3
B
A

I

F
R
E
−
F
R
E
2
P
A

/

D
B
L

B
Y
M

D
H
−
f
z

P
B
S

1
S
E
B

Z
T
A
L
P

f

o
D
−
2
C
2
C

I

P
Z
−
D
H
−
B
H

P
B
E
R
E
−
2
P
A

e
p
y
t
−
M
−
S
D
A
M

I

L
E

i
l

e
k

X
O
W
−
B
H

I

C
K
M
−
S
D
A
M

−
2
G
−
P
R
A
G
Family

T
L
U

T
R
H

e
k

i
l

−
a
F
1
S

C
P
B
−
R
B
B

e
k

i
l

−
O
C
−
2
C
2
C

A
T
M
A
C

e
k

i
l

−
2
G

P
B
e
G

A
Y
−
F
N

C
Y
−
F
N

P
F
O

K
R
−
P
W
R

y
f
i

T

P
C
T

B
−
R
R
A

e
k

i
l

−
n

i
f
l

A

P
P
C

F
R
G

S
D
A
M

B
Y
−
F
N

S
R
S

x

i
l

e
h
i
r
T

A
T
A
G
−
2
C
2
C

Y
B
B
A
Y
−
2
C
2
C

Figure 3.1. Expression dynamics of LRGs during seed development  
a. Heatmap representing mRNA accumulation information data highlighting four LRG clusters 
(rows). The clusters were generated based on the expression level of the corresponding genes  

40 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 3.1 (cont’d) 
during sixteen timepoints across Camelina seed development including samples immediately after 
germination (columns). In total we analyzed 2,765 LRGs collected from CamRegBase. DAP: Days 
after pollination. GS: Germinated seed. b. Bar graph indicating the percentage of LRGs assigned 
to different lipid-related processes by each of the clusters of expression presented in (a). The lipid-
related processes were defined based on homology with Arabidopsis and following the AraLip 
classification.  c. Expression variation across seed development of three major LRG groups. d. Bar 
graph indicating the number of TF classified by families identified as potential lipid-metabolism 
regulators  in  Camelina.  Red  color  indicates  TF  families  significantly  enriched  (FDR  <  0.05, 
Fisher’s Exact Test).  

3.3.2 Identification of candidate lipid transcriptional regulators by co-expression analysis 

To  identify  candidate  genes  encoding  TFs  potentially  associated  with  the  regulation  of 

Camelina  LRGs,  we  estimated  the  mutual  information  (MI)  between  each  of  the  5,590  TFs 

annotated in CamRegBase and each gene in the genome using all the available Camelina gene 

expression  data.  For  each  TF,  we  extracted  the  highest  200  genes  (average  MI  ≥  1)  as 

corresponding  to  the  co-expressed  genes  of  the  corresponding  TF.  We  then  evaluated  whether 

LRGs were statistically overrepresented [False Discovery Rate (FDR) < 0.05, Fisher’s Exact Test] 

within these 200 genes. From the 5,590 TFs analyzed, we identified 350 TFs that met the criteria. 

The  350  TFs  belonged  to  52  different  TF  families  and  those  with  the  highest  representation 

corresponded to MYB, AP2/ERF, bZIP, and bHLH families (Figure 3.1d).  

We compared our list of TF candidates with 36 Arabidopsis TFs known to participate in the 

regulation of lipid and/or seed development. The 36 Arabidopsis TF corresponded to 105 Camelina 

homologous genes, as reported in CamRegBase (Gomez-Cano et al., 2020), consistent with the 

hexaploid nature of the Camelina genome. We excluded ten out of the 105 Camelina TFs because 

of the absence of evidence for expression in the available Camelina expression data. We found a 

significant overlap between the TFs annotated by homology as Arabidopsis lipid regulators and 

those TFs predicted by our analysis (28 TFs overlapped, P-value < 0.05, Hypergeometric test), 

providing confidence in our approach. These 28 TFs included homologs of WRI1, WRI4, ABI3, 

41 

 
 
FUS3, LEC2, MYB9, MYB41, MYB107, MYB94, AGL15, VAL2, EEL, and DEWAX (Meinke 

et  al.,  1994;  Focks  and  Benning,  1998;  Bensmihen  et  al.,  2002;  Cernac  and  Benning,  2004; 

Tsukagoshi et al., 2007; Braybrook and Harada, 2008; Zheng et al., 2009; To et al., 2012; Go et 

al., 2014; Kosma et al., 2014; Lee and Suh, 2015; Lashbrooke et al., 2016; Lee et al., 2016; Zhang 

et al., 2016; Pouvreau et al., 2020). Noteworthy, not all the Camelina paralogs were co-expressed 

with the same number of LRGs. For example, one of the three Camelina homologs of Arabidopsis 

AtMYB94, AtMYB41, AtVAL2, AtWRI4, and AtDEWAX were not co-expressed significantly 

with LRGs. Similarly, only one of the three Camelina paralogs of AtAGL15 and AtLEC2 were 

present in the list of 350 Camelina TFs (Figure 3.2a).  

To prioritize Camelina TF candidates for functional studies, we ranked the 350 identified TFs 

based on the number of co-expressed LRGs. Notably, the top candidates also showed preferential 

expression  in  seeds,  as  indicated  by  the  seed  Z-scores  (Kryuchkova-Mostacci  and  Robinson-

Rechavi, 2017) (See Methods). From the ranked list, we selected the top 35 TFs, which included 

13 pairs of paralogs. From the paralog pairs, we selected only the TF with the largest number of 

co-expressed  LRGs  and  the  highest  expression,  resulting  in  a  final  list  of  22  TFs  that  were 

subjected to further analyses. Four of these TFs were homologs of known seed development and/or 

lipid metabolism regulators in Arabidopsis, corresponding to ABI3, FUS3, MYB9, and MYB107 

(Giraudat et al., 1992; Keith et al., 1994; Lashbrooke et al., 2016).  

To further characterize the TF candidates, we evaluated the conservation of the predicted TF-

LRG  associations  between  Camelina  and  Arabidopsis.  For  this,  we  re-analyzed  >250  publicly 

available Arabidopsis RNA-seq experiments using identical pipeline and metrics as for Camelina, 

selecting datasets similar to the samples used for the Camelina co-expression analyses. We focused 

specifically on our list of 22 Camelina TFs. Arabidopsis homologs of CsaMYB1 and CsaMYB3 

42 

 
were not expressed on the analyzed data and therefore were excluded from this analysis. In total, 

within the remaining 20 TFs, ten showed a conserved significant co-expression with LRGs (Figure 

3.2b). Substantiating our analyses, the three well-described Arabidopsis lipid regulators AtABI3 

(CsaABI3VP1-1), AtFUS3 (CsaABI3VP1-2), and AtMYB9 (CsaMYB2), were identified as part 

of  the  conserved  co-expression  associations.  This  co-expression  analysis  identified  seven 

Camelina TFs (and their Arabidopsis homologs) that had not been previously associated with lipid 

metabolism, 

including  CsaNAC1,  CsaNAC2,  Csazf-HD1,  CsaB3-1,  CsaAP2/B3-like-1, 

CsaULT1,  and  CsaLBD1  (Figure  3.2b).  The  remaining  ten  Camelina  TFs  that  did  not  show  a 

conserved  co-expression  with  Arabidopsis  LRGs  are  likely  to  correspond  to  Camelina-specific 

lipid regulators, or alternatively they are not involved in the control of lipid metabolism. 

43 

 
 
(a)
a

t

s
r
o
a
u
g
e
r

l

i

i

s
s
p
o
d
b
a
r
A
n
w
o
n
k

f
o

l

s
g
o
o
m
o
h

a
n

i
l

e
m
a
C

l

)
y
g
o
o
m
o
h

'

i

i

s
s
s
p
o
d
b
a
r
A
y
b
d
e
p
u
o
r
G

(

AGL18
ABI5
MINI3
WRI2
SHN3
MYB89
LEC1
ASIL1
MYB16
MYB61
WRI3
TCP4
MYB30
MYB106
COG1
VAL1
MYB96
SHN2
SHN1
BBM
ABI4
AGL15
LEC2
DEWAX
WRI4
VAL2
MYB94
MYB41
EEL
WRI1
MYB107
MYB9
FUS3
ABI3

(b)
b

CsaABI3VP1−1 [ABI3]

CsaABI3VP1−2 [FUS3]

CsaLBD1 [LBD40]

CsaAP2/B3−like−1 [REM16]

i

t

s
e
a
d
d
n
a
C
F
T
p
o

t

a
n

i
l

e
m
a
C

f
o

l

s
g
o
o
m
o
h

'

i

s
s
s
p
o
d
b
a
r
A

i

CsaULT1 [ULT1]

CsaB3−1 [REM17]

Csazf−HD1 [HB27]

CsaNAC2 [NAC60]

CsaNAC1 [NAC38]

CsaMYB2 [MYB107]

CsaHRT1 [ET2]

CsaC3H2 [C3H2]

CsaC3H1 [C3H2]

CsaC2C2−Dof1 [SCAP1]

CsaS1Fa−like−1 [S1FA−like]

CsabZIP1 [TGA4]

CsaWRKY1 [WRKY3]

CsaHB1 [HB4]

CsaHB2 [HB7]

CsaTify1 [PPD2]

0

10
30
20
LRGs in Top 200

40

FDR ≤ 0.05

FDR > 0.05

10

0
20
LRGs in Top 200

30

40

Figure 3.2 Co-expression of known lipid/seed development regulators and LRGs in Camelina 
and Arabidopsis  
The bar graphs show the total number of LRG co-expressed with (a) Camelina homologs of each 
Arabidopsis  TF  (note  that  there  are  three  bars  for  each  Arabidopsis  regulator  because  of  the 
hexaploid  nature  of  the  Camelina  genome),  or  (b)  Arabidopsis  homologs  (names  in  square 
brackets) for the Camelina top TFs. The color of the bar indicates the significance of the number 
LRG co-expressed (light-red, FDR ≤ 0.05; turquoise, FDR > 0.05). 

44 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3.3.3 Establishing the DNA-binding landscape of the candidate transcription factors 

To further characterize the 22 TFs and to identify potential target genes, we applied DAP-seq 

(O’Malley et al., 2016). We synthesized and cloned the corresponding open reading frames (ORFs) 

for the 22 TFs in a vector that permitted expression of the protein fused at the N-terminus to a 

Halo-tag (Bartlett et al., 2017). We also generated a Camelina unmethylated DAP-seq DNA library 

(ampDAP-seq) from green tissues of mature plants (see Methods). We reasoned that unmethylated 

DNA better captures the majority of the PDIs in which these TFs are likely to participate (O’Malley 

et al., 2016), and eliminates variations in methylation patterns between cell types or tissues. We 

performed DAP-seq in duplicate for each Halo-TF, and with the Halo-tag alone as the control. We 

obtained on average 25.5 million reads per sample, out of which about half mapped uniquely to 

the available Camelina genome (v2, cv. DH55) (Kagale et al., 2014). To assess the variance and 

reproducibility  of  the  experiments,  we  performed  a  principal  component  analysis  (PCA)  using 

uniquely mapped reads. The first two PC showed all TFs well separated from the control (Halo). 

However,  we  also  observed  five  TFs  with  strikingly  different  replicates,  indicating  low 

reproducibility between them. For each TF, we also analyzed the similarity of the uniquely mapped 

reads  between  each  pair  of  replicates,  which  confirmed  the  differences  observed  on  the  PCA 

analysis for replicates of the five TFs. Based on these observations, we discarded the DAP-seq 

results obtained for CsaABI3VP1-1 and CsaB3-1 (because of its high correlation with the HALO 

control),  and  settled  on  analyzing  the  replicates  of  CsaMYB2,  CsaULT1,  and  CsaTify1 

independently  (replicates  with  PCC  <  0.7).  For  the  remaining  17  TFs,  DNA-binding  regions 

(peaks) were called using both replicates. Thus, in total, 20 TFs were tested for the presence of 

peaks.  The  number  of  identified  peaks  varied  greatly  between  the  TFs,  with  CsaC2C2-Dof1 

showing >100,000 peaks, and four TFs having less than 500 peaks (CsaTify1, CsaS1Fa-like-1, 

45 

 
CsaMYB2, and CsaULT1), which were not further used. In consequence, a total of 16 TFs were 

kept for further analyses.  

The analysis of the distance between the peak summit and the closest annotated transcription 

start sites (TSSs) indicated that, on average, ~63% of the total peak summits are within 3 kbs of 

the TSSs. Thus, our results are in agreement with the peak genomic distribution patterns previously 

observed in DAP-seq experiments for Arabidopsis and maize (O’Malley et al., 2016; Galli et al., 

2018). We compared, in terms of successful identification of TF binding motifs, all our DAP-seq 

results  (including  those  which  failed  to  pass  the  quality  controls)  with  those  performed  in 

Arabidopsis  (O’Malley  et  al.,  2016)  and  determined  that  17  common  TFs  were  tested  (TF 

homologs). To note, 3/17 TFs did not work in either plant, 7/17 TFs worked in Camelina but not 

in Arabidopsis, and 6/17 TFs worked in both plants. The remaining TF (AtMYB107 homolog of 

CsaMYB2)  worked  only  in  Arabidopsis,  likely  related  to  the  lack  of  MYB  domains  on  the 

Camelina annotated transcript. Finally, the corresponding genes for CsaMYB1, CsaNAC2, and 

CsaC3H2  were  not  previously  tested  in  Arabidopsis.  In  summary,  we  provide  here  high-

confidence DNA-binding data for 16 TFs, of which 10 were previously unknown in Arabidopsis.  

To  evaluate  the  quality  of  the  predicted  DAP-seq  peaks  of  the  corresponding  16  TFs,  we 

determined  the  log2  fold  change  of  the  binding  (log2FC,  See  Methods).  We  defined  high-

confidence  peaks  for  further  analyses  as  those  showing  log2FC  >  0.5  in  both  replicates,  which 

represented ~32.5% of the total peaks called (Figure 3.3). One additional criteria that we applied 

to  decide  whether  DAP-seq  provided  meaningful  information  or  not  was  the  enrichment  for 

particular TF-binding motifs (TFBM) within the recovered peaks, a widely accepted characteristic 

of the DNA fragments recognized by TFs (Lambert et al., 2018). To identify the TFBMs associated 

with each TF, we ranked all the high-quality peaks based on their log2FC, selected the top 1,000 

46 

 
peaks for each TF and identified the motif consensus using MEME-ChIP (Machanick and Bailey, 

2011). To evaluate the relevance of the predicted TFBMs in the context of the identified peaks, we 

searched each TFBM across the full set of peaks for each TF, focusing on two specific aspects: (1) 

The fraction of peaks that harbored the motif, and (2) the localization of the motif within the peak 

(distance to the summit). We carried out this analysis by extending each peak 50 bps around the 

summit (Figure 3.4). The most significant motifs identified for each of the 16 TFs corresponded 

to those with the largest abundance and which displayed a clear accumulation close to the summit 

of each peak (Motif 1 in Figure 3.4). Thus, for the rest of this study, we considered high-confidence 

peaks those that harbored such a motif, corresponding to ~92% of all the peaks evaluated.  

47 

 
Figure 3.3 Reproducibility analysis between TF replicates based on DNA-binding fold 
changes 

48 

 
 
Figure 3.3 (cont’d) 
We calculated the log2 of the binding fold change (log2FC) for the total predicted peaks for each 
TF dividing the number of reads obtained for each peak with Halo-TF by the number of reads 
obtain for the same peak for the Halo control. Peaks with log2FC ≥ 0.5 in both replicates were 
defined as highly reproducible peaks. 

We compared the DNA-binding specificities provided by DAP-seq between the corresponding 

six Camelina and Arabidopsis homologs. We re-analyzed all six Arabidopsis DAP-seq using the 

same pipeline employed in the current study. Five TF pairs (AtNAC38 and CsaNAC1; AtFUS3 

and  CsaABI3VP1-2;  AtMYB67  and  CsaMYB3;  AtTGA4  and  CsabZIP1;  AtWRKY3  and 

CsaWRKY1) showed almost identical DNA-binding preferences, suggesting that the amino acid 

residues that distinguish the Arabidopsis and Camelina homologs are not significantly affecting in 

vitro DNA-binding specificities. The only exception was CsaAP2/B3-like-1 for which none of the 

top motifs identified matched the TTTGGCGGGAA sequence consensus predicted for AtREM1. 

This result puzzled us, hence we decided to re-check if the Arabidopsis and Camelina genes were 

properly annotated. Indeed, we determined that one of the B3 domains that characterizes the DNA-

binding domain of AtREM1 (Romanel et al., 2009) was absent in the cloned CsaAP2/B3-like-1 

ORF, because of a likely error in the current Camelina genome annotation. Taken together, we 

identified  the  DNA-binding  patterns  for  16  Camelina  TFs,  and  determined  a  similar 

correspondence with the Arabidopsis homolog, when available.  

49 

 
 
a
(a)

1

f
i
t
o
M

1

1

(c)
c

f
i
t

o
M

3
1

3
1

3
1

3
1

3
1

3
1

CsaABI3VP1−2

M1 CsaABI3VP1-2

b
(b)(b)

CsaC2C2−Dof1

CsaC3H1

CsaC3H2

60

40

20

80
Start Position of Motif
in Peak (bps)

CsaAP2/B3−like−1

CsabZIP1

CsaHB1

CsaMYB3

CsaNAC1

CsaWRKY1

60

40

20

80
Start Position of Motif
in Peak (bps)

M1 CsaC3H1

M1 CsaC3H2

M3 CsaAP2/B3-like-1

M1 CsaAP2/B3-like-1

M3 CsabZIP1

M1 CsabZIP1

M3 CsaHB1

M1 CsaHB1

M3 CsaMYB3

M1 CsaMYB3

M3 CsaNAC1

M1 CsaNAC1

M3 CsaWRKY1

M1 CsaWRKY1

f
i
t

o
M

3
2
1

3
2
1

3
2
1

3
2
1

3
2
1

3
2
1

3
2
1

CsaHB2

CsaHRT1

CsaLBD1

CsaMYB1

CsaNAC2

Csazf−HD1

60

40

20

80
Start Position of Motif
in Peak (bps)

M3 CsaC2C2-Dof1

M2 CsaC2C2-Dof1

M1 CsaC2C2-Dof1

M3 CsaHB2

M2 CsaHB2

M1 CsaHB2

M3 CsaHRT1

M2 CsaHRT1

M1 CsaHRT1

M3 CsaLBD1

M2 CsaLBD1

M1 CsaLBD1

M3 CsaMYB1

M2 CsaMYB1

M1 CsaMYB1

M3 CsaNAC2

M2 CsaNAC2

M1 CsaNAC2

M3 Csazf-HD1

M2 Csazf-HD1

M1 Csazf-HD1

Motif Z−score

0

2

4

6

Motif Z−score

0

2

4

6

Figure 3.4 Distribution of predicted TF binding motifs (TFBMs) in the predicted peaks 
The prediction of TFBMs resulted in up to three different motifs for some of the TFs. To identify 
the main motif, we counted the frequency and location of each of the predicted TFBMs in all the 
predicted peaks. The frequency of each TFBMs is presented as the motif Z-score in a heatmap 
indicating  the  start  position  of  the  TFBMs  on  the  peak.  Each  TFBM  was  tested  independently 
regardless of whether the TFs have a single (a), two (b), or three (c) TFBMs. 

50 

 
 
 
3.3.4 Predicting gene targets for the selected TFs 

To identify potential gene targets for the 16 TF candidates, we determined which genes were 

located within 3 kbps of each high-confidence peak summit, since 3 kbps capture many of the 

biologically-relevant TF-target gene interactions (Springer et al., 2019). We identified a total of 

31,898 potential targets for these 16 TFs, with CsaMYB1 and CsaHRT1 showing the largest (6,816 

genes) and lowest (9 genes) number of target genes, respectively. As a first step towards assessing 

the  biological  significance  of  the  DAP-seq  results  and  its  concordance  with  the  co-expression 

prediction, we tested if the predicted targets were enriched in LRGs and/or in TFs associated with 

the control of LRGs and seed development (TF-LRG/development). Tellingly, 4/16 and 6/16 sets 

of  targets  showed  significant  enrichment  (P-value  ≤  0.05,  Fisher-exact  test)  on  LRGs  and  TF-

LRG/development  targets,  respectively  (Figure  3.5a).  Moreover,  CsaABI3VP1-2  (Camelina 

homolog of AtFUS3) showed enrichment on both sets of genes, suggesting an important role in 

lipid metabolism. Thus, in total, 9 out of the 16 TFs tested showed a significant enrichment for 

target genes associated with lipid metabolism in Camelina.  

Previously, we showed that half of the candidate TF homologs in Arabidopsis were enriched 

in LRGs by co-expression (Figure 3.5b). Thus, we tested if they were also enriched in target genes 

annotated as LRGs, as we found for the Camelina TFs (Figure 3.5a). We performed the analysis 

with  the  six  Arabidopsis  TFs  for  which  we  previously  evaluated  TFBMs.  The  analysis  of  the 

Arabidopsis DAP-seq data was performed using the same pipeline and controls as we used for the 

Camelina data. Out of the six Arabidopsis TFs, five showed enrichment for target genes annotated 

as LRGs and TF-LRG/development (Figure 3.5b). This finding, along with the conservation of the 

corresponding  TFBMs,  suggests  conservation  of  the  corresponding  regulatory  functions. 

Curiously, the five Arabidopsis TFs showed target enrichment for both type of genes: LRGs and 

51 

 
TF-LRGs  (Figure  3.5b).  This  contrasts  with  what  we  found  for  the  corresponding  Camelina 

homologs which showed in all cases but one either enrichment for LRGs or TF-LRGs but not in 

both (Figure 3.5a). Finally, neither CsaAP2/B3-like-1 nor AtREM16 showed targets enriched in 

LRGs  or  TF-LRG/development.  While  this  is  consistent  with  the  possibility  that  we  used  a 

truncated protein for CsaAP2/B3-like-1, our results suggest that AtREM16 plays a secondary role 

as a lipid metabolism regulator.  

To characterize other functional roles of the set of predicted target genes associated with the 

corresponding TFs, we investigated enrichment for Gene Ontology (GO) terms. All the TFs tested 

have at least one GO term enriched that is lipid-related. After removing redundant and general 

terms, we clustered all the TFs based on the top 10 GOs for each (based on P values), allowing us 

to separate them into two main clusters. One cluster (indicated in green) was associated with a 

wide range of GO terms, including regulation of development and several metabolic processes, 

particularly lipid metabolism-related functions, as well as phenylpropanoid and carboxylic acid 

biosynthesis  processes.  Members  of  the  other  cluster  (indicated  in  orange,  Figure  3.6)  have  in 

common  the  terms  signal  transduction,  defense  responses,  regulation  of  gene  expression,  and 

regulation of nitrogen compounds. When all the data is considered together, these analyses provide 

additional evidence that the DAP-seq results bore biologically meaningful targets and support the 

initial  co-expression  predictions,  including  the  discovery  of  previously  unrecognized  candidate 

regulators of lipid-related genes. 

52 

 
 
 
LRGs

TF LRG/Development

0

100 200 300 400 0

10

20

30

40

Targets

Enrichment

P value <= 0.05

P value > 0.05

LRGs

TF LRG/Development

c

d

GO Group
CsaC2C2−Dof1
CsaABI3VP1−2
CsaHB1
CsaWRKY1
CsabZIP1
Csazf−HD1
CsaMYB1
CsaMYB3
CsaC3H1
CsaC3H2
CsaNAC1
CsaNAC2
CsaAP2/B3−like−1
CsaHB2
CsaHRT1
CsaLBD1

8671 1620 786 1193 792 171 933 734 159

82

215 342

41

21

1620 11477 1084 1453 1138 235 1183 922 166

88

282 461

31

13

786 1084 4829 602 468 148 509 415

72

31

134 187

22

10

1193 1453 602 8307 767 155 922 658

88

44

211 286

35

14

792 1138 468 767 5891 99

574 495

62

32

166 231

16

171 235 148 155

99

975 117

80

23

14

34

37

3

9

4

933 1183 509 922 574 117 6789 2376 103

58

151 273

18

13

734 922 415 658 495

80 2376 5434 72

39

120 214

15

10

159 166

72

88

62

23

103

72

825 253

27

22

82

88

31

44

32

14

58

39

253 402

18

13

215 282 134 211 166

34

151 120

27

18 1427 131

342 461 187 286 231

37

273 214

22

13

131 2592

41

31

22

35

16

9

3

4

I

1
P
Z
b
a
s
C

21

13

10

14

2

6

1
B
H
a
s
C

2

9

1
Y
K
R
W
a
s
C

3

7

1
f
o
D
−
2
C
2
C
a
s
C

3

16

2
−
1
P
V
3
I
B
A
a
s
C

3

4

2

0

1
D
H
−
f
z
a
s
C

18

15

13

10

0

12

1
B
Y
M
a
s
C

1

6

3
B
Y
M
a
s
C

6

5

0

0

5

5

0

0

3

2

0

1

9

3

0

2

1
H
3
C
a
s
C

2
H
3
C
a
s
C

1
C
A
N
a
s
C

2
C
A
N
a
s
C

5

5

2

3

8

96

0

0

2
B
H
a
s
C

6

5

3

9

162

8

0

1

1
−
e
k

i
l

−
3
B
/
2
P
A
a
s
C

3

3

2

2

3

2

0

1

0

0

0

0

0

0

14

7

16

6

9

4

0

12

6

0

0

1

2

1

0

1

1

100

1
T
R
H
a
s
C

1
D
B
L
a
s
C

 −log10(P.val)

0 20 40 60 80

0

50

100 150 200 250 0.0
Targets

2.5

5.0

7.5 10.0 12.5

Enrichment

P value <= 0.05

P value > 0.05

0

0

CsaMYB1 
& 
CsaWRKY1

C
s
a
M
Y
B
1
~
C
s
a
W
R
K
Y
1

Distance between summits (kbp)
Distance between peaks' summits (kbps)

2

2

4

4

a

CsaLBD1

CsaHRT1

CsaHB2

CsaAP2/B3−like−1

F
T

CsaC3H2

Csazf−HD1

CsaC3H1

CsaNAC1

CsaNAC2

CsaHB1

CsabZIP1

CsaMYB3

CsaC2C2−Dof1

CsaWRKY1

CsaMYB1

CsaABI3VP1−2

F
T

REM16

WRKY3

FUS3

NAC38

TGA4

MYB67

b

e

CsaMYB1_R1

CsaMYB1_R2

CsaWRKY1_R1

CsaWRKY1_R2

Halo_R1

Halo_R2

Csa09g007470 (LOH1)

1 kb

Csa04g040040 (DEI1)

1 kb

Figure 3.5 Regulatory landscape of predicted lipid-related regulators based on DAP-seq 
The bar graph indicates the number of predicted target genes annotated as (a) Camelina or (b) 
Arabidopsis  LRGs.  Arabidopsis  TFs  correspond  to  homologs  of  the  Camelina  predicted 
candidates. The red color indicates the significance of the overlap of target genes annotated as 
LRG  vs  the  total  number  of  predicted  target  genes  for  each  of  the  tested  TF  (P-value  ≤  0.05, 
Fisher’s Exact Test). c. Heatmap indicating the number of common targets between pairs of TFs. 
Color scale indicates the P-value associated with the corresponding number of common targets 
(Fisher’s Exact Test). d. Violin plot showing the distribution of distances (in bps) between summits  
53 

 
 
Figure 3.5 (cont’d) 
(peak centers) of CsaMYB1 and CsaWRKY1 mapped to common targets. The vertical dashed line 
indicates the most frequent distance between summits. e. IGV plots with co-binding profiles (peak) 
generated from the DAP-seq experiments of CsaMYB1 and CsaWRKY1 highlighting two shared 
targets with the respective gene models obtained from Camelina V2.0 at the bottom. Peaks heights 
correspond to the number of reads by bins (10 bp) per million mapped reads.   

regulation of nitrogen compound metaboli...
regulation of primary metabolic process
aromatic compound biosynthetic process
regulation of macromolecule biosynthetic...
signal transduction
response to organic substance
response to other organism
defense response
phosphorus metabolic process
cellular protein modification process
protein modification process
cellular response to endogenous stimulus
hormone−mediated signaling pathway
lipid metabolic process
secondary metabolite biosynthetic proces...
interspecies interaction between organis...
modification of morphology or physiology...
response to jasmonic acid
S−glycoside metabolic process
glucosinolate biosynthetic process
endoplasmic reticulum tubular network or...
regulation of transcription by RNA polym...
root system development
phenylpropanoid metabolic process
organic acid biosynthetic process
carboxylic acid biosynthetic process
cellular lipid metabolic process
regulation of growth
plant−type cell wall organization or bio...
enzyme linked receptor protein signaling...
lignin metabolic process
cellular amino acid biosynthetic process
aromatic amino acid family metabolic pro...
regulation of meristem development
cinnamic acid biosynthetic process
mitochondria−nucleus signaling pathway
cytochrome complex assembly
cytochrome b6f complex assembly
mature ribosome assembly
nicotinate metabolic process
oxylipin biosynthetic process
fruit ripening
regulation of fertilization
regulation of ion transport
response to mechanical stimulus
generative cell differentiation
regulation of nitrogen utilization
starch metabolic process
ribonucleoprotein complex assembly
calcium ion transmembrane transport
nickel cation transport
negative regulation of photomorphogenesi...
glycogen metabolic process
energy reserve metabolic process
transmembrane transport
negative regulation of metabolic process

1
B
Y
M
a
s
C

2
H
3
C
a
s
C

3
B
Y
M
a
s
C

1
C
A
N
a
s
C

1
D
B
L
a
s
C

1
T
R
H
a
s
C

2
B
H
a
s
C

1
H
3
C
a
s
C

1
B
H
a
s
C

I

1
P
Z
b
a
s
C

1
f
o
D
−
2
C
2
C
a
s
C

1
D
H
−
f
z
a
s
C

2
−
1
P
V
3
I
B
A
a
s
C

2
C
A
N
a
s
C

1
Y
K
R
W
a
s
C

1
−
e
k

i
l

−
3
B
/
2
P
A
a
s
C

% Targets

0 5 10 15

Figure 3.6 Heatmap and hierarchical clustering of TF candidates based on the top 10 GO 
terms significantly enriched 

54 

 
 
Figure 3.6 (cont’d) 
GO  terms  not  significantly  enriched  are  shown  in  white.  The  color  indicates  the  percentage  of 
targets annotated within the corresponding GO term. 

Similarities  in  the  functional  annotation  of  target  genes  for  TF  pairs  may  indicate  that  the 

corresponding TFs share common targets. Alternatively, the TFs could regulate different genes in 

the  same  process/pathway.  To  distinguish  between  these  two  possibilities,  we  evaluated  the 

overlap in targets between the 16 TFs. Almost half of the comparisons showed significant target 

overlaps  (P-value  <  0.05,  Fisher’s  Exact  Test)  (Figure  3.5c,  darker  colors  indicate  smaller  P-

values).  As  anticipated,  TFs  from  the  same  family  (CsaMYB1  and  CsaMYB3;  CsaNAC1  and 

CsaNAC2) had the largest number of shared target genes, likely driven by the very similar in vitro 

DNA-binding  consensus  of  the  corresponding  TFs.  Noteworthy,  while  significant,  the  overlap 

comprises only a subset of all the targets for each of these TFs, suggesting that outside the shared 

core motif, each TF has specific DNA-binding preferences (Figure 3.4). Many of the TF pairs have 

overlapping  targets  (e.g.,  CsaMYB1  and  CsaMYB3;  CsaHB1,  CsaABI3VP1-2,  and  CsaC2C2-

Dof1; CsaNAC1, CsaNAC2, and CsaAP2/B3-like-1), indicating that they function in the control 

of  related  biological  processes.  We  explored  this  hypothesis  by  comparing  two  of  the  non-

homologous TF pairs with the highest number of common targets, corresponding to CsaMYB1-

CsaWRKY1 and CsabZIP1-CsaHB1, which had 922 and 468 common targets, respectively. For 

the  CsaMYB1-CsaWRKY1  pair,  we  found  that  shared  targets  were  enriched  in  multiple  lipid-

related GO terms at several levels of the GO hierarchy, including carboxylic acid biosynthesis and 

very  long-chain  fatty  acid  biosynthesis.  Contrary  to  the  pattern  observed  for  CsaMYB1-

CsaWRKY1,  common  targets  of  CsabZIP1-CsaHB1  were  enriched  in  a  more  diverse  list  of 

biological  processes  not  observed  on  the  corresponding  individual  list  of  enriched  GO  terms, 

including flavone biosynthesis, regulation of transcription, activation of protein kinase activity, 

55 

 
 
root hair cell tip growth, and leaf senescence, suggesting that their role in lipid metabolism control 

is not linked to common target genes in the pathway.  

To further understand the potential participation of CsaMYB1 and CsaWRKY1 in gene co-

regulation of their common targets, we evaluated the distribution of binding sites in the 922 shared 

targets. For most of them, the binding sites were within a few hundred base pairs apart from each 

other (the average distance was 320 bps; Figure 3.5d), highlighting a possible cooperative work at 

the DNA level (post-DNA binding) (Reiter et al., 2017). The proximity and potential significance 

for  transcriptional  regulation  is  exemplified  by  the  two  shared  targets  Csa03g002110  and 

Csa04g040040  (Figure  3.5e),  Arabidopsis  homologs  of  3-KETOACYL-COA  SYNTHASE 

(KCS1, At1g01120) and PASTICCINO 1 (PAS1/DEI1, At3g54010), which are involved in FA 

and VLCFA synthesis (Shang et al., 2016; Roudier et al., 2010), respectively, further underscoring 

the potential regulatory role of CsaMYB1 and CsaWRKY1 on lipid metabolism. 

3.3.5 Identified TFs associate with distinct aspects of lipid metabolism  

To better understand the specific aspects of lipid metabolism that each of the identified TFs 

might be involved with, we scored how many targets of each TF corresponded to each of the lipid 

pathway  categories  (as  presented  in  Figure  3.1b).  In  total,  11/16  TFs  showed  significant 

enrichment for targets annotated across several lipid-related processes (P-value < 0.05, Fisher’s 

Exact  Test).  As  examples,  CsaMYB3,  CsaMYB1,  CsaWRKY1,  and  CsaABI3VP1-2  were 

enriched  in  more  than  four  different  processes,  with  their  top  target  processes  being  suberin 

synthesis  (18.2%),  cutin  synthesis  (25.3%),  and  transcriptional  regulation  (18.1%  and  41.9%), 

respectively.  Remarkably,  several  combinations  of  TFs  showed  significant  enrichment  for  the 

same processes. Finally, we also observed that the targets for CsaABI3VP1-2 and CsaWRKY1 

56 

 
were  significantly  enriched  in  genes  associated  with  TAG  synthesis  (22.6%)  and  FA-TAG 

degradation (15.7%), respectively, which are core processes in the accumulation of seed oil.  

In parallel, to evaluate the biological significance of the regulatory interactions predicted at the 

pathway  co-expression  level,  we  applied  the  gene  set  enrichment  analysis  (GSEA)  algorithm 

(Subramanian et al., 2005) using the Pearson Correlation Coefficient (PCC) as the scoring metric. 

Thus, significant positive and negative enrichment values indicate association of the corresponding 

TF  with  a  metabolic  pathway  in  a  positive  or  negative  fashion,  respectively.  Also,  under  these 

conditions, GSEA permits the identification of TF-process relationships that have significant co-

expression signals at the pathway rather than as individual target gene levels (Subramanian et al., 

2005). Eight out of the sixteen TFs tested showed significant enrichment (P-value < 0.05) for at 

least  one  of  the  processes  tested.  CsaABI3VP1-2  showed  the  largest  number  of  significant 

associations (up to ten), including FA elongation and desaturation, FA and TAG synthesis, and 

transcriptional  regulation.  The  second  and  third  TFs  with  most  enriched  processes  were 

CsaWRYK1 and CsaMYB1, with seven each. We also observed eleven TF-process associations 

with negative enrichment scores, indicating enrichment for negative co-expression values, within 

which  CsaWRKY1,  CsaMYB1,  CsaMYB3,  and  CsabZIP1  are  included.  The  former  showed 

enrichment  for  negative  scores  on  its  corresponding  targets  annotated  under  FA  synthesis, 

transcriptional  regulation,  and  transport,  while  the  latter  with  targets  annotated  under  cutin 

synthesis, wax synthesis, and FA elongation. These results suggest major roles of these TFs as 

negative regulators of the mentioned pathways.  

Finally,  we  combined  both  sets  of  results  (target  enrichment  and  GSEA  results)  to  identify 

high-confidence  TF-process  associations.  Six  of  the  eleven  TFs  analyzed  showed  significant 

associations  in  both  tests  with  at  least  six  different  processes,  to  a  total  of  ten  TF-pathway 

57 

 
associations (pink edges in Figure 3.7). Transcriptional regulation was the process with the largest 

number of connections. CsaNAC1, CsaWRKY1, and CsaABI3VP1-2 were the three TFs with the 

largest number of associations (two for each of them, Figure 3.7). Three out of the ten TF-pathway 

associations  showed  significant  negative  enrichment  (Figure  3.7),  indicating  transcriptional 

repression roles of the corresponding TFs on the respective pathways. Also, it is worth noting that 

one of the main processes enriched for the targets of CsaMYB1 and CsaNAC2 was cutin synthesis 

(Figure  3.7).  CsaABI3VP1-2  was  the  only  TF  significantly  enriched  in  TAG  synthesis-  and 

transcriptional  regulation-related  targets  (Figure  3.7),  and  remarkably  we  found  that  the  large 

majority  of  the  targets  that  we  predicted  for  CsaABI3VP1-2  were  also  TF  targets  previously 

identified  for  AtFUS3  by  either  chromatin  immunoprecipitation-DNA  microarray  (ChIP-chip) 

(Wang  and  Perry,  2013)  or  DAP-seq  assays  (O’Malley  et  al.,  2016),  uncovering  potential 

Camelina-specific  interactions  as  well  as  unreported  Arabidopsis  targets.  Altogether,  these 

analyses underscore CsaABI3VP1-2 as a good candidate playing a major role in lipid metabolism 

in Camelina, similar to AtFUS3 (Yamamoto et al., 2010; Wang and Perry, 2013; Zhang et al., 

2016). 

58 

 
 
Suberin 
Synthesis 
[170] 

Oxylipin 
Metabolism 
[273] 

FA 
Elongation 
Desaturation 
[47] 

Export 
From 
Plastid 
[47] 

CsaABI3VP1-2 

FA 
Synthesis 
[152] 

TAG 
Synthesis 
[212] 

CsaC2C2-Dof1 

Transcriptional 
Regulation 
[105] 

CsaHB1

Sphingolipid 
Synthesis 
[108] 

CsaMYB3

Prokaryotic 
Galactolipid 
Sulfolipid 
[178] 

CsaNAC1

Wax 
Synthesis 
[766] 

CsaNAC2

FA 
Elongation 
[766] 

CsabZIP1

Unknown 
[191] 

Cutin 
Synthesis 
[79] 

CsaWRKY1 

FA-TAG 
Degradation 
[204] 

Phospholipid 
Signaling 
[365] 

Transport 
[227] 

Eukaryotic 
Galactolipid 
[114] 

Mitochondrial 
Lipopolysaccharide 
Synthesis 
[61] 

Sulfolipid 
Synthesis 
[114] 

CsaMYB1

Phospholipid 
Synthesis 
[352] 

CsaC3H2

CsaAP2/B3-like-1 

Target 
enrichment

co-expression 
enrichment

Target & co-
expression enrichment

Figure 3.7. High-confidence TF-process network 
Associations predicted based on target enrichment and GSEA using TF-target PCC as score metric 
are indicated by lines joining TFs (blue) and specific processes associated with lipid metabolism 
(black). The thickness of the edges represents the fraction of lipid-related genes in the pathway 
that is being targeted by the corresponding TF. The total number of genes annotated for each of 
the corresponding lipid-related processes are indicated inside square brackets. 

3.3.6 Dynamic behavior of the predicted networks during seed development 

To  gain  further  insights  on  the  regulatory  effect  of  the  identified  TF-target  interactions  in 

Camelina seeds, we performed a second co-expression analysis with GENIE3 (Huynh-Thu et al., 

2010) using only expression data from seeds. GENIE3 uses a regression tree and random forest 

algorithm to make regulatory prediction implying causality (Huynh-Thu et al., 2010). Thus, we 

assumed that predictions identified by GENIE3 and supported by DAP-seq are highly confident 

59 

 
 
 
regulatory interactions occurring specifically in seeds. The significance of the predicted score was 

assayed using a permutations test (FDR ≤ 0.001, 1,000 permutations). Overall, 35% of the targets 

identified by DAP-seq were also predicted as targets of the corresponding TFs by GENEI3 (Figure 

3.8a). The highest percentage of DAP-seq seed co-expressed targets was observed for CsaNAC2 

and CsabZIP1 (~54% each, Figure 3.8a). These results suggest that many of the predicted TF-

target associations have a regulatory effect in the context of seed development.  

To parse TFs involved in controlling FA and TAG-related genes in seeds, we combined the 

target  enrichment  and  the  GSEA  results  (Figure  3.7)  to  select  TFs  associated  with  the 

corresponding  pathways.  Consequently,  we  reduced  the  TF-target  DAP-seq  network  to  only 

targets  co-expressed  in  seeds  (as  predicted  by  GENIE3)  (Figure  3.8a).  With  this  subset  of  TF-

target interactions, we tested the enrichment for targets on the corresponding pathways once again 

to determine if the reduced TF-target network still had a significant number of targets associated 

with FA and TAG-related processes. Seven TFs showed enrichment for seed co-expressed targets 

associated with at least three different pathways (FDR < 0.05, Fisher’s Exact Test) (Figure 3.8b). 

FA elongation was the pathway most frequently targeted, with six different TFs associated with it 

(Figure  3.8b).  CsaABI3VP1-2  and  CsaMYB1  were  the  two  TFs  with  most  seed  co-expressed 

targets annotated under FA elongation. However, TAG synthesis and FA-TAG degradation were 

significantly targeted by just one TF each, CsaABI3VP1-2 and CsabZIP1, respectively (Figure 

3.8b).  

60 

 
 
 
a

c

d

s
t
e
g
r
a
T

s
t
e
g
r
a
T

F
T

CsaNAC2
CsabZIP1
CsaHB2
CsaABI3VP1−2
CsaWRKY1
CsaAP2/B3−like−1
CsaC3H2
CsaLBD1
CsaC3H1
Csazf−HD1
CsaHB1
CsaMYB1
CsaC2C2−Dof1
CsaNAC1
CsaMYB3

b

F
T

CsaMYB1

CsaABI3VP1−2

CsaHB1

CsaNAC2

CsabZIP1

CsaMYB3

CsaNAC1

0

20

40

Percentage targets
co−expressed (GENEI3)

0

10 20 30 40

Targets

Pathway

FA Elongation

FA−TAG Degradation

TAG Synthesis

CsaMYB1

Csa01g025830*, Csa15g038240*, Csa19g031820* AT3G22600 LTPG5

Csa08g012100*, Csa13g022350*, Csa20g029070* AT5G19410 ATP-BINDING CASSETTE G23

Camelina target

Arabidopsis homlog

*

*
*

*
*
*
*

*

*
*

*

*
*

*
*
*

Csa17g083150*, Csa03g055300*

AT1G49430 LONG-CHAIN ACYL-COA SYNTHETASE 2

Csa03g060520*

Csa06g040040*

Csa06g027980*

Csa09g034370*

Csa09g069120*

Csa10g007950*

Csa08g058550*

Csa08g005490*

AT1G53270 ATP-BINDING CASSETTE G10

AT2G37360 ATP-BINDING CASSETTE G2

AT3G53510 ATP-BINDING CASSETTE G20

AT3G44550 FATTY ACID REDUCTASE 5

AT3G56700 FATTY ACID REDUCTASE 6

AT4G34250 3-KETOACYL-COA SYNTHASE 16

AT5G05960 Inhibitor/lipid-transfer protein/seed storage 2S albumin

AT5G13580 ATP-BINDING CASSETTE G6

Csa03g058380, Csa14g059740, Csa17g090380 AT1G51500 ATP-BINDING CASSETTE G12

Csa10g008420, Csa11g009270, Csa12g009950 AT4G33790 FATTY ACID REDUCTASE 3

Csa13g008570, Csa20g008850

AT5G06530 ATP-BINDING CASSETTE G22

Csa11g071140, Csa18g010110

AT5G43760 3-KETOACYL-COA SYNTHASE 20

Csa05g084930, Csa07g034210

AT1G68530 3-KETOACYL-COA SYNTHASE 6

Csa04g066390, Csa06g053760

AT2G47240 LONG-CHAIN ACYL-COA SYNTHASE 1

Csa18g022770, Csa02g044850

AT5G55340 MBOAT

Csa05g059550

Csa17g095070

Csa09g092370

Csa16g016450

Csa01g007680

Csa19g012130

Csa15g031400

Csa11g083530

Csa02g065020

AT1G54570 PHYTYL ESTER SYNTHASE 1

AT1G55260 LTPG6

AT2G26250 3-KETOACYL-COA SYNTHASE 10

AT2G28630 3-KETOACYL-COA SYNTHASE 12

AT3G06650 ATP-CITRATE LYASE B-1

AT3G08770 LIPID TRANSFER PROTEIN 6

AT3G21090 ATP-BINDING CASSETTE G15

AT4G32170 CYP96A2

AT5G59310 LIPID TRANSFER PROTEIN 4

Camelina target

Arabidopsis homlog

Csa06g008780, Csa04g015780, Csa09g014800

AT3G27660 OLEOSIN 4   

Csa12g079570, Csa11g057650, Csa10g047190

AT5G40420 OLEOSIN 2   

Csa09g071950, Csa07g037530

AT1G70670 PEROXYGENASE 4

Csa16g014970, Csa07g013360

AT2G29980 FATTY ACID DESATURASE 3

Csa09g034290, Csa06g017080

AT3G44460 bZIP67

Csa04g024660, Csa06g018480

AT3G44830 Lecithin:cholesterol acyltransferase

Csa02g026890, Csa13g047050

AT4G10020 HYDROXYSTEROID DEHYDROGENASE 5

Csa11g017470, Csa12g025120

AT4G26740 CALEOSIN1

Csa02g039360, Csa18g021430

AT5G50770 HYDROXYSTEROID DEHYDROGENASE 6

Csa02g068600, Csa11g097450

AT5G61610 Oleosin family protein  

Csa17g095060

Csa15g025140

Csa13g044710

Csa08g056830

Csa11g072150

Csa11g088480

Csa03g053840

AT1G55250 HISTONE MONO-UBIQUITINATION 2

AT3G18570 Oleosin family protein  

AT4G09760 choline synthase

AT5G07571 Oleosin family protein  

AT5G42870 Phosphatidate phosphohydrolase  (PAH2)

AT5G55240 PEROXYGENASE 2

AT1G48990 Oleosin family protein  

Camelina target

Arabidopsis homlog

Csa08g028100*, Csa20g055600*

AT5G27600 LONG-CHAIN ACYL-COA SYNTHETASE 7

Csa05g023410*

Csa01g007860*

Csa10g013600*

Csa20g020640*

Csa17g015030

Csa06g043450

Csa15g076730

Csa11g072070

Csa20g067180

AT2G33150 3-KETOACYL-COA THIOLASE 2

AT3G06860 MULTIFUNCTIONAL PROTEIN 2

AT4G29010 ABNORMAL INFLORESCENCE MERISTEM (AIM1)

AT5G14180 MYZUS PERSICAE-INDUCED LIPASE 1

AT1G11090 Alpha/beta-Hydrolases protein (MAGL1)

AT2G39420 Alpha/beta-Hydrolases protein (MAGL8)

AT4G24160 Alpha/beta-Hydrolases protein

AT5G42930 Alpha/beta-Hydrolases protein

AT5G43280 Delta (3,5), Delta(2,4)-DIENOYL-CoA ISOMERASE 1

5 8 10 11 12 14 16

18 20 22 24 26

16−21

35−39
25−29

S
G

DPA

CsaABI3VP1-2

5 8 10 11 12 14 16

18 20 22 24 26

16−21

35−39
25−29

S
G

e

s
t
e
g
r
a
T

DPA

CsabZIP1
*
**
*
*
*

5 8 10 11 12 14 16

18 20 22 24 26

16−21

35−39
25−29

S
G

DPA

log2(TPM + 1) 

0

5

10

Figure 3.8. Integration of seed co-expression and DNA-binding information  

61 

 
 
Figure 3.8 (cont’d) 
a.  Bar  graph  indicating  the  percentage  of  DAP-seq  targets  supported  by  the  co-expression 
associations predicted with GENEI3 using seed expression data. b. Bar graph of the seven most 
significant TF-lipid-related process interactions that passed the enrichment test after incorporation 
of  seed  co-expressed  targets.  For  each  TF,  the  total  number  of  target  genes  annotated  for  the 
corresponding  FA  and  TAG  related  processes  are  indicated.  c.  Heatmap  representing  the 
expression  dynamics  of  targets  of  CsaMYB1  associated  with  FA  elongation  during  seed 
development.  d.  Heatmap  representing  the  expression  dynamics  of  targets  of  CsaABI3VP1 
associated with TAG synthesis during seed development. e. Heatmap representing the expression 
dynamics of targets of CsabZIP1 associated with FA/TAG degradation during seed development. 
The right panel list the gene IDs for the Camelina genes represented in the heatmaps and the gene 
IDs for the corresponding Arabidopsis homologs. 

We  selected  three  of  seven  TF-pathway  interactions  (CsaMYB1  &  FA  elongation, 

CsaABI3VP1 & TAG synthesis, and CsabZIP1 & FA-TAG degradation, Figure 4b) to analyze the 

expression dynamics of the corresponding TFs and targets during seed development. CsaMYB1 

showed two expression windows, one at 12 - 21 DPA and a second at 25 - 29 DPA (Figure 3.8c). 

These expression profiles are in concordance with the reported peaks of FA synthesis and TAG 

accumulation (11-24 DPA) (Pollard et al., 2015). CsaABI3VP1-2 showed a broader expression 

window,  starting  at  8  DPA  with  constant  expression  until  25  -  29  DPA  (Figure  3.8d).  Finally, 

CsabZIP1 is mainly expressed during the later stages of seed development (expression peak ~35 - 

39 DPA) (Figure 3.8e), consistent with the expected pattern for controlling TF/TAG degradation 

right before seed germination.  

As for the corresponding target genes, we observed several expression patterns consistent with 

activation or repression by the respective TFs, as exemplified for the targets of CsaMYB1 and 

CsabZIP1 (Figure 3.8c, e). Within the set of the CsabZIP1 targets, it is worth mentioning multiple 

Arabidopsis homologs involved in FA beta-oxidation during seed germination (Fulda et al., 2004; 

Footitt  et  al.,  2006;  Jiang  et  al.,  2011;  Richmond  and  Bleecker,  1999),  and  homeostasis  of 

phospholipid and neutral lipids (Ghosh et al., 2009) (Figure 3.8e). Finally, most (28/30) of the 

CsaABI3VP1-2 targets showed a similar expression to the corresponding TF (Figure 3.8d). Within 

62 

 
 
these 

targets, 

it 

is  worth 

noting 

the 

Csa16g014970/Csa07g013360 

and 

Csa09g034290/Csa06g017080  gene  pairs  which  are  homologs  of  Arabidopsis  FATTY  ACID 

DESATURASE 3 (AtFAD3) (At2g29980) and AtbZIP67 (At3g44460), respectively. The former 

is involved in linolenic acid synthesis (O’Neill et al., 2011), while the latter is a known regulator 

of  AtFAD3  (Mendes  et  al.,  2013),  highlighting  a  potential  feedforward  loop  between 

CsaABI3VP1-2, Csa09g034290/Csa06g017080, and Csa16g014970/Csa07g013360 in Camelina.  

3.4 DISCUSSION 

Camelina  is  an  oilseed  crop  that  has  emerged  as  a  prominent  feedstock  for  biofuels  and 

industrial oils during the past decade. Its polyploid genome makes it challenging to identify genes 

involved in the biosynthesis or regulation of seed oils by classical loss-of-function approaches. 

The homology to Arabidopsis has permitted to translate knowledge gained in this model plant to 

Camelina,  exemplified  in  the  manipulation  of  epicuticular  and  total  wax  production  by  the 

overexpression of AtMYB96 (Lee et al., 2014), or in the increase of seed oil by the overexpression 

of Arabidopsis WRI1 (An and Suh 2015). However, homology-based approaches are unlikely to 

reveal the regulators that make Camelina such a good oil producer. Moreover, techniques such as 

ChIP-seq, classically used to discover TF targets, can be challenging to implement because of the 

difficulties associated with developing antibodies that recognize a single homolog in a polyploid, 

and the use of epitope-tagged version of the TF for ChIP experiments is questionable because the 

function of the epitope-tagged TF cannot be tested unless a mutant is available (which again is 

difficult to obtain in a polyploid).  

We  present  here  a  co-expression  guided  approach  aimed  at  identifying  candidate  Camelina 

TFs involved in the control of seed oils, followed by the evaluation of TF target genes based on 

DAP-seq.  While  not  perfect,  this  strategy  overcomes  many  of  the  limitations  imposed  by  a 

63 

 
polyploid  genome,  providing  a  small  set  of  candidate  TFs  that  can  be  used  for  metabolic 

engineering  efforts  (Grotewold  2008).    Co-expression  analyses  identified  22  TFs  strongly  co-

expressed  with  LRGs,  which  were  further  reduced  to  16  after  several  quality  control  steps. 

Furthermore,  co-expression  analyses  with  seed  expression  data  allowed  us  to  identify  specific 

metabolic  processes  targeted  by  our  regulators,  including  the  control  of  FA-  and  TAG-related 

genes during Camelina seed development.  

Evidence of the robustness of our co-expression analysis is provided by the inclusion in our 

list  of  candidate  TFs  homologs  of  well-known  regulators  of  lipid-related  metabolism  in 

Arabidopsis, including the Camelina homologs of ABI3, FUS3, MYB9, and MYB107 (Giraudat 

et al., 1992; Keith et al., 1994; Lashbrooke et al., 2016). Our list of Camelina TFs also includes 

homologs  of  TFs  indirectly  associated  with  lipid  metabolism  in  Arabidopsis,  such  as  the 

Arabidopsis  homolog  of  CsaNAC2  (AtNAC60).    AtNAC60  was  shown  to  play  a  role  in  sugar 

sensing (Li et al., 2014), as a negative regulator of AtABI5 (Yu et al., 2020), and is a target of 

AtABI4  (Li  et  al.,  2014).  Both  AtABI5  and  AtABI4  are  known  regulators  of  sugar-responsive 

expression, seed germination, and lipid metabolism (Chandrasekaran et al., 2020; Skubacz et al., 

2016). While AtULT1 (homolog of CsaULT1) has been implicated in various Arabidopsis plant 

developmental processes (Fletcher, 2001; Pires et al., 2015; Ornelas-Ayala et al., 2020), a recent 

transcriptome  analysis  of  loss  of  AtUTL1  function  (Tyler  et  al.,  2019)  showed  a  significant 

enrichment  (P-value  <  0.05)  for  LRGs  among  the  differentially  expressed  genes,  indicating  a 

participation of AtUTL1 in the control of lipids. We could not test the co-expression of AtMYB67 

(homolog  of  CsaMYB3)  with  Arabidopsis  LRGs  because  it  is  expressed  at  very  low  levels. 

However,  supporting  a  potential  role  of  CsaMYB3  in  lipid-related  metabolism,  AtMYB67 

physically interacts with the known negative regulator of cuticular wax biosynthesis AtDEWAX 

64 

 
(Trigg et al., 2017), which is a target of AtAGL15, a regulator of embryogenesis and gibberellic 

acid  catabolism  (Zheng  et  al.,  2013).  However,  our  analysis  also  identified  13  Camelina  TFs 

(including  CsaC3H2,  CsaHB2,  Csazf-HD1,  CsaC3H1,  CsaAP2/B3-like-1,  CsaC2C2-Dof1, 

CsaB3-1, CsaS1Fa-like-1, CsaNAC1, CsaWRKY1, CsaTify1, CsaHB1, and CsaLBD1) that were 

previously not associated with the regulation of lipids. 

To  further  elucidate  the  role  of  these  22  Camelina  TFs  in  lipid  regulation  and  to  identify 

potential targets of these TFs, we applied DAP-seq to them. DAP-seq has many limitations, chief 

among them is that it is performed in a chromatin-free context, resulting in the identification of 

binding  sites  and  potential  targets  that  might  not  be  accessible  in  vivo.  However,  it  is  easy  to 

implement,  and  it  is  not  affected  by  a  polyploid  genome,  as  ChIP  techniques  are.  DAP-seq 

permitted  us  to  identify  TFBMs  and  potential  target  genes  for  16  out  of  the  22  TFs  identified. 

When we compared our DAP-seq results with those derived from a large scale analysis conducted 

for Arabidopsis TFs (O’Malley et al., 2016), we determined that DNA-binding properties for ten 

out of the 16 TFs are not available for the corresponding Arabidopsis orthologs, either because 

were  not  tested  (homologs  of  CsaMYB1,  CsaNAC2,  CsaC3H2),  or  because  the  corresponding 

experiments for Arabidopsis TFs did not result in meaningful results (homologs of CsaC2C2-Dof1, 

CsaC3H1, CsaHB1, CsaHB2, CsaHRT1, CsaLBD1, and Csazf-HD1).  

The  analysis  of  TF  target  enrichment  for  LRGs  and  GO  terms  within  the  sets  of  predicted 

targets provided additional validation for the results obtained from the co-expression analyses for 

nine TFs (Figure 3.5a). Of interest, in several instances, the GO enrichment analysis also exposed 

biological  processes  previously  reported  for  the  corresponding  Arabidopsis  homologs.  For 

example,  the  DAP-seq  results  for  CsaMYB1  and  CsaWRKY1  showed  enrichment  for  targets 

associated  with  suberin  biosynthesis  and  defense  responses,  which  are  known  functions  of 

65 

 
AtMYB9 and AtWRKY3, respectively (Lai et al., 2008; Lashbrooke et al., 2016; Birkenbihl et al., 

2018).  The  targets  of  CsaNAC2  were  also  enriched  in  genes  associated  with  fruit  ripening, 

hormone biosynthesis processes, exit from dormancy, inositol lipid-mediated signaling, and lipid 

homeostasis  terms  (Figure  3.6),  which  are  in  good  agreement  with  the  functions  attributed  to 

AtANAC60, the Arabidopsis homolog (Li et al., 2014; Yu et al., 2020). The targets of Csazf-HD1 

showed  enrichment  for  several  GO  terms,  including  cell  death,  cell  wall  organization,  cellular 

response  to  endogenous  stimulus,  and  cellular  response  to  hormone  stimulus,  matching  the 

predicted functions of AtZHD13/RHD1 (Liu et al., 2021). CsaHB1’s targets were enriched in GO 

terms related to leaf development and light responses, response to auxin, response to abscisic acid, 

and post-embryonic plant organ development, among others, similar to the known functions of 

AtHB4 (Carabelli et al., 1993; Sorin et al., 2009; Bou-Torrent et al., 2012). Finally, the targets of 

CsaABI3VP1-2 showed enrichment for GO terms that cover the full spectrum of the functions 

known for its Arabidopsis homolog, AtFUS3 (Curaba et al., 2004; Kagaya et al., 2005; Tiedemann 

et al., 2008; Lumba et al., 2012; Tang et al., 2017). When considered together, these results not 

only provide evidence of the biological significance of the associations identified here, but also 

reveal  the  intertwined  connections  between  lipid  metabolism  and  other  biological  processes  in 

Camelina. 

Regulators  of  plant  metabolism  often  regulate  multiple  genes  in  a  pathway,  making  them 

attractive for metabolic engineering (Broun 2004; Grotewold 2008). We took advantage of this 

characteristic  of  metabolic  regulators  to  identify  TF-process  relationships  with  significant  co-

expression signals at the pathway, rather than as individual target gene level by applying GSEA 

(Subramanian  et  al.,  2005).  This  permitted  us  to  identify  previously  unstated  associations  that 

further support several of the identified TFs as important lipid regulators (Figure 3.7). Finally, we 

66 

 
took advantage of a computational method (GENIE3) that involves causality (rather than simply 

correlation)  to  further  support  the  role  of  several  of  the  identified  TFs  in  controlling  particular 

aspects of lipid metabolism in seeds. When taken together with the results from the other methods 

applied in this study, our results suggest that CsabZIP1 is involved in controlling FA and TAG 

degradation just before seed germination, CsaMYB1 regulates FA elongation, and CsaABI3VP1 

controls the synthesis of TAG (Figures 3 & 4). Our results also implicate CsaMYB1 and CsaNAC2 

as participating in the regulation of cutin biosynthesis (Figure 3.7). 

Gene regulation is at the core of many important agronomic attributes and TFs have a large 

potential to modify complex traits (Century et al. 2008; Springer et al. 2019).  Identifying TFs that 

control specific metabolic or developmental processes in polyploids is challenging because the 

effect of mutations is often masked by redundancy, and traditional approaches to investigate TF 

function  are  limited  by  high  sequence  identity  between  homologs.  Our  strategy  to  identify 

Camelina  candidate  TFs  involved  in  the  regulation  of  lipid  metabolism  was  based  on  a 

combination of co-expression analyses and target identification using DAP-seq. These resulted in 

the  identification  of  a  set  of  16  TFs.  The  presence  among  these  16  TF  of  several  that  were 

previously shown in Arabidopsis to participate in different aspects of lipid accumulation furnished 

a validation for the approach. Incorporating into our pipeline co-expression analyses that imply 

causality  and  that  take  into  consideration  that  TFs  often  control  multiple  genes  in  a  metabolic 

pathway further provided a better picture of the regulatory events involving the identified TFs in 

seed  oil  accumulation  in  Camelina.  Similar  combination  of  approaches  could  significantly 

contribute to identify key regulators for important agronomic traits in other polyploids. 

67 

 
 
 
3.5 METHODS 

3.5.1 Plant materials and growth conditions 

Camelina sativa cultivar Suneson was grown in the plant biology greenhouse at Michigan State 

University. RNA-seq and DAP-seq experiments were performed on plants grown for one month 

at 22 °C and under 16:8 -h light/dark cycles. For seed RNA-seq, total RNA was extracted from 

seedpods harvested at 5, 8, and 11 DPAs. For DAP-seq, a pool of ten leaves from six mature (two-

month-old) plants was collected.  

3.5.2 Cloning and expression of transcription factors for DAP-seq 

A set of 22 full-length Camelina TF-ORFs were annotated using the Camelina sativa cultivar 

DH55  (reference  genome,  V2.0,  http://camelinadb.ca).  Coding  regions  were  assembled  (when 

required) using expression data available for the Camelina cultivar Suneson. TF-ORFs Gateway 

compatible  were  synthesized  by  Genewiz  (https://www.genewiz.com/Public/Services/Gene-

Synthesis/Standard). Clones were recombined using LR clonase II (Life Technologies) into the 

pIX-Halo expression vector containing both T7 and SP6 promoters (pIX-Halo:ccdB) 6xHis-tag at 

C-terminus, no stop codon but has T7 terminator. 

3.5.3 RNA-seq library preparation 

Total RNA from fresh seed, after removing pod covers, was extracted using Spectrum Plant 

Total  RNA  Kit  (Sigma-Aldrich)  according  to  the  manufacturer’s  protocol.  The  total  RNA  was 

prepared with three biological replicates, each with ~100 mg seeds. The quality of total RNA was 

determined  by  TapeStation4200  (Agilent),  and  cDNA  library  was  generated  with  1  μg  of  total 

RNA using TruSeq stranded mRNA (Illumina). The pooled libraries were sequenced with a pair-

ended read length of 150 bp by Illumina HiSeq 4000 at the Research Technology Support Facility 

Genomics Core at Michigan State University.   

68 

 
3.5.4 DAP-seq library preparation 

Camelina  sativa  genomic  DNA  were  extracted  using  urea  buffer  (7M  urea,  350mM  NaCl, 

50mM  Tris-HCl  pH8,  20mM  EDTA,  1%  N-lauroyl  sarcosine)  and  mixed  with 

phenol:chloroform:isoamyl  alcohol  25:24:1.  The  supernatants  containing  DNA  were  further 

precipitated using 3M NaOAc (pH 5.2) and isopropanol followed by 70% ethanol wash. The DNA 

pellet was resuspended in UltraPure™ DNase/RNase-Free Distilled Water (Invitrogen) followed 

by  RNase  A  (Roche)  treatment  and  ethanol  precipitation.    DAP-seq  gDNA  libraries  were 

constructed following the protocol of Bartlett et al. (2017) with minor modifications. Extracted 

gDNA were fragmented to the size range between 200-400 bp using Diagenode’s Bioruptor® 300 

for 40 cycles with 30 seconds on/off at high energy. The fragmented DNA was further used for 

end repair and adapter ligation. To create modification-free DNA, additional 11 cycles of PCR 

amplifications  were  performed  using  the  adapter  ligated  libraries,  followed  by  ethanol 

precipitation.  Finally,  the  amplified  gDNA  libraries  (ampDAP)  were  used  for  all  protein-DNA 

interaction  procedures.  The  Halo-tagged  TFs  were  expressed  in  the  wheat  germ  in  vitro 

transcription/translation SP6 promoter system (Promega). All the buffers and procedures for DNA-

protein interaction were as published (Bartlett et al., 2017), except that the input gDNA library 

amount and the final step of library size selection. About 200 ng of ampDAP gDNA library were 

added  as  an  input  to  mix  with  each  pIX-HALO-TF  protein.  Finally,  to  perform  double-size 

selection targeting 300-400 bp fragments, 0.7 volume of Agencourt AMPure XP beads (35 µl) to 

1  volume  of  sample  (50  µl)  were  mixed  for  5  minutes  and  the  bead  was  discarded  to  remove 

fragments with size larger than 400 bp. Next, the supernatant (85 µl) containing < 400 bp fragments 

were added to 0.2 volume of Agencourt AMPure XP beads (10 µl) and mixed for 15 minutes. The 

bound  fragments  were  eluted  from  the  beads  by  adding  18  ml  UltraPure™  DNase/RNase-Free 

69 

 
Distilled Water (Invitrogen). The concentrations of eluted DNA were measured using the Qubit 

HS  dsDNA  assay  kit,  and  approximate  5-20  ng/µl  final  concentrations  were  obtained.  The 

fragment size and binding capacity to the flow cell were further examined on the agarose gel by 

six  cycles  of  PCR  using  2  µl  of  eluted  ampDAP-seq  library  with  Illumina  P5  and  P7  primers. 

Twelve libraries were pooled in one lane and sequenced by Illumina HiSeq 4000 SE50 at RTSF 

Genomics core at Michigan State University.  

3.5.5 Data processing, quantification, and statistical analyses 

RNA-seq. 

Sample 

quality 

control 

was 

performed 

using 

FastQC 

(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, V0.11.5). Adapters and low quality 

reads  were  trimmed  with  Trimmomatic  (Bolger  et  al.,  2014)  using  the  following  parameters: 

ILLUMINACLIP:Adapter.fastq:2:40:15  SLIDINGWINDOW:4:20  MINLEN:30.  Cleaned  reads 

were mapped to the reference genome (V2.0, http://camelinadb.ca) using HISAT2 (2.0.4) (Kim et 

al.,  2019)  with  default  parameters.  Reads  aligned  to  genes  were  counted  with  the  R  package 

Rsubread v1.32.2 (Liao et al., 2019), using default parameters and counting only uniquely-mapped 

reads (Liao et al., 2019), and the transcript abundance estimated as transcripts per kilobase million 

(TPM). Arabidopsis RNA-seq samples were re-analyzed using the same pipeline. Cleaned reads 

were mapped to the TAIR10 Arabidopsis genome (https://www.arabidopsis.org/).  

Selection  of  TF  candidates  based  on  co-expression  analyses.  Camelina  lipid-related  genes 

(LRGs),  TFs  and  the  whole  genome  expression  data  were  collected  from  CamRegbase 

(https://camregbase.org/, Gomez-Cano et al., 2020). Specifically, we retrieved 2,765 LRGs, 5,590 

TF, and TPMs values for 131 publicly available RNA-seq experiments. Mutual information (MI) 

was used as the co-expression metric and estimated with the R package Parmigene v1.0.2 (Sales 

and Romualdi, 2011). The top 200 genes with the highest MI (MI value ≥ 1) were assumed as the 

70 

 
 
co-expressed  genes  of  the  corresponding  TFs.  The  significance  of  the  common  LRGs  and  co-

expressed  genes  by  TF  was  tested  with  a  Fisher-Exact  Test.  TF  family  enriched  on  TFs 

significantly  co-expressed  with  LRGs  were  characterized  using  the  R  package  GeneOverlap 

v1.26.0 (Shen and Sinai, 2020). TF-target genes co-expression analysis was performed using the 

log2 of TPMs+1 collected from CamRegBase and generated in this work from seed samples (Table 

S1).  The  co-expression  was  estimated  as  the  Pearson  coefficient  of  the  expression  of  the 

corresponding  TF  and  target  expression  profiles  and  was  calculated  with  the  cor  function 

implemented in R v3.6.0 (https://www.r-project.org/). Arabidopsis homologs of the corresponding 

Camelia candidates were collected from CamRegbase (https://camregbase.org/, Gomez-Cano et 

al.,  2020),  and  the  co-expression  analysis  was  performed  with  the  sample  pipeline  and  filters 

implemented in Camelina’s analyzes.   

DAP-seq  read  mapping,  filtering,  and  peak  calling.  Sample  quality  control  was  performed 

using  FastQC  (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/,  V0.11.5).  Adapters 

and low quality reads were trimmed with Trimmomatic (Bolger et al., 2014) using the following 

parameters:  ILLUMINACLIP:Adapter.fastq:2:40:15  SLIDINGWINDOW:4:20  MINLEN:30. 

Cleaned reads were mapped to the reference genome (V2.0, http://camelinadb.ca) with Bowtie2 

v2.3.4.1 (Langmead and Salzberg, 2012) and only using nuclear chromosomes. Multi-mapping 

reads were filtered with Samtools v1.9 (Li et al., 2009): samtools view -q 30. Peaks were called 

using GEM v3.4 (Guo et al., 2012) using the HALO vector as negative control, and the  following 

parameters: --d Read_Distribution_default.txt --k_min 6 --k_max 15 --k_seqs 2000 --outNP --sl. 

For TFs with replicates, GEM was called with the replicate mode. Only TFs with >500 predicted 

peaks were used for further analysis. Arabidopsis DAP-seq samples were re-analyzed using the 

71 

 
same  pipeline.  Cleaned 

reads  were  mapped 

to 

the  TAIR10  Arabidopsis  genome 

(https://www.arabidopsis.org/).  

Peak quality control, and motif enrichment. Summit peaks were extended 50 bps and formatted 

into SAF files to count mapped reads by peak using the R package Rsubread v1.32.2 (Liao et al., 

2019). Read abundance was estimated as counts per kilobase million (CPM) by peak, which were 

then used to estimate the log2FC (TF/Halo) of each peak. Peaks with log2FC > 0.5 were used for 

further  analysis.  TF  binding  motifs  were  identified  with  the  meme-chip  tool  (-meme-minw  6  -

meme-maxw 20) from the MEME suite v5.1.1 (Machanick and Bailey, 2011), using top 1,000 

peaks with largest log2FC. DNA sequences of the corresponding peaks used as input on meme-

chip  were  extracted  from  the  reference  genome  (V2.0,  http://camelinadb.ca)  using  the  getfasta 

function from bedtools v2.26.0 (Quinlan and Hall, 2010). MEME’s predicted motifs frequency 

and distribution were assayed with the FIMO tool from the MEME suite v5.1.1 (Grant et al., 2011). 

Fimo analysis was performed with default parameters using the total set of peaks selected after the 

log2FC  filter  as  the  fasta  database.  Motif  Z-scores  were  estimated  as  follow:  𝑍!"#$%𝑠𝑐𝑜𝑟𝑒 =

(𝑋!$ − 𝑋!)/𝑠𝑑(𝑋!), such that 𝑋! and 𝑠𝑑(𝑋!) are the average and standard deviation of the total 

number  of  significant  hits  of  the  motif  m  (Fimo  q-value  <=0.05)  for  the  corresponding  TF  X. 

And,	𝑋!$ represents the total number of significant hits (Fimo q-value <=0.05) of the TF X at the 

peak’s position i for the corresponding motif m. The position i was defined as the position at which 

the motif's first nucleotide did match within the corresponding peak sequence (i.e., position 1 to 

100, having 50 as the peak’s summit). Peaks without a motif match were filtered out from further 

analysis.  Motif logos were generated using MotifStack (Ou et al., 2018). Peaks were visualized 

with  the  Integrated  Genome  Browser  (IGV)  (Robinson  et  al.,  2011),  for  which  bam  files  were 

converted into bigwig files normalizing mapped reads by bins per million mapped reads (BPM) 

72 

 
using bins 10 bps long. Bigwig files were generated using the bamCoverage tool from deepTools 

v3.5.0 (Ramírez et al., 2016). 

Target  genes  identification,  lipid-related  target  enrichment,  and  GO  enrichment  analysis. 

Targets  genes  were  assigned  based  on  the  peak’s  summit  distance  to  the  closest  annotated 

transcription start site (TSS). We use 3 kpbs around the TSS as the maximum distance threshold. 

Summit-TSS distances were estimated using the closest function from bedtools v2.26.0 (Quinlan 

and Hall, 2010) as follows: closestBed -a Summit.file.bed -b Cs_TSS.bed -D "b". The TSSs bed 

file  was  generated  using  the  current  genome  annotation  available  (V2.0,  http://camelinadb.ca). 

Lipid-related target gene enrichment was carried out using the R package GeneOverlap v2.28.0 

(Shen and Sinai, 2020), with the total number of Camelina genes annotated as background. GO 

enrichment on predicted target genes was performed using the R package topGO v2.38.1 (Alexa 

and Rahnenfuhrer, 2010). The top 10 GO terms were selected based on the P-value filtering out 

general and redundant terms. Genes GO annotation was retrieved from CamRegBase, which is 

based on homolog with Arabidopsis  (Gomez-Cano et al., 2020).   

Target enrichment at pathway level and gene set enrichment analysis (GSEA) algorithm. The 

identification  of  TF  enriched  on  target  genes  associated  to  specific  lipid-related  pathways  we 

performed using the gene-pathway annotation introduced in Figure 3.1, and using the R package 

GeneOverlap v2.28.0 (Shen and Sinai, 2020), with the total number of Camelina genes annotated 

as  background.    The  GSEA  assay  was  performed  with  the  list  of  pathways/genes  presented  in 

Figure  3.1  using  the  R  package  FGSEA  v1.18.0  (Korotkevich  et  al.,  2021),  and  with  Pearson 

Correlation  Coefficients  (PCCs)  as  scoring  metric.  The  PCC  was  estimated  as  weighted  PCC 

(wPCC)  between  the  corresponding  TFs  and  the  current  annotated  genes  in  Camelina  (V2.0, 

http://camelinadb.ca). We use in this analysis the same list of expression samples analyzed during 

73 

 
the prediction of TF candidates. Expression values (TPMs) were log2 transformed, and wPCCs 

were calculate using the R package wCorr (Version 1.9.1) (Emad and Bailey, 2017)with an optimal 

threshold  of  0.4,  which  reduces  overestimation  of  the  PCC  because  of  similar  samples  (e.g., 

biological replicates). 

Seed co-expression analysis based on GENIE3. The estimation of potential target genes based 

on  expression  was  performed  with  only  seed  expression  data  using  the  GENIE3  algorithm, 

implemented 

on 

the 

R 

package 

GENEI3 

v1.14.0 

(https://bioconductor.org/packages/release/bioc/html/GENIE3.html) (Huynh-Thu et al., 2010). To 

identify significant scores, we re-assigned the gene IDs randomly at the expression matrix to then 

recalculate the corresponding GENIE3 score. This process was replicated 1,000 times in order to 

generate a null distribution for each potential target gene. The significance of the observed score 

vs the null distribution was estimated calculated as the significance of the Z-score observed, which 

calculated as follows:  

𝑍&’"() = 	

𝑆𝑐𝑜𝑟𝑒"*&)(+), − 	𝐴𝑣𝑒𝑟𝑎𝑔𝑒(𝑆𝑐𝑜𝑟𝑒(-.,"![0,..,0333])
𝑠𝑑(𝑆𝑐𝑜𝑟𝑒(-.,"!)
√𝑡𝑜𝑡𝑎𝑙	𝑟𝑎𝑛𝑑𝑜𝑚	𝑣𝑎𝑙𝑢𝑒𝑠

Given  the  number  of  comparisons,  estimated  P-values  were  corrected  for  multiple  testing  by 

Benjamini-Hochberg method (Yoav Benjamini and Yosef Hochberg, 1995). 

3.5.6 Data availability and accession numbers 

The  supporting  the  findings  of  this  work  are  available  on  the  supplementary  files.  Raw 

sequencing data generated can be found in the NCBI SRA databased under the accession number 

PRJNA763897.  Processed  data  have  been  deposited  in  the  NCBI  GEO  databased  under  the 

accession number GSE184283. 

74 

 
 
 
REFERENCES 

Abdullah,  H.M.,  Akbari,  P.,  Paulose,  B.,  Schnell,  D.,  Qi,  W.,  Park,  Y.,  Pareek,  A.  and 
Dhankher,  O.P.  (2016)  Transcriptome  profiling  of  Camelina  sativa  to  identify  genes 
involved  in  triacylglycerol  biosynthesis  and  accumulation  in  the  developing  seeds. 
Biotechnol. Biofuels, 9, 136.  

Alexa,  A.  and  Rahnenfuhrer,  J.  (2010)  topGO:  enrichment  analysis  for  gene  ontology.  R 

package version 2 

Alonso,  R.,  Oñate-Sánchez,  L.,  Weltmeier,  F.,  Ehlert,  A.,  Diaz,  I.,  Dietrich,  K.,  Vicente-
Carbajosa,  J.  and  Dröge-Laser,  W.  (2009)  A  pivotal  role  of  the  basic  leucine  zipper 
transcription factor bZIP53 in the regulation of Arabidopsis seed maturation gene expression 
based on heterodimerization and protein complex formation. Plant Cell, 21, 1747–1761. 

An,  D.  and  Suh,  M.C.  (2015)  Overexpression  of  ArabidopsisWRI1  enhanced  seed  mass  and 

storage oil content in Camelina sativa. Plant Biotechnol. Rep., 9, 137–148. 

Bartlett, A., O’Malley, R.C., Huang, S.-S.C., Galli, M., Nery, J.R., Gallavotti, A. and Ecker, 
J.R.  (2017)  Mapping  genome-wide  transcription-factor  binding  sites  using  DAP-seq.  Nat. 
Protoc., 12, 1659–1672. 

Baud,  S.  and  Lepiniec,  L.  (2010)  Physiological  and  developmental  regulation  of  seed  oil 

production. Prog. Lipid Res., 49, 235–249. 

Bäumlein, H., Miséra, S., Luerßen, H., Kölle, K., Horstmann, C., Wobus, U. and Müller, A.J. 
(1994) The FUS3 gene of Arabidopsis thaliana is a regulator of gene expression during late 
embryogenesis. Plant J., 6, 379–387. 

Bensmihen, S., Rippa, S., Lambert, G., Jublot, D., Pautot, V., Granier, F., Giraudat, J. and 
Parcy,  F.  (2002)  The  homologous  ABI5  and  EEL  transcription  factors  function 
antagonistically to fine-tune gene expression during late embryogenesis. Plant Cell, 14, 1391–
1403. 

Berti, M., Gesch, R., Eynck, C., Anderson, J. and Cermak, S. (2016) Camelina uses, genetics, 

genomics, production, and management. Ind. Crops Prod., 94, 690–710. 

Birkenbihl, R.P., Kracher, B., Ross, A., Kramer, K., Finkemeier, I. and Somssich, I.E. (2018) 
Principles  and  characteristics  of  the  Arabidopsis  WRKY  regulatory  network  during  early 
MAMP-triggered immunity. Plant J., 96, 487–502. 

Bou-Torrent,  J.,  Salla-Martret,  M.,  Brandt,  R.,  Musielak,  T.,  Palauqui,  J.-C.,  Martínez-
García, J.F. and Wenkel, S. (2012) ATHB4 and HAT3, two class II HD-ZIP transcription 
factors, control leaf development in Arabidopsis. Plant Signal. Behav., 7, 1382–1387. 

Bolger, A.M., Lohse, M. and Usadel, B. (2014) Trimmomatic: a flexible trimmer for Illumina 

sequence data. Bioinformatics, 30, 2114–2120. 

75 

 
Braybrook, S.A. and Harada, J.J. (2008) LECs go crazy in embryo development. Trends Plant 

Sci., 13, 624–630.  

Broun, P. (2004) Transcription factors as tools for metabolic engineering in plants. Curr. Opin. 

Plant Biol., 7, 202–209. 

Carabelli, M., Sessa, G., Baima, S., Morelli, G. and Ruberti, I. (1993) The Arabidopsis Athb-

2 and -4 genes are strongly induced by far-red-rich light. Plant J., 4, 469–479. 

Carlsson,  A.S.  (2009)  Plant  oils  as  feedstock  alternatives  to  petroleum  –  A  short  survey  of 
at: 

665–670.  Available 

Biochimie, 

platforms. 

crop 

91, 

oil 

potential 
http://dx.doi.org/10.1016/j.biochi.2009.03.021. 

Century,  K.,  Reuber,  T.L.  and  Ratcliffe,  O.J.  (2008)  Regulating  the  regulators:  the  future 
prospects for transcription-factor-based agricultural biotechnology products. Plant Physiol., 
147, 20–29. 

Cernac,  A.  and  Benning,  C.  (2004)  WRINKLED1  encodes  an  AP2/EREB  domain  protein 
involved in the control of storage compound biosynthesis in Arabidopsis. Plant J., 40, 575–
585. 

Chandrasekaran, U., Luo, X., Zhou, W. and Shu, K. (2020) Multifaceted Signaling Networks 

Mediated by Abscisic Acid Insensitive 4. Plant Commun, 1, 100040. 

Chaturvedi, S., Bhattacharya, A., Khare, S.K. and Kaushik, G. (2019) Camelina sativa: An 
Emerging  Biofuel  Crop.  In  C.  M.  Hussain,  ed.  Handbook  of  Environmental  Materials 
Management. Cham: Springer International Publishing, pp. 2889–2925. 

Curaba, J., Moritz, T., Blervaque, R., Parcy, F., Raz, V., Herzog, M. and Vachon, G. (2004) 
AtGA3ox2, a key gene responsible for bioactive gibberellin biosynthesis, is regulated during 
embryogenesis by LEAFY COTYLEDON2 and FUSCA3 in Arabidopsis. Plant Physiol., 136, 
3660–3669. 

Devic, M. and Roscoe, T. (2016) Seed maturation: Simplification of control networks in plants. 

Plant Sci., 252, 335–346. 

Emad, A. and Bailey, P. (2017) wCorr: weighted correlations. R package version. 1.9. 1. 

Fletcher,  J.C.  (2001)  The  ULTRAPETALA  gene  controls  shoot  and  floral  meristem  size  in 

Arabidopsis. Development, 128, 1323–1333. 

Focks, N. and Benning, C. (1998) wrinkled1: a novel, low-seed-oil mutant of Arabidopsis with a 
deficiency in the seed-specific regulation of carbohydrate metabolism. Plant Physiol., 118, 
91–101. 

Footitt,  S.,  Marquez,  J.,  Schmuths,  H.,  Baker,  A.,  Theodoulou,  F.L.  and  Holdsworth,  M. 
(2006)  Analysis  of  the  role  of  COMATOSE  and  peroxisomal  beta-oxidation  in  the 
determination of germination potential in Arabidopsis. J. Exp. Bot., 57, 2805–2814. 

76 

 
Fulda, M., Schnurr, J., Abbadi, A., Heinz, E. and Browse, J. (2004) Peroxisomal Acyl-CoA 
synthetase activity is essential for seedling development in Arabidopsis thaliana. Plant Cell, 
16, 394–405. 

Galli, M., Khakhar, A., Lu, Z., Chen, Z., Sen, S., Joshi, T., Nemhauser, J.L., Schmitz, R.J. 
and Gallavotti, A. (2018) The DNA binding landscape of the maize AUXIN RESPONSE 
FACTOR family. Nat. Commun., 9, 4526. 

Ghosh,  A.K.,  Chauhan,  N.,  Rajakumari,  S.,  Daum,  G.  and  Rajasekharan,  R.  (2009) 
At4g24160,  a  soluble  acyl-coenzyme  A-dependent  lysophosphatidic  acid  acyltransferase. 
Plant Physiol., 151, 869–881. 

Giraudat,  J.,  Hauge,  B.M.,  Valon,  C.,  Smalle,  J.,  Parcy,  F.  and  Goodman,  H.M.  (1992) 
Isolation of the Arabidopsis ABI3 gene by positional cloning. Plant Cell, 4, 1251–1261. 

Grant, C.E., Bailey, T.L. and Noble, W.S. (2011) FIMO: scanning for occurrences of a given 

motif. Bioinformatics, 27, 1017–1018. 

Gomez-Cano, F., Carey, L., Lucas, K., García Navarrete, T., Mukundi, E., Lundback, S., 
Schnell,  D.  and  Grotewold,  E.  (2020)  CamRegBase:  a  gene  regulation  database  for  the 
biofuel 
at: 
http://dx.doi.org/10.1093/database/baaa075. 

Available 

Database 

Camelina 

sativa. 

2020. 

crop, 

, 

Go, Y.S., Kim, H., Kim, H.J. and Suh, M.C. (2014) Arabidopsis Cuticular Wax Biosynthesis Is 
Negatively  Regulated  by  the  DEWAX  Gene  Encoding  an  AP2/ERF-Type  Transcription 
Factor. Plant Cell, 26, 1666–1680. 

Grotewold,  E.  (2008)  Transcription  factors  for  predictive  plant  metabolic  engineering:  are  we 

there yet? Curr. Opin. Biotechnol., 19, 138–144. 

Guerriero, G., Martin, N., Golovko, A., Sundström, J.F., Rask, L. and Ezcurra, I. (2009) The 
RY/Sph element mediates transcriptional repression of maturation genes from late maturation 
to early seedling growth. New Phytol., 184, 552–565. 

Gugel, R.K. and Falk, K.C. (2006) Agronomic and seed quality evaluation of Camelina sativa in 

western Canada. Can. J. Plant Sci., 86, 1047–1058. 

Guo,  Y.,  Mahony,  S.  and  Gifford,  D.K.  (2012)  High  resolution  genome  wide  binding  event 
finding  and  motif  discovery  reveals  transcription  factor  spatial  binding  constraints.  PLoS 
Comput. Biol., 8, e1002638. 

Huynh-Thu,  V.A.,  Irrthum,  A.,  Wehenkel,  L.  and  Geurts,  P.  (2010)  Inferring  regulatory 
networks  from  expression  data  using  tree-based  methods.  PLoS  One,  5.  Available  at: 
http://dx.doi.org/10.1371/journal.pone.0012776. 

Iskandarov, U., Kim, H.J. and Cahoon, E.B. (2014) Camelina: An emerging oilseed platform 
for advanced biofuels and bio-based materials. In MC McCann, M. S. Buckeridge, and N. C. 
Carpita, eds. Plants and BioEnergy. New York: Springer, pp. 131–140. 

77 

 
Jiang,  T.,  Zhang,  X.-F.,  Wang,  X.-F.  and  Zhang,  D.-P.  (2011)  Arabidopsis  3-ketoacyl-CoA 
thiolase-2  (KAT2),  an  enzyme  of  fatty  acid  β-oxidation,  is  involved  in  ABA  signal 
transduction. Plant Cell Physiol., 52, 528–538. 

Kagale, S., Koh, C., Nixon, J., et al. (2014) The emerging biofuel crop Camelina sativa retains a 

highly undifferentiated hexaploid genome structure. Nat. Commun., 5, 1–11. 

Kagaya, Y., Okuda, R., Ban, A., Toyoshima, R., Tsutsumida, K., Usui, H., Yamamoto, A. 
and Hattori, T. (2005) Indirect ABA-dependent regulation of seed storage protein genes by 
FUSCA3 transcription factor in Arabidopsis. Plant Cell Physiol., 46, 300–311. 

Keith, K., Kraml, M., Dengler, N.G. and McCourt, P. (1994) fusca3: A Heterochronic Mutation 

Affecting Late Embryo Development in Arabidopsis. Plant Cell, 6, 589–600. 

Kim, D., Paggi, J.M., Park, C., Bennett, C. and Salzberg, S.L. (2019) Graph-based genome 
alignment  and  genotyping  with  HISAT2  and  HISAT-genotype.  Nat.  Biotechnol.,  37,  907–
915. 

Kong, Q. and Ma, W. (2018) WRINKLED1 transcription factor: How much do we know about 
at: 

153–156.  Available 

Science, 

272, 

regulatory  mechanism?  Plant 
its 
http://dx.doi.org/10.1016/j.plantsci.2018.04.013. 

Kong, Q., Singh, S.K., Mantyla, J.J., Pattanaik, S., Guo, L., Yuan, L., Benning, C. and Ma, 
W. (2020) TEOSINTE BRANCHED1/CYCLOIDEA/PROLIFERATING CELL FACTOR4 
Interacts with WRINKLED1 to Mediate Seed Oil Biosynthesis. Plant Physiol., 184, 658–665. 

Kong,  Q.,  Yang,  Y.,  Guo,  L.,  Yuan,  L.  and  Ma,  W.  (2020)  Molecular  Basis  of  Plant  Oil 
Biosynthesis: Insights Gained From Studying the WRINKLED1 Transcription Factor. Front. 
Plant Sci., 11, 24. 

Korotkevich, G., Sukhov, V., Budin, N., Shpak, B., Artyomov, M.N. and Sergushichev, A. 

(2021) Fast gene set enrichment analysis. bioRxiv, 060012.  

Kosma, D.K., Murmu, J., Razeq, F.M., Santos, P., Bourgault, R., Molina, I. and Rowland, 
O.  (2014)  AtMYB41  activates  ectopic  suberin  synthesis  and  assembly  in  multiple  plant 
species and cell types. Plant J., 80, 216–229. 

Kryuchkova-Mostacci, N. and Robinson-Rechavi, M. (2017) A benchmark of gene expression 

tissue-specificity metrics. Brief. Bioinform., 18, 205–214. 

Kurdyukov, S., Faust, A., Trenkamp, S., et al. (2006) Genetic and biochemical evidence for 
involvement of HOTHEAD in the biosynthesis of long-chain alpha-,omega-dicarboxylic fatty 
acids and formation of extracellular matrix. Planta, 224, 315–329. 

Kwong, R.W., Bui, A.Q., Lee, H., Kwong, L.W., Fischer, R.L., Goldberg, R.B. and Harada, 
J.J. (2003) LEAFY COTYLEDON1-LIKE defines a class of regulators essential for embryo 
development. Plant Cell, 15, 5–18. 

Lai, Z., Vinod, K., Zheng, Z., Fan, B. and Chen, Z. (2008) Roles of Arabidopsis WRKY3 and 

78 

 
WRKY4 transcription factors in plant responses to pathogens. BMC Plant Biol., 8, 68. 

Langmead,  B.  and  Salzberg,  S.L.  (2012)  Fast  gapped-read  alignment  with  Bowtie  2.  Nat. 

Methods, 9, 357–359. 

Lambert, S.A., Jolma, A., Campitelli, L.F., et al. (2018) The Human Transcription Factors. Cell, 

175, 598–599. 

Lashbrooke, J., Cohen, H., Levy-Samocha, D., et al. (2016) MYB107 and MYB9 Homologs 

Regulate Suberin Deposition in Angiosperms. Plant Cell, 28, 2097–2116. 

Le, B.H., Cheng, C., Bui, A.Q., et al. (2010) Global analysis of gene activity during Arabidopsis 
seed development and identification of seed-specific transcription factors. Proc. Natl. Acad. 
Sci. U. S. A., 107, 8063–8070. 

Lee, S.B., Kim, H., Kim, R.J. and Suh, M.C. (2014) Overexpression of Arabidopsis MYB96 
confers drought resistance in Camelina sativa via cuticular wax accumulation. Plant Cell Rep., 
33, 1535–1546. 

Lee, S.B., Kim, H.U. and Suh, M.C. (2016) MYB94 and MYB96 Additively Activate Cuticular 

Wax Biosynthesis in Arabidopsis. Plant Cell Physiol., 57, 2300–2311. 

Lee,  S.B.  and  Suh,  M.C.  (2015)  Cuticular  wax  biosynthesis  is  up-regulated  by  the  MYB94 

transcription factor in Arabidopsis. Plant Cell Physiol., 56, 48–60. 

Leprince,  O.,  Pellizzaro,  A.,  Berriri,  S.  and  Buitink,  J.  (2016)  Late  seed  maturation:  drying 

without dying. J. Exp. Bot., 68, 827–841. Available at: [Accessed July 7, 2021]. 

Liao, Y., Smyth, G.K. and Shi, W. (2019) The R package Rsubread is easier, faster, cheaper and 
better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res., 47, e47–
e47. 

Liang, C., Liu, X., Yiu, S.-M. and Lim, B.L. (2013) De novo assembly and characterization of 

Camelina sativa transcriptome by paired-end sequencing. BMC Genomics, 14, 146. 

Li-Beisson, Y., Shorrosh, B., Beisson, F., et al. (2013) Acyl-lipid metabolism. Arabidopsis Book, 

11, e0161. 

Li,  H.,  Handsaker,  B.,  Wysoker,  A.,  et  al.  (2009)  The  Sequence  Alignment/Map  format  and 

SAMtools. Bioinformatics, 25, 2078–2079. 

Li,  D.,  Jin,  C.,  Duan,  S.,  et  al.  (2017)  MYB89  Transcription  Factor  Represses  Seed  Oil 

Accumulation. Plant Physiol., 173, 1211–1225. 

Li, P., Zhou, H., Shi, X., et al. (2014) The ABI4-induced Arabidopsis ANAC060 transcription 
factor attenuates ABA signaling and renders seedlings sugar insensitive when present in the 
nucleus. PLoS Genet., 10, e1004213. 

Li, Y., Beisson, F., Pollard, M. and Ohlrogge, J. (2006) Oil content of Arabidopsis seeds: the 

influence of seed anatomy, light and plant-to-plant variation. Phytochemistry, 67, 904–915. 

79 

 
Liu, P., Nie, W.-F., Xiong, X., et al. (2021) A novel protein complex that regulates active DNA 

demethylation in Arabidopsis. J. Integr. Plant Biol., 63, 772–786. 

Lotan, T., Ohto, M., Yee, K.M., et al. (1998) Arabidopsis LEAFY COTYLEDON1 is sufficient 

to induce embryo development in vegetative cells. Cell, 93, 1195–1205. 

Lumba, S., Tsuchiya, Y., Delmas, F., Hezky, J., Provart, N.J., Shi Lu, Q., McCourt, P. and 
Gazzarrini, S. (2012) The embryonic leaf identity gene FUSCA3 regulates vegetative phase 
transitions by negatively modulating ethylene-regulated gene expression in Arabidopsis. BMC 
Biol., 10, 8. 

Machanick,  P.  and  Bailey,  T.L.  (2011)  MEME-ChIP:  motif  analysis  of  large  DNA  datasets. 

Bioinformatics, 27, 1696–1697. 

Malik, M.R., Tang, J., Sharma, N., Burkitt, C., Ji, Y., Mykytyshyn, M., Bohmert-Tatarev, 
K.,  Peoples,  O.  and  Snell,  K.D.  (2018)  Camelina  sativa,  an  oilseed  at  the  nexus  between 
model system and commercial crop. Plant Cell Rep., 37, 1367–1381. 

Mandáková, T., Pouch, M., Brock, J.R., Al-Shehbaz, I.A. and Lysak, M.A. (2019) Origin and 
Evolution  of  Diploid  and  Allopolyploid  Camelina  Genomes  Were  Accompanied  by 
Chromosome Shattering. Plant Cell, 31, 2596–2612. 

Meinke,  D.W.,  Franzmann,  L.H.,  Nickle,  T.C.  and  Yeung,  E.C.  (1994)  Leafy  Cotyledon 

Mutants of Arabidopsis. Plant Cell, 6, 1049–1064. 

Mendes, A., Kelly, A.A., Erp, H. van, Shaw, E., Powers, S.J., Kurup, S. and Eastmond, P.J. 
(2013) bZIP67 regulates the omega-3 fatty acid content of Arabidopsis seed oil by activating 
fatty acid desaturase3. Plant Cell, 25, 3104–3116. 

Morineau, C., Bellec, Y., Tellier, F., Gissot, L., Kelemen, Z., Nogué, F. and Faure, J.-D. (2017) 
Selective gene dosage by CRISPR-Cas9 genome editing in hexaploid Camelina sativa. Plant 
Biotechnol. J., 15, 729–739. 

Mudalkar, S., Golla, R., Ghatty, S. and Reddy, A.R. (2014) De novo transcriptome analysis of 
an imminent biofuel crop, Camelina sativa L. using Illumina GAIIX sequencing platform and 
identification of SSR markers. Plant Mol. Biol., 84, 159–171. 

Neumann,  N.G.,  Nazarenus,  T.J.,  Aznar-Moreno,  J.A.,  Rodriguez-Aponte,  S.A.,  Mejias 
Veintidos, V.A., Comai, L., Durrett, T.P. and Cahoon, E.B. (2021) Generation of camelina 
mid-oleic acid seed oil by identification and stacking of fatty acid biosynthetic mutants. Ind. 
Crops Prod., 159, 113074. 

Nguyen, H.T., Silva, J.E., Podicheti, R., et al. (2013) Camelina seed transcriptome: a tool for 
meal and oil improvement and translational research. Plant Biotechnol. J., 11, 759–769. 

Nikolov, L.A., Shushkov, P., Nevado, B., Gan, X., Al-Shehbaz, I.A., Filatov, D., Bailey, C.D. 
and  Tsiantis,  M.  (2019)  Resolving  the  backbone  of  the  Brassicaceae  phylogeny  for 
investigating trait diversity. New Phytol., 222, 1638–1651. 

80 

 
O’Malley, R.C., Huang, S.S.C., Song, L., Lewsey, M.G., Bartlett, A., Nery, J.R., Galli, M., 
Gallavotti,  A.  and  Ecker,  J.R.  (2016)  Cistrome  and  Epicistrome  Features  Shape  the 
Regulatory DNA Landscape. Cell, 165, 1280–1292. 

O’Neill, C.M., Baker, D., Bennett, G., Clarke, J. and Bancroft, I. (2011) Two high linolenic 
mutants of Arabidopsis thaliana contain megabase-scale genome duplications encompassing 
the FAD3 locus. Plant J., 68, 912–918. 

Ornelas-Ayala, D., Vega-León, R., Petrone-Mendoza, E., Garay-Arroyo, A., García-Ponce, 
B.,  Álvarez-Buylla,  E.R.  and  Sanchez,  M.  de  la  P.  (2020)  ULTRAPETALA1  maintains 
Arabidopsis  root  stem  cell  niche  independently  of  ARABIDOPSIS  TRITHORAX1.  New 
Phytol., 225, 1261–1272. 

Ozseyhan,  M.E.,  Kang,  J.,  Mu,  X.  and  Lu,  C.  (2018)  Mutagenesis  of  the  FAE1  genes 
significantly  changes  fatty  acid  composition  in  seeds  of  Camelina  sativa.  Plant  Physiol. 
Biochem., 123, 1–7. 

Ou,  J.,  Wolfe,  S.A.,  Brodsky,  M.H.  and  Zhu,  L.J.  (2018)  motifStack  for  the  analysis  of 

transcription factor binding site evolution. Nat. Methods, 15, 8-9. 

Pelletier,  J.M.,  Kwong,  R.W.,  Park,  S.,  et  al.  (2017)  LEC1  sequentially  regulates  the 
transcription of genes involved in diverse developmental processes during seed development. 
Proc. Natl. Acad. Sci. U. S. A., 114, E6710–E6719. 

Pires, H.R., Shemyakina, E.A. and Fletcher, J.C. (2015) The ULTRAPETALA1 trxG factor 
contributes  to  patterning  the  Arabidopsis  adaxial-abaxial  leaf  polarity  axis.  Plant  Signal. 
Behav., 10, e1034422. 

Pollard, M., Martin, T.M. and Shachar-Hill, Y. (2015) Lipid analysis of developing Camelina 

sativa seeds and cultured embryos. Phytochemistry, 118, 23–32. 

Pouvreau, B., Blundell, C., Vohra, H., Zwart, A.B., Arndell, T., Singh, S. and Vanhercke, T. 
(2020)  A  Versatile  High  Throughput  Screening  Platform  for  Plant  Metabolic  Engineering 
Highlights the Major Role of ABI3 in Lipid Metabolism Regulation. Front. Plant Sci., 11, 
288. 

Quinlan,  A.R.  and  Hall,  I.M.  (2010)  BEDTools:  a  flexible  suite  of  utilities  for  comparing 

genomic features. Bioinformatics, 26, 841–842. 

Ramírez, F., Ryan, D.P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, A.S., Heyne, S., 
Dündar,  F.  and  Manke,  T.  (2016)  deepTools2:  a  next  generation  web  server  for  deep-
sequencing data analysis. Nucleic Acids Res., 44, W160–5. 

Reiter,  F.,  Wienerroither,  S.  and  Stark,  A.  (2017)  Combinatorial  function  of  transcription 

factors and cofactors. Curr. Opin. Genet. Dev., 43, 73–81. 

Richmond,  T.A.  and  Bleecker,  A.B.  (1999)  A  defect  in  beta-oxidation  causes  abnormal 

inflorescence development in Arabidopsis. Plant Cell, 11, 1911–1924. 

81 

 
Robinson, J.T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E.S., Getz, G. and 

Mesirov, J.P. (2011) Integrative genomics viewer. Nat. Biotechnol., 29, 24–26. 

Rodríguez-Rodríguez, M.F., Sánchez-García, A., Salas, J.J., Garcés, R. and Martínez-Force, 
E. (2013) Characterization of the morphological changes and fatty acid profile of developing 
Camelina sativa seeds. Ind. Crops Prod., 50, 673–679. 

Romanel,  E.A.C.,  Schrago,  C.G.,  Couñago,  R.M.,  Russo,  C.A.M.  and  Alves-Ferreira,  M. 
(2009) Evolution of the B3 DNA binding superfamily: new insights into REM family gene 
diversification. PLoS One, 4, e5791. 

Roudier, F., Gissot, L., Beaudoin, F., et al. (2010) Very-long-chain fatty acids are involved in 
polar auxin transport and developmental patterning in Arabidopsis. Plant Cell, 22, 364–375. 

Sadler, C., Schroll, B., Zeisler, V., Waßmann, F., Franke, R. and Schreiber, L. (2016) Wax 
and  cutin  mutants  of  Arabidopsis:  Quantitative  characterization  of  the  cuticular  transport 
barrier in relation to chemical composition. Biochim. Biophys. Acta, 1861, 1336–1344. 

Saez,  A.,  Rodrigues,  A.,  Santiago,  J.,  Rubio,  S.  and  Rodriguez,  P.L.  (2008)  HAB1-SWI3B 
interaction reveals a link between abscisic acid signaling and putative SWI/SNF chromatin-
remodeling complexes in Arabidopsis. Plant Cell, 20, 2972–2988. 

Sales,  G.  and  Romualdi,  C.  (2011)  Parmigene-a  parallel  R  package  for  mutual  information 

estimation and gene network reconstruction. Bioinformatics, 27, 1876–1877. 

Sarnowski, T.J., Ríos, G., Jásik, J., et al. (2005) SWI3 subunits of putative SWI/SNF chromatin-
remodeling  complexes  play  distinct  roles  during  Arabidopsis  development.  Plant  Cell,  17, 
2454–2472. 

Seo, P.J., Lee, S.B., Suh, M.C., Park, M.-J., Go, Y.S. and Park, C.-M. (2011) The MYB96 
transcription  factor  regulates  cuticular  wax  biosynthesis  under  drought  conditions  in 
Arabidopsis. Plant Cell, 23, 1138–1152. 

Shang, B., Xu, C., Zhang, X., Cao, H., Xin, W. and Hu, Y. (2016) Very-long-chain fatty acids 
restrict  regeneration  capacity  by  confining  pericycle  competence  for  callus  formation  in 
Arabidopsis. Proc. Natl. Acad. Sci. U. S. A., 113, 5101–5106. 

Shen, B., Sinkevicius, K.W., Selinger, D.A. and Tarczynski, M.C. (2006) The homeobox gene 

GLABRA2 affects seed oil content in Arabidopsis. Plant Mol. Biol., 60, 377–387.  

Shen, L. and Sinai, I. (2020) GeneOverlap: Test and visualize gene overlaps. R package 1.26.0 

Skubacz, A., Daszkowska-Golec, A. and Szarejko, I. (2016) The Role and Regulation of ABI5 
(ABA-Insensitive  5)  in  Plant  Development,  Abiotic  Stress  Responses  and  Phytohormone 
Crosstalk. Front. Plant Sci., 7, 1884. 

Sorin, C., Salla-Martret, M., Bou-Torrent, J., Roig-Villanova, I. and Martínez-García, J.F. 
(2009) ATHB4, a regulator of shade avoidance, modulates hormone response in Arabidopsis 
seedlings. Plant J., 59, 266–277. 

82 

 
Springer, N., León, N. de and Grotewold, E. (2019) Challenges of Translating Gene Regulatory 

Information into Agronomic Improvements. Trends Plant Sci., Dic, 1075–1082. 

Stone, S.L., Kwong, L.W., Yee, K.M., Pelletier, J., Lepiniec, L., Fischer, R.L., Goldberg, R.B. 
and Harada, J.J. (2001) LEAFY COTYLEDON2 encodes a B3 domain transcription factor 
that induces embryo development. Proc. Natl. Acad. Sci. U. S. A., 98, 11806–11811. 

Subramanian,  A.,  Tamayo,  P.,  Mootha,  V.K.,  et  al.  (2005)  Gene  set  enrichment  analysis:  a 
knowledge-based  approach  for  interpreting  genome-wide  expression  profiles.  Proc.  Natl. 
Acad. Sci. U. S. A., 102, 15545–15550. 

Suzuki, M. and McCarty, D.R. (2008) Functional symmetry of the B3 network controlling seed 

development. Curr. Opin. Plant Biol., 11, 548–553. 

Tang,  L.P.,  Zhou,  C.,  Wang,  S.S.,  Yuan,  J.,  Zhang,  X.S.  and  Su,  Y.H.  (2017)  FUSCA3 
interacting with LEAFY COTYLEDON2 controls lateral root formation through regulating 
YUCCA4 gene expression in Arabidopsis thaliana. New Phytol., 213, 1740–1754. 

Tian,  R.,  Paul,  P.,  Joshi,  S.  and  Perry,  S.E.  (2020)  Genetic  activity  during  early  plant 

embryogenesis. Biochem. J, 477, 3743–3767. 

Tiedemann, J., Rutten, T., Mönke, G., et al. (2008) Dissection of a complex seed phenotype: 

novel insights of FUSCA3 regulated developmental processes. Dev. Biol., 317, 1–12. 

To, A., Joubès, J., Barthole, G., Lécureuil, A., Scagnelli, A., Jasinski, S., Lepiniec, L. and 
Baud, S. (2012) WRINKLED Transcription Factors Orchestrate Tissue-Specific Regulation 
of  Fatty  Acid  Biosynthesis  in  Arabidopsis.  The  Plant  Cell,  24,  5007–5023.  Available  at: 
http://dx.doi.org/10.1105/tpc.112.106120. 

Trigg, S.A., Garza, R.M., MacWilliams, A., et al. (2017) CrY2H-seq: a massively multiplexed 

assay for deep-coverage interactome mapping. Nat. Methods, 14, 819–825. 

Troncoso-Ponce,  M.A.,  Barthole,  G.,  Tremblais,  G.,  To,  A.,  Miquel,  M.,  Lepiniec,  L.  and 
Baud, S. (2016) Transcriptional Activation of Two Delta-9 Palmitoyl-ACP Desaturase Genes 
by MYB115 and MYB118 Is Critical for Biosynthesis of Omega-7 Monounsaturated Fatty 
Acids in the Endosperm of Arabidopsis Seeds. Plant Cell, 28, 2666–2682. 

Tsukagoshi,  H.,  Morikami,  A.  and  Nakamura,  K.  (2007)  Two  B3  domain  transcriptional 
repressors  prevent  sugar-inducible  expression  of  seed  maturation  genes  in  Arabidopsis 
seedlings. Proc. Natl. Acad. Sci. U. S. A., 104, 2543–2547. 

Yoav Benjamini and Yosef Hochberg (1995) Controlling the False Discovery Rate: A Practical 
and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. Methodol., 57, 289–
300. 

Tyler, L., Miller, M.J. and Fletcher, J.C. (2019) The Trithorax Group Factor ULTRAPETALA1 
Regulates  Developmental  as  Well  as  Biotic  and  Abiotic  Stress  Response  Genes  in 
Arabidopsis. G3, 9, 4029–4043. 

83 

 
Voelker,  T.  and  Kinney,  A.J.  (2001)  VARIATIONS  IN  THE  BIOSYNTHESIS  OF  SEED-

STORAGE LIPIDS. Annu. Rev. Plant Physiol. Plant Mol. Biol., 52, 335–361. 

Wang, F. and Perry, S.E. (2013) Identification of direct targets of FUSCA3, a key regulator of 

Arabidopsis seed development. Plant Physiol., 161, 1251–1264. 

Wang, X., Niu, Q.-W., Teng, C., Li, C., Mu, J., Chua, N.-H. and Zuo, J. (2009) Overexpression 
in 

of  PGA37/MYB118  and  MYB115  promotes  vegetative-to-embryonic 
Arabidopsis. Cell Res., 19, 224–235. 

transition 

Yamamoto, A., Kagaya, Y., Usui, H., Hobo, T., Takeda, S. and Hattori, T. (2010) Diverse 
roles and mechanisms of gene regulation by the Arabidopsis seed maturation master regulator 
FUS3 revealed by microarray analysis. Plant Cell Physiol., 51, 2031–2046. 

Yu, B., Wang, Y., Zhou, H., Li, P., Liu, C., Chen, S., Peng, Y., Zhang, Y. and Teng, S. (2020) 
Genome-wide  binding  analysis  reveals  that  ANAC060  directly  represses  sugar-induced 
transcription of ABI5 in Arabidopsis. Plant J., 103, 965–979. 

Zhai, Z., Liu, H. and Shanklin, J. (2017) Phosphorylation of WRINKLED1 by KIN10 Results 
in  Its  Proteasomal  Degradation,  Providing  a  Link  between  Energy  Homeostasis  and  Lipid 
Biosynthesis. Plant Cell, 29, 871–889. 

Zhang,  M.,  Cao,  X.,  Jia,  Q.  and  Ohlrogge,  J.  (2016)  FUSCA3  activates  triacylglycerol 

accumulation in Arabidopsis seedlings and tobacco BY2 cells. Plant J., 88, 95–107. 

Zheng,  Q.,  Zheng,  Y.  and  Perry,  S.E.  (2013)  AGAMOUS-Like15  promotes  somatic 
embryogenesis in Arabidopsis and soybean in part by the control of ethylene biosynthesis and 
response. Plant Physiol., 161, 2113–2127. 

Zheng, Y., Ren, N., Wang, H., Stromberg, A.J. and Perry, S.E. (2009) Global identification of 
targets of the Arabidopsis MADS domain protein AGAMOUS-Like15. Plant Cell, 21, 2563–
2577. 

84 

 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER FOUR: MULTI-NETWORK INTEGRATION TO PRIORITIZE 

REGULATORY GENES OF METABOLISM IN MAIZE 

85 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4.1 ABSTRACT 

Elucidating gene regulatory networks (GRNs) is a major area of study within plant systems 

biology.  Phenotypic  traits  are  intricately  linked  to  specific  gene  expression  profiles.  These 

expression  patterns  primarily  arise  from  regulatory  connections  between  sets  of  transcription 

factors (TFs) and their target genes. In this study, I integrated publicly available co-expression 

networks  encompassing  over  6,000  RNA-seq  samples,  approximately  16  million  SNPs,  and 

around  300  protein-DNA  interaction  assays,  which  comprised  245  ChIP-seq  and  38/  DAP-seq 

assays. Overall, I constructed four distinct types of TF-target networks, including co-expression, 

protein-DNA  interaction  (PDI),  trans-expression  quantitative  loci  (trans-eQTL),  and  cis-eQTL 

combined  with  PDIs.  In  total,  I  analyzed  ~4.6M  interactions.  I  implemented  three  different 

strategies to integrate these four types of networks, and performed evaluation of the method based 

on knockouts and random networks. These results identify transcriptional regulators of different 

biological processes, including hormone-, metabolic- and development-related processes. Finally, 

using  the  topological  properties  of  the  full  integrated  network  I  identify  potentially  functional 

redundant TF paralogs. Our findings retrieve functions previously documented for numerous TFs 

and  reveal  novel  functions  that  are  crucial  for  informing  the  design  of  future  experiments. 

Moreover, I am laying the foundation for the integration of multi-omic datasets in maize and other 

plant systems. 

4.2 INTRODUCTION 

Plant cells, like those of other organisms, use multiple interconnected molecular layers which 

collaboratively coordinate every cellular process, from cellular division to metabolite synthesis 

and adaptation to environment changes. Among these molecular layers, transcription factor (TF) 

proteins  play  a  vital  role  in  controlling  the  expression  of  other  genes  (known  as  target  genes) 

86 

 
(Gupta et al., 2021). The regulation requires the direct protein-DNA interactions (PDI) between 

TFs  and  specific  cis-regulatory  elements  (CREs)  located  in  close  (promoters)  or  distal 

(enhancers/silencers) regulatory regions of the corresponding target genes (Schmitz et al., 2022). 

Furthermore,  TFs  can  also  regulate  gene  expression  indirectly  by  engaging  in  protein-protein 

interactions (PPI) with other proteins. Together, collections of PDIs form highly interconnected 

gene  regulatory  networks  (GRNs).  Overall,  GRN  are  characterized  by  gene-centered  and  TF-

centered  approaches  (Arda  and  Walhout,  2010;  Mejia-Guerra  et  al.,  2012;  Yang  et  al.,  2016). 

Common  gene-centered  methods  include  yeast  one-hybrid  (Y1H)  assay  and  electrophoretic 

mobility shift assay (EMSA) (Arda and Walhout, 2010; Yang et al., 2016). TF-centered strategies 

involve techniques like ChIP-seq for in vivo TF binding site discovery, and SELEX, PBM, and 

DAP-seq for in vitro analysis (Yang et al., 2016; O’Malley et al., 2016). The organization of GRNs 

has implications for phenotypic variation (Deplancke et al., 2006), plant responses to abiotic and 

biotic  stress  (Nakashima  et  al.,  2014;  Birkenbihl  et  al.,  2017;  Sun  et  al.,  2022),  development 

(Marand  et  al.,  2023),  speciation  (Mack  and  Nachman,  2017),  as  well  as  adaptation  and 

diversification  (Mack  and  Nachman,  2017;  Bowles  et  al.,  2020;  Marand  et  al.,  2023),  among 

others.  Therefore,  it  is  crucial  to  comprehend  the  structure  and  dynamics  of  these  networks, 

emphasizing the significance and rationale behind such endeavors.  

Maize holds great agricultural significance due to its versatility and wide range of applications. 

It serves as a staple food, particularly in sub-Saharan Africa and Latin America, and is also used 

for animal feed, meat production, dairy, and poultry products (Erenstein et al., 2022). Part of maize 

versatility underlines its extraordinary maize metabolic diversity (Riedelsheimer et al., 2012; Wen 

et al., 2014, 2016; Zhou et al., 2019), which is established by its genetic diversity (Schnable et al., 

2009;  Hufford  et  al.,  2021;  McMullen  et  al.,  2009),  and  varies  as  a  function  of  endogenous 

87 

 
variables (e.g. organs or development stages) (Zhou et al., 2019) and environment factors (Wen et 

al., 2014; Kusmec et al., 2017). Irrespective of the methodology employed and the traits analyzed, 

maize has consistently shown a complex genetic architecture, as indicated by the number of loci 

potentially linked to a single trait and their minor contribution to the variance of the corresponding 

traits (Riedelsheimer et al., 2012; Wen et al., 2014, 2016; Xiao et al., 2017; Mazaheri et al., 2019; 

Zhou  et  al.,  2019).  These  challenges  present  two  primary  obstacles  in  comprehending  the 

molecular  mechanisms  underlying  these  traits:  the  involvement  of  multiple  genes  in  a  single 

phenotypic trait and the influence of additional genetic factors that determine and modulate the 

genetic contribution to phenotypic variations. 

A distinctive characteristic of maize is its genome itself, which has undergone a recent whole 

genome duplication (WGD) event ~5-12 Mya, and is currently defined as an ancient tetraploid 

(Wei et al., 2007). It has an abundance of tandem duplicated genes (Kono et al., 2018), and is 

highly-enriched in transposable elements (~85%) (Schnable et al., 2009). Furthermore, the WGD 

event resulted in the formation of two subgenomes (maize1 and maize2), exhibiting unequal gene 

loss  and  expression  patterns,  primarily  driven  by  the  subgenome  with  a  lower  fraction  rate 

(Schnable  et  al.,  2011).  The  dominant  subgenome  (i.e.,  maize1)  was  shown  to  have  a  larger 

contribution  to  phenotypic  variations  (Renny-Byfield  et  al.,  2017).  Nevertheless,  the  precise 

molecular  mechanisms  underlying  the  asymmetric  contributions  of  each  subgenome  remain 

largely  unknown  and  are  likely  orchestrated  at  multiple  molecular  levels,  including  regulation, 

signaling,  and  interactome  level,  as  evidenced  by  examining  co-expression  and  multi-network 

comparisons of homeologs (Li et al., 2016; Han et al., 2023). Understanding these mechanisms 

holds  significant  implications  for  modeling,  prioritizing,  and  unraveling  the  principal  factors 

88 

 
behind  agriculturally  relevant  traits,  while  also  advancing  our  fundamental  comprehension  of 

maize evolution. 

The multi-omic data sets have gained attention as alternatives to address the complexity of 

genetic and phenotypic variation observed in biological systems, offering insights at various levels 

of a biological process (Tolani et al., 2021). Integrating these diverse datasets is an evolving field, 

with  four  main  approaches:  conceptual  integration  (overlapping  observations),  statistical 

integration,  network-based  integration,  and  machine  learning-based  integration  (Depuydt  et  al., 

2023). In maize, like in other organisms, the advancements in technology have led to the rapid 

generation of genomic data. These data encompass various molecular layers at different scales, 

such  as  TF  binding  profiles  (Galli  et  al.,  2018;  Ricci  et  al.,  2019;  Tu  et  al.,  2020),  accessible 

chromatin regions (ACRs) (Rodgers-Melnick et al., 2016; Ricci et al., 2019; Marand et al., 2021), 

expression and co-expression atlases (Sekhon et al., 2011; Stelpflug et al., 2016; Hoopes et al., 

2019; Zhou et al., 2020), and transcriptomic, proteomic, and metabolic at population-level (Li et 

al., 2013; Wen et al., 2014, 2016; Kremling et al., 2018; Zhou et al., 2019; Mazaheri et al., 2019; 

Shrestha et al., 2022). Consequently, there has been a growing focus on integrating multi-omic 

datasets (Liu et al., 2016; Walley et al., 2016; Wen et al., 2016; Jin et al., 2017; de Abreu E Lima 

et al., 2018; Lee et al., 2019; Schaefer et al., 2018; Wen et al., 2018). However, most integration 

efforts  primarily  involve  the  verification  of  each  layer  with  one  another  (i.e.,  conceptual 

integration) (Depuydt et al., 2023). There are a few exceptions where the layers are leveraged to 

enhance the integration (Schaefer et al., 2018; Yang et al., 2022) or to learn from their combined 

information (Han et al., 2023). Therefore, emphasizing the need for a comprehensive assessment 

of integration strategies and its effectiveness to prioritize gene-specific processes. 

89 

 
In this study, I analyzed genetic and gene expression variation in 304 maize inbred lines. I 

utilized  data  from  over  300  publicly  available  ChIP-  and  DAP-seq  experiments,  along  with  45 

previously analyzed co-expression networks (Zhou et al., 2020). Combining these datasets, I built 

four  molecular  networks  and  employed  three  integration  methods.  I  sought  to  annotate 

transcription factors (TFs) based on their predicted target genes. I combined published knockouts 

and created random networks as strategies to evaluate the corresponding functional predictions. 

This allowed me to identify the integration strategy that made functional predictions more similar 

to  those  observed  in  knockout  assays,  while  minimizing  the  chance  of  random  predictions.  In 

essence, it allowed me to recover predictions rarely predicted by a random network. I provided 

evidence  that  these  predictions  recovered  TF-process  associations  previously  linked  to  specific 

biological processes. The compiled predictions enabled the creation of a TF-process association 

list, which, when combined with TF-target networks, facilitates the identification of regulators for 

processes like abscisic acid (ABA), lipid, phenylpropanoid, and leaf-related processes. Finally, I 

demonstrated that employing the generated embedding post-integration of all four networks, which 

recovers  pattern  of  connectivity  within  the  full  combined  network,  allows  distinguishing 

homologous (aka, paralogs) with potential redundancy in maize. Collectively, these findings offer 

a  remarkable  amalgamation  of  TF-process  associations  and  lay  the  foundation  for  prospective 

network-based functional prediction in maize. Moreover, this invaluable tool facilitates the linkage 

of previously identified genetic markers with clusters of functionally associated genes, utilizing 

their connectivity patterns within the presented networks. 

90 

 
 
 
 
4.3 RESULTS 

4.3.1 Construction of a maize regulatory network based on multiple layers 

To build a multi-layer TF-function association network, I collected previously published co-

expression networks, single-nucleotide polymorphisms (SNPs), and reanalyzed publicly available 

expression, DAP-seq, and ChIP-seq datasets in maize. In total, I included several co-expression 

networks, genetic variation data for 304 maize inbred lines, and 289 DNA-binding assays (DAP-

seq and ChIP-seq) associated with 144 TFs (Figure 4.1a). In total, I identified ~3.4M, ~155.1K, 

~1.18M, and 112.46K TF-gene associations derived from the co-expression networks (CENs), a 

trans-eQTL  association  network  (GAN),  a  gene-regulatory  network  (GRN),  and  cis-eQTLs 

overlapped with GRN interactions (eGRN), respectively. The GRN was built based on DAP/ChIP-

seq assays (Figure 4.1b). Construction details of the corresponding networks are described below.  

Coexpression  network  (CEN).  To  build  the  CEN  layer,  I  started  by  collecting  45  CENs 

previously  published  (Zhou  et  al.,  2020),  and  added  an  additional  network  constructed  with  a 

subset of expression datasets associated with 304 inbred lines [Wisconsin Diversity (WiDiv) panel 

(Mazaheri et al., 2019)]. The 304 lines were selected based on availability of high-density whole 

genome  sequencing  derived  SNPs  (Bukowski  et  al.,  2018),  following  consistent  methods  with 

previously reported CENs (Zhou et al., 2020) (see Methods). Thus, in total, I utilized 46 different 

co-expression  networks  to  define  the  TF-target  CEN  layer.  Each  network  was  reduced  to  only 

maize genes in synteny with Sorghum bicolor (Schnable, 2019) to avoid potential bias towards 

non-functional  genes  when  conducting  gene  enrichment  analyses.  The  syntenic  gene  filter  was 

also applied to all other network types (i.e., GRN, eGRN, and GAN). On average, the collected 

CENs showed ~1,055 TFs co-expression network (Figure S4.1a) with ~74 predicted target genes 

(a.k.a,  targets)  per  TF  (Figure  S4.1b).  Combining  all  46  CENs,  I  identified  ~3.4M  TF-target 

91 

 
associations involving 1,852 TFs and 23,788 targets (on average, ~1,350 targets per TF; targets 

can include other TF genes). To note, some of these TFs had several times more targets than the 

average  TF  (Figure  S4.1b).  For  example,  the  ABI3VP1-7  (ABI7)  and  the  C2C2-CO-like-

transcription factor 8 (COL8), showed >400 targets in five and four CENs, respectively (Figure 

S3.1b); or COL13 and the bHLH-transcription factor 127 (bHLH127), which in total showed > 

6,000 targets each (Figure S4.1c).  

Gene association network based on trans-eQTLs (GAN). This layer was built based on trans-

eQTLs  identified  in  eight  distinct  tissues  encompassing  several  developmental  stages.  Overall, 

after quality control and data preprocessing (see Methods), I tested between 15.5M - 16.7M SNPs 

against the expression of 15.3K - 26.4K genes across the eight tissue types. Thus, after discarding 

non-significant eQTLs (See Methods) and non-syntenic genes (Schnable, 2019), I obtained a total 

of ~22.9M eQTL-gene associations including ~10M and ~26.4k different SNPs and target genes, 

respectively. These associations were classified as cis-eQTL overlapped with its target genes (cis-

eQTLt), trans/cis-eQTL, cis-eQTL, trans-eQTL, and unassigned eQTL according to the distance 

between each eQTL and its corresponding target gene and eQTL-gene co-location (Figure S4.2a). 

Under this classification schema, I identified 10.2M, 6.7M, 1.20M, 1.18M, and 3.5M unassigned 

eQTL,  trans-eQTL,  trans/cis-eQTL,  cis-eQTLt,  and  cis-eQTL,  respectively  (Figure  S4.2a). 

Within them, trans-eQTLs (eQTLs overlapped with annotated genes and located >50 kbs far away 

from their corresponding target genes) were used to define the GAN.  

After  removing  redundant  links  (e.i.,  multiple  eQTL  supporting  the  same  gene-to-gene 

connection), the resultant GAN harbored ~155k associations, including 23.9K and 18.9K source 

and  target  genes,  respectively.  Here,  “source  gene”  was  defined  as  a  gene  overlapped  with  the 

corresponding eQTL. To better understand the nature of the genes captured on the predicted GAN, 

92 

 
I classified source and target genes into five functional categories including transcription factors 

(TFs),  co-regulatory  factors  (CoReg),  mediators,  kinases,  enzymes,  and  others  (Yilmaz  et  al., 

2009; Zheng et al., 2016; Mathur et al., 2011). Interestingly, “Kinase” and “Enzyme” were the top 

two classes with the highest number of target genes, even larger than the “TF” class (Figure S4.2b). 

Similarly,  “Enzyme”  class  was  the  most  frequent  target  class  followed  by  “Mediator”  and  co-

regulators  (“CoReg”)  (Figure  S4.2c).  I  counted  the  interaction  frequency  between  the 

corresponding  classes,  and  after  “Other”,  “Enzyme”  was  the  functional  class  with  more 

interactions (13.7K) (Figure S4.2d), highlighting “Enzymes” as one of the functional classes more 

interconnected within the predicted GAN. Finally, I also note that the GAN recovered gene-gene 

interactions  that  capture  both  typical  TF-target  interactions,  but  also  physical  protein-protein 

interactions.  An  example  is  provided  by  the  the  HSF-transcription  factor  20  (HSF20)  which 

showed 354 targets, including 27 genes previously reported as heat-response related genes (Zhou 

et  al.,  2021),  as  well  as  five  known  physical  interactors  (Zhu  et  al.,  2016).  This  highlights  a 

typically unexplored set of regulatory connections among genes at several hierarchical levels.  

Gene-regulatory network (GRN), and cis-eQTLs overlapped with GRN interactions (eGRN). 

To  construct  the  GRN,  I  collected  and  reanalyzed  283  PDI  experiments  associated  with  142 

different TFs. All the reanalyzed assays corresponded to TF-centered approaches, including 215 

ChIP-seqs in protoplast (pChIP-seq), 30 classic ChIP-seq, and 38 DAP-seq. A single data analysis 

pipeline was used to process all PDI assays to reduce pipeline-specific bias (See Methods).  On 

average,  I  obtained  ~52k  peaks  per  TF  which,  in  total,  represented  ~7.6M  PDIs.  Most  of  the 

predicted peaks were contributed by pChIP-seq (Tu et al., 2020), which represented 75% of the 

data analyzed (on average, ~55k peaks by TF) (Figure S4.3a). To identify high-confidence peaks, 

I  applied  two  filtering  criteria.  First,  I  gathered  accessible  chromatin  regions  (ACRs)  from  the 

93 

 
recently published single-nuclei ATAC-seq (snATAC-seq) atlas (Marand et al., 2021), retaining 

only TF's peaks that overlapped with ACRs. Therefore, I compared all the DAP-seq and ChIP-seq 

datasets to a shared regulatory maize space. Second, I removed peaks with low counts per million 

(CPM) (as defined by a Z-score ≤ -0.5) for each PDI assay. Overall, I filtered out ~3.8M and ~1.1M 

peaks  using  the  ACR  and  CPM  criteria,  respectively  (Figure  S4.3b,  c).  As  expected,  DAP-seq 

assays,  in  both  filters,  have  the  largest  percentage  of  discarded  peaks  (Figure  S4.3b,  c). 

Interestingly, comparing low-coverage and ACR co-location peaks and their distance to the closest 

annotated transcription start site (TSS), I find that peaks with the highest Z-values mapped largely 

to ACRs near TSSs (~10 kbs around) (Figure S4.3d). These last patterns were observed in all data 

types  (DAP-,  ChIP-,  and  pChIP-seq),  thus,  supporting  the  biological  relevance  of  the  high-

confidence peaks retained. After filtering, I ended with a set of ~3.6M of peaks that were used for 

further analyses. 

To define target genes, I integrated the peak-TSS distance and their overlap with cis-eQTLs 

(declared when a peak summit and a cis-eQTL were at ≤ 20 bp away). Combining these metrics, I 

classify the peaks into three types of peaks close to TSSs (≤ 3 kb) and two types of peaks far away 

from TSSs (> 3 kbs and ≤ 50 kb). Specifically, peaks in close proximity (≤ 3 kb) were defined as 

follows: peaks without cis-eQTL support (Figure S4.3e, light purple peaks), with cis-eQTL support 

and  similar  target  prediction  (Figure  S4.3e,  light  green  peaks),  and  with  cis-eQTL  support  and 

different target prediction (Figure S4.3e, yellow peaks). These categories represented the 54.9%, 

1.9%, and 0.1% of the total analyzed peaks, respectively. Similarly, peaks located far away were 

classified as peaks with (3.3%) and without (39.5%) cis-eQTL support (Figure S4.3e; light blue 

and gray peaks, respectively). Overall, I did not observe differences in peak categories among PDI 

data types (Figure S4.3e, bottom panel). Thus, after discarding peaks located far away and without 

94 

 
cis-eQTL  support,  I  build  a  gene  regulatory  network  (GRN)  and  cis-eQTL  supporting  GRN 

(eGRN) combined all peaks by TF irrespectively of the PDI source. In total, I captured ~1.12M 

(GRN) and ~1123.46K (eGRN) TF-target interactions, including 138 TFs and ~23.9K and 13.9K 

target genes, respectively (Figure 4.1a, b).  

Figure 4.1 Construction of maize gene regulatory network based on multiple data types 
a. Model indicating the different types of TF-gene associations used to define the network types 
analyzed in this work. b. Summary of the metrics of the four types of network layers.  c. Schematic  

95 

 
 
 
Figure 4.1 (cont’d) 
representation of the pipeline implemented to annotate and evaluate the corresponding functional 
predictions.  

4.3.2 TF Functional annotation 

A  major  difference  between  the  networks  constructed  is  the  number  of  TFs  and  their 

corresponding target/associated genes (here, indistinguishably called target genes), which hinders 

comparisons between layers. For instance, all four networks have 111 TFs with at least one target 

gene (Figure S4.4a, b), however, this number is reduced to only 17 TFs when comparing TFs with 

at least ten different target genes by network (Figure S4.4a, c). This reduction is largely caused by 

the  low  number  of  predicted  targets  on  the  GAN  layer  (on  average,  ~6.5  targets  by  gene).  In 

consequence,  I 

implemented 

three  different  strategies  (common 

interactions,  common 

integrations,  and  network-based)  to  functionally  annotate  the  TFs  present  in  the  corresponding 

networks. In all three approaches, the annotation was performed based on enrichment of metabolic 

pathways (PWYs) (Andorf et al., 2016) and GO terms (Wimalanathan et al., 2018) (Figure 4.1c) 

(See Methods). Briefly, the most conservative approach, common interactions, assumes that only 

common  TF-target  interactions  between  layers  (i.e.,  GAN,  GRN,  eGRN,  and  CEN  networks) 

capture true targets of the corresponding TF, and by extension its function. 

Common function, assumes that a TF function is most accurately captured by those functions 

commonly  enriched  across  different  network  types.  Thus,  it  prioritizes  functions  commonly 

enriched for the corresponding TF across layers. Finally, network-based combines all layers to 

then extract topological properties for each gene. It assumes that each interaction type bore equally 

valid information about the function of the corresponding TFs. Specifically, it combines all four 

layers  (GAN,  CEN,  eGRN,  and  GRN)  creating  a  denser  network  (combined  network)  to  then 

extract  physical  parameters  -  embeddings  -  from  each  gene  in  the  combined  network  (See 

96 

 
 
methods). The transformation of the networks into a matrix of genes and embeddings allows the 

grouping of genes based on the similarity of their embeddings. Here, I used the mutual rank of the 

mutual information as the metric to identify highly similar genes in the embedding matrix, to then 

test for enrichment with PWYs and GO terms between the corresponding genes. Ultimately, this 

strategy allowed me to make functional annotation of TFs independently of the number of target 

genes predicted at the individual layer types (i.e., GAN, CEN, eGRN, and GRN).  

Common interactions. To identify common interactions, I compared all layers with each other 

(Figure S4.4a) and obtained ~4.6M TF-target interactions. As expected, GRN and eGRN were the 

layers with the largest number of overlapping interactions (~112.5K), followed by GRN and CEN 

(~102.7K) (Figure S4.4d). After identifying common interactions, I keep 206.2k out of the 4.6M 

interactions, including 934 and ~20.6K different TFs and target genes (Figure S4.4d). Using target 

genes as a proxy to annotate the TFs function, I test the enrichment of common target genes with 

PWYs  and  GO  terms  by  TF  (See  Methods).  Also,  given  the  similarities  in  their  molecular 

functions, I included co-regulators in the analysis and treated them without distinction from TF. 

In total, I found 2,812 TF-PWY and 8,550 TF-GO significant associations [False Discovery Rate 

(FDR) ≤ 0.1, Fisher’s Exact Test] (Figure 4.2a), which on average represented ~8 and ~80 PWYs 

and GO terms by TFs (Figure 4.2b). Combining PWY and GO term results, I annotated 347 TFs, 

out  of  which  235  TFs  showed  enrichment  only  in  the  PWYs  analysis.  The  remaining  112  TFs 

showed enrichment with both PWYs and GO terms (Figure 4.2c).  

Common function. To identify common functions, I initially tested the enrichment of target 

genes with PWYs and GO terms for each TF in each layer. I retained TFs that had at least one 

PWY/GO  term  enriched  in  the  last  two  different  layers.  This  allowed  me  to  explore  common 

predictions between layers for the corresponding TFs (Figure S4.5a). I observed a variable number 

97 

 
of TFs enriched with PWYs (ranging from 120 to 2,019 TFs) and GO terms (ranging from 72 to 

1,777 TFs) across the different layers. Between the layers, eGRN had the fewest annotations, while 

CEN had the largest number of annotations (Figure S4.5b). Thus, after selecting TFs with at least 

one  PWY  and/or  GO  enrichment,  I  ended  with  966  TFs  and  245  TFs,  respectively.  When 

considering  PWY  annotations,  the  layer  pair  of  CEN  &  GAN  had  the  highest  number  of  TFs 

annotated (888 TFs), while the layer pair of GRN & GAN had the lowest (59 TFs). Similarly, for 

GO term annotations, CEN & GAN had the highest number (130 TFs), while GRN & GAN had 

the  lowest  number  (59  TFs)  (Figure  S4.5c,  d).  Regarding  common  predictions,  I  identified 

overlapping PWYs by evaluating gene overlap among all PWYs enriched per TF between layers 

(P-value ≤ 0.05, Fisher’s Exact) (Figure S4.5a). A similar approach was used to identify common 

functions at the GO term level. However, due to the hierarchical and redundant nature of the GO 

terms, I employed semantic similarity rather than gene overlap to determine common GO terms 

per TF between layers (FDR ≤ 0.1) (See Methods). These two annotation analyses together yielded 

7,081 TF-function annotations (727 TF-PWY and 6,354 TF-GO) (Figure 4.2a). On average, this 

corresponds to 3.5 different PWYs and 57.7 different GO term associations per TF (Figure 4.2b). 

In terms of TFs, these associations encompass annotations for 204 TFs through PWY enrichment 

and 110 TFs through GO term enrichment (Figure 4.2c). 

Network-based. I combined all four layers, i.e., CEN, GEN, GRN, and eGRN, to then scale the 

interaction frequencies from 0.5 to 1, being 0.5 and 1 the weight for interactions observed in one 

and all four layers, respectively. With the scale version of the combined networks, I proceeded to 

identify  low-dimensional  representations  (embeddings)  for  each  gene/node  in  the  combined 

network  (Figure  S4.6a)  (See  Methods).  The  combined  network  included  4.6M  interactions 

associated with 36.4K genes. Unlike the previous two strategies, this method generated an equal 

98 

 
number of descriptors (embeddings) for each gene in the network, thus allowing the identification 

of genes with similar properties, including TFs present in the CEN and/or GAN layers without 

data on the GRN/eGRN layers. The distance of the embedding vector between genes was defined 

as the decay function of the mutual rank of the mutual information of the embedding (See Methods) 

(Figure S4.6a). On average, I found 235 highly similar genes per TF [Distance (D), ≤ 0.05, See 

Methods]  (Figure  S4.6b).  As  in  previous  approaches,  I  annotated  the  corresponding  TFs  by 

assaying the enrichment with PWYs and GO terms of their highly similar genes (Figure S4.6a). In 

total, I found 23,796 and 7,722 TF-PWY and TF-GO significant associations (FDR ≤ 0.1, Fisher’s 

Exact)  (Figure  4.2a),  which  on  average  captured  ~7  and  ~8  PWYs  and  GO  terms  per  TF, 

respectively (Figure 4.2b). Combining both assays, I annotated 2,910 different TFs, out of which 

1,030 TFs showed enrichment with both PWYs and GO terms (Figure S4.6c). To note, these 1,030 

TFs belong to 82 different TFs (including co-regulator) families capturing and representing - on 

average - 34% of the total proteins annotated in the corresponding families (Figure S4.6d). This 

highlights the potential of the method to annotate TFs with unobserved layers.  

Comparing  all  three  methods,  network-based  allowed  me  to  identify  the  largest  and  lowest 

total number of TF-PWYs and TF-GOs associations, respectively (Figure 4.2a). Also, it has the 

lowest average of PWYs and GO terms per TF (Figure 4.2b). Unexpectedly, network-based and 

common target methods predicted a similar number of PWYs per TFs, which contrasts with the 

significantly  lower  number  of  GOs  between  network-based  and  the  other  two  methods  (Figure 

4.2b). Importantly, the number of TF annotated by the network-based is >2.5 times larger than the 

other two methods (Figure 4.2c, left panel). Finally, combining all results, I functionally annotated 

2,917 TFs. However, 94 (Figure 4.2c, violet plus green labels) and 32 (light blue plus green labels) 

99 

 
out  of  the  2,243  TFs  showed  at  least  a  PWY  and  a  GO  term  enrichment  in  all  three  methods, 

respectively.  

4.3.3 Evaluation of functional prediction with knockouts 

TF  perturbation  experiments  enable  the  understanding  of  the  TF  regulatory  landscape  by 

unraveling the direct and indirect effects of expression changes induced by the expression variation 

of  the  corresponding  TFs.  Here,  I  used  21  previously  published  knockouts  associated  with  13 

different TFs (Zhou et al., 2020; Ellison et al., 2023) to assay the accuracy of each of the three 

methods by two independent strategies. Specifically, I questioned the overlap between predicted 

and observed PWYs/GO terms within DEGs for the corresponding knockouts. In parallel, I also 

tested  the  gene  set  enrichment  analysis  (GSEA)  of  the  predicted  PWYs/GO  terms  within  the 

corresponding TF knockouts (Subramanian et al., 2005) (See Methods). Unexpectedly, predicted 

PWYs - without distinction of the methods - showed poor overlapping with PWYs observed at the 

knockout’s  assays,  as  well  as  low  recovering  of  PWY  significantly  enriched  within  DEGs  as 

estimated by the GSEA (Figure S4.7). Conversely, comparisons between predicted and observed 

GO analysis showed similarities [measured by the GO semantic similarity (GSS)] different than 

the expected by chance (P-value ≤ 0.05) (Figure S4.8). Overall, the GO terms from knockouts and 

the GO terms predicted by network-based and common function are significantly more similar than 

the common targets predictions (higher GSS values, P-value ≤ 0.05, Wilcoxon test) (Figure 4.2d). 

Remarkably,  when  a  prediction  is  available,  the  network-based  method  recovers  the  GO  terms 

with  the  highest  GSS  values  among  all  the  methods  (Figure  4.2d,  TB1  and  FEA4  results). 

Additionally, I observed that seven knockouts lacked predictions from network-based methods, 

while  eight  others  had  predictions  only  with  network-based  methods  (Figure  4.2d,  e).  This 

variability  in  predictions  can  be  partly  attributed  to  the  low  number  of  target  genes  (when  the 

100 

 
prediction is absent; Figure S4.9a, TFs with Z-score ≤ 0) and the absence of data in at least one of 

the four layers (i.e., GRN, eGRN, CEN, and GAN) (when the network-based method is the only 

one making the prediction, Figure S4.9b). Consistently with the GSS analysis, the GSEA results 

indicate  that  GO  terms  recovered  with  the  network-based  and  the  common  function  are  more 

consistently identified across the different knockouts (Figure 4.2e). Thus, all together, my results 

suggest that network-based predictions are resilient to the presence-absence variation of layer’s 

data, although susceptible to the number of targets by TF. By extension, they also indicate that 

common targets and function predictions are more sensitive to the absence of data in at least one 

of the layers.  

Combining all GSEA results, I find that, on average, only 25% of total GO predictions show 

significant GSEA scores (P-value ≤ 0.05), denoting a low recovering rate of GO terms (Figure 

4.2e). I argue that the characteristic indirect effects of the knockout can explain these low recovery 

rates, combined with the tissue/condition/genotype differences between the knockout assays and 

the data used in the corresponding predictions. I used the TFs expression-specificity as a proxy to 

understand  the  relationship  between  the  low  fraction  of  GO  recovered  by  GSEA  and  the 

tissue/condition/genotype variation among the corresponding TFs. Including all the TFs for which 

I obtained at least PWY/GO term prediction and using the Tau index as a metric (Kryuchkova-

Mostacci and Robinson-Rechavi, 2017), I find a bi-modal expression distribution with ~55% of 

the  TFs  trending  into  a  sample-specific  expression  fashion  (Figure  S4.9c,  Tau  ≥  0.65). 

Interestingly, only four out of the 13 TFs tested in the knockout analysis are expressed in a sample-

specific fashion (P-value ≤ 0.05) (Figure S4.9d, labeled in green). The top four included RA1 (Tau 

0.99) and TB1 (Tau 0.96), which also are the top two TFs with the largest fraction of GO term 

supported by the GSEA (Figure 4.2e). Hence, the results support the notion that a portion of the 

101 

 
low fraction of GO terms recovered can be attributed to the differences in conditions used on the 

knockout  and  prediction  analyses.  Thus,  I  predict  that  perturbation  analyses  conducted  under 

conditions that mitigate tissue/condition effects may lead to a higher overlap.  

4.3.4 Evaluation of functional prediction by comparing with random networks 

Despite recovering GO terms similar to - and enrichment with - GO terms from knockouts 

(Figure  4.2d,e),  which  method  generates  fewer  false  positives  still  needs  to  be  determined.  In 

consequence, I assayed the identification of GO terms from ~3,000 random networks to establish 

which method recovered the lower fraction of false positives (See Methods). I counted the number 

and significance of the GO terms enriched in random networks as a proxy for the precision, and 

the similarity of observed GOs (true TF-target interactions) with the GOs from random networks 

as  proxy  for  the  accuracy  of  the  corresponding  methods.  Also,  to  compare  predictions  across 

methods for the same TF, I reduced our analysis to only the 32 TFs with GO predictions in all 

three methods (Figure 4.2c, green and blue intersection). I posited that methods with fewer GO 

terms,  less  significant  P-values  (FDR),  and  GO  terms  from  random  networks  less  similar  to 

observed  GO  terms  are  indicative  of  better  predictions.  Remarkably,  network-based  identified 

significantly enriched GO terms in only ~12% of random networks tested, which contrasts with 

the  ~28%  and  ~72%  obtained  with  the  common  function  and  the  common  target  methods, 

respectively (Figure 4.2f). Concordantly, Network-based predicted significantly fewer GO terms 

(Figure 4.2g), with less significant P-values (Figure 4.2h) and GO terms less similar (lowest GSS 

values) to the predicted from true interactions per TF than those observed with common function 

and  common  target  methods  (Figure  4.2i),  highlighting  Network-based  as  the  method  with  the 

highest precision and accuracy. To be noted, the common function predicted fewer GO terms per 

TFs than the common target; although its predictions have P-values more significant and with GSS 

102 

 
values equally similar to those observed in the common target (Figure 4.2g-i), patterns that were 

consistently  observed  also  at  individual  TF  level  (Figure  S4.10),  positionings  the  common 

functional and the common target as equally noisy methods. 

Overall, the network-based method detected a lower number of GO terms per TF (Figure 4.2b) 

and had GO terms enriched in significantly fewer random networks (Figure 4.2f). This suggests 

limitations in the method's ability to identify GO term associations, as it inherently identifies fewer 

GO terms per TF. To examine this possibility, I investigated whether the number of GO terms 

observed with true interactions could be attributed to chance. Remarkably, 30 out of 32 tested TFs 

exhibited a significantly higher total number of GO terms compared to those expected by chance 

(P-value <= 0.05) (Figure S4.11). Thus, despite the network-based approach yielding fewer GO 

terms per TF, these identified GO terms contain valuable biological information that is unlikely to 

occur  randomly.  In  conclusion,  within  the  given  data  context  used  in  this  work,  I  affirm  the 

network-based method as the superior approach. Consequently, I exclusively relied on network-

based predictions for subsequent analyses. 

103 

 
 
Figure 4.2 Annotation and evaluation of TF functional annotation by contrasting predictions 
with knockout assays and random networks  

104 

 
 
 
Figure 4.2 (cont’d) 
Total PWYs and GO terms predicted per integration method after combining all TFs predictions 
(a) and per TF (b). c. Upset plot comparing total TFs annotated by each method and annotation 
system.  Colors  indicate  the  groups  of  TFs  functionally  annotated  by  all  three  methods  by 
enrichment with PWYs (fuchsia), GO terms (blue), and both PWYs and GO terms (green). Black 
groups indicate TFs annotated by at least one of the methods and annotation systems. d. Boxplot 
of the GO semantic similarity for the top 10 most similar GO terms observed in knockout assays 
for each of the predicted GO terms per TF and methods. e. Stacked barplot indicating the fraction 
of the GO terms predicted and significantly enriched - by GSEA analysis - in the knockouts. f. 
Violin plot showing the fraction of random networks with at least one significant (FDR ≤ 0.1, 
Fisher exact test) GO term by TF. g, h, and i. Boxplot showing the average number of GO terms 
(g), -log10FDR (h), and GSS (i) observed in 3000 random networks by method. The GSS values 
were calculated by comparing each random network with the observed GO terms from the true 
TF-target interactions. Asterisks indicate P-value significance (*: p ≤ 0.05, **: 633 p ≤ 0.01, ***: 
p ≤ 0.001, ****: p ≤ 0.0001, two-sided t-test). “TFm” denotes multiple mutant lines for the same 
TF.  

4.3.5 Prioritization of regulators by biological process 

The network-based method detected approximately 7.7K TF-GO associations, encompassing 

1,036 TFs and 2,219 GO terms (Figure 4.2a, c & Figure S4.6b, c). For ease of TF comparison, I 

retained associations involving GO terms within the biological process (BP) category and having 

fewer  than  eight  hundred  associated  genes  (when  a  more  specific  GO  term  association  was 

present). Additionally, to minimize GO term redundancy, I mapped GO terms with a small number 

of annotated genes (less than 50 genes) to their nearest GO term parent. After applying the filters, 

I continued with 4,337 TF-GO associations, including 902 TFs and 559 GO terms. The distribution 

TF-GO  associations  obtained  hold  a  scale-free  distribution  (Figure  4.3a,  b).  Typically,  highly 

interconnected  GO  terms  and  TFs  suggest  a  greater  number  of  annotated  and  targeted  genes, 

respectively. Nevertheless, I did not discover any evidence linking the gene count per GO term or 

the target gene count per TF to their respective degrees (Figure 4.3c, d). Therefore, these analyses 

highlight GO terms whose regulation may depend on multiple TFs, and TFs that may contribute 

to regulating several biological processes. 

105 

 
 
From the perspective of gene regulation, when multiple TFs are associated with multiple GO 

terms, it suggests that the regulatory impact of a TF on a GO term is influenced by the presence of 

other TFs. To assess the contribution of individual TFs to their respective GO terms, I calculated 

a scaled enrichment score (Z-score of the enrichment) for each TF and GO term (See Methods). 

Utilizing the scores as an indicator to assess the significance of individual TF-GO associations 

relative to all other TF and GO term associated, I observed that only 3.4% (151/4,337) of the TF-

GO  associations  exhibit  high  enrichment  scores  (Z-score  ≥  1)  (Figure  4.3e,  top  right  corner), 

indicating that only a reduced fraction of the TFs and GOs analyzed have strong enriched scores 

for the corresponding association. Consequently, this implies that most of the analyzed TFs/GO 

terms have multiple associations of comparable significance. I combined both Z-scores (per TF 

and GO term) into a reciprocated Z-score (rZ, See Methods) to rank TFs by GO term using a single 

metric. I evaluated the ranking after grouping GO terms into specific biological processes (Figure 

4.3f,  and  Figure  S4.12).  I  highlight  here,  46,  62,  47,  and  50  abscisic  acid  (ABA)-,  lipid-, 

phenylpropanoid-, and leaf-related TF-GO associations that were targeted by 44, 55, 47, and 50 

different TFs, respectively (Figure 4.3f). Using the rZ score as filter (rZ ≥ 0.5), I narrowed down 

the list to 25, 27, 15, and 19 TF candidates to control the corresponding processes (Figure 4.3f, 

dots with name label included). Some examples included the top two TFs ABA-related, NAC56 

and WRINKLED2 (WRI2). Additionally, WRI2 was also on the top three of the TFs related to 

lipid-related  metabolism  (Figure  4.3f,  second  panel).  Finally,  five  (WOX9a,  OFP39, 

Zm00001d024353, EREB149, and LBD24) out of the initial 47 TFs phenylpropanoid-related were 

previously  identified  as  maize  regulators  of  phenolic-related  genes  by  yeast-one  hybrid  assays 

(Y1H) (Yang et al., 2017). Altogether, this highlights the biological relevance of the associations 

this analysis predicted.  

106 

 
Apart from controlling specific enzymatic or signaling-related genes, TFs can also regulate 

biological processes by targeting other TFs. This leads to the formation of regulatory circuits with 

multiple hierarchical levels and network motifs. To identify TFs that play a higher-level role in 

controlling a biological process, I calculated the ratio of TFs targeted by other TFs within specific 

GO terms to the total number of TFs targeted by the corresponding TF. Given that I looked for TF 

associated with a common function, this ratio represents the weighted proportion of feed-forward 

loops associated at the level of biological process compared to the overall TF targets of each TF. 

This measure is referred to as the upstream regulator score (URS). I calculated the URS for the 

twenty different processes, including eight hormones-, seven metabolic-, and five developmental-

related processes (Figure 4.3g). Cytokinin- and shoot-related GO terms were the top two processes 

with  the  highest  score,  with  C2C2-CO-like  13  (COL13)  and  RAMOSA1  (RA1)  as  their  top 

regulators,  respectively  (Figure  4.3g).  To  note,  COL13  was  previously  associated  with  carbon 

metabolism (Tu et al., 2020), and is also differential expressed on the indeterminate1 (1d1) loss of 

function mutant (Minow et al., 2018). ID1 is a maize regulator of autonomous floral induction 

(Colasanti  et  al.,  1998).  Thus,  our  results  suggest  a  role  of  COL13  in  the  connection  among 

cytokinin, carbon metabolism, and flowering; mechanistic association previously reported in other 

plant systems (Bartrina et al., 2011; Wahl et al., 2013). Similarly, RA1 was predicted as the top 

upstream  regulator  of  shot-related  processes,  and  RA1  itself  was  linked  with  shoot  system 

development (Figure 4.3g, process number 17), both of them functions previously associated with 

RA1 (Eveland et al., 2014). To further understand the regulatory landscape of the four processes 

described previously (Figure 4.3f), I selected the top two predictions - URS score - for each process 

and traced their TF targets back into the original networks (i.e., GRN, eGRN, CEN, and GAN) 

(Figure  4.3h).  Specifically,  I  looked  for  regulators  directly  upstream  of  any  of  the  top  TFs  as 

107 

 
predicted by the reciprocal Z-score (rZ) analysis (Figure 4.3f). Without exception, I found at least 

an upstream regulator directly targeting (GRN network) at least one of the top three TFs from the 

rZ analysis, i.e., a top regulator of the corresponding biological process (Figure 4.3h). To highlight 

an example within the ABA-related process network, EREB17 targeted NAC56 and WRI2 (tops 

TFs in rZ analysis), and bHLH43 (URS top one) targeted WRI2 and EREB17 (Figure 4.3f, h, first 

panel). This configuration forms a feed-forward loop with bHLH43 on top (i.e., bHLH43 targets 

EREB17  and  WRI2,  and  EREB17  targets  WRI2).  Within  the  lipid-related  network,  ARF14 

targeted WRI2 and PRH65 (top two and three by rZ score) (Figure 4.3f, h, second panel), as so 

did HB33 targeting LBD23 (top in rZ) in the phenylpropanoid-related network (Figure 4.3f, h, 

third panel). Finally, WRKY25 (top USR) targeted the MYBR4 and EREB126 (top two TFs in 

rZ), as well as BZR2 (top two in URS analysis) on the leaf-related network. Thus, it highlights 

specific regulatory hypotheses to further experiment validations.  

108 

 
 
 
Figure 4.3 Prioritization of regulators by biological process using network-based prediction 
a  Out  degree  and  b  in  degree  distributions  of  the  TF-GO  term  predictions  obtained  from  the 
network-based integration analysis. c and d, scatter plots indicating the frequency - as density - of 
the number of TFs by GO terms (in degree) and GOs per TF (out degree) as a function of the 
number gene annotated per GO term (c) and target genes per TF (d), respectively. e, Scatter plot 
indicating the frequency - as density - of the TF-GO term enrichment scores scaled, which allows 
to rank GOs highly enriched with specific TF (GOz) and TFs highly enriched with specific GO 
term (TFz). The enriched was calculated only with TF-GO term associations already predicted in  
109 

 
 
Figure 4.3 (cont’d) 
previous analysis. Dotted line orange indicates TF-GO term associations with enrichment score a 
standard deviation over the observed average for the corresponding TF and GO term (Z-score of 
enrichment  ≥  1  for  both  GO  term  and  TF).  f.  Scatter  plot  with  reciprocal  Z  score  (rZ)  of  four 
different biological process mapped into the GO and TF scaled score coordinates as presented in 
e; GO terms were grouped as follow: ABA-related (GO:0009737, GO:0009738, GO:0009688, and 
GO:0009788),  Lipid-related 
(GO:0031408,  GO:0006099,  GO:0006635,  GO:0019915, 
GO:0006629,  GO:0019375,  GO:0044255,  GO:0016042,  GO:0051790,  GO:0008610, 
GO:0009062,  and  GO:0045332),  phenylpropanoid-related 
(GO:0009963,  GO:0009698, 
GO:2000762,  and  GO:0009699),  and  leaf-related  (GO:0009965,  GO:0048366,  GO:0010305, 
GO:0010150). TF name/gene id labels are included for TFs with rZ ≥ 0.5. g. Scatter plot with TF 
ranked their upstream regulator score (URS) by biological process. TF name labels are included 
for  TFs  with  rank  ≤  2.  Square  brackets  indicate  an  arbitrary  biological  process  index  which 
matches with the number in square brackets of the corresponding TF names. All URS scores are 
calculated based on the original GRN, eGRN, CEN and GAN networks. h. Heatmap with top two 
TFs (y axes) from the URS analysis (g) for the four biological processes presented in f. X axes 
indicate  the  corresponding  TF  targets.  Colors  indicating  the  network(s)  source  of  the 
corresponding interactions. 

4.3.6 Topological properties predict TF homeologs redundancy 

Although  substantial  efforts  have  been  made  to  comprehend  and  anticipate  the  functional 

redundancy between maize paralogs in subgenomes (Schnable et al., 2011; Li et al., 2016; Kono 

et al., 2018; Han et al., 2023), the problem remains far from complete comprehension. I anticipate 

that if a pair of paralogs exhibit functional redundancy, these differences may manifest in their 

topological  properties,  i.e.,  functional  redundant  paralogs  would  display  similar  properties 

indicating  a  comparable  network  arrangement.  To  assess  the  similarity  between  paralogs,  I 

generated  a  distance  matrix  from  the  embeddings  using  the  mutual  rank  (MR)  of  mutual 

information as metric (Figure 4.4a). Next, I mapped TF paralogs (Schnable, 2019) and analyzed 

their MR and the similarity of their MR profiles with all the genes in the embeddings matrix as a 

proxy for understanding the similarity of their embeddings and the similarity of their resemblance 

with other TFs examined. I also differentiated between paralogs located on the same chromosome, 

serving  as  a  proxy  for  pre-speciation  tandem  duplicates.  In  total,  I  tested  932  TF  pairs,  and 

regardless of the metric used, TF paralogs situated on the same chromosome demonstrated greater 
110 

 
similarity compared to TF paralogs on different chromosomes (Figure 4.4b, c). I combined both 

scaled metrics to identify highly similar TF pairs (Figure 4.4d). As expected, both metrics were 

correlated  (Pearson  correlation  0.68),  yet  they  effectively  served  the  purpose  of  identifying  TF 

paralog  pairs  that  were  highly  similar.  I  tallied  the  number  of  interactions  after  the  embedding 

integrations (Figure S4.6a) for the top ten TF pairs that were most and less similar (Figure 4.4e), 

all top ten TF pairs more similar have several common interactions (Figure 4.4e, light brown TF 

pairs highlighted), contrary to those observed within the top less similar which have none (Figure 

4.4e, gray brown TF pairs highlighted). Additionally, I find seven TF pairs mapped to the same 

chromosome out of the top ten TF pairs (Figure 4.4e, TF pairs with asterisk). Thus, for additional 

assessment of tandem duplicates distribution and the shared interactions between TF homologs in 

the context of the embedding similarities, I categorize all TF pairs into nine bins using both scaled 

similarity metrics (Figure 4.4d, dashed black lines). The bins are structured to include the most 

dissimilar TF pairs in the first bin (I) and the most similar pairs in the last bin (IX) (Figure 4.4f, 

internal box). Confirming the observation from the top ten TFs (Figure 4.4e), the bin IX contained 

5-7 times more tandem duplicates than the other bins (Figure 4.4f). I quantify shared interactions 

using the Jaccard index. By considering bin I as a reference, I detect significant differences (p < 

0.05, two-sided t-test) across five distinct bins (Figure 4.4g), primarily categorized based on the 

scaled correlation between TF pairs (Figure 4.4d, x axes). Furthermore, bin IX exhibits the utmost 

values, validating the predictive capacity of embedding similarities for functional redundancy in 

TF paralogs. 

Considering  variations  and  similarities  in  topological  properties  as  indications  of  function 

divergence  and  conservation,  respectively,  I  expect  that  the  protein  sequence  or  expression 

variation of the TFs in bin I will be greater. Focusing exclusively on TF pairs from bin I and IX 

111 

 
(representing  a  TF  pair  with  contrasting  embedding  similarities),  I  calculated  the  Hamming 

distance  of  the  amino  acid  sequences  and  co-expression  as  proxies  to  understand  the  observed 

differences  in  topological  dissimilarities.  Unexpectedly,  I  did  not  notice  any  differences  in  the 

Hamming  distance  between  TF  pairs  highly  similar  or  dissimilar  at  the  topological  level,  as 

evidenced by TF pairs highly conserved (low Hamming distance) in both groups of TFs (Figure 

4.4h). However, TF pairs in bin I (PCC = 0.44) showed slightly lower average co-expression values 

compared  to  those  observed  for  TFs  in  bin  IX  (PCC  =  0.5).  Interestingly,  when  co-expression 

values  are  mapped  in  the  context  of  TFs'  Hamming  distance,  it  allows  me  to  differentiate  TF 

paralog pairs that may be undergoing neofunctionalization/subfunctionalization due to variations 

in  their  protein  sequences  or  its  regulation.  A  striking  example  of  the  former  is  observed  in 

MYBR1 and MYBR81, which have significantly different sequences (Hamming distance close to 

1), distinct embedding profiles (bin I), and yet display high co-expression (PCC > 0.9) (Figure 

4.4i). In contrast, HAG1 and HAG38, as well as GRAS14 and GRAS82, which also belong to bin 

I and have high similarity in peptide sequences (Hamming distance close to 0), show variation 

only in their co-expression (PCC < 0.3) suggesting variation at the regulation level (Figure 4.4i). 

Additionally, within the groups of TFs sharing similar embedding profiles (bin IX), I identified 

TFs exhibiting high conservation (Hamming distance close to 0) and similar expression, implying 

a  significant  degree  of  redundancy  (e.g.,  MADS73  and  TU1)  (Figure  4.4j).  Furthermore,  I 

observed TFs with limited co-expression but high conservation (Hamming distance close to 0, e.g., 

C3H53  and  C3H36),  as  well  as  TFs  with  fairly  poor  conserved  peptide  sequences  (hamming 

distance close to 1, ABI5 and ABI4), indicating differences in its regulation (Figure 4.4j). Thus, 

altogether,  the  combination  of  embedding  similarity,  protein  amino  acid  similarities,  and  co-

expression enables the identification of TFs that are clearly variable or redundant, which is a key 

112 

 
observation  for  understanding  function  redundancy  (in  terms  of  GO  enrichment)  as  described 

previously (Figure 4.3). 

Figure 4.4 Network embedding as predictor of TF paralogs with functional variation 

113 

 
 
 
 
Figure 4.4 (cont’d) 
a. Diagram illustrating the key stages of comparing TF paralogs through embedding similarities. 
b and c. Box plots displaying the MRMI of TF pairs (b) and the Spearman correlations (SCC) of 
the observed MRMI profiles (c) derived from the embedding. d. Combined scaled scores of the  
MRMI and SCC for TF pairs. Black dashed lines with Z-scores of -0.5 and 0.5 indicate values 
below and above the average observed standard deviation. e. Heatmap indicating the total number 
of associated genes for the top ten TFs, on the top right corner and the bottom left corner TF pair 
(d). G1 and G2 represent the number of unique genes associated with the first and second TFs in 
the corresponding pair. G1:G2 indicates the common associations between the corresponding pair. 
F. Bar plot indicating the total number of TF pairs by bin. Bins are indicated on the interval box, 
which is a map of the zones in the plot in (d). g. Box plot with Jaccard index (as an approximation 
of common associated genes) by TF pair by bin (as presented in f). h. Jitter plot displaying amino 
acid (AA) differences (Hamming distance) between TF pairs in bins I and IX. i and j, Jitter density 
of points representing AA Hamming distance and co-expression (measured as PCC) for TF pairs 
in bin I (i) and IX (j). Asterisks indicate P-value significance (*: p ≤ 0.05, **: 633 p ≤ 0.01, ***: 
p ≤ 0.001, ****: p ≤ 0.0001, two-sided t-test). “TFm” denotes multiple mutant lines for the same 
TF.  

4.4 DISCUSSION 

Cells  utilize  complex  networks  of  proteins  to  integrate  and  synergistically  regulate  their 

activities. Capturing the full extent of biological complexity requires the integration of multiple-

omic disciplines that generate layers of information from the cell. From a technical perspective, 

the integration of multiple network types allows one to verify and complement one another (Tolani 

et al., 2021; Shen et al., 2023; Depuydt et al., 2023). Maize, as many other plants, accumulates a 

vast and diverse type of metabolites (Riedelsheimer et al., 2012; Wen et al., 2014; Zhou et al., 

2019).  Despite  major  advances  in  the  understanding  of  the  genetic  and  external  factors  that 

influence  metabolic  variation  and  accumulation  in  maize,  transcriptional  regulators  of  many  of 

these metabolic pathways are largely unknown. This represents a knowledge gap that could be 

bridged by integrating multiple networks, which leverages the continuously growing multi-omic 

data available in maize (Liu et al., 2016; Walley et al., 2016; Wen et al., 2016; Jin et al., 2017; de 

Abreu E Lima et al., 2018; Lee et al., 2019; Schaefer et al., 2018; Wen et al., 2018). In this study, 

I analyzed three distinct data types (PDI, expression, and natural variation) and constructed four 

114 

 
 
different  molecular  networks  (layers).  I  utilized  various  integration  methods  to  prioritize 

transcriptional regulators associated with specific biological processes, which allows me to gained 

valuable insights into potential regulatory mechanisms underlying maize metabolism - as well as 

developmental-related processes - paving the way for designing specific experiments aimed at crop 

improvement,  metabolic  engineering,  and  basic  gene  regulation  understanding  of  the  of  the 

corresponding processes. 

The  rapid  generation  of  multi-omic  genomic  data  in  maize  has  led  to  growing  efforts  to 

implement integration strategies (Liu et al., 2016; Walley et al., 2016; Wen et al., 2016; Jin et al., 

2017; de Abreu E Lima et al., 2018; Lee et al., 2019; Schaefer et al., 2018; Wen et al., 2018). 

However,  the  large  majority  of  the  studies  relies  on  the  idea  of  verifying  each  layer  with  one 

another (i.e., conceptual integration) (Depuydt et al., 2023), with a few exceptions where layers 

are used to level up each layer with one another (Schaefer et al., 2018; Yang et al., 2022) or to 

learn  from  the  combination  of  them  (Han  et  al.,  2023).  Here,  I  implemented  three  different 

integration strategies to make functional annotations of the TFs (Figure 4.1). Our findings indicate 

that  the  integration  of  multiple  layers  based  on  common  targets  and  functions,  although  more 

intuitive, does not effectively recover observed GO terms in knockouts (Figure 4.2d, e). Instead, 

it frequently yields results that can be readily attributed to chance, as demonstrated by the number 

of times that a GO term may be retrieved from random networks, as well as their high similarity 

with the GO terms from the true network (Figure 4.2f-i). Surprisingly, the common targets strategy 

predicts  a  similar  number  (Figure  4.2a,  g)  and  category  (Figure  4.2i)  of  GO  terms  in  random 

networks  as  in  the  true/observed  networks  (Figure  4.2a,  g).  From  a  technical  perspective,  this 

suggests that the initial set of interactions contains a significant number of false positives, which 

explains why random networks can recover similar sets of GO terms. Overall, the number of GO 

115 

 
terms and their significance (P values) in the enrichment analysis with random networks (Figure 

4.2g, h), suggest that common target and common function are more lenient than network-based. 

This drawback is compounded by the inherent technical noise associated with corresponding layers 

(e.g.,  PDI  without  transcriptional  effect).  Furthermore,  it  indicates  that  even  when  a  TF-target 

interaction  is  highly  reliable  (due  to  its  presence  in  multiple  layers),  it  alone  is  insufficient  to 

provide an accurate representation of the biological landscape associated with the corresponding 

TF (Figure 4.2d).  Interestingly, unlike the first two methods, the network-based approach proved 

to predict GO terms that are less likely to be observed from a random network. I interpreted this 

as a sign of robustness in the identification of genes truly functionally related (Figure 4.2g-i). My 

contention  is  that  this  robustness  is  rooted  in  the  inherent  nature  of  the  embedding  generation 

process,  as  it  is  highly  improbable  to  observe  similar  wiring  patterns  across  layers,  despite  the 

expected presence of potential false positives within each respective layer. Additionally, of equal 

significance, the network-based approach facilitated predictions for a considerably larger number 

of TFs (Figure 3.4c), thereby influencing the design of future experiments aimed at uncovering 

and validating specific TF functions in maize. 

In general, TF expand their regulatory repertoire through functional or physical interactions 

with other TFs (Reményi et al., 2004; Brkljacic and Grotewold, 2017), as evidence the formation 

of regulatory cluster both at the level of TF-target genes (Tu et al., 2020) and in the organization 

cis-regulatory  elements  across  cell  types  (Marand  et  al.,  2021).  Here,  combining  all  the  TF-

function  predictions  made  by  our  network-based  integration  I  find  a  network-like  structure 

independent of the number of genes by GO term or targets by TF (Figure 4.3a-d). Using a scaled 

enrichment score for each TF and GO term, I showed that only ~3% of our predictions had a single 

TF  as  the  primary  regulator  of  the  corresponding  GO  term,  which  indicated  that  most  TFs 

116 

 
contributes to the regulation of multiple functions, and that the regulation of a biological process 

requires the involvement of multiple TFs. These similarities between the patterns observed here 

and previously reported ones are interpreted as validation of the presented results. Additionally, I 

prioritized TFs by biological process combining scaled scores from both TF and GO terms and 

built two tier regulatory models for specific biological processes. Noteworthy, the top two TFs 

ABA-related (NAC56 and WRI2) are differentially expressed under drought and cold conditions 

(Hoopes et al., 2019),  both conditions trigger the accumulation of ABA (Cutler et al., 2010; Waadt 

et al., 2022). Interestingly, WRI2 - a homeolog of the lipid metabolism master regulator WRI1 

(Pouvreau et al., 2011) - was also in the top three of the TFs related to lipid-related metabolism 

(Figure 3.3f, second panel) highlighting molecular connections between ABA signaling/control 

and lipid metabolism (Guschina et al., 2002; Chen et al., 2020). Similarly, five out of the previously 

identified  maize  regulators  (Yang  et  al.,  2017)  and  predicted  by  here  as  regulator  of  phenolic-

related genes included LBD24, which has the higher enrichment score (rZ = 0.08) confirming its 

previous identification as a highly-connected TF within the phenolic metabolism Y1H network 

(Yang et al., 2017). Finally, within the leaf-related predictions, I find MYBR4 as the top prediction 

linked  with  leaf  morphogenesis  term  (GO:0009965)  (Figure  4.3f,  four  panel).  The  closest 

MYBR4’s protein in Arabidopsis is AtMYB46 (AT5G12870, 68% identity and 29% coverage), 

which  is  a  direct  target  of  SECONDARY  WALL-ASSOCIATED  NAC  DOMAIN  PROTEIN1 

(SND1) and works as regulator of secondary wall biosynthesis in fibers and vessels in Arabidopsis 

(Zhong et al., 2007). Thus, it provides a plausible mechanism for the association of MYBR4 with 

leaf morphology in maize. Altogether, this highlights the biological relevance of the associations 

this analysis predicted. I showed the presence of feed-forward motifs within my results, which are 

known  as  a  mechanism  for  reinforcement  of  regulatory  signals  (Alon,  2007).  Together,  these 

117 

 
discoveries expand the anticipation of TF regulators of metabolic pathways to encompass a wider 

array of biological processes, bearing significant implications for forthcoming biotechnological 

applications, such as precise modifications of developmental processes, for instance. 

In summary, I built four different gene networks in maize, which included the re-analysis of 

almost  300  PDI  assays  under  the  same  pipeline,  and  the  associations  of  ~15M  with  public 

expression  in  a  population  of  >300  inbred  lines.  I  integrated  these  datasets  with  co-expression 

networks from our previous work (Zhou et al., 2020). Considering the inherent challenge posed 

by the variations in each respective network, I examined three distinct integration methods and 

employed two different strategies to functionally annotate TFs. Our findings demonstrated that 

integrating  all  layers,  followed  by  the  identification  of  highly-similar  genes  based  on  their 

embeddings,  enabled  the  identification  of  genes  which  functionally  allows  the  annotation  of 

>1,000  TFs.  Notably,  the  embedding  similarities  create  a  network  of  gene-gene  associations 

involving over 24K genes. This study focused exclusively on regulatory-related genes, such as TFs 

and  coregulators.  Nevertheless,  I  foresee  that  the  resources  provided  here  for  the  remaining 

unexplored ~22k genes will also offer a wider array of functionalities. 

4.5 METHODS 

4.5.1 Genetic markers  

A set of 304 diverse inbred lines with publicly available SNP and gene expression information 

were included in our eQTL analysis (Bukowski et al., 2018; Kremling et al., 2018; Mazaheri et al., 

2019).  SNP  marker  data  from  whole  genome  sequencing  along  with  RNA-seq  were  combined 

between studies based on physical positions. In the case that an overlap was observed between the 

two  datasets,  the  RNA-seq  marker  was  preferentially  kept.  The  expression  datasets  capture 

variation both at the genotypic and tissue level. 

118 

 
4.5.2 RNA-seq and co-expression data 

All the RNA-seq and co-expression datasets utilized here were previously published (Zhou et 

al., 2020), except for co-expression network 46, which was constructed using the 304 inbred lines 

analyzed for genetic markers (referred to as n304). Specifically, pre-mapped CPM values were 

gathered for the respective inbred lines and employed the exact strategy outlined by Zhou et al. 

(2020) to construct the corresponding co-expression network, ensuring comparability among all 

46 networks. All co-expression networks were based on RandomForestRegressor and using the 

top 100K association by TF. 

4.5.3 eQTL identification and classification 

eQTLs were identified using eight distinct tissue types encompassing different developmental 

stages from germination to plant maturity (GRoot, Gshoot, Kern, L3Base, L3Tip, LMAD, LMAN, 

and seedling) (Bukowski et al., 2018; Kremling et al., 2018; Mazaheri et al., 2019). SNPs were 

filtered by removing non-biallelic markers, and those with a minor allele frequency < 0.05. Each 

of the tissue-specific expression datasets were filtered independently by retaining genes with ≥ 6 

reads  in  ≥  20%  of  samples  and  ≥  0.1  TPM  in  ≥  20%  of  samples.  After  filtering,  it  was  tested 

between  15.5M  -  16.7M  SNPs  against  the  expression  of  15.3K  -  26.4K  genes  across  the  eight 

tissue types. Briefly, to test SNP-gene associations, a series of eight candidate linear models were 

fitted beginning with a naive T-test then progressively controlling for different levels of kinship 

and  population  structure  in  a  mixed  linear  model.  For  each  model  tested,  the  association  was 

deemed significant if the observed P-value surpassed the 10K permutation threshold computed for 

each gene. Non-significant eQTLs were discarded when the association was supported by fewer 

than two of the candidate linear models and when the association involved non-syntenic genes 

(Schnable, 2019). The significant associations were classified as cis-eQTLt, trans/cis-eQTL, cis-

119 

 
eQTL, trans-eQTL, and unassigned eQTL according to the distance between each eQTL and its 

corresponding target gene as well as its co-location with annotated maize genes (genome B73-V4, 

Figure S4.2a).  

4.5.4 Protein-DNA interactions data analysis 

Raw  reads  from  classic  ChIP-  (Bolduc  et  al.,  2012;  Morohashi  et  al.,  2012;  Eveland  et  al., 

2014; Pautler et al., 2015; Li et al., 2018; Zhan et al., 2018; Dong et al., 2019), ChIP-seq from 

protoplast (pChIP-seq) (Tu et al., 2020), and DAP-seq (Ricci et al., 2019; Dong et al., 2020) were 

collected  from  publicly  available  dataset.  Reads  quality  control  and  peaks  identification  was 

performed  as  reported  previously  (Gomez-Cano  et  al.,  2022).  Briefly,  read  quality  control  was 

performed  using  FastQC  (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/,  V0.11.5). 

Adapters and low quality reads were trimmed with Trimmomatic (Bolger et al., 2014) using the 

following 

parameters: 

ILLUMINACLIP:Adapter.fastq:2:40:15  SLIDINGWINDOW:4:20 

MINLEN:30.  Cleaned  reads  were  mapped  to  the  maize  genome  (V4)  (Jiao  et  al.,  2017)  with 

Bowtie2 v2.3.4.1 (Langmead and Salzberg, 2012) and only using nuclear chromosomes. Multi-

mapping reads were filtered with Samtools v1.9 (Li et al., 2009)  (q 30). Peaks were called using 

GEM v3.4 (Guo et al., 2012). In DAP-seq assays, HALO vector was used as a control. Peaks from 

classic  ChIP-seq  were  called  including  duplicates  and  using  the  corresponding  mutants  or  tag-

protein as control. Finally, ChIP from prototals were called including replicates. All of them used 

the following parameters: --d Read_Distribution_default.txt --k_min 6 --k_max 15 --k_seqs 2000 

--outNP --sl. Only TFs with >500 predicted peaks were used for further analysis.  

Peak quality control. Peaks were filtered by testing overlapping with ACRs, and by scaling the 

number of Counts per Million (CPMs) by peaks per assay. Briefly, peaks with a Z-score larger 

than  -0.5  and  mapping  to  ACR  were  kept  for  further  analysis.  The  CPMs  were  obtained  after 

120 

 
extending the peaks’ summit 50 bps and converted to SAF files for counting mapped reads per 

peak using Rsubread v1.32.2 (Liao et al., 2019). Available accessible chromatin regions (ACRs) 

were collected from previously published maize ATLAS (Marand et al., 2021). Peaks with a Z-

score ≤ -0.5 and mapping to ACRs were retained for further analysis.  

4.5.5 Functional annotation 

All functional annotations were performed after discarding non syntenic genes. PWYs were 

collected  from  CornCYC  (Andorf  et  al.,  2016),  and  GO  terms  were  obtained  from  GAMER 

(Wimalanathan  et  al.,  2018).  Syntenic  genes  were  defined  based  on  Schnable  et  al.  (2019). 

Enrichment analysis for PWYs and GO terms was conducted in R using the GeneOverlap (v1.30.0) 

and topGO (v2.46.0) packages, respectively. GO term semantic similarity was calculated using the 

GOSemSim  (v2.20.0)  package  in  R,  employing  the  "Wang"  method.  For  common  function 

analysis, all GO comparisons were performed on the original set of enriched GO terms, and after 

the comparisons (GO semantic similarity), all GO terms were mapped to their closest parent GO 

terms  using  the  R  package  Rrvgo  (v1.6)  (Sayols,  2023).  Similarly,  all  GO  terms  significantly 

enriched from common target and network-based analyses were mapped to its closed parent before 

any description to reduce redundancy. 

4.5.6 Network integration 

Common  interactions.  I  compared  TF-target  associations  across  all  layers  (i.e.,  GAN,  GRN, 

eGRN, and CEN networks) and considered interactions as common when they are present in at 

least two different layers for the same corresponding TF. Subsequently, I assessed the enrichment 

of PWYs and GO terms using the common TF interactions. Any significant GO terms were then 

mapped to their closest parent terms. 

121 

 
Common function. Common functions were identified by testing the enrichment of target genes 

with PWYs and GO terms for each TF in each layer. TFs that had at least one PWY/GO term 

enriched in at least two different layers were retained to assess common predictions across layers 

(Figure S3.5a). The similarity in PWY predictions among layers was performed by comparing all 

PWYs between layers for each TF using a Fisher exact test. Enrichment test was used because a 

single gene could be annotated in multiple PWYs. Hence, PWYS were considered overplayed only 

if they exhibited a significant number of overlapped genes (P-value < 0.05). The similarity in GO 

term predictions were performed by measuring the semantic similarity between the corresponding 

terms. Significant GO terms are then mapped to their closest parent terms.  

Network-based. All four layers were combined and then scaled the interaction frequencies from 

0.5  to  1,  as  follow  0.5  +  (0.5/4)*N,  being  N  the  number  of  times  that  same  interaction  was 

observed.  Embeddings  of  the  scaled  network  were  identified  with  PecanPy  (Liu  and  Krishnan, 

2021) using the following parameters: --weighted --dimensions 50 --walk-length 80 --num-walks 

10 --directed. Gene similarity was assessed by computing the mutual rank (MR) of the mutual 

information (MI) using the following formula: MRMI = sqrt(MI_rank * tMI_rank), where MI_rank 

represents the rank of the MI matrix and tMI_rank represents the transpose matrix of MI_rank. 

The  MI  was  calculated  using  the  R  package  Parmigene  (Sales  and  Romualdi,  2011).  To  select 

highly similar genes by TF based on its MRMI I used a decay function as follows: D = e-(MRMI -1)/50. 

D values  ≤ 0.05 were taken as highly similar (Wisecaver et al., 2017). After identifying genes 

highly similar per TF, I proceed to test the enrichment of PWY and GO terms. Significant GO 

terms are then mapped to their closest parent terms. 

122 

 
 
 
4.5.7 Knockout and random network validation 

PWY  and  GO  term  predictions  for  TFs  were  contrasted  with  enriched  PWY  and  GO  terms 

identified in knockout analysis using DEGs and their corresponding log2FC values. I used data 

from previous studies for KN1 (Bolduc et al., 2012), RA1 (Eveland et al., 2014), FEA4 (Pautler et 

al., 2015), O2 (Zhan et al., 2018), bZIP22 (Li et al., 2018), and TB1 (Dong et al., 2019) reanalyzed 

in  (Zhou  et  al.,  2020).  Additional  knockout  data  (MYBR32_m1,  WRKY82_m1,  HSF13m1m2, 

HSF18m1,  HSF20m1,  HSF29m1,  HSF29m2,  WRKY2m2,  WRKY8m1,  and  WRKY8m2)  were 

collected from (Ellison et al., 2023). The enrichment of PWY and GO terms were performed with 

DEG selected based on adjusted P-value as reported by DESeq2 (Padj ≤ 0.05) (Love et al., 2014) 

and following indication described above (Methods session 4.5.5). The similarities between PWYs 

and  GO  terms  predicted  by  each  integration  method  and  those  observed  in  the  corresponding 

knockout  were  estimated  using  PWY  overlapping  and  GO  semantic  similarity,  as  previously 

described  (Methods  sections  4.5.5).  Gene  set  enrichment  analyses  were  conducted  using  the  R 

package FGSEA  (v1.20) (Korotkevich et al., 2021)  with the parameters: minSize = 5, maxSize = 

1000, and eps = 0. The gene sets tested were defined based on the predicted PWYs and GO terms 

for  each  TF,  considering  the  available  knockout  data  (Methods  section  4.5.6).  The  fraction  of 

recovered predictions was calculated by determining the number of significant (P-value ≤ 0.05) 

PWYs and GO terms out of the total tested. 

The comparison of each method's prediction against the random networks was conducted by 

randomizing  each  of  the  four  initial  networks  (GRN,  eGRN,  CEN,  and  GAN)  3,000  times, 

generating 3,000 random versions of each layer. Subsequently, I annotated and integrated each set 

of  random  networks  following  the  procedures  described  in  Methods  sections  4.5.5  and  4.5.6, 

similar to the original networks. All random networks were generated using the “rewire” function 

123 

 
from the R package Igraph (v1.2.4.1), with the following parameters:  avoided loops and with niter 

= NodesInNetwork * 10000).  

4.5.8 Prioritization of transcriptional regulators-process associations 

All prioritization analyses were conducted using network-based results. GO terms with less 

than 800 genes were retained, and after mapping excessively specific GO terms (≤ 50 genes) to 

their  corresponding  GO  terms  parent.  Mapping  to  parent  terms  was  performed  following  the 

procedures described in Methods section 3.5.5. Then, I proceeded to calculate the enrichment score 

associated with each TF-GO association as follow:  

Eij = Log2[(c/t)/(p/u)] 

Where Eij is the enrichment score of the TFi with the GOj, c is the intersection of target genes 

of  TFi  and  annotated  genes  on  GOj,  t  is  the  total  number  of  target  genes  of  TFi,  p  is  the  total 

number of genes annotated on GOj, and u is the total number of genes in maize, which in this case 

refers to the total number of syntenic genes with sorghum (Schnable, 2019). All Eij values were 

subsequently normalized by each TFi and GOj as follows:  

Zi = (Eij - Ui)/𝜎i 

and  

Zj = (Eij - Uj)/𝜎j 

Here, Ui and Uj represent the average enrichment score value for all the GOj associated with 

TFi and all the TFi associated with GOj, respectively. Similarly, 𝜎i and 𝜎j represent the standard 

deviation  of  the  enrichment  score  value  for  all  the  GOj  associated  with  TFi  and  all  the  TFi 

associated with GOj, respectively. Finally, I calculated the reciprocal Z-score (rZ) as follow: 

rZij = sqrt (max(0, Zi)^2 + max(0, Zj)^2) 

124 

 
 
4.5.9 Similarities in sequence among TF paralogs 

Sequences for all peptides associated with the corresponding pair of paralogs were collected 

from MaizeGDB (https://maizegdb.org/) using genome v4 (Jiao et al., 2017). TFs' similarities were 

calculated by averaging the Hamming distance between all amino acid sequences associated with 

the respective TFs. The Hamming distance was computed using the R package DECIPHER (v2.22) 

(Wright,  2016)  and 

the  "DistanceMatrix" 

function  with 

the 

following  parameters: 

includeTerminalGaps = TRUE, penalizeGapLetterMatches = TRUE, and correction = "none". 

125 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
REFERENCES 

de Abreu E Lima, F., Li, K., Wen, W., Yan, J., Nikoloski, Z., Willmitzer, L., and Brotman, Y. 
(2018). Unraveling lipid metabolism in maize with time-resolved multi-omics data. Plant J. 
93: 1102–1115. 

Alon, U. (2007). Network motifs: theory and experimental approaches. Nat. Rev. Genet. 8: 450–

461. 

Andorf, C.M. et al. (2016). MaizeGDB update: new tools, data and interface for the maize model 

organism database. Nucleic Acids Res. 44: D1195–201. 

Arda, H.E. and Walhout, A.J.M. (2010). Gene-centered regulatory networks. Brief. Funct. 

Genomics 9: 4–12. 

Bartrina, I., Otto, E., Strnad, M., Werner, T., and Schmülling, T. (2011). Cytokinin regulates 

the activity of reproductive meristems, flower organ size, ovule formation, and thus seed 
yield in Arabidopsis thaliana. Plant Cell 23: 69–80. 

Birkenbihl, R.P., Liu, S., and Somssich, I.E. (2017). Transcriptional events defining plant 

immune responses. Curr. Opin. Plant Biol. 38: 1–9. 

Bolduc, N., Yilmaz, A., Mejia-Guerra, M.K., Morohashi, K., O’Connor, D., Grotewold, E., 

and Hake, S. (2012). Unraveling the KNOTTED1 regulatory network in maize meristems. 
Genes Dev. 26: 1685–1690. 

Bolger, A.M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina 

sequence data. Bioinformatics 30: 2114–2120. 

Bowles, A.M.C., Bechtold, U., and Paps, J. (2020). The Origin of Land Plants Is Rooted in Two 

Bursts of Genomic Novelty. Curr. Biol. 30: 530–536.e2. 

Brkljacic, J. and Grotewold, E. (2017). Combinatorial control of plant gene expression. Biochim. 

Biophys. Acta 1860: 31–40. 

Bukowski, R. et al. (2018). Construction of the third-generation Zea mays haplotype map. 

Gigascience 7: 1–12. 

Chen, K., Li, G.-J., Bressan, R.A., Song, C.-P., Zhu, J.-K., and Zhao, Y. (2020). Abscisic acid 

dynamics, signaling, and functions in plants. J. Integr. Plant Biol. 62: 25–54. 

Colasanti, J., Yuan, Z., and Sundaresan, V. (1998). The indeterminate gene encodes a zinc finger 
protein and regulates a leaf-generated signal required for the transition to flowering in 
maize. Cell 93: 593–603. 

Cutler, S.R., Rodriguez, P.L., Finkelstein, R.R., and Abrams, S.R. (2010). Abscisic acid: 
emergence of a core signaling network. Annu. Rev. Plant Biol. 61: 651–679. 

Deplancke, B. et al. (2006). A gene-centered C. elegans protein-DNA interaction network. Cell 

126 

 
125: 1193–1205. 

Depuydt, T., De Rybel, B., and Vandepoele, K. (2023). Charting plant gene functions in the 

multi-omics and single-cell era. Trends Plant Sci. 28: 283–296. 

Dong, Z., Xiao, Y., Govindarajulu, R., Feil, R., Siddoway, M.L., Nielsen, T., Lunn, J.E., 

Hawkins, J., Whipple, C., and Chuck, G. (2019). The regulatory landscape of a core 
maize domestication module controlling bud dormancy and growth repression. Nat. 
Commun. 10: 3810. 

Dong, Z., Xu, Z., Xu, L., Galli, M., Gallavotti, A., Dooner, H.K., and Chuck, G. (2020). 

Necrotic upper tips1 mimics heat and drought stress and encodes a protoxylem-specific 
transcription factor in maize. Proc. Natl. Acad. Sci. U. S. A. 117: 20908–20919. 

Ellison, E.L., Zhou, P., Hermanson, P., Chu, Y.-H., Read, A., Hirsch, C.N., Grotewold, E., and 

Springer, N.M. (2023). Mutator transposon insertions within maize genes often provide a 
novel outward reading promoter. bioRxiv: 2023.06.05.543741. 

Erenstein, O., Jaleta, M., Sonder, K., Mottaleb, K., and Prasanna, B.M. (2022). Global maize 
production, consumption and trade: trends and R&D implications. Food Security 14: 1295–
1319. 

Eveland, A.L. et al. (2014). Regulatory modules controlling maize inflorescence architecture. 

Genome Res. 24: 431–443. 

Galli, M., Khakhar, A., Lu, Z., Chen, Z., Sen, S., Joshi, T., Nemhauser, J.L., Schmitz, R.J., 

and Gallavotti, A. (2018). The DNA binding landscape of the maize AUXIN RESPONSE 
FACTOR family. Nat. Commun. 9: 4526. 

Gomez-Cano, F., Chu, Y.-H., Cruz-Gomez, M., Abdullah, H.M., Lee, Y.S., Schnell, D.J., and 

Grotewold, E. (2022). Exploring Camelina sativa lipid metabolism regulation by combining 
gene co-expression and DNA affinity purification analyses. Plant J. 

Guo, Y., Mahony, S., and Gifford, D.K. (2012). High resolution genome wide binding event 

finding and motif discovery reveals transcription factor spatial binding constraints. PLoS 
Comput. Biol. 8: e1002638. 

Gupta, O.P., Deshmukh, R., Kumar, A., Singh, S.K., Sharma, P., Ram, S., and Singh, G.P. 

(2021). From gene to biomolecular networks: a review of evidences for understanding 
complex biological function in plants. Curr. Opin. Biotechnol. 74: 66–74. 

Guschina, I.A., Harwood, J.L., Smith, M., and Beckett, R.P. (2002). Abscisic acid modifies the 

changes in lipids brought about by water stress in the moss Atrichum androgynum. New 
Phytol. 156: 255–264. 

Han, L. et al. (2023). A multi-omics integrative network map of maize. Nat. Genet. 55: 144–153. 

Hoopes, G.M., Hamilton, J.P., Wood, J.C., Esteban, E., Pasha, A., Vaillancourt, B., Provart, 
N.J., and Buell, C.R. (2019). An updated gene atlas for maize reveals organ-specific and 

127 

 
stress-induced genes. Plant J. 97: 1154–1167. 

Hufford, M.B. et al. (2021). De novo assembly, annotation, and comparative analysis of 26 diverse 

maize genomes. Science 373: 655–662. 

Jiao, Y. et al. (2017). Improved maize reference genome with single-molecule technologies. Nature 

546: 524–527. 

Jin, M. et al. (2017). Integrated genomics-based mapping reveals the genetics underlying maize 

flavonoid biosynthesis. BMC Plant Biol. 17: 17. 

Kono, T.J.Y., Brohammer, A.B., McGaugh, S.E., and Hirsch, C.N. (2018). Tandem Duplicate 

Genes in Maize Are Abundant and Date to Two Distinct Periods of Time. G3 8: 3049–3058. 

Korotkevich, G., Sukhov, V., Budin, N., Shpak, B., Artyomov, M.N., and Sergushichev, A. 

(2021). Fast gene set enrichment analysis. bioRxiv: 060012. 

Kremling, K.A.G., Chen, S.-Y., Su, M.-H., Lepak, N.K., Romay, M.C., Swarts, K.L., Lu, F., 

Lorant, A., Bradbury, P.J., and Buckler, E.S. (2018). Dysregulation of expression 
correlates with rare-allele burden and fitness loss in maize. Nature 555: 520–523. 

Kryuchkova-Mostacci, N. and Robinson-Rechavi, M. (2017). A benchmark of gene expression 

tissue-specificity metrics. Brief. Bioinform. 18: 205–214. 

Kusmec, A., Srinivasan, S., Nettleton, D., and Schnable, P.S. (2017). Distinct genetic 

architectures for phenotype means and plasticities in Zea mays. Nat Plants 3: 715–723. 

Langmead, B. and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. 

Methods 9: 357–359. 

Lee, T., Lee, S., Yang, S., and Lee, I. (2019). MaizeNet: a co-functional network for network-

assisted systems genetics in Zea mays. Plant J. 99: 571–582. 

Liao, Y., Smyth, G.K., and Shi, W. (2019). The R package Rsubread is easier, faster, cheaper and 

better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 47: 
e47–e47. 

Li, C., Yue, Y., Chen, H., Qi, W., and Song, R. (2018). The ZmbZIP22 Transcription Factor 

Regulates 27-kD γ-Zein Gene Transcription during Maize Endosperm Development. Plant 
Cell 30: 2402–2424. 

Li, H. et al. (2013). Genome-wide association study dissects the genetic architecture of oil 

biosynthesis in maize kernels. Nat. Genet. 45: 43–50. 

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., 
Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009). The Sequence 
Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079. 

Li, L., Briskine, R., Schaefer, R., Schnable, P.S., Myers, C.L., Flagel, L.E., Springer, N.M., 

and Muehlbauer, G.J. (2016). Co-expression network analysis of duplicate genes in maize 
128 

 
(Zea mays L.) reveals no subgenome bias. BMC Genomics 17: 875. 

Liu, H. et al. (2016). MODEM: multi-omics data envelopment and mining in maize. Database 

2016. 

Liu, R. and Krishnan, A. (2021). PecanPy: a fast, efficient, and parallelized Python 

implementation of node2vec. Bioinformatics 37: 3377–3379. 

Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and 

dispersion for RNA-seq data with DESeq2. Genome Biol. 15: 550. 

Mack, K.L. and Nachman, M.W. (2017). Gene Regulation and Speciation. Trends Genet. 33: 68–

80. 

Marand, A.P., Chen, Z., Gallavotti, A., and Schmitz, R.J. (2021). A cis-regulatory atlas in maize 

at single-cell resolution. Cell 184: 3041–3055.e21. 

Marand, A.P., Eveland, A.L., Kaufmann, K., and Springer, N.M. (2023). cis-Regulatory 

Elements in Plant Development, Adaptation, and Evolution. Annu. Rev. Plant Biol. 74: 111–
137. 

Mathur, S., Vyas, S., Kapoor, S., and Tyagi, A.K. (2011). The Mediator complex in plants: 
structure, phylogeny, and expression profiling of representative genes in a dicot 
(Arabidopsis) and a monocot (rice) during reproduction and abiotic stress. Plant Physiol. 
157: 1609–1627. 

Mazaheri, M. et al. (2019). Genome-wide association analysis of stalk biomass and anatomical 

traits in maize. BMC Plant Biol. 19: 45. 

McMullen, M.D. et al. (2009). Genetic properties of the maize nested association mapping 

population. Science 325: 737–740. 

Mejia-Guerra, M.K., Pomeranz, M., Morohashi, K., and Grotewold, E. (2012). From plant 

gene regulatory grids to network dynamics. Biochimica et Biophysica Acta (BBA) - Gene 
Regulatory Mechanisms 1819: 454–465. 

Minow, M.A.A., Ávila, L.M., Turner, K., Ponzoni, E., Mascheretti, I., Dussault, F.M., Lukens, 
L., Rossi, V., and Colasanti, J. (2018). Distinct gene networks modulate floral induction of 
autonomous maize and photoperiod-dependent teosinte. J. Exp. Bot. 69: 2937–2952. 

Morohashi, K. et al. (2012). A genome-wide regulatory framework identifies maize pericarp 

color1 controlled genes. Plant Cell 24: 2745–2764. 

Nakashima, K., Yamaguchi-Shinozaki, K., and Shinozaki, K. (2014). The transcriptional 

regulatory network in the drought response and its crosstalk in abiotic stress responses 
including drought, cold, and heat. Front. Plant Sci. 5: 1–7. 

O’Malley, R.C., Huang, S.S.C., Song, L., Lewsey, M.G., Bartlett, A., Nery, J.R., Galli, M., 

Gallavotti, A., and Ecker, J.R. (2016). Cistrome and Epicistrome Features Shape the 

129 

 
Regulatory DNA Landscape. Cell 165: 1280–1292. 

Pautler, M., Eveland, A.L., LaRue, T., Yang, F., Weeks, R., Lunde, C., Je, B.I., Meeley, R., 
Komatsu, M., Vollbrecht, E., Sakai, H., and Jackson, D. (2015). FASCIATED EAR4 
encodes a bZIP transcription factor that regulates shoot meristem size in maize. Plant Cell 
27: 104–120. 

Pouvreau, B., Baud, S., Vernoud, V., Morin, V., Py, C., Gendrot, G., Pichon, J.-P., Rouster, J., 

Paul, W., and Rogowsky, P.M. (2011). Duplicate maize Wrinkled1 transcription factors 
activate target genes involved in seed oil biosynthesis. Plant Physiol. 156: 674–686. 

Reményi, A., Schöler, H.R., and Wilmanns, M. (2004). Combinatorial control of gene 

expression. Nat. Struct. Mol. Biol. 11: 812. 

Renny-Byfield, S., Rodgers-Melnick, E., and Ross-Ibarra, J. (2017). Gene Fractionation and 
Function in the Ancient Subgenomes of Maize. Mol. Biol. Evol. 34: 1825–1832. 

Ricci, W.A. et al. (2019). Widespread long-range cis-regulatory elements in the maize genome. 

Nature Plants 5: 1237–1249. 

Riedelsheimer, C., Lisec, J., Czedik-Eysenberg, A., Sulpice, R., Flis, A., Grieder, C., Altmann, 

T., Stitt, M., Willmitzer, L., and Melchinger, A.E. (2012). Genome-wide association 
mapping of leaf metabolic profiles for dissecting complex traits in maize. Proc. Natl. Acad. 
Sci. U. S. A. 109: 8872–8877. 

Rodgers-Melnick, E., Vera, D.L., Bass, H.W., and Buckler, E.S. (2016). Open chromatin reveals 

the functional maize genome. Proc. Natl. Acad. Sci. U. S. A. 113: E3177–84. 

Sales, G. and Romualdi, C. (2011). parmigene—a parallel R package for mutual information 

estimation and gene network reconstruction. Bioinformatics 27: 1876–1877. 

Sayols, S. (2023). rrvgo: a Bioconductor package for interpreting lists of Gene Ontology terms. 

MicroPubl Biol 2023. 

Schaefer, R.J., Michno, J.-M., Jeffers, J., Hoekenga, O., Dilkes, B., Baxter, I., and Myers, C.L. 
(2018). Integrating Coexpression Networks with GWAS to Prioritize Causal Genes in 
Maize. Plant Cell 30: 2922. 

Schmitz, R.J., Grotewold, E., and Stam, M. (2022). Cis-regulatory sequences in plants: Their 

importance, discovery, and future challenges. Plant Cell 34: 718–741. 

Schnable, J. (2019). Pan-Grass Syntenic Gene Set (sorghum referenced) with both maize v3 and 

maize v4 gene models. figShare. 

Schnable, J.C., Springer, N.M., and Freeling, M. (2011). Differentiation of the maize 

subgenomes by genome dominance and both ancient and ongoing gene loss. Proc. Natl. 
Acad. Sci. U. S. A. 108: 4069–4074. 

Schnable, P.S. et al. (2009). The B73 maize genome: complexity, diversity, and dynamics. Science 

130 

 
326: 1112–1115. 

Sekhon, R.S., Lin, H., Childs, K.L., Hansey, C.N., Robin Buell, C., de Leon, N., and Kaeppler, 

S.M. (2011). Genome-wide atlas of transcription during maize development. Plant J. 66: 
553–563. 

Shen, S., Zhan, C., Yang, C., Fernie, A.R., and Luo, J. (2023). Metabolomics-centered mining of 

plant metabolic diversity and function: Past decade and future perspectives. Mol. Plant 16: 
43–63. 

Shrestha, V., Yobi, A., Slaten, M.L., Chan, Y.O., Holden, S., Gyawali, A., Flint-Garcia, S., 

Lipka, A.E., and Angelovici, R. (2022). Multiomics approach reveals a role of translational 
machinery in shaping maize kernel amino acid composition. Plant Physiol. 188: 111–133. 

Stelpflug, S.C., Sekhon, R.S., Vaillancourt, B., Hirsch, C.N., Buell, C.R., de Leon, N., and 

Kaeppler, S.M. (2016). An Expanded Maize Gene Expression Atlas based on RNA 
Sequencing and its Use to Explore Root Development. Plant Genome 9. 

Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., 
Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and Mesirov, J.P. (2005). 
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide 
expression profiles. Proc. Natl. Acad. Sci. U. S. A. 102: 15545–15550. 

Sun, Y., Oh, D.-H., Duan, L., Ramachandran, P., Ramirez, A., Bartlett, A., Tran, K.-N., 

Wang, G., Dassanayake, M., and Dinneny, J.R. (2022). Divergence in the ABA gene 
regulatory network underlies differential growth control. Nat Plants 8: 549–560. 

Tolani, P., Gupta, S., Yadav, K., Aggarwal, S., and Yadav, A.K. (2021). Big data, integrative 

omics and network biology. Adv. Protein Chem. Struct. Biol. 127: 127–160. 

Tu, X., Mejía-Guerra, M.K., Valdes Franco, J.A., Tzeng, D., Chu, P.-Y., Shen, W., Wei, Y., 
Dai, X., Li, P., Buckler, E.S., and Zhong, S. (2020). Reconstructing the maize leaf 
regulatory network using ChIP-seq data of 104 transcription factors. Nat. Commun. 11: 
5089. 

Waadt, R., Seller, C.A., Hsu, P.-K., Takahashi, Y., Munemasa, S., and Schroeder, J.I. (2022). 
Plant hormone regulation of abiotic stress responses. Nat. Rev. Mol. Cell Biol. 23: 680–694. 

Wahl, V., Ponnu, J., Schlereth, A., Arrivault, S., Langenecker, T., Franke, A., Feil, R., Lunn, 
J.E., Stitt, M., and Schmid, M. (2013). Regulation of flowering by trehalose-6-phosphate 
signaling in Arabidopsis thaliana. Science 339: 704–707. 

Walley, J.W., Sartor, R.C., Shen, Z., Schmitz, R.J., Wu, K.J., Urich, M.A., Nery, J.R., Smith, 
L.G., Schnable, J.C., Ecker, J.R., and Briggs, S.P. (2016). Integration of omic networks 
in a developmental atlas of maize. Science 353: 814–818. 

Wei, F. et al. (2007). Physical and genetic structure of the maize genome reflects its complex 

evolutionary history. PLoS Genet. 3: e123. 

131 

 
Wen, W. et al. (2018). An integrated multi-layered analysis of the metabolic networks of different 

tissues uncovers key genetic components of primary metabolism in maize. Plant J. 93: 
1116–1128. 

Wen, W., Li, D., Li, X., Gao, Y., Li, W., Li, H., Liu, J., Liu, H., Chen, W., Luo, J., and Yan, J. 

(2014). Metabolome-based genome-wide association study of maize kernel leads to novel 
biochemical insights. Nat. Commun. 5: 3438. 

Wen, W., Liu, H., Zhou, Y., Jin, M., Yang, N., Li, D., Luo, J., Xiao, Y., Pan, Q., and Tohge, T. 

(2016). Combining quantitative genetics approaches with regulatory network analysis to 
dissect the complex metabolism of the maize kernel. Plant Physiol. 170: 136–146. 

Wimalanathan, K., Friedberg, I., Andorf, C.M., and Lawrence-Dill, C.J. (2018). Maize GO 
Annotation-Methods, Evaluation, and Review (maize-GAMER). Plant Direct 2: e00052. 

Wisecaver, J.H., Borowsky, A.T., Tzin, V., Jander, G., Kliebenstein, D.J., and Rokas, A. 

(2017). A Global Coexpression Network Approach for Connecting Genes to Specialized 
Metabolic Pathways in Plants. Plant Cell 29: 944–959. 

Wright, E. (2016). Using DECIPHER v2.0 to analyze big biological sequence data in R. R J. 8: 

352. 

Xiao, Y., Liu, H., Wu, L., Warburton, M., and Yan, J. (2017). Genome-wide Association 

Studies in Maize: Praise and Stargaze. Mol. Plant 10: 359–374. 

Yang, F. et al. (2017). A Maize Gene Regulatory Network for Phenolic Metabolism. Mol. Plant 10: 

498–515. 

Yang, F., Ouma, W.Z., Li, W., Doseff, A.I., and Grotewold, E. (2016). Establishing the 

Architecture of Plant Gene Regulatory Networks. Methods Enzymol. 576: 251–304. 

Yang, Z., Xu, G., Zhang, Q., Obata, T., and Yang, J. (2022). Genome-wide mediation analysis: 
an empirical study to connect phenotype with genotype via intermediate transcriptomic data 
in maize. Genetics 221. 

Yilmaz, A., Nishiyama, M.Y., Garcia-Fuentes, B., Souza, G.M., Janies, D., Gray, J., and 

Grotewold, E. (2009). GRASSIUS: A platform for comparative regulatory genomics across 
the grasses. Plant Physiol. 149: 171–180. 

Zhan, J., Li, G., Ryu, C.H., Ma, C., Zhang, S., Lloyd, A., Hunter, B.G., Larkins, B.A., Drews, 
G.N., Wang, X., and Yadegari, R. (2018). Opaque-2 Regulates a Complex Gene Network 
Associated with Cell Differentiation and Storage Functions of Maize Endosperm. Plant Cell 
30: 2425–2446. 

Zheng, Y. et al. (2016). iTAK: A Program for Genome-wide Prediction and Classification of Plant 
Transcription Factors, Transcriptional Regulators, and Protein Kinases. Mol. Plant 9: 1667–
1670. 

Zhong, R., Richardson, E.A., and Ye, Z.H. (2007). The MYB46 transcription factor is a direct 

132 

 
target of SND1 and regulates secondary wall biosynthesis in Arabidopsis. Plant Cell 19: 
2776–2792. 

Zhou, P., Enders, T.A., Myers, Z.A., Magnusson, E., and Crisp, P.A. (2021). Applying cis-

regulatory codes to predict conserved and variable heat and cold stress response in maize. 
bioRxiv. 

Zhou, P., Li, Z., Magnusson, E., Gomez Cano, F., Crisp, P.A., Noshay, J.M., Grotewold, E., 
Hirsch, C.N., Briggs, S.P., and Springer, N.M. (2020). Meta Gene Regulatory Networks 
in Maize Highlight Functionally Relevant Regulatory Interactions. Plant Cell 32: 1377–
1396. 

Zhou, S., Kremling, K.A., Bandillo, N., Richter, A., Zhang, Y.K., Ahern, K.R., Artyukhin, 

A.B., Hui, J.X., Younkin, G.C., Schroeder, F.C., Buckler, E.S., and Jander, G. (2019). 
Metabolome-Scale Genome-Wide Association Studies Reveal Chemical Diversity and 
Genetic Control of Maize Specialized Metabolites. Plant Cell 31: 937–955. 

Zhu, G., Wu, A., Xu, X.-J., Xiao, P.-P., Lu, L., Liu, J., Cao, Y., Chen, L., Wu, J., and Zhao, 

X.-M. (2016). PPIM: A Protein-Protein Interaction Database for Maize. Plant Physiol. 170: 
618–626. 

133 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
APPENDIX 

Figure S4.1 TFs and interactions used in the layer of co-expression network (CEN) 
a. Histogram showing the frequency of TFs with at least a target gene per CEN. Dotted gray line 
indicates average TFs in all 46 CEN. b. Boxplot indicating total target genes per TFs across the 
different  CENs.  CEN  are  named  following  Zhou  et  al.,  (2022)  nomenclature.  Orange  labels 
highlight  TFs  with  the  largest  number  of  target  genes  in  several  CEN.  c.  Histogram  with  the 
frequency of total target genes per TF after combined results from all 46 CENs. 

134 

 
 
 
 
Figure S4.2 Defining a gene association network (GAN) based on trans-eQTL 
a.  model  indicating  total  eQTLs  identified  and  the  classification  schema  used  to  define  trans-
eQTL, trans/cis-eQTL, cis-eQTLt, and cis-eQTL. Within them, trans-eQTLs were used to define 
the GAN. In the context of trans-eQTLs, a source gene (in blue) was defined as a gene whose 
promoter  (2kb  upstream  from  TSS)  or  gene  baby  overlapped  with  an  eQTL.  Genes  whose 
expression is explained by the SNP variation were defined as gene targets (gene in yellow). b and 
c. I classify each source and target gene into five functional categories to count the number of  

135 

 
 
Figure S4.2 (cont’d) 
associations by category (unclassified genes defined as other). Left panel, Boxplot indicating the 
number of targets (b) and source (c) genes by each gene category. Right panel, Stacked bar plots 
indicate the fraction of each gene category over the total genes in GAN. d. Bar plot indicating total 
interactions by gene category pair.  

136 

 
 
 
 
Figure S4.3 Establishing the maize gene regulatory network (GRN) layer based on protein-
DNA interaction data 
a. Density plot with distribution of peaks by PDI data type. b and c. Stacked bar plot with fraction 
of peaks mapped to accessible chromatin region (ACR) (b) and with low peak coverage (c) (CPM 
scaled and filtered; Z ≤ -0.5). d. Locally weighted scatterplot smoothing (LOESS) line plot of Z-
scores by peak in 10 kb bins around 200 kb of the closest transcription start site (TSS).  

137 

 
 
Figure S4.3 (cont’d) 
e. Classification schema (top) and corresponding proportion of total combined peaks (bottom, first 
stacked bar plot) and peaks by method (bottom, second stacked bar plot) utilized for determining 
target genes.  

138 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure S4.4 Strategy to annotate TFs based on common targets 
a. Schema of pipeline used to annotate TFs based on common target genes amount layer (GAN, 
GRN, eGRN, and CEN). b and c. Venn diagram indicating the number of common TFs with at 
least one and ten target genes.  d. Venn diagram indicating total common interactions (TF-target 
gene) among layers.  

139 

 
 
 
Figure S4.5 Strategy to annotate TFs based on common functions 
a. Schema of pipeline used to annotate TFs based on common function amount layer (GAN, GRN, 
eGRN, and CEN). b. Bar plot indicating total TFs annotated by layer and by type of function. c 
and  d.  Venn  diagram  indicating  the  number  of  TFs  with  at  least  a  PWY  (c)  and  GO  term  (d) 
commonly enriched among the corresponding layers. 

140 

 
 
 
 
 
 
 
Figure S4.6 Network-based strategy to annotate TFs 
a.  Schema  of  pipeline  used  on  the  integration  of  layers  to  identify  TF  with  similar  topological 
properties,  defined  here  as  network-based  TF  annotation.  b.  Histogram  plot  indicating  the 
distribution of genes associated per TF. c. Stacked bar plot with total TFs annotated by enrichment  

141 

 
 
Figure S4.6 (cont’d) 
with PWYs and GO terms. d. Bar plot indicating the percentage of TFs annotated for the 82 TF 
families (and co-regulator) with at least a TF annotated (c). 

142 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a

b

]
s
G
E
D
n
i

s
Y
W
P
[
F
T

BAF6021m1_tassel [26]
BAF6021m2_ntassel [3]
bZIP22 [16]
BZIP76m2_leaf [50]
BZIP76m3_leaf [40]
C3H42m1_tassel_stem [50]
E2F13m1_coleoptile [10]
E2F19m1_leaf [25]
E2F19m2_leaf [37]
FEA4 [27]
GRAS52m1_embryo [42]
GRAS75m1_embryo [38]
HSF13m1m2_leaf [41]
HSF18m1_embryo [37]
HSF20m1_embryo [36]
HSF24m3_tassel [60]
HSF24m4_tassel [31]
HSF29m1_embryo [36]
HSF29m2_embryo [30]
HSF6m1_embryo [29]
HSF6m2_embryo [40]
JMJ13m4_tassel [7]
KN1_leaf [42]
KN1_SAM [42]
KN1_tassel [58]
KN1:near [22]
MYB40_m1:coleoptile_tip [6]
MYB40_m2:coleoptile_tip [6]
MYBR21m1_embryo [38]
MYBR32m1_leaf [54]
O2 [51]
ORPHAN249m2_embryo [30]
RA1 [12]
SBP20m2_embryo [25]
SBP20m3_embryo [40]
TB1:buds_12DAP [26]
TB1:buds_8DAP [30]
WRKY2m2_coleoptile [8]
WRKY82m1_embryo [48]
WRKY87m1_embryo [43]
WRKY87m2_embryo [37]
WRKY8m1_embryo [39]
WRKY8m2_embryo [52]

0

0

0

0

1

1

0

0

0

0

4

5

0

0

0

1

2

7

1

0

2

0

1

0

0

0

0

0

0

3

0

0

0

2

0

0

0

0

0

1

0

0

0

0

0

0

0

0

1

0

0

0

0

0

0

0

2

0

1

0

0

0

0

0

0

0

d
o
h
t
e

M

Network−based
Network−based
Network−based
Comm.Function
Comm.Target
Network−based
Network−based
Network−based
Network−based
Network−based
Network−based
Network−based
Comm.Target
Network−based
Network−based
Comm.Target
Network−based
Comm.Target
Network−based
Network−based
Comm.Function
Comm.Target
Network−based
Network−based
Network−based
Network−based
Network−based
Network−based
Network−based
Network−based
Comm.Target
Network−based
Comm.Target
Network−based
Comm.Target
Network−based
Comm.Target
Network−based
Network−based
Comm.Function
Comm.Target
Comm.Function
Comm.Target
Network−based
Network−based
Comm.Target
Network−based
Comm.Target
Network−based
Comm.Target
Network−based
Comm.Function
Comm.Target
Network−based
Comm.Function
Comm.Target
Network−based
Network−based
Comm.Target
Network−based
Network−based
Network−based
Network−based

BAF6021m1_tassel

BAF6021m2_tassel

bZIP22_kernel

BZIP76m2_leaf

BZIP76m3_leaf

C3H42m1_tassel_stem

E2F13m1_coleoptile

E2F19m1_leaf

E2F19m2_leaf

FEA4_ear

GRAS52m1_embryo

GRAS75m1_embryo

HSF13m1m2_leaf

HSF18m1_embryo

HSF20m1_embryo

HSF24m3_tassel

HSF24m4_tassel

HSF29m1_embryo

HSF29m2_embryo

HSF6m1_embryo

HSF6m2_embryo

JMJ13m4_tassel

KN1_ear

KN1_leaf

KN1_SAM

KN1_tassel

MYBR21m1_embryo

MYBR32m1_leaf

O2

ORPHAN249m2_embryo

RA1

SBP20m2_embryo

SBP20m3_embryo

TB1_buds_12DAP

TB1_buds_8DAP

WRKY2m2_coleoptile

WRKY82m1_embryo

WRKY87m1_embryo

WRKY87m2_embryo

WRKY8m1_embryo

WRKY8m2_embryo

.

t
e
g
r
a
T
m
m
o
C

n
o
i
t
c
n
u
F
m
m
o
C

.

e
s
a
b
.
k
r
o
w
t
e
N

0
0
.
0

5
2
.
0

0
5
.
0

5
7
.
0

0
0
.
1

PWYs Fraction

P-value

≥ 0.05

≤ 0.05

Figure S4.7 Predicted PWY overlapped poorly with PWY observed in knockouts assays 
a.  Heatmap  shows  the  count  of  overlapped  PWYs  between  predicted  and  observed  PWY  in 
knockouts per method. The Violet box signifies significant overlap (P-value 0.05, Fisher test). An 
empty box (white) denotes no predicted PWYs for the corresponding TF. Square braking indicates 
the number of PWYs significantly enriched in the corresponding knockout (P-value 0.05, Fisher 
test). b. Stacked bar plot indicating the fraction of predicted PWY significantly enriched on DEGs 
per knockout assay and method. 

143 

 
 
 
 
 
 
Figure S4.8 GO semantic similarities observed between predicted GO terms and enriched 
GO terms in knockout are not occurring by chance 

144 

 
 
 
 
Figure S4.8 (cont’d) 
Density  plot  displaying  the  distribution  of  random  GO  term  semantic  similarity.  The  observed 
value on real GO term enrichment, along with its corresponding P-value concerning the random 
distribution, is highlighted by the horizontal purple line. 

Figure S4.9 Target genes and expression distribution of TFs compared with knockout 
results 
a. Histogram and density plot display the scaled (Z-score) number of target genes for each layer. 
TFs  utilized  in  the  knockout  analysis  are  indicated  by  dotted  red  lines.  b.  Heatmap  shows  the 
presence or absence of target genes in each of the four layers for every TF analyzed in the knockout  
assays. c. Histogram and density plot of Tau index distribution for TF 2,910 TFs annotated with  
145 

 
 
Figure S4.9 (cont’d) 
at least PWY/GO term. TFs utilized in the knockout analysis are indicated by dotted purple lines. 
d. Histogram and density plot illustrate the null distribution of Tau after randomly sampling 13 
TFs a thousand times. TFs used in the knockout analysis are represented by dotted purple lines. P-
values were calculated using the null distribution as a reference. 

Figure S4.10 GO term significance and similarity distributions from random networks per 
TF 
a and b. Density ridges plot showing the average -log10FDR (a) and GSS (b) distributions in 3,00 
random  networks  for  each  TF.  The  GSS  values  were  calculated  by  comparing  each  random 
network with the observed GO terms from the true TF-target interactions. 

146 

 
 
 
 
Figure S4.11 Scale count of GO terms in random networks 
Density plot displaying the distribution of GO terms in random networks predicted by the network-
based  method  for  each  corresponding  TF.  The  observed  number  of  significantly  enriched  GO 
terms for the corresponding TF is indicated by a dotted orange line. The p-value was calculated 
using the random distribution as the null distribution.  

147 

 
 
a

O
G

r
e
p

s
F
T

;

Z
O
G

O
G

r
e
p

s
F
T

;

Z
O
G

1
0
−1
−2
−3

Auxin−related

Zm00001d046346
LBD15

Zm00001d020446

Zm00001d024814

NPL14
MYBR4 WRKY24

LEC1

EREB132

ARF12

CADR12

EREB142
Zm00001d002562
Zm00001d021995
NAC27Zm00001d052738
Zm00001d002449

Zm00001d011730

OFP35

Zm00001d021582

GRAS4

TB1

WOX3a

Zm00001d044988

GRAS27

Zm00001d002449

Zm00001d018510

WRKY134

TSH1

BD1

Zm00001d016876

Zm00001d026197

Zm00001d032217

Zm00001d015525

FEA4

Zm00001d021390

Zm00001d029711

SBP1

ABI26

EREB132

Zm00001d013074

Zm00001d006463

Zm00001d008665

Zm00001d035440

OFP35

Zm00001d026231

1
0
−1
TFZ ;GOs per TF

rZ

0
.
0

5
.
0

0
.
1

5
.
1

Zm00001d027686

Jasmonic−related
1.5
1.0
0.5
0.0
−0.5
−1.0
−1.5

Zm00001d032208

Zm00001d045617

WRKY134

Zm00001d004851

Zm00001d043423

GRAS54

Zm00001d039764

Zm00001d033957

Zm00001d026184

EREB132

MYBR90

WRI2

MYB114

bHLH128

bHLH181

−2 −1

1
0
TFZ ;GOs per TF
rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

O
G

r
e
p

s
F
T

;

Z
O
G

O
G

r
e
p

s
F
T

;

Z
O
G

Brassinosteroid−related
1.0

bHLH143

Zm00001d026252

0.5

0.0

−0.5

−1.0

Zm00001d038473

Zm00001d021995

1

0
TFZ ;GOs per TF

2

rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

Cytokinin−related

Salicylic−related

O
G

r
e
p

s
F
T

;

Z
O
G

1.0
0.5
0.0
−0.5
−1.0

COL13

Zm00001d049058

ABI34

JMJ14

Zm00001d033246

Zm00001d033319

0.5 1.0 1.5
TFZ ;GOs per TF

rZ

5
.
0

0
.
1

5
.
1

0
.
2

Zm00001d031416

Zm00001d047995

Zm00001d006768

Zm00001d034664

NAC77

Zm00001d002482

Zm00001d038216

PRH48

RAP2

EREB122

1

0

−1

O
G

r
e
p

s
F
T

;

Z
O
G

0

1

2

TFZ ;GOs per TF

rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

Ethylene−related

Gibberellin−related

Carotenoid−related

3
2
1
0
−1
−2

Zm00001d031272

Zm00001d041205

IAA15

bZIP26

Zm00001d011730 MYB114

TCP15

WOX2B

Zm00001d023882

WRKY86

MADS40

EREB142

Zm00001d039424

Zm00001d046292

TCP15

Zm00001d021818
OCL2

WRKY1

Zm00001d027459

OFP32

Zm00001d016175

Zm00001d017682

Zm00001d012445

GRAS60

Zm00001d017592
EREB100
Zm00001d051956
LBD24

HB56

Zm00001d009668

bHLH113

Zm00001d021279

Zm00001d013467

Zm00001d036648

Zm00001d023411

IDDP10

Zm00001d051507

Zm00001d018142

Zm00001d024436

Zm00001d010978

WRKY8

Zm00001d047359

Zm00001d016876

Zm00001d032217

Zm00001d008404

Zm00001d010264

Zm00001d047644

Zm00001d033876

Zm00001d046774

0.0 0.5 1.0 1.5
TFZ ;GOs per TF
rZ

0 1 2 3

O
G

r
e
p

s
F
T

;

Z
O
G

2

1

0

−1

−2

Zm00001d052732

ABI34

bHLH77

bHLH77

GLK7

Zm00001d042472

Zm00001d007188

WRKY8

RAP2

bHLH99

WRKY134

Zm00001d031463

Zm00001d017618

Zm00001d052102

Zm00001d039424

bHLH77

NAC119

Zm00001d046346

Zm00001d047330

Zm00001d029553

Zm00001d018776

bZIP45

Zm00001d030744

OFP36

Zm00001d043902

Zm00001d009160

Zm00001d033876

EREB185

Zm00001d009030

0.0 0.5 1.0 1.5 2.0
TFZ ;GOs per TF
rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

5
.
2

Zm00001d051507

MYBR104

C3H45

Zm00001d016095

EREB122

Zm00001d046759

O
G

r
e
p

s
F
T

;

1

0

Z
O
G

−1

−1.5 −1.0 −0.5 0

0.5

TFZ ;GOs per TF

rZ

0

5
.
0

0
.
1

2
.
1

Zm00001d043663

Zm00001d046346

Zm00001d002934

SBP1

bHLH43

Zm00001d035195

Nitrogen−related

Cell wall−related

O
G

r
e
p

s
F
T

;

Z
O
G

2
1
0
−1
−2

Zm00001d052355

Zm00001d022142

EREB146

Zm00001d036594

Zm00001d038283

Zm00001d021636

WOX2B

Zm00001d037737

Zm00001d032178

Zm00001d006735

Zm00001d047359

Zm00001d002642

EREB45

WRKY56

EREB142

Zm00001d021714

Zm00001d005071

Zm00001d021545

MYB114

NAC10

Zm00001d036426

Zm00001d013200

PRH43

Zm00001d015525

WRKY133

ZAG6
Zm00001d045477

Zm00001d051739

RAP2

Zm00001d005566

Zm00001d024436

SDG129

NPL14

bZIP81

Zm00001d021927

Zm00001d022142

ZNF3

Zm00001d026184

EREB211

Zm00001d014565

Zm00001d046346

GLK7

CADR12

Zm00001d042197

Zm00001d018510

GBP20

Zm00001d032178

Zm00001d027409

Zm00001d039424

−2 −1 0

1

2
TFZ ;GOs per TF

3

rZ

0 1 2 3

O
G

r
e
p

s
F
T

;

Z
O
G

1

0

−1

Zm00001d035492

GRAS56

TRU1
Zm00001d026182

Zm00001d024627

MADS15

Zm00001d022046

Zm00001d042917

Zm00001d044940

Zm00001d042472

Zm00001d021714

mTERF1

Zm00001d003677

Zm00001d022139

NAC27

PRH48

Zm00001d006733

Zm00001d045617

Zm00001d028413

Zm00001d030907

−2

0

−1
1
TFZ ;GOs per TF

rZ

0
.
0

5
.
0

0
.
1

5
.
1

Pentose−related
2

Zm00001d046346

PRH48

TRU1

WRKY56

C3H45

Zm00001d052102

1

0

−1

O
G

r
e
p

s
F
T

;

Z
O
G

O
G

r
e
p

s
F
T

;

Z
O
G

−1.0 −0.5 0.0 0.5
TFZ ;GOs per TF

rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

b

c

Sugar−related
2

Zm00001d046346

WRKY81

TRU1

C3H45

Zm00001d021636

C3H45

WRKY56

C3H45

1

0

−1

PRH48

Zm00001d052102

ELF3.1

Zm00001d049309

PRH115

Zm00001d026182

Zm00001d049320

Zm00001d018225

C3H45

JMJ25

Zm00001d039492

Zm00001d021785

Zm00001d017101

Zm00001d012330

Zm00001d021208

Zm00001d048172

Zm00001d032213

Zm00001d044168

Zm00001d016950

Zm00001d044167

−1
0
1
TFZ ;GOs per TF

2

rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

Meristem−related

Pollen−related

Flower−related

Seed−related

O
G

r
e
p

s
F
T

;

Z
O
G

1

0

−1

NAC65

ZMM17

Zm00001d004897

Zm00001d008171

C3H45

MYBR40

Zm00001d049058

Zm00001d025944

OFP4

Zm00001d020932

O
G

r
e
p

s
F
T

;

Z
O
G

1

0

−1

−2

EREB63 Zm00001d043663

TUB11

Zm00001d022142

WRKY124

MYB77

bHLH129

Zm00001d048172
bHLH113

Zm00001d021785
Zm00001d013074

MYBR4

Zm00001d047330

JMJ25

Zm00001d021892

MYB116

Zm00001d034641

bHLH182

Zm00001d021714

Zm00001d021390

Zm00001d021607

OHP1

IAA15

Zm00001d012482

PRH43

Zm00001d018510

HB56

GATA32

Zm00001d021636

Zm00001d048600

Zm00001d045162

Zm00001d047752

Zm00001d020680

O
G

r
e
p

s
F
T

;

Z
O
G

1

0

−1

−2

EREB63

NAC10

GRAS28

ARF5

GRAS81

WRKY1

MYBR57

Zm00001d010616
GLK47

bHLH7

WRI2

bHLH99

OFP32

MYB116

Zm00001d034084

IAA15

CCHH2

ZAG6

THX16

Zm00001d027317

SBP4

bHLH143

Zm00001d004897

Zm00001d032502

bHLH182

O
G

r
e
p

s
F
T

;

Z
O
G

1
0
−1
−2
−3

−2 −1

1
0
TFZ ;GOs per TF
rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

−1

0
1
TFZ ;GOs per TF
rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

2

−2

−1

0

1

TFZ ;GOs per TF
rZ

0
.
0

4
.
0

8
.
0

2
.
1

6
.
1

GLK7

GLK7

Zm00001d038473
bHLH28

Zm00001d002642

NAC116 ZNF3
BD1

Zm00001d049903

Zm00001d036244

Zm00001d010144

NAC10

Zm00001d018510

bZIP45
Zm00001d026184

SBP1SBP1
bZIP93

WRKY86

Zm00001d016260

Zm00001d052102
Zm00001d035195
Zm00001d010175

bHLH49
Zm00001d009030

bHLH77

WRKY56Zm00001d043066

MYBR90

OCL2

PRH48

THX4

bHLH43

Zm00001d012330

Zm00001d045477

Zm00001d039254

Zm00001d031463

Zm00001d020932

OCL2

Zm00001d021208

bHLH74

WRKY24

Zm00001d043837

Zm00001d005749

Zm00001d032217

Zm00001d005300

ABI26

RAP2

Zm00001d010616

−1

1
0
TFZ ;GOs per TF
rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

Shoot−related

O
G

r
e
p

s
F
T

;

Z
O
G

1

0

−1

Zm00001d002405

Zm00001d008812

NAC77

MYBR40

GRAS28

Zm00001d025944

Zm00001d020932

−1 −0.5 0 0.5 1
TFZ ;GOs per TF

rZ

0
.
0

5
.
0

0
.
1

5
.
1

0
.
2

Figure S4.12 Enrichment score for TF and GO term in several biological processes 
a,  b,  and  c.  Scatter  plot  with  reciprocal  Z  score  (rZ)  of  hormones-  (a),  metabolism-  (b),  and 
development-related (c) process. 

148 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER FIVE: ARABIDOPSIS CO-EXPRESSION SIGNATURES OF 

COMBINATORIAL GENE REGULATION 

149 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5.1 ABSTRACT 

Gene  co-expression  analyses  provide  a  powerful  tool  to  determine  gene  associations.  The 

interaction  of  transcription  factors  (TFs)  with  their  target  genes  is  an  essential  step  in  gene 

regulation, yet to what extent TFs-target gene associations are recovered in co-expression studies 

remains unclear. Using the wealth of data available for Arabidopsis, I show here that protein-DNA 

interactions  are  overall  poor  indicators  of  TF-target  co-expression,  yet  the  inclusion  of  TF-TF 

interaction information significantly enhances co-expression signals. These results highlight the 

impact  of  combinatorial  gene  control  on  such  gene  association  networks.  I  integrated  this 

information  to  predict  higher-order  regulatory  complexes,  which  are  difficult  to  identify 

experimentally.  I  demonstrate  that  genes  strongly  co-expressed  with  a  TF  are  also  enriched  in 

indirect  targets.  These  results  have  significant  implications  on  the  empirical  understanding  of 

complex gene regulatory networks and transcription factor function, and the significance of co-

expression from the perspective of protein-protein and protein-DNA interactions 

5.2 INTRODUCTION 

The translation of genotype into phenotype is largely dependent on genes being expressed in 

the appropriate cell types at the correct time (Swift and Coruzzi, 2017). Such expression is mainly 

controlled by transcription factors (TFs) recognizing specific cis-regulatory regions in the genes 

that  they  regulate  resulting  in  protein-DNA  interaction  (PDI)  which  together  define  a  gene 

regulatory  network  (GRN)  (Gupta  et  al.,  2021).  PDIs  are  experimentally  identified  using 

combinations  of  gene-  and  TF-centered  approaches;  gene-centered  approaches  result  in  the 

identification of TF regulators for specific genes, while TF-centered approaches permit identifying 

target genes of a particular TF (Arda and Walhout, 2010; Yang et al., 2017; Mejia-Guerra et al., 

2012).  Within 

the  most  commonly  used  TF-centered  strategies 

include  chromatin-

150 

 
immunoprecipitation (ChIP) and DNA-affinity purification (DAP) methods, often coupled with 

high-throughput sequencing (ChIP-Seq and DAP-Seq, respectively) (Park, 2009; O’Malley et al., 

2016).  

Identification of PDIs is particularly important in the context of the effect that a TF has on the 

expression of its target genes. Often, however, identified TF targets show no changes in expression 

when  the  activity  of  the  corresponding  TF  is  perturbed  (Zeller  et  al.,  2006;  Morohashi  and 

Grotewold, 2009; Morohashi et al., 2012; Eveland et al., 2014; Liu et al., 2015). While in some 

instances technical artifacts are responsible, the low overlap between TF targets and differentially 

expressed genes are more often due to redundancy in the activity of the TF (Gitter et al., 2009; Hu 

et al., 2007), the timing of the PDI interactions (Para et al., 2014; Swift and Coruzzi, 2017; Brooks 

et al., 2019), the ability of some master regulators to bind closed chromatin regions (Pajoro et al., 

2014; Sayou et al., 2016; Tao et al., 2017; Jin et al., 2021; Lai et al., 2021), and/or  regulation of 

the target gene by the TF in only a fraction of the cells sampled (Nolan et al., 2023). For these 

reasons, the tethering of a TF to the regulatory region of a gene without a clear contribution to the 

control of the gene’s expression is often considered of limited biological significance (Banks et 

al.,  2016;  Jiang  and  Mortazavi,  2018).  Additionally,  TF  are  also  known  by  their  combinatorial 

nature, where a single TF can regulate multiple sets of target genes through interactions with other 

proteins,  defined  as  combinatorial  gene  regulation  (CGR)  (Reményi  et  al.,  2004;  Brkljacic  and 

Grotewold, 2017). However, despite CGR being a well-documented phenomenon in plant systems 

(Reményi  et  al.,  2004;  Heyndrickx  et  al.,  2014a;  Brkljacic  and  Grotewold,  2017;  Colinas  and 

Goossens, 2018; Lacchini and Goossens, 2020), there is no single study that attempts to predict 

the contribution of CGR to the low overlap in expression changes observed after perturbation and 

target genes observed in PDI assays. 

151 

 
In  general,  it  is  assumed  that  genes  with  very  similar  expression  patterns  are  regulated  by 

similar mechanisms, involving shared TFs (Eisen et al., 1998; Vandepoele et al., 2009; Haynes et 

al.,  2013;  Zhou  et  al.,  2020;  Geng  et  al.,  2021;  Burks  et  al.,  2022).  Similar  patterns  of  gene 

expression can be captured by gene co-expression networks (Eisen et al., 1998; Stuart et al., 2003; 

Haynes et al., 2013; Wisecaver et al., 2017; Rao and Dixon, 2019; Zhou et al., 2020; Geng et al., 

2021;  Burks  et  al.,  2022).  Multiple  examples  of  implementation  of  co-expression  networks  or 

specific TF-target co-expression patterns have allowed the prioritization of PDIs (Wu and Ji, 2013; 

Jiang and Mortazavi, 2018; Zhou et al., 2020; Furuya et al., 2021; Geng et al., 2021; Burks et al., 

2022;  Gomez-Cano  et  al.,  2022).  Here,  I  took  advantage  of  data-rich  Arabidopsis  thaliana 

(Arabidopsis), which provides an attractive system to investigate the co-expression relationships 

between TFs and their corresponding predicted target genes, and how the co-expression patterns 

are  affected  by  the  formation  of  TF-TF  complexes.  Specifically,  I  obtained  expression  and  co-

expression  data  from  ATTED-II  (http://atted.jp/),  a  database  that  provides  co-expression 

information  obtained  from  various  gene  expression  analyses    (Obayashi  et  al.,  2018).  The  co-

expression data was combined with over five million PDIs identified through ChIP-chip, ChIP-

seq, and DAP-Seq. All of these PDIs are accessible via AGRIS (http://agris-knowledgebase.org/) 

(Palaniswamy  et  al.,  2006;  Yilmaz  et  al.,  2011).  Additionally,  I  included  9,503  experimentally 

established PPI for Arabidopsis TFs that can be accessed through the BioGRID database (Oughtred 

et al., 2019). Combining the expression and co-expression from ATTED-II, I determined that about 

half of the TFs are globally co-expressed with their targets as a set, with this number increasing to 

85% when local co-expression patterns are considered. I show that a small fraction (in average 

~5%) of the direct targets are robustly co-expressed with the corresponding TFs. However, when 

TF complexes deduced from available PPI data are considered, the number of targets co-expressed 

152 

 
with  a  TF  significantly  increases.  By  integrating  PDIs,  PPIs,  and  co-expression  information,  I 

predicted  the  formation  of  ternary  TF  complexes,  some  with  strong  support  from  experimental 

data. Finally, I determined the TFs most highly co-expressed are largely represented by direct and 

indirect TF targets. These findings have significant implications on the empirical understanding of 

complex gene regulatory networks, and the meaning of co-expression from the standpoint of PPIs 

and PDIs. 

5.3 RESULTS 

5.3.1 Transcription factors and their targets show varying levels of co-expression 

To investigate the co-expression of Arabidopsis TFs and their corresponding target genes, I 

collected existing PDI data involving 555 TFs and 25,255 target genes (see Methods). The target 

genes were determined based on the proximity, when coordinates of peak were available, between 

the peak of the respective TF and the target genes. It is worth noting that the majority of PDIs used 

were derived from DAP-seq, which, due to the absence of chromatin context, may contain a higher 

proportion of non-functional TF-target associations (O’Malley et al., 2016). With these datasets, I 

built a PDI network that included 2,271,066 interactions that were then used to interrogate the co-

expression relationships between each TF and its targets, using the mutual rank (MR) of the PCC 

(MR-PCC), as reported by ATTED-II (Obayashi et al., 2018), and the mutual rank of the mutual 

information  (MR-MI)  (See  Methods).  I  used  PCC  and  MI  capturing  linear  and  non-linear 

relationships, respectively (Banf and Rhee, 2017), and the corresponding MR value in order to 

reduce  dataset-dependent  associations  and  to  improve  the  predictive  power  of  the  correlation 

(Obayashi and Kinoshita, 2009; Obayashi et al., 2018).  

To assess the significance of co-expression between each TF and its corresponding set of target 

genes, I conducted two distinct analyses for each TF: (1) I compared the average MR of a TF with 

153 

 
its targets to the average MR of the TF with a randomly selected gene set of similar size. TFs that 

exhibited significant differences compared to the random set were classified as 'co-expressed by 

average MR' (see Methods). (2) I examined differences in the distributions of MRs between a TF 

and  its  target  genes  versus  all  non-target  genes.  TF-target  pairs  that  demonstrated  significant 

differences (P < 0.05, Kolmogorov-Smirnov test) compared to the distribution of TF-non-target 

pairs were categorized as 'co-expressed by MR distribution' (see Methods). It should be noted that 

the  analyses  based  on  MR-PCC  values  were  performed  separately  for  negative  and  positive 

correlation values. Hence, based on the results of the statistical tests, I determined that 231/555 

TFs  (using  MR-PCC)  and  172/555  TFs  (using  MR-MI)  showed  significant  co-expression  with 

their  respective  target  genes  (Figure  S5.1a,  b).  Additionally,  by  comparing  both  co-expression 

metrics (MR-PCC and MR-MI), I identified 124 TFs that were common to both analyses (Figure 

5.1a). In total, I identified 279 (172 + 231 - 124) TFs that exhibited significant co-expression with 

their corresponding target gene sets, while the remaining 276 TFs did not show significant co-

expression. A closer look into only the MR-PCC results allowed us to establish that 186/231 TFs 

showed significant co-expression (either by MR distribution and/or MR average tests) only with 

positively  co-expressed  targets  (potential  transcriptional  activators),  and  23/231  only  with 

negatively co-expressed targets (potential transcriptional repressors) (Figure S5.1c). Remarkably, 

22  TFs  showed  significant  co-expression  with  different  sets  of  both  positively  and  negatively 

associated  target  genes,  indicating  that  they  can  function  both  as  transcriptional  activators  or 

repressors, depending on the target gene subset (Figure S5.1c). 

To further characterize the TF-target genes co-expression profiles observed, I classified the 

TFs  into  four  co-expression  categories:  TFs  co-expressed  with  their  targets  based  on  MR-PCC 

(107 TFs), TFs co-expressed based on both MR-PCC and MR-MI (124 TFs), TFs co-expressed 

154 

 
based on MR-MI alone (48 TFs), and TFs that did not display significant co-expression with their 

corresponding  targets  (276  TFs)  (Figure  5.1a).  Next,  I  grouped  the  MR  distribution  into  bins, 

ranging from the smallest to the largest rank, to analyze the proportion of targeted genes in each 

bin  (~250  MR  values  per  bin)  per  TF.  Consequently,  smaller  and  larger  MR-PCC  values 

correspond  to  more  positive  and  negative  co-expression  values,  respectively.  In  the  MR-PCC 

distribution, TFs that displayed significant co-expression with their targets were predominantly 

distributed within the first 25 bins (i.e. within around the first 6,250 genes most co-expressed per 

TF)  (Figure  5.1b).  Conversely,  TFs  that  did  not  show  significant  co-expression  with  their 

respective targets demonstrated a distinct pattern in the MR-PCC distribution (Figure 5.1b, gray 

panel). I observed similar patterns in the MR-MI distribution as well (Figure S5.2). Notably, MI 

does  not  differentiate  between  positive  and  negative  associations.  Thus,  all  significant  values, 

when present, are captured in the left tail of the distribution. Additionally, there was a consistent 

~1% presence of targets across all bins in the distribution (Figure 5.1b, indicated by line plot with 

target  %  beneath  each  heatmap).  These  findings  validate  earlier  observations  in  Arabidopsis 

(Zaborowski  and  Walther,  2020),  corroborating  the  absence  or  low  co-expression  relationship 

between TFs and their respective target genes. 

Given that many TF functions are often highly cell-type, tissue, or stress specific, I analyzed 

the  co-expression  at  different  scales  (Zhou  et  al.,  2020;  Lee  et  al.,  2023;  Nolan  et  al.,  2023).  

Specifically, I introduced a new category called "local co-expression," which involved analyzing 

subsets of expression datasets obtained after clustering similar samples. These subsets served as 

proxies for organ- and condition-specific co-expression (see Methods). In total, I identified twelve 

distinct sample clusters representing potential conditions (Figure S5.3). Similar to the previous 

global  co-expression  analysis,  I  employed  two  statistical  methods  (average  MR  and  MR 

155 

 
distribution)  and  two  metrics  (MR-PCC  and  MR-MI).  To  explore  the  presence  of  local  co-

expression patterns in the 276 TFs that did not exhibit significant global co-expression with their 

target genes, I kept these sets separate. Overall, I discovered that 199 out of 276 TFs displayed 

significant co-expression with their target genes in at least one of the clusters (Figure 5.1c). As 

expected, TFs with global co-expression patterns were found to exhibit co-expression with targets 

in multiple local clusters (Figure 5.1c), with the exception of seven TFs (WIP5, MYB1, PLT1, 

ERF109,  HHO5,  NAC4,  and  AT5G47660).  These  seven  TFs  showed  significant  global  co-

expression, but no evident local co-expression in any of the clusters. The reason for this intriguing 

behavior is not yet clear. 

I  explored  the  distinguishing  characteristics  of  TFs  that  do  not  exhibit  global  or  local  co-

expression with their target genes. I observed a significant difference in the connectivity within 

the network between TFs showing co-expression and those that do not. TFs lacking co-expression 

with  their  alleged  targets  displayed  significantly  smaller  in-degree  (representing  the  number  of 

TFs binding to a specific promoter region of the corresponding TF) and out-degree (representing 

the  number  of  target  genes  bound  by  a  TF)  compared  to  co-expressed  TFs  (P  <  0.05,  Mann-

Whitney  U  test;  Figure  S5.4).  These  findings  suggest  that  TFs  with  lower  connectivity  in  the 

network  may  have  distinct  co-expression  relationships  with  their  targets.  However,  I  cannot 

dismiss the possibility that the identified clusters may not be sufficiently resolved for these TFs. 

156 

 
Figure 5.1 Patterns of co-expression between TFs and their direct target genes 
a. Total number of TFs globally co-expressed with their corresponding targets across all tissues 
and conditions based on MR-PCC and MR-MI. The Venn diagrams show the overlap between the 
two metrics. b. Heatmaps displaying the distribution of MR-PCC values across 25,296 Arabidopsis 
genes. TFs are divided into four co-expression groups: TFs co-expressed with their targets based 
on MR-PCC (107 TFs), on both MR-PCC and MR-MI (124 TFs), MR-MI only (48 TFs), and TFs 
that do not show significant co-expression with their targets (276 TFs). The colors indicate the 
percentage  of  TF  targets  within  each  bin  of  250  MRs.  There  are  101  bins  along  the  PCC 
distribution, representing the co-expression values of each TF with the 25,296 Arabidopsis genes. 
Small MR values correspond to positive PCC values, while large MR values represent negative 
PCC values. The line-dot plots below each heatmap display the average percentage of targets for 
all TFs in each bin. c. Heatmap illustrates the local co-expression profiles of each TF analyzed 
across 12 different expression clusters. The color indicates whether there is co-expression (orange) 
or  no  co-expression  (gray).  The  left  panel  shows  TFs  that  are  globally  co-expressed  with  their 
targets, while the right panel shows those that are not. The number in brackets represents the count 
of TFs with significant co-expression in at least one of the local clusters.  

157 

 
 
5.3.2 Few targets are highly co-expressed with their respective TFs 

The  distribution  of  target  genes  along  the  MR-PCC  range  mentioned  earlier  (Figure  5.1b) 

reveals a limited presence of targets among the genes exhibiting the highest co-expression with 

each TF. Specifically, the maximum proportion of targets within a bin containing 250 co-expressed 

genes is approximately 5% (Figure 5.1b). Moreover, the percentage of targets gradually decreases 

beyond the first 5,000 MRs, capturing a maximum of 25% of the total identified direct targets for 

each TF. To assess the proportion of highly co-expressed targets (HCT) for each TF, I defined the 

top and bottom 2.5% of the MR-PCC distribution as the set of highly co-expressed genes (HCGs) 

and  tallied  the  total  number  of  targets  within  these  intervals.  Among  all  TFs,  ARABIDOPSIS 

PSEUDO-RESPONSE REGULATOR 9 (PRR9) exhibited the highest percentage (36%) of target 

genes identified as HCTs according to the defined criteria. However, on average, only 4.7% of the 

targets qualified as HCTs (Figure 5.2a), indicating that, on average, the remaining 95.3% of the 

targets were classified as low co-expressed targets (LCTs).  

5.3.3 PPIs condition TF co-expression with direct targets 

To gain insights into the limited co-expression between TFs and their target genes, I explored 

how  the  presence  of  multiple  physically  interacting  TFs  regulating  a  gene  could  influence  the 

observed  co-expression  pattern.  I  obtained  815  experimentally  determined  protein-protein 

interactions  (PPIs)  involving  313  out  of  the  555  TFs  analyzed  in  this  study  from  BioGRID. 

Specifically,  using  this  PPI  information,  I  assessed  the  extent  to  which  the  formation  of  TF 

complexes (e.g., TFx-TFz) could account for the high fraction of low co-expressed targets (LCTs) 

associated with each TFx. To do this, I calculated the partial co-expression correlation of TFx with 

all LCTs, conditioned on the presence of TFz (de la Fuente et al., 2004; Kim, 2015; Uygun et al., 

2016). This analysis allowed me to examine the co-expression of TFx target genes with a TFx 

158 

 
complex (TxCC). It is important to note that these correlations are not symmetric, meaning that 

TxCC may differ from TzCC. Additionally, TCC refers specifically to correlations conditioned by 

already reported TF heterodimers. I performed the correlation analysis using all Arabidopsis genes 

and identified the top 2.5% highly co-expressed genes (at each tail of the correlation distribution 

as cut-off) for each TFx-TFz complex. I found that, on average, 5% of the LCTs of a TF are co-

expressed with the complexes in which the TF is involved (i.e., TxCC) (Figure 5.2b). Furthermore, 

I calculated the percentage of TxCC based on the number of interactions, revealing that the average 

of  TxCC  is  not  influenced  by  the  total  number  of  targets  associated  with  the  respective  TF 

(Spearman  Correlation,  rs  =  0.02)  (Figure  5.2c,  color  scale  distribution).  However,  when 

considering  all  interactions  for  each  TF,  it  became  evident  that  the  percentage  of  targets  co-

expressed with a complex increased proportionally with the number of known interactors that a 

TF possesses (correlation, rs = 0.69) (Figure 5.2c), indicating that a significant proportion of the 

LCTs described previously can be explained by considering complexes of interacting TFs. 

Even among TFs with a similar number of analyzed complexes, there is notable variation in 

the proportion of TxCC (Figure 5.2d). For instance, within the subset of TFs that have a single 

known  partner,  I  observed  distinct  cases  represented  by  DEHYDRATION  RESPONSE 

ELEMENT-BINDING  PROTEIN  26  (DREB26)  and  ethylene  response  factor  (ERF) 

(AT4G18450). These TFs interact with BASIC HELIX LOOP HELIX PROTEIN 10 (BHLH010) 

and GT-1, respectively, and the corresponding complexes explain 5.5% and 1.6% of the LCTs 

(Figure 5.2d). This finding highlights the specific and unique impact of each TF complex on the 

percentage of co-expressed target genes, potentially reflecting functional aspects of combinatorial 

gene regulation. 

159 

 
Thus far, I have demonstrated that incorporating regulatory complexes can enhance the co-

expression of TFs with their targets. Despite the variable number of common targets shared by 

these interacting TFs (TFx-TFz in Figure 5.2e), only a small fraction of these shared targets exhibit 

co-expression  with  the  complex  (TxCC-Tz,  Figure  5.2e).  Therefore,  to  gain  a  deeper 

understanding of the co-expression patterns among the shared targets of TFx and TFz, I compared 

the proportion of these targets that co-expressed with the TFx-TFz complex and also exhibited 

high co-expression with TFz (Figure 5.2f, blue box), with the TFz-TFx complex (TzCC) (Figure 

5.2f, orange box), or show low co-expression with TFz (Figure 5.2f, gray box). Overall, 91% of 

the shared targets that are also TxCC were found to have modest co-expression with TFz (LCTz, 

gray  in  Figure  5.2g).  Only  3.9%  of  the  shared  targets  exhibited  high  co-expression  with  TFz 

(Figure 5.2g, blue box), and 4.1% co-expressed with both complexes (TxCC and TzCC) (orange 

in  Figure  5.2g).  These  findings  emphasize  the  significance  of  considering  TF  complexes  when 

interpreting the co-expression between TFs and their targets. 

To assess the biological significance of the co-expression observed between targets and TF 

complexes,  I  examined  specific  examples.  HHO2  (HRS1  HOMOLOG2)  and  HHO3  (HRS1 

HOMOLOG3) are MYB-related TFs involved in phosphate homeostasis, lateral root development 

(Nagarajan et al., 2016), and nitrogen responses (Varala et al., 2018). Our analysis revealed that 

the HHO2-HHO3 complex co-expressed with 43 targets. Notably, HHO2, HHO3, and six of their 

targets exhibited differential expression in response to different nitrogen growth conditions (Figure 

5.2h), supporting the functional relevance of complex formation and its associated targets.  

I  also  examined  the  SVP  (SHORT  VEGETATIVE  PHASE)  -  GBF2  (G-BOX  BINDING 

FACTOR 2) complex. SVP acts as a flowering repressor (Chen et al., 2018) and is also involved 

in drought responses (Bechtold et al., 2015), while GBF2 is associated with abscisic acid (ABA) 

160 

 
responses (Song et al., 2016). My results identified 429 shared co-expressed targets for the SVP-

GBF2  complex  (Figure  5.2i),  of  which  130  genes  were  differentially  expressed  under  drought 

conditions (Harb et al., 2010; Wilkins et al., 2010; Bechtold et al., 2015). These findings support 

the  notion  that  TF  targets,  which  lack  significant  co-expression  with  the  TFs  individually,  do 

exhibit co-expression when considering TF complexes. 

161 

 
 
 
 
 
Figure 5.2 Targets are more frequently co-expressed with TF complexes than with individual 
TFs 
a.  Violin  plot  displaying  the  proportion  of  highly  co-expressed  targets  (HCT)  for  313  TFs.  b. 
Boxplot illustrating the percentage of low co-expressed targets (LCTs) that coincide with targets 

162 

 
 
Figure 5.2 (cont’d) 
co-expressed with a TFx complex (TxCC). c. Percentage of TxCCs in relation to the total number 
of PPIs involving each TF. d. Enlarged view of the section in (c) depicting TFs with only one 
interacting partner. DREB26-bHLH10 and ERF (At4g18450)-GT-1 represent extreme cases in the 
distribution. The color scale in (c) and (d) indicates the number of targets for each TF. e. Boxplot 
presents the number of shared targets between the 815 analyzed TF complexes (TFx-TFz) or the 
number of targets of a given TFx co-expressed with the TFx-TFz complex (TxCC) that are also 
targets of TFz. f. Schematic representation of the comparison made among target genes of TFz and 
targets of TFx categorized as HCTs, TCCs, or LCTs of TFx, denoted by blue, orange, and yellow, 
respectively. g. Distribution of targets based on the comparison in (f) for the 815 analyzed TFx-
TFz complexes. Complexes are shown on the x-axis, while the y-axis represents the frequency of 
overlap. The HHO2-HHO3 (h) and SVP-GBF2 (i) TF complexes serve as representative examples 
from  the  analyzed  TF  complexes.  h.  The  numbers  indicate  the  differentially  expressed  genes 
(DEGs) under various nitrogen growth conditions. i. The numbers indicate DEGs, also identified 
as  targets  of  the  corresponding  complexes,  under  drought  stress  in  three  different  studies.  The 
sidebar  plot  provides  a  zoomed-in  view  of  the  HHO2-HHO3  and  SVP-GBF2  positions  on  the 
shared target distribution shown in g. 

5.3.4  Co-expressed  targets  shared  by  binary  TF  complexes  suggest  higher-order 

arrangements  

The  results  presented  so  far  indicate  that  the  integration  of  co-expression  and  physical 

interaction  information  contributes  to  the  identification  of  TFs  that  control  gene  expression 

working as part of complexes. There are many instances in which Arabidopsis TF pairs interact 

and  control  shared  sets  of  target  genes  (Brkljacic  and  Grotewold,  2017;  Bemer  et  al.,  2017). 

However,  the  experimental  identification  of  higher-order  (beyond  binary)  TF  complexes  is  not 

without  challenges  (Lambert  et  al.,  2018).  To  investigate  whether  the  combination  of  co-

expression,  PPI,  and  PDI  information  might  provide  insights  on  higher-order  TF  complexes,  I 

started by describing the complexes made up by TGA10 (TGACG MOTIF-BINDING PROTEIN 

10), TCP14 (TGA10 with TEOSINTE BRANCHED, cycloidea and PCF 14), and a homeodomain-

like  TF  (AT2G40260)  (Trigg  et  al.,  2017).  The  TGA10-TCP14  and  TGA10-AT2G40260 

complexes  share  80%  of  targets  co-expressed  with  each  complex  (Figure  5.3a,  black  nodes). 

Moreover, shared targets had similar expression correlation with both heterodimers (either positive 

163 

 
 
or negative), indicating that both complexes potentially activate or repress the same sets of genes 

(Figure  5.3a).  These  results,  combined  with  the  information  that  TCP14  and  AT2G40260 

physically  interact  with  each  other  (Trigg  et  al.,  2017),  provide  strong  evidence  that  TGA10, 

TCP14,  and  AT2G40260  form  a  ternary  complex  that  controls  the  expression  of  all  targets 

indicated in Figure 5.3a. 

I  proceeded  to  examine  the  presence  of  other  triple-binary  (tri-bi)  TF  combinations  in 

Arabidopsis,  similar  to  the  TGA10-TCP14  and  TGA10-AT2G40260  complexes.  To  do  this,  I 

initially identified 47 TFs that had at least two interacting partners and PDI information. I then 

determined the percentage of shared target genes between these pairs (Figure 5.3b, orange) and 

compared it to the percentage of targets unique to each pair (Figure 5.3b, gray). In certain cases, 

all targets were shared by both binary complexes (indicated by the orange columns in Figure 5.3c), 

while only around 8% were shared by binary complexes with minimal overlap (columns on the 

right in Figure 5.3c). Notably, 13 out of the 47 tri-bi combinations tested showed experimental 

evidence for all three binary interactions (indicated by black arrows in Figure 5.3c), supporting the 

existence of higher-order (ternary) complexes. However, I was unable to establish a statistically 

significant correlation between the number of shared targets and experimental evidence confirming 

the  formation  of  ternary  complexes.  This  lack  of  correlation  likely  stems  from  the  limited 

availability of PPI data for many of the TF pairs involved, rather than the shared percentage of co-

expressed targets being an inadequate indicator of ternary complex formation. 

I next investigated how frequently TFs involved in tri-bi interactions share common targets. 

Unlike the previous analysis, I now considered TFs with more than two PPIs. I identified a total 

of 2,013 true tri-bi instances (i.e., with evidence of physical interaction for all pairs of the tri-bi) 

involving 140 TFs. In approximately 90% of these instances, the TFs showed a significant overlap 

164 

 
of target genes (false discovery rate < 0.01, Fisher’s exact test). This indicates that TFs involved 

in tri-bi interactions often share a substantial number of targets, making them strong candidates 

for the formation of tertiary, or even higher-order, complexes. To assess whether the fraction of 

shared targets differs from random tri-bi complexes, I compared the co-expressed shared targets 

of TF complexes from experimentally demonstrated tri-bi instances to those from tri-bi instances 

obtained through a randomized binary interactome approach for each TF (see Methods).  

Among  the  104  TFs  analyzed,  I  identified  12  TFs  involved  in  tri-bi  instances  with  a 

significantly larger fraction of shared targets compared to the background model (Figure S5.5a). 

An illustrative case is ABI5 (ABA INSENSITIVE5), which participates in eight tri-bi instances 

and exhibits a median shared fraction of targets of 0.77 (Figure 5.3d). Remarkably, six out of the 

eight  tri-bi  instances  involving  ABI5  consist  of  a  combination  of  four  TFs  from  the  ABF 

(ABSCISIC ACID RESPONSIVE ELEMENTS-BINDING PROTEIN) family (Figure 5.3e). The 

number of target genes varies across the tri-bi instances, ranging from 258 for ABF2-ABI5-ABF4 

to 290 for ABF3-ABI5-ABF4 (Figure 5.3e). The 290 ABF3-ABI5-ABF4 gene targets include 46 

genes differentially expressed in abi5 mutant seeds (Bi et al., 2017). Remarkably, ABF2, ABI5, 

and  ABF4  also  interact  with  SnRK2.2  (SNF1-RELATED  PROTEIN  KINASE  2),  PP2CA 

(PROTEIN PHOSPHATASE 2CA) (Yoshida et al., 2010; Lynch et al., 2012), and AHG1 (ABA-

HYPERSENSITIVE  GERMINATION  1)  (Lynch  et  al.,  2012),  which  are  key  known  post-

translational regulators of ABI5 (Skubacz et al., 2016). I found 41 TFs involved in tri-bi instances 

with a significantly reduced fraction of shared targets compared to the expected background model 

(Figure S5.5b). These findings suggest that these TFs may participate at least in dimeric complexes 

where they bind overlapping sets of target genes. 

165 

 
Figure  5.3  Common  co-expressed  targets  of  TF  complexes  suggest  higher-order  TF 
arrangements 
a. Co-expressed targets shared by the TGA10-TCP14 and TGA10-AT2G40260 TF complexes are 
represented.  Black  nodes  indicate  common  targets  for  both  complexes,  while  light  gray  nodes 
represent targets controlled by one complex but not the other. Green arrows indicate positive co-
expression  correlation  (activation),  and  blue  arrows  indicate  negative  co-expression  correlation 
(repression) with the respective TF complexes. b. The strategy used to identify shared targets by 
comparing TxCC between pairs of dimers is illustrated schematically. c. The percentage of total 
targets bound by both complexes (orange) or only by one complex (gray) is shown. Black arrows 
indicate tri-bi complexes with experimental evidence for all three binary interactions. d. ABI5 

166 

 
 
 
Figure 5.3 (cont’d) 
serves  as  an  example  of  12  TFs  with  significantly  larger  fractions  of  shared  targets  in  tri-bi 
complexes compared to randomly formed tri-bi complexes (two-sided t-test P < 0.05). Similarity 
between the sets of target genes for corresponding dimers was measured using Jaccard indices. e. 
Tri-bi complexes involving ABI5 are depicted, with experimentally verified interactions shown as 
lines and the numbers in blue indicating targets of the complexes. 

5.3.5 Genes highly co-expressed with TFs are enriched in indirect TF targets 

In  previous  sections,  I  focused  on  the  co-expression  patterns  between  TFs  and  their  direct 

targets. However, a question that remains unanswered is whether there is a relationship between a 

TF and the genes that are most highly co-expressed with that TF. To explore this, I examined how 

many target genes of a TF also belong to the top 5% most highly co-expressed genes (HCG) with 

that  TF.    Surprisingly,  for  the  large  majority  of  the  TFs  (80%),  less  than  30%  of  the  HCG  are 

among the target genes. There is one exception, NF-BY2 (nuclear factor Y, subunit B2), where 

this number is as high as 82% (Figure 5a). I explored the possibility that genes that are not direct 

targets of a TFx could be targets of a TFx partner (TFz), or that they could be targets of a second 

TF (TFy) that is itself a direct target of TFx.  

To  assess  the  impact  of  TF  partners  (TFz)  on  the  highly  co-expressed  genes  of  TFx,  I 

investigated the proportion of highly co-expressed genes that are targets of TFz but not of TFx 

itself. Our analysis revealed that out of the 313 tested TFs, 309 TFs had at least one highly co-

expressed gene that was a target of one of its TFz partners. On average, approximately 10% of the 

highly co-expressed genes of a TF belonged to this category (Figure 5.4b). Similarly, to understand 

the contribution of downstream targets to the highly co-expressed genes of a downstream TFy in 

the regulatory hierarchy, I examined the same set of 313 TFs. Among these TFs, 306 TFs bound 

to a TFy that had at least one direct target gene highly co-expressed with the upstream TFx. On 

average, around 9.8% of the genes most highly co-expressed with TFx were indirect targets of TFy 

(Figure 5.4c). I also compared the actual set of highly co-expressed genes recovered using true 
167 

 
interactions  with  those  obtained  using  random  networks  (PPI  and  PDI,  respectively)  (See 

Methods). The random TF PPIs yielded a similar number of highly co-expressed genes compared 

to the known PPIs (P > 0.05, Mann-Whitney U test) (Figure 5.6a). It is worth noting that the PPI 

network  used  in  this  analysis  had  an  average  path  length  of  3.5  edges  between  all  TF  nodes, 

indicating weak independence between the true and random PPIs. In contrast, the random target 

TFy resulted in a significantly smaller number of highly co-expressed genes compared to the true 

targets (P < 0.05, Mann-Whitney U test) (Figure S5.6b), suggesting that downstream hierarchical 

regulators  play  a  crucial  role  in  explaining  the  presence  of  highly  co-expressed  genes  for  the 

corresponding TF. 

I computed the combined contribution of TFz interactors and downstream TFs (TFy) to the set 

of  highly  co-expressed  genes  for  each  of  the  313  TFs.  This  allowed  me  to  determine  that,  on 

average, 90% of the genes most highly co-expressed with a TF consist of its direct targets (~16%), 

targets of its TFz partners (~4%, after excluding partners that are also direct targets of TFx), and 

downstream  targets  (~70%,  targets  of  a  TF's  target)  (Figure  5.4d).  Interestingly,  I  also  found 

examples in which the partner for TFx is also a downstream target, participating in a feed-forward 

loop (FFL) (26% out of total TFs). FFLs are among the most highly represented regulatory motifs 

present in Arabidopsis (Chen et al., 2018) and other eukaryotes (Milo et al., 2004). 

168 

 
 
Figure 5.4 Genes highly co-expressed with TFs are enriched in indirect TF targets 
a. Percentage of highly co-expressed genes (HCGs) of TFx that are confirmed targets of TFx. b 
and c. Model and percentage of highly co-expressed genes that are potential indirect targets of TFx 
through its TFz interactors (b) and a TFy downstream of the corresponding TFx (c). d. Percentage 
of HCGs attributed to direct or indirect targeting by TFx. 

5.4 DISCUSSION 

In  this  chapter,  I  examined  the  co-expression  patterns  between  TFs  and  their  targets  using 

comprehensive PDI, PPI, and gene expression data for Arabidopsis. I found that approximately 

half (279) of the TFs studied exhibit global co-expression with their targets, while an additional 

35%  (199)  display  local-specific  co-expression  in  at  least  one  of  the  twelve  sample  clusters 

identified.  Interestingly,  for  77  Arabidopsis  TFs  with  extensive  PDI  information,  there  is  no 

conclusive evidence of co-expression with their identified targets beyond what would be expected 

by chance. This suggests that certain TFs only show co-expression under specific conditions, and 

it  is  possible  that  utilizing  single-cell  sequencing  will  uncover  additional  co-expression 

169 

 
 
 
relationships  that  are  not  apparent  in  organ-level  gene  expression  experiments  due  to  the 

complexity of cell populations. I show that only a small fraction (on average 4.7%; Figure 5.2a) 

of the direct targets are among the genes most highly co-expressed with a given TF. Conversely, 

direct targets are a small fraction of the genes highly co-expressed with a TF (in average 14.3%; 

Figure 5.4a). Considering that high co-expression is frequently employed as an additional measure 

to establish the biological importance of a PDI, my findings suggest that these comparisons involve 

a more intricate regulatory framework. 

In  the  endeavor  to  uncover  the  co-expression  connections  between  TFs  and  their  targets,  I 

observed that a significant proportion (up to 17%) of targets that are not highly co-expressed with 

a specific TF are indeed co-expressed with TF complexes. Interestingly, a substantial number of 

co-expressed targets (up to 100%, averaging around 22%) were shared by multiple members of 

the complex, even if they were not highly co-expressed with individual TFs. These findings align 

with extensive literature highlighting the concept of combinatorial gene regulation (Ravasi et al., 

2010;  Brkljacic  and  Grotewold,  2017;  Colinas  and  Goossens,  2018;  Droge-Laser  and  Weiste, 

2018).  To  investigate  the  biological  significance  of  co-expressed  targets  associated  with  two 

distinct TF complexes (HHO2-HHO3 and SVP-GBF2), I examined their expression changes under 

stress conditions. Remarkably, in both cases, I identified differentially expressed target genes and 

TF  members  within  the  complex.  Our  results  emphasize  the  necessity  of  considering  the 

combinatorial nature of gene regulation to fully harness the potential of co-expression analyses. 

Identifying ternary TF complexes experimentally presents significant challenges. To address 

this,  I  employed  a  comprehensive  approach  combining  co-expression  data,  protein-protein 

interactions (PPIs), and shared targets obtained from PDI data to analyze potential TF pairs that 

may form ternary complexes (Figure 5.3c). For instance, I discovered eight potential ABI5 ternary 

170 

 
complexes  involving  four  TFs  from  the  ABF  family  (ABF1/2/3/4).  These  findings  align  with 

experimental evidence suggesting functional redundancy between ABF3 and ABI5 (Finkelstein et 

al., 2005), as well as the regulatory role of ABI5 and ABF2/3/4 in the degradation of chlorophyll-

related genes (Gao et al., 2016). Moreover, it is known that ABF3/4 and NF-YC (nuclear factor Y 

subunit  C)  form  a  complex  that  controls  flowering  in  response  to  drought  by  regulating  SOC1 

(SUPPRESSOR OF OVEREXPRESSION OF CONSTANS1) expression (Hwang et al., 2019), 

which is also targeted by ABI5 during seedling development  (O’Malley et al., 2016). These results 

strongly suggest the formation of a larger-order complex involving ABF3-ABF4-ABI5. Together, 

by  integrating  PPIs  between  TFs  with  co-expression  studies,  I  predicted  a  number  of  potential 

ternary TF complexes, which could now be experimentally validated, an easier undertaking than 

carrying out do novo identification. 

Another  question  addressed  by  this  study  regards  the  nature  of  the  association  of  the  other 

genes that are highly co-expressed with a TF, if they are not targets of the TF itself. I showed that, 

on average for the 313 TFs investigated, almost a third of the highly co-expressed genes are either 

indirect targets of the TF (targets of a TF target), direct targets of the TF or direct targets of a TF 

partner. Is important to note that in many instances this number was much larger, which to some 

extent justifies the wide-spread use of co-expression as a proxy to carry out functional association 

of TFs and different plant traits (Haque et al., 2019; Kulkarni and Vandepoele, 2019). However, 

what  these  studies  also  show  is  that  the  use  of  co-expression  is  a  poor  indicator  of  direct 

interactions between TFs and their target genes. Establishing the co-expression relationships of 

TFs and their target genes has wide implications for elucidating the architecture of gene regulatory 

networks  in  all  organisms  and  establishing  the  meaning  of  co-expression  as  a  tool  to  elucidate 

molecular interactions. 

171 

 
5.5 METHODS 

5.5.1 Data collection  

Expression  and  global  co-expression  data  were  collected  from  the  ATTED-II  database 

(http://atted.jp/,  versions  Ath-r.v15-08  and  Ath-r.c2-0,  respectively)  (Obayashi  et  al.,  2018).  In 

total, I used 1,416 different RNA-Seq libraries with expression data associated for 25,296 different 

genes. I collected the protein-DNA interaction information as raw peaks (bed or narrowpeak files 

from  ChIP-chip,  ChIP-Seq,  and  DAP-Seq  experiments)  from  the  Gene  Expression  Omnibus 

(GEO) and/or supplementary material from reference source (Yant et al., 2010; Wang et al., 2010; 

Brandt et al., 2012; Gregis et al., 2013; Jensen et al., 2013; Merelo et al., 2013; ÓMaoiléidigh et 

al., 2013; Heyndrickx et al., 2014b; Verkest et al., 2014; Liu et al., 2015; Nagel et al., 2015; Li et 

al., 2016; Liu et al., 2016; O’Malley et al., 2016; Song et al., 2016; Van Leene et al., 2016; Albihlal 

et al., 2018; Besbrugge et al., 2018; Chen et al., 2018; Shanks et al., 2018; Xu et al., 2018). The 

assignment of a peak region to a gene was carried out assuming a promoter region of 2 kb upstream 

from the transcription start site (TSS) for each Arabidopsis gene (genome annotation TAIR10). I 

used all peak region sizes as reported originally. All protein-protein interactions (PPIs) used for 

the identification of complex co-expressed targets were collected from the BioGRID database for 

Arabidopsis (V3.5.169) (Oughtred et al., 2019).  

5.5.2 Evaluation of co-expression and determination of mutual rank values 

For the evaluation of the global co-expression between TFs and their corresponding targets, I 

used  the  mutual  ranks  (MRs)  of  the  Pearson  Correlation  Coefficient  (PCC)  and  the  Mutual 

Information (MI) as co-expression metrics. MR were defined for each gene as follows: Rij is the 

rank of the correlation of gene i with the gene j, and Rji is the rank of the correlation of gene j with 

the gene i, with the lowest value as the best rank (close to 1). Then, MR is equal to the square root 

172 

 
of Rij times Rji. Global MRs from positive PCC were used as reported by ATTED-II, while global 

MRs from negative PCC values were transformed into a second MR by subtracting the original 

MR reported from the maximum possible MR (25,296) for each TF. For the calculation of local 

MRs-PCC, I used the expression normalized as reported by ATTED-II, parsing the samples into 

twelve expression conditions through a dimensional reduction of the total dataset, followed by a 

k-means  analysis  (see  Methods  5.5.10).  Grouping  these  samples  as  expression  conditions,  I 

proceeded to calculate the PCC between genes. I employed a weighted PCC to accurately measure 

the  correlation  between  genes.  To  avoid  an  inflated  correlation  influenced  by  replicates,  I 

incorporated  a  weighting  parameter  based  on  the  correlation  of  corresponding  samples.  This 

approach helps prevent overestimation of the gene correlation. The weighted PCC was calculated 

using  the  R  package  wCorr  (Version  1.9.1)  (Emad  and  Bailey,  2017),  using  the  same  optimal 

threshold (0.4) as in ATTED-II. All global and local co-expression analyses using MR-MI values 

were carried out with the same samples used for the calculation of the respective MR-PCC values. 

The correlation-based on MI was estimated using the R package Parmigene (Version 1.0.2) (Sales 

and Romualdi, 2011), and with 1e-12 as noise to break ties due to limited numerical precision.  

5.5.3 Identification of TFs co-expressed with the corresponding target genes 

The significance of the MRs between TFs and their corresponding targets was assayed using 

both  MR-PCC  and  MR-MI  correlation  metrics,  and  two  independent  statistics  tests.  First,  I 

compared for each TF the average MR value of the targets vs. a null distribution of average MRs 

values from 1,000 random sets of genes, referred to as co-expression by MR average. Each random 

sample was generated by sampling with replacement N random genes to the N number of direct 

targets of each TF. For the MR-PCC values, I compared separately MR distributions of positively 

and negatively PCC values. To define if average MRs of the target genes were significantly smaller 

173 

 
than the null distribution, I calculated the Z-score using the MR values of the true targets using the 

random set of genes as background (which follow a gaussian distribution). The significance (P-

value)  of  corresponding  Z-score  was  corrected  for  multiple  testing  (FDR  <  0.05,  Benjamini-

Hochberg  method)  (Yoav  Benjamini  and  Yosef  Hochberg,  1995).  Secondly,  I  evaluated  the 

differences  between  target  and  non-target  genes  by  comparing  their  empirical  cumulative 

distributions.  This  was  done  using  a  one-sided  Kolmogorov-Smirnov  test,  with  the  alternative 

hypothesis being that the target genes' distribution is greater than the non-target genes' distribution. 

This test determined if the MRs of the target genes deviated significantly from those of the non-

target genes (FDR < 0.05). Both positive and negative correlations were tested independently for 

both the average-based and distribution-based co-expression assessments. 

5.5.4 Identification of targets co-expressed with TF complexes  

The identification of complex-co-expressed targets was carried out for TFs present in our list 

of TFs with PDI data and at least one protein-protein interaction (PPI) between them in BioGRID. 

In total, I found 815 protein-protein interactions (PPIs) associated with 313 different TFs. Using 

these  PPIs,  I  evaluated  the  effect  of  the  formation  of  a  TF  complex  (TFx-TFz)  over  lowly  co-

expressed targets (LCTs) of TFx by: (1) Assuming TFx-TFz as a new protein, thus, I averaged 

their expression (TFx and TFz) and then re-calculated the co-expression of the complex with a 

target y. This co-expression analysis was carried out using the weighted PCC as described above. 

(2) I also calculated the partial correlation of TFx with genes y conditioned by TFz: p(TFx ~ y | 

TFz), such that TFx and TFz interact between them and y is a TFx target. The partial correlation 

was  calculated  using  the  R  package  PPCOR  (Kim,  2015).  In  both  cases,  I  calculated  the  co-

expression of the complex against all genes in the genome to define the significant values on the 

distribution obtained (See below).  

174 

 
5.5.5 Definition of highly co-expressed targets 

I defined highly co-expressed genes as those genes in the top 5% of the correlation distribution, 

assuming  them  as  genes  with  correlation  values  significantly  different  from  the  average  of 

correlation distribution (P < 0.05). For PCC values, I took the 2.5% from each tail (i.e., 5% in 

total), while for MI values I took the top 5%. This last, given that MI does not discriminate between 

positive and negative associations. The approach was also implemented to define highly targeted 

co-expressed with a complex (TCC). 

5.5.6 Degree network connectivity 

I defined the in-degree and the out-degree as the number of TFs that bound the promoter of a 

particular target gene and the number of targets of a particular TF, respectively. Differences in 

both degrees, in- & out-degree, between TF co-expressed with its corresponding targets and those 

than not were tested by a Mann-Whitney test.  

5.5.7  Protein-Protein  Interactions  (PPIs)  and  Protein-DNA  interactions  (PDIs)  network 

randomization 

I created random PPIs and PDI networks to test the significance of the shared targets between 

dimers of the tri-bi and to test the significance of number the indirect targets within the set if genes 

highly co-expressed with a TFs, as well as significance of number the indirect targets by TFs in 

cascade. In all the cases I used the rewire function from the R package Igraph (v1.2.4.1) to generate 

the 

random 

network  with 

similar 

degree 

by 

node 

and 

avoiding 

loops 

(niter=NodesInNetwork*1000).  Random  PPI  network  was  built  with  the  directed  parameter  as 

FALSE while the random PDI was set as TRUE, which allows the shuffling of edges between TF 

and target genes only. 

175 

 
 
5.5.8 Definition of tri-bi complexes with significant number of shared targets 

In total, I selected 104 TFs after discarding tri-bi instances with no significant target overlap, 

as well as TFs involved in less than two tri-bi instances (to avoid comparison with few samples). 

To compute the differences between the random and true PPIs, I calculated the Jaccard index (J) 

between every pair of dimers involved in each tri-bi, and then I asked if the mean of the J values 

between true tri-bi instances was different from the J values mean of tri-bi instances derived from 

the random PPI collection (see randomization network description).  

5.5.9 Counting the HCG of a TFx that are targeted by TFz partners and TFy downstream of 

the corresponding TFx 

To test the significance of the percentages of HCG of TFx explained because either they are 

targets of an interactor TFz or a target TFy; I compared the actual set of HCGs recovered based on 

true interaction versus random networks (of PPI and PDI, respectively). I measured the overlap 

(Jaccard index) of the HCGs of TFx with the corresponding set of TFz and TFy targets. 

5.5.10 Definition of local expression clusters 

Given the heterogeneity of the annotation of the expression samples used in this work, I defined 

expression  clusters  based  on  the  expression  similarities  between  the  samples  analyzed.  First,  I 

downloaded  from  the  ATTED-II  database  the  normalized  expression  data  (Ath-r.v15-08) 

(Obayashi et al., 2018) used for the construction of the global co-expression database analyzed 

here.  Second,  I  dimensionally-reduced  the  expression  data  by  means  of  t-distributed  stochastic 

neighbor embedding (t-SNE) method, to then cluster the samples using the respective t-SNE 1 and 

t-SNE  2  values.  The  t-SNE  analysis  was  performed  using  the  R  package  Rtsne  (V0.15) 

(https://cran.r-project.org/web/packages/Rtsne/index.html), with the following parameters: pca set 

TRUE, perplexity=30, theta=0.5, dims=2. The clustering was performed using the R "kmeans” 

176 

 
function with scale t-SNE values and number of clusters equal to 12. I choose 12 clusters based 

on the total within sum of square (wss) value calculated using the fviz_nbclust (nboot = 300, k.max 

= 

25) 

function 

of 

the  R 

packages 

factoextra 

(v1.0.5) 

(https://cran.rproject. 

org/web/packages/factoextra/index.html). 

177 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
REFERENCES 

Albihlal, W.S., Obomighie, I., Blein, T., Persad, R., Chernukhin, I., Crespi, M., Bechtold, 

U., and Mullineaux, P.M. (2018). Arabidopsis HEAT SHOCK TRANSCRIPTION 
FACTORA1b regulates multiple developmental genes under benign and stress 
conditions. J. Exp. Bot. 69: 2847–2862. 

Arda, H.E. and Walhout, A.J.M. (2010). Gene-centered regulatory networks. Brief. Funct. 

Genomics 9: 4–12. 

Banf, M. and Rhee, S.Y. (2017). Computational inference of gene regulatory networks: 

Approaches, limitations and opportunities. Biochim. Biophys. Acta Gene Regul. Mech. 
1860: 41–52. 

Banks, C.J., Joshi, A., and Michoel, T. (2016). Functional transcription factor target discovery 

via compendia of binding and expression profiles. Sci. Rep. 6: 20649. 

Bechtold, U. et al. (2015). Time-series transcriptomics reveals that AGAMOUS-LIKE22 affects 

primary metabolism and developmental processes in drought-stressed arabidopsis. Plant 
Cell 28: 345–366. 

Bemer, M., van Dijk, A.D.J., Immink, R.G.H., and Angenent, G.C. (2017). Cross-family 

transcription factor interactions: an additional layer of gene regulation. Trends Plant Sci. 
22: 66–80. 

Besbrugge, N. et al. (2018). GSyellow, a Multifaceted Tag for Functional Protein Analysis in 

Monocot and Dicot Plants. Plant Physiol. 177: 447–464. 

Bi, C., Ma, Y., Wu, Z., Yu, Y.T., Liang, S., Lu, K., and Wang, X.F. (2017). Arabidopsis 
ABI5 plays a role in regulating ROS homeostasis by activating CATALASE 1 
transcription in seed germination. Plant Mol. Biol. 94: 197–213. 

Brandt, R. et al. (2012). Genome-wide binding-site analysis of REVOLUTA reveals a link 

between leaf patterning and light-mediated growth responses. Plant J. 72: 31–42. 

Brkljacic, J. and Grotewold, E. (2017). Combinatorial control of plant gene expression. 

Biochim. Biophys. Acta 1860: 31–40. 

Brooks, M.D., Cirrone, J., Pasquino, A.V., Alvarez, J.M., Swift, J., Mittal, S., Juang, C.-L., 
Varala, K., Gutiérrez, R.A., Krouk, G., Shasha, D., and Coruzzi, G.M. (2019). 
Network Walking charts transcriptional dynamics of nitrogen signaling by integrating 
validated and predicted genome-wide interactions. Nat. Commun. 10: 1569. 

Burks, D.J., Sengupta, S., De, R., Mittler, R., and Azad, R.K. (2022). The Arabidopsis gene 

co-expression network. Plant Direct 6: e396. 

Chen, D., Yan, W., Fu, L.Y., and Kaufmann, K. (2018). Architecture of gene regulatory 

networks controlling flower development in Arabidopsis thaliana. Nat. Commun. 9: 1–
13. 

178 

 
Colinas, M. and Goossens, A. (2018). Combinatorial Transcriptional Control of Plant 

Specialized Metabolism. Trends Plant Sci. 23: 324–336. 

Droge-Laser, W. and Weiste, C. (2018). The C/S1 bZIP Network: A Regulatory Hub 
Orchestrating Plant Energy Homeostasis. Trends Plant Sci. 23: 422–433. 

Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and 

display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U. S. A. 95: 14863–
14868. 

Emad, A. and Bailey, P. (2017). wCorr: weighted correlations.–R package ver. 1.9. 1. 

Eveland, A.L. et al. (2014). Regulatory modules controlling maize inflorescence architecture. 

Genome Res. 24: 431–443. 

Finkelstein, R., Gampala, S.S.L., Lynch, T.J., Thomas, T.L., and Rock, C.D. (2005). 

Redundant and distinct functions of the ABA response loci ABA-insensitive(ABI)5 and 
ABRE-binding factor (ABF)3. Plant Mol. Biol. 59: 253–267. 

de la Fuente, A., Bing, N., Hoeschele, I., and Mendes, P. (2004). Discovery of meaningful 
associations in genomic data using partial correlation coefficients. Bioinformatics 20: 
3565–3574. 

Furuya, T., Saito, M., Uchimura, H., Satake, A., Nosaki, S., Miyakawa, T., Shimadzu, S., 
Yamori, W., Tanokura, M., Fukuda, H., and Kondo, Y. (2021). Gene co-expression 
network analysis identifies BEH3 as a stabilizer of secondary vascular development in 
Arabidopsis. Plant Cell 33: 2618–2636. 

Gao, S., Gao, J., Zhu, X., Song, Y., Li, Z., Ren, G., Zhou, X., and Kuai, B. (2016). ABF2, 
ABF3, and ABF4 Promote ABA-Mediated Chlorophyll Degradation and Leaf 
Senescence by Transcriptional Activation of Chlorophyll Catabolic Genes and 
Senescence-Associated Genes in Arabidopsis. Mol. Plant 9: 1272–1285. 

Geng, H., Wang, M., Gong, J., Xu, Y., and Ma, S. (2021). An Arabidopsis expression 

predictor enables inference of transcriptional regulators for gene modules. Plant J. 107: 
597–612. 

Gitter, A., Siegfried, Z., Klutstein, M., Fornes, O., Oliva, B., Simon, I., and Bar-joseph, Z. 

(2009). Backup in gene regulatory networks explains differences between binding and 
knockout results. Mol. Syst. Biol. 5: 1–7. 

Gomez-Cano, F., Chu, Y.-H., Cruz-Gomez, M., Abdullah, H.M., Lee, Y.S., Schnell, D.J., 
and Grotewold, E. (2022). Exploring Camelina sativa lipid metabolism regulation by 
combining gene co-expression and DNA affinity purification analyses. Plant J. 

Gregis, V. et al. (2013). Identification of pathways directly regulated by SHORT 

VEGETATIVE PHASE during vegetative and reproductive development in Arabidopsis. 
Genome Biol. 14: R56. 

179 

 
Gupta, O.P., Deshmukh, R., Kumar, A., Singh, S.K., Sharma, P., Ram, S., and Singh, G.P. 

(2021). From gene to biomolecular networks: a review of evidences for understanding 
complex biological function in plants. Curr. Opin. Biotechnol. 74: 66–74. 

Haque, S., Ahmad, J.S., Clark, N.M., Williams, C.M., and Sozzani, R. (2019). 

Computational prediction of gene regulatory networks in plant growth and development. 
Curr. Opin. Plant Biol. 47: 96–105. 

Harb, A., Krishnan, A., Ambavaram, M.M.R., and Pereira, A. (2010). Molecular and 

physiological analysis of drought stress in arabidopsis reveals early responses leading to 
acclimation in plant growth. Plant Physiol. 154: 1254–1271. 

Haynes, B.C., Maier, E.J., Kramer, M.H., Wang, P.I., Brown, H., and Brent, M.R. (2013). 

Mapping functional transcription factor networks from gene expression data. Genome 
Res. 23: 1319–1328. 

Heyndrickx, K.S., Vandepoele, K., Weigel, D., de Velde, J.V., and Wang, C. (2014a). A 

Functional and Evolutionary Perspective on Transcription Factor Binding in Arabidopsis 
thaliana. 

Heyndrickx, K.S., Van de Velde, J., Wang, C., Weigel, D., and Vandepoele, K. (2014b). A 
functional and evolutionary perspective on transcription factor binding in Arabidopsis 
thaliana. Plant Cell 26: 3894–3910. 

Hu, Z., Killion, P.J., and Iyer, V.R. (2007). Genetic reconstruction of a functional 

transcriptional regulatory network. Nat. Genet. 39: 683–687. 

Hwang, K., Susila, H., Nasim, Z., Jung, J.Y., and Ahn, J.H. (2019). Arabidopsis ABF3 and 

ABF4 Transcription Factors Act with the NF-YC Complex to Regulate SOC1 Expression 
and Mediate Drought-Accelerated Flowering. Mol. Plant 12: 489–505. 

Jensen, M.K., Lindemose, S., de Masi, F., Reimer, J.J., Nielsen, M., Perera, V., Workman, 
C.T., Turck, F., Grant, M.R., Mundy, J., Petersen, M., and Skriver, K. (2013). 
ATAF1 transcription factor directly regulates abscisic acid biosynthetic gene NCED3 in 
Arabidopsis thaliana. FEBS Open Bio 3: 321–327. 

Jiang, S. and Mortazavi, A. (2018). Integrating ChIP-seq with other functional genomics data. 

Brief. Funct. Genomics 17: 104–115. 

Jin, R., Klasfeld, S., Zhu, Y., Fernandez Garcia, M., Xiao, J., Han, S.-K., Konkol, A., and 
Wagner, D. (2021). LEAFY is a pioneer transcription factor and licenses cell 
reprogramming to floral fate. Nat. Commun. 12: 626. 

Kim, S. (2015). ppcor: An R Package for a Fast Calculation to Semi-partial Correlation 

Coefficients. Commun Stat Appl Methods 22: 665–674. 

Kulkarni, S.R. and Vandepoele, K. (2019). Inference of plant gene regulatory networks using 
data-driven methods: A practical overview. Biochim. Biophys. Acta Gene Regul. Mech.: 
194447. 

180 

 
Lacchini, E. and Goossens, A. (2020). Combinatorial Control of Plant Specialized Metabolism: 

Mechanisms, Functions, and Consequences. Annu. Rev. Cell Dev. Biol. 36: 291–313. 

Lai, X. et al. (2021). The LEAFY floral regulator displays pioneer transcription factor 

properties. Mol. Plant 14: 829–837. 

Lambert, S.A., Jolma, A., Campitelli, L.F., Das, P.K., Yin, Y., Albu, M., Chen, X., Taipale, 
J., Hughes, T.R., and Weirauch, M.T. (2018). The Human Transcription Factors. Cell 
175: 598–599. 

Lee, T.A., Nobori, T., Illouz-Eliaz, N., Xu, J., Jow, B., and Nery, J.R. (2023). A single-

nucleus atlas of seed-to-seed development in Arabidopsis. bioRxiv. 

Li, D. et al. (2016). FAR-RED ELONGATED HYPOCOTYL3 activates SEPALLATA2 but 

inhibits CLAVATA3 to regulate meristem determinacy and maintenance in Arabidopsis. 
Proc. Natl. Acad. Sci. U. S. A. 113: 9375–9380. 

Liu, S., Kracher, B., Ziegler, J., Birkenbihl, R.P., and Somssich, I.E. (2015). Negative 

regulation of ABA signaling by WRKY33 is critical for Arabidopsis immunity towards 
Botrytis cinerea 2100. Elife 4: e07295. 

Liu, T.L., Newton, L., Liu, M.-J., Shiu, S.-H., and Farré, E.M. (2016). A G-Box-Like Motif 

Is Necessary for Transcriptional Regulation by Circadian Pseudo-Response Regulators in 
Arabidopsis. Plant Physiol. 170: 528–539. 

Lynch, T., Erickson, B.J., and Finkelstein, R.R. (2012). Direct interactions of ABA-

insensitive(ABI)-clade protein phosphatase(PP)2Cs with calcium-dependent protein 
kinases and ABA response element-binding bZIPs may contribute to turning off ABA 
response. Plant Mol. Biol. 80: 647–658. 

Mejia-Guerra, M.K., Pomeranz, M., Morohashi, K., and Grotewold, E. (2012). From plant 

gene regulatory grids to network dynamics. Biochimica et Biophysica Acta (BBA) - 
Gene Regulatory Mechanisms 1819: 454–465. 

Merelo, P., Xie, Y., Brand, L., Ott, F., Weigel, D., Bowman, J.L., Heisler, M.G., and 

Wenkel, S. (2013). Genome-wide identification of KANADI1 target genes. PLoS One 8: 
e77341. 

Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr, S., Ayzenshtat, I., Sheffer, M., and 

Alon, U. (2004). Superfamilies of Evolved and Designed Networks. Science 303: 1538–
1542. 

Morohashi, K. et al. (2012). A genome-wide regulatory framework identifies maize pericarp 

color1 controlled genes. Plant Cell 24: 2745–2764. 

Morohashi, K. and Grotewold, E. (2009). A systems approach reveals regulatory circuitry for 

Arabidopsis trichome initiation by the GL3 and GL1 selectors. PLoS Genet. 5: e1000396. 

Nagarajan, V.K., Satheesh, V., Poling, M.D., Raghothama, K.G., and Jain, A. (2016). 

181 

 
Arabidopsis MYB-Related HHO2 Exerts a Regulatory Influence on a Subset of Root 
Traits and Genes Governing Phosphate Homeostasis. Plant Cell Physiol. 57: 1142–1152. 

Nagel, D.H., Doherty, C.J., Pruneda-Paz, J.L., Schmitz, R.J., Ecker, J.R., and Kay, S.A. 

(2015). Genome-wide identification of CCA1 targets uncovers an expanded clock 
network in Arabidopsis. Proc. Natl. Acad. Sci. U. S. A. 112: E4802–10. 

Nolan, T.M. et al. (2023). Brassinosteroid gene regulatory networks at cellular resolution in the 

Arabidopsis root. Science 379: eadf4721. 

Obayashi, T., Aoki, Y., Tadaka, S., Kagaya, Y., and Kinoshita, K. (2018). ATTED-II in 

2018: A Plant Coexpression Database Based on Investigation of the Statistical Property 
of the Mutual Rank Index. Plant Cell Physiol. 59: e3–e3. 

Obayashi, T. and Kinoshita, K. (2009). Rank of correlation coefficient as a comparable 
measure for biological significance of gene coexpression. DNA Res. 16: 249–260. 

O’Malley, R.C., Huang, S.S.C., Song, L., Lewsey, M.G., Bartlett, A., Nery, J.R., Galli, M., 

Gallavotti, A., and Ecker, J.R. (2016). Cistrome and Epicistrome Features Shape the 
Regulatory DNA Landscape. Cell 165: 1280–1292. 

ÓMaoiléidigh, D.S., Wuest, S.E., Rae, L., Raganelli, A., Ryan, P.T., Kwasniewska, K., Das, 

P., Lohan, A.J., Loftus, B., Graciet, E., and Wellmer, F. (2013). Control of 
reproductive floral organ identity specification in Arabidopsis by the C function regulator 
AGAMOUS. Plant Cell 25: 2482–2503. 

Oughtred, R. et al. (2019). The BioGRID interaction database: 2019 update. Nucleic Acids Res. 

47: D529–D541. 

Pajoro, A. et al. (2014). Dynamics of chromatin accessibility and gene regulation by MADS-
domain transcription factors in flower development. Genome Biol. 15: R41. 

Palaniswamy, K., James, S., Sun, H., Lamb, R., Davuluri, R.V., and Grotewold, E. (2006). 

AGRIS and AtRegNet: A platform to link cis-regulatory elements and transcription 
factors into regulatory networks. Plant Phyisiol. 140: 818–829. 

Para, A. et al. (2014). Hit-and-run transcriptional control by bZIP1 mediates rapid nutrient 
signaling in Arabidopsis. Proc. Natl. Acad. Sci. U. S. A. 111: 10371–10376. 

Park, P.J. (2009). ChIP–seq: advantages and challenges of a maturing technology. Nat. Rev. 

Genet. 10: 669–680. 

Rao, X. and Dixon, R.A. (2019). Co-expression networks for plant biology: Why and how. Acta 

Biochim. Biophys. Sin. 51: 981–988. 

Ravasi, T. et al. (2010). An atlas of combinatorial transcriptional regulation in mouse and man. 

Cell 140: 744–752. 

Reményi, A., Schöler, H.R., and Wilmanns, M. (2004). Combinatorial control of gene 

182 

 
expression. Nat. Struct. Mol. Biol. 11: 812. 

Sales, G. and Romualdi, C. (2011). Parmigene-a parallel R package for mutual information 
estimation and gene network reconstruction. Bioinformatics 27: 1876–1877. 

Sayou, C. et al. (2016). A SAM oligomerization domain shapes the genomic binding landscape 

of the LEAFY transcription factor. Nat. Commun. 7: 11222. 

Shanks, C.M., Hecker, A., Cheng, C.-Y., Brand, L., Collani, S., Schmid, M., Schaller, G.E., 

Wanke, D., Harter, K., and Kieber, J.J. (2018). Role of BASIC PENTACYSTEINE 
transcription factors in a subset of cytokinin signaling responses. Plant J. 95: 458–473. 

Skubacz, A., Daszkowska-Golec, A., and Szarejko, I. (2016). The Role and Regulation of 

ABI5 (ABA-Insensitive 5) in Plant Development, Abiotic Stress Responses and 
Phytohormone Crosstalk. Front. Plant Sci. 7: 1884. 

Song, L., Huang, S.-S.C., Wise, A., Castanon, R., Nery, J.R., Chen, H., Watanabe, M., 

Thomas, J., Bar-Joseph, Z., and Ecker, J.R. (2016). A transcription factor hierarchy 
defines an environmental stress response network. Science 354. 

Stuart, J.M., Segal, E., Koller, D., and Kim, S.K. (2003). A gene-coexpression network for 

global discovery of conserved genetic modules. Science 302: 249–255. 

Swift, J. and Coruzzi, G.M. (2017). A matter of time - How transient transcription factor 

interactions create dynamic gene regulatory networks. Biochimica et Biophysica Acta 
(BBA) - Gene Regulatory Mechanisms 1860: 75–83. 

Tao, Z., Shen, L., Gu, X., Wang, Y., Yu, H., and He, Y. (2017). Embryonic epigenetic 
reprogramming by a pioneer transcription factor in plants. Nature 551: 124–128. 

Trigg, S.A. et al. (2017). CrY2H-seq: a massively multiplexed assay for deep-coverage 

interactome mapping. Nat. Methods 14: 819–825. 

Uygun, S., Peng, C., Lehti-Shiu, M.D., Last, R.L., and Shiu, S.H. (2016). Utility and 

Limitations of Using Gene Expression Data to Identify Functional Associations. PLoS 
Comput. Biol. 12: 1–27. 

Vandepoele, K., Quimbaya, M., Casneuf, T., De Veylder, L., and Van de Peer, Y. (2009). 

Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and 
coexpression networks. Plant Physiol. 150: 535–546. 

Van Leene, J. et al. (2016). Functional characterization of the Arabidopsis transcription factor 
bZIP29 reveals its role in leaf and root development. J. Exp. Bot. 67: 5825–5840. 

Varala, K. et al. (2018). Temporal transcriptional logic of dynamic regulatory networks 

underlying nitrogen signaling and use in plants. Proc. Natl. Acad. Sci. U. S. A. 115: 
6494–6499. 

Verkest, A. et al. (2014). A generic tool for transcription factor target gene discovery in 

183 

 
Arabidopsis cell suspension cultures based on tandem chromatin affinity purification. 
Plant Physiol. 164: 1122–1133. 

Wang, C., Xu, J., Zhang, D., Wilson, Z.A., and Zhang, D. (2010). An effective approach for 

identification of in vivo protein-DNA binding sites from paired-end ChIP-Seq data. BMC 
Bioinformatics 11: 81. 

Wilkins, O., Bräutigam, K., and Campbell, M.M. (2010). Time of day shapes Arabidopsis 

drought transcriptomes. Plant J. 63: 715–727. 

Wisecaver, J.H., Borowsky, A.T., Tzin, V., Jander, G., Kliebenstein, D.J., and Rokas, A. 

(2017). A Global Coexpression Network Approach for Connecting Genes to Specialized 
Metabolic Pathways in Plants. Plant Cell 29: 944–959. 

Wu, G. and Ji, H. (2013). ChIPXpress: Using publicly available gene expression data to 

improve ChIP-seq and ChIP-chip target gene ranking. BMC Bioinformatics 14. 

Xu, C., Cao, H., Xu, E., Zhang, S., and Hu, Y. (2018). Genome-Wide Identification of 

Arabidopsis LBD29 Target Genes Reveals the Molecular Events behind Auxin-Induced 
Cell Reprogramming during Callus Formation. Plant Cell Physiol. 59: 744–755. 

Yang, F. et al. (2017). A Maize Gene Regulatory Network for Phenolic Metabolism. Mol. Plant 

10: 498–515. 

Yant, L., Mathieu, J., Dinh, T.T., Ott, F., Lanz, C., Wollmann, H., Chen, X., and Schmid, 

M. (2010). Orchestration of the floral transition and floral development in Arabidopsis by 
the bifunctional transcription factor APETALA2. Plant Cell 22: 2156–2170. 

Yilmaz, A., Mejia-Guerra, M.K., Kurz, K., Liang, X., Welch, L., and Grotewold, E. (2011). 
AGRIS : the Arabidopsis Gene Regulatory Information Server , an update. Nucleic Acids 
Res. 39: 1118–1122. 

Yoav Benjamini and Yosef Hochberg (1995). Controlling the False Discovery Rate: A 

Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Series B Stat. 
Methodol. 57: 289–300. 

Yoshida, T., Fujita, Y., Sayama, H., Kidokoro, S., Maruyama, K., Mizoi, J., Shinozaki, K., 
and Yamaguchi-Shinozaki, K. (2010). AREB1, AREB2, and ABF3 are master 
transcription factors that cooperatively regulate ABRE-dependent ABA signaling 
involved in drought stress tolerance and require ABA for full activation. Plant J. 61: 672–
685. 

Zaborowski, A.B. and Walther, D. (2020). Determinants of correlated expression of 

transcription factors and their target genes. Nucleic Acids Res. 48: 11347–11369. 

Zeller, K.I. et al. (2006). Global mapping of c-Myc binding sites and target gene networks in 
human B cells. Proceedings of the National Academy of Sciences 103: 17834. 

Zhou, P., Li, Z., Magnusson, E., Gomez Cano, F., Crisp, P.A., Noshay, J.M., Grotewold, E., 

184 

 
Hirsch, C.N., Briggs, S.P., and Springer, N.M. (2020). Meta Gene Regulatory 
Networks in Maize Highlight Functionally Relevant Regulatory Interactions. Plant Cell 
32: 1377–1396. 

185 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
APPENDIX 

Co-expression based 
on MRs-PCC

57

170

4

Co-expressed 
by MR average

Co-expressed by 
MR distribution

Co-expression 
based on MR-MIs

38

128

6

Co-expressed by 
MR average

Co-expressed by 
MR distribution

Total 
TFs

s
e
Y

231

o
N

324

y
l
l
a
b
o
l
g
s
F
T

d
e
s
s
e
r
p
x
e
-
o
c

Total 
TFs

y
l
l
a
b
o
l
g
s
F
T

d
e
s
s
e
r
p
x
e
-
o
c

s
e
Y

172

o
N

383

a 

b

c 

Total 
TFs

Co-expression based on 
MR-PCCs

y
l
l
a
b
o
l
g
s
F
T

d
e
s
s
e
r
p
x
e
-
o
c

s
e
Y

231

o
N

324

186

22

23

Positive 
PCC

Negative 
PCC

Figure S5.1 Evaluation of co-expression of TFs and corresponding target genes 
a and b. Comparison of the two statistical approaches used to test differences in either average or 
distribution of MRs between targets and not targets genes by (a) PCC-MR or (b) MI-MR. c. Venn 
diagrams  comparing  the  total  number  of  positive  and  negatively  co-expressed  TFs  with  their 
targets based on PCC-MR. 

186 

 
 
 
  
 
 
 
 
 
 
 
 
MRs-MI
distribution 

s
F
T

TFs

MRs_PCC

TFs
TFs

MRs_PCC
MRs_PCC

TFs

MRs_PCC

s
t

e
g
r
a
T
20000

5000

10000

15000

MRs

%

1.6

1.6
1.6

t
t
e
e
g
g
r
r
a
a
T
T
e
e
g
g
a
a
t
t
n
n
e
e
c
c
r
r
e
e
P
P

1.2
1.2

0.9
0.9

0.9
0.3

0.6
0.6

0.3
0.3

25000

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.6

1.2

0.9

0.6

0.3

5000
5000

10000
10000

15000
15000

20000
20000

25000
25000

5000

10000

15000

20000

25000

MRs
MRs

MRs

s
F
T

TFs

Common

TFs
TFs

Common
Common

TFs

Common

s
t

e
g
r
a
T
20000

s
t

e
g
r
a
T
20000

5000

10000

15000

MRs

TFs

MRs_MI

5000

10000

15000

MRs

%

1.2
1.2

1.6

1.6
1.6

t
t
e
e
g
g
r
r
a
a
T
T
e
e
g
g
a
a
t
t
n
n
e
e
c
c
r
r
e
e
P
P

0.9
0.9

0.9
0.3

0.3
0.3

0.6
0.6

25000

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.6

1.2

0.9

0.6

0.3

5000
5000

10000
10000

15000
15000

20000
20000

25000
25000

5000

10000

15000

20000

25000

MRs
MRs

MRs

s
F
T

TFs

TFs

MRs_MI

MRs_MI

TFs

MRs_MI

%

1.2
1.2

1.6

1.6
1.6

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

0.9
0.9

0.9
0.3

0.3
0.3

0.6
0.6

25000

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.6

1.2

0.9

0.6

0.3

20000
25000

25000

5000

10000

15000

20000

25000

MRs

5000

5000

10000

10000
MRs

15000
MRs

15000

20000

Target %
mat
Bin
4
4
3
3
2
2
1
1
0
0

Not_Coexp

s
F
T

TFs

Not_Coexp

TFs

TFs

Not_Coexp

Not_Coexp

TFs

s
t
e
g
r
a
T
20000

5000

10000

15000

MRs

%

1.6

1.6
1.6

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.2
1.2

0.6
0.6

0.9
0.9

0.9
0.30.3
MR 0

25000

0.3

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.6

1.2

0.9

0.6

0.3

5000

5000

10000

15000

20000

10000
MRs

15000
MRs

12,500

25000
20000

25000

5000

10000

15000

20000

25000

25,000

MRs

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.6

1.2

0.9

0.6

0.3

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.6

1.2

0.9

0.6

0.3

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.6

1.2

0.9

0.6

0.3

t
e
g
r
a
T
e
g
a
t
n
e
c
r
e
P

1.6

1.2

0.9

0.6

0.3

Bins of 250 MRs

Figure  S5.2  Heatmaps  displaying  the  distribution  of  MR-MI  values  across  25,296 
Arabidopsis genes 
Colors represent the percentage of TF targets within bins of 250 MRs. In total, there are 101 bins 
along the PCC distribution corresponding to co-expression values of each TF with 25,296 genes 
(genes expressed in the dataset used, see Methods). Small MR represent larger MI, thus, better 
association between TF and genes in bin. Dot plots under each heat map represent the average 
percentage of targets for all the TFs along each bin. Color side bars represent TFs categories as 
presented in Figure 5.1. 

187 

 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8

6

4

10

3

7

9

2

2
e
n
s
t

11

1

12

5

tsne 1

Figure S5.3 Sample expression clusters used to define local expression values 
Clusters  were  defined  by  k-means  clustering  (k=12  defined  by  Elbow  method)  using  the  t-
Distributed Stochastic Neighbor Embedding (t-SNE) 1 and 2 of the expression data.  

188 

 
 
  
 
 
 
 
 
 
 
 
 
In-degree

Out-degree

)
e
e
r
g
e
d
(

2
g
o
L

)
q
e
r
F
(
2
g
o

l

8
8

6
6

4
4

2
2

0
0

0
No

Coexpressed

1
Yes

15.0
15.0

12.5
12.5

10.0
10.0

)
q
e
r
F
(
2
g
o

l

7.5
7.5

5.0
5.0

1
0
Yes
No
Coexpressed

TFs significantly co-expressed with their targets

Figure S5.4 In- and out-degree differences between TFs co-expressed and not-co-expressed 
with their targets 
This  classification  accounts  for  both  globally  and  locally  co-expression  results.  Both  types  of 
degree (in and out) showed statistically significant differences between TFs co-expressed or not 
co-expressed with its targets (Mann-Whitney U test, P. value < 0.05). 

189 

 
 
  
 
 
 
 
 
 
 
 
 
 
 
AT1G08320 AT1G69180 AT2G17950 AT2G31220 AT2G33710 AT2G36270 AT3G27010 AT3G57230 AT5G08130 AT5G17300 AT5G59990 AT5G61380

****

**

*

*

**

***

***

**

**

*

**

*

AT1G09530 AT1G13450 AT1G24260 AT1G25550 AT1G45249 AT1G53230 AT1G69120 AT1G69560 AT1G69690 AT1G69780 AT2G01570 AT2G33860

*

*

****

*

***

****

****

**

****

*

**

**

Class

AT2G40260 AT2G42280 AT2G45650 AT2G45660 AT2G45680 AT2G46830 AT3G01970 AT3G02150 AT3G15210 AT3G19290 AT3G21175 AT3G28920

****

**

****

****

*

***

*

****

**

**

**

***

a

b

x
e
d
n

I

d
r
a
c
c
a
J

1.00
0.75
0.50
0.25
0.00

1.00

0.75

0.50

0.25

0.00

1.00

0.75

0.50

x
e
d
n

I

0.25

0.00

AT3G47620 AT3G50700 AT3G59060 AT4G01720 AT4G18960 AT4G34000 AT5G06950 AT5G06960 AT5G08070 AT5G08330 AT5G20240 AT5G23280
****

****

****

****

**

**

*

*

*

*

*

*

AT1G69180
AT1G69180
****
****

AT1G77080
AT1G77080
***
***

AT2G01930
AT2G01930
*
*

AT2G04038
AT2G04038
*
*

AT2G17950
AT2G17950
****
****

AT2G36270
AT2G36270
****
****

AT3G57230
AT3G57230
*
*

AT3G57920
AT3G57920
*
*

1.00

d
r
a
c
c
a
0.75
J
AT1G08320
AT1G08320
0.50
****
****

0.25

1.00
x
e
0.75
d
n
I

d
0.50
r
a
c
c
a
0.25
J

1.00

0.75

0.50

0.25

x
e
d
n
I

d
r
a
c
c
a
J

AT5G24470 AT5G41315 AT5G47370 AT5G63090 AT5G65210
*

**

**

*

*

0.00

1.00

0.75

0.50

0.25

True_True
Random_Random

Random_Random

True_True

True_True

True_True
Random_Random

Random_Random

True_True

True_True
Random_Random

Random_Random

True_True

True_True
Random_Random

Random_Random

True_True

True_True
Random_Random

Random_Random

True_True

True_True
Random_Random

Random_Random

True_True

True_True
Random_Random

Random_Random

True_True

True_True
Random_Random

Random_Random

True_True

True_True
Random_Random

Random_Random

0.00

Class

Class

Class

Class
Tri-bi

True_True
True_True
True PPI

Random_Random
Random PPI

Random_Random

Class

Figure S5.5 Target genes recovered for tri-bi complexes 
a  and  b.  Comparison  of  target  genes  recovered  for  tri-bi  of  53  TFs  with  a  shared  fraction 
significantly larger (a) or smaller (b) than by random PPIs. The similarity of the recovery set of 
targets was measured as the Jaccard index between the set of targets of each pair of dimers that 
form a tri-bi complex. Asterisks indicate P-value significance (*: p <= 0.05, **: p <= 0.01, ***: p 
<= 0.001, ****: p <= 0.0001, two-sided t-test).  

190 

 
 
 
 
 
 
 
a

b

Wilcoxon, p = 0.59

Wilcoxon, p < 2.2e−16

J

0.10

0.05

0.00

J

0.10

0.05

0.00

True

Random

Class
PPIs

True

Random

Class
PDIs

Figure S5.6 Evaluation of HCG not targets of TFx 
a and b. Comparison of HCG which are not targets of TFx recovered because they are either (a) a 
target of a TFz interactor of TFx, or (b) a target of a TFy regulated by TFx vs random interaction. 
Jaccard index (J) calculated as the number of TFz/TFy targets shared with the HCGs non-targets 
of TFx over the total TFz/y targets plus total HCGs no-targets.    

191 

 
 
  
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CHAPTER SIX: CONCLUSIONS 

192 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Understanding gene regulatory networks (GRNs) has significant implications at various levels 

and in every biological system. However, the unraveling of GRNs, predicting their interactions, 

and prioritizing regulatory associations with phenotypic/biological consequences remains a long-

standing  and  unsolved  problem.  I  employed  several  strategies  in  this  study  to  predict  and 

comprehend the associations between transcription factors (TFs) and target genes in different plant 

systems using various data types. The results propose previously unknown regulatory associations 

specific to these plant systems and offer guidance for future research aimed at unraveling GRNs 

in a species-specific manner. It is important to note that the species-specific strategies presented 

here were primarily based on the availability and nature of the data.  

I  developed  a  co-expression  system  with  highly  stringent  thresholds  in  Camelina,  a  plant 

known for its complex genome and limited data available (compared to maize and Arabidopsis). 

This analysis did not rely on any assumptions about TF-target gene relationships using a specific 

metabolic pathway as an example, simplifying the analysis of regulatory associations. Thus, my 

combined analysis of co-expression and PDI predicted six TFs involved with lipid metabolism in 

Camelina. Five of these TFs were not previously associated with lipid metabolism in any other 

plant system. Moving on to maize, the extensive genetic data and the increasing availability of PDI 

data sets enabled me to create a framework for evaluating various approaches for the integration 

of multi-omic data to predict TF regulatory function. In addition to finding the best strategy and 

specific regulatory hypotheses for further validation, these analyses identify numerous potential 

functional  connections  which  showed  enrichment  with  GO  terms  also  observed  in  random 

networks.  Importantly,  this  indicates  that  a  substantial  portion  of  the  predicted  associations 

involves a significant number of interactions that are either false positives or whose related GO 

terms  are  just  too  over-represented,  resulting  in  enrichment  even  from  random  interactions 

193 

 
(random networks). In addition, I showed that the embedding representation of the network allows 

not  only  to  identify  TF-functionally  redundant  but  also  TF  paralogous  potentially  redundant. 

Lastly,  the  established  method  allowed  me  to  predict  transcriptional  regulators  of  different 

biological processes as well as potential upstream regulators of the corresponding TFs. Based on 

my analysis, I predicted a comprehensive list of regulators for twenty distinct biological processes. 

These processes range from development to metabolism and include associations of transcription 

factors that were identified in the past, thus validating our predictions. 

Finally,  the  abundance  of  data  available  in  Arabidopsis  enabled  me  to  establish  a  broader 

framework regarding the predictability of target genes for TFs based on co-expression models. My 

findings confirm a long-standing observation: many direct target genes do not exhibit significant 

co-expression with their corresponding TF. Additionally, many genes co-expressed with TFs are 

not direct targets of the respective TF. However, my analysis expanded on this observation by 

revealing  the  influence  of  physical  TF-TF  interactions  and  downstream  TFs  in  explaining  the 

occurrence  of  either  low-expression  targets  or  highly  co-expressed  genes  that  are  not  targets. 

Altogether, this work established tools and strategies and provided hypotheses to understand GRN 

in plants better.  

My results predicted regulatory associations that were previously unknown, it's important to 

note  that  some  of  these  associations  are  currently  under  validation  experimentally  by  other 

researchers in the Grotewold lab. Yet, it’s also worth mentioning that these findings may have a 

fraction of false positives regulatory associations for the corresponding TFs. Determining the exact 

fraction of false positives is challenging due to the nature of the data available. While it's common 

to identify examples of potential regulatory associations in the literature, it is less common to find 

experimentally validated negative examples with nonregulatory effects. This creates uncertainty 

194 

 
regarding the false positive fraction. Yet, it is important to mention, one contributing factor to false 

positives  is  the  heterogeneity  of  the  data.  For  example,  there  are  variations  in  the  expression 

datasets analyzed using different pipelines, as well as differences in the way peaks from PDI assays 

are  called  and  the  technologies  used  to  generate  the  peaks.  Thus,  my  implementation  of  filters 

based  on  normalized  counts  and  discarding  of  peaks  without  consensus  TFBM  was  largely 

complemented by the utilization of the peaks in the context of open chromatin regions. In maize, 

this normalization allowed for comparisons across different types of experiments (i.e., DAP-seq 

and ChIP-seq) and reduced the assignment of potential false positive target genes. Specifically, I 

discarded at least 50% of the called peaks because of their overlap with nearby closed chromatin 

regions.  Therefore,  it  is  highly  recommended  to  exclude  this  layer  of  information  in  future 

endeavors to predict and understand GRN in plants. 

Aside from the methodological limitations, it is also important to highlight that all the analyses 

and results presented here examined TFs and genes as individual interactions. However, in several 

cases, my results keep leading back to situations where only the integration of the corresponding 

predictions makes sense. For example, this is evident in their interpretation as complexes (partial 

correlation  results  in  Arabidopsis)  and  modules  of  TFs  working  together  in  specific  biological 

processes  (presence  of  multiple  TFs  with  similar  influence  on  individual  GO  terms,  results  in 

maize). Therefore, a major improvement to the models and analyses presented here would include 

the interrogation of multiple TFs in the same model or establishing predictive models that consider 

the modular and combinatorial nature of gene regulation. For instance, it is easy to envision the 

typical  co-expression  model  for  linear  regression  or  a  similar  model  (regardless  of  the  model's 

assumptions)  that  considers  the  contribution  of  multiple  TFs  to  the  expression  variation  of  the 

corresponding target genes. Additionally, with techniques such as DAP-seq and ATAC-seq, it is 

195 

 
undeniable that models aiming to build regulatory models in the context of open chromatin regions, 

not  only  for  single  TFs  but  also  for  co-expressed  TFs,  will  enable  the  prediction  of  TF-target 

interactions from PDIs dataset with significantly higher accuracy.  

When I combined the results of all three systems, it's important to highlight the effectiveness 

of  integrating  multiple  layers  to  handle  the  complexity  of  gene  regulatory  networks  (GRNs). 

Moreover, considering the identification of common associations across different data types offers 

a straightforward method to analyze data and discover highly reliable interactions. Yet, looking at 

the  bigger  picture,  incorporating  information  from  various  layers  into  more  robust  prediction 

strategies reduces the likelihood of false positives, and consequently increasing the accuracy of 

the corresponding predictions. It's also worth noting that TFs generally have multiple regulatory 

functions and are significantly redundant, which has been extensively documented. In this study, 

I demonstrated that by integrating multiple data types, it's possible to narrow down this complexity 

and formulate specific testable regulatory hypotheses. 

196