MODELING AGE-DEPENDENT GENE EXPRESSION VARIABILITY IN ACUTE 

MYELOID LEUKEMIA USING A LINEAR MODEL 

 
By 
 

Raeuf Roushangar 

 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 

 

 
 
 
 
 

2018 

	

	

A DISSERTATION 

Submitted to 

Michigan State University 

in partial fulfillment of the requirements  

for the degree of  

Biochemistry and Molecular Biology—Doctor of Philosophy 

Quantitative Biology—Dual Major 

MODELING AGE-DEPENDENT GENE EXPRESSION VARIABILITY IN ACUTE 

MYELOID LEUKEMIA USING A LINEAR MODEL 

ABSTRACT 

 

 

 
By 
 

Raeuf Roushangar 

In  2018  alone,  an  estimated  20,000  new  acute myeloid  leukemia  (AML)  patients  were 

diagnosed, in the United States, and over 10,000 of them are expected to die from the 

disease.  Although  AML  can  occur  in  people  of  all  ages,  AML  is  primarily  diagnosed 

among the elderly (median 68 years old at diagnosis) and its age-specific incidence and 

prevalence  increases  exponentially  after  50  years  of  age.  Prognoses  have  significantly 

improved for younger patients, but in patients older than 60 years old, prognoses remain 

grim:  with  current  treatments,  as  much  as  70%  of  patients  will  die  within  a  year  of 

diagnosis. Reassessment of early diagnosis and treatment approaches therefore should be 

considered, since relapse after complete remission is still the main obstacle. In this study, 

we conducted stratified computational meta-analysis of 2,213 AML patients compared to 

548 healthy individuals, using curated publicly available data. We carried out analysis of 

variance  of  normalized  batch  corrected  data,  including  considerations  for  disease,  age, 

tissue  and  sex.  We  identified  964  differentially  expressed  unique  genes  genes  and  4 

associated  significant  pathways  involved  in  AML.  Additionally,  we  have  identified  69 

sex- and 372 age-related gene expression signatures relevant to AML. Finally, we used a 

machine  learning  model  (KNN  model)  to  classify  AML  patients  compared  to  healthy 

individuals with > 90% achieved accuracy. Overall our findings provide a new reanalysis 

of public datasets, that enabled the identification of potential new gene sets relevant to 

	

	

AML  that  can  potentially  be  used in future  experiments  and  possible  stratified  disease 

diagnostics. 

	

	

Copyright by 

RAEUF ROUSHANGAR 

2018 

	

	

opening its doors and gave me a shot.	It’s a true privilege to wake up every day and be 

unconditional love as a single mother. I dedicate this work to this wonderful country for 

I dedicate this book to my mother for detecting her entire life to raise me with 

able to think freely, which I couldn't do for so many years, and do scientific research with 

a shot to contribute to science and humanity! 

 

v 

ACKNOWLEDGMENTS 

 

First and foremost, I would like to thank my PhD advisor, Dr. George I. Mias, who not 

only  welcomed  me  to  his  lab  and  mentored  me with  kindness  and  calm,  but  also  both 

challenged me and helped me mature as a scientist. Dr. Mias gave all necessary resources 

and created a great environment for me to do scientific research freely. He gave me space 

to express and discuss my views without worrying, this is very monumental to me. He 

supported me and let me reincarnate myself into a form that I have always wanted, which 

allowed me to determine who I am and what I wanted to become as a scientist. I want to 

deeply thank him for teaching me how to be a scientist who strive for depth, aware of 

fundamental assumptions, and to always be objective. I am forever in his debt. 

 

I  would  also  like  thank  the  people  in  the  lab  that  made  the  last  four  years  fun  and 

possible.  Dr.  Vikas  V.  Singh  and  Lavida  R.  K.  Brooks  didn’t  just  edit  my  scientific 

thinking through discussions and interpret results, but they became my friends which I 

deeply  value.  I  am  also  deeply  indebted  to  my  close  friends,  Timothy  Stachelski  and 

Qingfeng  Zeng,  who  supported  me  through  the  years,  edited  my  thoughts,  and  were 

always there for me when I needed them. I appreciate their friendship very much. 

 

I  am  also  extremely  thankful  for  having  Dr.  Ronald  Henry,  Dr.  A.  Daniel  Jones,  Dr. 

Carlo Piermarocchi, Dr. Curtis Wilkerson, and Dr. Timothy Zacharewski as my guiding 

committee. My time as a PhD student would have been dramatically different if not for 

their guide and constant support. I want to also thank Dr. Thomas D. Sharkey Dr. Jon 

 

vi 

Kaguni  and  Mrs.  Jessica  Lawrence  for  guiding  me  through  my  graduate  program  and 

helping on many occasions. 

 

I am also thankful to Dr. Brian Schutte for giving me the opportunity and resources to 

conduct research in his lab during my undergraduate time at Michigan State University. I 

am also deeply indebted to Dr. Youssef A. Kousa for dedicating his time to mentor me 

during my 3 years at Dr. Schutte’s lab, which helped me grow as a future scientist and 

develop my own individual creative spirit. 

 

Finally, I want to deeply thank Dr. Pamela J. Fraker and Dr. John LaPres for giving me a 

in helping me to be admitted into the PhD program at Michigan State University. I would 

also like to thank The Paul & Daisy Soros Fellowships for supporting this work, which 

allowed me to tackle research problems without any financial burdens. 

 

 

 

 

 

 

 

 

 

 

 

vii 

TABLE OF CONTENTS 

 

LIST OF TABLES ......................................................................................................... xi 

LIST OF FIGURES....................................................................................................... xii 

KEY TO ABBREVIATIONS ....................................................................................... xiii 

CHAPTER 1 – Acute Myeloid Leukemia..................................................................... 1 
Historical ............................................................................................................. 2 
AML statistics and incidences.............................................................................. 3 
AML characteristics............................................................................................. 3 
AML classification .............................................................................................. 4 
AML with recurrent genetic abnormalities ........................................................... 4 
•  AML with t(8;21)(q22;q22.1);RUNX1-RUNX1T1 .................................. 5 
•  AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22);CBFB-MYH11 ...... 6 
•  Acute promyelocytic leukemia with PML-RARA .................................... 7 
•  AML with t(9;11)(p21.3;q23.3);MLLT3-KMT2A .................................... 7 
•  Acute myelogenous leukemia with t(6;9)(p23;q34.1);DEK-NUP214 ........ 8 
•  AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2); GATA2, MECOM. 8 
•  Acute  megakaryoblastic 
t(1;22)(p13.3;q13.3);RBM15 
MKL1 ...................................................................................................... 9 
•  AML with BCR-ABL1........................................................................... 10 
•  AML with mutated NPM1...................................................................... 10 
•  AML with biallelic mutations of CEBPA ............................................... 12 
•  AML with mutated RUNX1 ................................................................... 12 
AML with myelodysplasia-related changes ........................................................ 13 
Therapy-related myeloid neoplasms ................................................................... 14 
AML not otherwise specified ............................................................................. 15 
AML complex genetic abnormalities suggests that AML evolves over time ....... 15 
AML standard treatments .................................................................................. 16 
The nature of AML changes with patients age ................................................... 17 
Conclusion ........................................................................................................ 19 
APPENDIX ....................................................................................................... 20 
BIBLIOGRAPHY ............................................................................................. 24 

leukemia  with 

 
CHAPTER  2  –  ClassificaIO:  machine  learning  for  classification  graphical  user 
interface ....................................................................................................................... 37 
Abstract ............................................................................................................. 38 
Introduction ....................................................................................................... 40 
ClassificaIO implementation .............................................................................. 43 
ClassificaIO backend ......................................................................................... 43 
ClassificaIO functionalities ................................................................................ 45 
•  Data input .............................................................................................. 45 

 

viii 

•  Classifier selection ................................................................................. 46 
•  Model training ....................................................................................... 47 
•  Results output ........................................................................................ 47 
•  Model export .......................................................................................... 47 
•  Results export ........................................................................................ 48 
Results: illustrative examples and data used ....................................................... 48 
• 
Iris prediction using Iris dataset .............................................................. 48 
•  Sex prediction using microarray gene expression data ............................ 48 
Discussion ......................................................................................................... 49 
ClassificaIO:  setup,  dependencies,  installation,  instruction,  and,  step  by  step 
working examples .............................................................................................. 51 
Summary ........................................................................................................... 51 
Dependencies .................................................................................................... 51 
Prerequisites ...................................................................................................... 51 
Installation instructions ...................................................................................... 52 
Iris dataset prediction using a logistic regression classifier ................................. 53 
•  Training data input ................................................................................. 53 
•  Data format ............................................................................................ 53 
•  Classifier selection ................................................................................. 54 
•  Model training, evaluation, validation and result output ......................... 56 
•  Testing data input and result output ........................................................ 57 
•  Result export .......................................................................................... 58 
•  ClassificaIO model input ........................................................................ 58 
APPENDIX ....................................................................................................... 60 
BIBLIOGRAPHY ............................................................................................. 77 

 
CHAPTER 3 – Computational Meta-analysis of Gene Expression in Acute Myeloid 
Leukemia ..................................................................................................................... 80 
Abstract ............................................................................................................. 81 
Introduction ....................................................................................................... 82 
Results ............................................................................................................... 84 
•  Data curation and gene expression preprocssing .............................................. 84 
•  Classification of missing metadata annotation .................................................. 85 
•  Batch correction ..................................................................................... 85 
•  Analysis  1:  Gene  expression  meta-analysis  and  enrichment  analysis  of 
AML disease state compared to healthy individuals. .............................. 86 
o  Gene expression meta-analysis of AML disease state.................. 86 
o  Gene enrichment analysis AML disease state DE genes .............. 87 
•  Analysis 2: gene expression meta-analysis and enrichment analysis of sex- 
and age-related DE genes in AML .......................................................... 88 
o  Analysis  2a.  Sex-relevance  differential  gene  expression  meta-
analysis and associated signaling pathways in AML ................... 89 
o  Analysis  2b.  Age-dependent  differential  gene  expression  meta-
analysis and associated signaling pathways in AML ................... 90 
o  Age-dependent genes analysis for drug to gene interaction ......... 91 

 

ix 

Discussion ......................................................................................................... 92 
•  Analysis  1  discussion:  Gene  expression  meta-analysis  of  AML  disease 
state ....................................................................................................... 94 
•  Analysis 2a discussion ........................................................................... 97 
•  Analysis 2b discussion ........................................................................... 98 
•  Future research possibilities and study limitations ................................ 100 
Methods........................................................................................................... 102 
•  Gene expression data curation and screening criteria ............................ 102 
•  Gene expression data sets used in our analysis ..................................... 103 
•  Datasets annotation and preprocessing ................................................. 103 
•  Prediction  of  missing  sex-  and  sample  source  annotations  from  curated 
datasets ................................................................................................ 104 
•  Dataset-wise correction approach for batch effects correction .............. 105 
•  Gene expression meta-analysis ............................................................. 106 
•  Functional and pathway enrichment analysis ........................................ 108 
•  Using k-nearest neighbor to predict AML ............................................ 108 
•  Online data availability ........................................................................ 108 
APPENDIX ..................................................................................................... 109 
BIBLIOGRAPHY ........................................................................................... 147 

 
CHAPTER 4 – Summary and Outlook .................................................................... 157 
Conclusion ...................................................................................................... 158 

Outlook ........................................................................................................... 160	

 

 

 

 

x 

LIST OF TABLES 

 

Table  1.  Classification  of  AML  according  to  the  WHO  acute  myeloid  leukemia  and 
related neoplasms classification system ......................................................................... 22 
 
Table  2.  Cytogenetic  abnormalities  sufficient  for  the  diagnosis  of  AML  with 
myelodysplasia-related changes (AML-MRC) ............................................................... 23 
 
Table 3. ClassificaIO software information ................................................................... 75 
 
Table 4. Classification algorithms included in ClassificaIO ........................................... 76 
 
Table 5. Summary table of all 34 gene expression data sets used in our study .............. 137 
 
Table 6. Top 10 up- and down-regulated of DE genes in AML from disease state meta-
analysis ....................................................................................................................... 139 
 
Table  7.  KEGG  functional  analysis  of  974  DEPS  from  meta-analysis  of  34  gene 
expression data sets ..................................................................................................... 140 
 
Table 8. AML sex relevance (male - female) DE genes & associated signaling pathways
.................................................................................................................................... 142 
 
Table  9.  AML  age-dependent  (AML  -  healthy)  DE  genes  &  associated  signaling 
pathways ..................................................................................................................... 143 
 
Table 10. Age-dependent genes show drug to gene interaction .................................... 145 

 

xi 

LIST OF FIGURES 

 

Figure 1. Genes frequently mutated in AML according to TCGA .................................. 21 
 
Figure 2. Workflow summary of ClassificaIO ............................................................... 61 
 
Figure 3. ClassificaIO main window ............................................................................. 62 
 
Figure 4. ClassificaIO user interface (Mac OS shown) .................................................. 63 
 
Figure 5. Graphical control element dialog box ............................................................. 65 
 
Figure 6. Current data upload panel ............................................................................... 66 
 
Figure 7. Gene expression sex prediction using linear support vector classifier .............. 67 
 
Figure 8. Selected logistic regression classifier .............................................................. 68 
 
Figure 9. Trained logistic regression classifier ............................................................... 69 
 
Figure 10. Tested logistic regression classifier ............................................................... 70 
 
Figure 11. ‘Already Trained My Model’ window .......................................................... 71 
 
Figure 12. Training and testing using gene expression data ........................................... 72 
 
Figure 13. Trained linear support vector machine classifier ........................................... 73 
 
Figure 14. Features data ................................................................................................ 74 
 
Figure 15. General approach, data curation, and analysis workflow summary .............. 110 
 
Figure  16.  Principal  component  analysis  of  all  2,761  subjects  before  and  after  batch 
correction .................................................................................................................... 112 
 
Figure 17. Functional classification of DEPS from AML disease state meta-analysis and 
associated KEGG and GO enrichment analysis ........................................................... 117 
 
Figure 18. Sex-related gene expression meta-analysis in AML .................................... 124 
 
Figure 19. Age-related gene expression meta-analysis in AML ................................... 129 

 

xii 

KEY TO ABBREVIATIONS 

 

AML................................................................................................acute myeloid leukemia 
 
WHO…………………………………………………………..World Health Organization 
 
AML-RGA...…………………………………….aml with recurrent genetic abnormalities 
 
t(8;21)………………………………….…………t(8;21)(q22;q22.1);RUNX1-RUNX1T1 
 
RUNX1………………………………………………Runt Related Transcription Factor 1 
 
CBFB…………………………………………………..Core-Binding Factor Beta Subunit 
 
inv(16) or t(16;16)……….........inv(16)(p13.1q22) or t(16;16)(p13.1;q22);CBFB-MYH11 
 
MYH11……………………………………………………...……Myosin Heavy Chain 11 
 
APL……………………………………………….…….…..acute promyelocytic leukemia 
 
PML.……………………………………………………………...Promyelocyte Leukemia 
 
RARA………………………………………………………Retinoic Acid Receptor Alpha 
 
t(9;11)………………………..…………………….t(9;11)(p21.3;q23.3);MLLT3-KMT2A 
 
KMT2A……………………………………………………...Lysine Methyltransferase 2A 
 
MLLT3……………………………………...MLLT3, Super Elongation Complex Subunit 
 
AMGL…………………………………………...…...…..….acute myelogenous leukemia 
 
t(6;9).…………………………………………………….t(6;9)(p23;q34.1);DEK-NUP214 
 
DEK…………………………………………………………………DEK Proto-Oncogene 
 
NUP214…………………………………………………………………...Nucleoporin 214 
 
CR……………………………………………………………………...complete remission 
 
inv(3) or t(3;3)..………...…inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2); GATA2, MECOM 
 
GATA2…………………………………………………………..GATA Binding Protein 2 
 

 

xiii 

MECOM………………………………………………..MDS1 And EVI1 Complex Locus 
 
AMKL……………………………………………….….acute megakaryoblastic leukemia 
 
t(1;22)………………………………………………..t(1;22)(p13.3;q13.3);RBM15-MKL1 
 
RBM15…………………………………………………….RNA Binding Motif Protein 15 
 
MKL1……………………………………..Megakaryoblastic Leukemia (Translocation) 1 
 
BCR……………………………………...BCR, RhoGEF And GTPase Activating Protein 
 
ABL1…………………………...ABL Proto-Oncogene 1, Non-Receptor Tyrosine Kinase 
 
CML…………………………………………………………….chronic myeloid leukemia 
 
ALL…………………………………………………………acute lymphoblastic leukemia 
 
NPM1………………………………………………………………..…...Nucleophosmin 1 
 
CEBPA…………………………………………CCAAT Enhancer Binding Protein Alpha 
 
bi-CEBPA..………………………………..……………......biallelic mutations of CEBPA 
 
AML-MRC…………………………………......aml with myelodysplasia-related changes 
 
OS………………………………………………………………………….overall survival 
 
MDS……………………………………………………………myelodysplastic syndrome 
 
MDS/MPN………………………………..myelodysplastic/myeloproliferative neoplasms 
 
t-MN…………………………………………………..therapy-related myeloid neoplasms 
 
AML-NOS…………………………………………………….aml not otherwise specified 
 
GUI………………………………………………………………...graphical user interface 
 
GEO……………………………………………………………Gene Expression Omnibus 
 
BM……………………………………………………………………………bone marrow 

PB…………………………………………………………………………peripheral blood 
 
KNN…………………………………………………………………….k-nearest neighbor  
 

 

xiv 

LR……………………………………………………………………….logistic regression 
 
PCA………………………………………………………….principal component analysis 
 
DGE…………………………………………………………...differential gene expression 
 
DEPS…………………………………………………...differentially expressed probe sets 
 
ANOVA……………………………………………………………...Analysis of Variance 
 
HSD………………………………………………………Honestly Significant Difference 
 
WT1………………………………………………………………………...Wilms tumor 1 
 
CRISP3…………………………………………………...cysteine-rich secretory protein 3 
 
KEGG……………………………………….Kyoto Encyclopedia of Genes and Genomes 
 
GO…………………………………………………………………….……Gene Ontology 
 
DAVID………………...Database for Annotation, Visualization and Integrated Discovery 
 
JUP……………………………………………………………………junction plakoglobin 
 
CCNA1…………………………………………………………………………...cyclin A1 
 
FLT3…………………………………………………………fms-related tyrosine kinase 3 
 
PIK3R1…………………………..phosphoinositide-3-kinase, regulatory subunit 1 (alpha) 
 
CD14……………………………………………………………………….CD14 molecule 
 
CEBPE………………………………CCAAT/enhancer binding protein (C/EBP), epsilon 
 
DDX3Y………………………………………………….DEAD-Box Helicase 3 Y-Linked 
 
EIF1AY……………………………Eukaryotic Translation Initiation Factor 1A Y-Linked 
 
KDM5D…………………………………………………………...Lysine Demethylase 5D 
 
RPS4Y1………………………………………………...Ribosomal Protein S4 Y-Linked 1 
 
XIST………………………………………………………..X Inactive Specific Transcript 
 
TSIX……………………………………………...TSIX Transcript, XIST Antisense RNA 
 

 

xv 

PRKX…………………………………………………………….Protein Kinase X-Linked 
 
HOX………………………………………………………………………homeobox genes 
 
ORM1………………………………………………………………………Orosomucoid 1 
 
NCBI……………………………………..National Center for Biotechnology Information 
 
RMA…………………………………………………………Robust Multi-Array Average 
 
PM……………………………………………………………………………perfect match 
 
DGIdb………………………………………………..The Drug Gene Interaction Database 
 

 

xvi 

Chapter 1 - 

Acute Myeloid Leukemia 

1 

 

Historical 

 Cancer  was  first  identified  and  described  in  Egypt,  where  evidence  from  ancient 

Egyptian mummies and manuscripts date back more than 3,500 years 1. It was the Greek 

physician  Hippocrates  (460-370  BC)  however,  who  coined  the  word,  “cancer” 

( καρκίνος  in Greek)  1. According to Gordon Piller  2, blood malignancies were hard to 

diagnose  since  microscopic  examination  of  the  blood  was  not  possible  until  Robert 

Hooke published work on microscopy in 1665  3. In 1674, Anton van Leeuwenhoek was 

the  first  to  describe  human  red  blood  cells,  and  in  1749,  white  blood  cells  (including 

lymphocytes) were described by Joseph Lieutaud 2,4. 

 

In  1845,  John  Hughes  Bennett,  then  pathologist  at  the  Royal  Infirmary  of  Edinburgh, 

carried out the post mortem of a patient and reported that the patient’s blood was affected 

throughout his system 5. Around the same time, other cases with blood abnormality were 

reported by Rudolf Virchow in Berlin and Henry Fuller in London  6,7. The findings of 

Bennett, Virchow, and Fuller led to the recognition of leukemia as a distinct disease  2. 

The  earliest  recorded  case  of  acute  leukemia,  a  form  of  leukemia,  took  place  in  1857 

when the German pathologist Nikolaus Friedreich observed mass leukocytes formed in 

his 46-year-old patient’s thorax 6 weeks before her death 8. 

 

In  1880,  Paul  Ehrlich  developed  staining  methods  to  stain  and  trace  blood  cells  –  his 

work led to the classification of myeloid and lymphoid leukemia subtypes 2,9. One of the 

earliest recorded epidemiological studies was of 154 cases of leukemia, which took place 

in  1879  when  W.  R.  Gowers  and  others  speculated  that  the  disease  might  be  due  to 

2 

 

exposure  to  malaria  10.  In  1894,  the  work  of  Dr.  Richard  Cabot,  then  a  physician  in 

Boston, was vital to the recognition of acute leukemia where he published his work on 34 

patients that had an average survival of 4.5 weeks after their diagnosis  11. And In 1909 

Dr. Robert J. M. Buchanan clinically described acute myeloid leukemia (AML), its onset 

and rapid progression 2,12. 

 

AML statistics and incidences 

Each  year,  cancer  affects  millions  of  people  in  the  United  States  (US)  and  around  the 

world 13-15. Within the US, cancer is the second leading cause of death after heart disease 

with 1,735,350 new cases and 609,640 deaths projected for 2018 14. Leukemia is a cancer 

of the blood and is currently the 9th most common type of cancer and the 6th leading cause 

of death in males and 7th in females in the US 14. Myeloid leukemia is the most common 

type of leukemia, and AML accounts for 70% of myeloid leukemia and nearly 80% of 

acute  leukemia  cases,  making  it  the  most  common  form  of  both  myeloid  and  acute 

leukemia 14,16,17. The number of new AML cases is increasing each year – in 2018 alone, 

there have been an estimated 60,300 new diagnosed leukemia patients. About 20,000 of 

these are AML cases, over 10,000 of which will die from the disease 18. In fact, AML has 

the highest mortality rate of all leukemia related disease 19. 

 

AML characteristics 

AML is a blood cancer that best described as several heterogeneous diseases with many 

complex  genetic  abnormalities.  Specifically,  AML  is  a  multifactorial  cancer  of  the 

myeloid cell lineage of the hematopoietic system that begins in the bone marrow. AML is 

3 

 

characterized by terminal differentiation of normal blood cells and excessive proliferation 

and release of abnormally differentiated myeloid cells (leukemia cells) at various stages 

of myeloid hematopoiesis  20. This faster than normal and uncontrolled growth leads to 

abnormal accumulation and buildup of leukemic cells in the bone marrow and peripheral 

blood,  frequently  resulting  in  suppression  of  healthy  myeloid  precursors  of  the 

hematopoietic system and hematopoiesis insufficiency 20. 

 

AML classification 

According  to  the  2016  World  Health  Organization  (WHO)  newly  revised  myeloid 

neoplasms and acute leukemia classification system, there are a number of major disease 

categories of AML and many subtypes (Table 1) 21. This classification system is based on 

factors  that  affect  AML  prognosis,  including  cytogenetic  abnormalities,  molecular 

genetic alterations, morphologic features, immunophenotypic, and biological and clinical 

information 16,21. The major categories of AML classification are, 1) ‘AML with recurrent 

genetic  abnormalities’,  2)  ‘AML  with  myelodysplasia-related  changes’,  3) ‘Therapy-

related myeloid neoplasms’, and 4) ‘AML not otherwise specified’, described below. 

 

AML with recurrent genetic abnormalities 

Hereafter  abbreviated  AML-RGA.  Chromosomal  abnormalities  including  deletions, 

duplications,  translocations,  inversions,  and  gene  fusion  occur  frequently  in  AML  22. 

AML-RGA encompasses a number of different AML subgroups with specific distinctive 

chromosomal abnormalities that include we list here, and further discuss below: 

• 

‘AML with t(8;21)(q22;q22.1);RUNX1-RUNX1T1’ 

4 

 

• 

• 

• 

• 

• 

• 

• 

• 

• 

‘AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22);CBFB-MYH11’ 

‘Acute promyelocytic leukemia with PML-RARA’ 

‘AML with t(9;11)(p21.3;q23.3);MLLT3-KMT2A’ 

‘Acute myelogenous leukemia with t(6;9)(p23;q34.1);DEK-NUP214’ 

‘AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2); GATA2, MECOM’ 

‘Acute megakaryoblastic leukemia with t(1;22)(p13.3;q13.3);RBM15-MKL1’ 

‘AML with BCR-ABL1’ 

‘AML with mutated NPM1’ 

‘AML with biallelic mutations of CEBPA’ 

•  and ‘AML with mutated RUNX1’ 16,21. 

 

•  AML with t(8;21)(q22;q22.1);RUNX1-RUNX1T1 

Hereafter  abbreviated  AML  with  t(8;21).  Translocation  in  chromosomes  8  and  21, 

t(8;21), is one of the most common AML chromosomal abnormalities and is associated 

with  12%  of  all  AML  cases  23.  In  1973,  Dr.  Janet  Rowley  was  first  to  discover  the 

translocation and breaks at q22;q22 in chromosomes 8 and 21 in a female patient with 

acute leukemia 24. In early 1990, the location of RUNX1 and RUNX1T1 were identified 

to be at the translocation site 23. In 1993 Miyoshi et al. (1993) 25 reported that the t(8;21) 

translocation in AML results in the RUNX1-RUNX1T1 fusion protein. RUNX1 is a gene 

encoding DNA-binding transcription factor that binds to DNA using its runt-homology 

domain and interacts with CBFB, a common heterodimeric partner 26. 

 

5 

 

•  AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22);CBFB-MYH11 

Hereafter abbreviated AML with inv(16) or t(16;16). The inversion and/or translocation 

of  chromosome  16,  inv(16)  or  t(16;16),  is  among  the  most  frequently  observed 

chromosomal abnormalities found in AML and is detected in about 16% of AML cases 

27.  In 1983, Le Beau et al. (1983)  28 were first to report inv(16) in leukemic cells from 

newly  diagnosed  AML  patients  with  abnormal  bone  marrow.  In  1993  Dr.  Paul  Liu 

identified the two genes, CBFB and MYH11, located at the inversion breakpoints, that 

resulted in chimeric mRNA formation, which generates CBFB-MYH1 a fusion protein 

product resulting from inv(16) in AML  29,30. CBFB, located at 16q22, encodes the beta 

subunit  of  the  core  binding  transcription  factor,  whereas  MYH11,  located  at  16p13.1, 

encodes the smooth muscle myosin heavy chain 11. 

 

RUNX1 and CBFB are both crucial to transcriptional regulation of healthy hematopoiesis 

development  26,31.  Chromosomal  abnormalities  and  mutations  in  RUNX1  and  CBFB 

result in terminal differentiation of healthy myeloid cells and uncontrolled proliferation 

of  leukemia  cells  at  various  stages  of  hematopoiesis,  which  ultimately  leads  to 

hematological malignancies  26,32. AML with t(8;21) and AML with inv(16) or t(16;16) 

are  classified  as  core  binding  factor  (CBF)  AML  and  together  they  account  for 

approximately  20%  of  all  adult  AML  cases  31.  These  AML  subtypes  are  commonly 

associated with favorable prognosis and response to conventional therapy 33. 

 

 

 

6 

 

•  Acute promyelocytic leukemia with PML-RARA 

Acute  promyelocytic  leukemia  (APL)  is  a  subtype  of  AML  that  has  distinct  and  clear 

biological features. APL accounts for approximately 10% of all AML cases 34,35. In 1957 

Dr. Leif Hillestad was first to identify APL and characterized its clinical features  36. In 

1977 Rowley et al. (1977)  37 identified the APL cytogenetic signature as the reciprocal 

translocation  between  chromosome  15  and  17,  t(15;17),  which  results  in  fusion  of  the 

PML  and  RARA  genes 38.  RARA  is  involved  in  transcriptional  regulation,  gene 

expression,  and  various  other  biological  processes,  including  its  function  as  a  ligand-

dependent  receptor  for  retinoic  acid  binding  39-41.  The  PML-RARA  fusion  protein 

represses the retinoic acid downstream response, and targets gene expression, resulting in 

abnormal and uncontrolled cell proliferation and suppression of normal cellular process 

42,43. 

 

•  AML with t(9;11)(p21.3;q23.3);MLLT3-KMT2A 

Hereafter  abbreviated  AML  with  t(9;11).  Chromosomal  abnormalities  that  lead  to  the 

translocation between chromosome 9 and 11, t(9;11), result in fusion of the KMT2A gene 

(also known as MLL, MLL1, ALL1) with the MLLT3 gene (also known as AF9). AML 

patients  with  MLLT3-KMT2A  fusion  as  a  result  of  translocation  (9;11)  usually  have 

short  survival  rate,  frequent  disease  relapse,  and  poor  clinical  outcome  44-47.  KMT2A 

gene rearrangement has been reported in approximately 10% of all acute leukemia cases 

48.  KMT2A,  located  at  11q23,  encodes  a  transcription  factor  that  is  involved  in  gene 

expression regulation essential to chromatin remodeling, development, and hematopoiesis 

49. 

7 

 

•  Acute myelogenous leukemia with t(6;9)(p23;q34.1);DEK-NUP214 

Hereafter  abbreviated  AMGL  with  t(6;9).  AMGL  with  chromosomal  aberration 

t(6;9)(p23;q34.1) is a rare form of AML and is observed in about 0.5% to 4% of all AML 

patients 50. In 1976, Dr. Janet Rowley and Dr. David Potter studied bone marrow samples 

obtained  from  50  adult  patients  and  reported  the  translocation  between  chromosome  6 

and  9  in  AMGL  51.  The  translocation  between  chromosome  6  and  9,  t(6;9),  results  in 

fusion of the DEK gene (located at 6p23) with the NUP214 gene (located at 9q34) 50,52. 

AML patients with the chimeric DEK-NUP214 fusion gene have poor prognosis and only 

50% achieve complete remission (CR) with conventional chemotherapy 52. 

 

•  AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2); GATA2, MECOM 

Hereafter abbreviated AML with inv(3) or t(3;3). Chromosomal abnormalities that lead to 

inversion  and/or  translocation  in  the  chromosome  3  long  arm  have  been  detected  in 

approximately  1%  to  2%  of  all  AML  cases  53.  In  particular, inv(3)  abnormalities  have 

been  the  most  frequently  observed  chromosomal  abnormalities  in  this  group  and  are 

associated with poor prognosis, treatment response, and median survival rate of less than 

1 year 53-56. The GATA2 gene encodes for transcription factor GATA binding protein 2, 

an important regulator of hematopoietic cell differentiation 57, whereas the MECOM gene 

(also known as EV1, MDS1) encodes for the transcriptional regulator MDS1 and EVI1 

complex  locus,  which  is  involved  in  cell  differentiation  essential  to  development  and 

hematopoiesis 58.  

 

8 

 

Recently, Gröschel et al. (2014) 59 and Yamazaki et al. (2014) 60 revealed inv(3) biology 

in AML: they discovered that in inv(3), the GATA2 enhancer is repositioned from 3q21 

to  be  in  close  proximity  with  the  MECOM  gene  at  3q26.  This  rearrangement  in  turn 

activates MECOM gene expression and causes GATA2 haploinsufficiency at its original 

location, which ultimately leads to leukemogenesis 59,60. 

 

•  Acute megakaryoblastic leukemia with t(1;22)(p13.3;q13.3);RBM15-MKL1 

Hereafter  abbreviated  AMKL  with  t(1;22).  AMKL  is  a  rare  hematologic  malignant 

disease that is detected in <1% of all AML patients 61. AMKL is closely associated (high 

incidence)  with  infants  and  young  children  62-64.  In  1991,  the  translocation  between 

chromosome 1 and 22, t(1;22), was first reported as the principal nonrandom cytogenetic 

signature  in  infants  with  AMKL  65,66.  In  2001  Ma  et  al.  (2001)  67  reported  that  this 

chromosomal translocation fuses two novel genes, RBM15 gene (also known as OTT), 

located  at  1p13,  and  MKL1,  located  at  22q13,  which  generates  the  RBM15-MKL1 

chimeric protein product. 

 

The  RBM15  gene  encodes  three  RNA-recognition  motifs  involved  in  modulating  Hox 

homeotic  function,  which  regulates  the  Ras/MAP  kinase  signaling  essential  to  cell 

differentiation  and  proliferation  68.  The  MKL1  gene  encodes  an  SAP  DNA  binding 

domain  that  is  involved  in  transcription  regulation,  chromatin  remodeling,  and 

extracellular signaling pathways 67,69,70. AMKL patients with RBM15-MKL1 fusion as a 

result of the t(1;22) translocation have poor prognosis and clinical course with less than 1 

year survival time from diagnosis 63. 

9 

 

•  AML with BCR-ABL1 

Chromosomal abnormalities leading to the translocation between chromosome 9 and 22 

result  in  the  BCR-ABL1  fusion  gene,  commonly  referred  to  as  the  Philadelphia 

chromosome. It is most frequently associated with chronic myeloid leukemia (CML) and 

acute lymphoblastic leukemia (ALL). AML with BCR-ABL1 accounts for 0.5% to 3% of 

all AML cases 71-74. Because of recent improvement in the reliability and standardization 

of  diagnosis  for  this  rare  disease,  AML  with  BCR-ABL1  was  recently  added  as  a 

provisional  entity  in  the  2016  WHO  newly  revised  myeloid  neoplasms  and  acute 

leukemia classification system 21. 

 

The  ABL1  gene  encodes  the  ABL  protooncogene  non-receptor  tyrosine  kinase  protein 

involved  in  cell  division  and  apoptosis  75.  The  function  of  the  BCR  gene  product  is 

complex,  however,  Duejmann  et  al.  (1991)  76  and  Maru  et  al.  (1991)  77  showed  the 

participation of BCR in eukaryotic intracellular signaling via phosphorylation and GTP-

binding 78. Since BCR-ABL1 fusion affects the regulation of hematopoietic cells 78, AML 

patients with the aberrant BCR-ABL1 fusion gene have unfavorable prognosis and are 

among AML poor risk group 79. 

 

•  AML with mutated NPM1 

Mutations  in  the  NPM1  gene  are  the  most  common  and  frequent  mutations  found  in 

AML patients – they are detected in approximately 30% and 60% of all AML patients 

and AML patients with normal karyotype, respectively 80,81. The NPM1 gene encodes the 

nucleophosmin  1  protein,  which  is  a  member  of  the  nucleophosmin/nucleoplasmin 

10 

 

proteins family 82. Under normal conditions, the NPM1 protein is mainly restricted to the 

nucleolus,  but  shuttles  between  the  nucleus,  where  it  modulates  pre-ribosomal  protein 

nuclear export, and the cytoplasm, where it regulates centrosome duplication, during the 

cell cycle 81,83. 

 

In 2005 Brunangelo et al (2005) 80 examined bone marrow specimens from 591 primary 

AML patients and found that 208 (35.2%) of the 591 AML patients have NPM1 gene 

mutations and cytoplasmic dislocation of the NPM1 protein, and suggested that NPM1 

gene  mutations  that  cause  changes  in  the  NPM1  protein  are  responsible  for  the 

translocation  of  the  NPM1  protein  from  the  nucleus  to  the  cytosol.  Cytoplasmic 

dislocation of the NPM1 protein as a result of NPM1 genetic mutations is thought to play 

a major role in leukemogenesis 81. 

 

NPM1 is important in many cellular processes, including DNA repair and cell survival 84, 

ribosome biogenesis  85, chromatin remodeling  86, protein chaperoning  87, and regulation 

of  the  ARF–tumor  suppressors  p53  pathway  80,88,89.  AML  patients  with  NPM1  gene 

mutations, with normal karyotype and absence of Fms related tyrosine kinase 3 internal 

tandem  duplication  (FLT3  ITD)  mutations,  continue  to  be  associated  with  favorable 

AML prognosis, response to conventional therapy, achieve CR, and are among an AML 

favorable risk group 80,81,90. 

 

 

 

11 

 

•  AML with biallelic mutations of CEBPA 

Hereafter abbreviated AML with bi-CEBPA. The presence of mutations in the CEBPA 

gene ranges from 10% to 15% of all AML cases, and these are closely associated with 

AML  patients  who  have  cytogenetically  normal karyotype  91,92. The  CEBPA  gene  is  a 

transcription factor involved in and is upregulated during progenitor cell differentiation 

and proliferation  93,94. Mutations in the CEBPA gene lead to terminal or abnormal cell 

differentiation, and ultimately to leukemogenesis 94,95. 

 

The  CEBPA  gene  has  two  hotspots  where  mutations  cluster:  the  N-terminal,  where 

frame-shift  (insertions/deletions)  mutations  affect  the  CEBPA  transactivation  domains, 

and the C-terminal, where in-frame mutations (insertions/deletions) in the DNA-binding 

motif  affect  protein  dimerization  and  DNA  binding  95-97.  AML  patients  with  biallelic 

mutations of CEBPA (bi-CEBPA) – mutations on both CEBPA alleles – have mutations 

on  both  hotspots:  Frame-shift  and  in-frame  mutations  on  the  N-  and  C-  terminus, 

respectively 98. Furthermore, only AML with bi-CEBPA patients are uniquely associated 

with  favorable  clinical  outcome  and  improved  survival,  but  AML  patients  with  single 

CEBPA mutations or wild-type CEBPA are not 99,100. 

 

•  AML with mutated RUNX1 

RUNX1 is a transcription factor that is expressed in healthy hematopoietic cells essential 

to the hematopoietic system. In adults, mutations in the RUNX1 gene lead to AML 101,102. 

AML with mutated RUNX1 accounts for approximately 10% of all AML cases and is 

associated  with  newly-diagnosed  (de  novo)  AML  patients  103-105.  Gaidzik  et  al.  (2016) 

12 

 

investigated the RUNX1 gene mutation frequency and prognosis in 2439 de novo AML 

patients  and  reported  that  245  (10%)  of  the  2439  AML  patients  had  RUNX1  gene 

mutations with almost no chromosomal abnormalities 105. The disease was recently added 

as  a  provisional  entity  in  the  2016  WHO  newly  revised  myeloid  neoplasms  and  acute 

leukemia classification system, since AML patients with mutated RUNX1 have distinct 

biological  features  including  poor  disease  prognosis  with  worse  overall  survival  (OS) 

compared to other AML types 21. 

 

AML with myelodysplasia-related changes 

AML with myelodysplasia-related changes (AML-MRC) is a heterogeneous disease that 

is 

closely 

associated 

with 

myelodysplastic 

syndromes 

(MDS), 

myelodysplastic/myeloproliferative neoplasms (MDS/MPN), poor prognosis, and elderly 

patients  21. AML-MRC cases account for approximately 20% to 30% of all AML cases 

and  are  among  an  AML  poor  risk  group  21,106.  Cell  morphology,  cytogenetic 

abnormalities, and clinical features are important factors for AML-MRC prognosis 21. 

 

According  to  the  2016  WHO  newly-revised  ‘myeloid  neoplasms  and  acute  leukemia’ 

classification system, AML-MRC classification criteria include: 1) presence of 50% or 

more  multilineage  dysplasia  (two  or  more  cell  lines)  with  no  presence  of  NPM1  gene 

mutations or biallelic mutated CEBPA and 2) presence	of	at	least	20%	blast	cells in the 

bone marrow or peripheral blood, and/or 3) previous history of MDS or MDS/MPN, or 

presence of MDS or MDS/MPN related cytogenetic abnormalities (Table 2) -- excluding 

abnormalities  associated  with  NPM1  gene  mutations  or  biallelic  mutated  CEBPA 

13 

 

(del(9q)),  or  related  to  AML-RGA,  or  to  prior  cytotoxic  therapy  used  for  unrelated 

disease 107-111. 

 

Therapy-related myeloid neoplasms 

Therapy-related  myeloid  neoplasms  (t-MN)  encompass  a  number  of  therapy-related 

malignant  diseases  (therapy-related  AML  (t-AML),  MDS  (t-MDS),  and  MDS/MPN  (t-

MDS/MPN))  occurring  as  a  function  of  prior  cytotoxic  therapy  (radiation,  chemo,  or 

both) where the therapy used is independent of diseases (malignant or not) 112. 

 

There  are  two  major  clinical  subtypes  of  t-MD  113. The  first  subtype  is  approximately 

70%  of  all  t-MN  and  is  associated  with  chromosomal  abnormalities  that  result  in  the 

deletion of part of chromosome 5 (del(5q)) and/or part or all of chromosome 7 (del(75q)/-

7)  114.  Furthermore,  this  subtype  is  associated  with  patients  who  received  prior 

alkylating/radiation therapy, have poor prognosis, and initially diagnosed with MDS that 

progresses  to  AML  114.  Finally,  the  median  survival  of  patients  with  this  subtype  is  8 

months. The second subtype of t-MN is associated with chromosomal abnormalities that 

results in the translocation of the KMT2A gene (located at 11q23) or the RUNX1 gene 

(located 21q22)  113,114. Moreover, this subtype is associated with patients who received 

prior topoisomerase II inhibitors therapy, diagnosed as leukemia with absence of MDS, 

and have favorable clinical outcomes with standard treatment 114. 

 

 

 

14 

 

AML not otherwise specified 

AML  not  otherwise  specified  (AML-NOS)  encompasses  a  number  of heterogenous 

AML  subtypes  (not  classified  in  any  of  the  other  three  AML  categories  (AML-RGA, 

AML-MRC, and t-MN)) 21. According to the WHO, AML-NOS subtypes include, ‘AML 

with minimal differentiation’, ‘AML without maturation’, ‘AML with maturation’, ‘acute 

myelomonocytic  leukemia’,  ‘acute  monoblastic/monocytic  leukemia’,  ‘pure  erythroid 

leukemia’,  ‘acute  megakaryoblastic  leukemia’,  ‘acute  basophilic  leukemia’,  and  ‘acute 

panmyelosis  with  myelofibrosis’  21,115.  Since  AML-NOS  patients  have  intermediate 

prognosis and lacks consistent diagnostic criteria such as clinical features or cytogenetic 

abnormalities,  AML-NOS  patients  are  often  classified  based  on  cell  morphology, 

pancytopenia and/or bone marrow dysfunction 115. 

 

AML complex genetic abnormalities suggests that AML evolves over time 

AML  is  typically  diagnosed  through  microscopic,  cytogenetics,  and  molecular  genetic 

analyses of patients’ blood and bone marrow samples. Microscopic examination is used 

to detect distinctive features (e.g. Auer rods) in cell morphology, cytogenetic analysis to 

identify  chromosomal  structural  aberrations  (e.g.,  t(8;21),  inv(16)  or  t(16;16),  t(9;11)), 

and molecular genetic analysis to identify mutations in genes frequently mutated in AML 

(e.g., NPM1, RUNX1, FLT3) 116-118. 

 

Cytogenetic and molecular genetic analyses are used to identify prognosis markers that 

can be used to classify AML patients into three risk categories: favorable, intermediate, 

and  unfavorable.  The  largest  group  of  AML  patients  (almost  50%)  however,  present 

15 

 

normal karyotype and lack genetic abnormalities  117-121. These patients are classified as 

intermediate risk, and often have heterogeneous clinical outcome with standard therapy 

with  risk  of  AML  relapse  122,123.  Further  complicating,  AML  has  multiple  driver 

mutations and competing clones that evolve over time, making it a very dynamic disease 

124-126. 

 

In 2013 The Cancer Genome Atlas (TCGA) et al. (2013)  124 published a study entitled 

“Genomic  and  Epigenomic  Landscapes  of  Adult  De  Novo  Acute  Myeloid  Leukemia”, 

and  generated  a  catalogue  of  23  genes  that  are  significantly mutated  in  AML (Fig.  1). 

Building upon previous findings, these discoveries are expanding our knowledge of AML 

as well as revealing how biologically complex AML is, but the extent of their utility in 

disease prognosis, clinical practice, and patients’ clinical outcomes are as yet unclear 127. 

 

AML standard treatments 

Briefly, AML standard treatments consist of two phases: Remission induction therapy to 

eradicate as many leukemia cells as possible and to produce a CR in the bone marrow, 

followed  by  an  intensive  consolidation  phase (post-remission)  to  prevent  AML  relapse 

22,116. Generally, treatments employ a 7 + 3 regiment that consists of two chemo drugs: A 

7-day  continuous  infusion  of  standard-dose  cytarabine,  and  3  days  of  anthracycline, 

daunorubicin  or  idarubicin,  followed  by  the  consolidation  phase  if  CR  is  achieved 

(otherwise a second induction course can be considered) 116,128. 

 

16 

 

Cytarabine, typically used in AML treatment, inhibits polymerase activity and damages 

DNA  during  the  cell  cycle,  while  daunorubicin,  another  chemotherapy  medication, 

intercalates between base pairs of DNA/RNA strands, which prevents DNA replication 

and inhibits the enzyme topoisomerase II from relaxing supercoiled DNA. With current 

standard  treatments,  40%  of  younger  patients  have  5-year  OS.  The  majority  of  AML 

cases are elder patients (> 60 years) and they have no standard treatment option, which is 

reflected  in  a  much  lower  5-year  OS:  10%  to  20%  22,116.  This  is  because  older  AML 

patients have poor clinical outcomes, decreased sensitivity to chemotherapy, and adverse 

cytogenetic  abnormalities  due  to  their  lack  of  tolerability  to  the  ideal  dose  of 

chemotherapy  22,129-134.  While  other  treatments,  including  intensive  chemotherapy  and 

immunotherapies  for  younger  patients,  and  hematopoietic  stem  cell  transplantation 

(HSCT) 133-135, AML remains a major therapeutic challenge since relapse after CR is still 

the  main  obstacle  and  is  difficult  to  manage  due  to  patients’  nonrandom  heterogenous 

response to stereotypical treatment 128,136. 

 

The nature of AML changes with patients age 

AML  can  occur  in  people  of  all  age.  However,  AML  is  primarily  diagnosed  among 

patients older than 60 years of age, with a median age of 68 years at diagnosis in the US 

18.  Recent  advances  in  AML  biology  that  expanded  our  understanding  of  its  complex 

genetic  landscape  and  led  to  significant  improvement  in  AML  prognoses  for  younger 

patients. However, therapeutic strategies for AML patients have been nearly the same for 

more  than  30  years  116,137,  with  almost  no  treatment  options  for  patients  older  than  60 

years of age 129,130. Approximately 70% of patients older than 65 years of age die within 

17 

 

one year from diagnosis with current treatment138.Additionaly, AML prognosis worsens 

as  age  increases  due  to  increase  in  adverse  cytogenetic  abnormalities.  Furthermore, 

response  to  treatments  also  worsens  with  age,  with  older  patients  respond  less  to 

treatments, with poorer clinical outcomes. 

 

It is therefore unquestionable that the nature of AML changes with age, but despite this, 

little  is  known  about  the  extent  of  these  associations  and  how  they  vary  with  AML 

patient’s age22,129,139. Thus, in the present study we seek to identify AML age-dependent 

and sex-related gene expression signatures by exploring the age-related gene expression 

patterns in AML. 

 

 

 

 

 

 

 

 

 

 

 

 

 

18 

 

Conclusion 

In this dissertation, we aimed to establish sex-linked and age-dependent biomarkers from 

genes with similar alteration in gene expression level and associated signaling pathway in 

AML.  The  approach  utilized  machine  learning,  which  led  to  the  development  of  a 

graphical  user  interface  to  facilitate  model  training  and  classification,  (Chapter  2). 

Subsequently, meta-analyses of publicly available data were used to study the effects of 

age and sex in AML (Chapter 3). In particular, three analyses were performed to help us 

reach our aims Analysis 1: “Gene expression meta-analysis and associated signaling 

pathways  of  AML  disease  state  compared  to  healthy  individuals”,  to  identify 

differentially expressed (DE) genes in AML disease state, followed by gene enrichment 

analysis  on  the  identified  DE  genes  to  find  singling  pathway  associated  with  AML. 

Analysis  2a:  “Sex-relevance  differential  gene  expression  meta-analysis  and 

associated signaling pathways in AML”, to explore the relevance of patients’ sex on 

gene  expression  and  to  identify  sex-linked  genes  and  associated  signaling  pathways  in 

AML.  Analysis  2b:  “Age-dependent  gene  expression  meta-analysis  and  associated 

signaling  pathways  in  AML”,  to  identify  common  set  of  age-dependent  genes  and 

associated  signaling  pathways  and  to  explore  age-dependent  trends  in  AML  gene 

expression. Finally, using our results and combined with a machine learning model (KNN 

model), we were able to classify AML patients compared to healthy individuals with > 

90% achieved accuracy. Overall our findings provide a new reanalysis of public datasets, 

that  enabled  the  identification  of  potential  new  gene  sets  relevant  to  AML  that  can 

potentially be used in future experiments and possible stratified disease diagnostics. 

 

19 

 

APPENDIX

20 

 

APPENDIX 

 

Figure 1. Genes frequently mutated in AML according to TCGA 

The Cancer Genome Atlas Research Network (TCGA) analyzed the genomes of 200 de-

novo AML patients. Analysis revealed a total of 23 genes (shown) that are frequently and 

significantly  mutated  in  AML  with  237  genes  (not  shown)  were  mutated in  2  or  more 

samples in de novo AML. (data from reference 124) 

 
 
 
 
 
 
 
 
 
 
 

21 

 

Table 1. Classification of AML according to the WHO acute myeloid leukemia and 

related neoplasms classification system. 

1: AML with recurrent genetic abnormalities 
    AML with t(8;21)(q22;q22.1);RUNX1-RUNX1T1 
    AML with inv(16)(p13.1q22) or t(16;16)(p13.1;q22);CBFB-MYH11 
    APL with PML-RARA 
    AML with t(9;11)(p21.3;q23.3);MLLT3-KMT2A 
    AML with t(6;9)(p23;q34.1);DEK-NUP214 
    AML with inv(3)(q21.3q26.2) or t(3;3)(q21.3;q26.2); GATA2, MECOM 
    AML (megakaryoblastic) with t(1;22)(p13.3;q13.3);RBM15-MKL1 
    Provisional entity: AML with BCR-ABL1 
    AML with mutated NPM1 
    AML with biallelic mutations of CEBPA 
    Provisional entity: AML with mutated RUNX1 
2: AML with myelodysplasia-related changes 
3: Therapy-related myeloid neoplasms 
4: AML, NOS 
    AML with minimal differentiation 
    AML without maturation 
    AML with maturation 
    Acute myelomonocytic leukemia 
    Acute monoblastic/monocytic leukemia 
    Pure erythroid leukemia 
    Acute megakaryoblastic leukemia 

22 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Table  2.  Cytogenetic  abnormalities  sufficient  for  the  diagnosis  of  AML  with 

myelodysplasia-related changes (AML-MRC). 

Complex karyotype (3 or more abnormalities) 
    None included in AML with recurrent genetic 
    abnormalities 
Unbalanced abnormalities 
    -7/del(7q) 
    del(5q)/t(5q) 
    i(17q)/t(17p) 
    -13/del(13q) 
    del(11q) 
    del(12p)/t(12p) 
    idic(X)(q13) 
Balanced abnormalities 
    t(11;16)(q23;p13.3) 
    t(3;21)(q26.2;q21.2) 
    t(1;3)(p36.3;q21.1) 
    t(2;11)(p21;q23) 
    t(5;12)(q32;p13.2) 
    t(5;7)(q32;q11.2) 
    t(5;17)(q32;p13.2) 
    t(5;10)(q32;q21.2) 
    t(3;5)(q25.3;q35.1) 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

23 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

BIBLIOGRAPHY 

 

24 

 

 

1 

 
2 

 
3 

 
4 

 
5 

 
6 

 
7 

 
8 

 
9 

 
10 
 
11 

 
12 

 
13 

BIBLIOGRAPHY 

 
 

Sudhakar, A. History of Cancer, Ancient and Modern Treatment Methods. J 
Cancer Sci Ther 1, 1-4, doi:10.4172/1948-5956.100000e2 (2009). 

Piller, G. Leukaemia - a brief historical review from ancient times to 1950. Br J 
Haematol 112, 282-292 (2001). 

Gest, H. The remarkable vision of Robert Hooke (1635-1703): first observer of 
the microbial world. Perspect Biol Med 48, 266-272, doi:10.1353/pbm.2005.0053 
(2005). 

Lane, N. The unseen world: reflections on Leeuwenhoek (1677) 'Concerning little 
animals'. Philos Trans R Soc Lond B Biol Sci 370, doi:10.1098/rstb.2014.0344 
(2015). 

Bennett, J. H. Case of hypertrophy of the spleen and liver, in which death took 
place from suppuration of the blood. Edinburgh Med Sug J 64, 413-423 (1845). 

Fuller, H. Particulars of a case in which enormous enlargement of the spleen and 
liver, together with dilation of all the blood vessels of the body were found co-
incident with a peculiarly altered condition of the blood. Lancet 2, 43-44 (1846). 

Velpeau, A. Sur la resorption du pusaet sur l’alteration du sang dans les maladies 
clinique de persection nenemant. Premier observation. Rev Med 2, 216 (1827). 

Friedreich, N. Ein neuer Fall von Leukämie. Archiv für pathologische Anatomie 
und Physiologie und für klinische Medicin 12, 37-58, doi:10.1007/bf01938747 
(1857). 

Ehrlich, P. Beiträge zur Kenntniss der Anilinfärbungen und ihrer Verwendung in 
der mikroskopischen Technik. Archiv für mikroskopische Anatomie 13, 263-277, 
doi:10.1007/bf02933937 (1877). 

Gowers, W. R. Splenic leucocythaemia. A system of medicine 5, 216-305 (1879). 

CABOT, R. C. Acute Leukemia. The Boston Medical and Surgical Journal 131, 
507-511, doi:10.1056/nejm189411221312103 (1894). 

Buchanan, R. J. M. Flagellation of lymphocytes. Brit Med J 1909, 306-306 
(1909). 

Farmer, P. et al. Expansion of cancer care and control in countries of low and 
middle income: a call to action. Lancet 376, 1186-1193, doi:10.1016/S0140-
6736(10)61152-X (2010). 

25 

 

Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2018. CA Cancer J Clin 
68, 7-30, doi:10.3322/caac.21442 (2018). 

Yamamoto, J. F. & Goodman, M. T. Patterns of leukemia incidence in the United 
States by subtype and demographic characteristics, 1997-2002. Cancer Causes 
Control 19, 379-390, doi:10.1007/s10552-007-9097-2 (2008). 

De Kouchkovsky, I. & Abdul-Hay, M. 'Acute myeloid leukemia: a 
comprehensive review and 2016 update'. Blood Cancer J 6, e441, 
doi:10.1038/bcj.2016.50 (2016). 

Deschler, B. & Lubbert, M. Acute myeloid leukemia: epidemiology and etiology. 
Cancer 107, 2099-2107, doi:10.1002/cncr.22233 (2006). 

Institute, N. C. SEER Cancer Stat Facts: Acute Myeloid Leukemia (Percent of 
New Cases by Age Group). [https://seer.cancer.gov/statfacts/html/amyl.html].  
((accessed 11.30.18), 2011-2015). 

Estey, E. & Dohner, H. Acute myeloid leukaemia. Lancet 368, 1894-1907, 
doi:10.1016/S0140-6736(06)69780-8 (2006). 

Kumar, C. C. Genetic abnormalities and challenges in the treatment of acute 
myeloid leukemia. Genes Cancer 2, 95-107, doi:10.1177/1947601911408076 
(2011). 

Arber, D. A. et al. The 2016 revision to the World Health Organization 
classification of myeloid neoplasms and acute leukemia. Blood 127, 2391-2405, 
doi:10.1182/blood-2016-03-643544 (2016). 

Dohner, H., Weisdorf, D. J. & Bloomfield, C. D. Acute Myeloid Leukemia. N 
Engl J Med 373, 1136-1152, doi:10.1056/NEJMra1406184 (2015). 

Peterson, L. F. & Zhang, D. E. The 8;21 translocation in leukemogenesis. 
Oncogene 23, 4255-4262, doi:10.1038/sj.onc.1207727 (2004). 

 
14 

 
15 

 
16 

 
17 

 
18 

 
19 

 
20 

 
21 

 
22 

 
23 

 
24 

 
26 

 

Rowley, J. D. Identificaton of a translocation with quinacrine fluorescence in a 
patient with acute leukemia. Ann Genet 16, 109-112 (1973). 

 
25  Miyoshi, H. et al. The t(8;21) translocation in acute myeloid leukemia results in 
production of an AML1-MTG8 fusion transcript. EMBO J 12, 2715-2721 (1993). 

Speck, N. A. & Gilliland, D. G. Core-binding factors in haematopoiesis and 
leukaemia. Nature Reviews Cancer 2, 502-513, doi:10.1038/nrc840 (2002). 

26 

 

Shurtleff, S. A. et al. Heterogeneity in CBF beta/MYH11 fusion messages 
encoded by the inv(16)(p13q22) and the t(16;16)(p13;q22) in acute myelogenous 
leukemia. Blood 85, 3695-3703 (1995). 

Le Beau, M. M. et al. Association of an inversion of chromosome 16 with 
abnormal marrow eosinophils in acute myelomonocytic leukemia. A unique 
cytogenetic-clinicopathological association. N Engl J Med 309, 630-636, 
doi:10.1056/NEJM198309153091103 (1983). 

Liu, P. et al. Identification of yeast artificial chromosomes containing the 
inversion 16 p-arm breakpoint associated with acute myelomonocytic leukemia. 
Blood 82, 716-721 (1993). 

Liu, P. et al. Fusion between transcription factor CBF beta/PEBP2 beta and a 
myosin heavy chain in acute myeloid leukemia. Science 261, 1041-1044 (1993). 

Sinha, C., Cunningham, L. C. & Liu, P. P. Core Binding Factor Acute Myeloid 
Leukemia: New Prognostic Categories and Therapeutic Opportunities. Semin 
Hematol 52, 215-222, doi:10.1053/j.seminhematol.2015.04.002 (2015). 

Sood, R., Kamikubo, Y. & Liu, P. Role of RUNX1 in hematological 
malignancies. Blood 129, 2070-2082, doi:10.1182/blood-2016-10-687830 (2017). 

Estey, E. H. Acute myeloid leukemia: 2012 update on diagnosis, risk 
stratification, and management. Am J Hematol 87, 90-99, doi:10.1002/ajh.22246 
(2012). 

Adams, J. & Nassiri, M. Acute Promyelocytic Leukemia: A Review and 
Discussion of Variant Translocations. Arch Pathol Lab Med 139, 1308-1313, 
doi:10.5858/arpa.2013-0345-RS (2015). 

Stone, R. M. & Mayer, R. J. The unique aspects of acute promyelocytic leukemia. 
J Clin Oncol 8, 1913-1921, doi:10.1200/JCO.1990.8.11.1913 (1990). 

Hillestad, L. K. Acute promyelocytic leukemia. Acta Med Scand 159, 189-194 
(1957). 

Rowley, J. D., Golomb, H. M. & Dougherty, C. 15/17 translocation, a consistent 
chromosomal change in acute promyelocytic leukaemia. Lancet 1, 549-550 
(1977). 

Grignani, F. et al. The acute promyelocytic leukemia-specific PML-RAR alpha 
fusion protein inhibits differentiation and promotes survival of myeloid precursor 
cells. Cell 74, 423-431 (1993). 

27 

 
28 

 
29 

 
30 

 
31 

 
32 

 
33 

 
34 

 
35 

 
36 

 
37 

 
38 

 

27 

 

39 

 
40 

 
41 

 
42 

 
43 

 
44 

 
45 

 
46 

 
47 

 
49 

 
50 

Hilden, J. M. & Kersey, J. H. The MLL (11q23) and AF-4 (4q21) genes disrupted 
in t(4;11) acute leukemia: molecular and clinical studies. Leuk Lymphoma 14, 
189-195, doi:10.3109/10428199409049668 (1994). 

 
48  Marschalek, R. Systematic Classification of Mixed-Lineage Leukemia Fusion 

Partners Predicts Additional Cancer Pathways. Ann Lab Med 36, 85-100, 
doi:10.3343/alm.2016.36.2.85 (2016). 

Lee, G. Y. et al. Acute promyelocytic leukemia with PML-RARA fusion on 
i(17q) and therapy-related acute myeloid leukemia. Cancer Genet Cytogenet 159, 
129-136, doi:10.1016/j.cancergencyto.2004.09.019 (2005). 

Giguere, V., Ong, E. S., Segui, P. & Evans, R. M. Identification of a receptor for 
the morphogen retinoic acid. Nature 330, 624-629, doi:10.1038/330624a0 (1987). 

Petkovich, M., Brand, N. J., Krust, A. & Chambon, P. A human retinoic acid 
receptor which belongs to the family of nuclear receptors. Nature 330, 444-450, 
doi:10.1038/330444a0 (1987). 

de The, H. et al. The PML-RAR alpha fusion mRNA generated by the t(15;17) 
translocation in acute promyelocytic leukemia encodes a functionally altered 
RAR. Cell 66, 675-684 (1991). 

Kakizuka, A. et al. Chromosomal translocation t(15;17) in human acute 
promyelocytic leukemia fuses RAR alpha with a novel putative transcription 
factor, PML. Cell 66, 663-674 (1991). 

Swansbury, G. J., Slater, R., Bain, B. J., Moorman, A. V. & Secker-Walker, L. M. 
Hematological malignancies with t(9;11)(p21-22;q23)--a laboratory and clinical 
study of 125 cases. European 11q23 Workshop participants. Leukemia 12, 792-
800 (1998). 

Schoch, C. et al. AML with 11q23/MLL abnormalities as defined by the WHO 
classification: incidence, partner chromosomes, FAB subtype, age distribution, 
and prognostic impact in an unselected series of 1897 cytogenetically analyzed 
AML cases. Blood 102, 2395-2402, doi:10.1182/blood-2003-02-0434 (2003). 

Bower, M. et al. Prevalence and clinical correlations of MLL gene 
rearrangements in AML-M4/5. Blood 84, 3776-3780 (1994). 

Hess, J. L. MLL: a histone methyltransferase disrupted in leukemia. Trends Mol 
Med 10, 500-507, doi:10.1016/j.molmed.2004.08.005 (2004). 

Schwartz, S., Jiji, R., Kerman, S., Meekins, J. & Cohen, M. M. Translocation 
(6;9)(p23;q34) in acute nonlymphocytic leukemia. Cancer Genet Cytogenet 10, 
133-138 (1983). 

28 

 

 
51 

 
52 

 
53 

 
54 

 
55 

 
56 

 
57 

 
58 

 
59 

 
60 

 
61 

Rowley, J. D. & Potter, D. Chromosomal banding patterns in acute 
nonlymphocytic leukemia. Blood 47, 705-721 (1976). 

Chi, Y., Lindgren, V., Quigley, S. & Gaitonde, S. Acute myelogenous leukemia 
with t(6;9)(p23;q34) and marrow basophilia: an overview. Arch Pathol Lab Med 
132, 1835-1837, doi:10.1043/1543-2165-132.11.1835 (2008). 

Suzukawa, K. et al. Identification of a breakpoint cluster region 3' of the 
ribophorin I gene at 3q21 associated with the transcriptional activation of the 
EVI1 gene in acute myelogenous leukemias with inv(3)(q21q26). Blood 84, 2681-
2688 (1994). 

Bitter, M. A., Neilly, M. E., Le Beau, M. M., Pearson, M. G. & Rowley, J. D. 
Rearrangements of chromosome 3 involving bands 3q21 and 3q26 are associated 
with normal or elevated platelet counts in acute nonlymphocytic leukemia. Blood 
66, 1362-1370 (1985). 

Grigg, A. P., Gascoyne, R. D., Phillips, G. L. & Horsman, D. E. Clinical, 
haematological and cytogenetic features in 24 patients with structural 
rearrangements of the Q arm of chromosome 3. Br J Haematol 83, 158-165 
(1993). 

Lugthart, S. et al. Clinical, molecular, and prognostic significance of WHO type 
inv(3)(q21q26.2)/t(3;3)(q21;q26.2) and various other 3q abnormalities in acute 
myeloid leukemia. J Clin Oncol 28, 3890-3898, doi:10.1200/JCO.2010.29.2771 
(2010). 

Lugus, J. J. et al. GATA2 functions at multiple steps in hemangioblast 
development and differentiation. Development 134, 393-405, 
doi:10.1242/dev.02731 (2007). 

Buonamici, S., Chakraborty, S., Senyuk, V. & Nucifora, G. The role of EVI1 in 
normal and leukemic cells. Blood Cells Mol Dis 31, 206-212 (2003). 

Groschel, S. et al. A single oncogenic enhancer rearrangement causes 
concomitant EVI1 and GATA2 deregulation in leukemia. Cell 157, 369-381, 
doi:10.1016/j.cell.2014.02.019 (2014). 

Yamazaki, H. et al. A remote GATA2 hematopoietic enhancer drives 
leukemogenesis in inv(3)(q21;q26) by activating EVI1 expression. Cancer Cell 
25, 415-427, doi:10.1016/j.ccr.2014.02.008 (2014). 

Orazi, A. Histopathology in the diagnosis and classification of acute myeloid 
leukemia, myelodysplastic syndromes, and myelodysplastic/myeloproliferative 
diseases. Pathobiology 74, 97-114, doi:10.1159/000101709 (2007). 

29 

 

 
62 

 
63 

 
64 

 
65 

 
66 

 
68 

 
71 

 
72 

 
73 

Carroll, A. et al. The t(1;22) (p13;q13) is nonrandom and restricted to infants with 
acute megakaryoblastic leukemia: a Pediatric Oncology Group Study. Blood 78, 
748-752 (1991). 

 
67  Ma, Z. et al. Fusion of two novel genes, RBM15 and MKL1, in the 

t(1;22)(p13;q13) of acute megakaryoblastic leukemia. Nat Genet 28, 220-221, 
doi:10.1038/90054 (2001). 

Novoyatleva, T. et al. Protein phosphatase 1 binds to the RNA recognition motif 
of several splicing factors and regulates alternative pre-mRNA processing. Hum 
Mol Genet 17, 52-70, doi:10.1093/hmg/ddm284 (2008). 

 
69  Mercher, T. et al. Recurrence of OTT-MAL fusion in t(1;22) of infant AML-M7. 

Genes Chromosomes Cancer 33, 22-28 (2002). 

Bennett, J. M. et al. Proposals for the classification of the acute leukaemias. 
French-American-British (FAB) co-operative group. Br J Haematol 33, 451-458 
(1976). 

Lu, G., Altman, A. J. & Benn, P. A. Review of the cytogenetic changes in acute 
megakaryoblastic leukemia: one disease or several? Cancer Genet Cytogenet 67, 
81-89 (1993). 

Lion, T. et al. The translocation t(1;22)(p13;q13) is a nonrandom marker 
specifically associated with acute megakaryocytic leukemia in young children. 
Blood 79, 3325-3330 (1992). 

Baruchel, A., Daniel, M. T., Schaison, G. & Berger, R. Nonrandom t(1;22)(p12-
p13;q13) in acute megakaryocytic malignant proliferation. Cancer Genet 
Cytogenet 54, 239-243 (1991). 

 
70  Wiellette, E. L. et al. spen encodes an RNP motif protein that interacts with Hox 

pathways to repress the development of head-like sclerites in the Drosophila 
trunk. Development 126, 5373-5385 (1999). 

Konoplev, S. et al. Molecular characterization of de novo Philadelphia 
chromosome-positive acute myeloid leukemia. Leuk Lymphoma 54, 138-144, 
doi:10.3109/10428194.2012.701739 (2013). 

Keung, Y. K. et al. Philadelphia chromosome positive myelodysplastic syndrome 
and acute myeloid leukemia-retrospective study and review of literature. Leuk Res 
28, 579-586, doi:10.1016/j.leukres.2003.10.027 (2004). 

Soupir, C. P. et al. Philadelphia chromosome-positive acute myeloid leukemia: a 
rare aggressive leukemia with clinicopathologic features distinct from chronic 

30 

 

Diekmann, D. et al. Bcr encodes a GTPase-activating protein for p21rac. Nature 
351, 400-402, doi:10.1038/351400a0 (1991). 

 
77  Maru, Y. & Witte, O. N. The BCR gene encodes a novel serine/threonine kinase 

activity within a single exon. Cell 67, 459-468 (1991). 

 
74 

 
75 

 
76 

 
78 

 
79 

 
80 

 
81 

 
82 

 
83 

 
84 

 
85 

 

myeloid leukemia in myeloid blast crisis. Am J Clin Pathol 127, 642-650, 
doi:10.1309/B4NVER1AJJ84CTUU (2007). 

Berger, R. Differences between blastic chronic myeloid leukemia and Ph-positive 
acute leukemia. Leuk Lymphoma 11 Suppl 1, 235-237, 
doi:10.3109/10428199309047892 (1993). 

Colicelli, J. ABL tyrosine kinases: evolution of function, regulation, and 
specificity. Sci Signal 3, re6, doi:10.1126/scisignal.3139re6 (2010). 

Laurent, E., Talpaz, M., Kantarjian, H. & Kurzrock, R. The BCR gene and 
philadelphia chromosome-positive leukemogenesis. Cancer Res 61, 2343-2355 
(2001). 

Neuendorff, N. R., Burmeister, T., Dorken, B. & Westermann, J. BCR-ABL-
positive acute myeloid leukemia: a new entity? Analysis of clinical and molecular 
features. Ann Hematol 95, 1211-1221, doi:10.1007/s00277-016-2721-z (2016). 

Falini, B. et al. Cytoplasmic nucleophosmin in acute myelogenous leukemia with 
a normal karyotype. N Engl J Med 352, 254-266, doi:10.1056/NEJMoa041974 
(2005). 

Heath, E. M. et al. Biological and clinical consequences of NPM1 mutations in 
AML. Leukemia 31, 798-807, doi:10.1038/leu.2017.30 (2017). 

Federici, L. & Falini, B. Nucleophosmin mutations in acute myeloid leukemia: a 
tale of protein unfolding and mislocalization. Protein Sci 22, 545-556, 
doi:10.1002/pro.2240 (2013). 

Borer, R. A., Lehner, C. F., Eppenberger, H. M. & Nigg, E. A. Major nucleolar 
proteins shuttle between nucleus and cytoplasm. Cell 56, 379-390 (1989). 

Colombo, E., Alcalay, M. & Pelicci, P. G. Nucleophosmin and its complex 
network: a possible therapeutic target in hematological diseases. Oncogene 30, 
2595-2609, doi:10.1038/onc.2010.646 (2011). 

Lindstrom, M. S. NPM1/B23: A Multifunctional Chaperone in Ribosome 
Biogenesis and Chromatin Remodeling. Biochem Res Int 2011, 195209, 
doi:10.1155/2011/195209 (2011). 

31 

 

Okuwaki, M., Iwamatsu, A., Tsujimoto, M. & Nagata, K. Identification of 
nucleophosmin/B23, an acidic nucleolar protein, as a stimulatory factor for in 
vitro replication of adenovirus DNA complexed with viral basic core proteins. J 
Mol Biol 311, 41-55, doi:10.1006/jmbi.2001.4812 (2001). 

Dumbar, T. S., Gentry, G. A. & Olson, M. O. Interaction of nucleolar 
phosphoprotein B23 with nucleic acids. Biochemistry 28, 9495-9501 (1989). 

Bertwistle, D., Sugimoto, M. & Sherr, C. J. Physical and functional interactions of 
the Arf tumor suppressor protein with nucleophosmin/B23. Mol Cell Biol 24, 985-
996 (2004). 

Colombo, E., Marine, J. C., Danovi, D., Falini, B. & Pelicci, P. G. 
Nucleophosmin regulates the stability and transcriptional activity of p53. Nat Cell 
Biol 4, 529-533, doi:10.1038/ncb814 (2002). 

Dohner, K. et al. Mutant nucleophosmin (NPM1) predicts favorable prognosis in 
younger adults with acute myeloid leukemia and normal cytogenetics: interaction 
with other gene mutations. Blood 106, 3740-3746, doi:10.1182/blood-2005-05-
2164 (2005). 

Pabst, T. et al. Dominant-negative mutations of CEBPA, encoding 
CCAAT/enhancer binding protein-alpha (C/EBPalpha), in acute myeloid 
leukemia. Nat Genet 27, 263-270, doi:10.1038/85820 (2001). 

Green, C. L. et al. Prognostic significance of CEBPA mutations in a large cohort 
of younger adult patients with acute myeloid leukemia: impact of double CEBPA 
mutations and the interaction with FLT3 and NPM1 mutations. J Clin Oncol 28, 
2739-2747, doi:10.1200/JCO.2009.26.2501 (2010). 

86 

 
87 

 
88 

 
89 

 
90 

 
91 

 
92 

 
94 

 
95 

 
96 

 
97 

 
93  Mrozek, K., Heinonen, K. & Bloomfield, C. D. Clinical importance of 

cytogenetics in acute myeloid leukaemia. Best Pract Res Clin Haematol 14, 19-
47, doi:10.1053/beha.2000.0114 (2001). 

Calkhoven, C. F., Muller, C. & Leutz, A. Translational control of C/EBPalpha 
and C/EBPbeta isoform expression. Genes Dev 14, 1920-1932 (2000). 

Nerlov, C. C/EBPalpha mutations in acute myeloid leukaemias. Nat Rev Cancer 
4, 394-400, doi:10.1038/nrc1363 (2004). 

Tenen, D. G., Hromas, R., Licht, J. D. & Zhang, D. E. Transcription factors, 
normal myeloid development, and leukemia. Blood 90, 489-519 (1997). 

Dufour, A. et al. Acute myeloid leukemia with biallelic CEBPA gene mutations 
and normal karyotype represents a distinct genetic entity associated with a 

32 

 

favorable clinical outcome. J Clin Oncol 28, 570-577, 
doi:10.1200/JCO.2008.21.6010 (2010). 

 
98 

Pabst, T. & Mueller, B. U. Transcriptional dysregulation during myeloid 
transformation in AML. Oncogene 26, 6829-6837, doi:10.1038/sj.onc.1210765 
(2007). 

 
99  Wouters, B. J. et al. Double CEBPA mutations, but not single CEBPA mutations, 

define a subgroup of acute myeloid leukemia with a distinctive gene expression 
profile that is uniquely associated with a favorable outcome. Blood 113, 3088-
3091, doi:10.1182/blood-2008-09-179895 (2009). 

Pabst, T., Eyholzer, M., Fos, J. & Mueller, B. U. Heterogeneity within AML with 
CEBPA mutations; only CEBPA double mutations, but not single CEBPA 
mutations are associated with favourable prognosis. Br J Cancer 100, 1343-1346, 
doi:10.1038/sj.bjc.6604977 (2009). 

Ichikawa, M. et al. AML-1 is required for megakaryocytic maturation and 
lymphocytic differentiation, but not for maintenance of hematopoietic stem cells 
in adult hematopoiesis. Nat Med 10, 299-304, doi:10.1038/nm997 (2004). 

 
100 

 
101 

 
102 

 
104 

 
107 

 

Sakurai, M. et al. Impaired hematopoietic differentiation of RUNX1-mutated 
induced pluripotent stem cells derived from FPD/AML patients. Leukemia 28, 
2344-2354, doi:10.1038/leu.2014.136 (2014). 

 
103  Gaidzik, V. I. et al. RUNX1 mutations in acute myeloid leukemia: results from a 

comprehensive genetic and clinical analysis from the AML study group. J Clin 
Oncol 29, 1364-1372, doi:10.1200/JCO.2010.30.7926 (2011). 

Lindsley, R. C. et al. Acute myeloid leukemia ontogeny is defined by distinct 
somatic mutations. Blood 125, 1367-1376, doi:10.1182/blood-2014-11-610543 
(2015). 

 
105  Gaidzik, V. I. et al. RUNX1 mutations in acute myeloid leukemia are associated 
with distinct clinico-pathologic and genetic features. Leukemia 30, 2160-2168, 
doi:10.1038/leu.2016.126 (2016). 

 
106  Yanada, M. et al. Long-term outcomes for unselected patients with acute myeloid 
leukemia categorized according to the World Health Organization classification: a 
single-center experience. Eur J Haematol 74, 418-423, doi:10.1111/j.1600-
0609.2004.00397.x (2005). 

Falini, B. et al. Multilineage dysplasia has no impact on biologic, 
clinicopathologic, and prognostic features of AML with mutated nucleophosmin 
(NPM1). Blood 115, 3776-3786, doi:10.1182/blood-2009-08-240457 (2010). 

33 

 

108  Diaz-Beya, M. et al. The prognostic value of multilineage dysplasia in de novo 

acute myeloid leukemia patients with intermediate-risk cytogenetics is dependent 
on NPM1 mutational status. Blood 116, 6147-6148, doi:10.1182/blood-2010-09-
307314 (2010). 

 
109  Bacher, U. et al. Multilineage dysplasia does not influence prognosis in CEBPA-

mutated AML, supporting the WHO proposal to classify these patients as a 
unique entity. Blood 119, 4719-4722, doi:10.1182/blood-2011-12-395574 (2012). 

 
110  Rozman, M. et al. Multilineage dysplasia is associated with a poorer prognosis in 
patients with de novo acute myeloid leukemia with intermediate-risk cytogenetics 
and wild-type NPM1. Ann Hematol 93, 1695-1703, doi:10.1007/s00277-014-
2100-6 (2014). 

 
111  Haferlach, C. et al. AML with mutated NPM1 carrying a normal or aberrant 
karyotype show overlapping biologic, pathologic, immunophenotypic, and 
prognostic features. Blood 114, 3024-3032, doi:10.1182/blood-2009-01-197871 
(2009). 

 
112 

Singh, Z. N. et al. Therapy-related myelodysplastic syndrome: morphologic 
subclassification may not be clinically relevant. Am J Clin Pathol 127, 197-205, 
doi:10.1309/NQ3PMV4U8YV39JWJ (2007). 

 
113  Arber, D. A. et al. Acute myeloid leukaemia and related precursor neoplasms. 
WHO classification of tumours of haematopoietic and lymphoid tissues 1, 110-
139 (2008). 

 
114  McNerney, M. E., Godley, L. A. & Le Beau, M. M. Therapy-related myeloid 

neoplasms: when genetics and environment collide. Nat Rev Cancer 17, 513-527, 
doi:10.1038/nrc.2017.60 (2017). 

 
115  Walter, R. B. et al. Significance of FAB subclassification of "acute myeloid 

leukemia, NOS" in the 2008 WHO classification: analysis of 5848 newly 
diagnosed patients. Blood 121, 2424-2431, doi:10.1182/blood-2012-10-462440 
(2013). 

 
116  Dohner, H. et al. Diagnosis and management of acute myeloid leukemia in adults: 

recommendations from an international expert panel, on behalf of the European 
LeukemiaNet. Blood 115, 453-474, doi:10.1182/blood-2009-07-235358 (2010). 

 
117  Grimwade, D. & Hills, R. K. Independent prognostic factors for AML outcome. 

Hematology Am Soc Hematol Educ Program, 385-395, doi:10.1182/asheducation-
2009.1.385 (2009). 

 

34 

 

118  Dohner, H. Implication of the molecular characterization of acute myeloid 

leukemia. Hematology Am Soc Hematol Educ Program, 412-419, 
doi:10.1182/asheducation-2007.1.412 (2007). 

 
119  Bullinger, L. et al. Identification of acquired copy number alterations and 

uniparental disomies in cytogenetically normal acute myeloid leukemia using 
high-resolution single-nucleotide polymorphism analysis. Leukemia 24, 438-449, 
doi:10.1038/leu.2009.263 (2010). 

 
120  Walter, M. J. et al. Acquired copy number alterations in adult acute myeloid 

leukemia genomes. Proc Natl Acad Sci U S A 106, 12950-12955, 
doi:10.1073/pnas.0903091106 (2009). 

Suela, J., Alvarez, S. & Cigudosa, J. C. DNA profiling by arrayCGH in acute 
myeloid leukemia and myelodysplastic syndromes. Cytogenet Genome Res 118, 
304-309, doi:10.1159/000108314 (2007). 

 
122  Martelli, M. P., Sportoletti, P., Tiacci, E., Martelli, M. F. & Falini, B. Mutational 
landscape of AML with normal cytogenetics: biological and clinical implications. 
Blood Rev 27, 13-22, doi:10.1016/j.blre.2012.11.001 (2013). 

Zaidi, S. Z. et al. The challenge of risk stratification in acute myeloid leukemia 
with normal karyotype. Hematol Oncol Stem Cell Ther 1, 141-158 (2008). 

 
124  Cancer Genome Atlas Research, N. et al. Genomic and epigenomic landscapes of 

adult de novo acute myeloid leukemia. N Engl J Med 368, 2059-2074, 
doi:10.1056/NEJMoa1301689 (2013). 

 
125  Welch, J. S. et al. The origin and evolution of mutations in acute myeloid 

leukemia. Cell 150, 264-278, doi:10.1016/j.cell.2012.06.023 (2012). 

 
126  Walter, M. J. et al. Clonal architecture of secondary acute myeloid leukemia. N 

Engl J Med 366, 1090-1098, doi:10.1056/NEJMoa1106968 (2012). 

 
121 

 
123 

 
127 

 
129 

 
130 

Papaemmanuil, E. et al. Genomic Classification and Prognosis in Acute Myeloid 
Leukemia. N Engl J Med 374, 2209-2221, doi:10.1056/NEJMoa1516192 (2016). 

 
128  Yang, X. & Wang, J. Precision therapy for acute myeloid leukemia. J Hematol 

Oncol 11, 3, doi:10.1186/s13045-017-0543-7 (2018). 

Ferrara, F. & Schiffer, C. A. Acute myeloid leukaemia in adults. Lancet 381, 484-
495, doi:10.1016/S0140-6736(12)61727-9 (2013). 

Isidori, A. et al. Alternative novel therapies for the treatment of elderly acute 
myeloid leukemia patients. Expert Rev Hematol 6, 767-784, 
doi:10.1586/17474086.2013.858018 (2013). 

35 

 

 
131  Nazha, A. & Ravandi, F. Acute myeloid leukemia in the elderly: do we know who 

should be treated and how? Leukemia Lymphoma 55, 979-987, 
doi:10.3109/10428194.2013.828348 (2014). 

 
132  Kantarjian, H. et al. Intensive chemotherapy does not benefit most older patients 

(age 70 years or older) with acute myeloid leukemia. Blood 116, 4422-4429, 
doi:10.1182/blood-2010-03-276485 (2010). 

Lowenberg, B. et al. Cytarabine dose for acute myeloid leukemia. N Engl J Med 
364, 1027-1036, doi:10.1056/NEJMoa1010222 (2011). 

Lowenberg, B. Sense and nonsense of high-dose cytarabine for acute myeloid 
leukemia. Blood 121, 26-28, doi:10.1182/blood-2012-07-444851 (2013). 

Lowenberg, B. et al. High-dose daunorubicin in older patients with acute myeloid 
leukemia. N Engl J Med 361, 1235-1248, doi:10.1056/NEJMoa0901409 (2009). 

Estey, E. Acute Myeloid Leukemia - Many Diseases, Many Treatments. N Engl J 
Med 375, 2094-2095, doi:10.1056/NEJMe1611424 (2016). 

 
137  Reese, N. D. & Schiller, G. J. High-dose cytarabine (HD araC) in the treatment of 

leukemias: a review. Curr Hematol Malig Rep 8, 141-148, doi:10.1007/s11899-
013-0156-3 (2013). 

 
138  Meyers, J., Yu, Y., Kaye, J. A. & Davis, K. L. Medicare fee-for-service enrollees 
with primary acute myeloid leukemia: an analysis of treatment patterns, survival, 
and healthcare resource utilization and costs. Appl Health Econ Health Policy 11, 
275-286, doi:10.1007/s40258-013-0032-2 (2013). 

 
139  Appelbaum, F. R. et al. Age and acute myeloid leukemia. Blood 107, 3481-3485, 

doi:10.1182/blood-2005-09-3724 (2006). 

 
133 

 
134 

 
135 

 
136 

 
 
 
 
 
 
 
 
 
 
 
 
 

36 

 

Chapter 2 – 

ClassificaIO: machine learning for classification 

graphical user interface 

37 

 

Abstract 

Machine learning methods are being used routinely by scientists in many research areas, 

typically  requiring  significant  statsistical  and  programing  knowledge.  Here  we  present 

ClassificaIO,  an  open-source  Python  graphical  user  interface  for  machine  learning 

classification for the scikit-learn Python library. ClassificaIO provides an interactive way 

to train, validate, and test data on a range of classification algorithms. ClassificaIO’s core 

aim  is  to  provide  a  point  and  click  graphical  user  interface  to  enable  fast  comparisons 

within and across classifiers, and facilitates uploading and exporting of trained models, 

and  both  validation  and  testing  data  results  to  maximize  machine  learning  utility  in  a 

shorter  time  instead  of  to  writing  a  script  for  each  task.  ClassificaIO  can  also  be  an 

educational tool that can enable biomedical and other researchers with minimal machine 

learning  background  to  apply  machine  learning  algorithms  to  their  research  in  an 

interactive point-and-click way. For this thesis, the primary motivation for creating and 

utilizing  ClassificaIO  was  to  address  missing  annotations  in  AML  publicly  available 

microarray data. Our aim was to train multiple machine learning classifiers and use them 

to predict such missing annotations, including sex and sample source (tissue type), from 

curated  publicly  available  AML  gene  expression  data  that  could  be  used  in  a  meta-

analysis of a large cohort of studies on AML (Chapter 3) 

 

The ClassificaIO package is available for download and installation through the Python 

Package Index (PyPI) (http://pypi.python.org/pypi/ClassificaIO) and it can be deployed 

using the “import” function in Python once the package is installed. The application is 

distributed under an MIT license and the source code is publicly available for download 

38 

 

(for  Mac  OS  X,  Linux  and  Microsoft  Windows) 

through  PyPI  and  GitHub 

(http://github.com/gmiaslab/ClassificaIO, and https://doi.org/10.5281/zenodo.1320465). 

 

A version of this chapter and results has been submitted for publication. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

39 

 

Introduction 

Recent advances in high-throughput technologies, especially in genomics, have led to an 

explosion  of  large-scale  structured  data  (e.g.  RNA-sequencing  and  microarray  data)  1. 

Machine learning methods (classification, regression, clustering, etc.) are routinely used 

in mining such big data to extract biological insights in a range of areas within genetics 

and  genomics  2.  For  example,  using  unsupervised  machine  learning  classification 

methods to predict the sex of gene expression microarrays donor samples 3, using genome 

sequencing  data  to  train  machine  learning  models  to  identify  transcription  start  sites  4, 

splice  sites  5,  transcriptional  promoters  and  enhancers  regions  6.  Recent  examples  of 

using machine learning classification methods include their use to detect neurofibromin 1 

tumor  suppressor  gene  inactivation  in  glioblastoma  7,  and  to  identify  reliable  gene 

markers  for  drug  sensitivity  in  acute  myeloid  leukemia  8.  Many  advanced  machine 

learning algorithms have been developed in the recent years. Scikit-learn  9 is one of the 

most popular machine learning libraries in Python with a plethora of thoroughly tested 

and  well-maintained  machine  learning  algorithms.  However,  these  algorithms  are 

primarily  aimed  at  users  with  computational  and  statistical  backgrounds,  which  may 

discourage many biologists, biomedical scientists or beginning students (who may have 

minimal  machine  learning  background  but  still want  to  explore  its  application  in  their 

research) from using machine learning. Ching et al (2018) recently highlighted the role of 

deep  learning  (a  class  of  machine  learning  algorithms)  currently  plays  in  biology,  and 

how such algorithms present new opportunities and obstacles for a data-rich field such as 

biology 10. 

 

40 

 

Several  open  source  machine  learning  applications,  such  as  KNIME  11  and  Weka  12 

written in Java and Orange 13 written in Python, have been developed with graphical user 

interfaces. The dataflow process for most of these applications is generally graphically 

constructed by the user, in the form of placing and connecting widgets by drag-and-drop. 

Such  graphical  workflow  and  representation  of  data  input,  processing  and  output  is 

visually appealing, but can be computationally demanding (memory, storage, processing, 

etc.)  and  limiting  in  algorithm  comparison,  since  each  machine  learning  algorithm  can 

have many different parameters. These tools are very mature with numerous algorithms, 

and well documented. However, they can be intimidating for machine learning beginners 

and students that want to preform simple tasks such as data classification. Also, scikit-

learn has comprehensive documentation 14, and many online resources, including though 

Kaggle 15 and Stack Overflow 16, and a large online user base, which make scikit-learn a 

very popular package for machine learning beginners learning using Python. 

 

Here, we present ClassificaIO, an open-source Python graphical user interface (GUI) for 

supervised machine learning classification for the scikit-learn library. To our knowledge, 

no standalone GUI exists for the scikit-learn machine learning library. The core aim of 

ClassificaIO is to provide our research group with point and click graphical user interface 

to  enable  fast  comparisons  within  and  across  different  machine  learning  classifiers  to 

predict/classify missing annotations, including sex and sample source, from our curated 

microarray  data  (for  more  details,  see  Chapter  3  section,  “Classification  of  missing 

metadata  annotation”).  805  and  737  arrays  were  missing  for  sex  and  sample  source 

annotations  from  our  curated  data respectively.  Since  these  arrays  correspond  to  AML 

41 

 

patients and healthy individuals, the prediction of the missing annotations for these arrays 

was essential to our study for the statistical power and sample size. 

 

ClassificaIO can also serve as  a software tool for teaching and educational tool that is 

visually minimalistic and computationally light interactive interface, that can give access 

to a range of state-of-the-art classification algorithms to machine learning beginners with 

some  basic  knowledge  of  Python  and  using  a  terminal,  and  with  broad  background  in 

machine learning, allowing them to use machine learning and apply it to their research. 

What distinguishes ClassificaIO from other similar applications is: 

1.  Cross-platform  implementation  for  Mac  OS  X,  Linux,  and  Windows  operating 

systems 

2.  Interactive  point-and-click  GUI  to  25  supervised  classification  algorithms  in 

scikit-learn 

3.  Accessible clickable links, to scikit-learn’s well-written online documentation for 

each implemented classification algorithms 

4.  Simple upload of all data files with dedicated buttons; with robust CSV reader, 

and a displayed history-log to track uploaded files, files names and directories 

5.  Fast comparisons within and across classifiers to train, validate, and test data 

6.  Upload and export of ClassificaIO trained models (for future of a trained model 

without the need to retrain), and export of both validated, and tested data results 

7.  Small application footprint in terms of disk space usage (<2 MB) 

42 

 

ClassificaIO implementation 

ClassificaIO has been developed using the standard Python interface Tkinter module to 

the Tk GUI toolkit 17, for Mac OS X (using High Sierra ³ 10.13), Linux (using Ubuntu 

18.04-64  bit),  and  Windows  (using  Windows  10  64-bit).  It  uses  external  packages 

including: Tkinter, Pillow, Pandas 18, NumPy 19, scikit-learn and SciPy  20. To avoid any 

system errors, crashes, and crude fonts, we recommend not to install ClassificaIO using 

integrated  environment  package  installers  –  instead,  native  installation  of  ClassificaIO 

and dependencies (using pip for Mac and Windows, and pip3 and apt-get for Linux) is 

encouraged. Once installed, ClassificaIO can be deployed using the ‘import’ function. A 

ClassificaIO installation instruction and step by step working examples is distributed with 

ClassificaIO GUI and can be accessed directly through the ‘HELP’ button at the upper 

left  of  the  GUI,  that  points  the  user’s  default  browser  to  ClassificaIO’s  online  user 

manual on GitHub. Some basic knowledge of Python and accessing it through a terminal 

are required for installation and running the software. Link to all supplementary files and 

additional ClassificaIO software information is provided in Table 3. 

 

ClassificaIO backend 

ClassificaIO implements 25 scikit-learn classification algorithms for supervised machine 

learning.  A  list  of  all  these  algorithms,  their  corresponding  scikit-learn  functions,  and 

immutable (unchangeable) parameters with their default values are presented in Table 4, 

and ClassificaIO’s workflow is outlined in Figure 2. Once training and testing data are 

uploaded  to  the  front-end  as  described  below,  a  classifier  selection  is  made  and 

submitted, ClassificaIO’s backend calls the scikit-learn selected classifier, including any 

43 

 

values  from  manually  set  parameters  to  create  the  model.  Otherwise,  the  default 

parameters values are used instead. For example, for “LogisticRegression”, the model is 

defined in the scikit-learn library as a class, in terms of Python code used in the backend, 

the details are outlined in the scikit-learn documentation: 

sklearn.linear_model.LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, f

it_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver

=’liblinear’, max_iter=100, multi_class=’ovr’, verbose=0, warm_start=False, n_jobs=) 

 

The  inputs  to  the  class,  within  the  parentheses,  such  as  “penalty”,  “dual”,  “tol”,  etc., 

correspond  to  the  model  parameters,  followed  by  an  equal  sign  assigning  the  default 

values for these parameters. Rather than typing the values, the ClassificaIO GUI displays 

these  parameters  with  input  fields  and  radio  buttons,  for  each  classifier,  initially 

populated by the default values. More information is available for all the parameters in 

the GUI, through a link for each classifier in the interface named “Learn More”. The link 

directs  the  default  browser  to  the  scikit-learn  online  documentation  of  the  selected 

classifier, and connects to the underlying backend documentation, and online parameter 

descriptions.  The  details  and  code  complexity  of  the  backend  implementation  are 

effectively hidden from the user, who can interact with the ClassificaIO GUI to set the 

relevant  parameters,  or  leave  them  unchanged  as  default  values.  On  the  training  data, 

ClassificaIO  fits  the  estimator  for  classification  using  the  scikit-learn ‘fit’  method,  e.g. 

fit(x_train,  y_train), to  train (learn  from  the  model),  and  uses  the  scikit-learn  ‘predict’ 

method,  e.g.  predict(x_validation),  to  validate  the  model.  Finally,  ClassificaIO  predicts 

new  values  using  the  scikit-learn  ‘predict’  method  again  but  on  the  testing  data,  e.g. 

44 

 

predict(testing_X), for implementing the model on new data that have not been used in 

model training. 

 

ClassificaIO functionalities 

ClassificaIO’s GUI consists of three windows: ‘Main’, ‘Use My Own Training Data’, and 

‘Already Trained My Model’. Each window is actually implemented within the code as a 

class with several functions/methods that are dynamically connected to provide the GUI. 

ClassificaIO’s  Main  window  (Fig.  2)  has  two  buttons:  (i)  the  ‘Use  My  Own  Training 

Data’ button, which when clicked allows the user to train and test classifiers using their 

own training and testing data, (ii) the ‘Already Trained My Model’ button, which when 

clicked allows the user to use their own already ClassificaIO trained model and testing 

data. 

 

•  Data input 

For  the  ‘Use  My  Own  Training  Data’  window  (Fig  4a)  by  clicking  the  corresponding 

buttons in the ‘UPLOAD TRAINING DATA FILES’ and ‘UPLOAD TESTING DATA 

FILE’  panels,  a  file  selector  directs  the  user  to  upload  all  required  comma-separated 

values (CSV) data files (‘Dependent and Target’ or ‘Dependent, Target and Features’ and 

‘Testing Data’) (Fig. 5). A history of all uploaded data files (file name and directory) is 

automatically  saved  in  the  ‘CURRENT  DATA  UPLOAD’  panel  (Fig.  6).  Briefly,  the 

dependent data represent the data on which the model will depend on for learning, and 

the target data is the annotation, i.e. what is going to be predicted. The dependent data 

have attributes (also known as features) that take values (measurements/results) for each 

45 

 

contained object (i.e. each sample). Further details on data files formats and examples are 

provided in the Figures 7(a, b) and S2-S5. 

 

•  Classifier selection 

After  the  data  is  uploaded,  the  user  can  select  between  all  25  different  widely  used 

classification  algorithms  (Table  4)  (including  logistic  regression,  perceptron,  support 

vector machines, k-nearest neighbors, decision tree, random forest, neural network multi-

layer perceptron, and more). The algorithms are integrated from the scikit-learn library, 

and allow the user to train and test models using their own uploaded data. Each classifier 

can be easily selected by clicking the corresponding classifier name in the ‘CLASSIFER 

SELECTION’  panel.  Once  classifier  selection  is completed,  a  brief  description  for  the 

classifier  with  an  underlined  clickable  link  that  reads  “Learn  more”  right  next  to  the 

classifier name (Fig. 8a) and the classifier parameters will populate. If “Learn more” is 

clicked, the link directs the default web browser to open scikit-learn’s online well-written 

documentation that explains the specific classifier parameters, with explanation for each 

parameter  and  its  use,  and  how  to  tune/optimize  each  parameter  to  get  the  best 

performance. ClassificaIO provides the user with an interactive point-and-click interface 

to set, modify, and test the influence of each parameter on their data (Fig. 8c). The user 

can  switch  between  classifiers  and  parameters  through  point-and-click,  which  enables 

fast comparisons within and across classifier models. 

 

 

 

46 

 

•  Model training 

Both  train-validate  split  and  cross-validation  methods  (which  are  necessary  to 

prevent/minimize overfitting) will populate with each classifier that can be used for data 

training  (Fig.  8b).  Training,  validating  and  testing  are  all  performed  after  pressing  the 

submit button. 

 

•  Results output 

After model training and testing is completed, the confusion matrix, classifier accuracy 

and  error  are  displayed  in  the  ‘CONFUSION  MATRIX,  MODEL  ACCURACY  & 

ERROR’ panel, bottom of (Fig. 9a). Model validation data results are displayed in the 

‘TRAINING  RESULT:  ID  –  ACTUAL  –  PREDICTION’  panel  (Fig.  9b),  and  testing 

data results are displayed in the ‘TESTING RESULT: ID – PREDICTION’ panel (Fig. 

10b). 

 

•  Model export 

By  clicking  on  the  ‘Export  Model’  button,  Figure  10  bottom  left,  the  user  can  export 

trained  models  to  save  for  future  use  without  having  to retrain.  A  previously  exported 

ClassificaIO model can then be used for testing of new data in the ‘Already Trained My 

Model’  window  (Fig.  4b)  by  clicking  the  ‘Model  file’  button  in  the  ‘UPLOAD 

TRAINING MODEL FILE’ panel (Fig 11a). 

 

 

47 

 

•  Results export 

Full results (trained models, both validated and tested data, and uploaded files names and 

directories)  for  both  windows  (Fig  4a&b),  can  be  exported  as  CSV  files  for  further 

analysis  for  publication,  sharing,  or  later  use  (for  more  details  on  the  exported  trained 

model and data file formats, see S7 and S8). 

 

Results: illustrative examples and data used 

To illustrate the use of the interface and classification, we have used in this manuscript 

the following two examples. 

 

•  Iris prediction using Iris dataset 

To  demonstrate  the  interface  and  classification, we  used  the  so-called  Fisher/Anderson 

iris  dataset  21,22.  This  dataset  is  used  widely  as  a  prototype  to  illustrate  classification 

algorithms, not only of biological data but in general machine learning implementations. 

The  dataset  consists  of  fifty  samples  each  for  three  different  species  of  iris  flowers 

(Setosa,  Versicolor  and  Viginica),  with  sepal  length  and  width,  and  petal  length  and 

width provided as measurements. For more details on the iris data files format (Fig 4a&b 

and S2-S5). 

 

•  Sex prediction using microarray gene expression data 

In  this  example,  provided    we  used  raw  microarray  gene  expression  data,  from  Gene 

Expression Omnibus (GEO) 23 to predict each sample donor’s sex. This is often necessary 

48 

 

in metadata analyses, using publicly available gene expression datasets for reanalysis, as 

samples annotations on GEO may be missing information, including sample donor’s sex. 

To illustrate the classification/sex prediction we used two datasets, GSE99039 24 (training 

data)  and  GSE18781  25  (testing  data).  In  both  GSE99039  and  GSE18781  datasets,  we 

used  121  and  25  samples  respectively,  for  which  RNA  from  peripheral  blood 

mononuclear cells was assayed using Affymetrix Human Genome U133 Plus 2.0 Array 

(accession  GPL570).  The  Y  chromosome  gene  expression  values  were  used  in 

ClassificaIO as training and testing data to predict samples donor’s sex. Using the ‘Linear 

SVC’  model  with  “k-fold  cross  validation”  (10-fold),  resulted  into  a  model  with  99% 

accuracy for sample donor’s sex prediction (in the displayed example). For more details 

on the pre-processing of the raw gene expression data, files format, and Y chromosome 

probes  ids,  and  final  result  see  Ex2,  “Gene  expression  sex  prediction  using  linear 

support vector classifier” and Figures 10 &11. 

 

Discussion 

We  have  presented  ClassificaIO,  a  GUI  that  implements  the  scikit-learn  supervised 

machine learning classification algorithms. The scikit-learn package is one of the most 

popular  in  Python  with  well-written  documentation,  and  many  of  its  machine  leaning 

algorithms are currently used for analyzing large and complex data sets in genomics. Our 

interface  aims  to  provide  an  interactive  machine  learning  research,  teaching  and 

educational  tool  to  do  machine  learning  analysis  without  the  requirement  of  advanced 

computational  and  machine  learning  knowledge  using  scikit-learn.  ClassificaIO  is 

provided  as  an  open  source  software,  and  its  back-end  classes  and  functions  allow for 

49 

 

rapid development. We anticipate further development, aided by the scikit-learn library 

developer  community  to  integrate  additional  classification  algorithms,  and  extend 

ClassificaIO to include other machine leaning methods such as regression, clustering, and 

anomaly detection, to name but a few. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

50 

 

ClassificaIO:  setup,  dependencies,  installation,  instruction,  and,  step  by  step 

working examples 

 

Summary 

ClassificaIO  is  an  open-source  Python  graphical  user  interface  (GUI)  for  supervised 

machine learning classification for the scikit-learn module 9. ClassificaIO aims to provide 

an easy-to-use interactive way to train, validate, and test data on a range of classification 

algorithms.  The  GUI  enables  fast  comparisons  within  and  across  classifiers,  and 

facilitates uploading and exporting of trained models, and both validated, and tested data 

results. 

 

Dependencies 

ClassificaIO is a Python package with the following external dependencies: 

•  Tkinter ≥ 8.6.7, Pillow ≥ 5.3.0, pandas ≥ 0.23.3, numpy == 1.15.3, scikit-learn ≥ 

0.20.0, and scipy ≥ 1.1.0 

 

Prerequisites 

ClassificaIO requires Python version 3.6 or higher and can be used on Mac OS X High 

Sierra,  Linux  (tested  on  Ubuntu),  and  Windows  10  operating  systems.  To  avoid  any 

system errors, crashes, and crude fonts, we recommend to not install ClassificaIO using 

integrated  environment  package  installers  –  i.e.  native  installation  of  ClassificaIO  is 

highly encouraged using pip. In case you do not have pip installed, you must install it 

first. 

51 

 

Installation instructions 

1.  Mac or Windows 

•  To install the current release use pip in the terminal: 

$ pip install ClassificaIO 

•  Alternatively, you can install directly from GitHub using: 

$ pip install git+https://github.com/gmiaslab/ClassificaIO/ 

2.  Linux 

•  First install the current release of tkinter and pip:  

$ sudo apt-get install python3-tk 

$ sudo apt-get install python3-pip 

•  To install the current ClassificaIO release use pip: 

$ pip3 install ClassificaIO 

•  Alternatively, you can install directly from GitHub using: 

$ pip install git+https://github.com/gmiaslab/ClassificaIO/ 

After installing ClassificaIO, please run it from the terminal using Python: 

$ python3 

>>> from ClassificaIO import ClassificaIO 

>>> ClassificaIO.gui()   

We note here the name is case sensitive (i.e. the ‘IO’ is capitalized). Once ClassificaIO’s 

main  window  appears  on  your  screen,  you  can  click  on  ‘Use  My  Own Training  Data’ 

button and start your supervised machine learning classification project. 

 

52 

 

Iris dataset prediction using a logistic regression classifier 

 

•  Training data input 

You  first  need  to  select  (either  ‘Dependent  and  Target’  or  ‘Dependent,  Target  and 

Features’) from the ‘UPLOAD TRAINING DATA FILES’ panel to upload training data 

files. For this example, we select the ‘Dependent and Target’ button. 

 

To  begin  uploading  data  files,  click  the  corresponding  buttons  in  the  ‘UPLOAD 

TRAINING  DATA  FILES’  panel:  a  file  selector  (Fig.  5)  directs  you  to  upload  both, 

dependent or target data file. Once a file is uploaded to ClassificaIO, the file name and 

directory are automatically saved in the ‘CURRENT DATA UPLOAD’ panel (Fig. 6). 

Dependent data file (e.g. Fig. 7a) and target data file (e.g. Fig. 8b). This updatable log 

allows for tracking current data files in-use, and maintains a history of all files uploaded 

to the software. 

 

•  Data format 

Data formats are shown in Figure 7a for dependent data and Figure 7b for target data. 

The dependent data represent the data on which the model will depend on for learning 

and the target data is the annotation, i.e. what is going to be predicted. In this example, 

the dependent data have 4 rows and 105 columns. For the dependent data, each row is an 

attribute  (also  known  as  feature)  and  each  column  is  an  object  (also  known  as  an 

observation or a sample). Thus, the header row enumerates the objects, and the header 

column names the attributes. The values in the file represent the measurement made for 

53 

 

each  of  the  objects(columns)  for  each  of  the  attributes  (rows).  For  the  target  data,  we 

have  105  rows  and  2  columns  (note:  for  the  target  data,  the  rows  correspond  to  the 

objects and column is the class per object). The values in the “ids” column in the target 

data must much the Objects header row in the dependent data, the columns headers must 

match, otherwise an error will occur. Hence, the number of columns (i.e. objects) in the 

dependent data must also match the number of rows in the target data (i.e. each object has 

a  unique  “id”  and  must  be  assigned  a  target  class  for  training).  Finally,  the  “target” 

column in the target data must be numerically-valued. 

 

•  Classifier selection 

Once  you  have  uploaded  all  required  training  data  files,  you  can  select  between  25 

different machine learning classification algorithms in the ‘CLASSIFER SELECTION’ 

panel  (Table  4).  Here  are  all  classification  algorithms  in  order  of  appearance  in  the 

‘CLASSIFER  SELECTION’  panel.  Immutable  (unchangeable)  parameters  with  their 

default values are also listed for each classifier in the parentheses: 

Linear_model 

LogisticRegression. (class_weight = None) 

PassiveAggressiveClassifier. (class_weight = None, n_iter= None) 

Perceptron. (class_weight = None) 

RidgeClassifier. (class_weight = None) 

Stochastic Gradient Descent (SGDClassifier). 

Discriminant_analysis 

LinearDiscriminantAnalysis. (shrinkage= None, priors = None) 

54 

 

QuadraticDiscriminantAnalysis. (store_covariances = None, priors = None) 

Support vector machines (SVMs) 

LinearSVC. (class_weight = None) 

NuSVC. (class_weight = None) 

SVC. (class_weight = None) 

Neighbors 

KNeighborsClassifier. (metric_params = None) 

NearestCentroid. 

RadiusNeighborsClassifier. (metric_params = None)\ 

Gaussian_process 

GaussianProcessClassifier. (kernel = None) 

Naive_bayes 

BernoulliNB. (class_prior = None) 

GaussianNB. (class_prior = None) 

MultinomialNB. (class_prior = None) 

Trees 

DecisionTreeClassifier. (class_weight = None) 

ExtraTreeClassifier . (min_impurity_split = None, class_weight = None) 

Ensemble 

AdaBoostClassifier. (base_estimator = None) 

BaggingClassifier. (base_estimator = None) 

ExtraTreesClassifier. (class_weight = None) 

RandomForestClassifier. (class_weight = None) 

55 

 

Semi_supervised 

LabelPropagation. 

Neural_network 

MLPClassifier. 

 

The following will populate once you make a classifier selection: 
 
Figure 8a: The classifier definition with a clickable underlined link “learn more” in blue, 

which, when clicked opens an external web-browser to the scikit-learn documentation for 

the selected classifier. 

Figure  8b:  Interactive  way  to  select  between  train-validate  split  and  cross-validation 

methods  (radio  buttons),  which  are  necessary  to  prevent/minimize  training  model 

overfitting. 

Figure 8c: Classifier parameters, to provide you with a point-and-click interface to set, 

modify, and test the influence of each parameter on your data 

 

•  Model training, evaluation, validation and result output 

You can now click ‘submit’ to train your classifier using the uploaded training data files 

‘Dependent and Target’ in this example, and evaluate your result. Or, alternatively you 

can upload testing data first, and then click ‘submit’ to train and test a classifier on your 

uploaded  data  at  the  same  time.  For  this  example,  first:  we  train  a  selected  classifier, 

‘LogisticRegression’, using its default parameters, and default train-validate split method 

‘Train  Sample  Size  (%)’,  and  then,  second:  we  upload  testing  data  to  test  the  trained 

model.  After  clicking  ‘submit’,  our  selected  classifier,  ‘LogisticRegression’  for  this 

56 

 

example,  is  trained  using  the  loaded  training  data,  ‘Dependent  and  Target’  for  this 

example. 

 

Notes: ClassificaIO always shuffles your training data before splitting to eliminate mini 

batch effects. 

Internally, when ‘Train Sample Size (%)’ method is selected, ClassificaIO uses the scikit-

learn  train_test_split  method,  to  allow  for  fast  training  data  split  into  training  and 

validation subsets. With this method the parameter set is train_size, which takes the train 

sample size set by you (e.g. Train Sample Size (%): set to 75% means train_size = 0.75 

and  test_size  =  0.25).  If  the  ‘K-fold  Cross-Validation’  method  is  selected  instead, 

ClassificaIO  uses  the  scikit-learn  cross_val_predict  method  where  the  training  data  is 

split into k-sets. The model is trained on k-1 of the folds followed by a validation step on 

the remaining part of the data. This will be repeated for each of the k-folds. 

After  training  is  completed,  the  confusion  matrix,  classifier  accuracy  and  error  are 

displayed  in  the  ‘CONFUSION  MATRIX,  MODEL  ACCURACY  &  ERROR’  panel 

(Fig. 9a). Model validation data results are displayed in the ‘TRAINING RESULT: ID – 

ACTUAL  –  PREDICTION’  panel  (Fig.  6b)  with  each  data  point  ID  is  the  1st  value, 

actual target value is displayed 2nd, and predicted target value 3rd , where the predictions 

correspond to the iris flower species, with 0=setosa, 1=versicolor, and 2=virginica. 

 

•  Testing data input and result output 

To test your trained model, first upload the testing data file by clicking the ‘Testing Data’ 

button in the ‘UPLOAD TESTING DATA FILE’ panel (Fig. 10a). Once clicked, a file 

57 

 

selector directs you to upload the testing data file, the file name is automatically saved in 

the  ‘CURRENT  DATA  UPLOAD’  panel  (outlined  in  the  red  box  in  the  figure)  to 

indicate that your file has been uploaded. The Testing Data file format is the same as for 

the dependent data file. 

 

After clicking ‘Submit’, testing results are displayed in the ‘TESTING RESULT: ID – 

PREDICTION’ panel (Fig 7b) with each data point ID shown 1st, and the corresponding 

predicted target value displayed after it 2nd, separated by a hyphen. 

 

•  Result export 

Now  you  are  ready  to  export  your  trained  model  to  preserve  it  for  future  use  without 

having to retrain. Simply, click the ‘Export Model’ button (Fig 6a) and save your model. 

Your exported ClassificaIO model can then be used for future testing on new data in the 

‘Already Trained My Model’ window in ClassificaIO, shown below. 

 

•  ClassificaIO model input 

You will need to upload ClassificaIO model by clicking the ‘Model File’ button in the 

‘UPLOAD  TRAINING  MODEL  FILE’  panel  (Fig.  11a).  Once  clicked,  a  file  selector 

directs  you  to  upload  a  ClassificaIO  trained  model.  Also,  you  will  need  to  upload  a 

testing data file (the testing data file format is the same as explained above), by clicking 

the ‘Testing Data’ button in the “UPLOAD TESTING DATA FILE” panel (Fig. 11b). 

Once  a  ClassificaIO  model  and  testing  data  files  are  uploaded,  files  names  are 

automatically displayed in the ‘CURRENT DATA UPLOAD’ panel (Fig. 11c). 

58 

 

 

After clicking ‘submit’, the uploaded model preset parameters will populate (Fig. 11d) to 

show  the  classifier  used  to  originally  train the  uploaded  model.  The  confusion  matrix, 

classifier accuracy and error of trained model are then displayed in the ‘CONFUSION 

MATRIX, MODEL ACCURACY & ERROR’ panel (Fig. 11e). Testing data results are 

displayed in the ‘Testing RESULT: ID – PREDICTION’ panel (Fig. 11f) with the data 

point ID shown 1st, followed by a hyphen and the predicted value displayed right after it. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

59 

 

APPENDIX 

60 

 

APPENDIX 

 

Figure 2. Workflow summary of ClassificaIO. 

 

The  diagram 

summarizes  of 

the  graphical  user 

interface 

and  backend 

functionality/workflow  for  ClassificaIO  Use  My  Own  Training  Data  window  and 

Already Trained My Model window. 

 

 

 

 

 

 

 

61 

 

Figure 3. ClassificaIO main window. 

ClassificaIO  main  window  appears  on  the  screen  after  typing  ‘ClassificaIO.gui()’  in  a 

terminal or a Python interpreter. 

 

 

 

 

 

 

 

 

62 

 

Figure 4. ClassificaIO user interface (Mac OS shown). 

a 

b 

As  described  in  ClassificaIO  implementation  section,  a.  an  example  Use  My  Own 

Training  Data  window  with  uploaded  training  and  testing  data  files,  selected  logistic 

regression classifier, populated classifier parameters, and output classification results.  

63 

 

Figure 4. (cont’d) 

b.  A  corresponding  Already  Trained  My  Model  window  with  uploaded  ClassificaIO 

logistic regression trained model and testing data file, and output classification result. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

64 

 

Figure 5. Graphical control element dialog box. 

 

 

 

 

 

 

 

 

a.  Dependent  data file  selected  for  upload.  b.  Selected  target  data  file  to  upload.  N.B. 

each file selection has to be done one at a time. 

 	

									

65 

 

Figure 6. Current data upload panel.	
			

 

 

 

 

 

 

 

 

Both dependent and target data file names shown (red boxes). Scroll down for uploaded 

data files directories. 

 

 

 

 

 

 

 

 

 

66 

 

Figure 7.	Gene expression sex prediction using linear support vector classifier. 

Objects 

Attributes 

b 

 

s
t
c
e
b
O

j

a 

 
s
e
t
u
b

i
r
t
t

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a.  Dependent  data,  example  of  partial  dependent  data  file  format.  Testing  data  (not 

shown)  uses  the  same  format.  b.  Example  of  partial  target  data  file  format  where  the 

targets  correspond  to  setosa  =  0,  versicolor  =  1,  and  virginica  =  2.  Versicolor  and 

virginica are not visible in this screenshot. 

 

 

 

 

67 

 

Figure 8. Selected logistic regression classifier. 

The interface for each selected classifier, has uniform features. a. Classifier definition is 

displayed, together with an underlined clickable link that reads “Learn more” next to the 

classifier name. b. Training methods with ‘Train Sample Size (%)’ method selected. c. 

The classifier parameters set to their default values. 

 

 

 

 

 

 

 

 

 

68 

 

Figure 9. Trained logistic regression classifier. 

a.  Trained  model  using  78  data  points  (75%  of  105  data  points),  classifier  evaluation 

(confusion matrix, model accuracy and error). b. Model validated using 27 data points 

(25% of 105 data points). 

 

 

 

 

 

 

 

 

 

 

69 

 

Figure 10. Tested logistic regression classifier. 

a. Upload testing data panel. b. Model tested using 45 data points. 

 

 

 

 

 

 

 

 

 

 

 

 

70 

 

Figaure 11. ‘Already Trained My Model’ window. 

a. Upload ClassificaIO trained model panel. b. Upload testing data panel. c. Current data 

upload panel with both model and testing data files names shown (red boxes). d. Model 

preset  parameters.  e.  Trained  model  result  and  model  evaluation  (confusion  matrix, 

model accuracy and error). f. Model testing result. 

 

 

 

 

 

 

 

 

 

71 

 

Figure 12. Training and testing using gene expression data. 

a.  selected  k-nearest  neighbors’  classifier  with  trained  and  tested  the  data  using  the 

default  parameters  values,  b.  Same  classifier  selected  with  trained  and  tested  data  but 

using different parameters values. 

 

72 

 

Figure 13. Trained linear support vector machine classifier. 

Trained  model  using  GSE99039  121  data  points  and  k-fold  cross  validation,  classifier 

evaluation  (confusion  matrix,  model  accuracy  and  error).  Model  validated  and  tested 

model using GSE18781 25 data points. 

 

 

 

 

 

 

 

 

 

 

73 

 

Attribute 

Figure 14. Features data.  

 

 

 

 

 

 

 

 

 

 

Example of partial features data file format where each Affymetrix probe id correspond 

to a Y chromosome gene. 

										

74 

 

Table 3. ClassificaIO software information.	

1.1.5 

PyPI: https://pypi.org/project/ClassificaIO/ 
GitHub: https://github.com/gmiaslab/ClassificaIO 
MIT license (MIT) 

Current 
ClassificaIO 
Version 
Public Links to 
Executables 
Distribution 
License 
Operating Systems  Mac OS X, Linux, and Microsoft Windows 
Software 
Installation 
Dependencies 
Supplementary 
Data Online 
Availability 
Contact E-mail 

gmiaslab@gmail.com 

Python 3 and Python libraries: Tkinter, Pillow, Pandas, NumPy, 
scikit-learn and SciPy 

https://github.com/gmiaslab/manuals/tree/master/ClassificaIO/Su
pplementary%20Files 

ClassificaIO is provided as open source software, and distributed on GitHub and PyPI. 

Up-to-date  code,  manuals  and  supplementary  example  material  will  be  maintained  on 

GitHub. 

	

 

 

 

 

 

 

 

 

 

 

75 

 

Table 4. Classification algorithms included in ClassificaIO. 

A list of all 25 classification algorithms, their corresponding scikit-learn functions, and 

immutable (unchangeable) parameters with their default values. 

 

 

 

76 

 

BIBLIOGRAPHY 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

77 

 

1 

 
2 

 
3 

 
4 

 
5 

 
6 

 
7 

 
8 

 
9 

 
10 

 
11 

 
12 

 

BIBLIOGRAPHY 

 
 

Mias, G. I. & Snyder, M. Personal genomes, quantitative dynamic omics and 
personalized medicine. Quant Biol 1, 71-90, doi:10.1007/s40484-013-0005-3 
(2013). 

Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and 
genomics. Nat Rev Genet 16, 321-332, doi:10.1038/nrg3920 (2015). 

Buckberry, S., Bent, S. J., Bianco-Miotto, T. & Roberts, C. T. massiR: a method 
for predicting the sex of samples in gene expression microarray datasets. 
Bioinformatics 30, 2084-2085, doi:10.1093/bioinformatics/btu161 (2014). 

Ohler, U., Liao, G. C., Niemann, H. & Rubin, G. M. Computational analysis of 
core promoters in the Drosophila genome. Genome Biol 3, RESEARCH0087 
(2002). 

Degroeve, S., De Baets, B., Van de Peer, Y. & Rouze, P. Feature subset selection 
for splice site prediction. Bioinformatics 18 Suppl 2, S75-83 (2002). 

Heintzman, N. D. et al. Distinct and predictive chromatin signatures of 
transcriptional promoters and enhancers in the human genome. Nat Genet 39, 
311-318, doi:10.1038/ng1966 (2007). 

Way, G. P. et al. A machine learning classifier trained on cancer transcriptomes 
detects NF1 inactivation signal in glioblastoma. BMC Genomics 18, 127, 
doi:10.1186/s12864-017-3519-7 (2017). 

Lee, S. I. et al. A machine learning approach to integrate big data for precision 
medicine in acute myeloid leukemia. Nat Commun 9, 42, doi:10.1038/s41467-
017-02465-5 (2018). 

Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res 
12, 2825-2830 (2011). 

Ching, T. et al. Opportunities and obstacles for deep learning in biology and 
medicine. J R Soc Interface 15, doi:10.1098/rsif.2017.0387 (2018). 

Berthold, M. R. et al. KNIME: The Konstanz Information Miner. Stud Class Data 
Anal, 319-326, doi:Doi 10.1007/978-3-540-78246-9_38 (2008). 

Frank, E., Hall, M., Trigg, L., Holmes, G. & Witten, I. H. Data mining in 
bioinformatics using Weka. Bioinformatics 20, 2479-2481, 
doi:10.1093/bioinformatics/bth261 (2004). 

78 

 

13 

Demsar, J. et al. Orange: Data Mining Toolbox in Python. J Mach Learn Res 14, 
2349-2353 (2013). 

Stack Overflow. The stack overflow python online comunity.  (2018). 

Ousterhout, J. K. Tcl and the Tk toolkit.  (Addison-Wesley, 1994). 

Scikit Learn Documentation. Scikit learn online documentation.  (2018). 

Help, K. D. a. How to use kaggle, <https://www.kaggle.com/docs> (2018). 

 
14 
 
15 
 
16 
 
17 
 
18  McKinney, W. in Proceedings of the 9th Python in Science Conference.  51-56. 
 
19 
 
20 

Oliphant, T. E. A guide to NumPy. Vol. 1 (Trelgol Publishing USA, 2006). 

 
21 

 
22 

 
23 

 
24 

 
25 

 

 

 

 

 

 

 

Olivier, B. G., Rohwer, J. M. & Hofmeyr, J. H. S. Modelling cellular processes 
with Python and Scipy. Mol Biol Rep 29, 249-254, doi:Doi 
10.1023/A:1020346417223 (2002). 

Fisher, R. A. The use of multiple measurements in taxonomic problems. Ann 
Eugenic 7, 179-188, doi:DOI 10.1111/j.1469-1809.1936.tb02137.x (1936). 

Anderson, E. The Irises of the Gaspe peninsula. Bulletin of American Iris Society 
59, 2-5 (1935). 

Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene 
expression and hybridization array data repository. Nucleic Acids Res 30, 207-210 
(2002). 

Shamir, R. et al. Analysis of blood-based gene expression in idiopathic Parkinson 
disease. Neurology 89, 1676-1683, doi:10.1212/WNL.0000000000004516 (2017). 

Sharma, S. M. et al. Insights in to the pathogenesis of axial spondyloarthropathy 
based on gene expression profiles. Arthritis Res Ther 11, R168, 
doi:10.1186/ar2855 (2009). 

79 

 

Chapter 3 –  

Computational Meta-analysis of Gene Expression in 

Acute Myeloid Leukemia  

80 

 

Abstract 

In  2018  alone,  an  estimated  20,000  new  acute myeloid  leukemia  (AML)  patients  were 

diagnosed, in the United States, and over 10,000 of them are expected to die from the 

disease.  AML  is  primarily  diagnosed  among  the  elderly  (median  68  years  old  at 

diagnosis).  Prognoses  have  significantly improved  for  younger  patients,  but  in  patients 

older than 60 years old as much as 70% of patients will die within a year of diagnosis. In 

this study, we conducted stratified computational meta-analysis of 2,213 acute myeloid 

leukemia patients compared to 548 healthy individuals, using curated publicly available 

data. We carried out analysis of variance of normalized batch corrected data, including 

considerations for disease, age, tissue and sex. We identified 964 DE unique genes (974 

DEPS) differentially expressed genes and 4 associated significant pathways involved in 

AML.  Additionally,  we  have  identified  70  DEPS  with  69  unique  sex-  and  372  age-

dependent  DE  gene  signatures  relevant  to  AML.  Finally,  we  used  a  machine  learning 

model (KNN model) to classify AML patients compared to healthy individuals with > 

90% achieved accuracy. Overall our findings provide a new reanalysis of public datasets, 

that  enabled  the  identification  of  potential  new  gene  sets  relevant  to  AML  that  can 

potentially be used in future experiments and possible stratified disease diagnostics. 

 

 

A version of this chapter and results has been submitted for publication. 

 

 

 

81 

 

Introduction 

Acute  myeloid  leukemia  (AML)  is  a  heterogeneous  malignant  disease  of  the 

hematopoietic  system  myeloid  cell  lineage.  AML  is  best  characterized  by  the  terminal 

differentiation  in  normal  blood  cells  and  excessive  production  and  release  of  cells  at 

various stages of incomplete maturation (leukemia cells). As a result of this faster than 

normal and uncontrolled growth of leukemia cells, healthy myeloid precursors involved 

in hematopoiesis are suppressed, and ultimately, can soar to death within months from 

diagnosis if untreated1,2. AML accounts for 70% of myeloid leukemia and nearly 80% of 

acute  leukemia  cases,  making  it  the  most  common  form  of  both  myeloid  and  acute 

leukemia2,3. The number of new AML cases is increasing each year – in 2018 alone, there 

have been an estimated about 20,000 new diagnosed AML patients, over 10,000 of them 

will die from the disease4. 

 

AML can occur in people of all ages but is primarily diagnosed among the elderly (>60 

years),  with  a  median  age  of  68  year  at  diagnosis4.  Recent  advances  in  AML  biology 

expanded  our  understanding  of  its  complex  genetic  landscape  and  led  to  significant 

improvement in prognoses and therapeutic strategy for younger patients5,6. However, in 

patients older than 60 years old, prognoses remain grim and therapeutic strategy has been 

nearly the same for more than 30 years2,5-8. Approximately 70% of patients 65 years of 

age or older die within one year from diagnosis9. While it is apparent that the nature of 

AML  changes  with  age,  still  little is  known  about  the  extent  of  these  associations  and 

how  they  vary  with  patient’s  age6,10,11.  Taking  into  consideration  age  considerations  in 

82 

 

the identification of changes in AML global gene expression can lead to improved early 

diagnosis and improvement in treatment approaches for elderly patients. 

 

AML  prognosis  is  highly  dependent  on  cytogenetic  analysis  since  chromosome 

abnormalities  including  translocations,  deletion,  duplication  and  inversions  occur 

frequently in AML  6. Cytogenetic analysis, as a prognostic approach, is used to classify 

AML  patients  that  carry  distinctive  chromosomal  abnormality  into  either  favorable, 

intermediate, or unfavorable risk group. However, approximately 50% of AML patients 

lack  genetic  abnormalities  and  present  normal  karyotype12-15.  In  the  last  decade,  many 

frequently  mutated  genes  in  AML  were  identified  including  NPM1,  CEBPA,  RUNX1, 

FLT3 -- and many studies have reported sets of genes and gene panels that can be used to 

improve the prognostication of AML8,12,13. However, the impact of these findings to help 

improve AML prognosis in the current clinical practice is still unclear8. 

 

Gene expression profiling is a powerful prognostic method for the detection of changes in 

gene  expression  due  to  genetic  abnormalities,  gene  fusion  and/or  mutations  in  AML 

patients. In the past, gene expression biomarkers were used to classify myeloid leukemia 

as  compared  to  lymphoid  leukemia  including  many  subtypes  within  each  of  the  two 

diseases16-18.  Multiple  gene  expression  analyses  of  AML  have  been  carried  out,  25  of 

these  have  been  systematically  compared  by  Miller  and  Stamatoyannopoulos19,  who 

analyzed information on 4,918 genes, and identified 25 genes reported across multiple, 

with  potential  prognostic  features.  In  this  study,  we  performed  comprehensive  meta-

analysis  of  2,213  acute  myeloid  leukemia  patients  and  548  healthy  subjects  using  34 

83 

 

publicly available gene expression microarray datasets (following strict inclusion criteria) 

to identify disease, sex- and age-related gene expression changes and signaling pathways 

associated with AML. We identified sex- and age-related gene expression signatures that 

show  similar  alteration in  gene  expression  levels  and  associated  signaling  pathways  in 

AML and have used our results (gene sets) to predict AML or healthy status. We believe 

that our results may lead to improved AML early detection and diagnostic testing with 

target genes, as well as the identification of new targets for treatment with mechanisms of 

action  different  from  those  used  in  conventional  chemotherapy.  To  our  knowledge, 

provided the body of published AML gene expression studies, our approach of joining 

multi-study gene expression datasets for meta-analysis to identify disease-, sex- and age-

related signatures in AML has not been implemented before. 

 

Results 

•  Data curation and gene expression preprocessing 

By navigating the Gene Expression Omnibus (GEO) public repositories according to our 

systematic  workflow  and  inclusion  criteria  (Fig.  15a&b),  34  age-annotated  gene 

expression  datasets  from  32  different  studies  covering  2,213  AML  patients  and  548 

healthy  individuals  were  curated  and  selected  for  gene  expression  meta-analysis  and 

functional pathway enrichment analysis. Table 5 provides a description on each dataset 

with  a  sub-table  summary  of  all  curated  data  used  in  our  current  study.  After  pre-

processing  each  individual  data  set  separately  according  to  Figure  15b,  we  performed 

analysis on 44,754 probe sets which were common across all samples (arrays). 

 

84 

 

•  Classification of missing metadata annotation 

After the data curation step, 805 arrays (802 AML and 3 healthy) of 2,761 curated data 

were  missing  sex  annotation,  and  737  arrays  (all  AML  patients)  were  missing 

information regarding sample source (i.e. tissue, either bone marrow [BM] or peripheral 

blood [PB] annotation). Classification for the missing annotations for these arrays (1,542 

in total) was essential in our study, to increase the sample size, and statistical power  20. 

To predict the missing sex and sample source meta-data, we trained and validated various 

machine  learning  supervised  models,  including  logistic  regression  (LR)  and  k-nearest 

neighbor (KNN) classification models. The models were trained and verified using our 

annotated  preprocessed  expression  data.  Model  training,  parameters  used  in  training, 

validation  for  this  analysis  are  discussed  in  the  method  section.  Results  from  model 

training,  including  confusion  matrix,  model  accuracy,  and  error  can  be  viewed  in 

Supplementary Table S1 online and results from classification for missing annotation are 

presented in Supplement file 1&2. 

 

•  Batch correction 

Our  pre-processed  data,  AML  and  healthy,  was  subjected  to  “dataset-wise  correction” 

(for  more  details  and  further  explanation,  see method  section,  presented  in  sub-section 

“Dataset-wise  correction  for  batch  effects”.  We  used  ComBat21  to  correct  for 

confounding batch effects. Our datasets used in this study did not include within-study 

healthy  controls,  which  would  limit  variance  analysis,  and  the  ability  to  separate 

biological from batch effects. To address this, we implemented an iterative batch effect 

correction approach, essentially employing a weight-based method for correcting batch 

85 

 

effects. Assuming the batch effects due to each data set is a function of the number of 

samples in the data set (weight), normalizing sets of unevenly sized datasets may lead to 

unbalanced batch correction. We used 5 additional datasets as a reference set, which we 

refer  to  as  “covariate”  hereafter.  Each  of  the  covariate  datasets  included  within  study 

healthy controls. All 5 datasets together consisted of a total 613 arrays (455 AML and 

158 healthy) (Table 5), and pre-processed exactly as our curated data sets. These were 

used  together  with  each  of  the  remaining  datasets  to  batch  correct  each  dataset  with 

respect  the  covariate  reference  using  ComBat22.  After  this  dataset-wise  correction, 

covariate datasets were removed, and our expression data were clustered using principal 

component  analysis  (PCA)  to  visually  examine  the  effect  of  covariate  datasets  on 

distributing  the  batch  weight  during  batch  correction.  Figure  16  shows  the  results  for 

batch effects correction, including batch corrected data without covariate datasets (Fig. 

16 a&b), as well as batch corrected data with covariate datasets (Fig. 16 c&d). 

•  Analysis 1: Gene expression meta-analysis and enrichment analysis of AML 

disease state compared to healthy individuals 

 

 

o  Gene expression meta-analysis of AML disease state 

Following  batch  correction,  we  performed  an  analysis  of  differential  gene  expression 

(DGE) on 34 data sets including 2,213 AML patients and 548 healthy controls. Analysis 

of  Variance  (ANOVA)23-25  was  performed  according  to  a  linear  model  (see  method 

section “Meta-analysis”). 974 Statistically significant differentially expressed probe sets 

(DEPS) (with genes corresponding to 964 unique gene symbols) for AML versus healthy 

86 

 

were selected based on a Bonferroni26 adjusted p-value < 0.01 (accounting for multiple 

hypothesis testing), in conjunction with a two-tailed 5% quantile selection27 based on the 

mean  difference  distribution  between  AML-healthy  group  comparisons  (post-hoc 

analyses using Tukey’s Honestly Significant Difference (HSD)). The heatmap (Fig. 17a) 

shows the gene expression with hierarchical clustering of the 974 DEPS, including 487 

up- and 487 down-regulated with respect to AML as compared to healthy. The clustering 

did  not  reveal  any  sub-clustering  or  structure  indicative  of  a  grouping  or  possible 

classification in the AML subjects (that would also be suggestive of necessary additional 

blocking design for a per-class analysis). From this analysis, WT1 (Wilms tumor 1) with 

mean difference of 0.26 and adjusted p-value < 4.11E-11 was the most DE up-regulated 

gene while CRISP3 (cysteine-rich secretory protein 3) with mean difference of -0.52 and 

adjusted p-value < 4.11E-11 was the least DE gene. Figure 17b shows the top 10 up- and 

down-regulated DEPS with corresponding gene symbols, that resulted from this analysis 

(also listed in Table 6, including mean difference and Bonferroni p-adjusted values from 

post-hoc analysis using Tukey’s HSD tests). The entire list of all 974 DEPS can be found 

as Supplementary Table S2 online. 

 

o  Gene enrichment analysis AML disease state DE genes. 

To identify signaling pathways associated DEPS in AML, gene enrichment analysis was 

performed on all 974 DEPS combined. Signaling pathways from the Kyoto Encyclopedia 

of Genes and Genomes (KEGG)28-30 and Gene Ontology (GO) terms31,32 were analyzed 

for  over-representation  analysis  of  biological  function  in  Database  for  Annotation, 

Visualization and Integrated Discovery (DAVID)33,34. Using Benjamini and Hochberg35 

87 

 

adjusted  p-value  <  0.05,  4  KEGG  signaling  pathways  were  identified,  including 

Hematopoietic  cell  lineage,    Cell  cycle,  p53  signaling  pathway,  and    Transcriptional 

misregulation in cancer (Fig. 17c). The 4 KEGG signaling pathways and associated DE 

genes  are  summarized  in  Table  7,  including  unadjusted  p-values  and  Benjamini  and 

Hochberg35  adjusted  p-values.  56  DEPS  including  27  up-  and  29  down-regulated  were 

associated with these signaling pathways, and the heatmap of their mean differences is 

shown in Figure 17d. Additionally, 61 DEPS were enriched by 6 other KEGG signaling 

pathways known to be involved in AML (Fig. 17e). From our gene enrichment analysis 

for overrepresented biological GO terms, 21 GO terms were statistically significant with 

727  DEPS  (335  up-  and  392  down-regulated).  GO  terms  included  protein  and 

microtubule  binding  for  the  molecular  function  (MF)  category,  inflammatory  and 

immune  responses,  mitotic  nuclear  division,  and  cell  proliferation  response  for  the 

biological process (BP) category, and finally, cytoplasm, extracellular exosome, cytosol, 

extracellular  space,  integral  component  of  plasma  membrane  immune  response,  and 

others,  for  the  cellular  component  (CC)  category  (Fig.  17f).  The  entire  list  of  our 

enrichment analysis results (statistically significant over-representation in KEGG and GO 

terms) can be found as Supplementary Table S3 online. 

 

•  Analysis  2:  gene  expression  meta-analysis  and  enrichment  analysis  of  sex- 

and age-related DE genes in AML 

Further analysis of gene expression and pathways enrichment were conducted in order to 

characterize sex- and age-specific gene expression changes in AML patients compared to 

healthy  individuals,  Analysis  2a:  “Sex-relevance  differential  gene  expression  meta-

88 

 

analysis  and  associated  signaling  pathways  in  AML”,  and  Analysis  2b:  “Age-

dependent  differential  gene  expression  meta-analysis  and  associated  signaling 

pathways in AML”. We used the same filtering criteria in both analyses as those used in 

analysis 1 for significant DE genes and signaling pathways between AML patients and 

healthy  controls.  In  addition,  DE  genes  were  regarded  to  be  covariates  statistically 

significantly (up- or down-regulated) for each factor, sex and age, if displayed Bonferroni 

adjusted p-value from Tukey’s HSD < 2.2x10-7 (=0.01/(44,754 probe sets used)). 

 

o  Analysis  2a.  Sex-relevance  differential  gene  expression  meta-analysis  and 

associated signaling pathways in AML 

Gene expression meta-analysis was also used to identify DEPS that show sex relevance 

with respect to male AML patients as compared to female AML patients. 266 DEPS were 

regarded statistically significant (p-value < 2.2x10-7). A list of all 266 DEPS (including 

up-  and  down-regulated,  gene  title  and  symbol,  male-female  mean  difference,  and 

Bonferroni corrected p-value) can be found as Supplementary Table S4 online. 70 DEPS 

with 69 unique DE genes were found to overlap between analysis 1 (AML disease state) 

and  analysis  2a  (Fig.  18a).  Figure  18b  shows  these  70  DEPS  with  gene  symbol 

annotations, and their mean difference values in the heatmap, which displays differences 

in significance for a common DEPS in both analyses 1 and 2. The top 10 up- and down 

regulated DEPS from this analysis are shown in Figure 18c. Figure 18d shows the gene 

expression  heatmap  with  a  hierarchical  clustering  of  the  70  DEPS  (rows)  on  sex  and 

disease  state  of  all  2,213  AML  and  548  healthy subjects  (columns)  indicated  by  color 

bars above the heatmap. 

89 

 

 

For enrichment analysis, we searched for common DEPS between the 70 DEPS from this 

meta-analysis  and  the  974  DEPS  from  AML  disease  state  meta-analysis,  for  KEGG 

pathways  and  GO  terms.  4  sex-relevant  DE  genes  were  found  in  3  different  signaling 

pathways  (Table  8),  including,  (3  up-  and  1  down-regulated).  Up-regulated  genes  and 

pathway memberships included, FLT3 and CD34 in Hematopoietic cell lineage, FLT3 in 

Transcriptional  misregulation  in  cancer,  and  PMAIP1  in  p53  signaling  pathway,  and 

down-regulated gene MS4A1 in Hematopoietic cell lineage (Fig. 18e). 

 

Figure  18f  shows  GO  analysis  results,  where  15  overrepresented  biological  GO  terms 

were 

overlapped, 

including 

terms 

GO:0005615~extracellular 

space, 

GO:0006955~immune  response,  GO:0005515~protein  binding,  GO:0005819~spindle, 

and  GO:0030496~midbody.  The  entire  list  of  our  enrichment  analysis  (statistically 

significant KEGG and GO terms) can be found as Supplementary Table S3. 

 

o  Analysis  2b.  Age-dependent  differential  gene  expression  meta-analysis  and 

associated signaling pathways in AML 

Here we refer to the “age-group” to indicate AML patients and healthy individuals in the 

same age range assigned in our study. The subjects were binned in 8 groups: 0-19, 20-29, 

30-39, 40-49, 50-59, 60-69, 70-79, and 80-100 years old. From this meta-analysis, 1395 

probsets  across  all  age-groups  were  identified  as  statistically  significant  (Bonferroni 

adjusted p-value < 2.2x10-7) (Supplementary Table S5). From these 372 DE unique age-

dependent genes (375 DEPS) were found to overlap with the 964 DE unique genes (974 

90 

 

DEPS)  from  our  AML  disease  state  meta-analysis  (Fig.  19a),  with  137  up-  and  238 

down- regulated. The entire list of 375 DEPS can be found as Supplementary Table S6 

online.  Figure  19b  shows  the  heatmap  of  the  375  DEPS  (rows)  across  18  age-groups 

(columns) that were deemed statistically significant according to Bonferroni adjusted p-

value < 2.2x10-7 (these age-groups including all 2,761 arrays (2,213 AML patients and 

548  healthy  individuals)).  The  top  10  up-  and  down-  regulated  DE  genes  from  this 

analysis  are  shown  in  Figure  19c.  Figure  19d  shows  75  DE  genes  identified  to  have 

appeared specifically in one age-group. 

 

To  investigate  further,  pairwise  correlations  between  age-groups  were  computed  (Fig. 

19e). The “0 to 19” age-group was used as a common comparison reference with respect 

to other groups (Fig. 19f). Using this “0 to 19” group as a baseline, Figure 19g shows the 

mean difference of 25 genes that are DE with respect to the “0 to 19” baseline across all 

other groups and the mean difference values between AML and healthy are shown in the 

right-most column of Figure 19a-g for reference. Utilizing results for KEGG analysis for 

signaling  pathways  from  analysis  1,  Figure  19  shows  17  DE  genes  identified  in  all  4 

KEGG pathways according to age groups (also listed in Table 9). 

 

o  Age-dependent genes analysis for drug to gene interaction 

We carried out two analysis in The Drug Gene Interaction Database (DGIdb)36 to identify 

druggable genes or gene products from our results with known drug to gene interaction 

(where  available)  for  a  potential  therapeutic  in  AML.  According  to  DGIdb,  druggable 

genes are defined as “genes or gene products that are known or predicted to interact with 

91 

 

drugs, ideally with a therapeutic benefit to the patient”. The two analyses were performed 

in DGIdb using two different gene sets, (i) 25 DE genes common across the baseline (0 to 

19) age-group (Fig. 19g) and, (ii) 75 genes identified to be specific to one age-group (Fig. 

19d).  Table  10  lists  all  the  genes  and  their  corresponding  associated  druggable  gene 

categories. 

 

Discussion 

According  to  the  2016  World  Health  Organization  (WHO)  newly  revised  myeloid 

neoplasms  and  acute  leukemia  classification  system37,  AML  prognosis  criteria  for 

classification  is  highly  dependent  on  the  presence  of  chromosomal  abnormalities, 

including  chromosomal  deletions,  duplications,  translocations,  inversions,  and  gene 

fusion.  Mostly,  AML  is  diagnosed  through  microscopic,  cytogenetics,  and  molecular 

genetic  analyses  of  patients’  blood  and/or  bone  marrow  samples.  Microscopic 

examination  is  used  to  detect  distinctive  features  (e.g.  Auer  rods)  in  cell  morphology, 

cytogenetic analysis to identify chromosomal structural aberrations (e.g., t(8;21), inv(16), 

t(16;16),  or  t(9;11)),  and  molecular  genetic  analysis  to  identify  gene  fusion  (e.g., 

RUNX1-RUNX1T1 and CBFB-MYH11), and mutations in genes frequently mutated in 

AML  (e.g.,  NPM1,  CEBPA,  RUNX1,  FLT3)8,12,13.  These  cytogenetic  and  molecular 

genetic analyses are used to identify prognosis markers that can be used to classify AML 

patients into three risk categories: favorable, intermediate, and unfavorable. The largest 

group  of  AML  patients  (almost  50%)  however,  present  normal  karyotype  and  lack 

genetic  abnormalities12-15.  These  patients  are  classified  as  intermediate  risk,  and  often 

have heterogeneous clinical outcome with standard therapy with risk of AML relaps38. 

92 

 

Additionally, AML prognosis worsens as age increases, and older patients respond less to 

current  treatments  with  poorer  clinical  outcomes  than  their  younger  counterparts5,39. 

Further  complicating,  AML  has  multiple  driver  mutations  and  competing  clones  that 

evolve  over  time,  making  it  a  very  dynamic  disease40,41.  Identifying  differentially 

expressed  genes  and  associated  signaling  pathways  based  on  our  analysis,  that 

incorporates disease state, sex- and age-dependent meta-analysis, can provide global gene 

expression signatures, which collectively can potentially serve as sex- and age-dependent 

biomarkers for AML prognosis compared to healthy. 

 

In  the  present  study,  we  aimed  to  establish,  disease  sex-linked  and  age-dependent 

biomarkers  from  genes  with  similar  alteration  in  gene  expression  level  and  associated 

signaling  pathways  in  AML.  Utilizing  microarray  gene  expression  data  and  combined 

with  various  machine  learning  models,  respectively,  our  biomarkers  were  indicative  of 

prognostic  signature  for  AML  prediction  compared  to  healthy  with  >  90%  achieved 

accuracy. We took advantage of 34 publicly available microarray gene expression data 

sets  covering  2213  AML  patients  and  548  healthy  individuals  to  identify  changes  in 

AML  gene  expression  associated  with  disease  state  (AML  compared  to  healthy),  sex-

linked  (male  compared  to  female),  and  age-dependent  (across  age-groups  compared  to 

baseline). We performed 3 differential gene expression and gene enrichment analyses:  

 

Analysis  1:  Gene  expression  meta-analysis  and  associated  signaling  pathways  of 

AML  disease  state  compared to healthy  individuals,  was  carried  out  to  identify  DE 

genes in AML disease state, followed by gene enrichment analysis on the identified DE 

93 

 

genes to find singling pathway associated with AML. The results from this analysis were 

then used as baseline indicator for AML disease state.  

 

Analysis 2a: Sex-dependent gene expression meta-analysis and associated signaling 

pathways  in  AML  compared  to  healthy  individuals,  was  performed  to  explore  the 

relevance  of  patients’  sex  on  gene  expression  and  to  identify  sex-linked  genes  and 

associated signaling pathways in AML.  

 

Analysis 2b: Age-dependent gene expression meta-analysis and associated signaling 

pathways  in  AML  compared  to  healthy  individuals,  was  carried  out  to  identify 

common  set  of  age-dependent  genes  and  associated  signaling  pathways  and  to  explore 

age-dependent trends in gene expression in AML. 

 

•  Analysis 1 discussion: Gene expression meta-analysis of AML disease state 

From our meta-analysis for AML disease state 964 DE unique genes (974 DEPS) (487 

overexpressed  and  487  underexpressed)  were  identified  as  significantly  differentially 

expressed between AML patients and healthy individuals (Bonferroni adjusted p-value < 

0.01).  Among  these  6  genes  are  known  to  be  involved  in  AML  functional  pathways, 

including 4 up-regulated, JUP (junction plakoglobin), CCNA1 (cyclin A1), FLT3 (fms-

related  tyrosine  kinase  3),  PIK3R1  (phosphoinositide-3-kinase,  regulatory  subunit  1 

(alpha)),  and  2  down-regulated,  CD14  (CD14  molecule),  CEBPE  (CCAAT/enhancer 

binding protein (C/EBP), epsilon). The top 10 up- and down-regulated genes from this 

analysis  are  listed  in  Table  6  with  their  respected  Tukey’s  HSD  mean  difference  and 

94 

 

Bonferroni  p-adjusted  values.  As  shown  in  Figure  17b  of  the  top  10  up-  and  down-

regulated  DEPS  --  WT1  (Wilms  tumor  1)  was  found  to  be  the  most  expressed  and 

CRISP3 (cysteine-rich secretory protein 3) was the most under-expressed gene. From our 

gene  enrichment  analysis  for  overrepresented  biological  GO  terms,  WT1  is  identified 

with  protein  binding  and  cytoplasm  in  the  molecular  function  (MF)  and  cellular 

component  (CC)  categories  respectively.  WT1  is  a  transcriptional  regulatory  protein 

essential to cellular development and cell survival, and it has been known to be highly 

expressed with an oncogenic role in AML42,43, in agreement with our findings. However, 

CRISP3’s  direct  role  in  AML  is  still  under  investigation.  CRISP3  is  a  member  of  the 

cysteine-rich  secretory  protein  CRISP  family  with  major  role  in  female  and  male 

reproductive  tract,  and  is  mainly  expressed  in  salivary  gland  and  bone  marrow44. 

Recently  in  2017,  80  genes  were  reported  as  “extracellular  matrix  specific  genes”  in 

leukemia, and CRISP3 was among the downregulated DE genes reported.45 In GO terms 

from  our  gene  enrichment  analysis,  CRISP3  is  associated  with  the  extracellular  space, 

specific granule, and extracellular exosome GO terms, all in the cellular component (CC) 

category.  These  findings  suggest  that  CRISP3  could  be  a  potential  candidate  as  a 

prognostic  biomarker  in  AML.  CRISP3  associations  with  these  cellular  components  in 

AML have not been previously reported, to the best of our knowledge and merit further 

investigation. 

 

The enrichment analysis for overrepresented biological GO terms of the 974 DEPS (up- 

and  down-regulated  combined)  is  shown  in  Figure  17f.  727  DEPS  (335  up-  and  392 

down-regulated) were enriched for 21 GO terms. 592 of which (257 up- and 335 down-

95 

 

regulated)  were  enriched  in  the  cellular  component  (CC)  category  and  were  mainly 

associated  with  cytoplasm,  extracellular  exosome,  cytosol,  and  extracellular  space. 

Possible explanations to the relatively high number of DEPS associated with these GO 

terms might reflect the bone marrow or immunosuppressive microenvironment which is 

inevitable  to  AML  development  and  progression46,47.  On  the  biological  process  (BP) 

category, GO term were associated with inflammatory and immune responses, and cell 

proliferation. This is reflective of AML characteristics. AML is characterized by terminal 

differentiation  of  normal  blood  cells  and  excessive  proliferation  and  release  of 

abnormally  differentiated  myeloid  cells.  This  faster  than  normal  cell  proliferation  and 

uncontrolled growth leads to accumulation of genetic abnormalities that very likely can 

affect many signaling pathways essential to the immune system.  

 

Figure  17c  shows  the  four  statistically  significant  KEGG  pathways  identified  in  the 

pathway enrichment analysis with the number of DEPS enriched by each pathway, which 

encompassed  56  DE  unique  genes  (Table  7).  Specifically,  Figure  17c  indicates  that 

Transcriptional misregulation in cancer was the most up-regulated pathway in AML (13 

up-regulated DE genes), while Hematopoietic cell lineage, and Cell cycle pathways were 

mostly  down-regulated,  and  the  p53  signaling  pathway  was  balanced  in  terms  of 

up/down-regulated  DE  genes.  The  enriched  pathways  Figure  17e  shows  the  mean 

difference values of the 56 DE pathway-associated genes, including 27 genes up- and 29 

down-regulated.  61  DEPS  from  our  AML  disease  state  meta-analysis  were  also 

associated by 6 other KEGG signaling pathways that are known to be involved in AML 

(Fig.  17e).  All  these  10  KEGG  pathways  are  known  to  be  involved  in  tumorigenesis. 

96 

 

Additionally, the majority of the associated DE genes from AML meta-analysis with the 

identified  signaling  pathways  are  known  to  be  abnormally  expressed  in  AML.  These 

findings are consistent with findings from other studies and our current understanding of 

AML pathogenesis. 

 

•  Analysis 2a discussion 

To identify DE genes associated with sex in AML, we used post-hoc Tukey’s HSD tests 

for  comparison  between  male  and  female  subjects.  A  total  of  266  genes  were  found 

statistically significant in this analysis. 70 of there were also found to overlap with the 

DE  genes  from  analysis  1,  AML  disease  state  meta-analysis  (Fig  18a&b).  Figure  18c 

shows the top 10 up- and down-regulated DE genes with respect sex – DDX3Y (DEAD-

Box  Helicase  3  Y-Linked),  EIF1AY  (Eukaryotic  Translation  Initiation  Factor  1A  Y-

Linked), KDM5D (Lysine Demethylase 5D), RPS4Y1 (Ribosomal Protein S4 Y-Linked 

1)  were  the  most  expressed  genes  and  XIST  (X  Inactive  Specific  Transcript),  TSIX 

(TSIX  Transcript,  XIST  Antisense  RNA),  and  PRKX  (Protein  Kinase  X-Linked)  were 

the most down-regulated genes. These genes are known to be sex-specific and show such 

differences and sex separation within the AML and the healthy groups respectively (Fig. 

18d).  The  role  of  these  genes  as  positive  controls  in  studies  with  AML  needs  to  be 

investigated further. We also reported sex and AML known genes that were statistically 

significant in our analysis, including FLT3 and MAL. 

 

 

 

97 

 

•  Analysis 2b discussion 

The  age-dependent  meta-analysis  in  AML  using  ANOVA,  identified  1,381  genes  as 

statistically significant based on Bonferroni adjusted p-value <0.01. We then evaluated 

the overlap of DE genes from this analysis to our findings of 964 DE unique genes (974 

DEPS) in AML disease state (analysis 1) to identify age-related DE genes in AML (Fig. 

19a).  We  identified  an  overlap  of  372  DE  unique  age-dependent  genes  (375  DEPS), 

including  137  up-  and  238-down  regulated,  of  age-related  genes  and  AML-associated 

genes (Bonferroni adjusted p.value <0.01). As shown in Figure 19c, the top 10 most and 

least  expressed  age-associate  genes  in  AML  according  to  the  mean  difference  values 

conducted using Tukey’s HSD in seven age-groups, including their corresponding values 

from  AML  disease  state  in  column  “AML  -  healthy”  for  comparisons.  Interestingly, 

CRISP3  (cysteine-rich  secretory  protein  3)  was  among  the  down  regulated  genes 

specifically associated with younger age groups, 20 to 49 years of age as compared to 0 

to  19  years  old.  These  finding  providing  our  previous  finding,  are  suggestive  of 

CRISP3’s role in AML as well as association with certain age-groups. The Figure 19b 

also  shows  a  number  of  up-regulated  genes  known  to  be  involved  in  AML,  including 

HOXA3, HOXA5 and HOXA10-HOXA9, which belong to the homeobox genes (HOX) 

family  of transcription  factors,  essential  to  embryonic  development  and  hematopoiesis, 

and  associated  with  chromosomal  abnormalities  translocation  and  over-expression  in 

AML48,49.  Interestingly,  ORM1  (Orosomucoid  1),  which  was  deemed  significant  for 

down-regulation  for  both  of  our  previous  analyses.  In  fact,  in  analysis  1,  ORM1  was 

among the top-10 most underexpressed genes, and was also among the 70 DEPS from 

analysis 2a. These results suggest that ORM1 role in AML is independent of sex or age. 

98 

 

ORM1’s direct role in AML also merit’s further investigation, given ORM1 involvement 

in immunosuppression and inflammation50.  

 

From the 25 DE genes found across the “0 to 19” age-group as a baseline (Fig. 19g), 15 

genes were identified as “druggable genome” from our DGIdb36 analysis (i), including 

TFF3,  ORM1,  CA4,  CYP4F2,  CYP4F3,  CEACAM1,  FLT3,  CHIT1,  OLR1,  KCNJ15, 

CAMP,  CRISP2,  CAPN3,  SLC37A3,  FCRL1  (Table  10).  From  these  15  genes,  CA4 

(carbonic anhydrase 4) showed an interaction record with drug Topiramate, which is an 

anticancer drug known to act as an inhibiter to CD451. Additionally, we have identified 

75  statistically  significantly  DE  genes  that  show  association  with  only  one  age-group, 

exclusively from all other age-groups, suggestive of potential age-specific DGE. Finally, 

our  DGIdb36  analysis  (ii)  results  for  the  75  DE  genes,  24  genes  were  categorized  for 

“druggable genome” including CDH1, GPX3, CD14, DYRK2, SLPI, CCNA2, TGFBR3, 

UGCG,  FCN1,  GZMA,  TCN1,  BPI,  S100A12,  CDK6,  IL12A,  P2RY13,  ADGRG3, 

DNMT3B, GUCY1A3, FGFBP2, PTPRJ, LRRK2, BCL2L15, STYX (Table 10). 

 

In  summary,  our  study  successfully  integrated  multiple  datasets  to  perform  a  study  of 

gene  expression  in  AML,  across  multiple  factors  that  included  disease,  sex  and  age 

considerations, and identified interesting genes, both known and not previously reported 

as  differentially  expressed  in  each  factor.  We  identified  964  DE  unique  genes  (974 

DEPS) and 4 associated significant pathways involved in AML, and 69 DE unique sex-

relevant genes (70 DEPS) and 372 DE unique age-related genes (375 DEPS). Using the 

964  DE  genes,  a  KNN  model  allowed  for  classification  of  AML  patients  with  >90% 

99 

 

accuracy. We hope that these findings may provide additional relevant targets for further 

experimental  mechanistic  studies,  and  to  help  identify  new  markers  and  therapeutic 

targets for AML. 

 

•  Future research possibilities and study limitations 

We  note  that  our  study  identified  multiple  potentially  significant  DE  genes,  associated 

with  age  and  sex  related  differences  associated  with  AML  as  compared  to  healthy,  as 

discussed  above  for  analyses  1  and  2.  While  our  results  and  analyses  have  identified 

important expression relevant to AML, and many potential new gene targets, we need to 

acknowledge  the  limitations  of  our  data:  primarily  the  analysis  of  AML  and  healthy 

subjects involved bone-marrow and blood samples respectively in each group. We tried 

to account for this disparity in the tissues, by utilizing tissue (sample source) directly as a 

factor  in  our  linear  model,  and  including  its  binary  interactions  with  all  other  factors. 

Other limitations included an unbalanced AML/healthy ratio, as well as the lack of in-

study healthy controls. To address these, we attempted to account for batch effects using 

a  dataset-wise  iterative  batch  correction  transformation,  as  discussed  in  the  method 

section,  presented  in  sub-section  titled,  “Dataset-wise  correction  for  batch  effects”. 

Finally,  a  general  limitation  of  utilizing  publicly  available  data  is  the  lack  of  uniform 

annotation:  the  majority  of  sample  data  provided  have  no  information  on  the 

chromosomal abnormalities, AML classification, and retrospective outcome information. 

While we accounted for lack of annotations for sex and tissue information using machine 

learning, which greatly increased our sample size, we would recommend stricter, more 

100 

 

extensive  reporting  requirements  for  metadata  of  publicly  available  data,  deposited  in 

public databases. 

 

Our  findings  may  generate  further  data-driven  investigations  including,  i)  associations 

between age-groups and changes in gene expression across different AML subgroups to 

help  improve  AML  risk  stratification,  ii)  age-dependent  pseudo  time-series  models  to 

identify  changes  in  gene  expression  with  more  specific  AML  patients  age  and  sex 

however  such  an  analysis  would  require  many  more  well  annotated  samples  that  are 

currently  unavailable,  particularly  given  the  heterogeneity  of  the  disease,  and  we  hope 

new  studies  will  address  this  in  the  future.  Additionally,  the  use  of  microarray  data  is 

limiting,  in  that  the  transcriptome  is  not  fully  probed.  The  availability  of  more  RNA-

sequencing  data  can  address  this  in  future  expression  analyses,  additionally  involving 

considerations of allele-specific expression or alternative splicing. Finally, we hope our 

study  will  be  a  resource  for  the  AML  research  community,  as  a  starting  for  new 

hypothesis-driven  investigations,  that  can  further  probe  the  mechanistic  details  of  the 

genes identified as involved in AML, including their possible use as prognostic markers. 

 

 

101 

 

Methods 

The generalized workflow consisted of five main steps: i) Curation of microarray gene 

expression data, ii) Preprocessing of raw data files followed by batch effect correction, 

iii)  Predictions  of  missing  annotation  data  using  supervised  machine  learning,  iv) 

Differential gene expression analysis, and v) Gene enrichment for pathway analysis that 

includes  gene  annotation,  and  finally  gene  expression-based  prediction  of  AML  (Fig. 

15a). 

 

•  Gene expression data curation and screening criteria 

Datasets  used  in  this  study  were  selected  from  the  GEO  database,  maintained  by  the 

National 

Center 

for 

Biotechnology 

Information 

(NCBI)52 

(https://www.ncbi.nlm.nih.gov/geo/).  GEO  is  a  public  database  repository  at  the  NCBI 

that function as a hub for high-throughput gene expression datasets storage and retrieval 

to promote data sharing between researchers. To facilitate speed of search and keep up-

to-date with possible new and relevant datasets, as soon as they were released, a Python 

script was used that utilized functions from the Entrez Utilities from Biopython53. The 

script  navigated  GEO  public  database,  and  downloaded  publicly  available  microarray 

gene  expression  datasets.  We  additionally  utilized  Python  packages,  including  Pandas, 

NumPy, and Matplotlib for data structure, numerical computing for data processing, and 

data visualization respectively. We used strict inclusion criteria to maintain consistency 

in  each  dataset  selection,  screen  for  availability  of  both  raw  and  meta-data  annotation 

files provided, human samples used from untreated subjects, and that the sample source 

102 

 

was from either bone marrow (BM) and/or peripheral blood (PB). Inclusion criteria and 

the data curation workflow are illustrated in Figure 15a-b. 

 

•  Gene expression data sets used in our analysis 

For  our  analysis  we  included  34  age-dependent  datasets  from  32  different  studies,  16 

included  AML  and  18  healthy  subjects  respectively.  From  the  34  datasets,  32  were 

produced  from  Affymetrix  GeneChip  Human  Genome  U133  Plus  2.0  (GPL570)  and  2 

conducted on Affymetrix GeneChip Human Genome U133 Array Set (GPL96 & GPL97) 

arrays. Table 5 provides detailed information about each data set, including the number 

of samples used from each dataset, sample tissue source, as well as the total number of 

AML patients and healthy subjects. Two studies, GSE1241754 and GSE3764255-58, were 

originally  conducted  on  two  different  Affymetrix  array  types  (GPL570  and  GPL96  & 

GPL97), so each was separated into two subgroups and each subgroup was considered as 

individual dataset in our meta-analysis, data set GSE12417: (i) subgroup 1 included 73 

BM  and  5  PB  samples,  and  (ii)  subgroup  2  included  160  BM  and  2PB.  For  dataset 

GSE37642 (i) subgroup 1 included 140 BM and (ii) subgroup 2 422 BM samples (Table 

5). 

 

•  Datasets annotation and preprocessing 

Figure  15b  outlines 

the  workflow  of  our  preliminary  data  analysis 

including 

preprocessing.  For  each  dataset  used  in  our  analysis,  raw  microarray  CEL  files  were 

downloaded from GEO, metadata was reviewed, and the data was manually curated to 

guarantee that and each array, which corresponded to either an AML patient or healthy 

103 

 

individual, was verified and correctly annotated for sample source (BM or PB), platform 

technology  used,  age,  sex,  and  disease  state  (AML  or  healthy).  Raw  CEL  files  from 

individual datasets were individually pre-processed using the RMA (Robust Multi-Array 

Average) algorithm59-61. Datasets with mixed sample source, i.e both BM and PB, were 

pre-processed  together  irrespective  of  sample  source.  Preprocessing  consisted  of 

correction  for  background  noise  using  RMA  background  correction  on  perfect  match 

(PM) raw intensities, quantile normalization to obtain the same empirical distribution of 

intensities  for  each  array,  median  polish  summarization  of  probes  into  probe  sets  to 

estimate  gene-level  expression  value,  and  logarithm  base-2  transformations  of  gene 

expression values to facilitate data interpretation (normal distributions) and comparisons 

between arrays. Additionally, our expression data were first reduced to 44,754 probe sets 

that are common to and appeared in all data. Data sets were z-score standardized across 

all  probe  sets  and  arrays.  Finally,  each  pre-processed  dataset  was  visualized  with  box-

whisker plots to ensure similar gene expression data distribution across all datasets. 

 

•  Prediction  of  missing  sex-  and  sample  source  annotations  from  curated 

datasets 

805 arrays (802 from AML patients and 3 were healthy subjects) of curated data were not 

annotated  for  sex,  while  737  arrays  (all  AML  patients)  were  missing  sample  source 

information. Without these metadata, we would have to discard the data, which in turn 

would  limit  the  statistical  power  for  the  study,  and  our  ability  to  correct  for  biases 

stemming  from  individual  datasets20.  To  address  this,  we  used  supervised  machine 

learning  classifiers  to  predict  metadata.  For  all  prediction,  we  used  ClassificaIO62,  a 

104 

 

machine learning for classification user interface, which we recently developed, to carry 

out the machine learning classification analyses utilizing the sklearn package in Python63 

 

To predict sex and sample source, pre-processed data sets, 1956 arrays for 545 healthy 

and 1411 AML, that include 44,754 probe sets and their annotated sex and sample source 

information  were  used  to  train  logistic  regression  (LR)  and  k-nearest  neighbor  (KNN) 

classification models.  

 

The supervised machine learning LR classifier we used with the following parameters: 

random_state = None, shuffle = True, penalty = l2, multi_class = ovr, solver = liblinear, 

max_iter= 100, tol = 0.0001, intercept_scaling = 1.0, verbose = 0, n_jobs = 1, C = 1.0, 

fit_intercept = True, dual = False, warm_start = False, class_weight = None 

 

The trained models for classification of missing sex and sample source annotation from 

curated  data  achieved  >  95%  classification  accuracy  with  ~  3-5%  classification  errors. 

Confusion matrix details, model accuracy and error for training and testing are presented 

in Supplementary Table S1 online, and results in Supplement file 1&2. To account for 

training overfitting, we used 10-fold cross-validation on all 1,956 gene expression data 

arrays for training and validation. 

 

•  Dataset-wise correction approach for batch effects correction 

Batch  correction  was  done  using  a  dataset-wise  correction.  Here  we  refer  to  the  term 

“dataset-wise  correction,”  to  indicate  performing  batch  correction  iteratively  on  one 

105 

 

dataset at a time, against a reference set of datasets chosen to account for variability. We 

used  this  approach  to  account for  the  lack  within-study  healthy  controls  in  the  curated 

gene expression datasets. To address this issue, we used 5 additional datasets the included 

within-study  controls,  GEO  accessions:  GSE107968,  GSE6817264,  GSE1705465, 

GSE3322366,  and  GSE1506167  (Table  5).  We  refer  to  the  latter  datasets  hereafter  as 

“covariate datasets”, as they were as the reference datasets in the batch correction. Our 

approach aimed to balance/distribute the weight of batch effects exerted by each dataset, 

as this is dependent on the number of observations within a given dataset. Combined, the 

covariate datasets included 613 total arrays, totaling 455 AML and 158 healthy controls. 

We  used  ComBat22  to  correct  for  study  batch  effects,  as  its  empirical  Bayes-based 

algorithm  uses  both  scale  and  mean  center  based  methods,  providing  an  appropriate 

algorithm22.  Covariate  datasets  were  treated  as  the  covariate  for  batch  during  batch 

correction, to improve performance in correcting for batch effects rather than biological 

variation.  After  batch  correction,  we  used  principal  component  analysis  (PCA), 

visualizing components in both 2 and 3 dimensions, to compare the clustering results for 

corrections. Covariate data sets were removed after the batch correction step and were not 

part of our downstream meta-analysis. (Fig. 16a-d). 

 

•  Gene expression meta-analysis 

After batch correction step, we performed gene expression meta-analysis for differential 

expression  on  the  merged  datasets  (34  data  sets,  16  AML  and  18  healthy),  where  the 

expression  values  for  all  44,754  common  probe  sets  were  aggregated.  The  effects  of 

patients’  age,  sex,  and  sample  source,  including  their  pairwise  interactions  were 

106 

 

investigated using an analysis of variance (ANOVA)13,68 . The linear model of probe set i 

is then written as: 

For each gene i, where i=[1,…44,754], the gene expression 

Probeset Yi was modeled computationally as a linear model: 

Yi ~ (a + s + d + t) + (a:s + a:d + a:t) + (s:d + s:t) + (d:t) + ε 

 

Where d is the disease state (AML or healthy), a is age (between 0 to 100 years), s is sex 

(female or male), t is sample source (BM or PB), and ε is a random error term. We note 

that  the  model  includes  sample  source  and  its  interactions  to  address  comparisons 

involving different tissues in AML and healthy subjects (BM or PB respectively). 

 

From  the  ANOVA  analysis,  genes  were  deemed  to  be  disease  state  statistically 

significant  (differentially  expressed)  if  they  displayed  ANOVA  Bonferroni-adjusted  p-

value  <  0.01.  Post-hoc  analysis  for  significant  genes  was  conducted  for  comparisons 

(between  groups)  using  Tukey’s  Honestly  Significant  Difference  (HSD) 

tests. 

Additionally,  we  performed  a  quantile-based  effect  filter,  were  genes  were  deemed  to 

show  biological  effects  in  our  analysis  if  they  displayed  mean  difference  values  in  the 

<5%  and/or  >  95%  quantiles  of  the  mean  difference  distributions  of  the  binary  group 

comparisons.  Based  on  the  post-hoc  analysis,  genes  were  deemed  to  be  statistically 

significantly (up- or down-regulated) if they displayed Tukey HSD using a Bonferroni 

adjusted cutoff for p-value < 0.01/44,754. 

 

 

107 

 

•  Functional and pathway enrichment analysis 

We carried our enrichment analysis for differentially expressed genes using the Database 

DAVID33,34,  the  KEGG  database28-30  for  signaling  pathways,  GO  terms  functional 

annotation for over representation of biological function 31,32 were utilized and signaling 

pathways  were  deemed  significant  based  on  Benjamini-  Hochberg  adjusted  p-value  < 

0.05. 

 

•  Using k-nearest neighbor to predict AML 

Before gene expression data passed to the k-nearest neighbor (KNN) algorithm to train, 

gene  expression  signatures  resulted  from  our  meta-analysis  were  used  to  extract 

expression values. KNN in ClassificaIO62 was used to carry out this analysis. All 34 data 

sets  (16  AML  and  18  healthy)  were  used  for  training,  and  testing  was  done  on  all  5 

covariate  data  sets,  include  AML  and  healthy  subjects.  Dependent,  target  ,  and  testing 

data files were prepared in accordance with ClassificaIO62 user guide. The KNN model 

used the following parameters: 

 

random_state = None, shuffle = True, metric = minkowski, weights = uniform, algorithm 

= auto, n_neighbors = 5, leaf_size = 30, n_jobs = 1, p = 2, metric_params = None 

 

•  Online data availability 

Supplementary 

data, 

tables, 

figures 

and 

files 

are 

available 

online 

at 

https://www.zenodo.org/record/1492796#.XA7iUC3Mw_U. 

108 

 

 

 

 

 
 
 

APPENDIX 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

109 

 

APPENDIX 

 

Figure 15. General approach, data curation, and analysis workflow summary. 

 

110 

 

Figure 15. (cont’d) 

a.  The  five  main  steps  that  summarize  our  method  of  approach  for  our  study.  b.  The 

curation and screening criteria for raw gene expression and annotation data files curation, 

data  pre-processing,  supervised  machine  learning  for  missing  metadata  prediction,  and 

batch  effects  correction.  c.  Meta-analysis,  using  linear  model  in  Analysis  of  Variance 

(ANOVA)  coupled  with  Post-hoc  comparison  tested  by  Tukey’s  Honestly  Significant 

Difference (HSD), and KEGG enrichment and GO term ontology for signaling pathway 

and biological function annotations. Finally, classification of AML based on our results. 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

111 

 

Figure 16. Principal component analysis of all 2,761 subjects before and after batch 

correction. 

a. 

  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

112 

 

 Figure 16 (cont’d) 

b. 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 

113 

 

Figure 16 (cont’d) 

c. 

 

 

 

 

 

 

 

 

 

 

 

 

114 

 

Figure 16 (cont’d) 

d. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

115 

 

Figure 16 (cont’d) 

For  all  panels,  the  first  two  principal  components  are  at  the  top  and  the  first  three 

principal  components  are  at  the  bottom.  The  data  shown  in  all  panels  represent  gene 

expression data from 2,761 subjects (2,213 AML patients and 548 healthy individuals) 

with  44,754  probe  sets  that  has  been  pre-processed,  logarithm  base-2  transformed,  z-

score  standardized  across  all  data  sets.  Panels  a  and  b  show  the  principal  component 

analysis (PCA) of batch corrected data not including “covariate” datasets, while panels c 

and  d  show  the  same  batch  corrected  data  but  including  the  5  “covariate”  datasets.  a. 

Visualizations  of  the  first  two  and  three  principal  components  of  gene  expression  data 

before batch correction. b. The data was corrected without covariate datasets resulting to 

loss of biological effect information due to lack of within-study controls. c. Shows the 

first two and three principal components of gene expression data including 5 “covariate” 

data sets (see legend: last 5 labels) that include within-study controls (455 AML and 158 

healthy),  and  d.  the  first  two  and  three  principal  components  of  the  same  data  post 

“dataset-wise  correction”  for  batch  effect  using  ComBat  as  descripted  in  the  methods 

section. 

116 

 

Figure 17. Functional classification of DEPS from AML disease state meta-analysis 

and associated KEGG and GO enrichment analysis. 

a

 

 

117 

 

Figure 17 (cont’d) 

b

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
 

 

118 

 

Figure 17 (cont’d) 

c

 

 

 

119 

 

Figure 17 (cont’d) 

d

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

120 

 

e

Figure 17 (cont’d) 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

121 

 

Figure 17 (cont’d) 

f

 

For  heatmaps,  normalized  values  are  represented  in  with  blue  for  down-regulation  and 

red for up-regulation, while light red/gray represents no reported specific direction. And 

for horizontal bar plot, same values are represented in with orange for down-regulation 

and blue for up-regulation. Heatmap of 974 DEPS (964 unique gene) in rows on 2,761 

arrays (columns) including 2,213 AML patients and 548 healthy individuals from AML 

meta-analysis,  using  unsupervised  hierarchical  clustering  and  Euclidean  distance  for 

clustering. The age range of each age-groups is displayed in the legend and illustrated in 

the color bar on the top (labeled Age-group). The disease state (AML vs healthy) and sex 

of each subject are also represented in color bars on the top. b. Horizontal bar plot of the 

top  10  DEPS  (gene  symbols  on  vertical  axis)  from  AML  meta-analysis  with  mean 

difference  values  between  AML  and  healthy  (horizontal  axis).  c.  Shows  4  KEGG 

signaling  pathways  deemed  significant  for  our  AML  disease  state  enrichment  analysis 

with  number  of  up-  and  down-regulated  DEPS  enriched  by  each  signaling  pathway 

(horizontal axis), also visualized as a heatmap (d) of DEPS mean difference values with 

gene names (rows) identified in these 4 KEGG signaling pathways (columns). e. Shows  

122 

 

Figure 17 (cont’d) 

the  mean  difference  values  with  gene  names  (rows)  of  61  DEPS  enriched  by  6  other 

KEGG signaling pathways (columns) pathways that are known to be involved in AML. 

Finally, the GO enrichment analysis results are summarized in (f). 

 

 

123 

 

Figure 18. Sex-related gene expression meta-analysis in AML. 

a

b

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

124 

 

Figure 18 (cont’d) 

c

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

125 

 

Figure 18 (cont’d) 

 
 
 

d

 
 

 

126 

 

 

 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PMAIP
1
FLT3
MS4A
1
FLT3	and	
CD34

Figure 18 (cont’d) 

e

f

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

a. the Venn diagram shows 70 DEPS identified (69 unique DE genes) to overlap with the 

DE  genes  from  analysis  1,  AML  disease  state  meta-analysis.  b. The  heatmap  of  mean 

difference values comparison between the 70 DEPS overlapping genes between Analysis 

1 and Analysis 2a. c. Horizontal bar plot of the top 10 DE genes from the 70 genes; genes 

are positioned at the y-axis, and x-axis represents mean difference values. d. Heatmap the 

70 DEPS expression (rows) on 2,761 arrays (columns) including 2,213 AML patients and  

127 

 

Figure 18 (cont’d) 

548 healthy individuals from Analysis 2a of sex-relevance in AML (using unsupervised 

hierarchical clustering and Euclidean distance for clustering). The disease state (AML vs 

healthy) and sex of each subject are indicated in color bars at the top. The disease state 

(AML  vs  healthy)  and  sex  of  each  subject  are  indicated  in  color  bars  at  the  top.  e. 

Pathway  enrichment  analysis  using  KEGG  shows  3  signaling  pathways  were  found 

enriched  by  4  unique  genes  from  this  meta-analysis.  f.  Enrichment  analysis  for 

statistically significant overrepresented biological GO terms on the 70 DEPS genes. 

 

 

128 

 

a

b

Figure 19. Age-related gene expression meta-analysis in AML. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

129 

 

Figure 19 (cont’d) 

 
 
c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

130 

 

Figure 19 (cont’d) 

d

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

131 

 

Figure 19 (cont’d) 

 
 

e

 

 

132 

 

Figure 19 (cont’d) 

f

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

133 

 

Figure 19 (cont’d) 

g

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

134 

 

Figure 19 (cont’d) 

h

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
For all heatmaps, normalized values are represented in with blue for down-regulation and 

red for up-regulation, while light red/gray represents no reported specific direction.  

135 

 

Figure 19. (cont’d) 

Unsupervised hierarchical clustering and Euclidean distance was done for clustering on 

DEPS  (vertical  axis)  mean  difference  values  between  each  age-group  comparison 

(horizontal  axis).  a.  The  Venn  diagram  shows  375  DEPS  identified  (372  unique  DE 

genes) to overlap with the DE genes from analysis 1, AML disease state meta-analysis. b. 

Heatmap  of  375  DEPS  (rows)  across  18  age-groups  (columns)  that  were  deemed 

statistically significant with mean difference values pf the 375 DEPS on vertical axis and 

age-groups on horizontal axis. c. Heatmap of the top 10 DE age-dependent genes (rows) 

mean difference values clustered on 7 age-groups (columns) with some genes appears in 

multiple age-groups while others appear only in one age-group. d. Shows 75 DEPS that 

are  specific  to  a  single  age-group  comparison.  e.  Age-group  to  age-group  correlation 

matrix shows strong correlation direction between age-groups compared to the “0 to 19” 

age-group  as  a  common  reference.  f.  Shows  the  heatmap  of  375  DEPS  (rows)  mean 

difference  values  across  the  “0  to  19”  age-group  (columns)  as  a  common  comparison 

reference for baseline analysis. g. Shows heatmap the mean difference values of 25 DE 

genes common across the baseline (0 to 19) age-group compared to 7 other age-groups 

that progress in age to illustrate gene expression changes with aging. We note that the 

mean difference values between AML and healthy cohorts from analysis 1 are shown in 

the  right-most  column  of  panels  (b-d),  (f),  (g)  for  reference  comparisons.  h.  Overlaps 

over KEGG pathways of 17 DE genes identified in 4 KEGG pathways according to age 

groups. 

136 

 

Table 5. Summary table of all 34 gene expression data sets used in our study. 

GEO 

Author, Year 

accession id 
Zatkova et al, 2009 
GSE10258 
Tomasson et al, 2008  GSE10358 

Metzeler et al, 2008 

GSE12417 

GSE37642 

GSE14468 
GSE14479 
GSE15434 
GSE29883 

Wouters et al, 2009, 
Taskesen et al, 2011 
Figueroa et al, 2009 
Klein et al, 2009 
Lück et al, 2011 
Li et al, 2013, 
Herold et al, 2014, 
Janke et al, 2014, 
Jiang et al, 2016 
GSE39363 
Bullinger et al, 2014 
GSE46819 
Opel et al, 2015 
GSE68833 
TCGA et al, 2015 
GSE69565 
Cao et al, 2016 
GSE84334 
Bohl et al, 2016 
GSE23025 
Li et al, 2011 
GSE11375 
Warren et al, 2009 
GSE14845 
Green et al, 2009 
GSE15932 
Wu et al, 2012 
GSE16028 
Karlovich et al, 2009 
GSE17114 
Krug et al, 2011 
GSE18123 
Kong et al, 2012 
GSE18781 
Sharma et al, 2009 
GSE25414 
Rosell et al, 2011 
GSE2842 
Schmidt et al, 2006 
GSE71226 
Meng et al, 2015 
GSE84844 
Tasaki et al, 2017 
GSE98793 
Leday et al, 2018 
GSE99039 
Shamir et al, 2017 
GSE93272 
Tasaki et al, 2018 
GSE46449 
Clelland et al, 2013 
Lauwerys et al, 2013 
GSE39088 
Ducreux et al, 2016 
GSE36809 
Xiao et al, 2011 
GSE19743 
Zhou et al, 2010 
GSE107968* 
Jiang et al, 2018 
GSE68172* 
Greiner et al, 2015 
GSE17054* 
Majeti et al, 2009 
GSE33223* 
Bacher et al, 2012 
Mills et al, 2009 
GSE15061* 
Meta-analysis data sets summary 

AML/Healthy 

AML 
AML 

AML 

AML 
AML 
AML 
AML 

AML 

AML 
AML 
AML 
AML 
AML 
AML 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 
Healthy 

2 AML 
20 AML 
9 AML 
20 AML 
404 AML 

1 Healthy 
5 Healthy 
4 Healthy 
10 Healthy 
138 Healthy 

Disease state 

AML 
2213 

Healthy 

548 

Sample source 
PB 
BM 
2090 
671 

Affymetrix platform id: 

Number of samples 
used& Sample source 

GPL570: 8 BM 
GPL570: 300 BM 

GPL570: 73 BM & 5 PB 
GPL96/97: 160 BM & 

GPL570: 482 BM & 43 

GPL570: 16 BM 

GPL570: 231 BM & 20 

2PB 

PB 

PB 

GPL570: 10 BM & 2 PB 

GPL570: 140 BM 
GPL96/97: 422 BM 

GPL570: 11 BM & 2 PB  NYP 
GPL570: 8 BM & 4 PB 

76 

GPL570: 183 BM 
GPL570: 12 PB 

GPL570: 25 BM & 20 PB  NYP 
GPL570: 21 BM & 13 PB 

GPL570: 26 PB 
GPL570: 1 PB 
GPL570: 8 PB 
GPL570: 22 PB 
GPL570: 14 PB 
GPL570: 17 PB 
GPL570: 25 PB 
GPL570: 12 PB 
GPL570: 2 PB 
GPL570: 3 PB 
GPL570: 30 PB 
GPL570: 64 PB 
GPL570: 121 PB 
GPL570: 35 PB 
GPL570: 24 PB 
GPL570: 46 PB 
GPL570: 35 PB 
GPL570: 63 PB 
GPL570: 3 BM 
GPL570: 25 PB 
GPL570: 13 BM 
GPL570: 30 PB 
GPL570: 542 BM 

Refs. 

69 
70 

54 

71,72 
73 
74 
75 

55-58 

NYP 

NYP 
NYP 

NYP 

77 

78 
79 

80 

81 
82 
83 
84 

NYP 

85 
86 
87 
68 
88 
89,90 
91 
92 

NYP 

64 
65 
66 
67 

Unique probesets 
GPL96/9
GPL57

0 

7 

54,675 

44,760 

Affymetrix platform id 
GPL570 

GPL96/97 

2177 

584 

137 

 

Table 5. (cont’d) 

GEO, Gene Expression Omnibus; AML, acute myeloid leukemia; Ref. reference; NYP, 

not yet published, GPL570, Affymetrix Human Genome U133 Plus 2.0 Array; GPL96, 

Affymetrix Human Genome U133A Array; GPL97, Affymetrix Human Genome U133B 

Array; BM, Bone Marrow; PB, Peripheral Blood.  A summary table of all our data sets 

using in our meta-analysis and disease classification.  *“Covariate data sets,” 5 data sets 

that  were  used  during  the  batch  correction  step.,  data  sets  used  only  during  the  batch 

correction step to balance/account for batch in our curated data. 

 

 

138 

 

Table  6.  Top  10  up-  and  down-regulated  of  DE  genes  in  AML  from  disease  state 

meta-analysis. 

DEG name 

Up-regulated 

Wilms tumor 1 
MAM domain containing 2 
X inactive specific transcript (non-protein coding) 
homeobox A3 
fms-related tyrosine kinase 3 
cyclin A1 
mex-3 RNA binding family member B 
collagen, type IV, alpha 5 
neurexin 2 
ATPase, Na+/K+ transporting, beta 1 polypeptide 

DEG 
Symbol 
WT1 

MAMDC2 

XIST 
HOXA3 
FLT3 
CCNA1 
MEX3B 
COL4A5 
NRXN2 
ATP1B1 

Tukey’s HSD 

Mean 

difference 
0.255353 
0.248983 
0.230331 
0.195790 
0.193420 
0.185050 
0.181068 
0.177721 
0.166598 
0.165197 

Down-regulated 

cysteine-rich secretory protein 3 
olfactomedin 4 
orosomucoid 1 
cytochrome P450, family 4, subfamily F, 
polypeptide 3 
chitinase 3-like 1 (cartilage glycoprotein-39) 
annexin A3 
oxidized low density lipoprotein (lectin-like) 
receptor 1 
carcinoembryonic antigen-related cell adhesion 
molecule 8 
orosomucoid 1 
tumor-associated calcium signal transducer 2 

CRISP3 
OLFM4 
ORM1 
CYP4F3 
CHI3L1 
ANXA3 
OLR1 

-0.51965625 
-0.489845396 
-0.465232864 
-0.453467442 
-0.421520435 
-0.390688999 
-0.35525472 

CEACAM8 

ORM1 

TACSTD2 

-0.351181264 
-0.336303304 
-0.323939961 

Bonferroni 
(p-adjusted) 
< 4.11E-11 
5.47E-09 
< 4.11E-11 
1.1E-06 
< 4.11E-11 
1.35E-07 
< 4.11E-11 
1.7E-05 
< 4.11E-11 
5.47E-09 

< 4.11E-11 
< 4.11E-11 
< 4.11E-11 
< 4.11E-11 
< 4.11E-11 
< 4.11E-11 
< 4.11E-11 

< 4.11E-11 
< 4.11E-11 
< 4.11E-11 

From the Post-hoc Tukey’s test, gene expression means difference value < 5% or > 95% 

between  AML  and  healthy  (AML  -  healthy)  were  deemed  statistically  significant  for 

AML. Genes were considered disease state statistically significant from the analysis of all 

2761  cases  (2213  AML  patients  and  548  healthy  controls)  using.  The  p-values  were 

adjusted based on Bonferroni correction for false discovery rate (FDR). Significant DE 

genes are listed in descending order of the mean difference value comparisons for disease 

state. 

 

139 

 

Table  7.  KEGG  functional  analysis  of  974  DEPS  from  meta-analysis  of  34  gene 

Up- 

regulated 

p-value 

(unadjusted) 

Benjamini 
(p-adjusted) 

expression data sets. 

Pathway 

No. of 
genes 

Hematopoietic 

cell lineage 

11, 6 

Cell cycle 

12, 6 

p53 signaling 

pathway 

6, 7 

Transcriptional 
misregulation in 

cancer 

7, 13 

Down- 
regulated 
IL1R2, 
CD59, 
GYPA, 
MS4A1, 
EPOR, 
CD24, 
CD14, 
EPOR, 
IL1R1, 
MME, 
CR1 
CDC7, 
CDC6, 
CCNB1, 
CDC20, 
CCNA2, 
CCNE2, 
TTK, 

CDC14B, 
CDK1, 
BUB1, 
CCNB2, 
BUB1B 
THBS1, 
CCNB1, 
CCNE2, 
CDK1, 
RRM2, 
CCNB2 

IL1R2, 
GZMB, 
CD14, 
ELANE, 
MMP9, 
CEBPE, 
PBX1 

2.3E-5 

5.8E-3 

1.4E-4 

1.2E-2 

1.0E-4 

1.3E-2 

6.5E-4 

4.1E-2 

ITGA4, 
FLT3, 
CD34, 
IL3RA, 
ITGA5, 
CD44 

6 RB1, 
CCNA1, 
CDK6, 
ATM, 
TFDP2, 
CDKN2A 

EWSR1, 
ATM, 

HOXA10, 
MLF1, 
FLT3, 
CCNT2, 
MEF2C, 
SLC45A3 

SIAH1, 
CDK6, 
ATM, 

SERPINE1, 
CDKN2A, 
PMAIP1, 
ZMAT3 
WT1, 
RUNX2, 
ETV5, 
MEIS1, 
 JUP, 

974 DEPS (487 overexpressed and 487 underexpressed) from analysis 1 were enriched 

by 4 statistically significant KEGG pathways. Signaling pathways were deemed  

140 

 

Table 7. (cont’d) 

significant based on Benjamini- Hochberg adjusted p-value < 0.05. The two numbers in 

each cell in “No. of genes” column indicate the down-regulated (first) and up-regulated 

(second)  DE  genes  that  enriched  by  each  pathway.  Pathways  are  listed  in  descending 

order using of Benjamini- Hochberg adjusted p-value. 

 

 

141 

 

Table  8.  AML  sex  relevance  (male  -  female)  DE  genes  &  associated  signaling 

pathways. 

Pathway 

No. of 
genes 
1, 2 

–, 1 

–, 1 

High in Females  

MS4A1 

– 

– 

High in Males 

FLT3, CD34 

PMAIP1 

FLT3  

Hematopoietic 

cell lineage 
p53 signaling 

pathway 

Transcriptional 
misregulation 

in cancer 

Common DEPS between the 70 DEPS from analysis 2a and the 974 DEPS from AML 

disease state meta-analysis for KEGG pathways and GO terms. 4 sex-relevant unique DE 

genes  in  AML  were  found  in  3  different  signaling  pathways,  including,  1  highly 

expressed in females and 3 highly expressed in males. 

 

 

142 

 

Table  9.  AML  age-dependent  (AML  -  healthy)  DE  genes  &  associated  signaling 

pathways. 

Pathway 

No. of 
genes 

Hematopoiet

ic cell 
lineage 

4, 1 

Cell cycle 

3, 2 

p53 

signaling 
pathway 

1, 1 

Transcriptio

nal 

misregulatio
n in cancer 

5, 4 

(30 to 39) - (0 to 19), (40 to 49) - (0 to 19), 

(50 to 59) - (0 to 19) 

(30 to 39) - (0 to 19), (40 to 49) - (0 to 19), 

(50 to 59) - (0 to 19) 

MS4A1 

(40 to 49) - (0 to 19), (50 to 59) - (0 to 19), 
(60 to 69) - (0 to 19), (70 to 79) - (0 to 19), 

Down-regulated 

Age-group 

(30 to 39) - (0 to 19) 

CD14 

MME 

CD24 

(80 to 100) - (0 to 19) 

CCNA2 

(50 to 59) - (0 to 19) 

CDK6 

(60 to 69) - (30 to 39) 

CDC14B 

(30 to 39) - (0 to 19), 
(40 to 49) - (0 to 19), 
(50 to 59) - (0 to 19), 
(60 to 69) - (0 to 19), 
(70 to 79) - (0 to 19) 

CDK6 

(60 to 69) - (30 to 39) 

(30 to 39) - (0 to 19) 

CD14 

MMP9 

(20 to 29) - (0 to 19), (30 to 39) - (0 to 19), 
(40 to 49) - (0 to 19), (50 to 59) - (0 to 19), 
(60 to 69) - (0 to 19), (70 to 79) - (0 to 19) 

EWSR1 

(60 to 69) - (50 to 59), 
(70 to 79) - (50 to 59) 

CEBPE 

(20 to 29) - (0 to 19), (30 to 39) - (0 to 19), 
(40 to 49) - (0 to 19), (50 to 59) - (0 to 19), 
(50 to 59) - (20 to 29), (60 to 69) - (0 to19), 
(70 to 79) - (0 to 19), (70 to 79) - (20 to29), 

(80 to 100) - (0 to 19) 

CCNT2 

(60 to 69) - (30 to 39), 
(70 to 79) - (30 to 39), 
(60 to 69) - (50 to 59) 

143 

 

Up-regulated 
Age-group 

FLT3 

(20 to 29) - (0 to 19), 
(30 to 39) - (0 to 19), 
(40 to 49) - (0 to 19), 
(50 to 59) - (0 to 19), 
(60 to 69) - (0 to 19), 
(70 to 79) - (0 to 19), 
(80 to 100) - (0 to 19) 

CCNA1 

(30 to 39) - (0 to 19), 
(40 to 49) - (0 to 19), 
(50 to 59) - (0 to 19), 
(60 to 69) - (0 to 19) 

CDKN2A 

(40 to 49) - (0 to 19) 

CDKN2A 

(40 to 49) - (0 to 19) 

MEIS1 

(50 to 59) - (0 to 19), 
(50 to 59) - (20 to 29), 
(60 to 69) - (0 to 19), 
(60 to 69) - (20 to 29), 
(70 to 79) - (0 to 19) 

WT1 

(20 to 29) - (0 to 19), 
(30 to 39) - (0 to 19), 
(40 to 49) - (0 to 19), 
(50 to 59) - (0 to 19), 
(60 to 69) - (0 to 19), 
(70 to 79) - (0 to 19) 

FLT3 

(20 to 29) - (0 to 19), 
(30 to 39) - (0 to 19), 
(40 to 49) - (0 to 19), 
(50 to 59) - (0 to 19), 
(60 to 69) - (0 to 19), 
(70 to 79) - (0 to 19), 
(80 to 100) - (0 to 19) 

HOXA10 

(40 to 49) - (0 to 19), 
(50 to 59) - (0 to 19), 
(50 to 59) - (20 to 29), 
(60 to 69) - (0 to 19), 
(60 to 69) - (20 to 29), 
(70 to 79) - (0 to 19) 

Table 9. (cont’d) 

Common  DE  genes  between  the  375  DEPS from  analysis  2b  and  the  974  DEPS from 

analysis  1  for  KEGG  pathways  and  GO  terms.  17  age-dependent  unique  DE  genes  in 

AML  were  found  in  4  different  signaling  pathways.  DE  genes  are  listed  according  to 

associated age-groups for each signaling pathway. 

 

 

144 

 

Table 10. Age-dependent genes show drug to gene interaction. 

Druggable Gene 

Category 

Matching 
gene count 

Matching genes(s) 
DGIdb analysis 1 

DRUGGABLE GENOME 

15, 24 

TFF3, ORM1, CA4, 
CYP4F2, CYP4F3, 
CEACAM1, FLT3, 

CHIT1, OLR1, KCNJ15, 
CAMP, CRISP2, CAPN3, 

SLC37A3, FCRL1 

KINASE 

3, 12 

CEACAM1, FLT3, 

TCL1A 

SERINE THREONINE 

KINASE 

2, 10 

FLT3, TCL1A 

Matching genes(s) 
DGIdb analysis 2 

CDH1, GPX3, CD14, 
DYRK2, SLPI, CCNA2, 
TGFBR3, UGCG, FCN1, 

GZMA, TCN1, BPI, 

S100A12, CDK6, IL12A, 

P2RY13, ADGRG3, 
DNMT3B, GUCY1A3, 

FGFBP2, PTPRJ, 
LRRK2, BCL2L15, 

STYX 

DYRK2, CCNA2, 
TGFBR3, S100A12, 
CDKN2A, CDK6, 
DIRAS3, GTPBP4, 
DEPTOR, PTPRJ, 
NME7, LRRK2 
DYRK2, CCNA2, 
TGFBR3, S100A12, 
CDKN2A, CDK6, 
DIRAS3, GTPBP4, 

PTPRJ, LRRK2 
CTDSPL, CCNA2, 
CDKN2A, CDK6, 
IL12A, DIRAS3, 
GTPBP4, CCPG1 

CD14, TGFBR3, FCN1, 

MYH10, PTPRJ 

SLPI, FCN1, GZMA, 
ASPH, IGKV1-17, 

NRIP3 

CDH1, CDKN2A, 
CDK6, DNMT3B 

– 

SMAD6, GFI1B, 
HOXA11, HOXB9 

CD14, TGFBR3, FCN1 

CCNA2, GFI1B, 

DNMT3B 

– 
– 
– 

TUMOR SUPPRESSOR 

CELL SURFACE 

PROTEASE 

CLINICALLY 
ACTIONABLE 
TRANSPORTER 
TRANSCRIPTION 
FACTOR COMPLEX 
EXTERNAL SIDE OF 
PLASMA MEMBRANE 

HISTONE 

MODIFICATION 

CYTOCHROME P450 
DRUG METABOLISM 

EXCHANGER 

–, 8 

2, 5 

–, 6 

1, 4 

4, – 

–, 4 

1, 3 

–, 3 
2, – 
–, 1 
–, 1 

– 

CA4, CEACAM1 

– 

FLT3 

CEACAM1, KCNJ15, 

SLC37A3, RBP7 

– 

CA4 

– 

CYP4F2, CYP4F3 

CYP4F2 
SLC37A3 

Two analysis were carried using DGIdb, (i) 25 DE genes common across the baseline (0 

to 19) age-group and (ii) 75 genes identified to be specific to one age-group were used to 

carry out drug-gene interaction analysis using DGIdb. Functional classes for both  

145 

 

Table 10. (cont’d) 

analyses are shown here. “Matching gene count” column, which indicates the matching 

genes between from analysis (i) (first number) and matching genes from DGIdb analysis 

(ii) (second number) DE genes that enriched by each pathway. 

 

 

146 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

BIBLIOGRAPHY 

147 

 

 

 

1 

 
2 

 
3 

 
4 

 
5 

 
6 

 
7 

 
8 

 
9 

 
10 

 
11 

 
12 

BIBLIOGRAPHY 

 
 

Kumar, C. C. Genetic abnormalities and challenges in the treatment of acute 
myeloid leukemia. Genes Cancer 2, 95-107, doi:10.1177/1947601911408076 
(2011). 

De Kouchkovsky, I. & Abdul-Hay, M. 'Acute myeloid leukemia: a 
comprehensive review and 2016 update'. Blood Cancer J 6, e441, 
doi:10.1038/bcj.2016.50 (2016). 

Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2018. CA Cancer J Clin 
68, 7-30, doi:10.3322/caac.21442 (2018). 

Institute, N. C. SEER Cancer Stat Facts: Acute Myeloid Leukemia (Percent of 
New Cases by Age Group). [https://seer.cancer.gov/statfacts/html/amyl.html].  
((accessed 11.30.18), 2011-2015). 

Short, N. J., Rytting, M. E. & Cortes, J. E. Acute myeloid leukaemia. Lancet 392, 
593-606, doi:10.1016/S0140-6736(18)31041-9 (2018). 

Dohner, H., Weisdorf, D. J. & Bloomfield, C. D. Acute Myeloid Leukemia. N 
Engl J Med 373, 1136-1152, doi:10.1056/NEJMra1406184 (2015). 

Reese, N. D. & Schiller, G. J. High-dose cytarabine (HD araC) in the treatment of 
leukemias: a review. Curr Hematol Malig Rep 8, 141-148, doi:10.1007/s11899-
013-0156-3 (2013). 

Dohner, H. et al. Diagnosis and management of acute myeloid leukemia in adults: 
recommendations from an international expert panel, on behalf of the European 
LeukemiaNet. Blood 115, 453-474, doi:10.1182/blood-2009-07-235358 (2010). 

Meyers, J., Yu, Y., Kaye, J. A. & Davis, K. L. Medicare fee-for-service enrollees 
with primary acute myeloid leukemia: an analysis of treatment patterns, survival, 
and healthcare resource utilization and costs. Appl Health Econ Health Policy 11, 
275-286, doi:10.1007/s40258-013-0032-2 (2013). 

Ferrara, F. & Schiffer, C. A. Acute myeloid leukaemia in adults. Lancet 381, 484-
495, doi:10.1016/S0140-6736(12)61727-9 (2013). 

Appelbaum, F. R. et al. Age and acute myeloid leukemia. Blood 107, 3481-3485, 
doi:10.1182/blood-2005-09-3724 (2006). 

Grimwade, D. & Hills, R. K. Independent prognostic factors for AML outcome. 
Hematology Am Soc Hematol Educ Program, 385-395, doi:10.1182/asheducation-
2009.1.385 (2009). 

148 

 

Dohner, H. Implication of the molecular characterization of acute myeloid 
leukemia. Hematology Am Soc Hematol Educ Program, 412-419, 
doi:10.1182/asheducation-2007.1.412 (2007). 

 
14  Walter, M. J. et al. Acquired copy number alterations in adult acute myeloid 

leukemia genomes. Proc Natl Acad Sci U S A 106, 12950-12955, 
doi:10.1073/pnas.0903091106 (2009). 

Suela, J., Alvarez, S. & Cigudosa, J. C. DNA profiling by arrayCGH in acute 
myeloid leukemia and myelodysplastic syndromes. Cytogenet Genome Res 118, 
304-309, doi:10.1159/000108314 (2007). 

Armstrong, S. A. et al. MLL translocations specify a distinct gene expression 
profile that distinguishes a unique leukemia. Nature Genetics 30, 41-47, 
doi:10.1038/ng765 (2002). 

Debernardi, S. et al. Genome-wide analysis of acute myeloid leukemia with 
normal karyotype reveals a unique pattern of homeobox gene expression distinct 
from those with translocation-mediated fusion events. Gene Chromosome Canc 
37, 149-158, doi:10.1002/gcc.10198 (2003). 

 
13 

 
15 

 
16 

 
17 

 
18 

 
20 

 
21 

 
22 

 
23 

 

Schoch, C. et al. Acute myeloid leukemias with reciprocal rearrangements can be 
distinguished by specific gene expression profiles. P Natl Acad Sci USA 99, 
10008-10013, doi:10.1073/pnas.142103599 (2002). 

 
19  Miller, B. G. & Stamatoyannopoulos, J. A. Integrative meta-analysis of 

differential gene expression in acute myeloid leukemia. PLoS One 5, e9466, 
doi:10.1371/journal.pone.0009466 (2010). 

Ramasamy, A., Mondry, A., Holmes, C. C. & Altman, D. G. Key issues in 
conducting a meta-analysis of gene expression microarray datasets. PLoS Med 5, 
e184, doi:10.1371/journal.pmed.0050184 (2008). 

Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray 
expression data using empirical Bayes methods. Biostatistics 8, 118-127, 
doi:10.1093/biostatistics/kxj037 (2007). 

Chen, C. et al. Removing batch effects in analysis of expression microarray data: 
an evaluation of six batch adjustment methods. PLoS One 6, e17238, 
doi:10.1371/journal.pone.0017238 (2011). 

Pavlidis, P. Using ANOVA for gene selection from microarray studies of the 
nervous system. Methods 31, 282-289, doi:10.1016/S1046-2023(03)00157-9 
(2003). 

149 

 

Pavlidis, P. & Noble, W. S. Matrix2png: a utility for visualizing matrix data. 
Bioinformatics 19, 295-296, doi:DOI 10.1093/bioinformatics/19.2.295 (2003). 

 
25  Mias, G. in Mathematica for Bioinformatics: A Wolfram Language Approach to 

Omics     193-226 (Springer International Publishing, 2018). 

24 

 
26 

 
28 

 
29 

 
30 

 
31 

 
32 

 
33 

 
34 

 
35 

 
36 

 

Neyman, J. & Pearson, E. S. On the use and interpretation of certain test criteria 
for purposes of statistical inference. Part II. Biometrika 20a, 263-294, doi:DOI 
10.1093/biomet/20A.3-4.263 (1928). 

 
27  Waltman, L. & Schreiber, M. On the calculation of percentile-based bibliometric 

indicators. J Am Soc Inf Sci Tec 64, 372-379, doi:10.1002/asi.22775 (2013). 

Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new 
perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45, 
D353-D361, doi:10.1093/nar/gkw1092 (2017). 

Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a 
reference resource for gene and protein annotation. Nucleic Acids Research 44, 
D457-D462, doi:10.1093/nar/gkv1070 (2016). 

Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. 
Nucleic Acids Res 28, 27-30 (2000). 

Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene 
Ontology Consortium. Nat Genet 25, 25-29, doi:10.1038/75556 (2000). 

Carbon, S. et al. Expansion of the Gene Ontology knowledgebase and resources. 
Nucleic Acids Research 45, D331-D338, doi:10.1093/nar/gkw1108 (2017). 

Huang, D. W., Sherman, B. T. & Lempicki, R. A. Bioinformatics enrichment 
tools: paths toward the comprehensive functional analysis of large gene lists. 
Nucleic Acids Research 37, 1-13, doi:10.1093/nar/gkn923 (2009). 

Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative 
analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 
44-57, doi:10.1038/nprot.2008.211 (2009). 

Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate - a Practical 
and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met 57, 289-300 
(1995). 

Griffith, M. et al. DGIdb: mining the druggable genome. Nat Methods 10, 1209-
+, doi:10.1038/Nmeth.2689 (2013). 

150 

 

37 

Arber, D. A. et al. The 2016 revision to the World Health Organization 
classification of myeloid neoplasms and acute leukemia. Blood 127, 2391-2405, 
doi:10.1182/blood-2016-03-643544 (2016). 

 
38  Martelli, M. P., Sportoletti, P., Tiacci, E., Martelli, M. F. & Falini, B. Mutational 
landscape of AML with normal cytogenetics: biological and clinical implications. 
Blood Rev 27, 13-22, doi:10.1016/j.blre.2012.11.001 (2013). 

Klepin, H. D., Rao, A. V. & Pardee, T. S. Acute myeloid leukemia and 
myelodysplastic syndromes in older adults. J Clin Oncol 32, 2541-2552, 
doi:10.1200/JCO.2014.55.1564 (2014). 

Cancer Genome Atlas Research, N. et al. Genomic and epigenomic landscapes of 
adult de novo acute myeloid leukemia. N Engl J Med 368, 2059-2074, 
doi:10.1056/NEJMoa1301689 (2013). 

 
41  Walter, M. J. et al. Clonal architecture of secondary acute myeloid leukemia. N 

Engl J Med 366, 1090-1098, doi:10.1056/NEJMoa1106968 (2012). 

 
39 

 
40 

 
42 

 
43 

 
44 

 
45 

 
46 

 
47 

 

Hou, H. A. et al. WT1 mutation in 470 adult patients with acute myeloid 
leukemia: stability during disease evolution and implication of its incorporation 
into a survival scoring system. Blood 115, 5222-5231, doi:10.1182/blood-2009-
12-259390 (2010). 

Ho, P. A. et al. Prevalence and prognostic implications of WT1 mutations in 
pediatric acute myeloid leukemia (AML): a report from the Children's Oncology 
Group. Blood 116, 702-710, doi:10.1182/blood-2010-02-268953 (2010). 

Udby, L., Calafat, J., Sorensen, O. E., Borregaard, N. & Kjeldsen, L. 
Identification of human cysteine-rich secretory protein 3 (CRISP-3) as a matrix 
protein in a subset of peroxidase-negative granules of neutrophils and in the 
granules of eosinophils. J Leukocyte Biol 72, 462-469 (2002). 

Izzi, V. et al. An extracellular matrix signature in leukemia precursor cells and 
acute myeloid leukemia. Haematologica 102, E245-E248, 
doi:10.3324/haematol.2017.167304 (2017). 

Buggins, A. G. et al. Microenvironment produced by acute myeloid leukemia 
cells prevents T cell activation and proliferation by inhibition of NF-kappaB, c-
Myc, and pRb pathways. J Immunol 167, 6021-6030 (2001). 

Rashidi, A. & Uy, G. L. Targeting the Microenvironment in Acute Myeloid 
Leukemia. Curr Hematol Malig R 10, 126-131, doi:10.1007/s11899-015-0255-4 
(2015). 

151 

 

Cock, P. J. A. et al. Biopython: freely available Python tools for computational 
molecular biology and bioinformatics. Bioinformatics 25, 1422-1423, 
doi:10.1093/bioinformatics/btp163 (2009). 

 
54  Metzeler, K. H. et al. An 86-probe-set gene-expression signature predicts survival 

in cytogenetically normal acute myeloid leukemia. Blood 112, 4193-4201, 
doi:10.1182/blood-2008-02-134411 (2008). 

48 

 
49 

 
50 

 
51 

 
52 

 
53 

 
55 

 
56 

 
57 

 
58 

 

Borrow, J. et al. The t(7;11)(p15;p15) translocation in acute myeloid leukaemia 
fuses the genes for nucleoporin NUP98 and class I homeoprotein HOXA9. Nature 
Genetics 12, 159-167, doi:DOI 10.1038/ng0296-159 (1996). 

Andreeff, M. et al. HOX expression patterns identify a common signature for 
favorable AML. Leukemia 22, 2041-2047, doi:10.1038/leu.2008.198 (2008). 

Fan, C., Stendahl, U., Stjernberg, N. & Beckman, L. Association between 
Orosomucoid Types and Cancer. Oncology 52, 498-500 (1995). 

Abbate, F., Casini, A., Owa, T., Scozzafava, A. & Supuran, C. T. Carbonic 
anhydrase inhibitors: E7070, a sulfonamide anticancer agent, potently inhibits 
cytosolic isozymes I and II, and transmembrane, tumor-associated isozyme IX. 
Bioorg Med Chem Lett 14, 217-223, doi:10.1016/j.bmcl.2003.09.062 (2004). 

Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--update. 
Nucleic Acids Res 41, D991-995, doi:10.1093/nar/gks1193 (2013). 

Li, Z. et al. Identification of a 24-gene prognostic signature that improves the 
European LeukemiaNet risk classification of acute myeloid leukemia: an 
international collaborative study. J Clin Oncol 31, 1172-1181, 
doi:10.1200/JCO.2012.44.3184 (2013). 

Herold, T. et al. Isolated trisomy 13 defines a homogeneous AML subgroup with 
high frequency of mutations in spliceosome genes and poor prognosis. Blood 124, 
1304-1311, doi:10.1182/blood-2013-12-540716 (2014). 

Janke, H. et al. Activating FLT3 Mutants Show Distinct Gain-of-Function 
Phenotypes In Vitro and a Characteristic Signaling Pathway Profile Associated 
with Prognosis in Acute Myeloid Leukemia. Plos One 9, doi:ARTN e89560 
10.1371/journal.pone.0089560 (2014). 

Jiang, X. et al. Eradication of Acute Myeloid Leukemia with FLT3 Ligand-
Targeted miR-150 Nanoparticles. Cancer Res 76, 4470-4480, doi:10.1158/0008-
5472.CAN-15-2949 (2016). 

152 

 

Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of 
normalization methods for high density oligonucleotide array data based on 
variance and bias. Bioinformatics 19, 185-193 (2003). 

Irizarry, R. A. et al. Summaries of Affymetrix GeneChip probe level data. Nucleic 
Acids Res 31, e15 (2003). 

Irizarry, R. A. et al. Exploration, normalization, and summaries of high density 
oligonucleotide array probe level data. Biostatistics 4, 249-264, 
doi:10.1093/biostatistics/4.2.249 (2003). 

Roushangar, R. & Mias, G. I. ClassificaIO: machine learning for classification 
graphical user interface. bioRxiv, doi:10.1101/240184 (2017). 

Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res 
12, 2825-2830 (2011). 

Schneider, V. Z., L.; Markus, R.; Fekete, N.; Schrezenmeier, H.; Erle, A. ; Lars, 
B.; Hofmann, S.; Götz, M.; Döhner, K.; Ihme, S.; Döhner, H.; Buske, C.; Feuring-
Buske, M.; Greiner, J. Leukemic progenitor cells are susceptible to targeting by 
stimulated cytotoxic T cells against immunogenic leukemia-associated antigens.  
(2015). 

 
65  Majeti, R. et al. Dysregulated gene expression networks in human acute 

myelogenous leukemia stem cells. Proc Natl Acad Sci U S A 106, 3396-3401, 
doi:10.1073/pnas.0900089106 (2009). 

59 

 
60 

 
61 

 
62 

 
63 

 
64 

 
66 

 
68 

 
69 

 

Bacher, U. et al. Multilineage dysplasia does not influence prognosis in CEBPA-
mutated AML, supporting the WHO proposal to classify these patients as a 
unique entity. Blood 119, 4719-4722, doi:10.1182/blood-2011-12-395574 (2012). 

 
67  Mills, K. I. et al. Microarray-based classifiers and prognosis models identify 

subgroups with distinct clinical outcomes and high risk of AML transformation of 
myelodysplastic syndrome. Blood 114, 1063-1072, doi:10.1182/blood-2008-10-
187203 (2009). 

Tasaki, S. et al. Multi-omics monitoring of drug response in rheumatoid arthritis 
in pursuit of molecular remission. Nat Commun 9, 2755, doi:10.1038/s41467-018-
05044-4 (2018). 

Zatkova, A. et al. AML/MDS with 11q/MLL amplification show characteristic 
gene expression signature and interplay of DNA copy number changes. Genes 
Chromosomes Cancer 48, 510-520, doi:10.1002/gcc.20658 (2009). 

153 

 

Tomasson, M. H. et al. Somatic mutations and germline sequence variants in the 
expressed tyrosine kinase genes of patients with de novo acute myeloid leukemia. 
Blood 111, 4797-4808, doi:10.1182/blood-2007-09-113027 (2008). 

Taskesen, E. et al. Prognostic impact, concurrent genetic mutations, and gene 
expression features of AML with CEBPA mutations in a cohort of 1182 
cytogenetically normal AML patients: further evidence for CEBPA double mutant 
AML as a distinctive disease entity. Blood 117, 2469-2475, doi:10.1182/blood-
2010-09-307280 (2011). 

 
72  Wouters, B. J. et al. Double CEBPA mutations, but not single CEBPA mutations, 

define a subgroup of acute myeloid leukemia with a distinctive gene expression 
profile that is uniquely associated with a favorable outcome. Blood 113, 3088-
3091, doi:10.1182/blood-2008-09-179895 (2009). 

70 

 
71 

 
73 

 
74 

 
75 

 
76 

 
77 

 
78 

 
80 

 

Figueroa, M. E. et al. Genome-wide epigenetic analysis delineates a biologically 
distinct immature acute leukemia with myeloid/T-lymphoid features. Blood 113, 
2795-2804, doi:10.1182/blood-2008-08-172387 (2009). 

Klein, H. U. et al. Quantitative comparison of microarray experiments with 
published leukemia related gene expression signatures. BMC Bioinformatics 10, 
422, doi:10.1186/1471-2105-10-422 (2009). 

Luck, S. C. et al. Deregulated apoptosis signaling in core-binding factor leukemia 
differentiates clinically relevant, molecular marker-independent subgroups. 
Leukemia 25, 1728-1738, doi:10.1038/leu.2011.154 (2011). 

Opel, D. et al. Targeting inhibitor of apoptosis proteins by Smac mimetic elicits 
cell death in poor prognostic subgroups of chronic lymphocytic leukemia. Int J 
Cancer 137, 2959-2970, doi:10.1002/ijc.29650 (2015). 

Cao, Q. et al. BCOR regulates myeloid cell proliferation and differentiation. 
Leukemia 30, 1155-1165, doi:10.1038/leu.2016.2 (2016). 

Li, L. et al. Altered hematopoietic cell gene expression precedes development of 
therapy-related myelodysplasia/acute myeloid leukemia and identifies patients at 
risk. Cancer Cell 20, 591-605, doi:10.1016/j.ccr.2011.09.011 (2011). 

 
79  Warren, H. S. et al. A genomic score prognostic of outcome in trauma patients. 

Mol Med 15, 220-227, doi:10.2119/molmed.2009.00027 (2009). 

Karlovich, C. et al. A longitudinal study of gene expression in healthy 
individuals. BMC Med Genomics 2, 33, doi:10.1186/1755-8794-2-33 (2009). 

154 

 

Kong, S. W. et al. Characteristics and predictive value of blood transcriptome 
signature in males with autism spectrum disorders. PLoS One 7, e49475, 
doi:10.1371/journal.pone.0049475 (2012). 

Sharma, S. M. et al. Insights in to the pathogenesis of axial spondyloarthropathy 
based on gene expression profiles. Arthritis Res Ther 11, R168, 
doi:10.1186/ar2855 (2009). 

Rosell, A. et al. Brain perihematoma genomic profile following spontaneous 
human intracerebral hemorrhage. PLoS One 6, e16750, 
doi:10.1371/journal.pone.0016750 (2011). 

Schmidt, S. et al. Identification of glucocorticoid-response genes in children with 
acute lymphoblastic leukemia. Blood 107, 2061-2069, doi:10.1182/blood-2005-
07-2853 (2006). 

Tasaki, S. et al. Multiomic disease signatures converge to cytotoxic CD8 T cells 
in primary Sjogren's syndrome. Ann Rheum Dis 76, 1458-1466, 
doi:10.1136/annrheumdis-2016-210788 (2017). 

Leday, G. G. R. et al. Replicable and Coupled Changes in Innate and Adaptive 
Immune Gene Expression in Two Case-Control Studies of Blood Microarrays in 
Major Depressive Disorder. Biol Psychiatry 83, 70-80, 
doi:10.1016/j.biopsych.2017.01.021 (2018). 

Shamir, R. et al. Analysis of blood-based gene expression in idiopathic Parkinson 
disease. Neurology 89, 1676-1683, doi:10.1212/WNL.0000000000004516 (2017). 

Clelland, C. L. et al. Utilization of never-medicated bipolar disorder patients 
towards development and validation of a peripheral biomarker profile. PLoS One 
8, e69082, doi:10.1371/journal.pone.0069082 (2013). 

Ducreux, J. et al. Interferon alpha kinoid induces neutralizing anti-interferon 
alpha antibodies that decrease the expression of interferon-induced and B cell 
activation associated transcripts: analysis of extended follow-up data from the 
interferon alpha kinoid phase I/II study. Rheumatology (Oxford) 55, 1901-1905, 
doi:10.1093/rheumatology/kew262 (2016). 

Lauwerys, B. R. et al. Down-regulation of interferon signature in systemic lupus 
erythematosus patients by active immunization with interferon alpha-kinoid. 
Arthritis Rheum 65, 447-456, doi:10.1002/art.37785 (2013). 

Xiao, W. et al. A genomic storm in critically injured humans. J Exp Med 208, 
2581-2590, doi:10.1084/jem.20111354 (2011). 

81 

 
82 

 
83 

 
84 

 
85 

 
86 

 
87 

 
88 

 
89 

 
90 

 
91 

 

155 

 

Zhou, B. et al. Analysis of factorial time-course microarrays with application to a 
clinical study of burn injury. Proc Natl Acad Sci U S A 107, 9923-9928, 
doi:10.1073/pnas.1002757107 (2010). 

92 

 
 

156 

 

Chapter 4 – 

Summary and Outlook

157 

 

Conclusion 

In this dissertation, we aimed to establish sex-related and age-dependent DE genes with 

related  gene  expression  patterns  and  associated  signaling  pathways  as  biomarkers  in 

AML. Our approach utilized machine learning methods, which led to the development of 

a  graphical  user  interface  to  facilitate  model  training  and  testing  for  classification, 

(Chapter  2).  Subsequently,  we  carried  3  gene  expression  meta-analyses  and  gene 

enrichment analyses on publicly available gene expression data, accumulated from 2,761 

subjects  (2,213  AML  patients  and  548  healthy  individuals).  We  analyzed  a  total  of 

44,754  probe  sets  (corresponding  to  multiple  genes)  per  subject.  We  used  multiple 

statistical methods for microarray analysis were used to pre-process raw data, and also 

implemented  a  “data-wise”  batch  effect  correction.  The  latter  was  used  to  correct  for 

batch effects caused by study variability and sample processing. Following normalization 

and batch effect correction across arrays, we used a statistical linear model to study the 

effects  of  age  and  sex  on  gene  expression  in  AML  patients  as  compared  to  healthy 

individuals  (Chapter  3).  Three  downstream  differential  gene  expression  analyses  were 

carried out: 

Analysis  1:  Gene  expression  meta-analysis  and  associated  signaling  pathways  of 

AML disease state compared to healthy individuals. From this analysis we identified 

964 DE unique genes (974 DEPS) including 56 DE unique genes (27 up- and 29 down-

regulated) that were associated with 4 statistically significant KEGG pathways including 

Hematopoietic  cell  lineage,  Cell  cycle,  p53  signaling  pathway,  and  Transcriptional 

misregulation in cancer. Multiple genes identified do not have known associations with 

158 

 

AML and signaling pathways and can provide new avenues of investigation and novel 

hypothesis-driven mechanistic studies. 

 

Analysis 2a: Sex-dependent gene expression meta-analysis and associated signaling 

pathways in AML compared to healthy individuals, from this analysis we identified 

70  DEPS  with  69  unique  DE  genes  that  overlapped  between  analysis  1 (AML  disease 

state),  and  4  sex-relevant  DE  genes  were  found  in  3  different  signaling  pathways, 

including  FLT3  and  CD34  in  Hematopoietic  cell  lineage,  FLT3  in  Transcriptional 

misregulation  in  cancer,  and  PMAIP1  in  p53  signaling  pathway,  and  down-regulated 

gene MS4A1 in Hematopoietic cell lineage. 

 

Analysis 2b: Age-dependent gene expression meta-analysis and associated signaling 

pathways in AML compared to healthy individuals, from this analysis we found 372 

DE unique age-dependent genes (375 DEPS) overlap with the 964 DE unique genes (974 

DEPS) from our AML disease state meta-analysis (chapter 3 Fig. 19a), with 137 up- and 

238 down- regulated. We also found 25 DE genes common across a baseline (0 to 19) 

age-group (chapter 3 Fig. 19g) with 15 genes were identified as potential therapeutic for 

drug  target  and  75  genes  identified  to  be  specific  to  one  age-group  (Fig.  19d)  with  24 

genes were categorized for “druggable genome”. 

 

Finally, we used our results combined with a machine learning model (KNN model), and 

implemented supervised machine learning for classification training. We were able to test 

our model using 5 independent gene expression datasets (613 AML and healthy). Using 

159 

 

our  trained  model,  we  were  able  to  classify  AML  patients  compared  to  healthy 

individuals with > 90% achieved accuracy. Overall our findings provide a new reanalysis 

of public datasets, that enabled the identification of potential new gene sets relevant to 

AML  that  can  potentially  be  used in future  experiments  and  possible  stratified  disease 

diagnostics. 

 

Outlook 

While  our  results  and  analyses  have  identified  important  gene  expression  signatures 

relevant to AML, and many potential new drug-gene targets, our findings may generate 

more questions that should be considered in the future including, i) associations between 

age-groups  and  changes  in  gene  expression  across  different  AML  subgroups  to  help 

improve AML risk stratification, ii) age-dependent pseudo time-series models to identify 

changes  in  gene  expression  with  more  specific  AML  patients  age  and  sex.  However, 

these  questions  and  analysis  would  require  many  more  well  annotated  AML  patient’s 

gene expression data that are currently unavailable, particularly given the heterogeneity 

of the disease. We hope new studies will address this in the future, and that our findings 

will  lead  to  new  findings  that  will  help  our  understanding  of  AML,  and  ultimately 

improve disease diagnosis, prognosis and treatment. 

160