COMPUTATIONAL MODELING OF GENOME-WIDE DNA BINDING AND PROTEIN INTERACTIONS BY THE ARYL HYDROCARBON RECEPTOR By David Filipovic A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Biomedical Engineering – Doctor of Philosophy Computational Mathematics, Science and Engineering – Dual Major 2023 ABSTRACT The aryl hydrocarbon receptor (AhR) is a ligand inducible transcription factor (TF) with multiple endogenous and exogenous ligands. AhR regulates many cellular processes including differentiation, development, and xenobiotic metabolism. Among its exogenous ligands 2, 3, 7, 8 tetrachlorodibenzo-p-dioxin (TCDD) is its most potent inducer. Upon ligand binding, inactive cytosolic AhR undergoes a conformational change ultimately leading to its nuclear localization. Within the nucleus, AhR is thought to primarily dimerize with AhR nuclear translocator (ARNT) to form a functional TF which binds to DNA at dioxin response elements (DREs) and regulates transcription of AhR target genes. Most DREs in accessible chromatin are not bound by AhR, and DREs accessible in multiple cell lines or type can be bound in some and unbound in others. Still, since AhR possesses a strong core binding motif 5'-GCGTG-3', it is suited for a motif-centered analysis of its binding. To investigate determinants of AhR binding I developed interpretable machine learning models predicting the binding status of DREs in MCF-7, GM17212, HepG2 cells, and primary human hepatocytes. I conclude that AhR binding is driven by a complex interplay of cell-agnostic DRE flanking sequence and cell-specific local chromatin context. On the other hand, AhR can bind DNA in absence of ARNT. Both, RelA and KLF6 have been shown to physically interact with AhR and together drive the activation of several genes. For example, the activation of 1) c-myc in breast cancer and 2) PAI-1, p21cip1, and E-cadherin genes is driven by AhR interacting with RelA and KLF6, respectively. However, it is unknown if these interactions with AhR occur genome-wide or if they are localized to a small number of genes. I developed a computational method to investigate protein-protein interactions at AhR-bound sites. Results confirm ARNT as the main dimerization partner of AhR genome-wide in TCDD-exposed MCF-7 cells. By contrast, in untreated HepG2 cells, KLF6 and RelA but not ARNT were the main dimerization partners of AhR. These findings indicate that the role of AhR is likely ligand- dependent and can potentially be explained through dimerization with different partners. Copyright by DAVID FILIPOVIC 2023 This thesis is dedicated to all who love me and whom I loved. v ACKNOWLEDGMENTS I would like to acknowledge and offer my kindest thanks to my advisor and mentor, Dr. Sudin Bhattacharya. I have learned invaluable research and life lessons under the kind tutelage of Dr. Bhattacharya. Thank you for providing me with this amazing opportunity. I appreciate your patience and understanding over the many years we have worked together. I would also like to thank the esteemed members of my committee, Dr. Adam Alessio, Dr. Christopher Contag and Dr. Jianrong Wang. Working with you was such a pleasure and I definitely could not have made it as far as I have without your kind guidance and insightful advice. Dr. Arjun Krishnan, it was an honor having you on my committee. I was sad to see you move, but I am happy to see the next chapter of your life unfold. Thank you all! Additionally, I would like to thank Dr. Rory Connolly, Dr. Almudena Veiga-Lopez, and Dr. Suresh Cuddapah for providing invaluable mentorship and support. Dr. Connolly, thank you for teaching me the ins and outs of pharmacokinetic modeling. Dr. Veiga-Lopez, thank you for teaching me responsibility and accountability, and for providing an amazing role model of a compassionate principal investigator. I have grown immensely for having crossed paths with you and I will be forever grateful. Dr. Cuddapah, thank you for your precise, pinpoint critiques and comments. Even today, I continue to learn from your teachings. Next, I would like to thank my family and friends for being there with me every step of the way. Special thanks to Evran Ural. Thank you for being there for me! I have gone through some very hard times, but they were all the more easier because you were there to support me. I hope you know how much I appreciate you. Thank you Dr. Wenjie Qi for being an amazing friend and always being there to hear me out. Thank you Dr. Xander Farnum, it was real. Thank you, Dr. Chima Maduka, it was so fun living with you, and I miss talking science with you daily. vi TABLE OF CONTENTS LIST OF ABBREVIATIONS ...................................................................................................... viii CHAPTER 1: INTRODUCTION ................................................................................................... 1 CHAPTER 2: AHR BINDING PREDICTION .............................................................................. 6 INTRODUCTION .............................................................................................................. 6 MATERIALS AND METHODS ...................................................................................... 11 RESULTS ......................................................................................................................... 16 CHAPTER 3: AHR BINDING PARTNERS ............................................................................... 56 INTRODUCTION ............................................................................................................ 56 MATERIALS AND METHODS ...................................................................................... 57 RESULTS ......................................................................................................................... 60 CHAPTER 4: BISPHENOL A AND BISPHENOL S PREGNANCY-SPECIFIC PHYSIOLOGICALLY-BASED TOXICOKINETIC MODELS ................................................. 80 INTRODUCTION ............................................................................................................ 80 MATERIALS AND METHODS ...................................................................................... 82 RESULTS ......................................................................................................................... 94 CHAPTER 5: DISCUSSION AND CONCLUSIONS ............................................................... 100 BIBLIOGRAPHY ....................................................................................................................... 106 vii LIST OF ABBREVIATIONS 3D – three dimensional 3-MC – methylcholanthrene ADME – administration, distribution, metabolism, and elimination AhR – aryl hydrocarbon receptor AhRE – aryl hydrocarbon response element AIP – AhR interacting protein ALDH – aldehyde dehydrogenase ARNT – aryl hydrocarbon receptor nuclear translocator auROC – area under receiver operating characteristic curve auPRC – area under precision recall curve B[a]P – benzo[a]pyrene bHLH – basic helix-loop-helix BPA – bisphenol A BPS – bisphenol S ChIP – chromatin immunoprecipitation CYP – cytochrome P450 DHS – DNase hypersensitive site DMSO - dimethylsufoxide DNA – deoxyribonucleic acid DNase-seq – DNase I hypersensitivity assay followed by sequencing DRE – dioxin response element EDC – endocrine disrupting chemical viii EGFR – epithelial growth factor receptor EMSA – electrophoretic mobility shift assay ERK - extracellular signal-regulated kinases GD – gestational day FP – false positive FPR – false positive rate GST - glutathione S-transferase HM – histone modification IV - intravenous KLF6 – Krüppel-like factor 6 NF-kB – nuclear factor kB NTS – nuclear translocation signal PAH – polycyclic/polyhalogenated aromatic hydrocarbon PAS – PER-ARNT-SIM PBTK – physiologically based toxicokinetic modeling PER - period PRC – precision recall QSAR – quantitative structure-activity relationship ROC – receiver operating characteristic SIM – single-minded SP1 – specificity protein 1 SRC - Proto-oncogene tyrosine-protein kinase Src TAD – transactivation domain ix TCDD – 2, 3, 7, 8 tetrachlorodibenzo-p-dioxin TF – transcription factor TP – true positive TPR – true positive rate VC – vehicle control XRE – xenobiotic response element XAP2 – hepatitis B virus X-associated protein x CHAPTER 1: INTRODUCTION Proteins are one of the main building blocks of cells. They are comprised of a chain of smaller units called amino acids that are folded into a functional 3D structure. The precise 3D structure of a protein is crucial to facilitate the performance of its specific cellular functions including enzymatic actions and involvement in providing and maintaining cellular structure (1). Transcription factors (TFs) are proteins that bind to DNA and regulate the transcription of genes, by either promoting or interfering with the recruitment of cellular transcription machinery (2). Some TFs are only activated by a ligand without which they are kept sequestered in the cytosol, and without which they do not bind to DNA nor actively promote transcription (3). Ligand binding determines the activity of such TFs. In that sense, ligands can be both agonist – transforming the TF into an active or DNA binding form; and antagonist – transforming the TF into an inactive form (4). The aryl hydrocarbon receptor (AhR) is a ligand activated transcription factor belonging to the basic helix-loop-helix (bHLH) PER-ARNT-SIM (PAS) superfamily of TFs which act as sensors of both internal and external cellular environments (5). The existence of the AhR was hypothesized as early as 1976 by Poland et al. (6). Early research on the AhR uncovered its role as a xenobiotic sensor that binds exogenous ligands - xenobiotics in the class of polyhalogenated and polycyclic aromatic hydrocarbons (PAHs), chief among them being 2, 3, 7, 8 tetrachlorodibenzo-p-dioxin (TCDD) (7). More recently it has been shown that the AhR can bind endogenous ligands as well. For example, certain tryptophan derivatives – such as kynurenine, tetrapyrroles, and metabolites of arachidonic acid (8–10). The structure of the AhR protein can be broken down into five main structural domains: the basic domain, two PAS domains, HLH domain and the transactivation domain (TAD). The 1 basic domain plays a role in DNA binding by the AhR. The two PAS domains - PAS A and PAS B, together with the HLH domain, play a role in dimerization with the canonical AhR dimerization partner – AhR nuclear translocator (ARNT). AhR ligands bind within the PAS B domain. Finally, the C-terminal of AhR contains the transcription activation domain (TAD), which is composed of an acidic, a Q-rich, and P/S/T subdomains. The TAD domain is responsible for the recruitment of co-activators and co-repressors, resulting in the activation or repression of gene expression, respectively (11–13). Activation of AhR by its exogenous ligands underlies its role as a xenobiotic sensor. The AhR signaling pathway resulting from such activation is referred to as the canonical AhR pathway. Prior to ligand activation, the AhR is localized to the cytosol where it is maintained in its inactive form through binding to its co-chaperone proteins. These proteins include hepatitis B virus X- associated protein (XAP2) – also known as the AhR interacting protein (AIP), a dimer of heat shock protein 90 (HSP90), prostaglandin E synthase 3 (p23), and protein kinase SRC. Upon ligand binding, the AhR is released from its co-chaperones, exposing its nuclear translocation signal (NTS). Subsequently, the AhR translocates to the nucleus where it forms a heterodimer complex with the AhR Nuclear Translocator (ARNT). The AhR-ARNT complex modulates the expression of its target genes by binding to DNA at specific dioxin response elements (DREs), also known as xenobiotic response elements (XREs) and defined by the core DNA sequence 5'-GCGTG-3' (5). This pathway (illustrated in Figure 1) fits within the initially discovered role of AhR as a xenobiotic sensor regulating the adaptive metabolic response. Many target genes in this pathway are xenobiotic metabolizing genes, such as phase I metabolic enzymes, namely cytochrome P450 1A1 (CYP1A1), CYP1A2, CYP1B1, as well as phase II metabolic enzymes, namely glutathione S- transferase (GST), and aldehyde dehydrogenase 3a1 (ALDH3A1) (14, 15). These metabolic 2 enzymes play a role in detoxifying xenobiotics, but can sometimes produce reactive metabolites, such as in the case of benzo[a]pyrene (B[a]P) (16). On the other hand, the AhR co-chaperone SRC participates in the non-genomic mechanisms of AhR signaling, where its disassociation from the activated AhR can result in the activation of other pathways, namely ERK1/2 and EGFR (17). Figure 1 – canonical AhR pathway adapted from (18). In addition to its role as a xenobiotic sensor, over time many other functions of the AhR have been discovered, such as its roles in differentiation (19), development (20), cancer (21), circadian rhythm (22), cell cycle progression (23), and immunity (24). In general, the AhR can be thought of as an integrator of dietary, metabolic, microbial, and environmental cues that initiates fine-tuned and selective transcriptional programs. These programs can be ligand-specific, cell type-specific and even context-specific (5). 3 The AhR appears to bind exclusively to DREs in an in vitro setting, i.e., on naked DNA outside of a cell nucleus. When examining the DNA binding of AhR in vivo, by looking at publicly available AhR binding experiments, I observed that most DREs in accessible chromatin are not bound by AhR, and DREs accessible in multiple cell lines or cell types can be bound in some and unbound in others. Nevertheless, the fact that AhR possesses a strong core binding motif 5'- GCGTG-3' – the DRE, facilitates a motif-centered analysis of its binding. To investigate the molecular determinants of AhR binding I developed interpretable machine learning models predicting the binding status of DREs in MCF-7, GM17212, HepG2 cells, and primary human hepatocytes. My results indicate that AhR binding is driven by a complex interplay of cell-agnostic DRE flanking sequence and cell-specific local chromatin context. Aside from ARNT, AhR has been shown to bind DNA by interacting with other transcription factors, both with and without ARNT. For instance, the AhR-ARNT heterodimer interacts with the specificity protein 1 (SP1) via the AhR-ARNT HLH/PAS domains and SP1 zinc finger domains. AhR-ARNT and SP1 synergistically enhance the transcription of the CYP1A1 gene by binding to their cognate binding motifs in the promoter of CYP1A1. The DRE and the GC-rich binding motif of SP1 in the promoter of CYP1A1 were shown to partially overlap (25). On the other hand, the AhR was shown to interact with several other TFs without ARNT. For example, TCDD activated AhR interacts with the retinoblastoma tumor suppressor protein (pRb) without ARNT to induce G1 cell cycle arrest. The pRb appears to preferentially associate with the ligand-bound form of AhR (26). Additionally, the RelA subunit of nuclear factor-kB (NF-kB) interacts with the AhR to activate the transcription of c-myc and IL-6 genes by binding together in their promoters (27, 28). The transcription of PAI-1, p21cip1, and E-cadherin genes is thought to be driven by AhR interacting with the Krüppel-like factor 6 (KLF6) (29). 4 However, these ARNT independent AhR-protein interactions have only been confirmed at a limited number of AhR bound loci. It is unknown if these protein interactions with AhR could be occurring genome wide. To address this question, I developed a computational method to investigate AhR-TF interactions at AhR-bound sites. My results confirm ARNT as the main dimerization partner of AhR genome-wide in TCDD-exposed MCF-7 cells. By contrast, in untreated HepG2 cells, KLF6 and RelA but not ARNT were the main dimerization partners of AhR. These findings indicate that the role of AhR is likely ligand-dependent and can potentially be explained through dimerization with different proteins. Additionally, unrelated to computational modeling of AhR binding and dimerization, I have developed a set of physiologically based toxicokinetic (PBTK) models for bisphenol A and S (BPA and BPS) in pregnant sheep (30). These chemicals are often used in the manufacturing of polycarbonate plastics, epoxy resins, dental sealants, and plastic and paper consumer products (31, 32). They are also known endocrine disruptors and can be found pervasively in the environment (32, 33). The PBTK models I developed and calibrated against available toxicokinetic data (34–36), demonstrated that BPS exhibited a higher potential for accumulation in the fetus with repeated daily maternal exposure. 5 CHAPTER 2: AHR BINDING PREDICTION INTRODUCTION The expression of genes is governed principally through the process of transcriptional regulation. This process represents the main mechanism by which crucial cellular processes such as differentiation, development, and response to exogenous stimuli are coordinated (37). Transcriptional regulation occurs in large part through direct or indirect binding of transcription factors (TFs) to DNA, and through the interactions of these TFs with transcriptional machinery (38). Altering the expression, function, or the DNA-binding ability of even a single TF can result in changes in expression of hundreds to thousands of genes (39, 40). Further, the removal of a single TF binding site, e.g., through an experimental procedure such as promoter bashing or targeted mutagenesis, often results in altered gene expression (41, 42). The problem of experimentally identifying TF binding sites across the genome is further complicated by the fact that TF binding is often highly tissue- and cell type- specific (37). The problem of computationally predicting DNA binding sites genome-wide for a particular TF, in a cell type- or tissue-specific manner, can be likened to finding the proverbial needle in a haystack. This problem is particularly difficult for TFs with a short core DNA binding motif such as the aryl hydrocarbon receptor (AhR). The AhR binds to a core 5-base pair (bp) sequence, 5'-GCGTG-3', referred to as the dioxin response element (DRE) (43, 44). Such short binding motifs occur millions of times in the human genome. However, a typical chromatin immunoprecipitation followed by sequencing (ChIP-seq) binding assay for AhR produces a list of AhR bound regions on the order of a few hundred to a few thousand regions. This discrepancy could be partially explained by the fact that most genomic DREs lie in inaccessible regions of the genome in a particular cell line or type. Additionally, the nucleotides flanking the core motif on 6 both 5' and 3' ends are suspected to form an extended active AhR binding site (45). However, the manner in which the exact identity of these nucleotides affects AhR binding is currently unknown. Further, these AhR bound regions have anywhere between zero and 29 DREs (46, 47). Therefore, the occurrence of a DRE in the genome is neither sufficient nor necessary to induce AhR binding. The problem of predicting AhR-DNA binding is additionally complicated by the fact that it can be hard to distinguish direct DNA binding from indirect binding through tethering with other TFs or through 3D looping of chromatin (48). Considered together, these findings indicate that both genomic and epigenomic characteristics likely play a role in determining AhR binding in vivo. The AhR is a ligand-activated transcription factor (TF) in the basic-helix-loop-helix (bHLH) PER-ARNT-SIM (PAS) family of TFs (49, 50). The AhR can be activated by both endogenous and exogenous ligands (5, 51). In the later class of ligands, the environmental pollutant 2, 3, 7, 8 tetrachlorodibenzo-ρ-dioxin (TCDD) is the prototypical AhR ligand (7, 51). Exposure to TCDD activates the xenobiotic response pathway of AhR. Initially, the AhR is constrained in the cytosol of the cell through binding with its co-chaperone proteins. These proteins include a dimer of the 90-kDa heat shock protein (HSP90) (52, 53), the AhR-interacting protein (AIP) (54, 55), the cochaperone protein p23 (56), and SRC (17). When bound by its ligand, AhR releases from its co-chaperones and translocates to the nucleus where it forms a heterodimer with the AhR Nuclear Translocator (ARNT) (57, 58). The AhR-ARNT heterodimer binds to DNA sequences containing the consensus 5'-GCGTG-3' core binding motif (59, 60). This binding motif has been named variously the xenobiotic response element (XRE), aryl hydrocarbon response element (AhRE), or dioxin response element (DRE) (61). In this thesis I will use the name DRE when to referring to AhR binding motif within potential AhR DNA binding sites. 7 The best understood function of the AhR is the direct regulation of its target genes, chief among them the cytochrome p450 1A1 (CYP1A1), 1A2 (CYP1A2) and 1B1 (CYP1B1). The AhR regulates these genes by binding to DREs in their proximal promoters or, potentially, distal enhancers (24, 62, 63). The first step towards reconstructing the AhR-mediated gene regulatory network is the accurate, cell type-specific identification of AhR binding sites. The construction of these gene regulatory networks is crucial for improving our understanding of the role of the AhR in xenobiotic-induced toxicity and disease, as well as in crucial physiological functions. These include the immune response (24), circadian rhythm (22), cell cycle progression (23), and embryonic development (20). Significant progress has been made with the development of high throughput molecular techniques for identification of TF-bound DNA fragments. These techniques often use a method for the enrichment of TF bound DNA complexes followed by sequencing of the enriched DNA fragments. Techniques such as ChIP-seq (64), ChIP-exo (65) and ChIP-nexus (66) have enabled a genome-wide view of TF binding. Over time, the binding of hundreds of TFs in multiple cell lines, primary cells, and whole tissues has been investigated genome-wide and the results of these experiments have been made publicly available. Likewise, DNA binding of the AhR has been probed in several human cell lines and primary cells. Nonetheless, the determinants of cell-specificity of AhR binding remain poorly understood. Recent years have seen the development of many computational approaches for genome- wide prediction of TF binding. The most widely used methods leverage the position weight matrix (PWM) corresponding to the TF of interest. A PWM is a statistical and quantitative representation of known and experimentally confirmed DNA binding sites for a TF of interest. The PWM effectively makes up the binding motif of the TF. PWMs are available in online databases such as TRANSFAC and JASPAR. These PWMs have been derived from experimental data and can also 8 be estimated de novo if binding data is available (67, 68). A PWM is used to calculate a score for each potential binding site as a sum of individual scores of each nucleotide making up the PWM and overlapping the potential binding site. PWMs are then used to scan the genome for TF binding sites, using a previously derived optimal threshold score as the cutoff to predict TF-bound sites (69, 70). PWMs are commonly derived from in vitro experiments. The most often used in vitro experiment is the high throughput systematic evolution of ligands by exponential enrichment (HT- SELEX) (71). Occasionally, PWMs are also derived from in vivo experiments such as ChIP-seq. However, when examining the binding of TFs in vivo, it is often noted that many TFs are bound to DNA sequences that do not possess the in vitro or even the in vivo derived binding motif (72). Eukaryotic TFs generally do not bind DNA in isolation but rather in dense, often tissue- specific, TF clusters. These clusters are characterized by the co-location of the TF binding sites for multiple different types of TFs in relatively short genomic regions (73, 74). Consequently, it is reasonable to assume that PWMs of co-bound TFs could be used to predict the binding of a TF of interest. Still, models that use PWMs of co-binding TFs have shown limited utility in improving model performance, with models of certain TFs seeing little to no improvement (72). Even so, given that TFs bind in dense clusters and that PWMs are not always representative of actual TF binding, I hypothesized that ChIP-seq signals of co-bound TFs, as a measure of their actual binding, could provide the information that PWMs could not. Further, I propose that interpretable machine learning combined with the measures of co-bound TFs would provide mechanistic insights into the molecular mechanisms underlying the cell specificity of AhR binding. TF binding prediction models based on PWMs have been extended over time to include other biological features demonstrated to be associated with TF binding, such as chromatin accessibility, histone modifications, evolutionary sequence conservation, PWMs of co-bound TFs, 9 and gene expression (72, 75). Similarly, a broad range of statistical and machine learning models ranging from unsupervised Bayesian mixture models (75) to deep learning (76–78) have been used to address the problem of tissue-specific TF binding prediction. Despite some of these models achieving high cross-tissue performance for select few TFs, most of them lack interpretability and do not translate into mechanistic insights. Most computational models predicting TF binding have been applied to constitutively active TFs. The binding of inducible TFs, on the other hand, remains largely computationally unexplored. In this chapter of the thesis, I applied a supervised machine learning algorithm, XGBoost (79), and developed machine learning models predicting the AhR binding status of DREs in open chromatin of a particular cell line or type. These models were trained to predict DREs in open chromatin, as either bound or unbound, and were applied to four cell lines and one primary cell type: two human breast cancer cell lines (MCF-7 and T-47D) (46, 47), primary human hepatocytes (80), human hepatocellular carcinoma cell line (HepG2) – data obtained from the ENCODE project (81, 82), and lymphoblastoid cell line (GM17212) (83). The cells in these experiments were treated with either TCDD, Methylcholanthrene (3-MC; an AhR ligand) or Dimethyl sulfoxide (DMSO; vehicle control) for a duration of either 45 minutes, 1 hour or 24 hours. By using these datasets and chromatin accessibility experiments corresponding to the cell line or type used, I first identified cell line and type-specific AhR- bound and unbound DREs in open chromatin. Then, I developed machine learning models that predict the binding status of DREs in open chromatin for each cell line or type individually. My results demonstrate highly accurate and robust models of within-cell line or cell type binding. I identified several TFs as predictive of AhR binding in individual cell lines or types, such as GATA3 in MCF-7 cells, MXI1 in HepG2 cells, and SP1 in primary human hepatocytes and GM17212 cells; as well as histone 10 modifications (HMs) – H3K4me1 and H3K4me3 in MCF-7 cells, H3K4me3 and H3K27ac in primary hepatocytes, and H3K27ac in GM17212 cells. My cell-specific models generalize well to the prediction of AhR binding sites without DREs, demonstrating the robustness of the models. In conclusion, I demonstrated that the patterns of TFs and HMs most predictive of AhR binding are consistent within cell lines or types but highly variable across them, which is suggestive of potentially different underlying cell-specific mechanisms of AhR binding. Additionally, I show that AhR binding is driven by a complex interplay of cell-agnostic DNA sequence flanking the DRE and cell-specific local chromatin context. The approach used here can be adapted to other inducible TFs, such as steroid hormone and nuclear receptors. MATERIALS AND METHODS Reference genome Unless otherwise specified the reference, genome used for sequence alignment in this part of the thesis was the human genome assembly version hg19. I opted for hg19 due to availability of data on ChIP-seq and DNase-seq data repositories such as GEO Datasets (84) ChIP-Atlas (85) and ENCODE (81, 82). Likewise, most other transcription factor binding prediction tools available at this time were trained on hg19. Visualization of ChIP-seq signal Bigwig files were used as inputs to deepTools version 3.5.1 (86) for visualization. DeepTools plotHeatmap function was used to create visualizations of ChIP-seq signal fold enrichment within a −1.5 to +1.5 kb region around the bound and unbound dioxin response elements (DREs), as well as to generate average profiles for ChIP-seq enrichment in the same region. 11 DREs in open chromatin I obtained DNase-seq data for all relevant cell lines (MCF-7, T-47D, primary hepatocytes, HepG2, GM12878) from ENCODE - https://encodeproject.org/. I downloaded the broadPeak DNase-seq files for the hg19 genome assembly, and if there were multiple replicates, I found the intersection of all replicates. Any DRE found under the peaks of DNase-seq intersection was considered to be in the open chromatin of the corresponding cell line and was used in the determination of bound and unbound DREs for the purposes of model training. DREs occurring in ENCODE blacklisted regions, namely the merged consensus blacklist (wgEncodeHg19ConsensusSignalArtifactRegions.bed.gz) and exclusion list regions (ENCODE accession ENCFF001TDO) were ignored in downstream analyses. AhR-bound and unbound DREs Firstly, I assembled a list of all DREs in the human genome by searching the hg19 human reference genome sequence for the occurrences of the core DRE sequence 5'-GCGTG-3' on either strand of the DNA. Only DREs in open chromatin, i.e., DREs overlapping DNase-seq broadPeaks from an ENCODE experiment for a given cell line, were considered for training. Additionally, bound DREs in closed chromatin were also considered for testing purposes. Secondly, I obtained the AhR ChIP-seq bed and bigwig files either from Gene Expression Omnibus (GEO) Datasets or from ChIP-Atlas (85) where the original sequencing files have been processed uniformly following a standard processing pipeline. Originally, AhR ChIP-seq data was generated in the following independent experiments, 1) AhR and ARNT ChIP-seq of MCF-7 cells treated with 10 nM TCDD for 45 minutes (47) – the binding data was obtained from GEO Datasets – accession GSE41820, 2) AhR and AhRR ChIP-seq of MCF-7 cells treated with 10 nM TCDD for 45 minutes and 24 hours (46) - the binding data was obtained from GEO Datasets – accession GSE90550, 3) AhR 12 ChIP-chip of T-47D cells treated with 1 µM 3-MC or 10 nM TCDD for 1 hour (87, 88) - the binding data was obtained from their respective publications and converted from hg18 to hg19 using the liftOver tool (89) 4) AhR ChIP-seq of primary hepatocytes treated with 1nM of TCDD for 24 hours – the binding data was obtained from GEO Datasets – accession GSE205502; 5) AhR (3xFLAG tagged AhR) ChIP-seq of untreated HepG2 cells from ENCODE – accession ENCSR412ZDC (81, 82); 6) AhR ChIP-seq of GM17212 cells treated with 1 µM 3-MC for 24 hours – accessible through GEO Datasets – accession GSE116632; however, the binding data was obtained from ChIP-Atlas – accessions SRX4342282, SRX4342283, SRX4342285, and SRX4342286. Details of these experiments are summarized in Table 1. Bound DREs for the purposes of model training were determined as DREs found in open chromatin and under AhR peaks where only one DRE was present under the AhR peak (referred to as singleton DREs). Isolated unbound DREs are DREs in open chromatin found at least 500 bps away from the boundary of any AhR peak, as well as 100 bps away from any other DRE. These DREs were selected as unbound for model training in order to minimize confounding of DRE contribution to binding. All other DREs in open chromatin were considered ambiguous and were not used in model training. Promoters and enhancers I obtained all annotated transcription start sites from Ensembl 105 BioMart (human genes; GRCh38.p13) (90) and considered regions ±200 and ±1500 bp around the TSS as stringent and relaxed promoters, respectively. I obtained all computationally predicted enhancers from ChromHMM (91) for samples that had ChromHMM data available – HepG2 and GM12878 (ENCODE) and MCF-7 (GEO Datasets – accession GSE57498). Both weak and strong enhancers (ChromHMM states 4 through 7) were considered as valid enhancers. 13 Sequence and genomic signal features For each DRE in the human genome, I obtained the genomic sequence of seven nucleotides 5' upstream and 3' downstream from the DRE (5'-GCGTG-3') from the hg19 human reference genome. These nucleotides were one-hot encoded and used as features in my machine learning models. In total there were around 1.6 million DREs spread across the human genome. However, only a small fraction of them fulfilled the criteria for bound and unbound DREs used in training and testing. DNase-seq, as well as all available histone mark and transcription factor ChIP-seq genomic signal (bigwig) files were downloaded for MCF-7, T-47D, primary hepatocytes, HepG2, GM12878 (as the closest match to GM17212 where AhR was ChIP-ed) from the ENCODE consortium. For each bound and unbound DRE and each genomic signal (bigwig) file, I extracted the value of the genomic signal 740 bps up- and 740 bps down- stream from the DRE, for a total of 1485 bps of signal (DRE width is 5 base pairs). The extracted signal was split into 15 bins of equal 99-bp size and the signal within each bin was averaged to produce 15 features corresponding to the particular DRE-genomic signal combination. During averaging, any areas of missing signal were replaced with zeros. Model architecture and training For each cell line and all the bound and unbound DREs appearing in open chromatin of that particular cell line, I created sequence features, as well as genomic signal features for all available DNase-seq, histone mark and transcription factor (TF) ChIP-seq experiments. I then performed hyperparameter tuning of an XGBoost model through a grid search of the hyperparameter space with the following values - max_depth = {3, 4, 5, 6, 7}, min_child_weight = {3, 4, 5, 6, 7}, subsample = {1.0, 0.9, 0.8, 0.7}, colsample_by_tree = {1.0, 0.9, 0.8, 0.7} and eta 14 = {0.05, 0.075, 0.1, 0.125, 0.15, 0.2, 0.3}. I reported the average performances over all five folds for the best performing models in terms of hyperparameter selection. Model evaluation In addition to evaluating the models through 5-fold cross validation, I also evaluated model performance on predicting the binding status of DREs that occurred under multi-DRE AhR peaks both in open and closed chromatin. For each such peak and each DRE under the peak I used the AhR binding prediction model to make a prediction regarding whether the DRE is bound or not. If at least one DRE under the peak was predicted as bound, the peak was considered recovered, and the total fraction of recovered peaks was reported. Similarly, I evaluated the model performance on predicting the binding status of AhR peaks without DREs. Briefly, for each 0- DRE peak I simulated five dummy DREs. These dummy DREs are not actually present in the genomic sequence and only represent the genomic location that was used as reference for the calculation of all non-sequence model input features. The center of the first dummy DRE is aligned to the center of the AhR peak and the other four dummy DREs are positioned -100, -50, +50, +100 bps relative to the center point of the first dummy DRE. A zero-DRE peak is considered generally recovered if at least one of the five dummy DREs is predicted as bound. The peak is considered centrally recovered if the central DRE is predicted as bound. Model performance metrics To calculate the area under Receiver Operating Characteristic (auROC) and area under Precision Recall (auPRC) curves, I used the output of the XGBoost algorithm in the form of probabilities of each particular observation (DRE) belonging to a particular output class (bound or unbound). By using different thresholds for these probabilities above which the model predicts a DRE as bound, I obtained the numbers of true and false positives for each threshold, as well as 15 true and false negatives relative to the ground truth of DRE binding obtained from the corresponding AhR ChIP-seq experiment. Each threshold produced a point on the ROC and PRC curves; the area under the curve was calculated using a line interpolated through all the points. Statistical analysis Statistical analysis was carried out in Python 3, using the scipy 1.8.0 package (92). ChIP- seq signals were analyzed using the Kruskal-Wallis test and post-hoc analysis performed with the Wilcoxon test for each pair. Results were considered significant if P-value was < 0.01. RESULTS AhR binding is cell line and cell type-specific The first part of my study was focused on improving the understanding of the molecular determinants underlying the binding of the human aryl hydrocarbon receptor (AhR) to DNA. To achieve this goal, I investigated the role of the core 5'-GCGTG-3' AhR binding motif in determining the cell-specificity of AhR binding. This core AhR binding motif is known as the dioxin response element – DRE. I compared AhR binding in human cells across previously published and publicly available AhR binding data in the form of chromatin immunoprecipitation (ChIP) experiments. These experiments were either followed by sequencing (ChIP-seq) or a microarray (ChIP-chip). The ChIP-seq experiments provided a genome-wide view of AhR binding, while the ChIP-chip experiments were focused only on determining AhR binding in gene promoters. Each experiment selected for the analysis of AhR binding was performed on a specific cell line or on primary cells. Experiments on the following cells were included in further analyses – 1) two epithelial breast cancer cell lines - MCF-7 and T-47D, 2) a hepatocellular carcinoma cell line - HepG2, 3) a lymphoblastoid cell line – GM17212, and 4) primary human hepatocytes. The cells in these experiments were treated with an AhR agonist (TCDD or 3-MC) or vehicle control 16 (VC) – dimethyl sulfoxide (DMSO) for a duration of 45 minutes, 1 hour or 24 hours, or were not treated with either – HepG2 cells. For the full list of experiments, including the total number of AhR peaks, as well as type, concentration, and duration of treatments, see Table 1. The breast cancer cell lines had data available from more than one experiment. Table 1 – AhR ChIP-seq and ChIP-chip experiments. To develop and train my machine learning models I used data from four AhR ChIP-seq binding experiments - 1) MCF-7 cells treated with 10 nM TCDD for 24 hours (referred to as MCF- 7 or MCF-7 24h), 2) primary hepatocytes treated with 1 nM TCDD for 24 hours (referred to as primary hepatocytes), 3) HepG2 cells without AhR agonist treatment (referred to as HepG2), 4) GM17212 cells treated with 1 µM 3-MC for 24 hours (referred to as GM17212). In addition, some analyses were performed on the remaining three AhR ChIP-seq and ChIP-chip binding experiments – 1) MCF-7 cells treated with 10 nM TCDD for 45 minutes (referred to as MCF-7 45m), 2) T-47D cells treated with 10 nM TCDD for 1 hour (referred to as T-47D TCDD), 3) T- 47D cells treated with 1 µM 3-MC for 1 hour (referred to as T-47D 3-MC). All AhR binding 17 experiments performed on T-47D cells were ChIP-chip and only reported the binding of AhR in gene promoters. First, I searched for the occurrences of DREs within the hg19 human reference genome and found approximately 1.6 million DREs in the human genome. Upon intersecting these DREs with the genomic locations of AhR peaks I identified the existence of AhR peaks with either (i) none (0-DRE peaks), (ii) exactly one (singleton peaks), or (iii) more than one (multi-DRE peaks) DREs (Figure 2). Figure 2 – AhR peaks with 0-, 1-, and multi-DREs under the peak. Next, I calculated the percentages of each type of peak – 0-DRE, singleton, and multi- DRE, across all AhR peaks within each AhR binding data set listed in Table 1. The percentage of singleton AhR peaks ranged between 22.3% and 33% and was similar across all data sets. The percentage of multi-DRE peaks, however, was markedly larger for the two types of liver cells – HepG2 cells and primary hepatocytes (Figure 3). Even though HepG2 cells were not treated with an AhR agonist, the HepG2 experiment resulted in 12,164 AhR peaks. These results suggest the possibility of basal induction of AhR in cells exposed to typical cell culture conditions. 18 Figure 3 – percentage of AhR peaks with at least one DRE. Then, I investigated the locations of singleton DREs relative to the mid-point of their corresponding AhR peaks. If these DREs are indeed functional and not only showing up at random under the AhR peaks I would expect them to be enriched at one central or two centrally symmetrical points. This is because the AhR-ARNT dimer can bind to either strand of the DNA – depending on where the DRE is located, and because the DNA binding domain of the dimer does not lie exactly in the middle of the protein complex. My results show that the majority of singleton DREs are located near the mid-point of their corresponding AhR peak (Figure 4). For example, in MCF-7 cells, approximately 50% and 80% of singleton peak DREs were found within 100 and 200 base pairs up-/down- stream from the midpoint of the peak, respectively. However, I observed that in HepG2 cells, even though singleton DREs appeared somewhat centrally enriched, it was not to the same degree as singleton DREs in MCF-7 cells, primary hepatocytes and GM17212 cells (Figure 4). 19 Figure 4 – histogram of DRE position relative to the mid-poing of singleton AhR peaks. Most TF binding peaks occur in areas of open chromatin, except for some pioneering factors that bind in areas of closed chromatin, and potentially recruit chromatin remodeling factors. Therefore, the TF binding peaks of most TFs overlap with DNase hypersensitive sites (DHSs) or similar experimentally verified regions of open chromatin (82). To examine the contribution of chromatin accessibility in determining cell specificity of AhR binding, I examined DNase-seq broadPeak files from the ENCODE database. Only DNase-seq experiments most closely corresponding to the cell line or primary cell type used in the AhR binding experiment were used (81, 82). I identified all DREs and AhR peaks appearing in open chromatin for each cell line or type and determined that the majority of AhR peaks can be found in open chromatin (Table 2). 20 Table 2 – number of singleton bound and isolated unbound DREs; percentage of AhR peaks in open chromatin. Generally, DNase-seq or other types of experiments probing chromatin accessibility available in databases such as ENCODE are done on cells under normal cell culture conditions. Consequently, none of the available DNase-seq experiments corresponding to the cell lines or types used in this thesis were treated with AhR agonists. Therefore, the DNase-seq and AhR binding experiments are matched in ligand treatment only in the HepG2 cells, since these cells were not explicitly exposed to an AhR ligand. However, even though I use DNase-seq experiments corresponding to a non-treated cellular state to determine bound and unbound DREs in a treated cellular state, I observed that the majority of AhR peaks do lie in open chromatin - as it is before treatment (Table 2) – between 83.6% and 92.9%. These findings are consistent with our ATAC- seq data in mouse primary hepatocytes treated with TCDD for 6 hours (unpublished). These results indicate that AhR activation does not result in extensive chromatin remodeling, therefore the use of the DNase-seq prior to treatment is justified. The only exception I found was the human primary hepatocyte AhR ChIP-seq experiment where only about 69% of AhR peaks lie in initially open chromatin. Unfortunately, none of the related human liver or hepatocyte DNase-seq experiments available on ENCODE could be closely matched to the primary hepatocytes used for the AhR 21 ChIP-seq experiment. Hence, primary hepatocytes were excluded from many of the subsequent analyses. Further analysis focused on contrasting bound and unbound DREs. I observed that the proportions and exact identities of unbound and bound DREs found in open chromatin appear to be highly cell line or type specific. Out of nearly 8,000 DREs found in open chromatin across each of the four relevant cell lines or types, about half are bound in at least one, while only 14 DREs are bound in all four (Figure 5). Figure 5 – Venn diagram of bound DREs in accessible chromatin of all four cell lines or types. In contrast, about half of these pervasively accessible DREs are unbound in all four cell lines or types (Figure 6). These results suggest that if a DRE is found in open chromatin of all four cell lines or types that DRE is much more likely to be unbound in all four than it is to be bound. Conversely, if such a DRE is found to be at all bound than it is most likely bound in only one or at most cell lines or types, since only 3.1% of bound DREs in Figure 5 are bound in 3 or all 4 cell lines or types. 22 Figure 6 – Venn diagram of unbound DREs in accessible chromatin of all four cell lines or types. The two breast cancer cell lines, MCF-7, and T-47D, have publicly available AhR binding experiments with similar treatment conditions, i.e., 45 minutes of 10 nM TCDD and 1 hour of 10 nM TCDD treatment, respectively. However, even between these two breast cancer cell lines most accessible DREs appear bound in only one of the two cell lines according to their respective AhR peak lists (Figure 7). A more detailed look at a heatmap of MCF-7 AhR binding signal strength surrounding DREs bound only in T-47D shows that as many as three quarters of these DREs also possess subthreshold peaks in MCF-7 cells (Figure 7C). Ultimately, a DRE that lies within open chromatin of two different cell lines or types is somewhat likely to be bound in one and unbound in the other. These results jointly suggest the existence of AhR binding determinants beyond DNA sequence and the accessibility of chromatin. 23 Figure 7 – Venn diagram of AhR peaks in MCF-7 and T-47D cells; MCF-7 AhR signal across T- 47D only peaks. 24 When examining AhR peaks shared between two cell lines or types, I observed that among AhR peaks found at the same genomic location in two binding experiments there was a higher percentage of peaks with DREs, when compared to AhR peaks that were unique to a single binding experiment (Figure 8). Conversely, this means that AhR peaks with DREs are more likely to appear in more than one cell line or cell type than 0-DREs peaks. Figure 8 – fraction of AhR peaks with DREs. Breaking down AhR peaks by the number of DREs under the peak, into 0-DRE, singleton, and multi-DRE peaks and then quantifying the AhR ChIP-seq signal across each group of peaks revealed significant differences between average signal strength between groups. Namely, the more DREs an AhR peak had, the higher the average ChIP-seq signal was under the peak (Figure 9). These results jointly suggest that DREs are likely participating in determination of AhR binding and that a DRE-centric approach to the investigation of cell-specificity of AhR binding could reveal important determinants and potential mechanisms driving AhR binding to DNA. 25 Figure 9 – average AhR signal across peaks with 0-, 1-, and multi-DREs. Machine learning models accurately predict AhR binding Next, I sought to improve our understanding of the likely molecular determinants of cell- specific AhR binding, beyond chromatin accessibility and the core DRE motif. To achieve this goal, I developed a set of interpretable machine learning models trained to predict the binary binding status of DREs in open chromatin, i.e., bound or unbound. The models were trained with increasingly complex combinations of input features for each cell line or type. The models were trained on the singleton bound DREs occurring under AhR peaks, as my bound (positive) training examples and isolated unbound DREs (see Methods) occurring in open chromatin but not under AhR peaks, as my unbound (negative) training examples. The DREs under multi-DRE peaks were considered ambiguous, as it was not possible to computationally determine which specific DRE(s) 26 among the cluster of DREs were responsible for AhR binding. However, I have used these DREs in model evaluation. All machine learning models presented in this thesis were developed using the gradient boosted tree algorithm of the XGBoost family of algorithms, which has been shown to handle non-linear data well (93). In addition, these algorithms also provide metrics of feature importance. Therefore, it is possible to evaluate the contribution of individual input features to improving the model performance (94). The models I developed use features centered on the DRE which were based on the local chromatin context. These models are trained on singleton bound and isolated unbound DREs found in open chromatin for the cell line or type of interest. Models were validated using the 5-fold cross validation procedure. Due to a limited number of bound singleton DREs, I have not created a dedicated test set to evaluate the models. Instead, as model hyperparameters are tuned with 5-fold cross validation (see Methods), the average performance of the models across the five folds is reported (Figure 10). This choice is further justified by the purpose of this thesis which was to create interpretable machine learning models and derive from them mechanistic insights regarding cell-specific binding of AhR. Figure 10 – schematic representation of model training. I included the following local chromatin input features in the trained models - 1) DNA sequence immediately flanking the DRE. The contribution of flanking sequence of up to 7 nucleotides directly up- and down- stream from the DRE was investigated. These nucleotides have 27 been previously proposed to be involved in AhR binding through an analysis of 13 bona fide AhR binding sites (45). The flanking sequences were one-hot encoded and used as model inputs; 2) Binned average values of bigWig signals of experiments performed on the cell line or type most closely corresponding to the one used in the AhR binding experiment. Namely, for the primary hepatocyte model I used bigWig signals from experiments done in hepatocytes originated from H9 cells, and for GM17212 I used bigWig signals from experiments done in GM12878. All other cell lines were matched exactly, e.g., MCF-7, and HepG2. To create model input features from these bigWig files I used the following publicly available sequencing experiments i) DNase-seq (as representative of chromatin accessibility), ii) histone modification, and iii) transcription factor ChIP-seq experiments from ENCODE – see methods for details (81, 82). I created 15 bins of width 99 base pairs for each bigWig signal and each DRE. Each bin was assigned a value that was the average bigWig signal across the width of that bin. The mid-point of the central bin was positioned at the middle nucleotide of the 5-bp DRE; 3) Indicator variables of whether the DRE is found in a strict (+/- 200 bp away from a transcription start site - TSS) or loose (+/- 1500 bp away from the TSS) definition of a promoter. To optimize model performance and prevent overfitting, I conducted an extensive hyperparameter search for each newly trained model (see Methods). Thus, for each subset of input features and for each cell line a new hyperparameter search was performed. Among all models trained during the hyperparameter search the model with the highest average performance across the five folds was selected as the representative model for the given subset of input features and given cell line. Unless otherwise stated, model performance was evaluated as the area under the Receiver Operating Characteristic (ROC) and Precision Recall (PRC) curves, averaged over five folds using the 5-fold cross validation procedure. Since the AhR binding data sets were largely 28 unbalanced – i.e., there was a much higher number of unbound than bound DREs (Table 2), the area under the PRC curve (auPRC) was considered as a more appropriate metric of model performance. Therefore, in each case the model producing the highest auPRC was selected as the best performing model. Nonetheless, the area under the ROC curve (auROC) was reported as it remains a useful metric to distinguish between poorly and well performing models when comparing between different cell lines or types (see Methods – Performance Metrics). First, to investigate different input feature sets and their influence on model performance, I developed and validated models with the following feature sets used as model inputs - 1) DNase- seq only (DNase model), 2) flanking sequence only (Seq model), 3) flanking sequence and DNase- seq (Seq + DNase model), 4) flanking sequence, DNase-seq and histone modifications (Seq + DNase + HMs model), 5) flanking sequence, DNase-seq, histone modifications and transcription factor binding (referred to as the full model or Seq + DNase + HMs + TFs model). For most cell lines and types, the performance of each successive model improved, except for the primary hepatocyte (not shown) and HepG2 cells. Here, the performance of the sequence only models was overall very low, even lower than the performance of the corresponding DNase model. Nonetheless, the performance of Seq + DNase model was slightly higher than the DNase model for primary hepatocytes and HepG2. These results indicate that the flanking sequence provides some additional useful information when put in the context of the extent of chromatin accessibility (Figure 11). 29 Figure 11 – ROC and PRC curves for model training. The performance of most full (Seq + DNase + HMs + TFs) models was high, the only exception being the primary hepatocyte full model. I propose that primary hepatocyte model underperformed for two reasons. Namely, 1) lack of publicly available sequencing data in liver primary hepatocytes, and 2) the difference in the nature of hepatocytes used in different experiments. The AhR ChIP-seq experiment was performed on human primary hepatocytes 30 derived from a single donor specific to that experiment. All other input features, inclusive of the list of DREs in accessible chromatin, were obtained from hepatocyte-like cells in vitro differentiated from H9 cells. This discrepancy is also evidenced in a lower percentage of AhR peaks occurring in open chromatin in the primary hepatocyte experiment (Table 2). Next, I investigated the contribution of individual chromatin context features to improving the performance of full models. For each cell line, I trained the full model on all available data with the model hyperparameters set to values previously determined to produce the best performing model. After the full models were trained, I used the information gain metric generated by XGBoost to determine the average feature importance of all features (Figure 12), the relative feature importance of sequence features per flanking sequence nucleotide position (Figure 13), and relative importance of individual bins of non-sequence features (Figure 14). Figure 12 – feature importance lists of all features for full models. 31 Figure 12 (cont’d) Figure 13 – feature importance of flanking sequence. 32 Figure 14 – feature importance of individual bins. Examining the feature importance scores of all features used in full models I observed that specific models are predominantly learning and making AhR-DRE binding predictions by relying on different features across different cell lines and types (Figure 12). Each cell line or type had three to six bigWig signals with feature importance 2-5 times higher than that of any other signal. These were: 1) in MCF-7 cells - H3K4me1, H3K4me3, GATA3, CTCF, ZNF217 and FOXA1; 2) in primary hepatocytes - DNase-seq, SP1, H3K27ac and RXRA; 3) in HepG2 cells - MXI1, MAX, NR2F2, ZEB1; and 4) in GM17212 cells - SP1, POLR2A, TAF1, H3K27ac, and CREB1. Some features appear to be ranked relatively highly across most cell line or types, such as the binding of 33 CTCF, Rad21, SP1, FOXA1, MAX and MAX related factors MAZ, and MXI1; as well as histone modification H3K27ac. Nevertheless, the relative level of importance of these features varied across different cell lines or types - e.g., CTCF ranked fifth in MCF-7 cells and 20th in primary hepatocytes (Figure 12). Additionally, when looking at the feature importance scores of individual bins across cell lines or types, the distribution of relative importance scores across bins varied between cell lines or type. For example, the central bins of H3K27ac in GM17212 cells exhibited the highest importance for this feature, whereas in MCF-7 cells the central bins were not used by the model and were, consequently, not assigned an importance score (Figure 14). To verify that the ordering of feature importance scores in Figure 12 was robust and reproducible I created ranked lists of features with highest feature importance - one ranked list for each of the five folds within the 5-fold cross validation. Next, for each feature I created a boxplot of rank distributions across the folds (Figure 15). I observe that the most highly ranked features always rank highly and thus exhibit low rank variability. For instance, H3K4me1 always ranks first in all five folds within the MCF-7 model. Figure 15 – ranks of features sorted by importance across five folds in 5-fold cross validation. To investigate the contribution to model performance of DNA sequence immediately flanking the DRE, I examined the importance scores of nucleotides flanking the DRE within 1) flanking sequence-only models (not shown), and 2) full models. Importance scores of nucleotides 34 produced by the sequence only models were highly variable between different cell lines and types. On the other hand, the importance scores of nucleotides flanking the DRE produced by the full models demonstrated similar profiles of nucleotide importance across different cell lines or types. In summary, the thymine residue at the flanking position directly 5' of the DRE (labelled as the -1 position) had the highest feature importance out of all nucleotides that could appear at that position in three out of four examined binding experiments (Figure 13). Additionally, for two out of four binding experiments the thymine at position -2 and cytosine and guanine at position +1 also had high feature importance (Figure 13). To examine the influence of individual TFs on model performance, I developed models that used only a single input feature – the 15 bins representing the average bigWig signal of a single TF. Sorting these models by performance, I determined that the relative ordering of transcription factors (Figure 16) was different when compared to the feature importance ranking of the full models shown in Figure 12. Notably, for MCF-7 cells (Figure 16), EP300 was the factor resulting in the second-best performing model, while in the corresponding full model, EP300 did not appear even among the top 20 features with highest importance (Figure 12). On the other hand, GATA3 was the most predictive factor both in the full model and individually (Figure 12 and Figure 16). These results point to a high likelihood of redundancy between binding of different TFs. 35 Figure 16 – single transcription factor model performance. In conclusion, the models predicting the AhR binding status of DREs in open chromatin developed here, generalized well within the cell line or type they were trained on, when evaluated on a subset of bound singleton DREs and unbound isolated DREs left out from the training dataset (i.e., a single fold in a 5-fold cross validation). Additionally, these models exhibited highly variable chromatin context specificity between different cell lines or types. The only exceptions were the DNA flanking sequence features. Most cell lines or types exhibited similarities in DNA flanking sequence specificity, potentially pointing to a common flanking sequence grammar. Singleton peak-trained models predict multi-DRE and 0-DRE AhR peak binding within the same cell line or cell type To assess the robustness of trained models and to investigate the extent of overtraining, I performed feature selection based on the feature importance rankings of the full models, for each cell line or type. Briefly, using the list of 300 features with the highest feature importance in Figure 12, I created several models with increasingly larger subsets of those features used as model inputs. I created models with a subset of top N features with highest importance scores in the full model (where N = 10, 25, 50, 75, 100, 200, 300). In the MCF-7 cells I observed that the performance plateaus already at around 100 top features used and that the performance of the model using only 36 the top 50 features, although slightly lower on average, is not significantly different than the performance of the full model (Figure 17). Figure 17 – reduced feature set model performance (all features). Similarly, I investigated the influence of the number of flanking nucleotides used in model training. It was previously indicated, based on computational analysis of 13 experimentally verified DREs, that up to 7 up- and down- stream DRE-flanking nucleotides might play a role in determining AhR binding (45). Accordingly, I have created sequence only models with N flanking nucleotides up- and down- stream of the DRE used in model training (N = 1, 2, 3, 4, 5, 6, 7). The results show that the sequence-only model performance plateaus at four flanking nucleotides (Figure 18). Therefore, all models using flanking sequence features were developed using exactly four flanking nucleotides up-/down- stream of the DRE. 37 Figure 18 – reduced feature set model performance (sequence features). Noting that the DRE is not a palindromic binding motif, it is possible that whether the DRE occurs on the forward or the reverse strand might influence AhR binding determinants. Mainly, DRE orientation might influence the spatial orientation of the AhR-ARNT heterodimer when binding DNA, and thus also influence the direction of interactions with other TFs. To account for this potential issue, I investigated whether correcting the orientation of local chromatin context features by aligning them with the orientation of the DRE influences model performance. Specifically, for DREs found on the forward strand, all features were left as-is, and for DREs found on the reverse strand, the bins of all features were flipped around the central bin (the bin containing the DRE), to match the DRE orientation. The results indicate that there are no differences between the original and strand-corrected models (Figure 19). 38 Figure 19 – model performance with and without strand correction. Next, I evaluated the ability of full models to predict the binding status of DREs that were not used in training, as a test of how well the models generalize. Model performance was first evaluated on multi-DRE AhR peaks. For each muli-DRE peak, the binding status of each DRE was predicted by the model. If at least one DRE was predicted as bound, the peak was considered generally recovered (Figure 20 - blue bars). If the DRE closest to the summit of the peak was predicted as bound, the peak was considered centrally recovered (Figure 20 - orange bars). General recovery of multi-DRE peaks in open chromatin resulted in true positive rates (TPR) between 80- 100% for most multi-DRE peaks (Figure 20). Figure 20 – model evaluation of multi-DRE AhR peaks in open chromatin. In the MCF-7 cells, AhR peaks containing more than four DREs were mostly not recovered by the models (TPR around 40% or lower), suggesting the possibility of a different mechanism 39 underlying AhR binding in areas of high DRE density in MCF-7 cells, possibly through cooperative binding (95). In the hg19 human reference genome approximately 1% of all DREs can be found in one of these high DRE density areas, which were defined as 5 or more DREs within a 500-base pair region (Figure 21). Figure 21 – number of high DRE density clusters in the human genome. On the other hand, multi-DRE peaks in closed chromatin were recovered at a much lower and variable rate of 25-60% (Figure 22), suggesting that AhR binding in closed chromatin might be governed by a distinct set of rules compared to the binding in initially open chromatin. Alternatively, since my models are trained to predict the binding status of DREs in open chromatin, the model might struggle when predicting the binding status of DREs in closed chromatin. Figure 22 – model evaluation of multi-DRE AhR peaks in closed chromatin. However, taking a closer look at DNase-seq signal in the vicinity of these closed chromatin DREs revealed no correlation between the normalized binding strength of DNase-seq signal and 40 the DRE binding status prediction probabilities (Figure 23). This result indicates that the lower performance of the models when predicting the binding status of bound DREs in closed chromatin was not lower due to lower DNase-seq signal alone. Figure 23 – DNase-seq signal vs. model prediction probabilities across multi-DRE peaks in closed chromatin. Contrary to the expectation that the more DREs a peak contains, the higher the likelihood of that peak being recovered by pure chance, I did not observe this trend in Figure 20 and Figure 41 22, with either general or central recovery. These results suggest that the models generalized well, and that the multiple-testing issue was not prevalent when assessing general recovery rates. I also evaluated the false positive rates (FPR) of the full models when predicting the binding status of 1) multi-DRE DNase-seq peaks, 2) 1+-DRE peaks of best performing TF in open chromatin, 3) 1+-DRE peaks of best performing TF in closed chromatin (Figure 24). “Best performing TF” refers to the TF that was ranked the highest in importance in Figure 12. Best performing TFs were GATA3, MXI1, and SP1 for MCF-7, HepG2 and GM17212 cells, respectively. The FPRs generally do not exceed 20% and are the lowest for multi-DRE DNase-seq peaks. 42 Figure 24 – model evaluation on non-AhR peaks. Similarly, to multi-DRE peaks, I evaluated the performance of the models on 0-DRE peaks. In this case, since there were no DREs to evaluate model performance on, models trained on all features excluding flanking DNA sequence (DNase + HMs + TFs) were used for evaluation. To calculate the values of all non-sequence input features I previously used the DRE genomic location as a reference. However, since 0-DRE peaks do not have any DREs, I simulated five “dummy DREs” for each 0-DRE AhR peak to create reference points for the calculation of input features. 43 Unlike actual genomic DREs, these dummy DREs are not present in the genomic sequence and only define the location to be used as reference for the calculation of input features. The center of the first dummy DRE was aligned to the mid-point of the AhR peak and the other four dummy DREs were positioned at -100, -50, +50, and +100 base pairs relative to the mid-point of the peak. I investigated up to five dummy DREs for each 0-DRE peak since the majority of DREs within singleton AhR peaks were located within –100 to +100 base pairs relative to the mid-point of the peak (as shown in Figure 4). Upon establishing the dummy DREs I applied the same procedure as described for predicting multi-DRE peaks. Specifically, a 0-DRE peak was considered generally recovered if at least one of the five dummy DREs was predicted as bound. The peak was considered centrally recovered if the central dummy DRE was predicted as bound. In MCF-7 cells, approximately 93.7% of the 0-DRE peaks are partially recovered and 91% are centrally recovered. Other cell lines exhibit a slightly lower rate of recovery, nevertheless, the majority of 0-DRE peaks was recovered (Figure 25). Figure 25 – model evaluation on 0-DRE AhR peaks in open chromatin. Similar to multi-DRE peaks, the central true positive rates for 0-DRE AhR peaks in closed chromatin are considerably lower – between 11.7% and 41.8% for GM17212 and MCF-7, respectively (Figure 26). 44 Figure 26 – model evaluation on 0-DRE AhR peaks in closed chromatin. Additionally, I evaluated the false positive rates (FPR) of my models when predicting the binding status of 1) 0-DRE DNase-seq peaks, 2) 0-DRE peaks of best performing TF in open chromatin, 3) 0-DRE peaks of best performing TF in closed chromatin. Central FPR for 0-DRE DNase-seq peaks is relatively low and does not exceed 1.3%, However, central FPR for 0-DRE best performing TF peaks in open chromatin can be high and ranges from 8.6% to 45% for HepG2 and MCF-7 cells, respectively (Figure 27). 45 Figure 27 – model evaluation on 0-DRE non-AhR peaks. Cross-cell models provide insights into cell-specificity of AhR binding To examine whether full feature models for different cell lines learn from different features simply because different features were available for different cell lines, I developed full feature models that only use features available in all evaluated cell lines - MCF-7, HepG2 and GM17212. The input features were limited to DNase-seq, and to only those TF and HM features for which ChIP-seq experiments were available in all three cell lines. The results still exhibit a highly variable set of the most important features determining AhR binding within different cell lines (Figure 28). 46 Figure 28 – feature importance lists of all features in full models for DREs in enhancers only. On the other hand, the binding of AhR to DREs in specific genomic locations, such as promoters or enhancers, might be governed by distinct molecular mechanisms. Therefore, I investigated whether there were any enhancer-specific binding rules and whether these rules might be similar between cell lines. To this purpose, I created full models predicting the occupancy of singleton bound and isolated unbound DREs found only in enhancers. Once again, the results display a highly variable sets of the most important features for each cell line. Nevertheless, I observed an increase in feature importance for some TFs, such as the EP300 transcriptional coactivator for all cell lines (Figure 29). Figure 29 – feature importance lists of features shared by all three cell lines in full models for DREs. 47 Examining the feature importance of individual factors in different cell lines, I noticed that that SP1 and MAX transcription factor have the highest feature importance in the full models for GM17212 and HepG2 cells, respectively. However, in MCF-7 cells, neither of these factors appeared within the top 20 factors with highest feature importance (Figure 12). To investigate whether the importance of these factors might be low in MCF-7 cells due to redundancy with other TFs or other features, I examined the discriminative power of a single feature derived from these factors. For both SP1 and MAX, I created a single feature that was the maximal normalized signal within a 100-bp region surrounding the DREs used in model training. Next, for each feature in each cell line I found the optimal threshold that produced the highest F1 score. The F1 scores of both SP1 and MAX were much lower in MCF-7 cells when compared to HepG2 and GM17212 cells – 7% vs. 35% and 60%, respectively (Figure 30). Figure 30 – within-cell line F1 scores for the optimal threshold of the central bin of the TF signal. These results indicate that SP1 and MAX are not predictive of AhR binding in MCF-7 cells but are highly predictive of AhR binding in GM17212 and HepG2 cells. On a similar note, GATA3 is the factor most predictive of AhR binding in MCF-7 cells. Since GATA3 is not expressed in many other cell lines or types, it is difficult to investigate whether the AhR binding dependence on GATA3 in MCF-7 cells is specific to that cell line. To further investigate the difference in 48 predictive capabilities of features based on the MAX transcription factor I compared the binding profiles of MAX centered on bound and unbound DREs in MCF-7 and HepG2 cells. I observed that the difference in MAX signal between bound and unbound DREs appears qualitatively less pronounced in MCF-7 cells than it is in HepG2 cells, which could explain the increased utility of MAX features in the HepG2 model (Figure 31). Figure 31 – heatmaps representing MAX binding across bound and unbound DREs in MCF-7 and HepG2 cells. Next, I evaluated the cross-cell performance of two models with similar treatments – MCF- 7 and primary hepatocyte models. Here I focused on the sequence-only models to examine the possibility of a cross-cell flanking sequence grammar. The sequence-only model trained on MCF- 49 7 cells and evaluated on primary hepatocytes did not perform any better than random. Conversely, the sequence-only model trained on primary hepatocytes performed better when evaluated cross cells on MCF-7 cells than within-cells in primary hepatocytes (Figure 32). Admittedly, the cells in these two experiments were both treated with TCDD for 24 hours, although the concentration of TCDD was different - 10 nM and 1 nM, for MCF-7 cells and primary hepatocytes, respectively. Figure 32 – cross-cell performance of sequence only models. To further evaluate cross-cell performance, I focused on two similar cell lines – MCF-7 and T-47D, which are both epithelial breast cancer luminal type A cell lines. Models comprising of 1) sequence-only, 2) GATA3-only, and 3) sequence and GATA3 features trained in the MCF- 7 cells exhibit high performance within cells (results not shown), where model performance rises with each successive model. Nonetheless, when these models were evaluated on all bound DREs within T-47D cells treated with either 3-MC or TCDD for one hour, I found that it was the sequence-only model that had the highest true positive rate of 90.67%. Any DREs that were bound in both MCF-7 and T-47D cells were excluded from model evaluation to avoid using training data in the testing phase. When evaluating the sequence-only model trained on MCF-7 cells separately in three subgroups of T-47D DREs, namely DREs bound in 1) TCDD treatment only, 2) 3-MC treatment only, 3) and both TCDD and 3-MC treatment, the true positive rates were 90.48%, 50 89.47%, and 92.86%, respectively (Figure 33). Therefore, the sequence-only model could not distinguish between AhR binding resulting from activation by different AhR ligands. These results indicate that AhR binding resulting from activation by different ligands in the same cell line might not be as different as Figure 7 seems to suggest. It is therefore possible that many of the AhR peaks unique to a single ligand were also sub-threshold peaks for the other ligand. Figure 33 – cross-cell performance of MCF-7 models applied on T-47D cells. Taken together, these results also suggest that cross-cell model predictions might be more accurate for binding experiments with similar cell lines or types (like MCF-7 and T-47D, both of which are breast carcinoma cell lines) and treatments, as opposed to binding experiments with dissimilar cell types (like MCF-7 and primary hepatocytes) but similar treatments. AhR binding models reveal positive and negative regulators of AhR binding Next, I focused on analyzing individual DRE binding status predictions. To this end, I used ELI5 - https://eli5.readthedocs.io/, an algorithm that summarizes the decision-making process underlying individual model predictions. ELI5 assigns a numerical weight to each feature the model used when making each DRE binding status prediction. These feature weights are a summary measure of how much the feature contributed to the final DRE binding status prediction across all decision trees used by the XGBoost model. The higher the weight the more the feature 51 contributed. Features can be both positively and negatively weighted and an example showcasing the top 10 positively and negatively weighted features for a single DRE binding status prediction is shown in Figure 34. In this example, the high average bigWig signal value within bin 0 of MAX binding is assigned the highest weight by ELI5 which means that this feature contributes the most to the model predicting the corresponding DRE as bound. Figure 34 – an example showcasing the top 10 positively and negatively weighted features for a single DRE. When analyzing the weights assigned to features that are the individual bins of bigWig signals across all correctly predicted bound DREs, i.e., true positives (TPs), I observed both model features whose weights increase with increasing feature values - termed positive regulators, as well as features whose weights decrease with increasing feature values – termed negative regulators (Figure 35). Figure 35 – scatterplot examples of positive and negative regulators. 52 Upon determining the direction of regulation – positive or negative regulator, for each feature in each cell line, I compared the direction of regulation of features in common to each combination of two cell lines. In total I found 102, 46, and 59 features used in common by 1) MCF-7 and HepG2, 2) MCF-7 and GM17212, and 3) HepG2 and GM17212 models, respectively. I observe that for each model combination approximately 41-62% of features appear as positive regulators in one cell line and negative in another (Figure 36). Figure 36 – positive and negative regulators in different pairs of cell lines. Further, a total of 15 features was used by all three models. Out of those, only three features had the same direction of regulation in all three models – bin 1 of MAZ, bin 1 of MAX, bin 3 of H3K27ac. All three of these features are positive regulators in all three models (results not shown). Even different bins of a single transcription factor, e.g., CTCF in HepG2 cells, can be both positive and negative regulators, albeit the bins that are negative regulator had very small feature weights (Figure 37). These results suggest that even though cell-specific models primarily learn from entirely different features, a small subset of those features shows similar patterns across cells. Additionally, different TFs might both facilitate or interfere with AhR binding in different cell lines. 53 Figure 37 – average weights assigned to different bins of CTCF. Lastly, ELI5 provided weights for flanking sequence features, as well. The flanking sequence feature were binary – i.e., a certain type of nucleotide either appeared at a specific position or it did not, e.g., nucleotide at position -1 was a thymine or not. The flanking sequence features can also be seen as positive or negative regulators, as the model produces either positive or negative weights when a specific nucleotide at a particular position is present. Such sequence features are classically represented in the form of a sequence logo (95). These representations indicate how informative the presence of a certain nucleotide at a particular position is when determining whether the given sequence is a binding site or not. However, when these logos are formed, usually only the bound sequences are considered. Conversely, the logo generated by my models is a combination of two motifs – one that describes bound DREs, and another that describes the unbound DREs (the upper and lower motifs in Figure 38, respectively). I propose that this type of motif is more informative that a standard TF binding motif. Figure 38 – motif logo representing bound and unbound DREs. 54 Unlike non-sequence features that are common to different cell line models, the direction of regulation for DNA flanking sequence features appears more stable across cell lines or types. For instance, between the MCF-7 cells and primary hepatocytes, all three nucleotide features (position -2 is thymine, -1 is thymine and 1 is guanine) that are used by both models have the same direction of regulation (results not shown). 55 CHAPTER 3: AHR BINDING PARTNERS INTRODUCTION Some transcription factors (TFs), such as the AhR, are incapable of binding DNA by themselves due to their incomplete DNA binding domain, and need to dimerize with other TFs to bind DNA. The primary and most widely investigated dimerization partner of AhR is the AhR nuclear translocator (ARNT) protein. The AhR-ARNT dimer tends to dominate the AhR-DNA binding landscape – hence ARNT has been regarded as the canonical dimerization partner of AhR (96). In addition, there is little evidence of in vivo AhR binding in the absence of ARNT. However, many dimerizing TFs have multiple possible dimerization partners, for instance ARNT can dimerize with itself (97), AhR (96) and HIF1α (98). It is thus possible that AhR also has multiple dimerization partners and could potentially even bind to different cognate sequences when dimerized with different partners. Recently, it has been shown that AhR binds certain loci in the absence of ARNT, even when treated with exogenous ligands, such as 2, 3, 7, 8 tetrachlorodibenzo-p-dioxin (TCDD). One such locus exists in the promoter of the plasminogen activator inhibitor 1 (PAI-1) gene. The AhR binds this locus in a TCDD-inducible manner, however the binding of ARNT is markedly absent. In addition, the promoter of PAI-1 possesses no dioxin response elements (DREs) – the 5'- GCGTG-3' core consensus binding motif of the AhR-ARNT dimer (99). Huang and Elferink investigated preserved sequences across species in the promoter of PAI-1 and identified two likely locations for the binding of AhR. By mutating these sequences and testing for binding via electrophoretic mobility shift assay (EMSA) they narrowed down AhR binding to a single region and identified several nucleotides within that region that influence AhR binding. This region was termed the nonconsensus DRE (NC-DRE) and it shared marked homology with the DNA binding 56 sequence of the Krüppel-like factor (KLF) family of TFs. Later, it was confirmed that KLF6 interacts with AhR and binds to the NC-DRE in the PAI-1 promoter in a TCDD-dependent manner. Furthermore, sequential deletion studies demonstrated that the C terminus of the AhR and the N-terminal domains of KLF6 are necessary to facilitate this interaction (29). The activities of AhR and NF-kB pathways have also been functionally linked (100). Tian et al. demonstrated that AhR and the RelA subunit of NF-kB associate physically in murine hepatoma cells. Additionally, such physical interactions between AhR and RelA in the absence of ARNT have also been shown in the IL-6 promoter of human lung cells (28), and c-myc promoter in breast cancer cells (27). However, both KLF6 and RelA interactions with AhR have only been demonstrated at a limited number of loci and it is currently unknown if they could be more widespread. To investigate this possibility, I have developed a computational method to assess the likelihood of AhR interactions across the entire genome, by using publicly available ChIP-seq data. My results indicate that while TCDD-activated AhR predominantly interacts with ARNT, at a subset of sites, TCDD-activated AhR appears to bind with RelA as well. On the other hand, in cells not explicitly treated with an AhR ligand, AhR does not seem to interact with ARNT, except for a small subset of AhR peaks with DREs (0.5% of all AhR peaks). In this case, the AhR does seem to interact with both KLF6 and RelA extensively, across the genome. MATERIALS AND METHODS Reference genome The reference genome used for sequence alignment in this part of the thesis was the human genome assembly version hg38. 57 Genomic locations of DREs A list of all DREs and their genomic locations in the human genome was compiled by searching the hg38 human reference genome sequence for all occurrences of the core DRE sequence 5'-GCGTG-3' on either strand of the DNA. Scatterplot of signal-to-signal correlation between two TFs Given 1) a list of genomic ranges, e.g., a list of TF binding peaks, and 2) two TFs - TF1 and TF2, and their bigWig binding strength signal files representing the genome-wide intensity of TF-DNA binding; a scatterplot of binding strength signal correlation is constructed in the following way. For each genomic range, described by the chromosome, start and end of the range, the maximum of the binding strength signal of both TF1 and TF2 is found. A point is plotted on a scatterplot, where the x-axis represents the binding strength signal of TF1 and the y-axis represents the binding strength signal of TF2. The procedure is illustrated below (Figure 39). 58 Figure 39 – construction of the scatterplot of signal-to-signal correlation between two TFs. Histogram of individual signal correlations between two TFs Given 1) a list of genomic ranges, e.g., a list of TF binding peaks, and 2) two TFs - TF1 and TF2, and their bigWig binding strength signal files representing the genome-wide intensity of TF-DNA binding; a histogram of signal-to-signal correlations is constructed in the following way. For each genomic range, described by the chromosome, start and end of the range, the binding strength signals of both TF1 and TF2 within the given genomic range are extracted and converted into a numerical series of values. The Pearson correlation coefficient for the two series is calculated and recorded. A histogram of all Pearson correlation coefficients for all genomic regions of interest is constructed. The procedure is illustrated below (Figure 40). 59 Figure 40 – construction of the histogram of individual signal correlations between two TFs. RESULTS Genome-wide investigation of protein-protein interactions for DNA-bound TFs Certain TFs, like the AhR, are generally considered incapable of binding DNA by themselves and need to dimerize with other proteins to do so. Many such factors have more than one possible dimerization partner. For instance, ARNT can dimerize itself (97), AhR (96) and HIF1α (98) and possibly other TFs as well. Two TFs that dimerize to bind DNA, e.g., TFs A and B, do so together and could, therefore, be considered a single new A-B TF that is bound to DNA. The dimerization reaction, as well as dimer-DNA binding, are reversible reactions, however the crosslinking procedure that is the first step of the chromatin immunoprecipitation (ChIP) type of experiments makes the dimer-DNA complex stable. Thus, two ChIP-seq experiments, one for TF A and another for TF B should appear as though they were two replicate ChIP-seq experiments for the same A-B TF. Namely, they would appear as two biological replicate experiments that were also performed with different antibodies (Figure 41). However, this would be true only within the context of DNA sites that were bound by the dimer and not by individual TFs, or by an individual TF dimerized with a different TF. 60 To investigate TF-TF interactions across TF-DNA bound sites I propose an analytical method based on two semi-qualitative metrics. These are 1) the scatterplot of signal-to-signal correlation and 2) the histogram of individual signal correlations. Both metrics are constructed starting from a list of peaks. This list could be (i) a list of peaks of TF A, (ii) list of peaks of TF B, or (iii) the intersection of list of peaks of TF A and TF B. Given the list of peaks, the scatterplot of signal-to-signal correlation is generated by calculating the maximum of TF A and TF B signal for each peak and plotting these two maximums as a point on the scatterplot (left panels in TF-TF interaction figures). Each point represents a single peak (Figure 39). The value of the Pearson correlation coefficient – r, for all points on the scatterplot was also reported (see Methods for more details). Similarly, given a list of peaks, the histogram of individual signal correlations is generated by calculating the Pearson correlation between the signal of TF A and TF B across each peak, by first transforming these signals into number series of equal length, and then calculating their correlation coefficient. All Pearson correlation coefficients are then plotted on a histogram (Figure 40). The percentage of peaks having Pearson correlation coefficient r>0.7, was reported on the graph as well (see Methods for more details). This analytical method possesses an advantage over experimental methods such as immunoprecipitation followed by mass spectrometry (IP-MS) which are used to investigate protein-protein interactions. The IP-MS method pulls down the protein of interest (POI) and then performs mass spectrometry to obtain a list of proteins interacting with the POI. However, this method assumes that the two proteins interact even when not bound to the DNA, which might not be the case. The proposed method should work even if the concentration of the dimerized protein was generally much lower than the concentration of individual proteins A and B across the cell. 61 This is because the method works by selecting for sites bound by the A-B dimer, enriching for the A-B dimer signal in the process. Figure 41 – ChIP-seq of dimerized TFs as replicate ChIP-seq of a single TF. To set some expectations for these metrics in different scenarios, I first generated the scatterplot and histogram for two replicate experiments for the same TF – FOXA1. Data was obtained from ENCODE. The results demonstrated that for replicate experiments one could expect the scatterplot correlation to be high, r=0.82 in case of FOXA1 replicate experiments (left panel Figure 42). Similarly, the proportion of peaks with signal correlations exceeding 70% (r>0.7) was very high – 70% in the case of FOXA1 replicate experiments, and considerably shifted to the right (right panel Figure 42). 62 Figure 42 – FOXA1 replicate experiment interaction assessment – example of interacting TFs. Then, I generated the scatterplot and histogram for two experiments of TFs that are known not to interact – FOXA1 and CTCF. Data for both obtained from ENCODE. The results demonstrate that for experiments with non-interacting factors one could expect the scatterplot correlation to be very low, r=0.08 in this case (left panel Figure 43). Similarly, the proportion of peaks with signal correlations exceeding 70% (r>0.7) was very low – 9% in this case, and the histogram is relatively flat (right panel Figure 43). Figure 43 – FOXA1 and CTCF interaction assessment – example of non-interacting TFs. AhR interactions with ARNT To confirm and further investigate AhR interactions with ARNT, I examined two pairs of AhR and ARNT experiments (Table 3). The first pair of AhR and ARNT experiments was 63 conducted in MCF-7 cells treated with 10 nM TCDD for 24 hours under the same conditions by the same lab. The second pair of experiments assessed AhR and ARNT binding in HepG2 cells not treated by an AhR ligand (data available on ENCODE portal). Table 3 – AhR and ARNT ChIP-seq experiment list. AhR-ARNT interactions in TCDD-treated MCF-7 cells across all AhR peaks. In scenarios of treatment with exogenous AhR ligands such as TCDD, the AhR is assumed to require dimerization with ARNT to interact with DNA. On the other hand, ARNT is known to bind DNA by dimerizing with other TF partners, such as with itself or Hif1a. Therefore, I focused on investigating AhR-ARNT interactions by using the list of 17,588 AhR peaks in MCF-7 cells treated with 10 nM of TCDD for 45 minutes. I observed that the scatterplot correlation was very high, r=0.88 (left panel Figure 44), comparable to results for the two replicate experiments. The proportion of peaks with signal correlations exceeding 70% (r>0.7) was 26%, and the histogram was slightly shifted to the right (right panel Figure 44). This result is higher than for non-interacting TFs, but lower than for two replicate experiments. Figure 44 – AhR-ARNT interactions in TCDD-treated MCF-7 cells across all AhR peaks. 64 AhR-ARNT interactions in TCDD-treated MCF-7 cells across AhR peaks with 1+ and 2+ DREs. Next, I investigated how the AhR binding motif, also known as the dioxin response element (DRE), influences the AhR-ARNT interaction results. I have again generated the scatterplots and histograms of TF interactions, but this time for (1) AhR peaks with 1 or more DREs – 1+ DREs, 3097 AhR peaks (left panel Figure 45), and for (2) AhR peaks with 2 or more DREs – 2+ DREs, 563 AhR peaks (right panel Figure 45). The results indicate that the scatterplot correlation increases with increasing number of DREs, r=0.90 and r=0.93 (top left and right panels Figure 45). I also observed the narrowing of the scatterplot, with less variation with increasing number of DREs under AhR peaks. The proportion of peaks with signal correlations exceeding 70% (r>0.7) also increased to 36% and 43%, for 1+ DREs and 2+ DREs AhR peaks, respectively (bottom left and right panels Figure 45). 65 Figure 45 – AhR-ARNT interactions in TCDD-treated MCF-7 cells across AhR peaks with 1+ and 2+ DREs. AhR-ARNT interactions in TCDD-treated MCF-7 cells across ARNT peaks that do not overlap AhR peaks. Further, I investigated the correlations between AhR and ARNT binding across ARNT peaks that do not overlap AhR peaks – ARNT-only peaks. As mentioned, ARNT readily dimerizes and binds DNA with TFs other than AhR, hence I expected to see lower degrees of correlation between AhR and ARNT binding. The scatterplot correlation decreased to r=0.63 (left panel Figure 46) compared to AhR peaks which was r=0.88. A trend similar to AhR peaks is still observable, however this is likely due to some AhR peaks not being called by the peak caller, despite possessing high AhR signal. On the other hand, the proportion of peaks with signal 66 correlations exceeding 70% (r>0.7) is low at 12%, making it more comparable to non-interacting factors (right panel Figure 46). Figure 46 – AhR-ARNT interactions in TCDD-treated MCF-7 cells across ARNT-only peaks. To investigate how AhR signal influences these results I have split the ARNT-only peaks into two groups. The first group is referred to as the low AhR signal group – with AhR signal lower than the AhR peak with the lowest AhR signal. The second group is referred to as the high AhR signal group – with AhR signal higher than the AhR peak with the lowest AhR signal. I observed that about 13% of ARNT peaks had no AhR signal at all, and that 61% fell into the low AhR signal group. Therefore, the left panel in Figure 47 contains more peaks than the right panel, even though that might not be obvious. The correlation coefficient of the signal scatterplot was much lower in the low AhR signal group than the high AhR signal group, r=0.11 vs. r=0.72, respectively (Figure 47). In addition, none of the ARNT peaks with low AhR signal had more than 2 DREs. These results are in line with the notion that ARNT binds DNA in the absence of AhR. Together, these results indicate that some ARNT-only peaks might also be overlapping sub- threshold peaks of AhR. 67 Figure 47 – AhR-ARNT interactions in TCDD-treated MCF-7 cells across ARNT-only peaks – low vs. high AhR signal. AhR-ARNT interactions in non-treated HepG2 cells across all AhR peaks. It is currently not known whether non-activated or endogenously activated AhR interacts with TFs other than ARNT and to what extent. By focusing on investigating AhR-ARNT interactions by using the list of approximately 15,000 AhR peaks in HepG2 cells, I observed that the scatterplot correlation was very low, r=0.09 (left panel Figure 48), comparable to the results for non- interacting TFs. The proportion of peaks with signal correlations exceeding 70% (r>0.7) was 10%, and the histogram was also flat, which was also comparable to the results for non-interacting TFs (right panel Figure 48). These results suggest that ARNT might not be the primary dimerization partner of AhR in non-treated or endogenously treated cells. 68 Figure 48 – AhR-ARNT interactions in non-treated HepG2 cells across all AhR peaks. AhR-ARNT interactions in non-treated HepG2 cells across a subset of AhR peaks. I found that there were only 75 AhR peaks (about 0.5% of all AhR peaks) with more than 1 DRE and with high correlation between AhR and ARNT signals (r>0.9). The scatterplot correlation coefficient was much higher for this subset – r = 0.7 (Figure 49). One of these peaks contains two DREs and was located in the upstream region of CYP1A1, approximately -1kb from the transcription start site (TSS). Figure 49 – AhR-ARNT interactions in non-treated HepG2 cells across a subset of AhR peaks. 69 AhR interactions with RelA To confirm and further investigate AhR interactions with RelA, I examined one pair of AhR and RelA experiments (Table 4), with both AhR and RelA experiments not treated by an AhR ligand in HepG2 cell line and available on ENCODE. Table 4 – AhR and RelA ChIP-seq experiment list. AhR-RelA interactions in HepG2 cells across all AhR peaks. Here I compared the binding of an untreated AhR experiment and an untreated RelA experiment in HepG2 cells. The scatterplot correlation was r=0.68 (left panel Figure 50), and the proportion of AhR peaks with signal correlations exceeding 70% (r>0.7) was 34%, with the histogram slightly shifted to the left (right panel Figure 50). Figure 50 – AhR-RelA interactions in HepG2 cells across all AhR peaks. AhR-RelA interactions in HepG2 cells across AhR peaks with 0 and 3+ DREs. Next, I investigated how the DRE, influenced the AhR-RelA interaction results. I have again generated the scatterplots and histograms of TF interactions, but this time for (1) AhR peaks with exactly 0 DREs (left panel Figure 51), and for (2) AhR peaks with 3 or more DREs – 3+ DREs (right panel 70 Figure 51). The results indicate that the scatterplot correlation decreased with increasing number of DREs, r=0.72 and r=0.51 (top left and right panels Figure 51). The proportion of peaks with signal correlations exceeding 70% (r>0.7) was 35% and 34%, for 0 DRE and 3+ DREs AhR peaks, respectively (bottom left and right panels Figure 51). Interestingly, the correlation between AhR and RelA binding decreased with increasing number of DREs under the peak, however the percentage of highly correlated peaks (peaks with r>0.7) remains the same. Figure 51 – AhR-RelA interactions in HepG2 cells across AhR peaks with 0 and 3+ DREs. AhR-RelA interactions in HepG2 cells across AhR peaks with high peak to peak signal correlation between AhR and RelA. Next, I investigated the scatterplot correlation between AhR and RelA, but only across AhR peaks that have high peak to peak signal correlation 71 between AhR and RelA (peaks with signal to signal correlation of r>0.9), i.e., the right-most portion of the histogram of individual signal correlations. The scatterplot correlation increased to r=0.89 (Figure 52) compared to r=0.62 across all AhR peaks. Taken together, these results indicate that AhR likely interacts with RelA at a subset of AhR peaks. Figure 52 – AhR-RelA interactions in HepG2 cells across AhR peaks with high peak to peak signal correlation between AhR and RelA. AhR interactions with KLF6 To confirm and further investigate AhR interactions with KLF6, I examined one pair of AhR and KLF6 experiments (Table 5). Both of these experiments were carried out in HepG2 cells that were not treated by an AhR ligand (data available on ENCODE portal). Table 5 – AhR and KLF6 ChIP-seq experiment list. AhR-KLF6 interactions in HepG2 cells across all AhR peaks. Here I compared the binding in untreated HepG2 cells between an AhR and a KLF6 binding experiment. The scatterplot correlation was r=0.77 (left panel Figure 53), higher than the scatterplot correlation for KLF6. The 72 proportion of AhR peaks with signal correlations exceeding 70% (r>0.7) was 33%, and the histogram was slightly shifted to the left (right panel Figure 53), similar to RelA results. Figure 53 – AhR-KLF6 interactions in HepG2 cells across all AhR peaks. AhR-KLF6 interactions in HepG2 cells across AhR peaks with 0 and 3+ DREs. Next, I investigated how the number of DREs under the AhR peak influenced the AhR-KLF6 interaction results. I have again generated the scatterplots and histograms of TF interactions, but this time for (1) AhR peaks with exactly 0 DREs (left panel Figure 54), and for (2) AhR peaks with 3 or more DREs – 3+ DREs (right panel Figure 54). The scatterplot correlation decreased with increasing number of DREs, r=0.81 and r=0.65 (top left and right panels Figure 51). The proportion of peaks with signal correlations exceeding 70% (r>0.7) was 33% and 34%, for 0 DRE and 3+ DREs AhR peaks, respectively (bottom left and right panels Figure 51). Similar to interactions with RelA, the correlation between AhR and RelA binding decreased with increasing number of DREs under the peak, however the percentage of highly correlated peaks (peaks with r>0.7) remained the same. 73 Figure 54 – AhR-KLF6 interactions in HepG2 cells across AhR peaks with 0 and 3+ DREs. AhR-KLF6 interactions in HepG2 cells across AhR peaks with high peak to peak signal correlation between AhR and KLF6. Next, I investigated the scatterplot correlation between AhR andKLF6, but only across AhR peaks that have high peak to peak signal correlation between AhR and KLF6 (peaks with signal to signal correlation of r>0.9), i.e., the right-most portion of the histogram of individual signal correlations. The scatterplot correlation increased to r=0.85 (Figure 55) compared to r=0.77 across all AhR peaks. These results jointly indicate that AhR likely interacts with KLF6 at a subset of AhR peaks and that this subset might be slightly larger than the subset of AhR peaks where AhR and RelA interact. An implication of these results is that at some AhR peaks, AhR likely interacts with both RelA and KLF6 at the same time. 74 Figure 55 – AhR-KLF6 interactions in HepG2 cells across AhR peaks with high peak to peak signal correlation between AhR and KLF6. Taking the genomic sequence of 0-DRE AhR peaks where AhR and KLF6 exhibit high peak to peak signal correlation (r>0.9) in the +/- 200-bp region around the mid-point of each peak, I ran the MEME-ChIP motif discovery pipeline. Surprisingly, the most enriched motif was the RE1 silencing transcription factor (REST) motif, shown in Figure 56. In addition to being the most enriched motif, it was also highly centrally enriched. REST is not known to interact with either AhR, KLF6 or RelA. Figure 56 – de novo motif discovery across 0-DRE AhR peaks with high KLF6 correlation in HepG2 cells. AhR-KLF6 interactions in HepG2 cells across AhR peaks without the REST motif. Since the role of REST in AhR-KLF6 interactions is unknown and unexpected, I investigated the correlation between AhR and KLF6 across AhR peaks that did not possess a REST motif. To achieve this, I first searched for the REST motif, JASPAR motif MA0138.2 (103), under AhR peaks, looking at sequences within the region of +/-200 from the mid-point of the peak. I used 75 FIMO with a q-value cutoff of 0.001 (104). This search generated 841 AhR peaks without the REST motif. The scatterplot correlation was lower than for all AhR peaks, r=0.56 (left panel Figure 57) compared to r=0.77 for all AhR peaks. The proportion of peaks with signal correlations exceeding 70% (r>0.7) was 33% (right panel Figure 57), similar to result obtained for all AhR peaks. Figure 57 – AhR-KLF6 interactions in HepG2 cells across AhR peaks without the REST motif. Performing another de novo motif search across AhR peaks without the REST motif, I found the KLF6-like motif under these peaks (Figure 58). The KLF6 motif contains little information overall and is generally degenerate, so exact matching with high confidence is difficult. The q-value reported by TomTom for the match between the KLF6 and the found motif was 0.0213. Notably, nucleotides at positions 5 and 11 do not match well between the two motifs. Figure 58 – de novo motif discovery across AhR peaks without the REST motif in HepG2 cells. 76 The de novo discovered motif looks remarkably like the sequence used to probe the NC- DRE. Huang and Elferink identified a potential binding site for AhR by examining a species- conserved sequence in the promoter of plasminogen activator inhibitor 1 (PAI-1) (99). They subsequently ran electronic mobility shift assays (EMSA) to probe the extent of AhR binding. They tested the wild-type (WT) sequence and five sequence mutants (labeled M1, M2, M3, M4 and M5). Mutants M2 through M5 lie in the portion of the sequence overlapping the de novo found motif. The motif, together with the WT and all the mutant sequences is shown in Figure 59. The mutated portions of the sequence in the mutant sequences are labelled by a red line drawn on top of the mutated sequence. 77 Figure 59 – comparison of de novo derived motif and functionally tested sequences in PAI-1 promoter. Next, I calculated the position specific scoring matrix (PSSM) of the de novo discovered motif using Biopython (105), with pseudocounts calculated using the motifs.jaspar.calculate_pseudocounts function. I applied that PSSM to calculate the score for the WT and each of the mutant sequences (Figure 60). 78 Figure 60 – PSSM scores of the putative AhR-KLF6 motif for five PAI-1 sequences. These results are only in partial agreement with the functional analysis performed by Huang and Elferink (99). In their in vitro experiments mutants M2 and M4 exhibited impaired binding, while M3 and M5 did not. Further, in their in vivo experiments, mutants M3 and M5 exhibited the ability to activate a luciferase promoter, whereas mutant M4 did not (mutant M2 was not tested). Motif scores for mutants M2 and M5 are in line with previous work, while scores for mutants M3 and M4 are not. To match the functional analysis results, the score of the M3 sequence should be higher and the score of the M4 sequence should be lower. Admittedly, with only 70 sequences used to construct the motif, I suspect that there is not enough power to resolve all binding sites properly. It is possible that the true motif is less sensitive to certain nucleotide alterations and more sensitive to others, which could explain the functional results of Huang and Elferink. The consensus sequence of the motif is TGGGAGGCTGAGGCGGGAGGG, and the score for this sequence is 27.82. 79 CHAPTER 4: BISPHENOL A AND BISPHENOL S PREGNANCY-SPECIFIC PHYSIOLOGICALLY-BASED TOXICOKINETIC MODELS INTRODUCTION Bisphenols are a large class of chemicals structurally identified as having two hydroxyphenyl rings. Many bisphenols are considered endocrine disrupting chemicals (EDCs) (106). They are widely used in the manufacturing of polycarbonate plastics, epoxy resins, dental sealants, and plastic and paper consumer products (31, 107), and are pervasively present in dust and soil (32, 33). Due to consumer concerns and heightened regulations regarding the use of bisphenol A (BPA) in some countries (108), industrial and consumer products producers have resorted to using less studied bisphenol alternatives in their products (109). Such BPA-alternatives include bisphenol S (BPS), which is structurally similar to BPA, and is becoming just as environmentally prevalent (110). As a consequence, BPS is the second leading bisphenol found in humans following BPA (107, 111, 112). Bisphenols can be detected in urine, blood, breast milk, amniotic fluid and cord blood, highlighting the ubiquitous exposure humans have to these chemicals (32, 111–116). Several studies have shown that even at low concentrations, exposure to BPA during gestation can result in negative effects on the development of the fetus (117, 118). The detection of BPS in human fetal cord blood (119), the positive association between BPS exposure and prolonged gestational length (120), and the fact that in mammals, fetal exposure to BPS can alter reproductive (121, 122), metabolic (123), and behavioral outcomes (124), warrant research into the precise toxicokinetic mechanisms of these emerging bisphenol chemicals during pregnancy. 80 Physiologically based toxicokinetic (PBTK) mathematical models integrate toxicokinetic processes such as chemical absorption, distribution, metabolism, and excretion (ADME). The main advantage of PBTK models over the classical compartmental approaches to understanding chemical toxicokinetics is the ability of PBTK models to extrapolate outside of the conditions or population that was evaluated experimentally (125). The quantitative predictive and extrapolative capabilities of PBTK models can inform health risk assessments for chemical and pharmaceutical exposure (126, 127). Chemical toxicokinetics during pregnancy are more complex with the inclusion of the maternal, placental, and fetal compartments (126). Moreover, ethical constraints do not allow for any toxicokinetic studies other than biomonitoring to be conducted in pregnant women. The use of refined fetal surgery techniques in a sheep animal model represents unique opportunities to monitor the maternal, amniotic, and fetal compartments; key elements of pregnancy-specific PBTK (p-PBTK) models (126). Importantly, sheep are excellent models to study placental function (128, 129) and have been used for the study of feto-maternal transfer of drugs (130, 131) and EDCs (34, 132), as they allow for the simultaneous and longitudinal characterization of the pregnancy multi-compartment model in real time. The toxicokinetics of BPA have been extensively studied and modeled in both animals and humans (133–138). Primary metabolism (conjugation) for BPA occurs in the liver and the intestine (139). In rodents, BPA undergoes substantial enterohepatic recirculation. However, in monkeys and humans, the rapid metabolism and extensive renal excretion of BPA metabolites means that a negligible amount of conjugated BPA is able to undergo enterohepatic recirculation (140, 141). In pregnancy, both conjugation and deconjugation reactions also occur in fetal tissues, primarily the fetal liver, but these processes occur at varying rates during different developmental windows. In the early developmental stages deconjugation dominates with conjugation barely occurring (142). 81 However, in the case of BPA, conjugation has been shown to increase from 512-fold lower to 13- fold lower when compared with maternal conjugation rates from early to late pregnancy (132). Despite the breadth of work on BPA, only a limited number of studies have investigated the toxicokinetics of BPS during pregnancy (35, 36). Of the two available BPS toxicokinetic models (134, 143) only one is physiologically-based (134) and it is based on a non-pregnant sheep dataset. This non-pregnant BPS model was derived by a substitution of parameters from a previously calibrated BPA PBTK model with parameter values derived from quantitative structure–activity relationships (QSARs) for BPS, but was neither formally calibrated, nor validated. Recently, BPS was reported to reach higher systemic concentrations than BPA in humans (144), representing a need to better distinguish toxicokinetic characteristics between bisphenols, for which PBTK models are uniquely suited. Therefore, the objective of my current study was to improve the understanding of pregnancy toxicokinetics for bisphenols through the development of physiologically relevant multi-compartment p-PBTK models for BPA and BPS. Both p-PBTK models were developed using three independent pair-matched maternal and fetal sheep exposure cohort datasets (34–36). The text and figures in this chapter have been published as a research paper and are reprinted here with the permission of the publisher (30). MATERIALS AND METHODS Datasets Experimental datasets used in this work were obtained from previously published bisphenol toxicokinetic studies in pregnant sheep (34–36). For model calibration, three independent datasets were used (two for each bisphenol). Dataset #1 from (35) , reported total (conjugated plus unconjugated) bisphenol concentrations for BPA and BPS in the maternal and fetal plasma and was used for calibrating both bisphenol models. In brief, toxicokinetic data was 82 obtained from pregnant Polypay × Dorset sheep (singleton pregnancies only) that underwent fetal catheterization surgery at gestational day (GD) 115. Females (n = 3) were injected with a single subcutaneous dose of BPS (0.5 mg/kg) or a combination of BPA and BPS (n = 3; 0.5 mg/kg for each chemical) and data were collected over a 72-h period. No differences in toxicokinetic parameters (maximum concentration reached, time of maximum concentration, half-life, area under the curve, area under the first moment curve, mean residence time, and total body clearance) between single-chemical exposure and mixture dosing were reported, so all BPS values (n = 6) were combined. Additionally, two other toxicokinetic studies in pregnant sheep which presented data for conjugated and unconjugated bisphenols (34, 36) were used during model calibration. For BPA, dataset #2 was obtained from (34), who used pregnant Lacaune sheep (unreported fetal number) that underwent fetal catheterization surgery between GD 108 and 117. In separate experiments, females (n = 8) and fetuses (n = 8, unreported sex) were dosed with an intravenous (IV) infusion over 24 h of unconjugated BPA or BPA-glucuronide (conjugated, BPA-G) at a dose of 2.0 and 3.54 mg/kg/day respectively in the mother, and 5.0 and 3.54 mg/kg/day respectively in the fetus, assuming a 2.5 kg fetus. Plasma concentrations were collected over a 46-h period and the steady state plasma concentration over the final 3-h of infusion was reported. For BPS, dataset #3 was obtained from (36) which included pregnant Lacaune sheep (unreported fetal number) that underwent fetal catheterization surgery between GD 109 and 113. A dual dosing strategy was used, where pregnant females (n = 8) and their fetuses received simultaneous IV doses. First the mother received a dose of 2.7 mg/kg BPS-glucuronide (conjugated, BPS-G) and the fetus was administered a dose of 5 mg deuterated BPS (BPS-d8). This procedure was followed by a simultaneous administration of 5 mg/kg BPS to the mother and 17.5 mg BPS-G-d8 to the fetus. 83 Plasma concentrations were reported over a 72-h period. Dataset #3 was collected at somewhat regular intervals, though not always at the exact same time point (36). As such, these data could not be directly aggregated to yield mean and standard deviation values. Instead, the plasma data for each animal was interpolated using a cubic spline. The most representative time points were selected, and all the interpolated time-concentration curves were sampled at the selected time points and aggregated together. The time points used were either those containing the most data points across animals for all time points except the first and the last ones, or time points lying within each sheep’s interpolation region, for the first and last timepoint, to prevent extrapolation. Model development To establish informative and useful p-PBTK models for BPA and BPS, I developed a minimal generic p-PBTK model for an unconjugated bisphenol and its conjugate metabolite that includes 6 compartments for the mother (liver, fat, kidney, placenta, blood and rest of the body) and 3 compartments for the fetus (liver, blood and the rest of the body) (Figure 61). All relevant biological processes were included, namely conjugation (metabolism) in the maternal and fetal livers, maternal urinary and biliary excretion, and deconjugation in the fetal liver (145). The two coupled sub-models of identical structure for the unconjugated and conjugated bisphenols were connected through liver metabolism in the mother and the fetus, as well as deconjugation in the fetal liver, with one sub-model used for the parent compound (BPA or BPS) and another for the conjugate (BPAconj or BPSconj). In the case of BPS, a duplicate model was developed for deuterated BPS and BPSconj to account for fetal administration. Subsequently, I determined the physiological parameters for an average pregnant sheep, with a single fetus, at the gestational age where the experimental data were generated. This was done for the generic model, as well as for separately parametrized and calibrated individual instances of the generic model for both unconjugated BPA 84 and BPS, and their respective conjugated metabolites. I based the model structure and types of compartments and processes to be inclusive of the only two, to my knowledge, published BPA p- PBTK models (135, 137). All compartments were considered perfusion limited for both unconjugated and conjugated bisphenol sub-models. The most common bisphenol conjugate is glucuronide, although others, such as sulfate, exist (146). Due to a lack of available data on non- glucuronide conjugates, all conjugates for each parent compound were combined into a single conjugate parameter (BPAconj/BPSconj) which was calibrated against glucuronide-conjugate data. Figure 61 – PBTK model schematic. Model equations The equations listed in this section describe both the BPA and the BPS p-PBTK models and are the same for both compounds. The model equations for the maternal unconjugated bisphenol models are described below (Figure 62). All transport equations were perfusion limited. 85 Equations for conjugation represented saturable metabolism in both maternal and fetal livers and the equation for deconjugation in the fetal liver, as well as maternal urinary and biliary excretion equations were modeled as first order processes. All physiological and biochemical parameter units can be found in Table 6, Table 7, and Table 9. Figure 62 – model equations for the maternal unconjugated bisphenols. VT is the volume of tissue T, QT is the blood perfusion, CT is the chemical concentration, PT is the blood:tissue partition coefficient, and CVT is the concentration in the venous blood exiting the tissue. CA is the chemical concentration in the arterial blood. kt1 and kt2 are the diffusion rates from maternal placental blood to fetal blood and fetal blood to maternal placental blood, respectively. Subscript _f denotes fetal tissues. T_all is used in the maternal blood compartment to describes the sum of all tissue compartments. FSC and ASC represent the bioavailability of subcutaneous administration and remaining unabsorbed subcutaneous dose, respectively, and Ka is the first order rate constant for subcutaneous absorption. KELR is the rate of renal excretion. Vmax is the maximum reaction rate, and Km the Michaelis-Menten constant. 86 The model equations for the maternal conjugated bisphenol models are described as follows (Figure 63). Equations that are the same as in the unconjugated bisphenol models have been omitted. Figure 63 – model equations for the maternal conjugated (c) bisphenols. Here, (c) in superscript denotes the conjugated compound, KELL is the rate of biliary excretion. All the other symbols have the same meaning as in the unconjugated bisphenol models. The model equations for the fetal unconjugated bisphenol models are described below (Figure 64). Figure 64 – model equations for fetal unconjugated bisphenol. Maternal liver partition coefficient (PL) was used in the fetus, as well. The equations for the conjugated bisphenol models in the fetus are described below (Figure 65). Figure 65 – model equations for fetal conjugated bisphenol. 87 Values listed obtained from references: 1(35), 2(147), 3(148), 4(149) Table 6 – physiological parameters in pregnant sheep. 88 Values listed obtained from references: (35), (134) BW: body weight, G: glucuronide, Ka: absorption rate constant, MW: molecular weight. Table 7 – BPA and BPS physicochemical parameters. Parametrization The generic bisphenol model was first partially parametrized with the pregnant sheep physiological parameters obtained from the literature (Table 6), inclusive of fractional blood flows and organ volumes. Following this procedure, two separate model instances were created for BPA and BPS using their respective physiochemical parameters (Table 7) and the tissue:blood partition coefficients (Table 8), which were calibrated within ranges of one order of magnitude around values either obtained from the literature (35, 134, 147–150), or estimated from the available log octanol:water partition parameters for compounds with similar partitioning (143, 151, 152) and calibration was performed within those ranges. Partition coefficients for the rest of the body for both BPA and BPS models were calibrated within the minimum and maximum values for all other tissues. All physiological parameters were assumed to be time-invariant due to the nature of the experimental data, which was collected over a short period of time during mid-late pregnancy. 89 Table 8 – passive biochemical parameters (tissue/blood partition coefficients). Calibration The calibration for both the BPA and the BPS models was carried out in four steps: (1) fetal conjugated bisphenol calibration, (2) maternal conjugated bisphenol calibration, (3) maternal complete calibration, and (4) feto-placental transfer and fetal complete calibration. Maternal body weight used was dependent on which of the three experimental datasets the model was being calibrated against. During the fetal conjugated bisphenol calibration, the appropriate fetal conjugated bisphenol IV administration experiment was used to partially calibrate the fetal model, namely the conjugated bisphenol partition parameters for the fetal liver and the rest of the body. The maternal conjugated bisphenol calibration relied on the maternal conjugated bisphenol IV administration data and was used to partially calibrate the maternal model, namely the remaining conjugated bisphenol partition coefficients (maternal kidney, fat, and rest of body), as well as urinary and biliary excretion rates. The complete maternal calibration relied on the maternal unconjugated bisphenol IV and total bisphenol subcutaneous administration data from all three datasets. These were used to fully calibrate the maternal model, namely the unconjugated bisphenol partition coefficients, and metabolism and urinary excretion rates (Table 9). During this step of the calibration, the feto-placental transfer of the unconjugated bisphenol was not accounted for to minimize the number of calibrated parameters. Feto-placental transfer and total fetal calibration relied on the maternal unconjugated bisphenol IV and subcutaneous administration to 90 fully calibrate the feto-placental diffusion rates and the remainder of the fetal parameters. These parameters were mainly unconjugated bisphenol partition coefficients for the rest of the body and metabolism and deconjugation rates in the fetal liver. Except for the rest of body partition coefficient, all other partition coefficients corresponding to the same tissue between the mother and the fetus were assumed equal. Table 9 – rate constants. 91 Our p-PBTK models required the use of blood-to-plasma partition coefficients as parameters, since the experimentally derived calibration datasets reported plasma concentrations. The blood-to-serum partition coefficient for BPA in rats has been experimentally determined as 1.10 (153), and blood-to-plasma partition coefficient for BPA and BPA-glucuronide in humans has been computationally estimated to be 1.05 and 0.83, respectively (154). Since calibrating the blood:plasma partition coefficient in the BPA model for both conjugated and unconjugated BPA within the range of 0.80 to 1.20 did not affect the model results in a significant way, blood:plasma partition coefficients for both conjugated and unconjugated BPA were fixed to 1, simplifying the modeling procedure. I observed similar results for BPS, and have thus fixed the blood:plasma partition coefficients of both conjugated and unconjugated BPS to 1. Calibration of unknown parameter values was performed using sequential least square quadratic programming with random restart (155). Sequential least squares quadratic programming is a formal optimization technique known to perform well for systems requiring constrained nonlinear optimization, which was the case for my developed models. Here, each calibration procedure was repeated 500 times, each time starting from a randomly selected point within the allowable ranges of the calibrated parameters. The calibration most closely matching the datasets, using the lowest mean absolute percentage error score as the selection criteria, was chosen as the final calibration. Extrapolation of maternal and fetal body burdens Dosing regimens simulating daily repeated maternal and fetal exposures to both BPA and BPS were run with the calibrated ovine models using the reference dose for BPA set by the U.S. Environmental Protection Agency (50 µg/kg/day) (156). Simulations were run over a two-week period. 92 Computing software The current model was coded in, and all simulations run using the Python programming language and the Python package Tellurium version 2.1.5 developed for reproducible dynamical modeling of biological networks (157). The full model code is available at https://github.com/BhattacharyaLab/BisphenolPBTK Sensitivity analysis Global sensitivity analyses of the fetal plasma compartment kinetics for both unconjugated and conjugated BPA and BPS were performed to identify the most influential parameters determining fetal bisphenol kinetics. Sensitivity analysis was performed using the variance-based Sobol method (158), as implemented within the SALib python library (159). Parameters determining fetal kinetics were examined between 50% and 150% of the nominal values listed in Table 8 and Table 9, and were sampled using the Saltelli sampling scheme with N = 1,000 generated samples (160). To examine simulated fetal kinetics with both a loading (absorption) and an elimination phase, subcutaneous dosing from Dataset #1 was selected, as described in Datasets. The sensitivity analysis was repeated every half-hour for 48 h of simulation time (excluding 0 h). The parameters included in the sensitivity analysis were the fetal hepatic metabolism parameters (Vmax_f and Km_f), deconjugation rate constant (Kd_f) and rest of body partition coefficients (PR_f and PR_f (c)). Additionally, I repeated the sensitivity analysis by also adding the feto-placental transfer parameters (kt1 and kt2). 93 RESULTS Calibration Simulations of the fully calibrated BPA model for both maternal and fetal compartments were compared to experimental dataset #1 following a single subcutaneous administration of BPA to the mother (Figure 66A maternal compartment, and Figure 66D fetal compartment). A full simulation was also performed for dataset #2 following either a 24-h IV infusion of BPA and BPA- G to the mother (Figure 66B and Figure 66C, respectively), or 24-h IV infusion of BPA and BPA- G to the fetus (Figure 66E and Figure 66F, respectively). All simulations matched the experimental data ± one standard deviation from the individual data points for total, unconjugated and conjugated BPA (34, 35). Figure 66 – simulated toxicokinetic plots of BPA for maternal and fetal circulation. Similar to BPA, simulations of the fully calibrated BPS model were compared to experimental dataset #1 following a single subcutaneous injection of BPS to the mother (Figure 67A - maternal compartment, and Figure 67D - fetal compartment), or dataset #3 following a 94 single IV bolus of BPA and BPA-G to the mother (Figure 67B and Figure 67C, respectively), or a single IV bolus of BPS-d8 and BPS-G-d8 to the fetus (Figure 67E and Figure 67F, respectively). Except for fetal IV boluses of BPS-d8 and BPS-G-d8, all data points were consistent with the experimental datasets ± one standard deviation from the individual data points for total, unconjugated and conjugated BPS (35, 36). Figure 67 – simulated toxicokinetic plots of BPS for maternal and fetal circulation. Due to its robustness, full simulations of dataset #1 (35), separated into total, conjugated, and unconjugated forms of bisphenols, were run for both BPA and BPS. This was necessary to estimate the breakdown of unconjugated and conjugated bisphenols, which was not available from the original dataset. Extrapolation of maternal and fetal body burdens in an ovine model Simulations showing repeated daily subcutaneous exposure to BPA and BPS are shown in Figure 68 and Figure 69, respectively. Maternal exposure was consistent with known 95 toxicokinetic parameters for BPA (35), where unconjugated BPA was cleared from circulation within a 24-h period (Figure 68A, right panel). In the fetal compartment, I observed a gradual accumulation of total, unconjugated and conjugated BPA (Figure 68B), plateauing around a mean of 0.28 ng/ml unconjugated BPA at 14 days of daily exposure (Figure 68B, right panel, solid black line). Like BPA, total, unconjugated and conjugated BPS also rapidly clears from maternal blood (Figure 69A) and accumulate in the fetal compartment, but total fetal BPS accumulation does not plateau within the 14-day exposure window (Figure 69B, left panel). The BPS model simulates fetal blood concentrations at a mean of 0.45 ng/ml unconjugated BPS by 14 days of exposure (Figure 69B, right panel). Figure 68 – simulation of BPA body burden following two weeks of daily dosing in an ovine model. 96 Figure 69 – simulation of BPS body burden following two weeks of daily dosing in an ovine model. Sensitivity analysis A global sensitivity analysis was run to investigate the main effect of all relevant fetal parameters over time and results are shown in Figure 70 (for BPA) and Figure 71 (for BPS). For both bisphenols, the main effect (%) of the placental to fetal transfer parameter kt1 were the highest among the parameters evaluated for both unconjugated and conjugated BPA and BPS. For BPA, the main effect of the fetal to placental transfer parameter kt2 increased over time while other parameters like fetal hepatic deconjugation (Kd_f) and the rate of enzymatic reaction (Vmax_f) remained constant. For BPS, the main effect of kt2 was lower than for BPA, but also increased over time. The contribution of other parameters that determine fetal plasma kinetics, such as metabolic (Vmax_f, Km_f) and deconjugation (Kd_f) parameters tended to increase over time. 97 Metabolic parameters (Vmax_f, Km_f) were more important for fetal plasma kinetics of unconjugated BPA, until ~ 15 h where they begin to plateau. For unconjugated BPS, the main effect of both Vmax_f and Km_f was higher than Kd_f throughout the 48-h period. The rest of body partition coefficient for unconjugated bisphenols (PR_f) had a minor contribution to output variance in determining both BPA and BPS fetal plasma kinetics, however the rest of body partition coefficient for conjugated bisphenols (PR_f (c)) was especially important for determining conjugated BPA and BPS plasma kinetics. Figure 70 – global sensitivity analysis of the fetal compartment for BPA model. 98 Figure 71 – global sensitivity analysis of the fetal compartment for BPS model. 99 CHAPTER 5: DISCUSSION AND CONCLUSIONS Over time many computational models predicting the DNA binding of transcription factors (TFs) have been developed and the binding of specific TFs has been studied experimentally quite extensively (93, 161, 162). However, the molecular determinants and mechanisms governing the cell-specificity of binding for many TFs remain elusive. The AhR is one such TF. The AhR is a ligand-inducible TF, and its DNA binding cannot be fully determined through chromatin accessibility, the extended binding motif of AhR, the motifs of other co-bound TFs, or any combination of these features. In vitro studies examining the DNA binding of AhR demonstrated that AhR binds exclusively to the 5'-GCGTG-3' DNA sequence, known as the dioxin response element (DRE). On the other hand, in vivo studies revealed that AhR was bound to many genomic regions that did not possess a DRE (68, 69). Thus, it is likely that some AhR-bound regions with DREs were not a result of AhR binding to those DREs. Instead, the DREs may have occurred under the AhR peaks by chance, and the AhR may have bound the DNA through some other mechanism. My results show that many AhR peaks still have DREs and that some peaks have more DREs than could be expected by chance (Figure 3). When examining the position of the DRE within the peak relative to the mid-point of the peak, I observed that the DREs appear centrally enriched in all AhR binding experiments (Figure 4). Additionally, AhR peaks with a higher number of DREs under the peak have a statistically higher average normalized binding signal strength than AhR peaks with a lower number of DREs (Figure 9). These results suggest that DREs, although not necessary for AhR binding, are useful in determining the intensity of AhR binding. 100 On the other hand, AhR has also been shown capable of binding to a GC-rich region without sequence homology to the DRE. Thus, the existence of a non-canonical DRE (NC-DRE) was hypothesized. It was later demonstrated that one such NC-DRE in the promoter of plasminogen activator inhibitor 1 (PAI-1) was bound by AhR without ARNT, but together with KLF6 instead (29, 99). My analysis discovered a GC-rich motif appearing under 70 out of 841 AhR peaks having high correlation of AhR and KLF6 binding signals in the HepG2 cell line (Figure 58). This motif shares homology with the KLF6 binding motif, as well as with the identified GC-rich sequence bound by the AhR in the promoter of PAI-1. Nonetheless, the DREs still seem to play a role in HepG2 cells, as the correlation between AhR and KLF6 binding decreases with an increasing number of DREs under AhR peaks (Figure 54). Interestingly, I did not find a similar motif under AhR peaks in any other AhR binding experiment. Since the AhR binding experiment in HepG2 cells was conducted without treatment with an AhR ligand, I propose that AhR-KLF6 dimers binding to DNA do so preferentially in non-treated, or potentially endogenously treated cells. To investigate the likely molecular determinants of AhR binding, I developed interpretable machine learning models predicting the binding status of DREs in open chromatin. These models were trained on singleton bound DREs as examples of bound DREs, but were able to predict the binding of 0-DRE AhR peaks with high accuracy. This result could be explained by a high level of indirect binding of AhR within singleton and 0-DRE AhR peaks. In this case, AhR would not be bound to the DNA directly, but instead the AhR could, for instance, be tethered to other TFs that are directly bound to DNA (161). If a sufficient number of singleton AhR peaks used in training were actually a result of such binding, then the AhR binding prediction models could learn 101 to recognize indirect binding of AhR and thus predict AhR binding of 0-DRE peaks. The 0-DRE AhR peaks are likely not directly bound – since they do not possess the AhR binding motif. Another explanation for high model accuracy when predicting the binding of 0-DRE AhR peaks is that some of the bound AhR might be part of 3D chromatin loops. In this scenario, the AhR could be directly bound to one or more DREs, in one loop anchor, such that the imprint of AhR binding could also appear in other loop anchors due to their physical proximity in 3D space. EP300 transcriptional activator was one of the factors most predictive of AhR binding in enhancers (Figure 29). The EP300 has been shown to be a marker of pre-established enhancer anchors that appear in enhancer-promoter loops formed after glucocorticoid receptor (GR) activation (162). Furthermore, EP300, H3K4me1, and H3K27ac jointly mark active enhancers (163). In this scenario, the AhR molecule that was directly bound to a singleton DRE could leave an impression of a 0-DRE AhR peak in another anchor of the same loop, due to their physical proximity in 3D (48). Thus, any chromatin complexes participating in these loops would then be associated with both singleton and 0-DRE peaks. In this case my models would be learning how to identify direct AhR binding. Interpretation of the model predictions demonstrated that binding of AhR is likely determined by 1) a common cross-cell flanking-sequence syntax (Figure 13 and Figure 38) and 2) cell-specific chromatin context syntax (Figure 12, Figure 28, Figure 29, and Figure 30). The chromatin context determinants differ vastly between cell lines or types (Figure 12, Figure 14, and Figure 36). Most commonly, one or two TFs appear to be the most important contributors to model performance in each model. I propose that these factors 1) play a significant role in determining cell identity, or 2) are a common AhR co-factor in that specific cell line or type. 102 The pioneering factor GATA3 is mutated in MCF-7 cells, but not in T-47D cells. The MCF-7 mutation of GATA3 is heterozygous, and results in a copy of the GATA3 protein that is more stable and resistant to turnover. Consequently, the mutated GATA3 in MCF-7 cells binds to DNA more strongly than its wildtype counterpart (164, 165). I propose that the increase in binding activity of GATA3 in MCF-7 cells makes GATA3 the most predictive factor of AhR binding in those cells. In addition, GATA3 was also shown to be the most commonly overlapping factor for binding of ERα, another inducible TF, in MCF-7 cells (166). Nonetheless, AhR peaks in T-47D cells were still correlated with GATA3 binding, albeit to a lesser extent. Therefore, wild type GATA3 might still play a role in determining AhR binding, however, this role appears to be less pronounced. Additionally, GATA3 and AhR binding have shown synergistic effects on the expression of GPR15 in human CD4+ T cells (167). Therefore, AhR-GATA3 interactions might not be confined to breast cancer cells. On the other hand, certain TFs have been shown to be functionally associated with AhR, such as ARNT, RelA and KLF6. However, none of these factors were ranked highly by my models. ARNT is considered the principal dimerization partner of TCDD-induced AhR. However, my HepG2 and GM17212 models rank ARNT features very lowly (Figure 12). Admittedly, in the GM17212 model, the ARNT binding experiment was performed on a different but similar cell line – GM12878, and, more importantly, without AhR ligand treatment. HepG2 cells were not treated with an AhR ligand in either the AhR or the ARNT experiment. Still, when tryptophan in cell culture media is exposed to light it produces a photoproduct which has been shown to be an AhR agonist (168). In this case one could consider the ARNT experiment as having been conducted under similar conditions as the AhR experiment. Still, ARNT features were not used by the HepG2 model at all. When looking at the correlation between AhR and ARNT binding across AhR peaks 103 it is not difficult to see why (Figure 48). I suspect that the tryptophan derivative-induced activation of AhR might be more similar to endogenous than exogenous activation of AhR. Some of AhR’s endogenous activities are likely mediated through AhR di- and multi- merization with its other known partners, such as KLF6, and RELA. Surprisingly, even though KLF6 and RELA features were used in the HepG2 model, and I have shown that the binding of AhR was highly correlated with KLF6 and RelA binding (Figure 50, Figure 53), the KLF6 and RelA features did not rank highly in feature importance. However, since I relied on a DRE-centered approach to predict AhR binding and since AhR-KLF6 dimers do not appear to directly bind DREs (Figure 58), it is possible that my selection of bound and unbound DREs in open chromatin of HepG2 cells was non- informative for the machine learning models. In summary, I developed highly accurate and robust predictive models of within-cell line or type AhR binding. My models dissected the cell-type specificity of AhR binding and showed that cell-type specific AhR binding is driven by a complex interplay of cell-type agnostic DNA sequence immediately flanking the DRE, and a highly cell-type specific local chromatin context. Additionally, I demonstrated that ARNT was the primary binding partner of AhR in TCDD treated cells, but not in untreated cells, where KLF6 and RelA appear to be the primary binding partners of AhR. Finally, my BPA and BPS PBTK models demonstrated chemical accumulation in the fetal compartment of a pregnant sheep experimental model; the majority of which is simulated as the bisphenol conjugate for both BPA and BPS. When considering extrapolation to daily exposure patterns in sheep, the accumulation of bisphenols in the fetal compartment has been observed in humans (169), with glucuronide conjugates being the predominant form detected (170). Using the U.S. Environmental Protection Agency’s reference dose for BPA (50 µg/kg/day), I simulated 104 repeated maternal dosing over a two-week period for both BPA and BPS in sheep, to evaluate fetal plasma chemical burden. Here, my simulations predicted that a pseudo steady-state of 0.28 ng/ml unconjugated BPA would be reached, which falls within the range of detection for unconjugated BPA (0–53 ng/ml) in cord blood (118). For BPS, my simulations predict that a pseudo steady-state of 0.45 ng/ml unconjugated BPS would be reached. Since biomonitoring of unconjugated BPS has not been reported for cord blood, a direct comparison to human exposure cannot be made. However, the simulated total BPS in the fetal compartment (12.5 ng/ml on day 14) is in excess of total BPS concentrations measured in cord blood (<0.03–0.12 ng/ml total BPS) from a Chinese cohort (171). Most of the BPS accumulated in the fetus is predicted to be in the form of BPS conjugated metabolites. Although these metabolites are generally considered non-bioactive, BPA- G has been shown to be bioactive, and has adipogenic potential in vitro (172) . Given the predicted accumulation potential of BPS-G in the fetal compartment, the bioactivity of BPS metabolites like BPS-G should be further examined. My simulations also demonstrate that, given a steady maternal intake of BPA, unconjugated BPA rapidly reaches a state where it no longer accumulates in fetal blood. Unconjugated BPS, on the other hand, continues to accumulate in fetal blood even after 14 days of daily administrations. These results highlight the need to further study the precise fetal toxicokinetics of BPS, as well as the fetal accumulation potential of other BPA analogs. 105 BIBLIOGRAPHY 1. Laskowski,R.A. and Thornton,J.M. (2008) Understanding the molecular machinery of genetics through 3D structures. Nat. Rev. Genet. 2008 92, 9, 141–151. 2. Latchman,D.S. (1997) Transcription factors: an overview. Int. J. Biochem. Cell Biol., 29, 1305–1312. 3. Sogawa,K. and Fujii-Kuriyama,Y. (1997) Ah Receptor, a Novel Ligand-Activated Transcription Factor. J. Biochem., 122, 1075–1079. 4. A Pharmacology Primer: Theory, Applications, and Methods - Terry Kenakin - Google Books. 5. Rothhammer,V. and Quintana,F.J. (2019) The aryl hydrocarbon receptor: an environmental sensor integrating immune responses in health and disease. Nat. Rev. Immunol., 19, 184– 197. 6. Poland+,A., Glover,E. and Kende,A.S. (1976) Stereospecific, high affinity binding of 2,3,7,8- tetrachlorodibenzo-p-dioxin by hepatic cytosol. Evidence that the binding species is receptor for induction of aryl hydrocarbon hydroxylase. J. Biol. Chem., 251, 4936–4946. 7. Mimura,J. and Fujii-Kuriyama,Y. (2003) Functional role of AhR in the expression of toxic effects by TCDD. Biochim. Biophys. Acta - Gen. Subj., 1619, 263–268. 8. Nguyen,L.P. and Bradfield,C.A. (2008) The Search for Endogenous Activators of the Aryl Hydrocarbon Receptor. Chem. Res. Toxicol., 21, 102. 9. Kaiser,H., Parker,E. and Hamrick,M.W. (2020) Kynurenine signaling through the aryl hydrocarbon receptor: Implications for aging and healthspan. Exp. Gerontol., 130. 10. Perepechaeva,M.L. and Grishanova,A.Y. (2020) The Role of Aryl Hydrocarbon Receptor (AhR) in Brain Tumors. Int. J. Mol. Sci., 21. 11. Wu,D., Potluri,N., Kim,Y. and Rastinejad,F. (2013) Structure and Dimerization Properties of the Aryl Hydrocarbon Receptor PAS-A Domain. Mol. Cell. Biol., 33, 4346. 12. Pandini,A., Denison,M.S., Song,Y., Soshilov,A.A. and Bonati,L. (2007) Structural and functional characterization of the aryl hydrocarbon receptor ligand binding domain by homology modeling and mutational analysis. Biochemistry, 46, 696–708. 13. Schulte,K.W., Green,E., Wilz,A., Platten,M. and Daumke,O. (2017) Structural Basis for Aryl Hydrocarbon Receptor-Mediated Gene Activation Article Structural Basis for Aryl Hydrocarbon Receptor-Mediated Gene Activation. Struct. Des., 25, 1025–1033.e3. 14. Watson,J.D., Prokopec,S.D., Smith,A.B., Okey,A.B., Pohjanvirta,R. and Boutros,P.C. (2014) TCDD dysregulation of 13 AHR-target genes in rat liver. Toxicol. Appl. Pharmacol., 274, 445–454. 106 15. Korashy,H.M. and El-Kadi,A.O.S. (2006) The role of aryl hydrocarbon receptor and the reactive oxygen species in the modulation of glutathione transferase by heavy metals in murine hepatoma cell lines. Chem. Biol. Interact., 162, 237–248. 16. Shimizu,Y., Nakatsuru,Y., Ichinose,M., Takahashi,Y., Kume,H., Mimura,J., Fujii- Kuriyama,Y. and Ishikawa,T. (2000) Benzo[a]pyrene carcinogenicity is lost in mice lacking the aryl hydrocarbon receptor. Proc. Natl. Acad. Sci. U. S. A., 97, 779–782. 17. Xie,G., Peng,Z. and Raufman,J.P. (2012) Src-mediated aryl hydrocarbon and epidermal growth factor receptor cross talk stimulates colon cancer cell proliferation. Am. J. Physiol. - Gastrointest. Liver Physiol., 302, G1006. 18. Rothhammer,V. and Quintana,F.J. (2019) The aryl hydrocarbon receptor: an environmental sensor integrating immune responses in health and disease. Nat. Rev. Immunol. 2019 193, 19, 184–197. 19. Park,R., Madhavaram,S. and Ji,J.D. (2020) The Role of Aryl-Hydrocarbon Receptor (AhR) in Osteoclast Differentiation and Function. Cells, 9. 20. Gialitakis,M., Tolaini,M., Li,Y., Pardo,M., Yu,L., Toribio,A., Choudhary,J.S., Niakan,K., Papayannopoulos,V. and Stockinger,B. (2017) Activation of the Aryl Hydrocarbon Receptor Interferes with Early Embryonic Development. Stem Cell Reports, 9, 1377. 21. Wang,Z., Snyder,M., Kenison,J.E., Yang,K., Lara,B., Lydell,E., Bennani,K., Novikov,O., Federico,A., Monti,S., et al. (2021) How the AHR Became Important in Cancer: The Role of Chronically Active AHR in Cancer Aggression. Int. J. Mol. Sci., 22, 1–22. 22. Shimba,S. and Watabe,Y. (2009) Crosstalk between the AHR signaling pathway and circadian rhythm. Biochem. Pharmacol., 77, 560–565. 23. Marlowe,J.L. and Puga,A. (2005) Aryl hydrocarbon receptor, cell cycle regulation, toxicity, and tumorigenesis. J. Cell. Biochem., 96, 1174–1184. 24. Esser,C., Rannug,A. and Stockinger,B. (2009) The aryl hydrocarbon receptor in immunity. Trends Immunol., 30, 447–454. 25. Kobayashi,A., Sogawa,K. and Fujii-Kuriyama,Y. (1996) Cooperative interaction between AhR.Arnt and Sp1 for the drug-inducible expression of CYP1A1 gene. J. Biol. Chem., 271, 12310–12316. 26. Ge,N.L. and Elferink,C.J. (1998) A direct interaction between the aryl hydrocarbon receptor and retinoblastoma protein. Linking dioxin signaling to the cell cycle. J. Biol. Chem., 273, 22708–22713. 27. Kim,D.W., Gazourian,L., Quadri,S.A., Raphaëlle, Sherr,D.H. and Sonenshein,G.E. (2000) The RelA NF-kappaB subunit and the aryl hydrocarbon receptor (AhR) cooperate to transactivate the c-myc promoter in mammary cells. Oncogene, 19, 5498–5506. 107 28. Chen,P.H., Chang,H., Chang,J.T. and Lin,P. (2012) Aryl hydrocarbon receptor in association with RelA modulates IL-6 expression in non-smoking lung cancer. Oncogene, 31, 2555– 2565. 29. Wilson,S.R., Joshi,A.D. and Elferink,C.J. (2013) The tumor suppressor Kruppel-like factor 6 is a novel aryl hydrocarbon receptor DNA binding partner. J. Pharmacol. Exp. Ther., 345, 419–429. 30. Gingrich,J., Filipovic,D., Conolly,R., Bhattacharya,S. and Veiga-Lopez,A. (2021) Pregnancy-specific physiologically-based toxicokinetic models for bisphenol A and bisphenol S. Environ. Int., 147. 31. Liao,C. and Kannan,K. (2013) Concentrations and profiles of bisphenol a and other bisphenol analogues in foodstuffs from the united states and their implications for human exposure. J. Agric. Food Chem., 61, 4655–4662. 32. Liao,C., Liu,F., Guo,Y., Moon,H.B., Nakata,H., Wu,Q. and Kannan,K. (2012) Occurrence of eight bisphenol analogues in indoor dust from the United States and several Asian countries: implications for human exposure. Environ. Sci. Technol., 46, 9138–9145. 33. Kwak,J. Il, Moon,J., Kim,D., Cui,R. and An,Y.J. (2018) Determination of the soil hazardous concentrations of bisphenol A using the species sensitivity approach. J. Hazard. Mater., 344, 390–397. 34. Corbel,T., Gayrard,V., Viguié,C., Puel,S., Lacroix,M.Z., Toutain,P.L. and Picard-Hagen,N. (2013) Bisphenol A disposition in the sheep maternal-placental-fetal unit: Mechanisms determining fetal internal exposure. Biol. Reprod., 89, 11–12. 35. Gingrich,J., Pu,Y., Ehrhardt,R., Karthikraj,R., Kannan,K. and Veiga-Lopez,A. (2019) Toxicokinetics of bisphenol A, bisphenol S, and bisphenol F in a pregnancy sheep model. Chemosphere, 220, 185–194. 36. Grandin,F.C., Lacroix,M.Z., Gayrard,V., Gauderat,G., Mila,H., Toutain,P.L. and Picard- Hagen,N. (2018) Bisphenol S instead of Bisphenol A: Toxicokinetic investigations in the ovine materno-feto-placental unit. Environ. Int., 120, 584–592. 37. Sonawane,A.R., Platig,J., Fagny,M., Chen,C.-Y., Paulson,J.N., Lopes-Ramos,C.M., DeMeo,D.L., Quackenbush,J., Glass,K. and Kuijjer,M.L. (2017) Understanding Tissue- Specific Gene Regulation. Cell Rep., 21, 1077–1088. 38. Todeschini,A.L., Georges,A. and Veitia,R.A. (2014) Transcription factors: specific DNA binding and specific gene regulation. Trends Genet., 30, 211–219. 39. Caetano,M.S., Hassane,M., Van,H.T., Bugarin,E., Cumpian,A.M., McDowell,C.L., Cavazos,C.G., Zhang,H., Deng,S., Diao,L., et al. (2018) Sex specific function of epithelial STAT3 signaling in pathogenesis of K-ras mutant lung cancer. Nat. Commun. 2018 91, 9, 1–11. 108 40. Warrick,J.I., Walter,V., Yamashita,H., Chung,E., Shuman,L., Amponsa,V.O., Zheng,Z., Chan,W., Whitcomb,T.L., Yue,F., et al. (2016) FOXA1, GATA3 and PPARɣ Cooperate to Drive Luminal Subtype in Bladder Cancer: A Molecular Analysis of Established Human Cell Lines. Sci. Reports 2016 61, 6, 1–15. 41. Kress,S., Reichert,J. and Schwarz,M. (1998) Functional analysis of the human cytochrome P4501A1 (CYP1A1) gene enhancer. Eur. J. Biochem., 258, 803–812. 42. Ye,W., Chen,R., Chen,X., Huang,B., Lin,R., Xie,X., Chen,J., Jiang,J., Deng,Y. and Wen,J. (2019) AhR regulates the expression of human cytochrome P450 1A1 (CYP1A1) by recruiting Sp1. FEBS J., 286, 4215–4231. 43. Denison,M.S., Fisher,J.M. and Whitlock,J.P. (1988) The DNA recognition site for the dioxin-Ah receptor complex. Nucleotide sequence and functional analysis. J. Biol. Chem., 263, 17221–17224. 44. Swanson,H.I., Chan,W.K. and Bradfield,C.A. (1995) DNA binding specificities and pairing rules of the Ah receptor, ARNT, and SIM proteins. J. Biol. Chem., 270, 26292–26302. 45. Sun,Y. V., Boverhof,D.R., Burgoon,L.D., Fielden,M.R. and Zacharewski,T.R. (2004) Comparative analysis of dioxin response elements in human, mouse and rat genomic sequences. Nucleic Acids Res., 32, 4512–4523. 46. Yang,S.Y., Ahmed,S., Satheesh,S. V. and Matthews,J. (2018) Genome-wide mapping and analysis of aryl hydrocarbon receptor (AHR)- and aryl hydrocarbon receptor repressor (AHRR)-binding sites in human breast cancer cells. Arch. Toxicol., 92, 225–240. 47. Lo,R. and Matthews,J. (2012) High-resolution genome-wide Mapping of AHR and ARNT binding sites by ChIP-Seq. Toxicol. Sci., 130, 349–361. 48. Liang,J., Lacroix,L., Gamot,A., Cuddapah,S., Queille,S., Lhoumaud,P., Lepetit,P., Martin,P.G.P., Vogelmann,J., Court,F., et al. (2014) Chromatin immunoprecipitation indirect peaks highlight long-range interactions of insulator proteins and Pol II pausing. Mol. Cell, 53, 672–681. 49. Denison,M.S. and Nagy,S.R. (2003) Activation of the Aryl Hydrocarbon Receptor by Structurally Diverse Exogenous and Endogenous Chemicals. Annu. Rev. Pharmacol. Toxicol., 43, 309–334. 50. Abel,J. and Haarmann-Stemmann,T. (2010) An introduction to the molecular basics of aryl hydrocarbon receptor biology. 391, 1235–1248. 51. Gutiérrez-Vázquez,C. and Quintana,F.J. (2018) Regulation of the Immune Response by the Aryl Hydrocarbon Receptor. Immunity, 48, 19–33. 52. Perdew,G.H. (1988) Association of the Ah receptor with the 90-kDa heat shock protein. J. Biol. Chem., 263, 13802–13805. 109 53. Denis,M., Cuthill,S., Wikström,A.C., Poellinger,L. and Gustafsson,J.Å. (1988) Association of the dioxin receptor with the Mr 90,000 heat shock protein: A structural kinship with the glucocorticoid receptor. Biochem. Biophys. Res. Commun., 155, 801–807. 54. Carver,L.A. and Bradfield,C.A. (1997) Ligand-dependent Interaction of the Aryl Hydrocarbon Receptor with a Novel Immunophilin Homolog In Vivo. J. Biol. Chem., 272, 11452–11456. 55. Meyer,B.K. and Perdew,G.H. (1999) Characterization of the AhR-hsp90-XAP2 core complex and the role of the immunophilin-related protein XAP2 in AhR stabilization. Biochemistry, 38, 8907–8917. 56. Grenert,J.P., Sullivan,W.P., Fadden,P., Haystead,T.A.J., Clark,J., Mimnaugh,E., Krutzsch,H., Ochel,H.J., Schulte,T.W., Sausville,E., et al. (1997) The Amino-terminal Domain of Heat Shock Protein 90 (hsp90) That Binds Geldanamycin Is an ATP/ADP Switch Domain That Regulates hsp90 Conformation. J. Biol. Chem., 272, 23843–23850. 57. Ikuta,T., Eguchi,H., Tachibana,T., Yoneda,Y. and Kawajiri,K. (1998) Nuclear Localization and Export Signals of the Human Aryl Hydrocarbon Receptor. J. Biol. Chem., 273, 2895– 2904. 58. Ikuta,T., Kobayashi,Y. and Kawajiri,K. (2004) Phosphorylation of nuclear localization signal inhibits the ligand-dependent nuclear import of aryl hydrocarbon receptor. Biochem. Biophys. Res. Commun., 317, 545–550. 59. Durrin,L.K., Jones,P.B.C., Fisher,J.M., Galeazzi,D.R. and Whitlock,J.P. (1987) 2,3,7,8- Tetrachlorodibenzo-p-dioxin receptors regulate transcription of the cytochrome P1-450 gene. J. Cell. Biochem., 35, 153–160. 60. Dere,E., Lo,R., Celius,T., Matthews,J. and Zacharewski,T.R. (2011) Integration of Genome- Wide Computation DRE Search, AhR ChIP-chip and Gene Expression Analyses of TCDD- Elicited Responses in the Mouse Liver. BMC Genomics 2011 121, 12, 1–19. 61. Nebert,D.W., Roe,A.L., Dieter,M.Z., Solis,W.A., Yang,Y. and Dalton,T.P. (2000) Role of the aromatic hydrocarbon receptor and (Ah) gene battery in the oxidative stress response, cell cycle control, and apoptosis. In Biochemical Pharmacology. Biochem Pharmacol, Vol. 59, pp. 65–85. 62. Sorg,O. (2014) AhR signalling and dioxin toxicity. Toxicol. Lett., 230, 225–233. 63. Beischlag,T. V., Morales,J.L., Hollingshead,B.D. and Perdew,G.H. (2008) The Aryl Hydrocarbon Receptor Complex and the Control of Gene Expression. Crit. Rev. Eukaryot. Gene Expr., 18, 207. 64. Barski,A., Cuddapah,S., Cui,K., Roh,T.Y., Schones,D.E., Wang,Z., Wei,G., Chepelev,I. and Zhao,K. (2007) High-Resolution Profiling of Histone Methylations in the Human Genome. Cell, 129, 823–837. 110 65. Rhee,H.S. and Pugh,B.F. (2011) Comprehensive Genome-wide Protein-DNA Interactions Detected at Single Nucleotide Resolution. Cell, 147, 1408. 66. He,Q., Johnston,J. and Zeitlinger,J. (2015) ChIP-nexus enables improved detection of in vivo transcription factor binding footprints. Nat. Biotechnol. 2015 334, 33, 395–401. 67. Wasserman,W.W. and Sandelin,A. (2004) Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 2004 54, 5, 276–287. 68. Khan,A., Fornes,O., Stigliani,A., Gheorghe,M., Castro-Mondragon,J.A., van der Lee,R., Bessy,A., Chèneby,J., Kulkarni,S.R., Tan,G., et al. (2018) JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res., 46, D260. 69. Staden,R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res., 12, 505. 70. Li,F. and Stormo,G.D. (2001) Selection of optimal DNA oligos for gene expression arrays. Bioinformatics, 17, 1067–1076. 71. Ogawa,N. and Biggin,M.D. (2012) High-Throughput SELEX Determination of DNA Sequences Bound by Transcription Factors In Vitro. Methods Mol. Biol., 786, 51–63. 72. Karimzadeh,M. and Hoffman,M.M. (2022) Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. Genome Biol. 2022 231, 23, 1–23. 73. Gotea,V., Visel,A., Westlund,J.M., Nobrega,M.A., Pennacchio,L.A. and Ovcharenko,I. (2010) Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res., 20, 565–577. 74. Yan,J., Enge,M., Whitington,T., Dave,K., Liu,J., Sur,I., Schmierer,B., Jolma,A., Kivioja,T., Taipale,M., et al. (2013) Transcription Factor Binding in Human Cells Occurs in Dense Clusters Formed around Cohesin Anchor Sites. Cell, 154, 801–813. 75. Pique-Regi,R., Degner,J.F., Pai,A.A., Gaffney,D.J., Gilad,Y. and Pritchard,J.K. (2011) Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res., 21, 447–455. 76. Quang,D. and Xie,X. (2019) FactorNet: A deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods, 166, 40–47. 77. Keilwagen,J., Posch,S. and Grau,J. (2019) Accurate prediction of cell type-specific transcription factor binding. Genome Biol. 2019 201, 20, 1–17. 78. Srivastava,D. and Mahony,S. (2020) Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. Biochim. Biophys. Acta - Gene Regul. Mech., 1863, 194443. 111 79. Chen,T. and Guestrin,C. (2016) XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, pp. 785–794. 80. Filipovic,D., Qi,W., Kana,O.Z., Marri,D., LeCluyse,E.L., Andersen,M.E., Cuddapah,S. and Bhattacharya,S. (2022) Predictive Models of Genome-wide Aryl Hydrocarbon Receptor DNA Binding Reveal Tissue Specific Binding Determinants. bioRxiv, 10.1101/2022.05.13.491754. 81. Davis,C.A., Hitz,B.C., Sloan,C.A., Chan,E.T., Davidson,J.M., Gabdank,I., Hilton,J.A., Jain,K., Baymuradov,U.K., Narayanan,A.K., et al. (2018) The Encyclopedia of DNA elements (ENCODE): Data portal update. Nucleic Acids Res., 46, D794–D801. 82. Dunham,I., Kundaje,A., Aldred,S.F., Collins,P.J., Davis,C.A., Doyle,F., Epstein,C.B., Frietze,S., Harrow,J., Kaul,R., et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. 83. Neavin,D.R., Lee,J.-H., Liu,D., Ye,Z., Li,H., Wang,L., Ordog,T. and Weinshilboum,R.M. (2019) Single Nucleotide Polymorphisms at a Distance from Aryl Hydrocarbon Receptor (AHR) Binding Sites Influence AHR Ligand–Dependent Gene Expression. Drug Metab. Dispos., 47, 983–994. 84. Barrett,T., Wilhite,S.E., Ledoux,P., Evangelista,C., Kim,I.F., Tomashevsky,M., Marshall,K.A., Phillippy,K.H., Sherman,P.M., Holko,M., et al. (2013) NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res., 41, D991–D995. 85. Oki,S., Ohta,T., Shioi,G., Hatanaka,H., Ogasawara,O., Okuda,Y., Kawaji,H., Nakaki,R., Sese,J. and Meno,C. (2018) ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data. EMBO Rep., 19, e46255. 86. Ramírez,F., Dündar,F., Diehl,S., Grüning,B.A. and Manke,T. (2014) deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res., 42, W187–W191. 87. Pansoy,A., Ahmed,S., Valen,E., Sandelin,A. and Matthews,J. (2010) 3-Methylcholanthrene Induces Differential Recruitment of Aryl Hydrocarbon Receptor to Human Promoters. Toxicol. Sci., 117, 90–100. 88. S,A., E,V., A,S. and J,M. (2009) Dioxin increases the interaction between aryl hydrocarbon receptor and estrogen receptor alpha at human promoters. Toxicol. Sci., 111, 254–266. 89. Navarro Gonzalez,J., Zweig,A.S., Speir,M.L., Schmelter,D., Rosenbloom,K.R., Raney,B.J., Powell,C.C., Nassar,L.R., Maulding,N.D., Lee,C.M., et al. (2021) The UCSC genome browser database: 2021 update. Nucleic Acids Res., 49, D1046–D1057. 90. Howe,K.L., Achuthan,P., Allen,J., Allen,J., Alvarez-Jarreta,J., Ridwan Amode,M., Armean,I.M., Azov,A.G., Bennett,R., Bhai,J., et al. (2021) Ensembl 2021. Nucleic Acids Res., 49, D884–D891. 112 91. Ernst,J. and Kellis,M. (2012) ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods, 9, 215–216. 92. Virtanen,P., Gommers,R., Oliphant,T.E., Haberland,M., Reddy,T., Cournapeau,D., Burovski,E., Peterson,P., Weckesser,W., Bright,J., et al. (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 2020 173, 17, 261–272. 93. Elith,J., Leathwick,J.R. and Hastie,T. (2008) A working guide to boosted regression trees. J. Anim. Ecol., 77, 802–813. 94. Gregorutti,B., Michel,B. and Saint-Pierre,P. (2017) Correlation and variable importance in random forests. Stat Comput, 27, 659–678. 95. Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., 18, 6097–6100. 96. Haidar,R., Henkler,F., Kugler,J., Rosin,A., Genkinger,D., Laux,P. and Luch,A. (2021) The role of DNA-binding and ARNT dimerization on the nucleo-cytoplasmic translocation of the aryl hydrocarbon receptor. Sci. Reports 2021 111, 11, 1–11. 97. Sogawa,K., Nakano,R., Kobayashi,A., Kikuchi,Y., Ohe,N., Matsushita,N. and Fujii- Kuriyama,Y. (1995) Possible function of Ah receptor nuclear translocator (Arnt) homodimer in transcriptional regulation. Proc. Natl. Acad. Sci. U. S. A., 92, 1936–1940. 98. Gassmann,M., Chilov,D. and Wenger,R.H. (2000) Regulation of the hypoxia-inducible factor-1 alpha. ARNT is not necessary for hypoxic induction of HIF-1 alpha in the nucleus. Adv. Exp. Med. Biol., 475, 87–99. 99. Huang,G. and Elferink,C.J. (2012) A Novel Nonconsensus Xenobiotic Response Element Capable of Mediating Aryl Hydrocarbon Receptor-Dependent Gene Expression. Mol. Pharmacol., 81, 338. 100. Tian,Y., Ke,S., Denison,M.S., Rabson,A.B. and Gallo,M.A. (1999) Ah receptor and NF- kappaB interactions, a potential mechanism for dioxin toxicity. J. Biol. Chem., 274, 510– 515. 101. Gupta,S., Stamatoyannopoulos,J.A., Bailey,T.L. and Noble,W.S. (2007) Quantifying similarity between motifs. Genome Biol., 8, 1–9. 102. Kulakovskiy,I. V., Vorontsov,I.E., Yevshin,I.S., Sharipov,R.N., Fedorova,A.D., Rumynskiy,E.I., Medvedeva,Y.A., Magana-Mora,A., Bajic,V.B., Papatsenko,D.A., et al. (2018) HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res., 46, D252. 103. Castro-Mondragon,J.A., Riudavets-Puig,R., Rauluseviciute,I., Berhanu Lemma,R., Turchi,L., Blanc-Mathieu,R., Lucas,J., Boddie,P., Khan,A., Perez,N.M., et al. (2022) JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res., 50, D165–D173. 113 104. Grant,C.E., Bailey,T.L. and Noble,W.S. (2011) FIMO: scanning for occurrences of a given motif. Bioinformatics, 27, 1017–1018. 105. Cock,P.J.A., Antao,T., Chang,J.T., Chapman,B.A., Cox,C.J., Dalke,A., Friedberg,I., Hamelryck,T., Kauff,F., Wilczynski,B., et al. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422– 1423. 106. Gore,A.C., Crews,D., Doan,L.L., Merrill,M. La, Patisaul,H. and Zota,A. (2014) Introduction to Endocrine Disrupting Chemicals (EDCs): A Guide for Public Interest Organizations and Policy Makers. 107. Liao,C., Liu,F., Alomirah,H., Loi,V.D., Mohd,M.A., Moon,H.B., Nakata,H. and Kannan,K. (2012) Bisphenol S in urine from the United States and seven Asian countries: Occurrence and human exposures. Environ. Sci. Technol., 46, 6860–6866. 108. Jalal,N., Surendranath,A.R., Pathak,J.L., Yu,S. and Chung,C.Y. (2018) Bisphenol A (BPA) the mighty and the mutagenic. Toxicol. Reports, 5, 76–84. 109. EPA (2015) Bisphenol A Alternatives in Thermal Paper. 110. Rochester,J.R. and Bolden,A.L. (2015) Bisphenol S and F: A systematic review and comparison of the hormonal activity of bisphenol a substitutes. Environ. Health Perspect., 123, 643–650. 111. Philips,E.M., Jaddoe,V.W.V., Asimakopoulos,A.G., Kannan,K., Steegers,E.A.P., Santos,S. and Trasande,L. (2018) Bisphenol and phthalate concentrations and its determinants among pregnant women in a population-based cohort in the Netherlands, 2004–5. Environ. Res., 161, 562–572. 112. Ye,X., Wong,L.Y., Kramer,J., Zhou,X., Jia,T. and Calafat,A.M. (2015) Urinary Concentrations of Bisphenol A and Three Other Bisphenols in Convenience Samples of U.S. Adults during 2000-2014. Environ. Sci. Technol., 49, 11834–11839. 113. Asimakopoulos,A.G., Xue,J., De Carvalho,B.P., Iyer,A., Abualnaja,K.O., Yaghmoor,S.S., Kumosani,T.A. and Kannan,K. (2016) Urinary biomarkers of exposure to 57 xenobiotics and its association with oxidative stress in a population in Jeddah, Saudi Arabia. Environ. Res., 150, 573–581. 114. Lehmler,H.J., Liu,B., Gadogbe,M. and Bao,W. (2018) Exposure to Bisphenol A, Bisphenol F, and Bisphenol S in U.S. Adults and Children: The National Health and Nutrition Examination Survey 2013-2014. ACS Omega, 3, 6523–6532. 115. Rocha,B.A., Asimakopoulos,A.G., Honda,M., da Costa,N.L., Barbosa,R.M., Barbosa,F. and Kannan,K. (2018) Advanced data mining approaches in the assessment of urinary concentrations of bisphenols, chlorophenols, parabens and benzophenones in Brazilian children and their association to DNA damage. Environ. Int., 116, 269–277. 114 116. Xue,J., Wu,Q., Sakthivel,S., Pavithran,P. V., Vasukutty,J.R. and Kannan,K. (2015) Urinary levels of endocrine-disrupting chemicals, including bisphenols, bisphenol A diglycidyl ethers, benzophenones, parabens, and triclosan in obese and non-obese Indian children. Environ. Res., 137, 120–128. 117. Gingrich,J., Ticiani,E. and Veiga-Lopez,A. (2020) Placenta Disrupted: Endocrine Disrupting Chemicals and Pregnancy. Trends Endocrinol. Metab., 31, 508–524. 118. Veiga-Lopez,A., Pu,Y., Gingrich,J. and Padmanabhan,V. (2018) Obesogenic Endocrine Disrupting Chemicals: Identifying Knowledge Gaps. Trends Endocrinol. Metab., 29, 607– 625. 119. Kolatorova,L., Vitku,J., Hampl,R., Adamcova,K., Skodova,T., Simkova,M., Parizek,A., Starka,L. and Duskova,M. (2018) Exposure to bisphenols and parabens during pregnancy and relations to steroid changes. Environ. Res., 163, 115–122. 120. Wan,Y., Huo,W., Xu,S., Zheng,T., Zhang,B., Li,Y., Zhou,A., Zhang,Y., Hu,J., Zhu,Y., et al. (2018) Relationship between maternal exposure to bisphenol S and pregnancy duration. Environ. Pollut., 238, 717–724. 121. Gingrich,J., Pu,Y., Roberts,J., Karthikraj,R., Kannan,K., Ehrhardt,R. and Veiga-Lopez,A. (2018) Gestational bisphenol S impairs placental endocrine function and the fusogenic trophoblast signaling pathway. Arch. Toxicol., 92, 1861–1876. 122. Kolla,S.D.D., Morcos,M., Martin,B. and Vandenberg,L.N. (2018) Low dose bisphenol S or ethinyl estradiol exposures during the perinatal period alter female mouse mammary gland development. Reprod. Toxicol., 78, 50–59. 123. Pu,Y., Gingrich,J.D., Steibel,J.P. and Veiga-Lopez,A. (2017) Sex-Specific Modulation of Fetal Adipogenesis by Gestational Bisphenol A and Bisphenol S Exposure. Endocrinology, 158, 3844–3858. 124. Catanese,M.C. and Vandenberg,L.N. (2017) Bisphenol S (BPS) Alters Maternal Behavior and Brain in Mice Exposed During Pregnancy/Lactation and Their Daughters. Endocrinology, 158, 516–530. 125. Tsamandouras,N., Rostami-Hodjegan,A. and Aarons,L. (2015) Combining the ‘bottom up’ and ‘top down’ approaches in pharmacokinetic modelling: fitting PBPK models to observed clinical data. Br. J. Clin. Pharmacol., 79, 48–55. 126. Ke,A.B., Greupink,R. and Abduljalil,K. (2018) Drug Dosing in Pregnant Women: Challenges and Opportunities in Using Physiologically Based Pharmacokinetic Modeling and Simulations. CPT Pharmacometrics Syst. Pharmacol., 7, 103–110. 127. Zhuang,X. and Lu,C. (2016) PBPK modeling and simulation in drug research and development. Acta Pharm. Sin. B, 6, 430–440. 128. Fowden,A.L., Forhead,A.J., Sferruzzi-Perri,A.N., Burton,G.J. and Vaughan,O.R. (2015) 115 Review: Endocrine regulation of placental phenotype. Placenta, 36, S50–S59. 129. Mourier,E., Tarrade,A., Duan,J., Richard,C., Bertholdt,C., Beaumont,M., Morel,O. and Chavatte-Palmer,P. (2016) Non-invasive evaluation of placental blood flow: lessons from animal models. Reproduction, 153, R85–R96. 130. Krishna,R., Riggs,K.W., Kwan,E., Wong,H., Szeitz,A., Walker,M.P.R. and Rurak,D.W. (2002) Clearance and disposition of indometacin in chronically instrumented fetal lambs following a 3-day continuous intravenous infusion. J. Pharm. Pharmacol., 54, 801–808. 131. Ngamprasertwong,P., Dong,M., Niu,J., Venkatasubramanian,R., Vinks,A.A. and Sadhasivam,S. (2016) Propofol Pharmacokinetics and Estimation of Fetal Propofol Exposure during Mid-Gestational Fetal Surgery: A Maternal-Fetal Sheep Model. PLoS One, 11, e0146563. 132. Corbel,T., Perdu,E., Gayrard,V., Puel,S., Lacroix,M.Z., Viguié,C., Toutain,P.L., Zalko,D. and Picard-Hagen,N. (2015) Conjugation and Deconjugation Reactions within the Fetoplacental Compartment in a Sheep Model: A Key Factor Determining Bisphenol A Fetal Exposure. Drug Metab. Dispos., 43, 467–476. 133. Fisher,J.W., Twaddle,N.C., Vanlandingham,M. and Doerge,D.R. (2011) Pharmacokinetic modeling: Prediction and evaluation of route dependent dosimetry of bisphenol A in monkeys with extrapolation to humans. Toxicol. Appl. Pharmacol., 257, 122–136. 134. Karrer,C., Roiss,T., von Goetz,N., Skledar,D.G., Mašič,L.P. and Hungerbühler,K. (2018) Physiologically based pharmacokinetic (PBPK) modeling of the bisphenols BPA, BPS, BPF, and BPAF with new experimental metabolic parameters: Comparing the pharmacokinetic behavior of BPA with its substitutes. Environ. Health Perspect., 126. 135. Kawamoto,Y., Matsuyama,W., Wada,M., Hishikawa,J., Chan,M.P.L., Nakayama,A. and Morisawa,S. (2007) Development of a physiologically based pharmacokinetic model for bisphenol A in pregnant mice. Toxicol. Appl. Pharmacol., 224, 182–191. 136. Poet,T. and Hays,S. (2017) Extrapolation of plasma clearance to understand species differences in toxicokinetics of bisphenol A. https://doi.org/10.1080/00498254.2017.1379626, 48, 891–897. 137. Sharma,R.P., Schuhmacher,M. and Kumar,V. (2018) The development of a pregnancy PBPK Model for Bisphenol A and its evaluation with the available biomonitoring data. Sci. Total Environ., 624, 55–68. 138. Vom Saal,F.S., Vandevoort,C.A., Taylor,J.A., Welshons,W. V., Toutain,P.L. and Hunt,P.A. (2014) Bisphenol A (BPA) pharmacokinetics with daily oral bolus or continuous exposure via silastic capsules in pregnant rhesus monkeys: Relevance for human exposures. Reprod. Toxicol., 45, 105–116. 139. Domoradzki,J.Y., Pottenger,L.H., Thornton,C.M., Hansen,S.C., Card,T.L., Markham,D.A., Dryzga,M.D., Shiotsuka,R.N. and Waechter,J.M. (2003) Metabolism and Pharmacokinetics 116 of Bisphenol A (BPA) and the Embryo-Fetal Distribution of BPA and BPA- Monoglucuronide in CD Sprague-Dawley Rats at Three Gestational Stages. Toxicol. Sci., 76, 21–34. 140. Doerge,D.R., Twaddle,N.C., Woodling,K.A. and Fisher,J.W. (2010) Pharmacokinetics of bisphenol A in neonatal and adult rhesus monkeys. Toxicol. Appl. Pharmacol., 248, 1–11. 141. Völkel,W., Colnot,T., Csanády,G.A., Filser,J.G. and Dekant,W. (2002) Metabolism and Kinetics of Bisphenol A in Humans at Low Doses Following Oral Administration. Chem. Res. Toxicol., 15, 1281–1287. 142. Lucier,G.W., Sonawane,B.R. and McDaniel,O.S. (1977) Glucuronidation and deglucuronidation reactions in hepatic and extrahepatic tissues during perinatal development. Drug Metab. Dispos., 5. 143. Oh,J., Choi,J.W., Ahn,Y.A. and Kim,S. (2018) Pharmacokinetics of bisphenol S in humans after single oral administration. Environ. Int., 112, 127–133. 144. Khmiri,I., Côté,J., Mantha,M., Khemiri,R., Lacroix,M., Gely,C., Toutain,P.L., Picard- Hagen,N., Gayrard,V. and Bouchard,M. (2020) Toxicokinetics of bisphenol-S and its glucuronide in plasma and urine following oral and dermal exposure in volunteers for the interpretation of biomonitoring data. Environ. Int., 138, 105644. 145. Nishikawa,M., Iwano,H., Yanagisawa,R., Koike,N., Inoue,H. and Yokota,H. (2010) Placental transfer of conjugated bisphenol A and subsequent reactivation in the rat fetus. Environ. Health Perspect., 118, 1196–1203. 146. Ho,K.L., Yuen,K.K., Yau,M.S., Murphy,M.B., Wan,Y., Fong,B.M.W., Tam,S., Giesy,J.P., Leung,K.S.Y. and Lam,M.H.W. (2017) Glucuronide and Sulfate Conjugates of Bisphenol A: Chemical Synthesis and Correlation Between Their Urinary Levels and Plasma Bisphenol A Content in Voluntary Human Donors. Arch. Environ. Contam. Toxicol., 73, 410–420. 147. Craigmill,A.L. (2003) A physiologically based pharmacokinetic model for oxytetracycline residues in sheep. J. Vet. Pharmacol. Ther., 26, 55–63. 148. Upton,R.N. (2008) Organ weights and blood flows of sheep and pig for physiological pharmacokinetic modelling. J. Pharmacol. Toxicol. Methods, 58, 198–205. 149. Makowski,E.L., Meschia,G., Droegemueller,W. and Battaglia,F.C. (1968) Measurement of umbilical arterial blood flow to the sheep placenta and fetus in utero. Distribution to cotyledons and the intercotyledonary chorion. Circ. Res., 23, 623–631. 150. NCBI 4,4’-Sulfonyldiphenol, CID=6626. 151. Chow,E.C.Y., Talattof,A., Tsakalozou,E., Fan,J., Zhao,L. and Zhang,X. (2016) Using Physiologically Based Pharmacokinetic (PBPK) Modeling to Evaluate the Impact of Pharmaceutical Excipients on Oral Drug Absorption: Sensitivity Analyses. AAPS J., 18, 117 1500–1511. 152. Lyons,M.A., Reisfeld,B., Yang,R.S.H. and Lenaerts,A.J. (2013) A physiologically based pharmacokinetic model of rifampin in mice. Antimicrob. Agents Chemother., 57, 1763– 1771. 153. Shin,B.S., Kim,C.H., Jun,Y.S., Kim,D.H., Lee,B.M., Yoon,C.H., Park,E.H., Lee,K.C., Han,S.-Y., Park,K.L., et al. (2004) PHYSIOLOGICALLY BASED PHARMACOKINETICS OF BISPHENOL A. J. Toxicol. Environ. Heal. Part A, 67, 1971– 1985. 154. Edginton,A.N. and Ritter,L. (2009) Predicting plasma concentrations of bisphenol A in children younger than 2 years of age after typical feeding schedules, using a physiologically based toxicokinetic model. Environ. Health Perspect., 117, 645–652. 155. Bonnans,J.F., Gilbert,J.C., Lemaréchal,C. and Sagastizábal,C.A. (2003) Numerical Optimization. 10.1007/978-3-662-05078-1. 156. EPA (2012) Bisphenol A, CASRN 80-05-7. IRIS (Integrated Risk Information System). Washingron, DC. 157. Medley,J.K., Choi,K., König,M., Smith,L., Gu,S., Hellerstein,J., Sealfon,S.C. and Sauro,H.M. (2018) Tellurium notebooks—An environment for reproducible dynamical modeling in systems biology. PLOS Comput. Biol., 14, e1006220. 158. Saltelli,A., Annoni,P., Azzini,I., Campolongo,F., Ratto,M. and Tarantola,S. (2010) Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. Comput. Phys. Commun., 181, 259–270. 159. Herman,J. and Usher,W. (2017) SALib : An open-source Python library for Sensitivity Analysis. J. Open Source Softw., 2, 3873–3878. 160. Saltelli,A. (2002) Making best use of model evaluations to compute sensitivity indices. Comput. Phys. Commun., 145, 280–297. 161. Lonard,D.M. and O’Malley,B.W. (2006) The Expanding Cosmos of Nuclear Receptor Coactivators. Cell, 125, 411–414. 162. McDowell,I.C., Barrera,A., D’Ippolito,A.M., Vockley,C.M., Hong,L.K., Leichter,S.M., Bartelt,L.C., Majoros,W.H., Song,L., Safi,A., et al. (2018) Glucocorticoid receptor recruits to enhancers and drives activation by motif-directed binding. Genome Res., 28, 1272–1284. 163. Creyghton,M.P., Cheng,A.W., Welstead,G.G., Kooistra,T., Carey,B.W., Steine,E.J., Hanna,J., Lodato,M.A., Frampton,G.M., Sharp,P.A., et al. (2010) Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl. Acad. Sci., 107, 21931–21936. 164. Adomas,A.B., Grimm,S.A., Malone,C., Takaku,M., Sims,J.K. and Wade,P.A. (2014) Breast 118 tumor specific mutation in GATA3 affects physiological mechanisms regulating transcription factor turnover. BMC Cancer, 14. 165. Takaku,M., Grimm,S.A., De Kumar,B., Bennett,B.D. and Wade,P.A. (2020) Cancer- specific mutation of GATA3 disrupts the transcriptional regulatory network governed by Estrogen Receptor alpha, FOXA1 and GATA3. Nucleic Acids Res., 48, 4756–4768. 166. Jiang,G., Wang,X., Sheng,D., Zhou,L., Liu,Y., Xu,C., Liu,S. and Zhang,J. (2019) Cooperativity of co-factor NR2F2 with pioneer factors GATA3, FOXA1 in promoting ERα function. Theranostics, 9, 6501–6516. 167. Swaminathan,G., Nguyen,L.P., Namkoong,H., Pan,J., Haileselassie,Y., Patel,A., Ji,A.R., Mikhail,D.M., Dinh,T.T., Singh,H., et al. (2021) The aryl hydrocarbon receptor regulates expression of mucosal trafficking receptor GPR15. Mucosal Immunol. 2021 144, 14, 852– 861. 168. Öberg,M., Bergander,L., Håkansson,H., Rannug,U. and Rannug,A. (2005) Identification of the tryptophan photoproduct 6-formylindolo[3,2-b]carbazole, in cell culture medium, as a factor that controls the background aryl hydrocarbon receptor activity. Toxicol. Sci., 85, 935–943. 169. Pan,Y., Deng,M., Li,J., Du,B., Lan,S., Liang,X. and Zeng,L. (2020) Occurrence and Maternal Transfer of Multiple Bisphenols, including an Emerging Derivative with Unexpectedly High Concentrations, in the Human Maternal-Fetal-Placental Unit. Environ. Sci. Technol., 54, 3476–3486. 170. Andra,S.S., Austin,C., Yang,J., Patel,D. and Arora,M. (2016) Recent advances in simultaneous analysis of bisphenol A and its conjugates in human matrices: Exposure biomarker perspectives. Sci. Total Environ., 572, 770–781. 171. Liu,J., Li,J., Wu,Y., Zhao,Y., Luo,F., Li,S., Yang,L., Moez,E.K., Dinu,I. and Martin,J.W. (2017) Bisphenol A Metabolites and Bisphenol S in Paired Maternal and Cord Serum. Environ. Sci. Technol., 51, 2456–2463. 172. Boucher,J.G., Boudreau,A., Ahmed,S. and Atlas,E. (2015) In vitro effects of bisphenol A β- D-glucuronide (BPA-G) on adipogenesis in human and murine preadipocytes. Environ. Health Perspect., 123, 1287–1293. 119