ANALYSISOFCOOPERATIVETRANSCRIPTIONFACTORBINDINGATTHE SEQUENCELEVEL By Jacob ADISSERTATION Submittedto MichiganStateUniversity inpartialentoftherequirements forthedegreeof Physics-DoctorofPhilosophy 2015 ABSTRACT ANALYSISOFCOOPERATIVETRANSCRIPTIONFACTORBINDINGAT THESEQUENCELEVEL By Jacob TranscriptionFactorbindingtoDNAbindingsitesisoneoftheprimarycausesofgene regulation.AcommonrepresentationoftranscriptionfactorbindingsitesisattheDNA sequencelevel,partlyduetoreoccurringpatternsatthesequencelevelthatoccurthroughout thegenomeforagivenfactor.Thechapterofthisdissertationintroducesgeneregulation fromtheperspectiveofdevelopment.Inadditionthemathematical-physicsfoundationfor performingcalculationsandforrepresentationsofthetranscriptionfactorbindingsitesat thesequencelevelisdiscussedinChapter1.InChapter2Iexplorethepossibilitythat twodistinctsub-typesofbindingsitesmayco-existwithinapopulationoffunctionalsites. Thisleadstoamodelthatcanbeusedforpredictionoftranscriptionfactorbindingsites. InChapter3IexploremodellingofDorsalVentralearlydevelopmentGeneRegulatory Network,usingthetoolsbuiltupinChapter1and2,namely'PositionWeightMatrices' thatallowforpredictionofbindingenergiesforgenomicsegmentsofDNA. TABLEOFCONTENTS LISTOFTABLES .................................... vii LISTOFFIGURES ................................... viii Chapter1 ....................................... 1 1.1MathematicalPhysicsIntroduction.......................1 1.1.1Phasespace,arealvectorspace....................6 1.1.2Thermodynamics.............................9 1.1.2.1Derivationofthechemicalpotential: = o +ln( c )....14 1.1.2.2Nucleoplasmgenomeligandbindingproblem.........16 1.1.2.3Onebindingsite........................20 1.1.2.4Twodependentbindingsites.................22 1.1.2.5Agenomeof n dependentbindingsites............23 1.1.2.6Highlycorrelatedsystems,thek-mersequenceandrecogni- tionproblem..........................23 1.1.2.7K-mersandPWMbindingconstants.............25 1.1.2.8Singlebindingsiteproteinsystemsindistinctenvironments27 1.1.2.9Anexampleoftwok-merbindingsitessystemusingamixture28 1.2BiologicalIntroduction..............................29 1.2.1TreeofLifeandtheTheoryofLife...................29 1.2.1.1ComparativeAnatomyandPhysiology............30 1.2.1.2Classicalevolution.......................31 1.2.1.3ModernSynthesisofevolution.................32 1.2.1.4Comparativegenomics.....................33 1.2.1.5Expressionpatterns......................35 1.2.1.6Thegeneralizationofthetheoryofevolution;developmental genetics.............................36 1.2.2Development...............................38 1.2.2.1Theoriginofmulticellularity;theevolution of development38 1.2.2.2FlyDevelopment........................42 1.2.2.3ThecrowningjewelofEvo-Devo:theHOXgenes......45 1.2.2.4Evolutionofbodyplansinanimals..............50 1.2.3Generegulation..............................51 1.2.3.1ConservedGeneRegulatoryNetworks............51 1.2.3.2AnteriorPosterior(AP)axisformation............52 1.2.3.3DorsalVentral(DV)axisformation..............54 1.2.3.4Modularityingeneregulatorynetworks...........60 1.2.3.5Transcriptionfactorbindingsites...............61 iii Chapter2 ....................................... 63 2.1Introduction....................................63 2.1.1PositionWeightMatrices.........................63 2.1.2In-vitroBiophysicalPWMs.......................64 2.1.3EvolutionaryPWMs...........................66 2.1.4RelationbetweenbiophysicalPWMsandevolutionaryPWMs....68 2.1.5ShortcomingsofPWMs.........................69 2.1.6PhysicalShortcomingsofPWMs....................71 2.1.7Dependencieswithintranscriptionfactorbindingssites........71 2.1.8Dependenciesbetweentranscriptionfactorbindingsites........72 2.1.9ConditionalPWMsbasedonco-occurringfactorbindingsites....72 2.2Materials.....................................73 2.2.1DataforknownDorsalbindingsitesin D.melanogaster Dorsal-Ventral network..................................73 2.2.2DNAsequencecontextofbindingsites.................75 2.3Methods......................................76 2.3.1ClusteringDorsaltargetlocibasedonco-occurringbindingsites...76 2.3.2Classifyingbindingsitesbasedonspacerwindow...........77 2.3.3Energyestimationofabase.......................78 2.3.4Energyestimationofasequenceofbases................80 2.4Modeldetectors..................................81 2.5Results.......................................83 2.5.1OptimalspacerwindowfortheORGatedetector...........83 2.5.2TheconditionalandunconditionalPWMsaretlyt.86 2.6Performanceofoptimal(detectors).................88 2.6.1TheDCdetectorpredictssitesproximalto5'-CAYATGwithbetter oddsthantheDUdetector........................88 2.6.2BothORgateandCBdetectorsshowhighsensitivitywithknownsites aspositivesandCRMsequencesasnegatives.............89 2.6.3TheORgateperformsbetterthanCBatpredictingknownsitesat lowerenergies...............................90 2.7Discussion.....................................93 2.7.1DCandDUInformationlogosandpreviousevidence.........93 2.7.2TheORgateandtheCBdetector....................96 2.7.3TheCBdataset,merginganddividingclustersofbindingsites...98 2.7.4MixturesofasymmetricPWMs.....................101 2.7.5Comparingmodelswithunequalparameters..............104 2.7.6InformationthatdetectorshaveaboutDorsalbindingsites......106 2.7.7Conditionaldetectors...........................108 2.7.8FormsOfConditionalDetector'sscore.................110 2.8Conclusion.....................................111 2.9MethodsSupplement...............................112 2.9.1Alignmentofcis-regulatorymodulesandcollectionof D CB ......112 2.9.2CRMalignmentusingMUSCLE.....................113 2.9.3MUSCLEisbothfastandaccurate...................117 iv 2.9.4MUSCLEparametersensitivity.....................118 2.9.5GEMSTATmoforlocusannotationofCRMs........119 2.9.6Overlappingsiteprocessing.......................121 2.9.7ErrorInestimatingthespacerlengthbetweenknownDorsallociand Twistsites.................................122 2.9.8PWMBest predictions ofbindingsiteloci...............123 2.9.9ExpectationMaximizationAlignment..................124 2.9.10CBwasdesignedtobeanapproximationtoamixture........125 2.9.11ConditionalDistributions........................127 2.9.12Detectorenergythresholds, E c .....................129 2.10ResultsSupplement................................130 2.10.1Descriptionofranksumsamplingdistributionconstruction......130 2.10.2LogoddsratiotestofDCandDUpositivehits.............131 2.10.3MutualInformationbetweenknownclasstagsandtheconditionalde- tector'spredictionsofclasstags.....................132 2.11AdditionalExperimentSupplement,RerunningModelonCACATGTwist Motif........................................135 2.11.1ROCcurve................................137 2.11.2MutualInformationbetweenlociclasses C anddetectorpredictionsof classes P ..................................138 2.11.3Permutationtestusingranksumstatistic...............139 2.11.4PropagationoferrorforestimatingerrorinEntropyestimates....140 2.11.5DeterminingifinInformationaresit........143 2.11.6EntropybiasfromMaximumLikelihoodestimation..........143 2.11.7EntropyBias...............................146 Chapter3 ....................................... 148 3.1ModelBackground................................148 3.1.1FractionalOccupancyofMorphogenBindingtoDNAbindingSite..148 3.1.2FractionaloccupancyofCRMscontainingmultiplebindingsites..150 3.1.3Segal'sHiddenMarkovModel......................151 3.1.4EnumeratingthesofaCRMsequence.........152 3.1.5Thevectornomenclature.................153 3.1.6Anexampleofthehybridnotation............156 3.1.7Thepairwiseinteraction ! betweenboundfactors...........157 3.1.8RelatingthenumberofmRNAtranscriptstofractionaloccupancyof PolII...................................161 3.1.9FractionaloccupancyofBTA......................162 3.1.10FractionaloccupancyofBTAfromabindingreactionperspective..164 3.1.11FractionaloccupancyofBTAinCooperativeBinding(CB)modelin XinHe'sGEMSTAT...........................165 3.1.12FractionaloccupancyofBTAinAy'smodel..............167 3.2Dataset......................................171 3.2.1SequencePartofthedata.......................174 3.2.2Positional-dependenttargetgeneresponsedata E t ..........175 v 3.2.3Positional-dependentmorphogendata..................175 3.2.4CollectionofdatafromDVnetworkofDorsal,Twist,andSnailtargets inNeuroectodermandMesoderm,andPWMs.............176 3.3Nonlinearregressionmodel...........................178 3.3.1Puttingthedatapartsandfreeparameterstogethertoformthenon- linearmodelofBTAoccupancy.....................178 3.3.2Freeparameterstobe.........................178 3.4Annotationmodelofbindingsites........................180 3.4.1DiscoveringthebindingsiteswithintheCRM.............180 3.4.2AnnotationModelofBindingSiteswithoutaPWMthreshold....182 3.5Model...................................187 3.5.1Covariancematrixofparameters.................189 3.5.2Theoverdeterminedandunderdeterminedproblem..........192 3.6Results.......................................194 3.6.1BestofparametersfordatafromSection3.2.4,Experiment1...194 3.6.2Redesigningtheparameterstobe..................198 3.6.3AnalyticJacobian............................200 3.6.4Analysisofbindingenergyonoccupancy............203 3.6.5Balanceddatasetsofmesodermandneurectoderenhancers,Experi- ment2...................................204 3.6.6Robustnessanalysis,Experiment3...................205 3.7DiscussionandBackgroundonRobustness...................211 3.7.1MorphogenGradientsandDevelopmentalRobustness.........211 3.7.2Finepatterns...............................213 3.7.3ofBorderandProductionRate...............218 3.7.4UniformShiftsandScaleInvariance...................224 3.7.5MolecularbasisforRobustnessofMorphogengradients........225 3.8Conclusion.....................................226 BIBLIOGRAPHY .................................... 229 vi LISTOFTABLES Table2.1:MutualInformationbetweenfunctionalDorsalbindingsitesequences andputativeTwistsitesthatmatch5'-CAYATGusingasliding spacerwindowscheme..........................85 Table2.2:ContingencytablewiththeconditionaldetectorsDCandDUrepre- sentedalongtherowsandtheclasstypedistalandproximalrepre- sentedalongthecolumns.Eachtableelementrepresentsthenumber ofsitespredictedfromeachdetectorofeachclasstypebasedonTwist sites(5'-CAYATG)andaCBenergy E ( S )= E c =2 : 1....89 Table2.3:ContingencytableofDCandDUdetectorversustheclasstypedistal andproximal.Elementsofthetablearethecountsfrompredictions ofeachdetectorforagivenenergyandspacerinagiven setofCRMs................................132 Table2.4:MutualInformationandLogoddsratiotestforallknownDorsalsites withtheTwistmotif5'-CACATGbasedonthespacerwindowsde- notedbythecolumnsofthetable....................135 vii LISTOFFIGURES Figure1.1:CircuitdiagramofDorsalgeneregulatorynetworkintheneuroecto- dermtissueofearlydevelopmentdesignedusingStrathopolis'template[116] providedattheMarineBiologyLaboratory'sGeneRegulatoryNet- workssummercoursedirectedbyEricDavidsonandMikeLevine..56 Figure2.1:LogosgeneratedforknownDorsalsites(the D CB data)testedforad- jacencyto5'-CAYATGusedasthecooperativeclassifinthe[0,30]bp distance.LogoAcorrespondstothecooperativeclass,anddisplays theknown5'-AAATTcore,withtotalinformationcontent13.5bits. LogoDistheexactsamelogoasAbutwithasinglebase-pairof sequenceatthestartandendofthesite(hence,thislogo startsatposition-1).Position9ofthislogoshowsabouttwodecibits ofinformationrelativetothebackgroundsequenceinthenucleotide base`C'(2outof10functionalDCsiteshavea`C'atthisposition). LogoBisthe`uncooperative'classforthe[0,30]bpwindow,which wecalculatedtohave9.1bitsinformationrelativetothebackground (uniformdistributionofbases),andlogoEhastheadded sitestothe`uncooperative'class.LogoCistheCBmotifwith9.6 bitsofinformationrelativetothebackground,whichlookssimilarto the'uncooperative'classatposition6duetotherebeingmanymore sitesthatpreferAtoaTatthispositionamongstalltheDorsalsites inthenetwork.LogoFistheCBmotifwiththenkingsequence appended.................................85 Figure2.2:Histogramofp-valuesofaranksumtestofrandompartitionsof thecombineddataset D CB .Thebinningisinunits 10 log 10 ofthep-value,roundedtothenearestinteger.Thep-valueofthe ranksumtestbetweenDCandDUenergydatasetsbasedontheir energyPWMswas260inlogbasetenunits(scaledby10),whichis indicatedbytheredbarofarbitraryheight..............87 Figure2.3:ROCandInformation.(A)Falsepositiverate( FPR )vs.TruePos- itiveRate( TPR )whenvaryingtheenergycuto E c .(B)showsthe mutualinformation I ( I ; O )Eq.(2.6.3)betweentheinputandoutput ofthedetectorsasafunctionofthecuenergy...........91 viii Figure2.4:Mutualinformation I ( C ; P )betweentheactualclasses C andthe predictedclasses P forDetectorsDCandDUasafunctionofthe thresholdenergy E c thatisbyeachdetector'sconditional energyEquation(2.15).........................94 Figure2.5:LogosgeneratedforknownDorsalsitestestedforadjacencyto5'- CACATGusedas'cooperative'class(DC)ifinthe[0,30]bpdistance. LogoAcorrespondstothe'DorsalCooperative'class,it'stotalinfor- mationcontentwecalculatedat13.4bits.LogoDistheexactsame logoasAbutwe'veappendedonebase-pairofsequenceonto thestartandendofthesite(hence,thislogostartsatposition-1). Position9ofthislogoshowsaboutacoupledecibitsofinformation relativetothebackgroundsequenceandtheposition-1containsa halfbitofinformation.LogoBisthe'DorsalUncooperative'classfor the[0,30]bpwindow,whichwecalculatedtohave9.4bitsinformation relativetothebackground(uniformdistributionofbases),andlogo Ehasaddedthesitestothe'DorsalUncooperative'class. LogoCistheCBmotifwith9.7bitsofinformationrelativetothe background,whichlookssimilartothe'DorsalUncooperative'class atposition6duetotherebeingmanymoresitesthatpreferAtoa TatthispositionamongstalltheDorsalsitesinthenetwork.Logo FistheCBmotifwiththesequenceappended........136 Figure2.6:ROCcurvesdisplaytheFalsepositiverate(FPR)vs.TruePositive Rate(TPR)...............................137 Figure2.7:Themutualinformation I ( C ; P )betweentheconditionaldetector's prediction'sofclasstypes(distalorproximal)andtheknownclass types,asafunctionofthedetectionenergythresholdisvaried.DC showsabout.3bitsofinformationataboutanenergyof4.DU doesperformancesuggestsnotmuchbetterthanrandomguessingfor itspredictingclasstypes........................138 Figure2.8:Histogramofp-valuesofrandompartitionsofthecombineddataset D CB ,wherethehistogrambinswereinunitsof10*log(pvalue). Thep-valueoftheranksumtestforDCandDUmedianenergieswas about205inthescaledlogunits,whichisthebaratthefarleftof thesamplingdistribution........................139 Figure2.9:Theestimateoftheerror, dh ( f i )= ˙ h ( f i ) isplottedonthevertical axis,whenthenumberofcountsofeventioccurattheirexpected value,wherethehorizontalaxisistheexpectednumberofcounts observedforeventi, = pN ...................142 ix Figure2.10:Theprobabilitydistributiondenotedas p wasestimatedfrom N ran- domdeviatesofa'known'length9PWMthatwasbuiltfrom D CB data.Theentropyof p , H [ p ]= Q 9 i =1 P B p ( i;B )log p ( i;B ),where i runsovertheninepositionsofthealigned N sequencedeviates,and B runsoverthebases,wascomputedfortwentyreplicatesforeach valueof N andplottedtheaverageentropyoverthetwentyreplicates asafunctionof N .Wecomputedthefunctional H asafunction of N forvaluesof inthedomain[10 5 ; 0 : 1 ; 0 : 2 ; 0 : 25 ; 0 : 5 ; 1].The 'known'CBPWMhadanentropyof5.6bitsasobservedbythe greenhorizontalline,andfoundanempiricalvalueof thatbest estimatedthis'known'entropytobe =0 : 1asshownbythered plotofthefunctional H asafunctionof N .Wesimilarlyrepeated thisforthefunctionalenergyestimates,andfoundtheleastbiased valueof =0 : 1forentropyandenergyestimates...........147 Figure3.1:Widom,SegalNatureReviewGenetics;"motif"denotespaththrough thePWM[110]..............................152 Figure3.2:ThetoyCRMsequenceacggtisannotatedateachofitsloci(posi- tions)todenotethevectorinrowmajorordering(1st ebitsaredorsal'srow,secondebitsaretwist'srowetc...).In thelanguageofHMMs,thevalueofavectorreveals thehiddenstateofthesequence.Herethereare5states,wherethe 'silent'stateindicatesatranscriptionfactorisboundtoanupstream positionofthesequence,causingthelocitobecoveredbyaninternal positionoftheplantedfactor......................158 Figure3.3:Changingthespacingbetweenmotifsinmoduleschangethespanof cellsthatareexpressed.InthisFigurethespacingbetweentwosites wasadjustedbyNaturalSelectioninorthologsofthe rhomboid gene's CRM.WhentheorthologCRMsweretransgenicallyinsertedinto mel speciesthewidthoftheexpressionpatternchangesrelativetothe endogenouspatternwidth,suggestingthatcoopertivitybetweenthe sitesisafunctionofthedistancebetweensitesthatcanbeusedby evolutiontotune'theepressionpatternsindevelopment.When eachsequenceormoduleisexpressedinitsrespectivespecie(lineage), thentherelativewidths( w=L where L isthelengthsofthemajor axisofthespecie'sembryo, w isthewidthofthetissueinnanometers thatexpressthegene)ofthetissuesarethesame.tspecie's embryoshavetsizeshencethereisascalinglaw-howa characteristic(suchasgeneexpression)changeswithbodysize.This FigureisfromErivesCrockeret.al2008"Evolutionactsonenhancer organizationtogradientthresholdreadouts.PLoSBiology"159 x Figure3.4:Diagramofthecomponentsoffractionaloccupancymodel,along withtheparameterstobitthroughexpressiondata(estimatesofthe mRNAoutputofthegeneboundbytheBTAandestimatesofinput morphogenconcentrations)........................162 Figure3.5:thelegendisintheupperrightcornerofthetable,denotingthe Observed( E t ( z ))asred,themodelpredictionsasgreen alongwiththeheaderaboveeachdenotingtheCRM(gene target)ingreen,andtheDorsalmorphogen( E Dl ( z ))as dottedbluecurve............................196 Figure3.6:thelegendisintheupperrightcornerofthetable,denotingthe Observed( E t ( z ))asred,themodelpredictionsasgreen alongwiththeheaderaboveeachdenotingtheCRM(gene target)ingreen,andtheDorsalmorphogen( E Dl ( z ))as dottedbluecurve.Thecorrelationcotbetweentheobserved patternandthepredictedpatternisdenotedasCCforeachgene (whichisatmost'one'),andalsothesquarederrorbetweenthe observedpatternandpredictedpatternisdenotedasSEforeach gene.Eachgenehad20positions, z ,alongtheDVaxis,whichas always(inDVliterature),isplottedsuchthatventralisatthezero position..................................206 Figure3.7:TheCRMsandpredictingbindingsitesfromMPAfordefaultpa- rametersonallproteins.Dorsalannotatedblue,Twistgreen.The columnd+denotesaddednoisetoDorsalconcentration(which waszerointhiscase,henced+shouldnotbethere).Abuginthe printingcodecausedoneofthe vnvir sitestonotappearintheCRM sequencehighlighting..........................207 Figure3.8:Herethetargetgeneisdenotedintheleftcolumn,andthecellalong theDVaxisisdenotedinthesecondcolumn.Twistsite'sareanno- tatedgreen,Dorsalblue,Snailred,andbrowndenotesoverlaps.The sitesannotatedatthemesodermbottomborderwereusedtoanno- tatethesequence.Forexample,thegeneis rhomel ,for rhomboid inthespecies melanogaster ......................209 Figure3.9:Herethetargetgeneisdenotedintheleftcolumn,where,wherethe secondcolumncontainsd+todenoteaninincrease(+)intheDorsal (d)gradientalongtheDVaxis,andthecellalongtheDVaxisusedfor extractingconcentrationsfortheannotationmodelMAPisdenoted inthethirdcolumn.Twistsite'sareannotatedgreen,Dorsalblue, Snailred,andbrowndenotesoverlaps.................210 xi Figure3.10:thegreengradientrepresentstheconcentrationofBicoidandthe orangenucleidenoteHbexpression,whilebluenucleirepresentno Hbexpression.Thiswasmofrom1of[125],and Alon'stext[5]...............................215 xii Chapter1 1.1MathematicalPhysicsIntroduction Thisdissertationisadescriptionoftranscriptionfactorbindingthatallowsforpredictionof thebehavioroftheDorsalVentralpatterninggeneregulatorynetworkofDrosophilaearly development.AlthoughDrosophila,thefruit,ismuchsimplerthanhuman,itsmolecular biologycontainsmanysimilartoalmostidenticalmechanismsforcontrollinggenes,and henceisinthepremiereleagueofmodelingsystemsforunderstandinghowhumangenesare molecularlycontrolled. Genes,inabroadsense,aretheparticlesthatarepassedonfromparentstoprogenythat containtheinformationaboutthecharacteristicsoftheparent.Thecharacteristicsofthe parentsaretheir'traits',likeeyecolor,susceptibilitytodiabetesII,orfecundity(aproxy fortness').Henceknowledgeofone'sgenes,orapopulationofpeople'sgenes,agenepool, allowsonetomakepredictionsaboutwhattraitswillexistinfuturegenerations. The'traits'ofinterestinthisdissertationarenotattheleveloftheadult,orevenatthe levelofarecognizableanimal.Theyareatthecellularlevel,where'development'buildsan adultby'developing'tcelltypesthatarearrangedtogethertoformanadultbody plan.Theinitialstepsofturningatotipotentcell(thezygote)intoaballofthousandsof cells,theembryo,whereeachcellhasitsveryowngenomethatbecomesfatedtobethe brain,theheart,etc..ofthethroughgeneregulation. TheaspectofgeneregulationthatIfocusonisattheleveloftranscription.The stepinthe'centraldogma',whereDNAiscopiedtoanRNAsequence.Itisthecontrol 1 ofthisthatwewillconsider'regulation',wherecontrolisinthesenseofhowmanyRNA 'transcipts'shouldbeproduced. Thereasonwefocusontranscriptionandondevelopmentisbecause'development'pro- ducesintandemwiththeproductionofRNAasetofcontrolledexperiments.Forexample, theembryoproducessetsofcellsatgrossanatomicalpositions,'regions',whereateach regionthecellsareallunderthesameconditions.Grosspositionsarelikeabinarynorthand south,topandback,ordorsalandventral.Eachoftheseregionsprovidesacontrolledexperi- ment,allowingustotakeadvantageofthetrustedreproducibilityofanimaldevelopment[55]. Henceeachregionintheembryoisunderastrictsetofcontrols,namelythephysiological environmentuniquetothatregion(e.g.highorlowdosesofproteinsandotherfactors); justasascientistwouldsetupapitridishfullofcellsundersimilarconditionsinorderto reliablycausegeneexpression.Inasenseacolonyofbacteriainapetridishbehavesimilar totheembryonicregionsthatfatecertaintissuetypes.Theoftranscriptionallowsfor ustoobservetheemittedparticlesfromacellorfromthegenome,thetranscriptsorthe proteins.Itistheseparticlesthatallowfordecipheringthemechanismofcontrolofgenes. Maternallycontrolledmolecules,controlledinthesenseofbeingpositionedatent locationsoftheembryonicshell,orareactivelytransportedtothetregionsof theembryotoeventuallycontrolspgenesinthegenomes(inthecells)thatwillordoes resideinthatregion.Inearlyydevelopmentthecellsoftheembryo,eachwiththeirown genome,formamonolayeraroundasphericalyoke,likecornonthecob.Byobservingboth the'morphogens'(transcriptionfactors)andtheemittedparticlesfromthetranscription factor'stargetgenes,themRNAandtheirtranslatedproteins,wecandecipherwhatis occurringatthegeneregulatorylevel.Theobviousinferenceisthattheinputmolecules mustsomehowpassamessagetothegenethatisbeingcontrolled.ThemechanismthatI 2 studyforgeneregulationiswheretheinputmolecule'binds'toalocationspsegment ofDNA,suchastheDNAsequence5'-GGAAAATCC,andtherebypassingamessagetothe sequenceoftheDNA'bindingsite'. Chemically,themessagemaybeginbytheboundproteinaddingachemicalmotiftoa proteinthatalreadywasboundorwrappedupintheDNAlikea'histone'.Thisadditional chemicalmotifmaysetamotionacascadeoffurtherstepsthatultimatelyleadtoaclear mooftranscriptionlevels. Theeasiestmessagetoobserveis'turnon'or'turn',whichisseenthroughtran- scription,becausewecaneasilyobserveproteinsandRNAthroughcommonlabtechniques. Butbroadly,regulationofgenesbythe'binding'processmeanscontrollingtheinheritable sequencesofthedockingsite 1 . Howthemorphogenbindstoalocationsppositionofthegenome,i.e.it thegeneofinterest,seemstobeacomplicatedprocess.Forexample,randombindingor samplingsitesinthegenome,inthetimeallocatedduringdevelopment,wouldnotallow themorphogenttimetothetarget,evenwiththemassactionofmultiple morphogens. Tohelpunderstandhowthebindingprocessworks,acentralproblemisunderstanding howtheproteinrecognizesaspclocationinthegenome.Bybetterunderstandingof howproteinsorthemorphogensrecognizespbindingsiteswithinthegenome,problems suchastheofmaternalmoleculesortheactivetransportofthemorphogenstoa splocationinthegenomemaybebetterunderstood.Inabroadersense,anycellular messagethatrequiresthemodulationoftranscriptionlevels,willbebetterelucidatedbya 1 Thecontrolledsequencemaynotbeadogmatic'gene'thatencodesforaprotein,itcouldbe anythingthatisusefulfortheorganismandisthereforeselectedbyevolution.Inthissense,the bindingsiteitselfislikeageneifitisunderselection.Hencethemolecularphenotype'tobind'isexpressed bythebindingsitesequence,thegenotype. 3 wellformulatedunderstandingoftheprotein-DNAinteractions,therecognitionproblem. AcentralassumptioninthisworkisthattherecognitionisencodedintheDNA.Hence thebindingsitesaremorethanjustasurfaceuponwhichtheproteindeposits.Justas vapordepositioncanbecontrolledbyplacingsptypesofhighysurfacesmixed withlowysurfaces,sotoo,onecouldimaginetheproteinbindingtoregionsofthe genomesolelyduetohighysurfacesthatconsistofmaterialthatisnotDNA,such asbindingtoahistoneorhistonetails(histonesareproteinsalwayspresentthatoccupya largefractionofthegenome),orbindingtootherproteinsthatarealreadyboundtothe genome.Iamsolelyinterestedinprotein-DNAbinding,andhencethetialy ofthesurfaceisonlyofinterestforregionsofthegenomethathavetheDNAexposedand spprotein-DNAbindingthatareunderselection(i.e.itisfunctionalDNA). Thenaturalmathematicalphysic'sframeworktodiscussthisproblem,isthroughdepo- sitionofparticlestoaonedimensionallattice,arepresentationofthegenome(seeT.Hill's chapteronlatticesforageneralintroductiontolatticestatistics[63]).Hence,Iwillintro- ducethemathematicalphysicsnecessaryforperformingcalculationsofbindingenergiesand occupationnumbersoflatticesites.Thismachineryisverygeneral,andisnotspto therecognitionproblem. OnceIhavepresentedhow'binding'isrepresentedIwillintroducek-mersandtherecog- nitionproblem,wherewewillaccountforthespbindingtoorderedarraysofbases, sequencesofDNA.Thisleadstoapplicationsinbioinformaticsequencealignment,where knownbindingenergiestospsequenceofDNAcanbeusedasacomputationalsearch forpotentialdiscoveryofunannotatedbindingsites(i.e.notinadatabase)withinase- quencedgenome,andtheinverseproblemwhereknownsequencesofbindingsitescanbe usedtoinferthebindingenergy.Inparticular,thechapterofthisdissertationwill 4 introducea'mixturemodel',whereamixofbindingsitesforthesamefactor,Dorsal,are usedforpredictionofunknownsites.Thisisalsousedtoexplorethepossibilityofepistasis andphysicalcooperativitybetweenDorsalandcooccurringTwistbindingsites.Thisinter- actionisencodedinDorsalbindingsites,wherethecooperativelyencodedbindingsitesform onecomponentofthe'mixturemodel'.Giventhefoundationofprotein-DNArecognition Chapter3thenconsidersfurtherthepredictionofunknownsitesforthetrioDorsal,Twist, Snail;threeofthedominant'morphogens'ortranscriptionfactorsinDorsalVentralpat- terninginearlydevelopmentofDrosophila.HereIexplorethepossibilitythatrecognition isafunctionnotonlyofthepreferredk-mersequenceforagivenfactor,butalsodependon tlocationsoftheembryo(suchastheneuroectoderm,ormesoderm)whichcontain tconcentrationsofthesefactorsandthereforetheirrecognitiontospsequences ofDNAinthoseregionsoftheembryoismodulated,whichismanifestedbythetial expressionoftheirtargetgenes.HereIalsoexplorethepredictionofunknownsitesasa functionofthespacerbetweenco-occurringk-mersites.ThisisimportantinDrosophila, whereithasbeenextensivelydocumentedastheprimarymechanismof'repression'utilized bySnail,aso-calledshortrangerepressorfactorthatturnsgenesthatwouldotherwise beactivatedbytheactivatortranscriptionfactorDorsalthatisinhighconcentrationsin theventrallocationoftheembryo.Furthermore,thisfunctionthatpredictsbindingsitesas afunctionofthespaceralsoisexploredintermsofthecooperativitybetweenDorsaland Twistfactors,bothknownactivators,thatareknowntoactsynergisticallywhentheirk-mer bindingsitesco-occurwithasp'window'ofspacervalues(e.g.thesitesmustbeabout 2to30base-pairsfromeachother). TheDVnetworkofenhancer'soccupancyforthesefactorsiscalculatedasasubproblem intheoptimizationofageneexpressionmodelfortheDVnetworkofgenesthatislocation 5 spwithintheembryo(therebyaccountingforlocationspconcentrations).The inputtothemodelisatwo-dimensionaloftheconcentrationofthetranscription factorsalongtheDorsal-Ventralpositionaxisoftheembryo,alongwiththecorresponding forthetargetgeneswhosemRNAexpressionismodulatedduetoregulationbythe factorbinding.Thebindingisaccountedforbyadditionallyinputtingthecis-Regulatory Modules,orsequenceofDNAnearthetargetgeneswherethefactorsareknownto bind.Themodelcontainsunknownconstantsthatareusingrootmeansquareerrorfor theobjectivefunctionF,where F = P i P j ( M ij O ij ) 2 ,here M isthemodeloutputfora givenevaluationoftheobjectivefunction(i.e.foragivenvaluesoftheparameters)andthe observeddata O ,where i runsoverthepositionsoftheDorsalVentralaxis,andeach j isa particulargeneinthenetwork.Inaddition,theobjectivefunctionhasanoptionofrunning amulti-objectivethatwilladdanadditionalobjectivetotheoccupanciesofthefactors toChip-SeqorChip-Chipdataforthetrans-factors,henceinthemultiobjectivecaseone has F = P i P j ( M ij O ij ) 2 + P k P l ( I kl ) 2 ,wherethenewtermcalculatesthe modeloccupancy forthegenomicsegment k fortranscriptionfactor l ,whichwas observedwithintensity I kl . 1.1.1Phasespace,arealvectorspace Classicallyonecancalculatethestatisticalpropertiesofmanybodysystems,suchasagas withAvogodro'snumberofparticles,bytransformingNewton'ssecondordertial equationforeachparticletotwoordertialequationsforeachparticle,and then,uponsolvingtheequationsofmotion,onecancalculatetimeaveragesofquantitiesof interest.AnotherapproachisthroughHamilton'sprinciple,whichisaprincipleofparsimony, dictatingthatmotionfollowsapathof'leastaction'(theshortestpath,where'shortest'has 6 aspecialformulation).Thisapproachalsoconsistsoftwoordertialequations foreachpositionaldegreeoffreedom,andconsistofconstructingthe'Hamiltonian'ofthe system,whichiselythetotalenergyofthesystem(atleastforsystemsthatweare interestedin).Forexample,foranisolatedonedimensionalsystem(suchasastretchedout genome)of M particles(or M genomicunits)onehas: H = M X i P 2 i 2 m i + U ( X 1 ;X 2 ;:::;X M )(1.1) HistheHamiltonian, i indicatestheparticlelabel,and P isthemomentum 2 ,and U isthe potentialenergyofthesystemduetotheinteractionsoftheparticles,whichisafunctionof eachparticle'slocation X . The2Mrandomvariablejointdistributionfortheisolatedsystemdescribestheoccupancy ofonepointinphasespaceatanyparticularinstant( X 1 = x 1 ;X 2 = x 2 :::X M = x M ;P 1 = p 1 ;P 2 = p 2 :::P M = p M ;t =0).Givensometimethaselapsed,thedistributionwill befoundtooccupysomeotherpoint,( X 1 = x 1 ( t ) ;X 2 = x 2 ( t ) :::X M = x M ( t ) ;P 1 = p 1 ( t ) ;P 2 = p 2( t ) :::P M = p M ( t )),thisisaDiracdeltadistributionforeachparticle's positionandmomentum[123].Hencethisisarealvectorspacein R 2 M .Thisconstruction isanexampleofthemicrocanonicalensemble.Bylooseningtheisolationconstraint(i.e. allowingenergyofthesystemtovary),wehavethecanonicalensemblewhichhasnonzero varianceformanyoftherandomvariables. Ifwelabelourphasespacepointswithanindex i ,thentheoccupationofpoint i inphase space, n i ,couldbenormalizedbytheoccupationofallstates(pointsinphasespace),we 2 Weusecapitallettersforthemomentaandcoordinates,sincewewillthinkofthoseasrandomvariables, wherewecantranslateHamilton'sdeterministicequationstotheKolmogorov-Chapmanequationsleading toadeterministicequationofmotionforthejointdistribution,amasterequation,forexampleseepage10 ofVanKampen[123]. 7 would n i n = exp H i kT Q (1.2) Here H i ,istheHamiltonianevaluatedatthephasespacepoint i (justpluginthecorre- spondingpositionsandmomentumofeachparticleforthatstateintotheHamiltonian). Thenumberofphasespacepointsisdeterminedbyhowwellwecanresolveoursubspaces foreachparticle.Forexample,thespace,thevectorspaceovertheposition coordinates X ,wouldbemeshednothanourabilitytoresolvetdistances.Here, Qisthepartitionfunction,whichcanbeshownbythemaximumentropyprincipletoequal: Q = X i exp H i kT = X i exp P 2 M j P j ( i ) 2 2 m j + U ( X 1 ( i ) ;X 2 ( i ) :::X M ( i )) kT (1.3) Byusingthesizeofaunitofthegenomicbiopolymerastheunitofourlengthscalefor eachparticle'spositionsubspace,wecanmeshoutphasespace'sspace'such thateachbaseateachpositionalongthepolymerchainoccupiesagivenlocation(mesh point)inspace'.Themethodtomeshoutmomentumspaceisirrelevant, sinceweareaboutto'projectout'or'marginalizeout'thatportionofphasespace. Byassertingthatthebiopolymerisstretchedout,anditslengthandhencenumberof 'unit'sisd,wecanreducespacetocontainMmeshpoints(oneforeach unitofthepolymer).Thisisa'onedimensional'lattice,wherethedimensionalityisnow inreferencetothefactthateachunitofthepolymerliesalongalineararray,likeathin spaghettinoodlethatwasstillsolidandwasnotchedforeachunitofthepolymer. Weareinterestedin'binding',whereadistinctspecielikeatypeoftranscriptionfactor 'binds'tothelattice.Thisrequiresmorecomplexity,aswenowmustintroducemoreparticles 8 thantheoriginalMunitsofthepolymer.Beforeweintroducemoreparticles,wewill getridofthemomenta.Weareferencesystemwithnointeractions(U=0),which hasacorrespondingpartitionfunction Q o ,thenwewouldexpectthatforsmallinteractions (Uissmall)thevelocitydistribution(MaxwellBoltzmanndistribution),wouldely remaininvariant.Hencewehaveanepartitionfunction,q,as: q Q Q o ˇ X i exp U ( X 1 ( i ) ;X 2 ( i ) :::X n ( i )) kT (1.4) classicallythisiscalledtheintegral.Thisiselya'projection'or marginalizationoverthemomentumcoordinates. 1.1.2Thermodynamics Ourinterestisnotinisolatedphysicalsystems,astheMparticlesystemabove,ratherweare interestedinclosed(energyisexchangeable, but particlesarenotexchangeable)andopen (energyisexchangeable, and particlesareexchangeable)systems.TheVictorianfoundersof theofopenandclosedsystems,ofthermodynamics,werecontemporarieswithCharles Darwin,whichisveryseeingthattheirinsightsintoheatandparticleexchange(a formofwork)suppliestheessentialmechanicstodescribebiologicalsystems. ConservationofenergyforisolatedsystemsisaconsequenceofNewton'ssecondlaw. Amoreinterestingstatementisthattheenergyofaclosedsysteminequilibriumcannot spontaneouslychange,thisisthestatementofthelawofthermodynamics.Hencewe mustdoworkorheatthesystemtochangeitsinternalenergy 3 . 3 TheClausiusconventionfortheformofthelaw: dU = q w =heatsuppliedtothesystem( q )- workdonebythesystem( w ) 9 dU = q w (1.5) dU = TdS + PdV Here U istheinternalenergy,theinternalenergyofthesystemistheHamiltonian, H ,hence H = U .In(1.5) q istheheat( q hasnothingtodowiththepartitionfunction,whichusesthe samesymbol),and w isthework(theworkalsoaccountsforparticleexchangeprocesses). Thesystemaboveonlyhasonetypeofparticle,forsystemsthatallowtranscription factortobindtoDNAsegmentsitisnecessarytointroducemultipleparticlespecies(the proteinandDNA).ByGibbs'phaserule,weknowthatforabinarysystemwewillneed fourindependentvariables,oneofwhichisintensive.Hencewehavemanyoptionsavailable forconstructingbinaryparticleensembles.Forthespeciesthatcanvaryparticlenumber thenaturalvariableisthechemicalpotential,whileforsystemsthatareclosedthenatural variableisjusttheparticlenumber.Hence,forabinarysystem,composedof M components of S 0 (sequences)and N componentsof P 0 (protein),wecouldtheopenopensys- temwiththefollowingcoordinates( S 0 ; P 0 ;V;T );whiletheclosed-closedsystemhasthe followingform: dU = TdS + S 0 dM + P 0 dN + PdV (1.6) Wewillthinkofbindingsitesinteractingwithproteinsassolutesoluteinteractionsthat areoccurringinasolvent.Thenucleoplasmrepresentsaverycomplicatedsolvent,amixture 10 intheliquidphaseofallsortsofcomplicatedbiopolymersandwatermoleculesandthemany otherinorganicsubstancesinabiologicalnucleus.Ofcourse,theproteinthatmusta spbindingsitewithinthegenome,mustcompetewithalltheothermoleculesinthe nucleusforoccupyinganypointinthespatialgridofthenucleus.However,again,myaim istorepresentanddescribetherecognitionofaparticularbindingsitebyaprotein,henceI wouldliketomakeprogressspallyontherecognitionproblem,withoutbeinghampered bythesolventproperties.Hence,IwillintroduceanargumentinHill'stext[62],andalso discussedbyLandua[75],theso-called'dilutelimit'.Thiswillallowustorigorouslyaccount forthefactthattheproteinandDNAareembeddedinaphyiologicalsolvent,while,toa largedegree,hidesthesolventmoleculardetailsandthesolvent-solutedetailsbyintroducing anemolecularpartitionfunction q ,whichwillbesimilartoEq.1.4intwoways: becausethemomentumwillnotbeofinterestandsecondbecausewewillbetakingratioof twotypesofwsystemstothee'partitionfunction.However, q is tthanEq.1.4inthatamolecularpartitionfunctionisaboutonetypeofparticle(one molecule),whereforlargesystemslikeagasofnidenticalanddistinguishablemolecules, onecommonlydenotesthepartitionfunctionas Q = q n ,hencelowercaseistodenote'one molecule',andcapitalcasetodenotelargesystems.Bytheepartition functionwewillthenproceedtotherelevantproblemofthesolute-solute(protein-binding site)interaction,wherethephysiologicalenvironment(solvent)willbeaccountedforinboth theDNAandproteinbyanepartitionfuntionforeachsolute(e.g. q DNA and q pro ). SolventSolute,dilutelimit. Thesolvent-solutesystemisjustabinarysystem.Herewe willfollowHill'sapproach(seechapter1[62]),whichistostartbyorganizingtheliquidey 11 solutionwiththetwocomponentgrandcanonicalsystem 4 exp PV kT = S ; P ;V;T )= X N 1 X N 2 Q ( N 1 ;N 2 ;V;T )exp ( N 1 1 + N 2 2 ) kT (1.7) Herewehaverelabelled S as1and P as2,anddenotedthetwo-componentcanonical ensembleas Q ( N 1 ;N 2 ;V;T ) 5 .Nowweknowtheaveragesoluteparticlenumber: = X n 2 n 2 P ( n 2 )= 2 @ log @ 2 (1.8) Bytakingthederivativewithrespecttotheabsoluteactivityofthesolute, 2 =exp N 2 2 kT , wecancalculatetheaveragenumberofsoluteparticles.Totakethisderivative,notice thatinthedilutelimitofsolute,thesummationoverthesolutemayaswellbeneglected, sincethesoluteactivityapproacheszero.Ifwethepuresolventgrandpartition functionas o = P N 1 Q ( N 1 ; 0 ;V;T ) N 1 1 ,andthesolventgrandpartitionfunction 1 = P N 1 Q ( N 1 ; 1 ;V;T ) N 1 1 asthegrandforthesolventwithonesoluteembeddedinit 6 .Then thederivativewillappearas: 2 @ log @ 2 = 2 1 +2 2 2 +3 3 2 2 ::: 0 + 1 2 + 2 2 2 + 3 3 2 ::: (1.9) 4 Thegrandcanonicalensembleisrelatedtothethermodynamicgrandpotential,bythelegendretransform ofEq.1.1.2,thetransformleadsto: dU = TdS + S 0 + P 0 + PdV 5 Inthecaseofnoninteractingsolventandsolute Q ( N 1 ;N 2 ;V;T )factorizesintotwo'idealgas'ensembles, namely: Q ( N 1 ;N 2 ;V;T )= Q ( N 1 ;V;T ) N 1 N 1 ! Q ( N 2 ;V;T ) N 2 N 2 ! ,whileforinteractingsolventandsolutetheinter- actionenergyiscontainedinthepotentialenergyofoftheHamiltonian,andhence Q ( N 1 ;N 2 ;V;T )cannot befactorized.Regardlessofthisinteractionwecanproceedtothe'recognitionproblem'andconstruction ofanepartitionfunctionbeingawarethatexplicitrepresentationofthesolventsoluteinteraction willrequirewritingapotentialbetweenthesolvent-soluteintheHamiltonianofEq.1.3,whicharedetailswe wishtohide. 6 Notethat 1 willpossesasolutesolventinteractioninsideof Q ( N 1 ; 1 ;V;T )onlyifthesoluteinteracts withthesolvent,otherwise Q ( N 1 ; 1 ;V;T )issimply Q (1 ; 0 ;V;T ) N 1 Q (0 ; 1 ;V;T )forthecasethatthesolvent particlesareidenticalanddistinguishable. 12 Aslim 2 ! +0 ,wethat: = 1 0 2 (1.10) Hence,justasinEq.1.4wherewetheepartitionfunction, q ,the integral,hereagainwewillanepartitionfunctionforasolutionsystem.Hence, forasoluteimmersedinasolventwewilltheepartitionfunctionforthesolute as: q ( N;V;T ) 1 ( 1 ;N 2 =1 ;V;T ) 0 ( 1 ;N 2 =0 ;V;T ) (1.11) Wecanapproximatethegrandcanonicalensemblepartitionfunctionsontherighthandside oftheaboveequationbyusingthemaximumtermofthepartitionfunction: q = 1 ( 1 ;N 2 =1 ;V;T ) 0 ( 1 ;N 2 =0 ;V;T ) ˇ Q m ( ; 1 ;V;T ) Q m ( 0 ; 0 ;V;T ) ( 0 ) (1.12) here, Q m ( N 1 ; 1 ;V;T )isthecanonicalpartitionfunctionforonesoluteinaboxofsizeV ofsolventparticles,wherethesubscriptmindicateswehavetakenthelargesttermofthe grandcanonicalensemble.Themaximumterm'sparticlenumberisoccursatthe expected particlenumber duetotheGaussianpropertythatthemaximumoccursatthe expectedvalue(where 0 isaslightlytmaximumfortheconstantvolumecase ofnosolutemolecules(becausethenowevacuatedspaceleavesmoreroomforsolventto Theaveragesolventparticlenumbersarenuisancevariablesthatwecan,inasense, transformoutoftheepartitionfunction.Henceinsteadofusingthetwocomponent grandcanonicalsystemwewillusethetwocomponentsystemwithpressureand 13 solventparticlenumber 7 .Hencethenewpartitionfunctionis 0 ( N 1 ; 2 ;p;T )= X N 2 X V Q ( N 1 ;N 2 ;V;T )exp pV kT exp N 2 2 kT ; (1.13) wherethesummationoverthevolumeonlyworksfordiscretephysicalspaces,suchaslattices (seeEq.2.23ofHill[62]).NowifwerepeattheexactstepsabovefromEq.1.8toEq.1.12using thenewpartitionfunction 0 ),wewillarriveat: q = Q m ( N 1 ; 1 ;V m ;T ) Q m ( N 1 ; 0 ;V 0 m ;T ) p ( V m V 0 m ) ; (1.14) whereagainwehaveusedthemaximumtermmethodisolatingthecanonicalensemblesthat havethelargestprobabilityinthepartitionfunctionsof 0 o and 0 1 ,where,forexamplethe puresolventgrandpartitionfunctioninthiscaseis 0 o = P V Q ( N 1 ; 0 ;V;T ) e pV=kT ,where wedenotethevolumefromthemaximumterminthissummationas V m ,andsimilarly V m ' isthemaximumterminthepartitionfunction(whichsumsovervolumes)ofsolventwith onesoluteparticlepresent 8 . 1.1.2.1Derivationofthechemicalpotential: = o +ln( c ) Onpage6ofT.Hill'stext[62],hederivesthedilutelimitformulaforthechemicalpotential = o +ln( c )usingavacuumasasolvent.Afterhisargumenthepointsoutthatthe 7 Forexample,imagineabeakerwithanumberofliquidwatermolecules,thenifwedropa rock(onesolutemolecule)insidethebeakerthewaterwillrise-i.e.thevolumewillchange);thisisunlike thegrandcanonicalcase,wherewehavetodisposeoftheexcesswatermoleculesdisplacedbytherock(those thatdon'tinthevolumeV),sincewemustkeepthevolume(albeitthiscouldbedonebyallowing thevolumeofinteresttobethebeakertothebrim,thenanyexcesswatermoleculeswillsimplyw overtherim;butwe'reinterestedinasimplemathematicaltool,notanexperimentaldesign. 8 Wecanthinkof p ( V m V 0 m )asthemechanicalwork w donebyinsertingasoluteintoasolvent,namely: w = P V ,forexamplesee1.1ofHill[62].Furthermore,notingthattheHelmoholtz(A)isthefree energyofthecanonicalpartitionfunction Q =exp( A=kT ),weseethat q =exp A + p V )=exp G ). 14 derivationhepresentedcouldanddoestaketforms(dependingonthetextand purposes 9 however,hestatesthatforasolutesolventsystem(biologicalsystem),thatthis approach(whichI'llnowrecapitulate)isnecessary(ifyouwanttokeepthingssimple)). InEq.1.10wehavetheabsoluteactivityofthesolute,whichbyionis 2 =exp 2 kT , henceusingthenoftheactivityandalongwithEq.1.10,theaveragesoluteparticle number,wecanderivethestandardformulaforthethermodynamicchemicalpotential.First wemusttheconcentration c (particledensity)as: c = V ; (1.15) NowdividingbothsidesofEq.1.10by V ,wecanrewriteEq.1.10intermsofthe density,whichresultsin: c = V = 2 V 1 0 : (1.16) Nowplugginginexp 2 kT for 2 ,andtakingthelogarithmofbothsidesofEq.1.19results in: log c =log exp 2 kT V 1 0 : (1.17) Nowletus q basedontherightsideofEq.1.14,thenifwerewritetherightside ofEq.1.17astwotermswehave:log c =log 2 kT +log q V .Uponrearrangingterms,and multiplyingthroughbythethermalenergy kT ,andbywriting kT log q V = o (whichacts 9 Forexample,giventhat dG = Vdp + SdT ,thenforaconstanttemperatureidealgassystem(solvent- solutesystemwithvacuumasthesolvent)wehave G = G o + R Vdp = G o + R RT p dp ,wherewehaveused thegasconstantRintheidealgaslawinthelastexpression.Thisresultsin G = G o +ln p p o ,where p o isthereference(standardstate)ofonebar,whichisusuallyomitted,furthermoreusing G = ,wecan dividetheequationfor dG throughby N ,nowafterintegrationthisresultsin = o +ln p ,butusingthe idealgaslaw( p = ckT )andkeepinginmindthatweomittedthereferencepressure)wecanrewritethisin termsofareferenceconcentrationofoneparticleperunitvolume,resultingin: = o +ln c . 15 asareferenceorastandardstate) 10 ,wehaveourdesiredresult: = o + kT ln( c ) ; (1.18) 1.1.2.2Nucleoplasmgenomeligandbindingproblem Thebindingsiteisthemaincomponentofourphysicalsystem,wewillletthenumber ofbindingsitesbeinthegenome(i.e.thesystemisclosedwithrespecttonumber ofbindingsites).LetMbethenumberofbindingsitesinthegenome,eachsitebeing ofthesameenergy.Letthesystembeopenwithrespecttofactorbinding.Hence,each particularlocus(eachsite)isnotjusteitherboundornotbound,ratheritwillhavean occupancy.Inequilibrium,wecantheequilibriumbindingconstantasafunctionof theconcentrationsofthecomponentsofthesystem. Thechangeinfreeenergyperparticle, ,ofthebindingprocessiszeroinequilibrium, recalleachspeciesineachphasehasitsownchemicalpotential: = o +ln c; (1.19) here o isthereferenceenergy(standardstate),andcistheconcentrationordensityofthe chemicalspeciesrelativetostandardconcentrationof'1'intheunitsofinterest,hencewe alsohave: SP S P =0(1.20) 10 Thisstatisticalmechanicsapproachisincontrasttothethermodynamicapproachofthefootnoteabove, whereoneintegrates dG = Vdp from somestandardstate(anarbitraryreference)toadesired point (in thermodynamicspace,withdimensionslikepressure).Herethe point wereferenceis,inasense,apointin phasespace(alevelofdetailthanthecoarsegrainedthermodynamicvariables. 16 nowifwegroupcommonstandardstatesandconcentrations,andrearrange: o SP o S o P =ln( [ SP ] e [ S ] e [ P ] e )(1.21) Herethesubscript e ontheconcentrationsistoremindusthattheconcentrationsareno longeravariable,butbytheequilibriumconstraint,andweassumeunitsof kT =1. Ourchemicalpotentialsarelinkedtothemolecularenergiesthroughthelogarithmofthe dilutelimitpartitionfunctionofEq.1.14(ifthesystemisinequilibrium,hencewealsohave: o SP o S o P =ln( q SP q S q P )(1.22) Herewewillassumethat p V factorsfromEq.1.14thatarisefromtheemolecular partitionfunctionsfromtherighsideexpressionofEq.1.22allcancel.Thisisbecausethe pressureisconstant,andwewillassumethevolumeofthecomplex SP istheadditive volumeofthemolecules S and P inisolationinthesolvent.Nowweseethatthebinding energyemergesfromtheratioofpartitionfunctions,hence,weanewpartitionfunction as: q = q SP q S = q P e ( E b ) : (1.23) Here,thebindingenergy, E b isequaltotheworkdonetoseparatetheboundcomplex proteinandDNA(denotedastheSPparticle).Itisthesolute-soluteinteraction.Itcan alsobethoughtofastheenergytoliftanadsorbateoutofthepotentialwellofdepth E b thatdescribestheofthesequenceontheadsorbate,oritcouldbethoughtofas theparameter ˙ inthepairwisepotentialofaLennardJonespotential(thedepthofthe LJpotential).Italsodeterminesthepotentialenergyterm U ( X S ;X P )thatwewouldhave 17 addedtoourHamiltonianinequation(1.1).Theemergenceof E b bytakingtheratioofthe epartitionfunctionsisaconsequenceoftheassumptionthatthemoleculardegrees offreedom,suchasrotationandvibrationareunperturbedbythebindingprocess.For example,forthemoleculeS,wehave q S ˇ q S r q S v ,similarlyforthemoleculeP.Thecomplex SPcontainsallofthesemolecularstatestoo,howeverthecomplexalsocontainsanadditional factorduetotheinteraction(suchasanLJpotential).Assumingthecomplexisstable,then wecanassumeweareattheminimaofthepair-wisepotential,whichwecallthebinding energy 11 Linkingthestatisticalmechanic'spartitionfunctionstothethermodynamicbindingcon- stantwehave: K = e ( E b ) ; (1.24) wherethebindingconstantisby: K = [ SP ] e [ S ] e [ P ] e : (1.25) Experimentallythiscanbedeterminedbybindingtitrationcurves,whichallowonetotrans- 11 Anexampleofthecancellationsofthepartitionfunctions:Let q S r betherotationalpartitionfunction overtheeigenvaluesoftheHamiltonianfortherotationaldegreesoffreedom,e.g. q S r = P i exp( H S i ),where irunsovertheeigenvaluesoftheHamiltonianfortheSmolecule,similarlyfortheotherdegreesoffreedom (allthevariablesareassumedclassical,hencewecanworkinarealvectorspace,asopposedtoacomplex vectorspace).Then q PS q P q S = q P r Q d q P d q S r Q f q S f exp( U ) q P r Q d q P d q S r Q f q S f ,wheredandfrunoverallremaining'degreesof freedom'forthemoleculesSandP,wheretheformofeachdegreeoffreedom'sHamiltonianwilldeterminethe eigenvaluesandhencethepartitionfunctions(the'momenta'and'position'randomvariablesoftheEq.1.1 areseenas'degreesoffreedom'[64]inthiscontext,hencethevariablesofEq.1.1canbeseenasgeneralized coordinatesinphasespace,wheretherandomvariableX,forexample,mayrepresentarotation).Whatever theformoftheseHamiltonians,allofthesepartitionfunctionscanceliftheyareunperturbedwhenPand Sformacomplexor'bind',andallthatremainsistheinteractionbetweenSandPdenotedasU,whichat equilibriumhasavalue E b . 18 formthebindingconstantasafunctionofthefractionaloccupancy.Asaconsistencycheck, weseethatifthebindingenergyiszero(nointeractionbetweensequenceandprotein,then theconcentrationoftheboundcomplexisjustaslikelyastheunboundcomplex,while completebindingrequiresthebindingenergytobenegativenity,andforparticlesthat repelsuchthattheboundcomplexneverformsthebindingenergymustbeplusiny), thenwehave: q SP = q S q P : (1.26) Hence,wethatpartitionfunctionsbehavealmostidenticallyasjointdistributions.The beautyofpartitionfunctions,isthatwemaintainthemolecularlinktotheHamiltonian, andalinktothermodynamics. Theaboveanalysislaysthefoundationforunderstandingthebehaviorofatranscription factor(protein)thatbindsinasolutionwithidenticalsequences(DNAoligosforexample). Onecanimaginethesolventastheoligos,andusestandardpartitionfunctions,oronecan workinaframewheretheoligosthemselvesareanothersoluteparticle(liketheprotein), wherebothsolutesarebathedinasolventlikewaterormilk.Inowextendtheanalysisto thecaseofaproteinbindingtoasinglesitewithinthegenome.First,wewillmakethe observationthatforagenomethatcontainsanarrayofidenticalbindingsites,theproblem thenelyreducestoaproteinbindingtoasolutionofoligos. Fornowwewillassumethatthensitesareindependentofoneanother,hence,wecan workwithasystemofjustonesite,andrealizethattoextendthesystemtoall n sites, simplyrequiresscalingthefreeenergyby n ,andraisingthepartitionfunctiontothepower of n .Hence,althougheachindividualsitewillhavebetweenbeingboundand unbound,wecanusethe n sitesasedatatoincreasethepowerofourstatisticsfor 19 learningaboutthebindingenergy. 1.1.2.3Onebindingsite Forthecasethatthegenomecanbemodeledas n identicalbindingsites,wecanconstructa systemwith n siteswherethebindingproteinnumberisallowedtovary,(openclosed system): = ˘ n o n ! (1.27) Here,isthegrandcanonicalpartitionfunctionforthe n sites.Theindependenceofthe sitesmeanswecansimplyworkwithjustthegrandcanonicalpartitionfunctionforasingle bindingsite ˘ o ,wherethefactorialisduetotheindistinguishablyofthe n sites.Hence,we canworkwithasinglesitesystem,andsimplynotethatextensivequantities(suchasthe bindingenergy)willsimplybemultipliesofthesinglesitesystem(e.g.thebindingenergy of10boundproteinsissimplytentimeslargerthanthebindingenergyofsingleprotein, andtheonedimensionalvolumeofonesitesimplyincreasesbyafactorof10fortensites (i.e.length,whichis n inunitsofthebindingsitelength)). Thesinglebindingsiteisd(closed)whiletheadsorbateisopen,hencesinglesite partitionfunctionis: ˘ o = q P + q SP (1.28) heretheq'sareepartitionfunctionsforsolutesinsolvent(thedilutelimit).Wecan renormalizethepartitionfunction: ˘ =1+ q (1.29) Hereqistheepartitionfunctionoftheboundcomplex, q = q SP =q P .Clearly ˘ o 20 and ˘ aretnumerically.However,forrelativeprobabilitiestheformsareirrelevant. Hencetheoccupancy(relativeprobabilitybetweenboundtounbound)is: P b = q ˘ o = q 1+ q : (1.30) Theabsoluteactivity( =exp( T ))containsthechemicalpotentialthatisequaltothe potentialofboththefreeprotein(theproteinoatingaroundinthenucleoplasm)andthe proteinthatisboundonthesite(i.e.thatisinoursystem).Thisisbecauseinequilibrium, thechemicalpotentialofthereservoirofparticles(freeprotein)mustequalthechemical potentialoftheboundprotein.Thisissimplytheofequilibrium.Ifthepotentials areunequal,whichcertainlyoccursindevelopment,thentherewillbeanetintoorout ofoursystem(thebindingsite),untilthepotentialsequilibrate.Utilizingthefactthatthe chemicalpotentialsofthereservoirandsystemareequalgivesustwopotentialequations, oneintheformoftheacontrollableparameter(thefreeproteinwithconcentration c P that isrelatedtothepotentialofEq.1 : 19)andanotherintheformofpartitionfunctionofthe grandcanonicalensemble(i.e.solveforthepotentialinEq.1 : 30),relatingtheseallowsfor ustorewritetheoccupancyas: P b = c p K 1+ c p K = c PS c P + c PS ; (1.31) whichisutilizedextensivelyinchapter3ofmydissertationforcalculationsoftheoccupancy offactorsonmultipleheterogeneousDNAbindingsitesof cis regulatorymodules.For furtherdetailsonthistopicseealsochapter2ofHill[62]. 21 1.1.2.4Twodependentbindingsites Forasystemwithtwoidenticalindependentbindingsiteswehave: ˘ =1+2 qc + q 2 c 2 (1.32) Wewillbeinterestedincooperativitybetweenthetwoboundadsorbates,adependency, whichwillmodifytheabovefunctionto: ˘ =1+2 qc + yq 2 c 2 (1.33) Hereyistheexponentialoftheworkrequiredtoperformthefollowingreaction10+01 =11+00,where00istheunboundunboundetc..Thisprocessrequiresno energyunlessthereisaninteractionbetweentheadsorbates.UsingHill'sformalismwehave ingeneral: ˘ = y 00 + qc ( y 10 + y 01 )+ y 11 q 2 c 2 (1.34) Forexample, y 11 = y ,andcontainscooperativityforboundprotein-proteininteractions, while y 00 referstoaninteractionthatoccursbetweenthetwobindingsites(aninteraction thatoccursinthe00). Furthermore,tocomputetheprobabilityofanyoccurringwesimplyextract thecorrespondingtermfromthepartitionfunctionandcomputetherelativeprobability.For example,theprobabilityof11wherebothproteinsareboundis: P = yq 2 c 2 ˘ : (1.35) 22 1.1.2.5Agenomeof n dependentbindingsites For n bindingsites,wheretheboundproteinsinteract,thereareatotalof2 n ttypes ofpossibleinteractionsthatmaybeaccountedfor.Forthecasethatonlynearestneighbors interact,wehave= ˘ n n ! ,where ˘ =1+ yqc suchthatycontainstheinteractionenergy. Forthecasethatall n sitesaret,yettherearestillnearestneighborinteractionsof boundfactorswehave:= Q n i ˘ i ,where ˘ i =1+ y i q i c . 1.1.2.6Highlycorrelatedsystems,thek-mersequenceandrecognitionproblem We'vebeentreatingsequences S ,asiftheyweresimplyparticles.DNAsequencesconsist ofunitsofbases:A,C,G,T.Proteins,liketranscriptionfactors,maypreferonebaseover another,andingeneralmaypreferasporderingofspbases,forexample5'- AAATmaynotbeequivalentto5'-TAAANoticethesearenotgeneticcomplements(e.g. 5'-AAATcomplements5'-ATTT,whereIwillalwayswriteDNAsequencesinthe5'to3' direction). Rathertheyaremathematicalpermutationsofoneanother.Forthecasethatonecon- sidersak-mer,abindingsitethatcontainskconsecutivecomponentsitesthatareallbound orallunbound,thekcomponentsitescanbeaggregatedintojustonebindingsite(since they'recompletelycorrelated).ThisistheformIusefortherepresentationoftranscription factorbindingsites,whereeachcomponentofthekinternalsitesrepresentsaDNAbase 12 ,wherebasereallymeansbase-pair,sinceproteinsbindtothedouble-strandedDNA-but weonlyuseonestrandfortherepresentationofthebindingsitetokeepnotationsimple. Forexample,asp3-merofDNA,is5'-AAA,wheretheproteinisreallybindingtothe 12 Thisisaformwidelyacceptedpossiblyduetonaturalselectionactingattheunitsofthebases(which areroughlythechemicalfunctionalgroups) 23 complexof5'-AAAand5'-TTT.Andratherthanconstructingaclosedopensystemforthe DNAandadsorbateforeachcomponentofthesequence,weinsteadconstructaclosedopen systemfortheaggregate.Hencethefortheclosedopensystemwouldsimply beboundorunbound;identicaltotheproblemofasinglebindingsitewithvariablenumber ofadsorbate.Anotherexampleistwodimers.Forexample,5'-AAand5'-AA,whichconsist ofthesequence5'-AAAA.Thisissimplyconsideredastwobindingsites,andhencehasfour 00,10,01,11,wherethe01gurationindicatesthetwobasesofAA areunbound,whilethelastdimerisboundbytheadsorbate.Hencethiscanbetreated identicallytohowtwobindingsitesweretreatedabove. Anadditionalcomplexitywillbetonotonlyintroduceeachbaseashavingasp bindingenergy(so4distinctbindingenergies),buteachbase within thek-merashavinga spbindingenergy.Hencethebindingenergywillbebasedonafunctionof4*kpossible energies.Thismeansthatforalatticeofksites,wetreateachsiteindependentlyinterms oftheirbindingenergies,yetintermsofthebindingtothek-sites,thesitesarecompletely correlated.Hencethebindingenergyofak-mer S toaspadsorbateis: E b = E ( S )= k X i E ( S i ) ; (1.36) andthebindingconstantforthek-mertotheadsorbateis: K ( S )= k Y i K ( S i ) : (1.37) Thiscomplexitycanbeincreasedbyconsideringahierarchyofpossibleinternalinter- actions,orcooperativitywithinthebindingsite,suchthatthetopofthehierarchyhas4 k 24 possibleenergies.Thishierarchyisexploredcommonlyintheinterdisciplinaryliterature throughtprobabilisticmodelsofsequences. In1987,Bergandvon-Hippel(BvH)introducetheirevolutionaryselectionmodelof proteinDNAregulatorysequences,whichelyunitestheideaofthehighlycorrelated bindingproblemtoMultipleSequenceAlignment.Staden,threeyearsearlier,hadintroduced theideaofmakingatabletoorganizethecountdatafromaMSA.Atthetime,Multiple SequenceAlignmentwasanemerging(Smith-Waterman'slocalpairwisealignmentwas onlyinvented3yearsearlier),Blastdoesnotappearuntil1990,andthenamedHidden MarkovModelappliedtosequences,accordingtoDurbin,is1994[38].Thek-merbinding sitesinthe1987BvHpaper[16],weremodeledbywhatarecalledaPositionWeightMatrices, PWMs,whichwouldeventuallyberecognizedasaHMMinDurbin'stext[38],seechapter 5,wherethePWMsarecalledPositionSpScoringMatrices,PSSMs. 1.1.2.7K-mersandPWMbindingconstants Weknowthatwecantreatthebindingofaproteintoak-mer(asequence S )usingstandard thermodynamics,forexample,inunitsofthethermalenergy,wehave: K ( S )= [ PS ] [ P ][ S ] = e ( E ( S )) ; (1.38) where E ( S )isthebindingenergythatwewillassumeobeysEq.1.1.2.6.Wewouldliketo knowthebindingconstantoftheadsorbatetoallpossible4 k sequences S .Thisistedious andpossiblyexperimentallyinfeasible.Wecanusethebindingtothehighestybinding sequence(the'concensus'sequence( S o ))asareferencepoint.Thenwecand E ( S )astheperturbationenergyortheamountthebindingenergyincreaseswhenone 25 perturbsthesystemfrom S o tosequence S .Henceweareferenceenergyorzeroof energyas E ( S o )=0whichisanarbitrarychoiceofreference 13 . FromtheperspectiveofHill'sperturbationtheory(seepage29ofHill[62]),weseethat wecouldmodelperturbationsfromareferencesystemby: K ( S )= K o e E ( S ) (1.39) Herewehavegeneralizedthebindingconstantofaproteinto any sequence S oflength k.Physicallythisishardtoimagineexperimentallyinrealtimeforaboundprotein-DNA system,aswe'retalkingaboutchangingjustpartofthegeneticmaterial(abaseortwo) boundtotheproteinwhilekeepingintackthebulkofthebindingsite.However,sincethe endresultoftheprocessis'pathindependent',itisirrelevantthemethodusedtocausethe perturbation,hencetheperturbationmayevenbeanevolutionarymutationofabinding site. Assumingthateachpositionwithinthebindingsiteisindependent,wecanthenconstruct atableofallthesinglemutationperturbationsawayfromtheconcensus,therebyallowingus toestimatebindingenergyforallpossiblek-mers.Thistablecontainsthematrixelements oftheso-calledenergyPositionWeightMatrix,discussedinmoredetailinChapter2,which isusedincomputationalalgorithmsthat'search'forbindingsitesfortranscriptionfactors. 13 Note,wearenotsaying K ( S o )= [ PS o ] [ P ][ S o ] = e ( E ( S o )) =1.Rather,inasense,wearethe 'bindingenergy'tobeaperturbationenergy,by E ( S )=ln K ( S ) K ( S o ) ,whereitisclearthattocompute theworktoseparate S boundby P wouldrequire apriori knowledgeof K ( S o ),whichmustbeexperimentally determined,inadditiontotheexperimentaldeterminationof K ( S ).Hereafter,unlessexplicitystated,this perturbationenergy E willsimplybecalledthebindingenergy.Furthermore,inchapter2,wedenote E ( S ) asabioinformaticbindingenergythatisinferredfromsequencedata,and G ( S )(aGibbsfreeenergy).In thischapter,itisbesttothinkof E ( S )= G ( S ),namelyasabiochemicalenergy. 26 1.1.2.8Singlebindingsiteproteinsystemsindistinctenvironments Wecanfurtherimagineperturbationstothebindingenergyduetotheenvironmentofthe bindingsite.Forexample,ifwehavetwodistinctenvironments, c and u ,whichwillbecome usefulideasinChapter2.wecouldconstructtwodistinctbindingconstanttables,where thebindingconstantforanygivensequenceinenvironment c wouldbe: K ( S ) c = K c o e ( E ( S ) c ) ; (1.40) andsimilarlyinenvironment u : K ( S ) u = K u o e ( E ( S ) u ) (1.41) Inenvironment c ,weimagineall4 k sequencesS(allDNAk-mers)bindingconstantsbeing measuretobe K ( S ) c .Similarly,inenvironment u ,weimagineall4 k sequences S binding constantsbeingmeasuredtobe K ( S ) u .Statisticallyweareassuming P ( S j c ) 6 = P ( S ),and similarlyforenvironment u ,where P ( S )istheprobabilitythatsequence S isboundbythe adsorbatewhenallpossible4 k k-mersequencescompeteforbindingwiththeadsorbate. Hence P ( S )istheoccupancyofsequence S andtheadsorbatenormalizedbythesumover allpossibleoccupanciesofthe4 k k-mersequences,wheretheoccupancyiscalculatedforall k-mersequencesunderthesameconcentrationoftheadsorbate. Theseenvironmentscouldbeconsideredatthecis-level,thatisatthelevelofthegenome. Hence,theenvironments c and u ,couldbedeterminedbywhetherornotacooccurringbind- ingsiteisnearthesequence S ornotnear.Forexample,onecouldimaginetheenvironment c isduetocooperativeinteractionsthathaveevolvedbetweenthecooccurringbindingsites, 27 whileenvironment u isduetouncooperativeorjustplainindependentbindingtothecooc- curringbindingsites.Weexplorethisproblemindetailinchapter2,whereweconsiderthe possibilitythatDorsalbindingsiteshaveevolvedasamixtureofdistinctmotifsduetoa ' c 'and' u 'cis-environmentactingasaselectiveforcetomaintaintothecomponentmotifs. Thisisanexampleof'epistasis'wheremultiplegenes(i.e.thecooccurringbindingsites)are selectedforjointly(i.e.thesitesarenotevolutionarilyindependent.) 1.1.2.9Anexampleoftwok-merbindingsitessystemusingamixture Nowwecanimaginetwok-merbindingsitesadjacentonalattice(theremaybeintervening nonsplatticesitesbetweenthetwok-mersitesthatactas'spacers'),theimportant pointisthatwelabeltwodisinctlocations(eachoflengthk)onthelatticetobesites. Forexampletwodimersseparatedbyaspacer: S '=AANNNNCC,whereAAisthe dimer,andCCistheseconddimer,andNNNNisaspacerofDNAthattheadsorbate doesnotrecognizeasabindingsite.Thetwok-mersiteseachbinddistinctadsorbates,for exampletheadsorbateDorsalbindsAAandtheadsorbateTwistbindsCC.Wecan explicitly accountforlateralinteractionsbetweenthesitesusingthe'y'factorspreviouslyintroduced. Thelateralinteractioncanbeadsorbate-adsorbateorsequence-sequence. Forexample,theremaybeasequence-sequenceinteractionbetweentheparticlesAAand TT.Thiscanbeaccountedforby: y 11 =exp( w ( S 0 ))=exp( E ( S ) c E ( S ) u )= E ( AA ) c E ( AA ) u ; (1.42) wherewehaveintroducedtheenergytermsfromabovethatwerefromtheenvironments, c 28 and u .Here y 11 isanexponentialfactorthatcontainsthecooperativeenergy w ( S 0 )thatis functionoftheunderlyingsequence S ',andisbyourequation(1.35),andisfurther onpage100ofHill[62]forgeneralizedbindingsitesystems.Ifthereisasequence- sequenceinteraction,thenthebindingenergies E ( S ) c ;E ( S ) u maybet,althoughitis notanecessaryconditionforasequence-sequenceinteractiontomanifestitselfinthisform. An implicit formofcooperativityatthesequence-sequencelevelistojustuse E ( S ) c for thecasethatDorsalbindingsitek-mercooccurswithaTwistbindingsitek-mer.Thisform containsthebindingenergytothesequencealongwithashiftinthebindingsiteenergydue totheinteractionwithitscis-environment.TheshiftbeingmodeledasaKullbackLeibler divergencebetweenthecis-spenvironmentrelativetothecasethatthecasethatthe bindingsiteisindependentofitsenvironment. 1.2BiologicalIntroduction 1.2.1TreeofLifeandtheTheoryofLife Thetreeoflifeisboththeorganizingstructureoflifesciences(likeitspredecessor-Linnean andisarepresentationofthetheoryoflife-evolution.Onashorttime-scale itisapictureofbiologicalreproduction-orunfaithfulcloning-apedigree.Onlongtime- scales,thetreeisarepresentationoftheprocessofspeciesevolution,whichinmanyways canbethoughtofasapedigreeofspecies,whereeachnoderepresentsapopulationofa particularspecie,anddescentdownthebranchesrepresentstime,anddivisionofanode intotwonodes(reproductioninastandardpedigree-parentshavingchildren..)represents speciationeventsfromacommonancestor(reproductionatapopulationlevelongeological time-scales). 29 1.2.1.1ComparativeAnatomyandPhysiology Previously,intheeighteenthcentury,Linneanbybruteforceclusteringtechniquesorganized lifeaccordingtoanatomicalsimilarities,leadingtoasystemforlife.Thisorga- nizationwasnotjustadatabasethatorganizedcollectionsoflivingobjectsbycomparative anatomy;theorganizationexplainedwhytheobjectswereditorwhytheyweresimilar becausecomparativeanatomybegsthequestionof'function',ofphysiology,andphysiology isatheoryoflife.Inthissense,Linnaensystematicorganizationoflifewasasimpletheory oflife,inthatitdidexplainlife,asdoesallphysiology,becauseitexplainedthepurpose (function)ofeachanatomicalfeature 14 .Forexample,thepurposeofalegislocomotion, ofajawistobite,ofarootistostabilizeastandingplant(amongotherfunctions),ofa circulatorysystem(heart)istocirculatenutrientsthroughouttheorganism,ofastamenis forsexualreproductionetc.;andLinnaeusknewthesetrivialrelationshipsbetweenstruc- tureandfunction,whichundoubtedlyhelpedinhisgroupingoforganisms.Interpretingthe organizationoflifethroughatheoryofstructureandfunctionisverypowerful,asatheory intentionallycomplexpatternssothattheycanbeunderstoodandcomprehended. Hencethetheoryof'structureandfunction'hasoneofthemostimportantaspectsofbiolog- icaltheory,andthatistosimplifylifeinawaythatwecancomprehenditsrichdiversity.It alsohasanobviouspredictivepower,forexampleifananimallosesitslegsyoucanpredict thatitwillloselocomotion.AlthoughmanyofLinneangroupsarebasedonreproductive anatomicalfeatures,a'structureandfunction'theoryisveryshort-sightedintermsofhow lifereproduces,asitcanonlyexplainhowtomaintainexistingpopulationofspecies(by 14 Inbiologythe'function'ofatraitoranatomicalstructureiswhatthetraitdoesoraccomplishes,hence theword'purpose'seemstobeasynonymtofunction.However,thismaybeconfusingseeingthatevolution hasno'purpose',hence'purpose'shouldnotbeinterpretedasifthepopulationorlineagethatevolvedthe traithadforesightandintendeditscreation. 30 havingthereproductivestructuresdowhattheydo).What'sclearlymissingistheorigin oflife,howtomakelifefrombasicphysicalchemicalprinciples;andtheoriginofallthe tkindsorspeciesoncebasiclifeforms(i.e.speciemeaningagroupthatcanform viable ThetheoryofevolutionbyCharlesDarwin,whichI'llsummarizeintwopieces'descent' and'moaddedmorestructuretothetheoryoflife.Darwinrecognized(hypothe- sized)thatcommonanatomicalstructureswereduetocommon'descent',indicatingcommon featuresneednotbederivedanewpossiblyusingtingredientseachtime,theentire structurewaspassedonduringreproduction.Healsorecognizedthattstructures betweentwogroupsofanimalswasdueto'mofromacommonancestorbetween thetwogroupsofanimals. 1.2.1.2Classicalevolution Darwin'stheoryinthelate1800sunitedlifethroughonecommonlineage.Howeverthe biologicalmechanismsofhowtoproducelifefrommolecules(e.g.howarecompletelynovel andcomplexfeaturesderivedintheplace)wasnotpossibleduringDarwin'stime,as theobservationaltoolssuchasLeeuwenhoek'smicroscopeseemedtobet.Further- moretoshowthatonespecieoforganismcouldberelatedtoanother(throughthecommon ancestor)wouldrequireanevolutionexperimentthattakeslongerthanone'slifetime(usu- ally).Albeit,assuggestedbyDarwin'sbook'stitle,TheOriginofSpecies,showingthatone speciescouldberelatedtoanotherwaspreciselyhispoint. InparallelwithDarwin'swork,Mendel'sevolutionexperimentwithpeaswouldsetin motionthemolecularbasisofevolution,andsetthestagefortwocontributionsfrommolec- ularbiologythatexplainthemolecularbasisofinheritance.Beforetheadventofmolecular 31 biology,fromareductionistpointofview,theproof(evidence)oftheevolutionarytree ofmulticellularorganismscanbethoughtofasahierarchyofconservationoffeaturesor modules.Theimportanceofconservationcannotbestatedenough,asDarwin'scentral tenantis'descent',meaningcommonfeaturesbetweenourancestorsandourselvesaredue toconservationfromdescent:inheritance.Atthehighestlevelofmodularity(andbyfar themostusefulforabigpictureresolutionofthetreeoflifeformulticellularorganisms)is theanatomicalstructures(e.g.leg,eye,heart,nervoussystem).Comparativeembryology duringDarwin'stimeelucidatedthatalltheseanatomicalstructuresfromorganstogross featureslikealeg(containingmultipleorgans,likeskin)arederivedfrompossiblyjustthree tissuetypes(germlayers).Forexample,theheartisderivedfromthemesodermtissue,and allmulticellularorganismswithoutamesodermintheirembryonicstagedonotdevelopa heart.Thesethreetissuetypesarecomposedoflargelytiatedcellsandduring developmenttheyworktogetherandsometimesworkindependentlytofatethearrayofdif- ferentcelltypesthatmakeuptheanatomicalfeaturesthatmakeupmulticellularorganisms (suchasepidermalcells,hematopoieticcells,bloodcells).Hence,beforemolecularbiology (beforethe'gene'picture),therewasatleastthreebasicconservedtraitsthatcouldbeused forevidenceonresolvingtreebranchinginmulticellularorganisms(anatomicalfeatures, germlayers,celltypes). 1.2.1.3ModernSynthesisofevolution Moleculargenetics ,thecontributionfrommolecularbiology,wouldshowanimals possesgenesthatarecontainedinchromatinandthatthosegeneswerepassedonfrom parentstochildrenduringreproduction,thisculminatedinthe'modernsynthesis'ofevo- lutionarybiologyintheearly1900s.ThiswouldgainfurthersupportbyR.Franklin's 32 crystallizationofachunkofchromatin,DNA.DNAwasfoundtobeapolymerstrandthat wouldcomplementwithanotherselfassemblingstrand,whichFrancisandCricksawcould serveasa'copy'forareplicatingcell'sprogenycell.The'modernsynthesis'islargelyabout themoleculardynamicsofpopulations,'populationgenetics',apopulationoforganism's genomes(thefrequencyofgenotypes)andhowthosechangeintime(inunitsofgenera- tions),andhowtheycanbebynaturalselection,mutation,geneow,migration andisolation(E.Mayertypespeciation-allopatry).Mosttly,thegenecentred picturethathadariseninevolutionnowhadgivenafourthfeatureormodulethatwas conserved:thegene,whichencodesforaprotein. 1.2.1.4Comparativegenomics Withtheadventoffast'sequencing'technologythatcandeterminethegeneticsequenceof wholegenomes,geneticconservation(bysimplycomparinggenomes)hasbecomeapowerful toolinhelpingresolvethevariousbranchesoftheevolutionarytree.Withtheknowledge ofthevastarrayofprocessesofgenomeevolution(howgenome'schangesintime)weinfer theenormousgenomesofcomplexorganismslikehumansthatsharegenesincommonwith themuchsmallerbacterialgenomes,sharetheseincommonduetoinheritance,throughthe processofnaturalselectionthatbalancestherandomizingforcesofmutation,asgenerationof genesdenovo(byrandommutation)isfarlesslikely(probabilistically)thantheprobability ofinheritingthesemodules(genes)(e.g.ageneoflength1000nucleotide(henceabout333 aminoacids)has4 1000 =2 2000 > 10 300 possiblesequences,whichmeansanorganismthat reproduces1000timesperyearwouldstillrequirevastlymoretimethanthetimeavailable sincetheBigBangtogenerateonegene(seeMaynardSmith'schapterone[88]).).Thecause ofthegeneticinheritancethroughnaturalselectioncanbeduetostandardreproduction 33 alongalineage(i.e.meiosisandsexualreproduction),suchasthehumanlineage,orit canbeduetomoreexoticformsofhomology(inheritance),wherevirusesortransposable elementscaninsertgeneticmaterialinhostgenomesresultingingenomeexpansion,where thenewlyinsertedgeneticmaterialmay,overtime,becomeabialresourcetothe host.Perhaps,themosttsourceofgenomeevolutionisgeneduplicationand wholegenomeduplication(thehumanlineageisthoughttohavehadtwowholegenome duplicationssincetheevolutionofthedeuterostomes-the2rounds(2R)hypothesis 15 ) thatcanhaveenormoussuchasthoseseeninthemustardfamilyofplants.These duplicationeventscauseeachoriginalgenetohaveacopywhosepurposeisinasense unfated,andhenceisfreetoevolveanewfunction.Hencealthoughhumanshaveabout 25,000genes,manyofthesearejustslightvariantsofoneanother-agenefamily-inherited throughduplicationevents.Hence,thearchibacteriathatoriginatedaboutfourbillionyears ago,evolvedgenetemplates(abasicdesign)ofsomeofthegenefamiliesthroughnatural selectionforgenesnecessarytoprocesstheoxygendepletedenvironment,similarlytheir descendentcyanobacteriaevolvedprocessnecessaryforphotosynthesis,therebytransforming theatmospheretobeamenableforevolutionofgenenetworksinaerobicrespirationfoundin bacteria.Itcanthenbeinterpretedthatmostofourgenes(thegenesfoundineukaryotes) thatweshareincommonwithbacteriaareinheritedfromthebacteria.Hencegenephylogeny hasbecomethestandardtechniquetoresolve(determiningthebranchingorder)ofthetree oflife.Ofcourse,ithasbeenshownthatagenephylogenyalmostalwaysagreeswithan anatomicalgrouping(suchasLinnaeusgroupings),becausegenesencodetheanatomical structures 16 . 15 This2Rhypothesisissupportedbywholegenomeanalysisbetweenvertebratesandinvertebrates.In particulartheHOXgenesasseenonfourchromosomesoccuronjustonechromosomeinthe. 16 Onemuststillbecarefulusingjustanatomicalfeaturesfordeterminingthetopologyofthetree,as 'mimics'commonlyseenintheinsectworldinpredatorpreyinteractionstodeceivetheiropponentclearly 34 Anobviousquestionaboutgenetemplates(genefamilies)ishowdoesoneinferthata particulargenelocusinhuman(forexample)correspondtothesamelocusinsayaThe geneofinterestinthetwoorganismsmaybethedirectdescendentineachlineagesincethe mostrecentcommonancestoroftheorganisms,oritmaybetheindirectdescendentineach lineageduetogeneandgenomeduplicationsorviralandtransposableelementinsertions -thisisthedistinctionoforthologsandparalogs 18 .Ofcourseifthegenomesarefrom arecentcommonancestor,likechimpandhuman,it'srelativelyeasy,sincethe'order'of thegenesispreserved(synteny),hencethegenomes(chromosome)roughlymatchoralign fromstarttoend(barringtheelargescaleinversionsandthefamousfusionoftwoofthe chimp's24chromosomesformingchromosome'number2'inhumanthathastwocentromeres andtwoadditionaltelomeresatthefusionsiteofthechromosomes,hencehumanshave23 chromosomes.).However,formoredistantlyrelatedorganisms,suchasandhuman,this ismoreduetochromosomeinversionsandthemanyindelssincethemostrecent commonancestor. 1.2.1.5Expressionpatterns Theansweristhatweneedtoknowwhereinthebody(whereintheanatomy)isthegene beingexpressed.Ifweareinterestedinaproteininthebrain(say chordin ),wewouldexpect thatthehomolog(hereiusehomologtomeandirectcommonancestry-soimeanortholog) ofthatgenebeexpressedintheinsectbrain(say dpp ).Thisexpectationisinasenseobvious, demonstratethatanatomicalfeaturescanbeevolvedindependently 17 .However,internalanatomystructures likethecentralnervoussystemaremuchstrongeranatomicalfeaturesfordeterminingevolution,sincethese internalfeaturesareverycomplexandmuchhardertoevolvedenovo,andhardertoco-optduetohigh levelsofepistasisandpleiotropy. 18 Convergenceindicatestheorganismgeneratedthefeatureanew,whichisofcourseapossibilityfor anymaterialbeingobeyingthelawsofphysics,butissomuchlesslikelythaninheritancethatwhenever inheritancecanbedemonstratedpossiblethisisthemostprobabilisticorigin. 35 asyoumayknowthroughyourownexperiencewithfoodandtastethattorgansand tissueofananimalorplant'bodyplan'tastetduetothetproteinsinthose organs(ofcoursesometissuesarefullofsugarsandlipidspossiblyoverwhelminganytaste fromtheproteins).Ifthegenesareindeedcoexpressedinthesametissue,wenextneed toknowifthegenefunctionsthesameway,isitsproteinbeingusedforthesamepurpose. Giventhatthegeneiscoexpressedinthesametissueandfunctionsthesame,this,by parsimony,wouldsuggestthegenesareorthologsratherthanparalogs. Howdowesee'where'ageneisexpressedinabody(likeisitexpressedinthebrain, heartormuscleetc.)?Thisquestioncouldbeansweredpartlybyclassicalgenetics(breeding usingMendel'slaws)bydeterminingifaknowngenotyperesultedinembryoniclethality (orwhenindevelopmentthefetusorlarvadied,orwhatenvironmentalpressureswould 'induce'death-presumablybecauseofthemutantgene).However,usingclassicalgenetics alonetodeterminewhereageneisexpressedinabodyisextremelytediousandindirect andhasmajorlimitationsandrequiresalotofinterpretation.Amuchfastertechnique wouldbeavisualtagonthegene'sproteinormRNA(suchasatmarkerordye), albeitclassicalgeneticsisstillrequiredtocontrolforthecorrectallelicversionofthegene ofinterestamongotherreasons(likegenomicbackground).Thismicroscopictechnology becameavailableinthe1980sandrevolutionizedareunderstandingofevolutionbyusing thetechnologycoupledwithtime-shotsofdevelopment,therebyalsotellingus'when'agene isexpressedduringanembryo'sdevelopment. 1.2.1.6Thegeneralizationofthetheoryofevolution;developmentalgenetics Priortothe1980s,embryologistanddevelopmentalbiologisthadcarefullyobservedthe tstagesofdevelopmentofmanymulticellularorganisms.Mapsweremadeofthe 36 tcelltypesthatwouldbeproducedfromthesinglefertilizedegg,acelllineage,as wasmappedindetailforthetimeinthedevelopmentofthenematode.Thesecell lineagesthatwerederivedfromthefertilizedegg,andthengermlayers,wouldmarkthe tiationofcells(orcellspBydetermining'where'agenewasexpressed inadevelopinganimalembryoitbecameevidentthattheprocessofcelltiation,was linkedwithco-expressionofwholesetsofgenes,whereateachstepoftiationnew setsofgeneswerebeingexpressed(andotherturned) 19 .Hencedevelopmentinanimals, inalargepart,canbeseenasaresultof'generegulation'. Thesecondpartofmolecularbiologythatsupportsmacroevolutionis developmental genetics .Developmentalgeneticswouldshowthatmastergenes(transcriptionfactors) whentheirregulatorytargetsorbindingsitesevolved,thenwholedevelopmentalnetworks (i.e.thegenesnecessarytobuildananatomicalstructure)couldberedeployedtoat positionofadevelopingbody(e.g.thepositionthatexpressedtheactivatormastergene), orleadtomoofcurrentbodyparts 20 . 19 Plantdevelopmentistthananimals.Plantsubiquitouslydisplay'phenotypicplasticity',whichin developmentiscalled'developmentalplasticity'.Plantdevelopmentisnotareproducibleiftheenvironment oftheplantchanges(e.g.changethetemperatureorlightandtheplantwillrespond).Plantsrespondto theirenvironment,whileanimalsdonot,asGilbertputsit:someorganisms(likeanimaldevelopment)are ruledbyatyrantgenome,wherethephenotypeplaysthepassivepermissiveroleoftheruledwhileother organisms(likeplants)areruledbytheenvironment,wherenowthegenotypeplaysthepassiveroleofthe ruled(seechapter22Gilbert[51]).Animalshavelargelyevolvedplasticityaspartoftheirdevelopment,this isalittletthanhomeostasis.Hence,'developmentalplasticity'seeninplantsisincorporatedinto animaldevelopment.Animalshavestablygeneratedthe internal environmentalcuestoturngenesonand attanatomicalpositionsofvaryingmolecularenvironments,whereaplantmaydeveloprootsif cuedbylowlightlevelsorvariouschemicaltriggers,ananimalhasputthecuesfordevelopmentaspartof itsinternalsystem.Forexample,animalsalwaysdevelopssayaneuroectodermandmesodermbecausethe environmentalcuesthatareapartoftheembryonicenvironmentarestable(suchasinmother'swombthat setsupconcentrationgradientsofkeytranscriptionfactors,whichleadtotheDorsalgradient,ortheegg casingoforchicken)andareprocessedbycomponentsofthegenome(e.g.CRMs)[52] 20 Assuggestedbythephrase,'developmentalgenetics',thiswouldseemtobeasub-disciplineof'molecular genetics'andhencenotanextensionoranenrichingofthemodernsynthesis,butratheraHowever, themodernsynthesiswasagenecentredtheory;itwasaboutgene'sthatencodeforproteins.Developmental geneticsisagene'regulatory'theory,itisaboutthepartsofthegenomethatturntraditionalgenesonand andgeneregulatoryelements(transcriptionfactorbindingsites)arenotreallyapartofthemodernidea ofagene.Ageneencodesforaprotein.Ageneticregulatoryelementencodesforsomethingaltogether t. 37 Developmentalgeneticstolargedegreeistheofstudythatshowsatamolecular levelhowwholeanatomicalfeaturesevolve,andhowthediversityofanatomicalfeatures andtheirarrangement(bodyplans)hasevolved.Thisisbecausethedetailedmolecular mechanismsofhowanimalsdevelophasrevealedthesimplicityofanatomicalscaleevolution (sometimescrudelycalled'macroevolution').Developmentalmechanismsareobservedin theofdevelopmentalgeneticsthroughpowerfulexperimentssuchastransplantation experiments,andgainandlossoffunctionsexperiments,alongwiththepowerfultoolsof geneticsandmicroscopytechniques.Inasensedevelopmentalgenetics,elucidatesthe'Gene ReulatoryNetworks'thatformthecausalcoarsegrainedmolecularbasisofselfassembling anatomicalfeatures. 1.2.2Development 1.2.2.1Theoriginofmulticellularity;theevolution of development Abouttwobillionyearsago,single-celledalgaestartedtocooperatebydevelopingcell-cell interactions,formingtheeukaryoticmulticellularorganisms,likealgalmats.Suchcon- structseventuallyleadtotheimportantinnovationofmulticellularorganismsthatisfound inanimals(andotherkingdoms):sexualreproductionthroughmeiosisandfertilization.Like endosymbiosis,whichispossiblytheoriginofeukaryotesfrombacteria,wheretwocellswould mergeandpartnertosharetheirparticulargenesandhencetraits(whichmaybedit traits),insexualreproductioncellsnotonlymerge,orfertilize(mergingoftwogametesinto azygote),buttheygothroughmeiosis,whichisntlytfromtheresults ofendosymbiosis(whichreplicatethroughmitosisofeachfusedcomponent)cantly tduetothe'recombination'oftraits,orcrossingover,leadingtogreatdiversityin 38 progeny,therebyleadingtofasterevolution(byFisher'sFundamentalTheorem)andpre- ventingMuller'sratchet(bycreatinggamete'sfreeofdetrimentalmutations).Furthermore, thefusionoftwogenomes(e.g.thetwogametegenomes)isaformof'geneduplication'. Geneticduplicationeventsofgenesisasourceof'genefamilies',wherethecopiedgene (paralog)canextendthediversityinfunctionofaparticulartemplategene,whichleadstoa wholefamilyofgenes(aparalogphylogeny,seeforexamplechapter7ofDurbinetal.[38]) 21 . Theseearlysexualreproducingalgaearetheoriginofeukaryoticearlydevelopment (whichIaseukaryoticcellularinteractionsthatleadtoamulticellularstructure, suchasanalgalmat(aplainofcells),ora'blastula,liketheprotistanvolvox'(aballof cells)[126]).Theseprimitiveorganismsaremodernrepresentativesandtypicalexamplesof theevolutionofmulticellularity. Inthediversedomainofeukaryotethemostfamousgroupsaremulticellularorganisms duetotheirgrossanatomicalfeaturesthatcaptureourattentionfromduetotheirbeauty anddiversefunctions.Inthemulticellularorganismsthereisaremarkablyconservedearly development,yetthereisalsomarkedsindevelopmentsuggestingthatmulticel- lularityhasindependentlyevolvedinplants,animals,andfungus. Examplesofmulticellularprecursorstomulticellularitycanbeevenseeninbacteriain quorumsensing(chemicalcommunication/signalling)andtheprimitivefungusyeastthrough theirformofsexusing'matingtypes'(mating'types'areanalogsofmalesandfemales)[126]. Anarrayofprotists(eukaryotesthatarenotanimals,plants,orfungus)showmulticellularity, 21 Twobacterialcellscouldfuseleadingtoan ! 2ngenotype,similartodiploidy,andallowingforgreater genomicdiversityasoneoftheduplicatecopiesofthegenearenowfreetoserveanewfunction.However, thisgenomeduplicationeventisdistinctfromsexualreproduction(andhenceisnotdiploidy,inmyopinion), asthenewbacteriawith2ngenes,canbesaidtonowhavesimplyn'genes(ahaploidwithn'genes),when then'genesarereplicated(throughmitosis),theprogenyareclones(ifmutationrateisslowenough),which distinguishesitfromsex,insexualreproductiontheprogenyarenotclonesduetorecombination(assuming themutationrateishighenoughthatthereexistssomegeneticpolymorphismsinthepopulation,suchthat thecrossoverpiecesarenotidentical). 39 notablyvolvoxandslimemolds.Allthese'primitive'organismsaregoodstartingpointsfor thestudyofmulticellularity. Development,toalargedegree,focusesonmulti-celltypes-cellulartiation(a skincell'functions'tlythanagermcellandfunctionstthanastomachcell thatisadigestivecellthatcan'eat'absorbedsurroundings).Thatisthedivisionoflabor throughspecializedcells,whichisnotseeninmanyprimitiveorganisms,ratherthesesimple organismsaredisplaying'colonies',whicharetheaggregatesofcellsduetomitosisresulting inprogenycellsbeingproximaltooneanother(whichisphysicallyimportantindevelopment, butitdoesn'tdisplaythedivisionoftheaggregateintotcelltypes). Thestartingpointofdevelopmentistheinnovationofmeiosis,inventedbytheprotists, andpassedontoitsprogenylineagesofplants,animals,andfungus.Hence,plantsand animalsdon'tderivesexanewforthemselves,theywerethebenefactorsofitfromaprotistan ancestor.Bytheunionofmeiosiswithfertilization(theoppositeofmeiosis,inasense)the abilitytoalwayshaveanextracopyofagene,andthereforetheabilityofagenetoevolvea newfunctionbecomesastablecomponentoflifeforms(regardlessofwhethersomelifeforms displaya'haploiddominant'lifeform,likebryophytesinplants(e.g.mossandliverwort), wheretheplantonenormallyseesisthehaploid,becausemostofthebryophytes' lifecycle existinthehaploidstage). Thegreatlineagesofanimals,fungus,andplantsalldevelopfromthefertilized'egg' (thefusionofthetwogametes).'Egg'meansoocytehere(oneofthegametes)whilein thedevelopmentliterature,eggmayalsomeanthestructurethatencasesababy,likea chickenegg.Initiallyinevolution,possibly,therewasnodistinctionbetweengametes,egg andsperm(technicallyplantsweresoeggandpollen):bothgameteswereequalinform, justhaploidcells,likeinyeasts.Regardlessofthecontroversialoriginofmeiosis,weseehere 40 aclearcaseoftcelltypesandanabstractcaseofthedivisionoflaboratthecelllevel, theemergenceofmulti-celltypes(which,again,isdtthansimplycellaggregation). Thedivisionofcelltypesforthemulti-celltypeorganismsisspeculative,butitseems thehaploidcell'sroleinthelifecycleofthemeioticcellswastoprovideameansto generateuniqueprogenythatwerenotclonesoftheparents(throughrecombination). Cellaggregationintheformofcoloniesorevenintheformofcomplexstructures(the rudimentsofabodyplan)suchasseenintheprotistvolvoxorthe'transitionalform'between fungusandanimalsinthecpossiblyevolvedindependentlyofmeiosis(which Iconsidertheoriginoferentcelltypes).Cellaggregationiscausedbygeneproductsthat connectcellstogethersuchascadherins,actin,andmicrotubles.Henceonewouldsuspect thatthecelladhesiongeneswouldbeconservedamongthemulticellularlineages,however thesegreatmulticellularlineagesofplantsanimalsandfungusdonotconservesomeofthese genes,suchasoneofthelargestgenefamiliesthe"receptortyrosinekinase",andhence theyhaveeachevolveddevelopment'fromscratch'(byconvergence),whichsuggeststhat theorigin of developmentcannotsimplybetiedintotheoriginandevolutionofbody plans(plantshavebodyplansintheformoffeatureslikeroots,shoots,andleaves;which canbelaidoutinmanyways.Areplantbodyplansrelatedtothelayoutofanimalbody plans(likehead,thorax,andtail)?Foranexcellentdevelopmentalcomparativeanatomy betweenplantsandanimalsseeAlberts[4],whoshowsthereisasimilarityofgermlayers tothefundamentaltissuesinplants:epidermalcells,groundtissuecells,vasculartissue cells.Independentevolutionofbodyplansbetweenplantsandanimalsandfungusdoes notmeanweshouldnotbothercomparingtheirtlifecyclesandbodyplansor comparetheiranatomiesandphysiology.Thepotentialfactthatthesegreatlineagesare allindependentsimplymeansthatthepowerfulinferences(predictions)thatcanbemade 41 basedoncommonancestryarenotvalid(sincetheircommonancestryisirrelevantifeachone independentlyevolvedtheiranatomicalstructures,hencetheonlythingconstrainingtheir wouldbebasicphysicsandchemistryandtheconstraintsimposedbydescending fromtheircommonprotistanancestor).However,theirpotentialmulticellularindependence isnotasindependentassomemaysuggest,thereisstillthedeephomologyinthatthey allstillusemeiosisandtheybothmustdealwiththecomplexitiesoftranscriptioninthe faceofchromatin.Albeit,homologyinmeiosismaybeoflimitedhelpinunderstanding theiroriginsbecauseplantshavea'cellwall'thatisrigid,unlikeanimal'spermeableand agilecellmembrane,suggestingthattheproductionofgametes,andtheformofthegametes themselvesisvastlytbetweenplantsandanimals(andindeedpollenandhowit 'grows'intothefemalestamenleadingtotheovumisverytthanspermfertilizing anegginanimals). 1.2.2.2FlyDevelopment Thecoarse-grained(inspaceandtime)moleculardynamicsofhowaisselfassembled fromamaternallylaidegg,asinglecell,iscoarselyunderstoodatthemolecularlevelthanks todevelopmentalgenetics.Thisdoesnotexplainorproveanatomicalevolution,howeverif gainorlossormoofamastergenes(whichencodetranscriptionfactorsactingin earlydevelopment)orofgeneregulatorybindingsitesthatinteractwiththemastergenes didresultinmoorgainoflossofanatomicalfeatures,thenthiswouldconvincingly explaintheevolutionarymechanismofanatomicalfeatures 22 22 Thistypeofmechanismthathashugeleapsinmoofanorganismwasobservedinthefossil recordbyJohnConway(amongothers),apalaeontologist,whichhecalledtheCambrianExplosion,andis incontrast(contrastdoesnotmeanorcontradictory)totheubiquitouslyaccepted'gradualaccu- mulation'ofbmutationsthatresultsinadaptationsbetweenparticularspeciesoforganisms(this wasevenknownbyplantandanimalbreedersbeforeDarwin'stheory).Theideaofsaltationwascalled 'punctuatedequilibrium'byStevenJayGould,whereperiodsofgradualaccumulationarepunctuatedby 42 Inthe1960sJ.Monod,F.Jacob,andothersworkingwiththebacteria Ecoli discovered thattranscriptionfactorscangeneexpressionbyactivatingorrepressingagene, wherethetranscriptionfactoritselfwasactivatedbyenvironmentalcues(suchassugar)[68]. Thiswouldmarkthestartoftheofgeneregulation,byidentifyingcertaintypesofpro- teins,andhencegenes,thatarenotenzymesorstructuralinnature.Rather,theseregulatory proteins,inasenseencodedinmastergenes,actasfactorsthatmodulatetranscriptionand hencegeneexpression.Thiswouldbethecentralthemeindevelopmentalgenetics,which wasquicklyrecognizedafterMonod'sdiscoverybyE.Davidsonthedevelopmentalbiologist workingincollaborationwithanuclearphysicistR.Britain[23]. In Drosophila developmentitisknownthatafterfertilizationoftheeggandaftercleavage (mitosiswithoutaGphase(cellsdonotGrowbigger))withintheblastocystmaternallylaid transcriptionfactors(suchasBicoid)zygoticallyregulateothertranscriptionfactors(such asgapgenes(whicharetranscriptionfactors)whichinturnregulatepair-rulegenes(which aretranscriptionfactors)whichinturnregulategenes(likeHOXgenes,someof whicharetranscriptionfactors)thatultimatelyfeedintosignallingnetworksandparacrine factorsthatleadtotheconstructionofanatomicalfeatures. Theinitialmaternallylaidtranscriptionfactorsarenotubiquitouslyexpressedthrough- outtheeggchamber,rather,likeaFourierseries,thematernaltranscriptionfactor proteinformsacoarsepatternacrosstheembryo,wheretheproteinconcentrationislikea squarewaveofconcentrationasafunctionofspacethatformsaterminaFourierseries(I meandiscretesummationofafewsignals).Thistranscriptionfactorproteinmayactalone toactivateagapgene(like hunchback )ormayactwithanothermaternallylaidtranscription factorwhosepatternisalsolikeasquarewave(withaphaseshift),thereby adding another burstsofmajoranatomicalfeaturenovelty. 43 termtotheseries.Theadditionofthetwoinputsignalsresultsafterabitoftimeinan additionalnewpatternacrosstheembryointheformagapgene'sproteinconcentration. Henceastimeprogressesmorecomplexpatternsappearasthegapgene'sinteract(primarily like addition ofwavesinaFourierseries)resultinginpatternslikeasinewave(where,again, theamplitudeistheamountofproteinataparticulartimeandthehorizontalaxisisa spatialaxisoftheembryo(thewaveisinspacenottime),wheretheembryo'saxislength isintime).Hence,aftercleavageindevelopment,thetotipotentcellsoftheblastocyst areinasensetransformingintomorespecializedcelltypes,wherethecellisby theamountofeachspgeneproduct'sproteinconcentrationataparticular position oftheembryo,wherepositionisemphasizedtostressthattheconcentrationpatternsare overspace(thelocationoftheembryo).Thesentiatedcellswilldividefurtherand tiatefurtherastheembryostartstotaketheformofanadultsegmented.The chainreactionofgeneinteractionsthatbeginswiththematernaltranscriptionfactorsthat activateothertranscriptionfactors,whichinturnactivateotherfactors,givesamolecular dynamicsdescriptionofearlydevelopment,andhenceanswersthequestionofhowtobuild 'fromscratch'abodyplan(thearrangementofanatomicalfeatures). 'Fromscratch'ismisleadingandambiguouswhendiscussingbodyplansofanimals. Therearetwoissuesthephraseneedstoinvoke,ontogeny(development)andphylogeny (evolution).Disentanglingtheambiguityweseehowanadultarisesfromasingle fertilizedeggusingthemodelsystemDrosophila.Second,howdidsinglecelledorganisms evolveasthediversityofallmulticellularorganisms.Itisinphylogenyandevolutionthat thequestionofhowtoevolvefromscratchananimalismisleading.Evolutionrarelybuilds fromscratch(i.e.independentevolutionsuchasconvergenceorparallelisms),bodyplansare thoughttoderivethroughthereconstructionsofasetofmodules(developmentalnetworksof 44 genes)thateachencodeforanatomicalfeatures(hencealleyes,forexample,arehomologies atadevelopmentallevelthroughthemastergene(transcriptionfactor) pax6 thatbindsto asetofgenesthatseta'totipotent'orpartiallyprogrammedcelltostartto'develop'the eyeimaginaldisc(thesetofcellsthatformthebasicoutlineofaneye,whichwheninduced, willformaneye.Thisinductioneventcanoccuratanypartofabody(eveninthetailif soinduced)[51]. Theideasofatotipotentcellandcellsparenecessarytoexplaindevelopment, andhencenecessarycomponentofanextensiontothemodernsynthesis.Iwilldiscuss celltiationandcelllineagesbelowintheoriginofmulticellularity.Ofcourse,the initialbodywithmanyofthemodules(urbilateria,theancestorofallanimals)oritsfungal ancestor,stillneedstobeexplainedashowonebuildsitfromscratch(thisisbeingdone usingcandbrineshrimp-artemia),butmostpeoplearemaking inferenceswithahypotheticalurbilateriathatcanexplainthediversityofanimalsbythe evolutionofgeneregulatorynetworks.Furthermore,thestatementthatanatomicalfeatures aremodularatthegeneticlevelisnotobvious,asmanybelievecomplexfeatures(anatomical structures)containhighlycorrelatedgeneticnetworks,suchthatanygeneticmutationwould bedeleteriousandprobablylethal.Thediscoveryoftheextentofmodularity(ortheextent ofnonmodularitythrough'induction'bysignallingandparacrinefactors)Iwilldiscuss inthesectionbelowon'modularity'. 1.2.2.3ThecrowningjewelofEvo-Devo:theHOXgenes TheworkofEdLewis(aPhDstudentofAlfredSturtevant)onbody-transformativegenes thatwere'saltationary' 23 helpedprovideevidenceandatheoreticalframework(extending 23 Saltationistheideathatwithinjustonegenerationmajorevolutionarytransformationscouldoccur, possiblyevenspeciation,takingtheideatoitsextremewouldbelikeanapehavingababyhuman,a 45 evolutiontheory)forthestatementthatdiversityintheanimalkingdom(inparticularthe segmented insects)isduetoevolutionofgenereguatorynetworks.The"Hox"geneswerea groupofgenesthatcaused'Homeotictransformations',whichmeansatransformationina bodypartofananimal'into'anotherpart,likechangingapairoflegsintoapairofantennae. ThesemutanttransformationsthatwereknowntosomebiologistsuchasBateson(whoalso coinedtheword'genetics')asearlyasthelatenineteenthcentury,whohadcatalogued theseinvariousanimalgroupssuchascrabsandysetc..AnexemplarygeneintheHOX groupis'bithorax',whichwasdiscoveredinThomasHuntMorgan'slabaround1915, andalineageofthesemutantshasbeenpreservedsincethen.Thegeneisnamedafterits mutationthatleadtoamutantwithtwo(bi)thoraxes(likeanabdomen).Lewisfound thatfurthercrosseswithothermutantlineagesleadtodoublethenumberofwingsofthe (sincethethoraxiswherethewing'sdevelopmentalsource(orpoolofcellsthateachhavethe rightgenesturnedonandaformofprogramming-leadstothedevelopmentalsource ofcellsofthewing,these'programmed'cellsduringearlydevelopmentaretheso-called 'imaginaldisc',andinthiscase,thewingimaginaldisk 24 occurs,andhenceisahomeotic transformation.TheMorganlabdidn'tknowthemoleculargeneticbasisbehindbithorax (whatDNAsequencehadchanged(orpossiblynetworksofDNAsequences)fromthewild typetothemutant),buttheydidknowthemutationwasinheritable,andhencewas genetic.Theyalsowereableto'map'thelocationofthesegenes(beforetheyevenknew genesweremadeofDNA)tolocationsonthechromosomeduetoatechniquedeveloped byMorgan'sundergradstudent(Sturtevant),whohadinferredthatgenesmustbelinearly 'hopefulmonster',henceinonegenerationaspeciationeventoccurred.Ofcoursethisexampleusestheleast derived(theorganismwiththefewestfeatures)astheposterchildfortheancestor(e.g.algeaforprotist, spongeformonoblasts,fordipoblasts,worms(likenematode)fortripoblasts,apeforHominoidea...) 24 placea'wingimaginaldisc'inthelocationofthehead,andyougetwingsgrowingoutofthehead,or placea'legimaginaldisc'-thecells'programmed'tobuildalegwhereawingissupposedtooccurand you'llseeleg's'grow'orfurtherdevelopwherethewing'sweresupposedtobe. 46 arrayedalongthechromosome,whichwouldsuggest'linked'genes(orlinearlyclosegenes alongthechromosome)wouldnotrecombinethatfrequentlyduringmeiosis[120]. Armedwiththeknowledgeofthelocationofthegenesthatcontributedtotheso-called bithoraxcomplex(duetoSturtevant'smappingtechnology),Lewiscreatedamodelofhow thebithoraxHOXgenesevolvedfromanancestralgenethroughtandemgeneduplications (whichislargelythoughtcorrect)whichculminatedinhistheoreticalpaperin1978[82]. LewisknewthroughhisowngeneticgainandlossoffunctionassaysthebehaviorofHOX geneswhencrossedwithvariousmutantlineages(duetoclassicalgeneticsLewiscould createhybridduetorecombinationduringmeiosistocreatecrossesofknownmutant HOXlineagesofsuchasbithoraxline).Fromtheseexperimentsheconstructeda Wolpert-likegradientmodel(eg.FrenchFlagModel)ofhowtheHOXgeneswouldinteract likeaFourierseries,oneHOXgenewouldbecoarslyexpressed,therebyturningonanother HOXgeneinthecomplex,thenthetwoofthesewouldworkintandemtoturnonathird HOXgene,thenthethreewouldallworkinunisontoturnonthefourthHOXgene.Central tohishypothesiswasnotjusttheconsecutivecombinationsoftheHOXgeneproducts activatenewHOXgenesinthecomplex,butalsothateachnewactivelygeneratedHOX genewouldrepressthemostrecentintimeactivatedgene,therebycreatingpatterns ofgeneexpressionswithinthecellsoftheembryo.Theseuniquegeneproductsacrossthe embryowereisolatedinsegmentsinspace(hecouldliterallyseethesegments,asthesewere grossanatomicalfeatureseachcontaining1000'sofcellsinearlydevelopment). Hence,Lewishadproposedadevelopmentalmechanismforhoweachsegmentcontained tsetsofgeneproducts,andtherebyexplainedhoweachsegmentwouldsetinmo- tionthesignallingandparacrinefactorsthatwouldeventuallycausethesegmenttoform ananatomicalfeaturelikealeg,wing,head,ortail.TheHOXgenesaretranscriptionfac- 47 tors(mastergenes)thattargetthegenesnecessaryinsignallingpathwaysandparacrine induction.Inshort,Lewishadahypothesisforhowanatomicalfeatureswerebuiltfrom thesegmentedlarva,butmoreimportantly,hesawanevolutionarymechanismforsaltation ormacroevolutionbyhismodeloftheHOXgenes;asthebasisofHOXgenesisthatif youmutatethem,thenthesegmentsofthewouldchangeinsuchawaytosuggestthat theanatomicalfeaturesthatdecoratethesegments(antennae,wings,legs,eye,abdomen etc..)werealljustoneancestralanatomicalfeaturelikealegforlocomotion,thatcouldbe adaptedormoforfurtherpurposes,likeanantennaeforsensingtheenvironment,or amandibleorclawforkillingprey. ItisnowknownthattherearejustthreeHOXgenesinthebithoraxcomplex.While, Lewishadproposedmorethanthisbecauseofvariousmutationsintlocationsof thesamegene.TheHOXgenesdisplayvarioussplicesites,andvariousgeneregulatory elements,wheremutationsinaparticularexonorparticulartranscriptionfactorbinding sitewouldcausetphenotypes.AlthoughLewisdidn'tgeteverythingright,toa largedegreehehadproposedcorrectlyoneofthegreatesttheoriesofbiology,howgenes 'regulate'development(bymastergenesturningonothergenesnecessaryforbuildingthe anatomicalfeatures),andhowthesegenes,whenmutated,canleadtomajorrearrangements ofanatomicalfeatures. Thequestionofhowthesegmentedlarvadevelopedfromthefertilizedeggwasbeing elucidatedinparallelwithLewis'workthroughexperimentsofearlydevelopmenttoidentify genesinvolvedinformingthebodystructure.Partlythroughtheextensivegenetic mutationexperimentsofChristianeNullseinVolhardandEricWieschousthatiden embryoniclethalmutants(andthereforegene'sactiveindevelopment)itbecameclearer thattherewerematernallaidandzygoticgenes(transcriptionfactors)thatcausedAnterior 48 Posteriorsegmentationinthelarvafromthehumblebeginningofthefertilizedegg,these genesareknownasthegapgenesandthepair-rulegenesduetothemutantpatternsthat thewouldcauseintheearlyembryoandlarva[97]. Althoughageneticdescriptionwasemergingofhowthedevelops,itwasnotpossible toisolatethegeneproductsofattissuesplocationsatthetimeofWieschaus, NussleinVolhard,andLewisduetothegene'sproteinbeinginsideofamulticellularenvi- ronment 25 .However,primitivetechniquesforisolatinggeneswasbecomingdevelopeddue torecombinanttechnologybeingproducedthankslargelytothediscoveryofbacterialen- zymesthatcouldcutandpastegenesfromchromosomesegments(whichispartofbacteria's primitiveimmunesystem).AllowingforforeignDNAtobedesignedandinsertedinbac- teriatocreatevastamountsbysimplyallowingthebacteriatodivide(togrowcolonies)- geneticcloning.InadditionFISH,urescent insitu hybridizationwasbeingrecognizedas amethodtoisolategenesthroughcDNA. WorkbyMcGinnisandLevineinWalterGehring'slabandworkbyothers,intheearly 1980s,possiblymotivatedbyLewis'shypothesisthat all theHOXgenesareparalogs,lead tothedevelopmentofamolecularoptical insitu hybridizationtechniquethatallowedone tovisuallyandisolatetheregion(tissuespofexpressionofeachHOXgene inspsegmentsofthe,wheretheywantedtoknowwhetheraHOXgenelike anten- napedia wasreallyjustexpressedinthehead(wheretheantennaeappear)[89].AsLewis hadhypothesized,everyHOXgenewerealmostidentical(theywereparalogs),allofwhich containedaDNAbindingdomaincalledthehomeobox,astretchofabout60aminoacids thatencodesthebindingdomain. 25 Forexample,Lewisknewthelocationsonthechromosomeofwhatgene'scausedhomeoticmutations, andwhatsegmentsofalarvawerebythemutation;butLewissimplydidn'tknowtowhatextent thatgene'sproductcausedthemutation(itobviouslycouldhavebeenthroughsomeconvolutedwebof interactions). 49 Acombinationofthenewmicroscopictechnique,alongwithgenomicsequencing,and bioinformaticpatternrecognitiontechniqueshasnowallowedfordevelopmentalbiologyto quicklyadvancebyelucidatingeachgeneexpressedineachcell,andifthatgene'sproduct isatranscriptionfactor,wherethatgenebindswithinthegenometodeterminethetargets ofthefactor. 1.2.2.4Evolutionofbodyplansinanimals Thediversityoftheanimalkingdomcannotbeexplainedbystandardgenemolecularevo- lution(evolutionofgene'sencodingproteins);thebetweenproteinsinanimals isnotttoexplainthediversityinanimalphenotypes[73].Itisthoughtthereis about25,000proteins,andtheseproteins(forexample,hemoglobin)arelargelyconserved acrosstheentirekingdom.Thediversityisnowthought,toalargeextent,tobedueto thewaytheseproteinsareusedintamountsandintcombinationsindif- ferentcellsandbodypartsofdtmulticellularorganism.Tochangetheamountor createcombinationsofvariousproteinsinaparticularcellinanorganismisaccomplished bygeneregulation,alongwithexpansionofgenomesizes(fromtheearlybacterialandpro- tist)throughgeneduplicationallowingtheroughly same proteintoadapttoanewnichein theorganism.Hence,inshort,thediversityisduetotheevolutionoftheelementsofgene regulatorynetworks,wherethetranscriptionfactorbindingsiteisTHEfundamentalunit. 50 1.2.3Generegulation 1.2.3.1ConservedGeneRegulatoryNetworks Amajoraimofbiologyistounderstandwhatelementsofbiologicalsystemsareconserved betweendistantlyrelatedorganisms.Similartothepredictivepoweroftheoreticalphysics, wheregeneraltheorydescribesmatterthroughaconsistentwebofrulesandlawsthatcan becombinedinvariouswaystoderivenewlawwithnewpredictions;intheoreticalbiology thegeneraltheoryisevolutionthathasvariousrules(statements)thatcanbeconsistently combinedtoarriveatvariousconclusionsandpredictions.Unfortunately,evolution'spre- dictivepower(intermsofcomparison)islimited,toalargedegree,tocaseswheregenetic inheritancegovernsaparticulargroupofspecies.Thisiswhymodernbiologistsarecladist. Itisbecausegeneticinheritanceallowsonetointerpolatethatalltheunobservedorganisms sincethelastcommonancestorofadatasetoforganismshavethesefeaturesincommon. Thereinliesthepowerfulpredictionsfoundinbiology.Thepredictionthatnotonlyarethe fewsparseobservationsdescriptiveofthedata,butthatitalsoprovidesinformationabout alloftheunobservedorganismssincethelastcommonancestorofthedata.Atheoryoflife couldhavebeensuchthateachorganism(species)hasnorelationshipbetweeneachother species.Imaginelifeformsindependentlyforminginparallel;amanyrootshypothesis.Such aworldhasnotreeoflife,butadisparategraphoflife.Inthisworldeveryspecies originatedbyselfassemblingfollowingthelawsofphysics.Suchaworldwouldbeveryhard toanalyzehoweachofthecomplexlifeformswork,aseachorganismwouldpossiblyhave vastlytsetsofphysicalandchemicalprocessesgoverningitsdynamics.Suchmind bogglingcomplexityisnottheworldwelivein.Rather,theevolutionarytheoryoflifethat governsourworldisinheritancethroughgeneticmaterial,whichresultsinconservedfeatures 51 thatareencodedinthegenes. Hence,indevelopmentamajoraimistounderstandwhatgeneregulatorynetworksare conserved,therebyprovidinglawsandrulesofregulatorynetworksthatarenotjust toaparticularmodelorganism,butentireclades.Anumberofgeneregulatorynetworkshave beenshowntobeconserved,alongwithparticularstrategiesintermsofcircuittypes(e.g. feedforwardcircuit).,Iwillreviewtwomajorthemesingeneregulatorynetworks (axispolarityandpatterning)andtheirextentofconservation. 1.2.3.2AnteriorPosterior(AP)axisformation APHOXgenesareconserved.HoweverthematernalelementBicoidisderived,whichseems contradictorytotheideathatformastergenestheearliertheyareturnedonindevelopment themoretheywillhaveondownstreamcomponentsofdevelopmentalpathways.The Bicoidproteingradientisformedduringoogenesis(thematernalprocessofproducingthe unfertilizedegg),wherematernallylaidmRNAbecomeslocalizedtotheanteriorportionof theeggcasing.Theeggcasingisamulticellularstructurethatuponcompletionofoogenesis allofitscellsbutone(theegg)dissipateandformpartoftheyolksurroundingtheegg's nuclei,allofwhichisinsidetheeggcasing 26 . MastergenessuchasBicoidareusuallythoughttobehighlypleiotropicandepistatic (formnetworkhubs)inthattheycausallyinteractwithmanytargetgenes(whichinturn willinteractwithfurthergenes).ThefactthatBicoidisnotaconservedelementofdevelop- mentacrosstheanimalkingdomsuggeststhateventheupstreamelementsofdevelopmental pathwaysaremodular(canbesubstituted)orthattherearemultipleways(paths)thatlead tothesamedevelopmentaloutcome. 26 I'mconsideringthe'nurse'and'follicle'cellsasapartofwhatI'mcallingthe'eggcasing'. 52 AlthoughBicoidisnotconserved,someofthedownstreamelementsofAPdevelopmental pathwayareconservedmostnotablytheHOXgenes,whichhavebeenshowntobepresent eveninchordates(whichincludesthevertebrates,andsomeweirdinvertebrateslikeechino- derms).TheconservationoftheHOXgenesinmammalsandinsectswasnotonly bysequencesimilarity(i.e.,homeoboxsequenceconservation),butalsointhattheHOX genesexpressionpatternsaresimilarduringdevelopmentandplaythesamefunctionalrole inlayingoutthelocationoftheanatomicalfeaturessuchasheadandtail.Indicatingthat thegeneticinstructionsforhowtheembryo'knows'wheretoformthebodystructuresisan ancientconservednetworkthathasbeensousefultotheanimalkingdomthatithasbeen preservedsincetheCambrianexplosion 27 .Oneoftheseminalexperimentthatthe geneticroleofthegenesthatshapebodyplansintheanimalkingdomwaswith eyeless and pax6 in1994inWalterGehring'slab[105],whereaninsect'sgeneearlyindevelopmentthat isknowntocauselossofcompoundeyeformation( eyeless )wasreplacedbyasimilargene foundinmouse(whichcontrolswherethemouseeyeforms,whichisatformthan thecompoundeye).Usingmolecularbiologytechniquestheendogenousinsectgene wasreplacedwiththemousegene,whereitwasfoundthatthestillformedcompound eyesusingthemousegene,suggestingthatthe eyeless geneisancientgenethatinstructs theembryotoformaneye.Whatkindofeye?Thatisnotaquestionansweredinbodyplan genes.Thedevelopmentalgenesnecessarytoinstructtheembryowhereeachanatomical featuregoes(themastergenes)arefundamentallydistinctfromthegenesthatactuallyare criticalcomponentsoftheactuallyanatomicalstructure.Forexample,thegene eyeless is found(expressed)duringthedevelopmentoftheeye(i.e.,itisexpressedduringdevelopment 27 Usingtheconstructionofahomefromamasterbuilder(maybeanarchitect)asametaphor,themaster genes(i.e.theHOXgenes)don'ttellthedevelopmentalprogram'how'tobuildawing,ittellstheprogram 'where'toputthewinginthehome.Hence'masterbuilder'isabitmisleadingofametaphor,theydon't buildanything,theysimplygivecrudelocationalcuesofwherecertainfeaturesshouldbegintobebuilt. 53 inlocationsoftheembryowhereaneyewillform),butitisnotfound(notexpressed)inthe fullyfunctionaleye(theadultstructure).Hence,thebodyplangenesaresimilartobankers orstockbrokersinasystem,whodon'tactuallyproduceanytangibleproductslike musicoracupofe,rathertheyhelpsetinmotionthecomplicatedwebofinteractions thatcanbringthesecommoditiestogetherinaharmoniousmanner 28 . Furtherevidenceofthiswascorroboratedbymodifyingtheexpressionpat- ternofthisgene(ectopicexpression),whichwouldcauseeyestoformattlocationsof thebody[48].Hence,theHOXgenesareallexamplesofmastergenes(transcriptionfactors) whoseroleisinregulatinghowtheanimalistobuilditselfintothearrangementof bodypartsthathelpeachanimalspecie. 1.2.3.3DorsalVentral(DV)axisformation TheearlystepsnecessaryinformingtheDorsalVentralaxisoftheembryo,muchlike theAPnetwork,aresetinmotionduringoogenesis,whereinthecaseofDV,atthesame timetheBicoidmRNAisproducingagradientintheunfertilizedegg(oocyte),inparallela genecalled gurken 29 polarizestheDVaxisoftheencasingeggbylocalizingtoonesideofthe futureembryo.ThesidethathasthelocalizedmRNAbecomesthe'end'oftheDorsalaxis, anddirectlyoppositeofthissidebecomestheVentralsideofthefutureembryo.Shortly afterGurkenlocalizationandDVaxispolarization,amaternalgradientofSpatzeliscreated fromaseriesofproteincomplexdegradationsorsplitsinducesthe'Tollpathway',where SpatzelbindstotheTolltransmembranereceptorsthatsignaltoamaternalcomplexof 28 SomeHOXgenes,like dpp ,aremastergenesthatplaytheroleofputtingdisparateobjecttogetherin harmony,buttheyalsoplayotherroleslatterindevelopment,suchasintheliteralconstructionofbody parts(suchasthenervoussystemfor dpp ). 29 ThemRNAof gurken istranscribedfromtheegg'snucleus,unlikeBicoid,wherethemRNAwasmaternal mRNA. 54 CactusandDorsalproteinthatinturncausesthecomplextosplitandtheproteinDorsal nowdisplaysanuclearlocalizationsignal,thatallowsitpassageintonuclei,whereDorsal actsasatranscriptionfactor.HencethemoreSpatzelthemoreDorsalthatisunlockedfrom thecomplex,leadingtoaDorsalnuclearconcentrationgradientthatmirrorstheSpatzel gradient. InearlydevelopmentDorsaltargetsanumberofgenesactivelyturnedonandturned intheneurogenicectodermasshownin1.1.Unlikemostmastergenes,Dorsalin turnisactivatedbyasignallingnetwork(nototherupstreammastergenes).Furthermore, Dorsalinteractsinearlydevelopmentwith cactus a'maternalgenes',asetofgenes discoveredbyNusslein-VolhardandAndersonin1984,wheretheinitialegg'sCactusprotein istranscribedfrommaternallylaidmRNA cactus maternalgenes. DorsalactsasacoarsegrainedregulatoroftheDVaxisbyformingroughlythreet celltypesfromthetotipotentblastula. Thesethreecelltypesarecomponentsofthe'germlayers'orterritoriesoftransienttissue (threeprimitivetissuesseeninall'tripoblasts'-theanimalswithbilateralsymmetrythat havebothamouthandanus(i.e.gothroughgastrulationorgutformation)).Dorsalhelps formthemesoderm,neuro-ectodermanddorsal-ectoderm;wheretheendodermispartly constructedbyAPgenes.Theendodermistheevolvedgermlayerthatdistinguished insidefromoutsideofprimitive'diploblastic'organisms;animalswithradialsymmetrylike cnidariansthatonlyhaveonehole(nofullformationofagut)thatservesasbothamouth andananus(likejwheretheectodermservedasthe'outside'oftheseprimitive organisms(hencethemesodermandthe'triploblast'cladeareamorerecentevolutionary inventionthanthe'diploblasts'). JustasBicoidisaderivedfeaturein(arthropods),similarlyDorsal'suseasmaster 55 Figure1.1:CircuitdiagramofDorsalgeneregulatorynetworkintheneuroectodermtissueof earlydevelopmentdesignedusingStrathopolis'template[116]providedattheMarineBiology Laboratory'sGeneRegulatoryNetworkssummercoursedirectedbyEricDavidsonandMike Levine. 56 geneinisaderivedfeature,whereneitherofthesepatterningmechanismsisutilizedas acomponentofdevelopmentalpathwaysofotherorganismssuchaschordates.Theseminal experimentofgermlayercreationinchordateswasthroughexperimentsbySpemannand Mangoldthatrevealedthesourceofthegermlayerswaslocalizedtoonesideofthevertebrate embryo(theyusedtranslucentnewtembryos),whichwasdiscoveredbyatransplantation experimentthatrevealedthattheanalogofthegermlayersconstructedfromtheDor- saltranscriptionfactor'sgradient,whereDorsalbindstoCRMsofDVtargetgenes,was occurringinasmalllocalizedregionofthenewtembryo(completelytthaninsects). Spemanncalledthisregionthe'organizer'.WhatSpemannandMangoldhaddiscoveredwas thattoalargedegree,vertebrateembryosformgermlayersthroughadevelopmentalprocess calledinduction(tissue-tissueinteractions).PreviousworkbySpemannusinganar contractileringtosplittheembryoinhalfhadrevealedthatembryo'salmostunobservable (atthattime)internalshadanasymmetricfeature.Thisfeaturewasthelocalizationofthe mesoderm,theonlygermlayerearlyoninchordatedevelopment.Themesoderm cellssecretealigandthatarerecievedandrecognizedbytheadjacenttotipotentcells,'in- ducing'thesecellstobecometheendoderm.Henceinduction,atacellularlevel,issimply signalling(i.e.cell-cellinteractions,whereonecellsynthesizesandsecretesamoleculethat isreceivedbytheothercellwhereinittriggersatranscriptionfactortotargetspgenetic loci).Henceinchordates(andbasicallyineverythingbutarthropods),thegermlayersare builtfromcell-cellsignalling. Although,IbelieveitisstillunknownwhetherahomologofDorsalisactiveinearly vertebratedevelopment,thetargetsofDorsaltranscriptionfactorarealsofoundactivein vertebrates[122].TwistandSnailarefoundactiveinthemesoderm(theSpemannOrga- nizer)andmostfamousis dpp and sog ,wheretheirhomologsformanantagonisticgradient 57 invertebratesthatpatternsthenotochord(albeitinchordatestherehasbeenan'inversion', which,forexample,leadstoheartsbeingventralinvertebrateswhiledorsalinmostinverte- brates).Thenotochordisaderivedvertebratefeaturethatformsjustbelowtheneuraltube (thefuturebrainandcentralnervoussystemthatbecomesdecoratedbythevertebra) 30 . ThefactthatDorsalisnotfoundtobethemastergenethatpatternsthevertebrate organizer(thegermlayers)isveryperplexing,justasconfusingaswhyBicoidisnotfound upstreamofthedevelopmentalpathwaythatleadstoactivationoftheHOXgenesinver- tebrates.Furthermore,vertebratedevelopment,toalargedegree,seemsverytthan thelocalizationoftranscriptionfactorstospregionsoftheembryo(proteingradients), wherethetgermlayersemergeatspeclocationsinspaceduetothevarious interactionsoftheprotein(morphogen)gradients.Vertebratesuse'induction'(cell-cellin- teractions)earlyonindevelopmentasapatterningmechanism,toinducetheformationof varioustissuetypes.Hence,unlikeinwhereallthreelayersformsomewhatsimulta- neously,invertebratesthemesodermformswhichinturninducesthedevelopmentof theendoderm(albeit,themesodermtargetsofDorsalinfruitareactivatedwhere Snail'sexpressionborderhelpstiateneuroectodermtissuefrommesodermaltissue). Theresolutionofthemysteriousfactthattheearliesteventsinembryogenesisfor isnotconservedamongotheranimals(i.e.BicoidandDorsalarebelievedtobederived) comesfromtheobservationthat'eggs'(oocytes)areunderhighselectionpressure.Hence evolutionhasactedoneggssuchthattheycantakeadvantageoftheirparticularnicheor environment(seepage90Davidsonthatdiscusses'MaternalAnistropyandExternalCues forAxialPolarity'[35]).Ofcoursespermisalsosusceptibleandexposedtoevolutionary 30 Tunicatesdevelopagelatinousnotochordinearlydevelopment,suggestingtheseprimitiveseaorganisms areclosertohumansthanare,showingnotallpathsofevolutionleadtocomplexity. 58 forces,however,thestrategiesfoundinanimalsisthatoneofgametesismuchlarger(and hencemuchmoreimportantbasedonquantitative/numericalreasons)thantheothergamete. Withinalmostalleggs,beforefertilizationbythesperm,theegghasacytoplasmicstructure thatlocalizesvariousmolecules.Hencethepolarityoftheembryoinmanycasesissimply inheritedtoitthroughtheegg.Thecytoplasmicstructureoftheegghowit(theegg cell)willdivideoncefertilized.Theseearlycelldivisions(cleavage)inturncausesvarious typesofstrategiesforcellstodividewhileavoidingthethickviscousyolkofthecytoplasm, thesetstrategiescantheearlyeventsthatleaduptogastrulation.Hence,as Davidsonpointsout,beforegastrulation,manygeneregulatorynetworksare not expected tobeconserved,andindeedarenot. Drosophila'searlynuclearcleavagesinthesyncitiumisauniqueconstructionoftheblas- toderm(pregastrulationstructure).Itisinimmediatecontrasttoalmostallotheranimal's strategyforbuildingablastoderm.Almostallanimalsformablastodermbycelldivisions, wherethecell-celljuxtapositionsarefrequentlyleadingtosignallingandinducingevents beforegastrulation.ThismakesDrosophila,inasense,aterriblemodelforunderstanding pregastrulationdevelopment(sinceitisunrepresentativeofanimalsatthispointofdevel- opment).However,bypushingDrosophilaembryogenesisbacktotheoogenesisstage,then moreresemblancecanoccurbetweenDrosophiladevelopmentwiththeotheranimals.For example,DorsalbecomesnuclearlyactiveduetoSpatzelthatisasignallingmolecule.Spatzel issynthesizedandsecretedbynurseandfolliclecells.Bythinkingofthemother'sfollicle andnursecell'sassimplyatissue,weseethatDorsalactivationissimplyan'induction' event,whereonetissueinducesanother'tissue'(theegg).Inductionistypicalofvertebrate development(seepage175ofDavidson'stext[35]).Thereasonthatcell-cellsignallinghas inasense,beenpushedbacktooogenesisinDrosophilaforaxisspIbelieveis 59 unknown,howeverit'sconceivablethatancestralarthropodswereinsultedfrequentlydur- inglayingtheireggs,leadingtoinnateimmunityactivation(whichutilizesrelproteins,like Dorsal),thenthefrequentcoexpressionofDorsalinearlydevelopmentwascooptedtobe incorporatedinpatterningthegermlayers(whatDavidsoncalls'territories').However, relpathwaysimportanceinearlydevelopmenthasbeenshowntohavesomeconservedel- ementsbetweenDrosophilaandXenopus(Africanfrog,avertebratemodelsystem),where relknockoutsofXenopusbecamedorsalized[6]inthefrog'searlydevelopment.Suggesting thatDorsal'suseasamastergeneinarthropoddevelopmentisnotderivedfromscratch (divergence).Itisindeednotthateasyforevolutionaryforcestomodifytheverysteps (earlymastergenes)indevelopmentofanimals. 1.2.3.4Modularityingeneregulatorynetworks ThetimepointofdevelopmentwhereDorsalbeginstoactlikeamastergene(atranscription factor)isthesamepointintimewhenBicoidiscausingitstargetstobecomeexpressed.How isitthateachdevelopmentalpathwayforaxisspation(DVandAP)issimultaneously operatinginagivencelloftheembryo?Themodularityofthetranscriptionfactorbinding sitesallowsforAPtargetgenestoonlyhaveDNAbindingsitesforBicoid,andsimilarly DVtargetgenestoonlyhaveDNAbindingsitesforDorsal.SimilarlyasBicoidworks intandemwithsomeofthegenesthatitactivatedintherstplace(suchasHunchback andGiant)infeedforwardcircuits,wheretheseAPtranscriptionfactorscombinatorialact togetherbybindingtotheirtargetDNAbindingsitesthatco-occurinshortsegmentsof DNA(about300bpthe cis regulatorymodules(CRMs))thatarenearthetargetgene.By excludingDorsalandotherDVmastertranscriptionfactorbindingsitesfromtheseCRMs, APdevelopmentalpathwaycanproceedinaparticularcellintheembryoindependentlyof 60 theDVdevelopmentalpathwaythatisalsooperatinginthatsamecell.Hence,thesecircuits, orcomponentsofthedevelopmentalpathwaysthatare'patterning'theembryo(fatingthe cellsthroughcellulardtiation)aremodular.Lateindevelopment,oncethenervous systemhasgrown,themodularityseeninlayingoutthebodyplanthathasorganizedthe anatomicalfeatureslargelydisappears,asthebodybecomesanintegratedsystemhighly interdependentandcontrolledbythenervoussystem. 1.2.3.5Transcriptionfactorbindingsites Generegulationoccursatvarioustemporalandspatiallevels.Withinacell,theamountofa particularproteincanberegulatedat'translation'levelbyregulatingtheamountofmRNA transcriptsthataretranslated.However,transcriptionistheprocessthatispredominantly regulatedduringdevelopment. Atthespatiallevel,transcriptionfactorbindingsitesprovidethedockingarea fortranscriptionfactorstotemporarilybindandinvariouswaysmodifythetranscription ofnearbygenes.Thebindingprocessphysicallydependsontheconcentrationofthetran- scriptionfactorandthenumberofbindingsitesthatresidewithinaCRM.Theanking sequenceofaCRMactsasacompetinglocationforthebindingofthetranscriptionfactor, henceknowingthesizeofagenome(i.e.thenumberofbindingsites)isequivalent toknowingthechemicalpotential.Thermodynamicallyatranscriptionfactorfollowsthe chemicalpotentialgradient,henceunoccupiedsitesinthegenomecompeteforbindingofa giventranscriptionfactor. CRMshaveevolvedstrategiestorecruittranscriptionfactorsbycreatinghighlysp bindingsitesthatresidewithintheCRM.Theseveryspbindingsitesprovideapo- tentialenergy'well'thatcapturestranscriptionfactorduetotheverylowbindingenergy 61 ofthewell(comparedtothehighenergyofdockingatthesequenceoftheCRM). Furthermore'bindingcooperativity'alsoprovidesafurtherdropofthepotentialwellfora giventranscriptionfactorifaparticularcooperatingfactorisboundnearby. 62 Chapter2 2.1Introduction The'particle'abstractionofclassicalmechanicsreducesthemanydegreesoffreedomofan extendedmaterialintoasinglepointinspaceandtime[75].Asimilarabstractionisuseful fortreatingtheprocessofregulationbytranscriptionfactorproteinsofgeneregulatory networks.Insuchamodel,theentiregenomeisseenasaone-dimensionallatticewhereeach lattice'site'islikeatypeofstaticparticlewithacoordinatealongthegenome,andwhere thesiteisashortsequenceofDNA,rangingfromasinglebase-pairtoacoarse-grained extendedsequenceofDNA.Eachsuchsitecanbebyitssplogicgivenbythe interactionsthatarerelevantforregulatingtranscription[33,7,76].Thislogic,encodedin thetypeofsite,isaninheritabletrait.Furthermore,evolutionofregulatorysiteschanges thelogic,whichisknowntocausemajortransformationsonanimalbodyplans[30,127]. Understandingthislogic,atasequencelevel,hasproducedstateoftheartphylogenetic modelsforatthephylumlevelthatallowsustobetterunderstandourdeepest homologieswiththerestofthekingdom. 2.1.1PositionWeightMatrices Commonly,estimatingthenucleotidefrequenciesoffunctionaltranscriptionfactorbinding sitesisachievedbyaligningexperimentallyfunctionalsitesoflength s ,andcount- ingthefrequencyofeachnucleotideateachposition.Thesecountscanthenbeusedtoinfer thedistributionoffunctionalbindingsitesequences.Theinferreddistributionoffunctional 63 sequencesiscalledaPositionWeightMatrix[118].Forexample,foralength s bindingsite, theprobabilitythatthebindingsitehasthesequence S is: P ( S )= s; 3 Y ij P S ij ij ; (2.1) wherethesequence S isrepresentedbythematrixofindicatorvariables S ij 2f 0 ; 1 g (Boolean variables),and P ij istheprobability(maximumlikelihoodestimatefromthefrequencies) tobase j atposition i ,with i 2f 1 ; 2 ;:::;s g and j 2f 0 ; 1 ; 2 ; 3 g ,suchthateachinteger representsaletterfromthealphabetA,C,G,T. Informationtheoreticandmethodscanthenbeusedtorelatetheseprob- abilitiestolinear(additive)logarithmicmodelsoradiscriminationfunction.Forexample, theenergyPWMgivesabioinformaticscore E(S) ,foranysequence S .Theenergyofthe sequencecanbedecomposedintoasumovereachinternalpositionofthesequence: E ( S )= s X i 3 X j E ij S ij ; (2.2) wherethebindingsitesequence S isagainrepresentedbytheindicatorvariable S ij for eachposition i inthesequenceandbase-pair j ,whichselectstheappropriatetranscription factor-DNAinteractionenergies E ij .Wetheinteractionenergies E ij mathematically inEquation(2.10)below. 2.1.2In-vitroBiophysicalPWMs TheenergyweightmatrixelementsusedinEquation(2.2)canbedeterminedforeachofthe 4 s matrixelementsusinganyassay.Thisassayispurelybasedonphysicalprinciples, 64 completelyblindtonotionsof\functional"(meaningadapted)bindingsequences.Thekey measurementistherelativechangeinytothetranscriptionfactorforallpossiblesingle mutationsequencesfromthehighestysequence[118,45,62,119,44].Suchanassay assumesthatthehighestysequence(whichwedenoteasS 0 ),isknown.Bychoosing thehighestysequenceasthereferenceDNA-transcriptionfactorinteraction,onecan thenconstructthefullsetofrelativeforfullsequences(all4 s Justasa keyassumptionofthePWMmodelwaslinearityinsequence,sotoointhisexperimentwe mustassumethatthebindingenergyisalinearfunctionofthesequence.Thisassumption enableseachofthethreepossibleDNAmutationsfromthereferencesequenceataparticular positionwithintheDNAbindingsitetobetestedindependentofthegeneticbackgroundof theremainingpositionswithinthebindingsite. Thetheoreticalthatthebindingenergyisalinearfunctionofthesequenceis thatthebindingyconstant K ( S )isequaltotheexponentialofthebindingenergyin unitsofk T ,wherekisBoltzmann'sconstantand T istemperature.Thefreeenergy,being astatefunction(i.e.,exacttial),thenwouldresultinthefollowingdisplacement reaction:log K ( S )=log K (S 0 ) G ,wherethetranscriptionfactorwasoriginallybound tosequenceS 0 (thereferencesequence)andthen(byanyphysicalprocess)isdisplacedand bindstosequence S .Ifwesettheenergyscalesuchthatthehighestysequencebound totheproteinhaszeroenergy,thenallotherboundcomplexeshavehigherenergies G ( S ), hence G = G ( S ) G (S 0 )=G(S). Usingthephysicalapproachabove,onecantreateachmutationofabasefromthe referencesequence(highestnitysequence)asaperturbationofthereferencesequenceS 0 . 65 Byexpandingthefreeenergyinsequencespace,wehave G ( S )= P s; 3 ik G ik S ik + P s i;j =1 P 3 k;l w ijkl S ik S jl + ::: (2.3) ˇ P s; 3 ik G ik S ik : (2.4) Thepairwiseinteractionterm, w ijkl ,isafunctionoffourindices,whereindices i and j run overthepositionsofthesequence S ,andtheindices k and l runoverthenucleotidebases. Theindicatorvariables S ik and S jl selecttheappropriatepairwiseinteractionterm w .The expansioninsequencespacehasatotalof2 s interactions,theapproximationassumes allthesearenegligibleexcepttheorderterms. 2.1.3EvolutionaryPWMs Justasaphylogeneticanalysisofgenescanrevealsubsequencesthatareimportantforthe functionorenzymaticactivityoftheprotein,sotoocanphylogeneticanalysisofbinding sitesrevealsubsequencesthatareimportantforthebindingfunctiony)ofthese- quence.Unlikecladistics,whereabindingsitealignmentwouldonlyincludeamonophyletic group(sequencesevolvedfromacommonancestor),andhencebehamperedbypatterns ofconservationthatareduetoinheritanceasopposedtoadaptations,hereweuseaphe- neticapproachtoalignment,basedonBergandvonHippel'spheneticapproach[16],where bothconvergentsites,paralogs,andorthologsareusedinthealignmenttorevealconserved patternsintheDNAbindingsitesthatareaconsequenceofthemolecularpropertiesthat providethebindingphenotype. AbasicmolecularevolutionprincipleinitiallyformulatedbyZukerlandylandPauliand latterutilizedbyDaandbyKimuraisthatneutralDNAaccumulatessub- 66 stitutionswithareliablerate,suchthat neutral DNAcanbeusedasamolecularclock. However, functional DNA'smutationrate(whatBergandvonHippelcalledthe\base-pair choices")arecorrelatedwiththefunctionalityofasite[16].Hence,functionalDNAunder purifyingselectionevolvesslower(ifatall)thanneutralDNA,enablingacomparativeanal- ysisofregulatorysequencesbyscreeningconservedblocksofsequences,or\phylogenetic footprints"[115]. BergandvonHippelusedtheseassumptionsin1987torelatetheempiricalnucleotide countsfromanalignmenttotheoreticalbindingsitesequencesundermutation-selectionbal- ance[16].Theoretically,theyassumedabindingsitewasconstrainedbythebindingy necessaryforbinding( i.e. ,bindingthatgeneexpression)[18].Thisconstraint allowedthemtouseJaynes'sprincipletoderiveatheoreticaldistributionknowninphysics astheBoltzmanndistribution,whichtheythencouldequatetotheempiricalnormalized countsfromEquation(2.1).Inthiscontext,Jaynesprinciplestatesthattheinformation contentofthesetofbindingsitesequences( i.e. ,bindingsitesequencedataintheformof Equation(2.1)andknowledgeofthegenome-widefrequencies{theprior,orGCcontentof thegenome{shouldbeminimizedsubjecttothebindingenergyconstraint[64]. Forasimpleexample,considerabindingsiteofjustonebase-pair 1 .The\Lagrangian" fortheconstrainedminimizationproblemcanbewrittenas(thesumisoverthenucleotides thatbase B cantakeon) X B P ( B )log P ( B ) P 0 ( B ) 0 ( X B P ( B ) 1) 1 ( X B P ( B ) G ( B ) h G i ) : (2.5) 1 Bindingsitesarefrequentlyabout10base-pairslong.Abindingsiteoflengthonebase-pairisnot realisticfortranscriptionfactors,asmostproteinswouldcovermorespacethanonebase-pair(about1 nanometer).Foranevolutionaryargumentforwhybindingsitesareabout10bpinlengthsee[117],andfor aargumentsee[111] 67 Thetermistheinformationcontentofthesteadystateprobabilities P ( B )relativetothe genome-widefrequencies P 0 ( B ),theprior.Thesecondtermrepresentsthenormalization constraintovertheprobabilities(wherethepriorisassumedandthelasttermis theconstraintthattheaveragebindingenergybeMinimizingtheLagrangianleads tothetheoreticalestimateofthesteadystatedistribution, P ( B ),whichtakestheform ofaBoltzmanndistribution(e.g.,seeEquation(2.9)foraoftheBoltzmann distribution). Theequilibriumfrequencies, P 0 ( B ),arethoseexpectedofneutralDNA( eg. ,thefrequen- ciesestimatedfromtheJukesCantorsubstitutionmodel.Sitesunderselectionareforced awayfromequilibrium,andformasteadystatedistribution P ( B ).Foraphysicalexample, therelativefrequencyofaparticularbase B islikeaconcentration,whichwheninthermo- dynamicequilibriumwillbeequaltotheconcentrationofthismoleculeinthebackground. Assumingthebackgroundcanbemodeledaschemicallyrandombases(A,C,G,T)[124],then inthermodynamicequilibriumthebase B sconcentrationwillequalthebackgroundconcen- trationoftherespectivebase.Inasteadystate,however,thebasefrequencyisforcedtoto concentrationunequaltothebackground.Similarly,inanevolutionarysteadystate,there isaofmutationsdrivingthepopulationofbindingsitestotherandomfrequencies,but thisisbalancedpredominantlybythefromtheselectivepressure.Inthepopulation geneticssense,thesteadystatefrequenciesaretheresultofmutationselectionbalance. 2.1.4RelationbetweenbiophysicalPWMsandevolutionaryPWMs AsaconsequenceofBergandvonHippel'shypothesisthatthenormalizedfrequenciesfrom analignmentofbindingsitescouldbeequatedtothetheoreticaldistributionofsequences undermutation-selectionbalance(theBoltzmann-likedistribution)[16];BergandvonHippel 68 wereabletoderiveasimplerelationbetweentheirinformationtheoreticlogarithmicscore E(S)fromEquation(2.2),andtheknownbindingenergies G ( S )ofthebindingsitestothe transcriptionfactorfromEquation(2.3).Usingthestandardstatisticalmechanicsrelation: log K ( S ) K ( S 0 ) =log P ( S ) P ( S 0 ) ,where K ( S )isthebindingconstant,and P ( S )istheBoltzmann-like distribution(seeEquation(2.9)fordetails),andobservingthatlog P ( S ) P ( S 0 ) canbereplacedby thenormalizedfrequenciesfromthealignment,anddtheinformationtheoreticscore fromEquation(2.2)as E ( S )=log P ( S ) P ( S 0 ) 2 ;onethenobtains: log K ( S )=log K ( S 0 ) E ( S )) 1 ; (2.6) where E ( S )isestimatedfromanalignment.( E(S) isfullyexplainedinourMethodssection, where E ( S )= E ( S ),andsimilarly G ( S )= G ( S )bychoosing S 0 tobeareference.)The linearrelationaboveisofthesameformasthethermodynamicperturbationEqua- tion(2.3): log K ( S ) ˇ log K ( S 0 ) G: (2.7) Thisgivesusalinearrelationbetweentheevolutionarysubstitutionpattern(datafroman alignment), E ,andthefreeenergy, G (inunitsofk T ). 2.1.5ShortcomingsofPWMs Analyzingtypicalfunctionalbindingsitesequencesforaparticulartranscriptionfactorre- vealssignsofaconservedpatternofnucleotidesatsppositionswithinthebinding 2 Hereweourgournotationfor P ( S ),whichinonecaseistheempiricalnormalizedfrequencies fromthealignment(whichBergandvonHippeldenotedas f ( S )),whileintheothercaseofstatistical mechanics P ( S )isatheoreticaldistributionparameterizedbytheLagrangeMultipliers(whichcanbeshown tobethethermodyanmictemperatureforsystemslikeanidealgas[8]).Herewedokeepthederivedvariables E ( S )and G ( S )separate,inordertoclearlyseetherelationbetweenthebioninformaticscore E ( S )andthe freeenergy G . 69 site.However,becausethesequencesareshort,false-positivematchestothepatternare expectedtooccurfrequentlyinlargegenomes,toofrequentlythantimeavailableforthe proteintoallthesites.ThiskineticsearchproblemwasalsoanalyzedbyBergandvon Hippelusingone-andthree-dimensionalmodels[17],whichhassincebeenreinter- pretedseveraltimes.Inparticular,Sela etal. showedthatsymmetriesinDNAsequences functionalbindingsitelocicandramaticallybinding[111],latervex- perimentally[3].Inthesamemanner,bioinformaticsearchesforbindingsitesusingonlythe conservedpatternsinordertodiscovernewbindingsitesoftenresultsinpoorpredictions onagenomicscale[24]. Anotherlimitationofthemodelisthatindevelopment,heterotypicclustersofbind- ingsites(ratherthanisolatedsites)governgeneexpression.Hence,bindingsitesequence matchestoamotif,ifoccurringinanisolatedlocuswithinagenome(i.e.,notoccurring withinaclusterofotherbindingsites)areincapableofrecruitingthecomplexesnecessary fortranscription,andhencetheseisolatedlociareunlikelyfunctional.Hencethefunctional sequencedistributionsimplydoesnotcontainenoughinformationtomakeaone-to-onemap tothefunctionalloci[107].Furthermore,ineukaryotes,bindingismodulatedbythechro- matinstateofalocusandthecellularstatethatthegenomeresidesin.Theseepigenetic cuesandotherexternalvariablesthatbindingarenotusuallyencodedintothe bindingsitesequences,andgivesrisetodeparturesfromthelinearassumptioninherentthe PWMmodel. Evolutionindevelopmenthasrepeatedlyevolvednewcombinationsofbindingsitespro- ducingnewtypesoflogicregulatinggeneexpression[49,34,35].Traditionalbioinformatic sequencetoolstodiscoverbindingsitesindevelopmentalsystemscandiscoverthelowreso- lutionsegments(500bp)ofregulatoryDNAthatcontainclustersofcoevolvingbindingsites, 70 CRMs,bysimplyusingclustersofmotifs[25].However,determiningwhatsequenceswithin theCRMarefunctionalisForexample:isthespacingbetweensitesfunctional, istheorderingofsitesfunctional,whatabout'halfsites'orsiteswithmismatches,whatis thenumberofmismatchesallowablebeforeasequenceisnotfunctional?Tediousgenetic experimentsmustbeconductedinordertodiscoverwhatsitestlycontributeto geneexpression[35]. Forexample,the invivo bindingsitecontributiontogeneexpressioncanbeunderstood bycomparingtheexpressionofatargetgenedrivenbyawild-typeCRMwithaknockoutof aputativebindingsite.However,thisiscomplicatedforanumberofreasons:binding siteturnoverwithinCRMsleavesremnantsoffunctionalsitessuchas\halfsites"thathave partialmatchestomotifs[31],secondthemultiplehalfsites(thatareeasiertoevolve)may beabletocompensateforastrongfullsite.Therefore,evenwithafunctional CRM,functionalbindingsitediscoveryisadauntingtask,duetovestigialsitesthathave fuzzyorpoormatchestobioinformaticmotifs. 2.1.6PhysicalShortcomingsofPWMs 2.1.7Dependencieswithintranscriptionfactorbindingssites ThelinearrelationinEquation(2.6)becomesnonlineariftherearecooperativeinterac- tionsbetweenpositionswithinabindingsite(oriftherearecontextdependentbase-pair dependencies).Forexample,cooperativityatthebiochemicalleveltendstocausethelinear relationshipbetweentheorderGibbsfreeenergyandthebindingconstantstobecome nonlinearasafunctionofsequence,therebydecreasingtheabilityoflinearmodels(or orderthermodynamicperturbations)tocapturetherelationship[62,20].Furthermore,some 71 DNA-proteininteractionsrequirespnucleotidesatvariouspositionstojointlyoccur, suchthattheadditivesumoftheinteractionsofeachnucleotidetotheproteinisnotwhat wouldbeexpectedunderthelinearmodel.Insuchcasesitbecomesimportanttocon- siderhigher-orderinteractions,suchasviadinucleotidesorothervariousjointoccurring nucleotides[114,112]. 2.1.8Dependenciesbetweentranscriptionfactorbindingsites Ifthebase-pairpreferencesforaparticulartranscriptionfactorarecontingentonacooper- atingfactor,thenevolutionwillhavetheco-occurringsitesjointly.Forexample,the transcriptionfactorNf Bisknowntohaveaspythatisdependentonco-occurring bindingsites[79],andsimilarlythebindingsitesoftheGlucocorticoidReceptoraresp totheircontext[90].TheNf BhomologDorsal'sbindingsiteshavealsobeenshowntoen- codewhenactiveintinnateimmunitypathways[28],ortosignalDorsal's roleasanactivatororarepressor[94]. 2.1.9ConditionalPWMsbasedonco-occurringfactorbinding sites Herewepresentamodelthatincorporateslocus-spinformationintoPWMsthatwecall \conditional"PWMs,thatimprovebindingsitediscoverywithinCRMsbyincorporating informationofeachbindingsitelocusintothefunctionalbindingsitesequence distribution.Thisisusefulfortranscriptionfactorsthatdisplayspecializedbehaviorbased ontheir cis -environment.OurPWMapproachaccountsforDNA-DNAepistasis(hard- wiredcooperativity)thatisafunctionoftheDNAspacerbetweentargetbindingsiteanda 72 putativecooperatingtranscriptionfactor'ssite.Thehypothesisisthatbase-pairpreferences betweenknowncooperatingproteinswillbeafunctionofthespacerbetweentheknownsites (assumingthatsitesthatareseparatedbylargespacersareelynon-interacting).If thebase-pairpreferenceschangeasthespacerchanges,thenevolutionwillhavethe co-occurringsitesjointlyratherthanindependently.Asaconsequence,weexpectt PWMsforbindingsitesseparatedfromaputativeinteractingsiteasafunctionofspacer size.ThismodelissimilartothecooperativenucleotidemodelinRef.[16],butnowwe elyhaveaspacermodelbetweenbindingsites. Furthermore,BergandvonHippelinRef.[16]introduceaspacerdependentinteraction energy,whichsimilarlyaddressesthatspacingbetweenco-occurringtranscriptionfactor bindingsitesthetotalbindingenergybetweenthetwoseparatedsites.However,in theirspacerdependentinteractionenergy,theseauthorskeptthePWMforeachbindingsite aconstant,regardlessofitsinteractionwithco-occurringbindingsites,andonlyfocusedon thespacingbetweentheco-occurringbindingsite.Ourmodel,inasense,encodesthespacer dependentinteractionenergyintothetconditionalPWMsconstructedfort spacerwindows. 2.2Materials 2.2.1DataforknownDorsalbindingsitesin D.melanogaster Dorsal-Ventralnetwork Theinitialdevelopmentofthefruitispartlybasedonmaternallylaidmorphogensthat formagradientacrosstheblastodermtherebycausingdtialtargetgeneexpression[78, 73 65,80].TheDorsal-Ventral(DV)networkofgenesactiveinthe Drosophila embryois largelyconservedacrosstheDrosophilagenus,furthermoretheircoarse-grainedexpression patternsintermsofpercentegglengthalongtheDVaxisarealsolargelyconserved[103]. ThetranscriptionfactorDorsalregulatesthegenesresponsibleforpatterningtheDVaxis ofembryogenesisleadingtogastrulation[93,116,128].HenceDorsaltranscriptionfactor bindingsiteswithinandacross Drosophila speciesrepresentalargesetofbindingsitesthat areamenabletoconstructingaPWM. WecollectedDorsalbindingsitesactiveinthe Drosophilamelanogaster neuroectoderm regionoftheDVaxisthatcooperatewithabHLH(basichelix-loop-helix)dimerwithTwist. Thesesitesarethe D sitesofTableS2ofCrockeret.al.[31],theDorsalsitesfrom2 ofCrocker etal. [32],aswellasthe\specialized"NEE(NeurogenicEctodermEnhancers) andNEE-likeDorsalbindingsitesofErives etal. andCrocker etal. [42,32]).Thosesites arespecializedinthesensethattheyhavebeenshowntoevolveslowerthankingDorsal bindingsitesinhomotypicclustersofDorsalbindingsitesintheNEE[31],andpossibly specializedtothecooperativeinteractionwithTwist(whichweaimtocharacterizethrough informationtechniques). Thereisampleevidenceandalong-standinghistoryintheliteratureforDorsalsites cooperatingwithabHLHdimer,see[87,74,129,121,71,67]andreferencestherein.In thosecases,thebHLHdimerislikelyaTwist:Daughterlessheterodimer.Daughterlessisa ubiquitouslyexpressedandobligatepartnerintissue-spbHLHdimers,suchasTwist. The'specialized'Dorsaldatasetislabelledas D DC mel ,where D representsadataset,and thesubscript DC means'DorsalCooperative'andmelstandsforthespecies melanogaster . WealsocollectedDorsalbindingsitesfromtheREDFLYfootprintingdatabase[46]for targetsitesactiveinembryogenesis.Thisdatasetislabeledas D DU mel ,where DU means 74 'DorsalUncooperative'.WedidnotDorsalfootprintedsitesfromREDFLYforthe Dorsaltargetgene snail intheCRMof snail ,hencetheseDorsalbindingsiteswereomitted fromourdataset(ourCRMdataaredescribedbelow).The D DU mel isasubsetofthe fullREDFLYDorsalbindingsites,whereweoutanysitesthathadalreadybeen collectedinour D DC mel dataset,orsitesthatwerenotactiveintheDVnetwork,orbinding sitelocithatoverlapped. 2.2.2DNAsequencecontextofbindingsites OuraimistocharacterizetheDorsalsitesbasedonpatternsintheloci'ssequence. Theregulatoryregions(thecis-regulatorymodules)ofDNAthatcontainthe D DC mel and D DU mel bindingsitesconsistedofthefollowing melanogaster CRMs: rho,brk,sog,sogS, vn,vnd,twi,zen,dpp,tld .InthatlisttheCRMislabeledbythegeneittargets,andthe sog genehaditsDorsalbindingsitesintwodistinctCRMslabeled sog and sogS (where sogS isa`Shadow'enhancer). TheseCRMshavebeencollectedinacentralizedbyPapatsenkoetal.[100].Addi- tionally,theseauthorscollectedknown melanogaster modulesfromtheliteratureandusinga BLASTapproachpredictedtheremaining11Drosophilaorthologsoftheknown melanogaster regulatoryregions(atthattimetherewe12sequencedgenomesfor Drosophila ).Theor- thologswerenot`known'withsamecertaintyasthe melanogaster data,howeverwewillstill classifytheseasknownforourpurposes,asconservationofsynteny(orderofsites)along witheachmodulecontainingmultipleconservedblockswheresequencematchestobinding sitesresiderendersthesepredictionsaccurate.Thesemodulesareusuallyminimalmodules thatareonaverageabout300basepairsinlength. Wealignedthe12orthologsofeachCRM,andonlyextractedthealignedblocksthat 75 containedour D DC mel and D DU mel bindingsites,seeSupplementsection2.9.1fordetails. Theenlargedsetofcombineddatawecall D CB = D DC [ D DU ,wheretheremovedsubscript melonDCandDU,denotesthatall12orthologsofagivenbindingsitesequenceareinthe dataset,andCBstandsforcombined. 2.3Methods 2.3.1ClusteringDorsaltargetlocibasedonco-occurringbinding sites GiventhelocationsoftheDorsalbindingsiteswithinagivenCRM(seeSupplementsection 2.9.5fordetails)andthe predicted sitesofanotherfactor(aputativecooperatingfactor),we areabletoconstructadistancematrixwhereeachrow` i 'isaknownDorsallocus(base-pair coordinate),andeachcolumnrepresentsapredictedco-occurringfactor'slocus` j 'withinthe CRM.Thematrixelementsofthedistancematrixarethespacerlength(denotedas d ( i;j ) )inbase-pairsbetweenanyrow i (Dorsalbindingsitelocus)andcolumn j (co-occurring bindingsitelocus),aofthecoordinates z oftheloci: d ( i;j )= z i z j w i ; (2.8) whereweassumethatthe i thDorsalsiteappearsupstreamfromthe j thco-occurringsite, andthatbothsitesareannotatedasonthepositivestrandoftheCRM,where w i isthewidth (length)ofthe i thsite,and z i and z j aretheCRMcoordinatesofsite i and j respectively. Herewethespacerasthebase-pairdistanceofneutralDNAbetweentwobinding sites(hencetheinternalpositionswithineithersitearenotcountedaspartofthespacer). 76 ForcaseswheretheTwistandDorsalsiteoverlap,thespacerisvaluedat0bpregardless oftheamountofoverlap.ForcasesthataCRMdidnotcontainapredictedco-occurring site,wesetthespacertoamaximumvaluesuchthatthecorrespondingDorsalsiteforthe spacerwasguaranteedtobeas"Uncooperative". 2.3.2Classifyingbindingsitesbasedonspacerwindow WeapartitioningoftheingsequenceofanygivenDorsallocus,henceweuse thereferenceframeoftheDorsallocuswithbothupstreamanddownstreamsequence.We partitiontheupstreamsequencebytheminimumdistance d min andamaximum distance d max awayfromthelocususingEquation(2.8).Similarly,weasymmetric partitionofthedownstreamsequencebytheminimumdistance- d min andamax- imumdistance- d max awayfromthelocus.Wethenacoarse-grainedbinningofall thesequenceintojusttwobins,wherea`spacerwindow'representsthebinthat containstheinterval[ d min ;d max ] [ [ d min ; d max ],andtheotherbincontainsalltherest ofthesequence.Oncethebinbordershavebeenbythespacerwindow, wethenaBooleanclassvariable C ,whicheachDorsallocusas C =1if theco-occurringbindingsiteofinterestispresent in thespacerwindow,and C =0ifthe co-occurringbindingsitesequenceofinterestisabsent in thespacerwindow.Hence,the classvariableisentirelybasedonthepatternsthatoccurwithinthespacerwindow,asthe classvalueofeachclassisdeterminedsolelyonco-occurringsitesinthespacerwindow. UsingEquation(2.8)weclassifytheDorsallocithatfallwithinawindow.Once eachDorsalbindingsite'slocusisassignedaclass,wethencanalignthelociofaclassand estimatetheconditionalPWM. 77 2.3.3Energyestimationofabase Thetheoreticalsteady-stateBoltzmann-likedistributionisthesolutiontominimizingthe Lagrangianwithrespectto P ( B )inEquation(2.5).TheBoltzmann-likedistributionin unitsofthesecondLagrangemultiplieris: P ( B )= P 0 ( B )exp E ( B ) Z ; (2.9) wherethenormalization Z isrelatedtotheLagrangemultiplier 0 ,andwehaveassumed calibrationoftheenergy E ( B )byestimatingtheshiftandscalingfactorsfromEquation (2.6).AssumingourfrequenciesfromEquation(2.1)isgovernedbytheBoltzmann-like distribution,wethencanconstructanenergyPWMbyinvertingthedistribution,arbitrarily choosingtheconsensusbaseB 0 tobethezerooftheinteractionenergybetweentranscription factorandbases.Theconsensusbaseisthebaseatapositionwiththemostcountsfrom thealignment,hencethischoiceofreferenceleadstoallotherbasescontributingahigher energy(orzerofordegeneratecases).Wethencancalculatetheinteractionenergyofthe remainingbases B as: E ( B ) ˇ log P ( B 0 ) P ( B ) = log n B 0 + n B + : (2.10) Herewehavemadetheapproximationthatthedegeneracyfactors P 0 ( B )= g ( B ) =L are negligible(thisisthepriororbackgroundDNAfrequency),where g ( B )isthemultiplicity ornumberoftimesthatbase B occursinagenomeoflength L [15], n B 0 arethecounts ofthereferencebase B 0 andsimilarly n B arethecountsofbase B fromthecontingency tableestimatedfromthealignmentof n knownsites,and isapseudocount > 0.The 78 jointenergyofagivenbase B withco-occurringkingsequence S' (thatmayormay notcontainaco-occurringbindingsiteofanotherfactor)isas E ( B;S 0 )= E ( B )+ E ( S 0 )+ w ( B;S 0 ).Bysettingaspacerthreshold(spacerwindow)andanenergythreshold onthepotentialcooperatingfactorweelycreateaBernoullivariableforthe sequence,suchthat S' isaggregatedintotheclassvariable C .Hence,wehave E ( B;C )= E ( B )+ E ( C )+ w ( B;C ),where w(B,C) isaninteractionenergythatissharedbetweenthe systems B and C .OncewehavedeterminedwhatclassaDorsallocusbelongsto,weare thenuninterestedintheenergyoftheco-occurringsiteinsequence S '.Hencewea conditionalenergythatisthestandardPWMenergyfromEquation(2.2)foraparticular positionandbaseplustheinteractionterm: E ( B j C )= E ( B )+ w ( B;C ) : (2.11) Theinteractiontermshiftsthestandardenergyofasequenceif P ( B j C ) 6 = P ( B ).We ourcontext C forDorsalsites B basedonproximitytoTwist(thespacerwindow),thereby placingaclasstag C ,oneachofDorsalbindingsitebases B .Wecalculatetheshift w as: w ( B;C )= log P ( B;C ) P ( B ) P ( C ) = log P ( B j C ) P ( B ) : (2.12) Theshift w issimplytheKullback-Leiblerdivergenceoftheconditionaldistribution P ( B j C ) andthemarginaldistribution P ( B ). 79 2.3.4Energyestimationofasequenceofbases Wenowextendthemodelfromasinglesitetoabindingsitesequence.Thetotalshiftfora particularbindingsitesequence S anditsgsequenceis: w ( S;S 0 )= E ( S;S 0 ) E ( S ) E ( S 0 ),wheretheshiftiscalculatedas: w ( S;S 0 )= log P ( S;S 0 ) P ( S ) P ( S 0 ) : (2.13) Thesequence S istheDorsalbindingsiteataparticularlocus,andisasequenceofbases B , whilethesequence S' iselyaBernoullivariable C ,whichmeansthesequence S 'oftheDorsalsiteeitherhasaTwistsite(inwhichcase C =proximal)ornot,inwhich case C =distal.Hence, w(S,S') = w(S,C) ,whichweas: w ( S;C )= s X i w ( B i ;C )(2.14) wherewehave S asthesequence f B i g ,where i 2f 1 ; 2 ; 3 :::s g ,and s isthelength ofthebindingsitesequence S .Equation(2.14)usesastandardPWMtocalculate P(S) (as opposedtousingthemarginalof S over C ),becausethemarginaldistributionisa\mixture model"thatcannotbefactorizedintoaproductofbasespprobabilityfactors[13]. Computationally,forenergyPWMsthereisamatrix w foreachclassvalueof C .Byadding the w matrixtotheenergymatrix E (matrixelementsbyEquation(2.10))weobtain a conditional energy.Weea conditionaldetector ,ora conditionalenergyPWM ,which weuseforbioinformaticpredictionsandannotationsofbindingsitesequences.Thedetector 80 trainedfromsequencesofclass C thenwillscoreeachsequence S as: E ( S j C )= E ( S )+ w ( S;C ) : (2.15) Here E(S) isfromEquation(2.2),wherethematrixelements E ij areequalto E ( B ( j ) i ) fromEquation(2.10).Thefunction B(j) isamapbetweenbase B 'salphabetA,C,G,Tand thevaluesofthematrixindex j :0,1,2,3;wherewethe0indextobetheconsensus baseandthereforereferenceenergylevel(thegroundstate).Thematrixindex i denotesthe positionofthebase,whichwepreviouslydenotedas B i inEquation(2.14),whereitwasclear whichparticularbase B residedatposition i ofsequence S .Hencetheconditionalenergyis E ( B j C )= log P ( B 0 ) P ( B j C ) ,where B 0 istheconsensusbaseofthepositionindependentPWM fromEquation(2.9). 2.4Modeldetectors WeetwotypesofDorsalbindingsitesequencemodels(\detectors")thatweusefor detectionandThedetectorisconditionedonsequencemotifs, andhencepotentiallycanbetterresolvefunctionalloci.Theseconddetectorissimplya standard(unconditional)PWMmodel,whichweuseasabaselineformodelcomparison. Firstwethedetectorthatincorporatessequenceinformation.Aswewill see,thedetectoractslikealogic-likegatethatwecallthe\ORgate",duetoitssimilar- itywithastandarddigitalORgateusedinelectronics.Theinputtothegateisak-mer, andtheoutputisadecisiononwhetherthek-merisaDorsalbindingsiteorjustrandom backgroundDNA.Thedetector'sdecisionisbasedontheconditionalenergyPWMscores 81 fromEquation(2.15)describedabove,thatis,itsoutputdependsontheoutputoftwo distinct\subdetectors",whichwecallDC(\DorsalCooperative")andDU(\DorsalUnco- operative").TheDCcomponentoftheORgatescoresallincomingk-mersbasedonthe conditionalenergyforasequencewithclasstype`proximal',whiletheDUcomponentscores allincomingk-mersbasedontheconditionalenergyfortheclasstype`distal'.The\OR gate"detector(thatis,predictsaDorsalsite),ifeithertheDCortheDUdetector(or both)Ingeneral,anyenergyPWMmodel(andhenceourconditionalenergyPWMs) canbeusedasalinearforbindingsitesequences.Thisisbasedon thefollowinglinearequationforanygivenk-mer: y ( S )= E c E S ; (2.16) Here E and S arevectorsfroma4 k dimensionalrealvectorspace,whereweelevatedthe matrixofindicatorvariablesfromEquation(2.10)tobeabonavector. E c actsas biasthatshiftsthehyperplanethatseparatesputativefunctionalsitesfromnon-functional sites.TheEuclideandotproductbetweenthetwovectors, E S ,isasthesumover element-wisemultiplications,wheretheenergy E isnowanothervectorinthespacethat projectseachk-mer S ontoalineoflength E(S) (i.e.,Equation(2.2)).Theso-calledbias orenergythresholdisapositiverealnumber( E c ),andrepresentsapartitioningoftheline by y intopositiveandnegativerealnumbers.Hereallk-merswithapositivevalue of y haveenergylessthan E c ,andareasabindingsite.Allk-merswithanegative valueof y haveenergygreaterthan E c ,andareasrandomDNAsequence. TheORGatedetectorispartiallyoncethesequencefeature(theco- occurringbindingsitemotif)andthespacerwindowhavebeenset(oroptimized),asde- 82 scribedintheMethodssectionabove.Thesesettingsallowustoestimatetheconditional probabilities.Hence,usingonlyDorsalbindingsitesequencesfromthedataset D CB weare abletotrainandanORGatethatisnotmixedwithbindingsitesbasedonpurely bioinformaticmatches.ThesecondmodelisthestandardPWM,whichwecallthe CB de- tector,where CB standsforthe\combined"set(meaningtheconditionalandunconditional datasetscombined),whichwedenoteby D CB ).TheCBmodelassignsanenergyscore E ( S )toeachsequence S asinEquation(2.2),whichhasacorrespondingprobability P ( S ) asinEquation(2.1). 2.5Results 2.5.1OptimalspacerwindowfortheORGatedetector Inordertocalibrateourconditionaldetectorswemustanoptimalintervalofthe spacerwindowbycalculatingthemutualinformationbetweentheknownDorsalbinding sitesandthepotentialcooperator'sbindingsite( eg. ,co-occurringTwistsites).Thespacer windowthatleadstothemaximummutualinformationdeterminesanoptimalclusteringof theDorsallociintotwoclasses,whichwethencanusetobuildtheORgate. Wepredict5'-CAYATGloci(putativeTwistsites)withintheCRMsbyscoringtheCRMs withanenergyPWMandthresholdthatcorrespondstoexactmatchesoftheTwistmotif5'- CAYATG,whichwefoundtohavethehighestmutualinformationwithDorsalbindingsite sequences.IntheSupplementalsection2.10weshowasimilaranalysiswiththealternative Twistmotif5'-CACATG,andsomeresultsforthemotif'srestrictedform5'-CACATGT. Uponconstructionofthespacerdistancematrixweareabletoclassifyallannotated Dorsalsitesas`Cooperative'or`Uncooperative',basedonwhetheranyofthespacersfora 83 givenDorsallocuswaswithinthebinborderby d min and d max .Forexample,aCRM annotatedwithoneDorsalsiteandthreeTwistsiteswillhavethreespacers.Ifanyofthose spacersarewithinthespacerwindow,thentheDorsalsiteisas`Cooperative'.We thespacerwindowasa30base-pairclosedinterval,whichstartsat[0,30]bprelative toeachDorsalcoordinatewithintheCRM(notcountingthebodyofthebindingsiteasa partofthespacer). AllknownDorsallociofagivenclassarethenaligned(seeSupplementsection2.9.9for details)toconstructtheconditionalDorsalbindingsitesequencedistribution(conditional PWM).GiventheclasslabelsontheDorsalsites,weareabletoestimatetheprobability ofagivenclassassimplythefractionofDorsalsitesthatbelongtoeachclass C .With thesedistributionswearethenabletocalculatethemutualinformation, I(S;C) betweenthe Dorsalsitesequencevariable S andtheclass C as I ( S ; C )= X S X C P ( S j C ) P ( C )log P ( S j C ) P ( S ) (2.17) where P ( S j C )istheconditionalPWM,and P ( S )= P C Q i P ( C ) P ( S i j C )isthemarginal- izeddistributionofsequenceoverclasslabels C (notethisisnotthesameastheCBdetector's probability).Asstatedabove,theinitial d min wassetatzeroand d max at30bp,andthen bothparametersareincrementedby30bptoshiftthewindowtoanewposition.Foreach shiftofthespacerwindowweclassifyallDorsalloci,aligneachclasstoalength9motif,and thencalculatethemutualinformation.TheresultisshowninTable2.1andimpliesthat theinformationbetweensequenceandclasslabelishighestifthespacerisbetween0and 30bps,asexpectedforbindingsitesthatinteractviamolecularinteractions.Furthermore weappendedonenucleotideofsequenceoneachbindingsitesequencetoseeifwe 84 Figure2.1:LogosgeneratedforknownDorsalsites(the D CB data)testedforadjacencyto 5'-CAYATGusedasthecooperativeclassifinthe[0,30]bpdistance.LogoAcorresponds tothecooperativeclass,anddisplaystheknown5'-AAATTcore,withtotalinformation content13.5bits.LogoDistheexactsamelogoasAbutwithasinglebase-pairof sequenceatthestartandendofthesite(hence,thislogostartsatposition-1).Position9 ofthislogoshowsabouttwodecibitsofinformationrelativetothebackgroundsequencein thenucleotidebase`C'(2outof10functionalDCsiteshavea`C'atthisposition).Logo Bisthe`uncooperative'classforthe[0,30]bpwindow,whichwecalculatedtohave9.1bits informationrelativetothebackground(uniformdistributionofbases),andlogoEhasthe addedsitestothe`uncooperative'class.LogoCistheCBmotifwith9.6bitsof informationrelativetothebackground,whichlookssimilartothe'uncooperative'classat position6duetotherebeingmanymoresitesthatpreferAtoaTatthispositionamongst alltheDorsalsitesinthenetwork.LogoFistheCBmotifwiththesequence appended. weremissingpartsoftheconditionalbindingsites. spacer [0,30]bp (31,60]bp (61,90]bp MutualInformation,Equation(2.17) 0.49 0.29 0.04 Table2.1:MutualInformationbetweenfunctionalDorsalbindingsitesequencesandputative Twistsitesthatmatch5'-CAYATGusingaslidingspacerwindowscheme. WeshowtheconditionalDorsalbindingsitesequencelogosforfunctionalbindingsites generatedforthisspacerwindowinFigure2.1.Theinformationcontentofeachposi- tionofthebindingsitecorrespondstotheheightofthelogo,whereweusedasymmetric hyperparametervalueof =0 : 1asdiscussedintheSupplementsection2.9.11andsection 2.11.7. 85 2.5.2TheconditionalandunconditionalPWMsaretly t HerewetesttheoptimalDCandDUdetector'strainingdataenergyscorestoseeifthe medianenergyofDCistlytthanthemedianenergyofDU.Theoptimal detectorswerebasedonthe5'-CAYAGTTwistmotifandthe[0,30]bpwindow.Therank sumtestrejectedthenullhypothesisthatthemedianenergiesareequalwith p =10 26 . ThemedianenergyoftheDCPWMwas0.27,whilethemedianenergyoftheDUPWM was2.7. Itispossiblethatanyrandompartitioningofasetofbindingsitesthatareusedto builddetectorsusingourtechniquewouldproducep-valuesconsistentwithWe usedouroriginaldatasetofDorsalsites D CB toconstructasamplingdistributionofp- valuesfortheranksumtestTocalibratethep-valuewecreatedasamplingdistributionof thep-valuefrom1000repetitions,whereateachrepetitionthecombineddata D CB were randomlypartitionedintotwodatasets.PWMswereconstructedforeachpartition.The energyofeachsequencewithinapartitionwascalculatedas E ( S )+ w ( S;P ),where P isthe partition, S isasequenceinthepartition,and E ( S )istheCBenergy.Wethendetermined thecorrespondingranksump-valuebetweentherandomsets.Wefoundthatthep-value oftheranksumtestbetweentheDCandDUmodelfellwellbeyondtherighttailofthe randomsamplingdistribution(showninFigure2.2),indicatingthatthemedianenergiesof DCdatasetandtheDUdatasetaretlytfrom any randompartitioningof thecombineddataset.MoredetailsareintheSupplementsection2.10.1. 86 Figure2.2:Histogramofp-valuesofaranksumtestofrandompartitionsofthecombined dataset D CB .Thebinningisinunits 10 log 10 ofthep-value,roundedtothenearest integer.Thep-valueoftheranksumtestbetweenDCandDUenergydatasetsbasedon theirenergyPWMswas260inlogbasetenunits(scaledby10),whichisindicatedbythe redbarofarbitraryheight. 87 2.6Performanceofoptimal(detectors) Alldetectorswerebuiltfromlength9alignments(seeSupplementsection2.9.9fordetailsof thealignmentprocedure).TheORgateisbasedontheDCdetectorbuiltfromthedata set D DC ,whichcontainsDorsallocifrom D CB thatweretaggedwithclasslabelsfrom theoptimalspacerwindowof[0,30]bpwiththe5'-CAYATGmotif,andsimilarly,theDU detectorisbuiltfromthedataset D DU ,whichcontainstheremainingDorsallocifrom D CB thatdidnothavetheTwistsitesinthespacerwindow..TheunboldedsubscriptsDC andDUonthedatasetsdenotethatthesesetsofDorsalsiteswerebasedonourclustering scheme(notbasedonliteratureannotation). WenowpresentthreeexperimentsthatteststheperformanceofourORgatedetector andtheconditionaldetectorsusingtheCBPWMasabenchmark. 2.6.1TheDCdetectorpredictssitesproximalto5'-CAYATG withbetteroddsthantheDUdetector. WeexpectthatDCshouldpredictDorsalbindingsitesequencesthatareadjacenttoTwist morepreciselythanDU(sinceweshowedearlierthattheDorsalsitesequencescontainin- formationaboutadjacencytoTwist).InTable2.2wecollectedallthehits(allthepositives) ofthedetectors.WetestwhethertheDCconditionalenergyPWMisactually predicting DorsalsiteswithintheCRMsthathavethecorrectsequencefeature(presenceor absenceofTwistmotif)withbetteroddsthantheDUdetector.TheoddsofDCforpre- dictingbindingsitesequencesthatbelongtotheproximalclasswas 61 39 =1 : 6.Theodds ofDUforpredictingsequencesoftheproximalclassis 280 345 =0 : 81,hencetheoddsratiois 2.0.Theone-sidedp-valueforthistable'slogoddsratiotestis p =0 : 001forthechancesof 88 seeingaDCdetectorwithbetteroddsrelativetoDUatpredictingcorrectsequence features.Increasingtheenergy E c increasesthetotalcountsofthetable,andwe obtainsimilarlysigttablesupuntilabout E c =5. proximaldistal DC 61 39 DU 280 345 Table2.2:ContingencytablewiththeconditionaldetectorsDCandDUrepresentedalong therowsandtheclasstypedistalandproximalrepresentedalongthecolumns.Eachtable elementrepresentsthenumberofsitespredictedfromeachdetectorofeachclasstypebased onTwistsites(5'-CAYATG)andaCBenergy E ( S )= E c =2 : 1. 2.6.2BothORgateandCBdetectorsshowhighsensitivitywith knownsitesaspositivesandCRMsequencesasnegatives InordertotestthesensitivityandthespyofthedetectorsweusedtheReceiver OperatorCharacteristics(ROC),whichdisplaysthebetweenoptimizingpredictive performancefor`positives',whilealsooptimizingfornotdetectingknown`negatives'.The TruePositiveRate( TPR )isas TPR = TP ( TP + FN ) ,wherethedenominatoristhe totalcountsofTruePositives( TP )andFalseNegatives( FN )).TheFalsePositiveRate ( FPR )isas FPR = FP ( FP + TN ) ,wherethedenominatoristhetotalcountsofTrue Negatives( TN )andFalsePositives( FP ). Weusethedataset D CB asourtrainingsetof`positives'( TP + FN )forboththeCB detectorandtheORgate.The`negative'dataset( TN + FP )isthesetofallCRMsthat containedaknownbindingsite(i.e.,theCRMsassociatedwith D CB ),wherethebona sites(thefunctionallysites)aremaskedout.Furthermore,withintheCRMswe alsomaskoutoverlappingpredictedbindingsitesbasedonthealgorithmintheSupplement section2.9.6,hencethenegativedata(theCRMswithknownsitesmaskedandoverlapping 89 hitsmasked)isatleastninefoldsmallerthantheconcatenatedlengthoftheCRMsduethe bindingsitesbeingninebasepairsinlength. Foragivenenergythreshold, E c = E ( S ),setbytheCBenergyPWMforboththeOR gateandtheCBdetector,eachdetector`scans'theCRMusingaslidingwindowapproach, whereeach`hit'ofthedetectorisasa TP ifthehitoverlapsaknownbinding sitelocusin D CB ,andasa FP ifthedetectorinthebackgroundoftheCRM. Similarly,knownsites(loci)from D CB thatwerenotcalledhitsbythedetectorare as FN ,while TN arethek-mersfromtheCRMbackgroundsequencethatthedetectordid notcallahit. TheROCoftheORgate(showninFigure2.3A)tendstoperformbetterthanthe CBdetectoratlowenergiesupuntiltheenergyreachesabout E ( S ) < 8(thelastpoint ( FPR;TPR )displayedinthegure),afterwhichtheCBdetectortendstodobetter.The ORgateintheregionofROCspacedisplayedshowsbetterperformancethanthetraditional CBdetector(Thisisclearerquantitatively,wherewefoundtheORgatehadahigherarea underthecurve(AUC)integratedfromtheminimumenergytoCB'senergyof E ( S ) < 8(whichisthelastpointdisplayedinROCspace)).TheORgateandCBdetectorboth performwellforstrongsites(lowenergysites),whichisindicatedbytheirgood TPR (almost 80%beforeanoticeablefractionofnegativesstarttobedetectedaspositive. 2.6.3TheORgateperformsbetterthanCBatpredictingknown sitesatlowerenergies Anothermetricofperformanceoftheisthemutualinformationbetweenthetype ofk-mer(DorsalornotDorsal)andthebythedetector.Forexample,ifthe 90 Figure2.3:ROCandInformation.(A)Falsepositiverate( FPR )vs.TruePositiveRate ( TPR )whenvaryingtheenergy E c .(B)showsthemutualinformation I ( I ; O ) Eq.(2.6.3)betweentheinputandoutputofthedetectorsasafunctionoftheen- ergy. inputisnotaDorsalbindingsite,thedetectorshouldstaysilent,whileitshouldifitis aDorsalsite(eitheradjacenttoTwistornot).Wecanwritethismutualinformationas I ( I ; O )= H ( I ) H ( IjO ) ; (2.18) where I isthebinaryrandomvariableholdingthetrueidentityofthe`Input'k-merreceived bythedetector,whilethe'Output'variable O isthebinaryvariablegivenbythedetector's decision.Theentropy H ( I )isinprinciplegivenbytherelativelikelihoodtoDorsal bindingsiteswithintheensembleofCRMs,whichisofcourseheavilybiasedtowardsneg- atives(non-Dorsalsites).However,thisBayesianpriorisnotavailabletothetranscription factor,inotherwords,foreachdecisiontobind,thefactorhasitsownBayesianprior p , whichwewillsetto p =1 = 2(maximumentropyBayesianprior)below. Theconditionalentropy H ( IjO )= P 1 i;o =0 p ( i ) p ( i j o )log p ( i j o )quantheremaining uncertaintyabouttheidentityofthek-mergiventhedecisionofthedetector,andcanbe 91 calculatedusingthefalsepositiveandtruepositiveratesintroducedearlier.Inparticular, theconditionalprobability p ( i j 0)isobtainedas p (1 j 1)= p ( I =1 jO =1)= TPR (2.19) p (1 j 0)= p ( I =1 jO =0)=1 TPR (2.20) p (0 j 1)= p ( I =0 jO =1)= FPR (2.21) p (0 j 0)= p ( I =0 jO =0)=1 FPR; (2.22) while p ( i )istheBayesianprior(densityofDorsals/non-DorsalsintheCRM).Usingan arbitraryprior p ,wecanrewritethemutualinformationfromEquation(2.18)as: I ( I ; O )= H [ p ] pH [ TPR ] (1 p ) H [ FPR ] ; (2.23) where H [ ]istheusualbinaryentropyfunctionofaBernoullidistributioncharacterizedby *,soforexample H [ TPR ]= FN TP + FN log FN TP + FN TP TP + FN log TP TP + FN ; (2.24) withasimilarexpressionfor H [ FPR ].Weshowthemutualinformation I ( I ; O )inFig- ure2.3BusingthemaximumentropyBayesianprior p =1 = 2.Comparedtotheinformation theCBdetectorhasaboutDorsalsites,theORgate'sinformationisshiftedtolowerenergies, implyingthatatenergyitknowsDorsalsitesbetterthanCB. 92 DCconditionaldetectorisabletopredictthatTwistisnearby TheconditionaldetectorsareexpectedtomakepredictionsnotonlyaboutwhatisaDorsal siterelativetothebackground,butalsowhetherDorsalisinthevicinityofTwist.By partitioningalltheknownsitesintothetwoclasstypes(e.g.,`distal'and`proximal')as determinedfromthespacerwindowof[0,30]bpandTwistmotif5'-CAYATG,wecantest howwelleachdetectorcanresolvetheclasstypeofaDorsalsite(DorsalwithTwistor without). Foragivenenergythresholdwescannedthecombineddataset D CB withtheDCas wellastheDUdetector,andaskedhowmuchthedetectorknowsabouttheclassvariable C (furtherdetailsofthisexperimentareinSupplementsection2.10.3).Weshowthismutual information I ( C ; P )inFigure2.4,where P isthebinaryrandomvariableencodingthe detector'sdecisionaboutthecontext.WeseethattheDCdetectorhasupto0.3bitsof informationabouttheproximityofTwistinanyparticularDorsalsite,whiletheDUdetector hasvirtuallynoinformationaboutthisvariable. 2.7Discussion 2.7.1DCandDUInformationlogosandpreviousevidence Thebindingsitesequencelogosdisplaytheinformationcontentofourbindingsitedata relativetoauniformdistribution.ByinspectionoftheDClogotheconsensussequence (highestinformationscoringsequence)ispartiallyconsistentwithTableS2ofCrockeret al.[31].The5'-AAATTcoreisreproducedasourDCconsensussequence,whilethe sequenceforthelength11bindingsitesarenotenrichedwithGatthestartofthesiteand 93 Figure2.4:Mutualinformation I ( C ; P )betweentheactualclasses C andthepredictedclasses P forDetectorsDCandDUasafunctionofthethresholdenergy E c thatisbyeach detector'sconditionalenergyEquation(2.15). 94 aCattheendofthesite.SimilarlywecanseethatourDUalsoconformsroughlytoA- tractDorsalbindingsites,whichareDorsalbindingsitesthathavefourormorecontiguous Adenines.MrinalpointedoutthatA-tractbindingsiteshavecertainphysicalchemical propertiesnotseenin5'-AAATTcoreDorsalsites[94],namelythatA-tractDorsalbinding sitesencodeamechanism(likeanextrahydrogenbondbetweentheproteinandDNA)for Dorsaltoswitchrolesfromanactivatorofgeneexpressiontoarepressorofexpressionbased onthebindingsiteDorsalwasoccupying.Ofcourse,asmentionedbyMrinal,thesesitesare stillcontextdependent,namelythecontextofasitemayoverrideanypreferenceabinding sitesequencehasforcausingactivatororrepressorroles[99].InspectionofourDUdetector's datasetshowsthatitismorethan50%enrichedwithDorsalsitesthatareknowntobe fromrepression cis -regulatoryelements( zen , tld , dpp ),hencetheDUlogowitha5'-AAAAT coreisnotsurprising. Ourknownbindingsites,toadegree,comewiththeclasslabelsalreadyattached.The D DC mel dataistheknownDorsalbindingsitedatasetbasedontheof D or 'specialized'sites,orNEE-likeDorsalbindingsites(neuroectodermDorsalsitesthatwere linkedtoTwistsites,butwerenotlinkedtothecanonical5'-CACATGTTwistsites)[31,42]. However,ourDCdetectoristthanadetectorbuiltstrictlyfromthe D DC dataset (thesetofall12orthologsforeach melanogaster locus),sinceweincludedadditionalortholog CRMsoftheNEEs. Furthermore,withintheNEEsonecouldimaginethatthespacerhasdivergedinspecies thatweanalyzedthatwerenotanalyzedpreviously,andourchoiceofthespacerwindowisan intervalnotthesameaspreviouschoices.Forexample,Papatsenkoetal.[101,100]showed thatbinningthespacersbetweenDorsalandTwistthattherewerevariousoptimalbins (namely14bp,20bp,and53bp).Itisalsopossiblethatthespacerthedistanceof 95 DorsalandTwistin D.melanogaster hasfurtherdivergedinitsorthologspecies,inparticular thosenotpreviouslyanalyzedandannotated. Szymanski etal. [121]usedDU-likeDorsalsitesinhissystematicstudyoftherolespacing hasbetweenDorsalandTwistsite,suggestingthatDorsalTwiststillcooperateifoneuses aDU-likebindingsite,whichisfurthercorroboratedbysystematicstudiesfromFakhouri et al. [43]thatalsousedA-trackDorsalsitesfortheprimaryDorsalsites.Thesestudiessuggest evolutioncouldhaveeitheraDCoraDUtypesiteatanNEElocusutilizingDorsal Twistlinkedsitesforsynergy,whichwoulddeteriorateourclaimthatDCandDUarereally ttypesofDorsalsites.However,itishighlyunlikelythatallthesesiteswouldhave withthesamesequenceunlesstheywerefunctionalorelseiftheCRMscontaining themwereduplications. 2.7.2TheORgateandtheCBdetector TheORgatescoresanyinputk-merwithbothconditionaldetectorsDCandDU,and thenoutputssimplythelowestenergyscore.Similardetectorshavebeenrepresentedinthe literatureasaHiddenMarkovModelorasamixturemodel[95,59].Eachcomponentof themixtureissimplyaconditionalPWM,wherethemixingfrequenciesareestimatedas thefractionoftrainingdatathatisassociatedwithaparticularcomponent(orclass)ofthe mixture.Themixtureisas: P ( S )= X c exp E ( S j C = c ) Z c P ( C = c ) ; (2.25) where E ( S j C )isinunitsof 1 (whichisfurtherassumedtohavebeencalibratedtothermal energyunits),and Z c = P S 2S exp E ( S j C = c ),where S isthesetallpossiblek-mers, 96 jSj =4 k . TheCBdetectoristhetraditionalpositionindependentprobabilitymodel(PWM)of bindingsites,wherethePWMisconstructedbyaligningallofthesitesinthe D CB data simultaneously.RecallfromEquation(2.1) P ( S )= Y i P ( S i ) : (2.26) where,asaconsequenceofBayes'Theorem P ( S i )= P c P ( S i j C = c ) P ( C = c ).However, forasequenceofbases, Q i P ( S i )= Q i P c P ( S i j C = c ) P ( C = c ) 6 = P c Q i P ( S i j C = c ) P ( C = c ),wherethelastexpressionisthemixtureofEquation(2.25),andisequivalentto amarginalizationofthesequenceovertheclasses.Themixturedistributionofthesequence overclassescanonlybefactorizedasaproductofpositiondistributions given theclass.We justifyourapproximationofthemarginalsequencedistributionoverclassesasaPWM(the CBPWM)intheSupplementsection2.10.3. ThemixturemodelwasusedbyHannenhalietal.[59]inasimilarformastheORgate, whereagiventranscriptionfactor'sbindingpreferencewasdescribedbytwoPWMs.There theauthorsscannedagivenCRMorpromoterwithbothPWMsandselectedthehighest scoringsitesashits,wherethethresholdforahitwasdeterminedbythemixingfrequencies{ theproportionofknownsitesthatareusedinconstructingeachPWM.Uponscoringall thesiteswithintheirpromoters,thescoreswererankedforagivenPWM,andthenthe fractionofsitesequaltothemixingfrequencywereconsideredpositives.Thismethodis tthantheORgatepresentedhereinthatwedonotusethemixingfrequenciesin discriminatingDorsalbindingsitesfrombackgroundDNA.TheORgatediscriminatessites fromnon-sitesbycheckingiftheminimum(i.e.,best)scoreofthecomponentdetectorsis 97 belowtheenergythreshold.Byalwayschoosingthelowestenergyscoreamongthegiven componentsasthedetector'soverallenergyscore,thebeofanincreasedTruePositive RateofthedetectorispartiallycancelledbythecostofanincreasedFalsePositiveRate. However,thiscostisonlyinathighenergies(non-spsites),whereitisunlikely thatevolutionorphysicalbindingishavinganyfunctionalontheorganism.Hence, theORgateisausefulmodelforincreasedsensitivityinthelowenergyregime. 2.7.3TheCBdataset,merginganddividingclustersofbinding sites ThemixturedistributionofequationofEquation(2.25)impliesthedataset D CB overall locicannotbealignedsimultaneouslytoformanestimateof P ( S ),asthatonlymakessense ifoneisconstructingtheCBPWM,whichassumesnomixture.However,apriori,onemay notknowwhethertheirsetofbindingsitelociisamixtureofttypes.Todetermine ifthereisamixtureinthedataonemustdecideonhowonewillalignthemixturemodel, andwhetherthatalignmentshouldberelatedtothecasethatonecombinesallthedata indiscriminatelytoformanalignmentfortheCBPWM.Willallthetrainingdataoverall classesbelumpedtogethertoestimate P ( S )forCB,andthentheconditionalprobabilities estimatedbypartitioningtheirrespectivesetofalignedsites?Thistechniqueiscommonly usedinthecasethatoneisgivenasetof aligned data,andonewishestomixtures withinthealigneddata.Alternatively,willthetrainingdatabepartitionedintotheclasses, andtheneachclassalignedindividually,andthentheseclass-spalignmentssimplybe 'merged'toformanestimateof P ( S )forCB? Thisadditionalcomplexityisanalogoustothedecisionmadeinclusteringmotifsasto 98 whetheronewantsatop-downapproach(startfromtherootand partition ),orabottom- upapproach(startfromtheleavesofthetreeand merge (i.e.combine)),seeforexample [57][36][69].Wepresentedresultsforabottom-upapproachthatalignsthetrainingdata D DC and D DU separatelytoestimateeachconditionalPWMDCandDU,thenCBwas basedonmergingthecountmatricesoftheDCandDUdatasets. Thetop-downapproachalignsallthesitestogether,thenpartitionsouttheclasses,and fromthosepartitionsbuilds(withoutfurtheralignment)theconditionalprobabilityPWMs, P ( S j C ).Thebottom-upmethodisguaranteedtoachievehigherMutualInformationthan thetop-downapproach.ThisisbecausetheDCalignmentwillnotnecessarilybe'inregister' withtheindependentDUalignment,forexampletheDUalignmentmaytendtohavethe characterwithmorethan0.5bitsofinformationshiftedrelativetotheDCalignment(i.e.the bindingstartsiteofthesetwosetareshifted).Thisinturncausestheirmarginalizationto haveanincreasedentropyduetomixingalignmentsoutofregister,whichinturncausesthe mutualinformationtobeboosted,sincetheconditionalsarebothnowsubstantiallyt thanthemarginalduetoregistrationofthestartsites.Basedonresultsnotpresented,we foundthatthetop-downapproachstillpreservestheoveralltrendinMutualInformation versespacerwindow,althoughthesignalinthe[0,30]bpwindowfordistancesofknown Dorsalsitesfromthe5'-CAYAGTmotifisreducedbyabout40%. Fromamodelcomparisonviewpoint,onemayassumethestrategyshouldbetoalignDC, DUandCBallseparately,neithertakingatop-downorbottom-upapproach.Thisstrategy doesnotbiaseitherthemixturemodel(ORgate)ortheCBPWMmodel.However,from resultsnotpresented,wefoundthishaslittleonthelogosforalength9bpalignment. Albeit,somealignments(inanensembleofalignmentsderivedfromgibbssampling)do choosea'T'richmotif,asopposedtoan'A'richmotif.Forourlogos,inthecasethat 99 a'T'richmotifwasfound,wepresentedthelogoderivedfromthereversecomplementof each aligned sequencewithinthetrainingdatasetsothatalllogoswouldbeeasilyvisually comparablebyinspectingthelogos.Forlongerthan9bplengthalignmentsofDorsalbinding sites,wefoundthattheConditionalDorsalmotifsfrequentlywerenotin'register',where registrationofthestartsitesofmotifsisbasedonamotif-motifalignmentprogramlike STAMP[86],butcanalsobeseenbyinspectionofthelogos(sometimes).Forexample,by inspection,theDCmotifforalength11bpalignmentmayhavehadthepositioninthe alignmentwithgreaterthan0.25bitsofinformationcontentatpositionone(inazerobased coordinatesystem),whileDUwouldhaveitspositioninthelogowithmorethan0.25 bitsatpositionzero(thestartofthelogo). Fromaphysicalstandpoint,itmaybethattheconditionalbindingsitesdotendto haveashiftintheirbindingstartsitepositions.Forexample,Dorsalmaybindto:5'- AAGGAAATTCCinaDCfavoredenvironment,whileintheDUfavoredenvironmentit binds:5'-GGAAATTCCAA.IftheA'sreallyareasignal,thenonemustconclude thebestrepresentationforCBwouldbe:5'-AAAGGAAATTCCAAA,whichisamotifthat is3nucleotideslongerthaneitherconditional.Thebottomupapproachtoclusteringwould missthissignal,sinceitwouldmergethetwoclassessuchthattheCBPWMwouldcontaina fractionof`A'satthepositionbasedontheproportionoftheDCdatarelativetotheCB data,andanotherfractionof`G'satthepositionduetotheDUmotif;therebynotonly missingsomeofthesignal,butalsointerferingwiththecapturedportionsofthesignal.This particularcaseshowsthattryingtocomparemotifsbasedonhavingthestartingpositionof themotifbeinginregister,willpotentiallytruncateasignalfortheCBmodel.Thelogosin Figure2.1areinregister(thestartofthebindingsitesarethesame),whichisbasedonour thatlength9alignmentsarethemostreproducibleintermsofregistration.Once 100 wehadalignedthelength9bindingsites,wethenappendedthesequencetoeach alreadyalignedlocus,therebyhavinggreaterassurancethattheCBPWMwouldnothave thistypeofinterference However,ifonestartswithanalignmentthathassequencetobeginwith,(such asanalignmentoflength15bp;)thenonecouldtryanddiscoverifthe aligned sitesdo containamixtureofmotifswithouthavingtoworryabouttheproblemsassociatedwith choosingastrand(suchasthe`A'richstrand),orwhetherthestartsitesareinregister. SuchanapproachwasusedinFigure2ofBaraschet.al.[13].However,wefoundthat settingtheGibbssampleralignment'slengthparameterhighlythealignment. Forexample,foralength9bpalignment,thestartingpositionofthealignmentmaycontain maximuminformation(2bits),howeverifonecreatesalength15alignmentthissignal(the information)isatleastconserved,butmayspreadintothesequence.Thisspreading changestheDNAmakeupofthelogoofvariouspositions.Thisispartlyduetotherebeing somanymorewaystospreadoutinformationamongthepositionsofthealignmentwhenone isusinganobjectivefunctionthatrunsover15positionsasopposedto9bp.Forexample,a subsequenceof5'-AAAthatiscompletelyconservedinalength3bpalignment,whenallowed tobelength5bpalignmentmayconvergeon5'-AAAAAwiththesameinformationcontent asthe3bpalignment(orslightlylargerinformationcontentthanthelength3bpalignment, whilehavingasmallerperpositioninformationcontent).Thisisduetoamixtureoverloci oftheform:5'-NAAAN,5'-AAANN,5'-NNAAA. 2.7.4MixturesofasymmetricPWMs Inthecasethatone'sdatasetofbindingsitesisprimarilyasymmetric,yetcompletely conservedateachposition;thenonecanconstructatwocomponentmixtureofPWMs 101 thataretlytbysimplychoosingeachcomponentPWMtorepresenteach concensusofthetwoasymmetricmotifs.Forexampleifone'sdatasetissimplyacollection oflocifromadatabasewherethesequencesfromthedatabasewereallidenticalofthe form5'-GGAAACC,thenonecouldtakehalfthelocitobe5'-GGTTTCC,andtheother half5'-GGAAACC.Wefoundthatwhentrainingthemixturemodelwhereonehasno knowledgeaboutwhatstrandtouseinthecomponentPWMs(i.e.DCandDU)thatDC andDUfrequentlywouldconvergeonmotifsthatappearedtobereversecomplementsofone another.ToresolvethisproblemonecansymmatrizethePWMbyusingbothstrandsinthe constructionofthePWM(seeforexamplethepaperbyBaileyet.al.[11]andthediscussion ofMNToperatorsitesbyFieldset.al.[44]). AhypotheticalDCconcensus5'-GGAATTTCCmayseemtobeaproblemforsym- metrizationifthe'T'atposition4(wherethestartsiteispositionzero)wouldlosesomeof itssignalbysymmetrization.However,inasense,thislosswouldbecompensatedbyagain initscomplement's'A'signalifoneconsidersthesymmetricalPWMtocontainasmaller alphabetofsymbols. 3 Evenforhighlynon-palindromicsitessuchasthe'A-rich'coreinthe DorsalDUsites,onecouldstillsymmatrizethemotifandhencenothavetodealwithissues ofstrandednessofsiteswhenformingmixturesofPWMs.However,thereareanumber ofreasonswhymixturesofasymmetricPWMsisimportant,assumingonecandealwith thepotentialartefactssuchasalossofthesignalduetochoosingoppositestrands(such asinthe5'-GGAAACCdatasetexampleabovewhichwouldboostthemutual 3 Thiscompensationmayseemtobeviolatedduetothelossofinformationcontentwhenonesymmatrizes apure'T'signalatposition4oftheDCmotif.However,thesourceofthisapparentviolationisdueto comparingsignalsfromtwotsamplespaces.Forexample,theoriginalalphabetof'A,C,G,T'hasat most2bitsofinformation.However,whenoneusesasymmetricPWMthealphabetsizeisnolongerover 4characters,butsimplyover2characters,hencethemosttheinformationcanbeisnow1bit.Henceifone comparesnormalizedinformationcontents(normalizedbyalphabetsize),thenthenormalizedinformation contentsarepreserved.Fromanotherperspectivethechoicetosymmatrizeisanadditionalbitofinformation gainedthatcompensatesthebitofinformationlostbytheoriginal2bit'T'signal. 102 information). OurdecisiontotreatthecomponentPWMsas'asymmetric'PWMswasbasedontheas- sumptionthattheorientationofasiterelativetocooccurringsitesmaybeimportant,hence byusingasymmetricPWMswewouldallowourselvesthepotentialtodecodeanyorientation informationifitwereencodedinthePWM(seeShutzabergeret.al.formoreideasonthis topicofstrand[113]).Thepotentialtodecodecontextinformationbasedonthestrandedness, isequivalenttomakingpredictionsthataparticulartypeofcooccurringbindingsitewould beeither5'orinthe3'directionoftheannotatedbindingsite.Thisdecodableinformation wouldalsobephysicallyrelatedtocooperativeorantagonisticinteractionsthatarecontin- gentonthebindingsitesequence.AnexampleofthisistheasymmetricextendedTwistsite 5'-CACATGT(extendedfromtheE-boxsequenceCACATG,wherea'T'isconcatenated tothe5'endofE-box),thissitehasevidencethatthe'T'atthe5'endmustcooccurwith aDorsalsite downstream ofit(i.e.wheredownstreammeansinthe5'to3'direction)[129]. Forexample,onewouldalwaysevolutionselected5'-CACATGTNNNNGGAATTTCC andnever5'-GGAATTTCCNNNNCACATGT.Furthermore,onecouldimaginethat theDorsalsitethatcooccursonthesamestrandinthiscaseis5'-GGAATTTCC,asop- posedto5'-GGAAATTCC,whereposition4isstranddependent.Inthiscase,hadone symmatrizedtheDCPWMtheywouldhavelosttheinformationatposition4.Hencethe higherinformationcontentoftheasymmetricPWMisnotanartefactoftheasymmetric PWM 4 ,it'srathertellingusthatthisisphysicallyduetothepreferenceofthe'A'atpo- sition4oftheDCmotifcooccurrringonthesamestrandastheannotatedTwistsite(i.e. 5'-CACATGT).SimilarlytheDUconcensus'core'5-AAAAT'maybecooccurringonthe 4 ThismaybeanartefactinthatasymmetricPWM'salwayshavegreater(oratleastequal)information contentthensymmetricPWMs 103 samestrandasothercoregulatingtransfactors,andhencecooccurringwithothermotifson thesamestrand. 2.7.5Comparingmodelswithunequalparameters Foroptimizationandmodelcomparison,modelsthathavemoreparametersmaybeguar- anteedtohaveahigherlikelihoodvalueforagivendataset,justasaTaylorexpansion approximationsofafunctionimproveswhenoneincludeshigherorderterms.Assumingone hastdatatostayclearofngparameterstonoise,andthatthemorecomplicated modelsareamoreaccuratedescription,amorecomplicatedmodelmaybeuseful.However, itisnottruethattheORgate,mustperformbetterthanCBduetotheincreasednumber offreeparameters. Indetection,ensemble(suchastheORgate)canbeusedtoincreasethede- tectionperformance[22].However,itisnottruethat ensembledetectors willalwaysperform betterthanasingledetector.TheORgateisaformofthemoregeneralensembledetector thatisamachinethatproducesadecisionbasedoncomponentdetectordecisions,where thecomponentdetector'sdecisionsarepooledandprocessedinordertocometooneoverall decisionoftheensembledetector.Examplesofprocessingtheindividualdetector'sdecisions are'majorityvote',wherethemostfrequentdecisionofthecollectionamongthecomponent detectorsisusedasathemachine'sdecision,similarlytheORgateisanexampleofan ensembledetector.Ensemblehaveanecessaryconditiontoperformbetterthana singlethecomponent(e.g.DCandDU)shouldbeindependentofeach other[22]. DCandDUareindependentifweuseaconditionalindependencemodelforthejoint distributionofanextendedsequence S ,wherewenowthinkoftheextendedsequenceSasa 104 dyad(whichcanbedecomposedintothreevariablesthatdenotetwobindingssitesseparated byavariablespacer,likeintheframeworkofBioProspector).Inthisprobabilitymodelover thejointwepartitionanextendedsequence S (likeaCRM)intotwoparts, S 'and C ,where S 'isthek-mersequencetobetestedasaDorsalsiteorbackground,and C istheclass determinedbythe spacerwindow ,whicheithercontainsTwistoritdoesn't. Giventhatweobservetheclass C ,thenwecanfactorizethejointdistributionofthe sequenceS'beingtestedasaDorsalsite(i.e. S 'isnolongerapositiondependentdistri- bution).WithouttheaboveassumptionDCandDUaredependent,hencetheORgate, whichdoesnotobservetheclassC,doesnothaveindependentcomponentdetectors.A consequenceofthisdependencyisthatthecomponentdetector'serrorsarecorrelated(i.e. errorsarecorrelated).Forexample,eachcomponentr'sprobability ofabase'G'atagivenpositioninabindingsiteshouldbejustaslikelytobeabovethe truevalueoftheprobabilityofG,P(G),asitisbelowthetruevalue;whileforcorrelated detectorsthisdoesnotoccur. Hence,itisnotalwaystruethatincreasedcomplexityofamodelwillincreaseitsperfor- manceontrainingdata.Mixturemodelsofbindingsitesaremorecomplexthanastandard PWM,andarebestatcapturingbroaddependenciesamongthecomponentclasses.Toa largeextent,theofDCandDUisalocalizeddependencyatparticularposition (whichcanbeseenastheoftheircores'5'-AAATT'and'5'-AAAAT'),anda inthe'strength'ofthesites(ortheenergylevelspacing).TheORgateseems toperformbetterthanCBforthelowenergyregime,wherethelocalalizedbe- tweenthecoresofDCandDUistindistinguishingtheenergiesoftheconcensus sequenceofeachconditionalPWM.However,fortheentireenergyspectrum,whereAUC canbeusedasameasureofperformance,itseemstheORgatebehavesaboutthesame 105 asCB,suggestingthatDCandDUarenotindependentmodels.Thisisnotsurprising, asnonlinearmodels(likeamixture)aresensitivetocertainregimesovertheirinput,and similarlyinsensitivetocertainvaluesoftheirinput(forexampleseeSection3.5.2. 2.7.6InformationthatdetectorshaveaboutDorsalbindingsites InaphysicalNVEensembleparticlenumber N ,volume V ,energy E ) theinformationcontentofthedistributionofmomentumandpositions(thedistribution function)isconserved.Thismeansthenumberofbitsnecessarytostorethepositionand momentuminformationisconservedintimerelativetothemaximumstoragecapacityde- byalatticeoverphasespace(thespaceofcoordinates).Forexample,ifthedistribution functionisauniformdistributionoverphasespace,ithaszeroinformationcontent. Similarly,evolutionarysystemsunderadaptivemaintenance(purifyingselection)con- serveinformationstoredintheirgenes[1].Theinheritanceofinformationimpliesthat parentspassanumberofbitstotheirprogeny.AndjustasintheNVEensemble wherecoordinatesandmomentumarenotconserved,similarlyinevolutionsequencesare variable,butthesequence'sinformationcontentisconserved.However,whenthe energyconstraintoftheNVEsystemisrelaxedandthesystemexchangesenergywitha muchlargerenvironment,thesystem'soriginalinformationcontentmaydeteriorateuntil thesystemequilibrateswithitssurroundings.Biologicalsystemsharnessenergyfromtheir environmenttomaintaintheirinformationcontentinthenever-endingtagainstthe secondlaw[2,29]. ThemutualinformationbetweensequencesandtheORgate'spredictionsinFigure2.3 suggeststhattheconditionaldistributionsoffunctionalDorsalbindingsiteshaveencoded synergisticandantagonisticinformationaboutsequencefeatures(presenceofTwist) 106 thatcausesthelikelihoodtocorrectlypredictthepresenceofDorsaltoshiftdownwards inenergy(asobservedbytheshiftofthemutualinformationoftheORgaterelativeto CBinFigure2.3).ThisshiftmayhavebeenanecessaryadaptationinthewayDorsal regulatesitstargets.Forexampleitispossiblethatatthephylumlevel,possiblybeforethe neuroectodermevolved,Dorsalonlyneededtoregulatethemesodermandectoderm.When theneuroectodermevolved,Dorsalevolvedtheabilitytorecognizetwosubtypesofbinding siteensembles,afunctionthatwouldhelptoresolvetheneuroectodermDorsaltargetsfrom themoreancientgermlayers(mesodermandectoderm).Inthissense,Dorsal'sadaptation toitslocalenvironmentisseenastheshiftedmutualinformationrelativetotheCBdetector (whichjusttreatsallbindinglociidentically).Dorsalcouldthenusethisinformationtoits advantage,inDorsalrealtimesotospeak,tomakebetterdecisionsaboutbinding. TheshiftinthemutualinformationplotinFigure2.3BisnotasvisibleintheROC curvesinFigure2.3A,inwhichweusedthesameTPRandFPRforthedetectors.This isbecause,ingeneral,energylevelspacingisnotaccountedforinanROCcurve,implying thatdetectorswithsimilarlyrankedsequencesmayactuallyhavetspacingsbetween theirenergylevels,andtheminimumenergiesofthescalesmaybeshiftedrelativetoone another.Forexample,DC'sgroundstateisbelowCB'sgroundstate,whichiswhytheOR gatecontainssomeinformationatnegativeenergy(asDC'sgroundstateisatabout-0.8in energyunitsasseenfromthehorizontalaxisofFigure2.3). ThedegreetowhichtheORgate'sROCdoesappearshiftedrelativetoCB'sROCin Figure2.3AispartlyduetothefactthattherankingofsequencesoftheDCdetector andDUdetectorisverysimilar;itistheenergylevelspacingthatisdramaticallyt betweentheconditionaldetectors.Forexample,usingasubstitutionmodelthatpenalizes allmismatchesfromtheconsensussequencewiththesameenergyscore(seetheAppendix 107 ofRef.[16]fordetails)leadstotheelegantformulathataconsensusbaseoccurswith probability1 m 3 k ,andthatanerrororsubstitutionoccurswithprobability m 3 k ,where k is thelengthofthesequenceand m isthenumberofmismatchesfromtheconsensus(the3in thedenominatorisduetothethreewaysamismatchfromaconsensusDNAnucleotidecan occur).Weaksiteswillbeseentohavelarge m ,whichtoadegreecanbeseenastheDU trainingdata.Similarly,strongsiteswillhavesmall m ,whichcanbeseenastheDCdata. Henceinthissubstitutionmodel,thebetweenDCandDUisnotintheordering oftheirrankedsequences,rathertheliesintheirenergylevelspacings(whichcan beseenbychanging m whichtheenergyspacingformulaEquation(2.10)). ThispictureofDCfunctionalsequencesbeingastrongversionofDU'ssequencesiscon- sistentwithourthattheirmedianenergiesbyalmosttwounits,andwith Papatsenko etal. 's[101]thatDorsalbindingsitesnecessaryinlimitingconcentra- tionsofDorsalprotein(suchasintheneuroectoderm)tendtohavehigherinformationscores (lowerenergyscores),thanotherDorsalsitessuchassitesactiveinthemesoderm[101].Itis alsoconsistentwiththemathematicalof\specialized"sitesfromErives etal. [42] andthe D sitesofCrocker etal. [31]whodethesesitesbasedonhowtheywerede- tected(similartoMEME'sOneOccurrencePerSequencesetting(OOPS)[11],thespecialized siteswereonesiteperNEECRMsequence,whereeachdiscoveredsitesharedthehighest sequencesimilaritybetweentheselectedsitesbetweentheCRMs),whichinasense,isthe Dorsalsitethathadtheslowestmutationrate(i.e.,underthestrongestpurifyingselection). 2.7.7Conditionaldetectors InFigure2.4weseethattheDCdetectorcanresolvewhetheraTwistsiteisinthespacer windowornotifthedetectorwhen E ( S j C ) < 3(seeEquation(2.15)).Theresolution 108 isnotperfectinthisregime:theDCdetectorstillhasanerrorrate,whichweas 1 2 H ( CjP ) ,wheretheconditionalentropyisas: H ( CjP )= H ( C ) I ( C ; P ) : (2.27) Theconditionalentropy, H ( CjP ),issimplytheuncertaintyof C given P .Butwhatdoes thismeanforaDCdetector?Weinterpretedthisconditionaluncertaintyasameasureof thedetector'suncertaintyabouttheunderlyingDorsalbindingsitesequencegivenhowwell itpredicteditscontext.Forexample,ifweassume H ( C )=1bitwhileDC'sinformationis I ( C ; P )=0 : 3,thenpluggingintoEquation(2.27)wehave H ( CjP )=1 0 : 3=0 : 7bits ; (2.28) andhenceDorsalhasdecreaseditsuncertaintyaboutitscontext. Ifthemutualinformation I ( C ; P )wasmaximal(1bit),thenDorsalcouldpredictwith perfectaccuracywhetherTwistwasproximalordistal.Attheoppositeextremewhere theDorsaldetectordoesnobetterthanrandomguessing,weseethatitwouldtakeabout twoguessesonaveragetopredictifTwistwillbenearabindingsitesequence.Froman evolutionarypointofview,theinformation I ( C ; P )encodedinDorsalbindingsitescanbe seenasamessagepassedfromanancestralpopulationoftoitsdescendants.Here, themessageinstructsDorsaltointeractwithTwist,andisencodedintheDNAofDorsal bindingsites. 109 2.7.8FormsOfConditionalDetector'sscore WehavetheconditionalenergyastheCBenergyplusashiftdependingonthe cis-contextofthebindingsitelocus.Asimilarbioinformaticmeasureistosolelyusethe conditionalprobabilitydistributionoversequencestoconstructabioinformaticscoringfunc- tionas:log P ( S o C j C ) P ( S j C ) ,where S o C istheconsensussequenceunderconditionC.Thisformof thediscriminantfunctioncanbetransformedtoourform E ( S j C )=log P ( S o ) P ( S ) log P ( S;C ) P ( S ) P ( C ) byasimpleadditivetermtothebioinformaticscorelog P ( S o C j C ) P ( S o ) ,where S o istheconcensus sequenceoftheCBdistribution.Theadvantageofourtechniqueisthatmixtureofdata setsoftsub-typesofbindingsitesthatareheavilyweightedbytheconsensussites ofthemixturecomponentdistributionscanbeshiftedrelativetotheconsensussiteofthe lumpeddataset,suchas D CB .Hence,weareabletoresolvequantitativelytheamount ofenergeticshift,whileapurelyconditionalbasedscoringfunctionswillseteachdetector's consensusscoretozero.Informationscores I ( S )=log P ( S ) P o ( S ) ,orloglikelihoodratios,are alsocommonlyusedforbioinformaticdetection,whichcanbetransformedtoourscoreby addinganadditionalconversiontermthatcancelsthelogarithmofthebackgrounddistri- bution.Informationscoreshavetheadvantageoverpurelyconditionalenergyscoresinthat theyallowallbioinformaticscorestobebasedonacommonscale(thebackground),which isequivalenttoanenergyscorewithacommonreferencepointinthecasethatoneusesa uniformdistributionasthebackground.Thisisbecauseinsequencespace 5 onecanimagine apointthatrepresentsauniformlinearcombinationofallsequences,thentheinformation scorescanbeseenashammingdistancesawayfromthatpoint;justasenergyscoresare 5 Inthiscontexteachpositionofasequencecanbethoughtofasa4dimensionalrealvectorspace.Then thelengthksequenceis4kdimensionalrealvectorspace,whereeachpointsimplydeterminesthescaling alongeachdimension(suchasthecountsthataparticularnucleotidebaseisobservedinanalignmentata givenposition,ortheenergyscoreatforthatbaseatagivenposition). 110 basicallyhammingdistancesmeasuredfromareferencepointinsequencespace.Ouraim wasmorethanjustdiscriminatingDorsalsitesfromrandomk-mers,wewantedtoresolve withinDorsalbindingsites.Ourscorehastheadvantageofallowingonetore- solvethemostimportantregionofthatscale(shiftsawayfromtheCBgroundstate),which cannotbedonewiththepureinformationscorethatusesauniformbackground. 2.8Conclusion PositionWeightMatricesrepresentalinearcoarse-grainedphysicallatticemodelofDNA- transcriptionfactorbinding.AttheDNAsequencelevelandatthelevelofDarwinian selectionPWMsrepresentoneofsimplestpossiblelinearmodels.Inthecasethateach positionwithinabindingsiteisindependentlyinteractingwiththeproteinbindingdomain, itmakessensetouseasimplemodelforbindingsincetheay(thephenotype)islinear, andhencenaturalselectionmaybehaveasifalinearmodel.However,bindingsitesequences maybedependent,andhencelinearmodelswillmissimportantinformation.Byconditioning PWMsbasedonthevariablesthatarecausingthedependencystructurewithinbinding sitesitispossibletoresolvethebindingsitesintoindependentclassesthatcantheneachbe modeledasconditionallyindependentPWMs. Thenecessityofintroducingnonlinearsequencemodelsintobindingsitesequencemod- elsisknowntohelpimprovebindingsitesequencedetection,andtogiveamorerealistic perspectivetobindingsitemodels.Anumberofgroupshaveintroducedsimilarmodelsfor discoveryofco-occuringmotifs[84,12,50,27,13,58,102,56,83,92].Inaddition,others havelookedattheofsymmetriesinthesequenceofbindingsites[111,3]. HereweplacedouranalysisinthecontextofBergandvonHippel'spopulationgenetics 111 modelthatisrelatedtothermodynamics,andhencetheinteractiontermcouldbeplaced insideofthermodynamicsoccupancymodelsoftranscriptionfactors. OurconditionalPWMsaccountforepistaticinteractionsbetweenDorsalbindingsitesand their cis -context.WeshowedthatDorsalbindingsitescontainonaveragearound0.5bitsof informationaboutthepresenceofTwistinthesequenceofeachDorsalsite(seeTable 1),therebycontributingtodisentanglingthedependencystructureofDorsalbindingsites activeindevelopment.Inthefuture,ourmodelcanbeincorporatedintheannotation ofbindingsitesofregulatoryregions,andcouldbeusedformodelingcooperativityand antagonisticinteractionsdirectlyfromthesequencelevel.Suchmodelscouldbeusedby occupancymodelsoftranscriptionfactorsthatpredictgeneexpression,suchasthosein Refs.[61,43]. 2.9MethodsSupplement 2.9.1Alignmentofcis-regulatorymodulesandcollectionof D CB WeusedthesequenceeditorSEAVIEW'sdefaultMUSCLEmultiplesequencealignment settings[47]toalignagivengene's12orhthologousCRMs[38].Giventhealignmentwethen manuallyextractedtheblocksthatcontainedtheknown D.mel bindingsites,whichallowed forsequencetobeextracted(theseblocksonaveragespannedabout15bpwithno gapsacrossthe12species).Thedatasetofall12extractedorthologs(sometimeslessthan 12ifabindingsitewasnotintheblock)forallthe D.mel Dorsalbindingsitelociisla- belledas D CB .The D CB datasetisavailableat:https://github.com/jacob andtherawCRMdata(thefastaofeachofthe12orthologousCRMs)isavailableat: https://github.com/jacobt. 112 2.9.2CRMalignmentusingMUSCLE MUSCLE(multiplesequencecomparisonbylog-expectation)isanalignmenttoolforprotein sequences,whichweco-optedforalignmentofcis-regulatorymodules.MUSCLEusesa similaralgorithmtoCLUSTALW,bothofwhichusethe'sumofpairs'(SP)objectivefunction todeterminethebestmultiplesequencealignment(MSA).IwilloutlinetheSPscore, andthendiscusshowMUSCLEistthanCLUSTALW. FundamentallysequencealignmentforstringsofaminoacidsymbolsorstringsofDNA symbolsrewardssymbolmatchesandpenalizesmismatches.Ascoringsystemisdesigned todeterminetherewardsandpenaltiesbasedonthegoalsandpurposeofthealignment, wheretheAltshult(theBLASTcofounder)andKarlinsystemiscommonlyadopteddueto ithavingananalyticsolutionforcalculationofthe p -valueforhypothesistesting(wherethe nullhypothesisisthatthesequencesareevolutionarilyindependent(random)). RegardlessofwhetheronehasasequenceofaminoacidsorDNAthematchandmismatch scoreisbasedona'similaritymatrices'suchasaDamatrix(a20x20matrixforamino acids)oraHenikmatrix,BLOSUM,whichisessentiallyalook-uptablethatdetermines thescoreforaligninganytwosymbols(thetabletreatsallpositionsofthesequencesthe same,unlikePWMs).Forexample,Daestimatedhersimilaritymatrixpartiallybased onrecentlydivergedproteinsfromanumberofknownproteinfamilies(genefamilies)that wereavailableatthetimeofconstruction(around1970)withsequencesthathadaknown alignment,thencountingthefrequencyofalignmentbetweenanytwosymbolsandhence substitutionsbetweensymbols,thejointprobability, p ( x;y ),ofseeingsymbol x atsome arbitrarytimeandsymbol y atsomettimecouldbedetermined.Hencethescore issimply s ( x;y )=log p ( x;y ) p ( x ) p ( y ) ,where p ( x )istheprobabilityofsymbol x inanyprotein 113 sequence(similarlyfor p ( y )). Forpairwisealignment(twosequencesbeingaligned),givenascoringsystem,thereare twoexhaustivemethodsofalignment(i.e.thatlookatallpossiblewaystoalignthetwo sequences).Thesetwomethodsaredistinguishedbasedontheir'boundaryconditions', andarecalledglobalandlocal.Globalalignmentmeansalltheysymbolsofthesequences mustbealignedandthescoreacrossallthealignedsymbolsmustbeaccountedforinthe pairwisealignmentscore,wherethebestpairwisealignment(theoptimum)isafunction overallpossiblealignments).Whilelocalalignmentdoesnotrequireallthesymbolsof thesequencesbealigned,andthebestsubsequencepairwisealignmentisconsideredtobe thealignment,wherenowthepairwisealignmentscoresimplyaccountsforthelengthof thesubsequences,whileagainthebestscoringalignedsequencesistakenastheoptimum. Bothmethods,globalandlocal,assumethatnotranspositions(orcrossingover)occur(e.g. onemustalignconsecutivecharactersofasequencetoconsecutivecharactersoftheother sequenceortoaso-calledindelcharacter(gaporinsertion)).Inshort,globalalignment meansallcharactersbetweenthetwosequencesmustbealigned(withthepossibilityofgaps andinsertions),whilelocalalignmentmeansonlyasubsetofthecharacterbetweenthetwo sequencesmustbealigned. Thepairwisealignmentscoreoftwosequencesxandythatareclonesofoneanotheris d ( x;y )= P L i s ( x i ;y i )whereiistheinternalpositionofeachsequence,andListhelength ofthesequence.Ifwenowallowgapsbetweenthesesequencesthereareatotalof 2 n n possiblescores(manyofwhicharedegenerate).Thegappenaltyisusuallyoftheform ofaconstant(independentofnucleotideoraminoacidsymbolaignedtothegap).For example,forasimplescoringsystemwhere s is'one'foramatch,and'zero'foramismatch (theidentitymatrix),andweonlyhaveaconstantgappenalty,weseethatthereissimply 114 L +1possiblepairwisealignmentscoresfromthe 2 n n possiblealignments.Ingeneral,the computationofthe d ( x;y )isaccomplishedthroughdynamicprogramming,whereonehasa progressivelycomputestheglobaloptimal d ( x;y )alignmentbyusingtherecursionrelation S ( x i ;y i )= max ( S ( x i 1 ;y i 1 )+ s ( x i ;y i ) ;S ( x i 1 ;y i ) g;S ( x i ;y j 1 ) g ),where S ( x i ;y i )is thepartialsumofscoresuptoposition i forsequences x and y ,and g isthegappenalty. Onceonehaschosenwhattype(globalorlocal)ofpairwisealignment(i.e. d ( x;y )score) thentoestimateaMSAfor n sequencesusingaSPobjectivefunctiononemustcreatea n x n distancematrix(sameideaasin2.9.5),wheretheelementsofthematrixarethe pairwisealignmentscoresbetweenanytwosequences( d ( x;y )).Onceonehasadistance matrixanynumberofclusteringalgorithmscanbeusedtobegintogroupsequencesbased onsimilarity,inparticularaminimumspanningtreemustbebuilt.Uponconstructionofthe minimumspanningtreeapairwisealignmentisarbitrarilychosenandsetsthefoundationof theMSA(apairofleavesinthetree),thenadditionalsequencesaresystematically'added' tothisfoundation,whereuponeachadditiontheMSAgrowsuntilall n sequenceshavebeen 'added'totheMSA.Theadditionofsequencestotheinitialfoundationisnotarbitrary, theentirepointofcomputingtheminimumspanningtreewastoguidepreciselywheneach sequenceshouldbeaddedtothegrowingMSA(see1ofEdgar[39]).HenceMSAgrows (isbuilt)byfollowingthebranchingofthetree,whereateachbranchpointanalignment occurs.Thisiscalledprogressivealignment,andisthebasicideabehindCLUSTALW. Forexample,giventheminimumspanningtree(suggestivelycalledaguidetree)the alignmentissimplythepairwisealignmentalreadycalculated;therestofthealignmentsare ts-whereaisanexistingalignment(itislikeaPWMwith thepossibilityofgapsasasymbol),ora'-alignment(wherethecolumnsofa arenowregardedasasymbol,andstandardglobalorlocalalignmentisperformedon 115 thesesymbols,similartoSTAMPforPWM-PWMalignments,wherethe'match'function betweentwocolumnsiscalledthefunction(whichiscalledthelogexpectationscore inMUSCLE). AnoveltyofMUSCLE,assuggestedbytheLEinMUSCLE,isthelogexpectationscore usedforalignment(seeEq.1fromEdgar[40]).Furthermore,MUSCLEalso usesa k -mercountingschemetodeterminethematrixelementsofthedistancematrix(as opposedtotheglobalorlocalpairwisealignmentscore),whereitisthoughtthat k -mer co-occurrencescorrelatewellwithsequencesimilarity(seeEq.1ofEdgar[39]).The k -mer countingschemeissimilarinspirittotheBLASTalgorithm(localalignmentbetweena shortandlongsequence),whichispopularforitsspeedandhighthroughput(timeand spacecomputationalcomplexity).ItappearsMUSCLElooksatallpossible6-merswhen constructingthesimilaritymeasure(anddoesn'tcountoverlaps). TheMUSCLEalgorithmcoarslyisshowninFigure1ofEdgar[].Theinitialstepis identicaltoCLUSTALWwiththeexceptionthatMUSCLEuses k -mercountingto it'sdistancematrixelements.AndMUSCLEusesUPGMAclusteringtobuildtheguide tree,whileCLUSTAL(CLUSTerALignment)usesneighbor-joining,forreasonsdiscussed byEdgar[].Theinitialguidetreeisthenusedtoprogressivelyalignthedatabasedonthe branchorderingofthetree,whereMUSCLEusestheLEscorewhileCLUSTALusesaLA scorethatisslightlytscore(thebetweenLEandLAisthesame asourusingaCBPWMtoP(S)andthetruemixturetoP(S),whereatrue mixturecannotbe'additive'whenthelogarithmisusedforscoring). UponconstructionoftheMSAbasedontheguidetree,MUSCLErepeatstheentirepro- cedureusingtheKimuradistancetobuildadistancematrix(theKimuradistancerequires analignment,whichispossible,sincetheinitialiterationinMUSCLEconstructsanalign- 116 ment).Clusteringisagainperformedonthedistancematrix(withmatrixelementsbasedon theKimuradistancebetweenanytwosequences),andagaintheguidetree(representation ofclustering)isusedtoprogressivelybuildthenewMSA. Athirdtypeof'iteration'occursintheMUSCLEalgorithm(afterthenewlygenerated MSAfromKimuradistance),wheretheguidetreeisrandomlypartitionedintotwosubtrees, whereale(MSA)isbuiltusingeachsubtreeasaguide(thepisprogressively built,andhenceanMSA).Thenalignmentisperformed(similartoSTAMP PWM-PWMalignment).Thenumberofrandompartitionsoftheguidetree,andhence alignmentsisaparametertheusercanadjust,orwillautomaticallyhaltonce thepartitionsfailtoimprovethealignment. 2.9.3MUSCLEisbothfastandaccurate Thereisalmostalwaysabetweenspeedandaccuracyincomputation.Accordingto theonlineMUSCLEdocumentationathttp://www.drive5.com/muscle/manual/accurate.html: "ThedefaultsettingsinMUSCLEwerechosenforbestaccuracyratherthanmakingany compromisesforspeed,soifyouwantthebestaccuracyyoushouldprobablystickwiththe defaultsunlessyouhaveanewandbetterparametertuningmethod."Thissuggeststhat ouruseofdefaultparametersettingsforCRMMSAwouldresultinbestpossiblealignments fromtheMUSCLEtool.However,MUSCLEwasdesignedforproteinalignments,andeven forproteinalignments,theuseoflimitedavailabilityofhighqualitybenchmarkdatasets meanstheMUSCLEparametersweretunedtoasmallsubsetofproteinfamilysand theirpeculiarities,hencethegeneralityoftheMUSCLEparametersettingsbeyondthese fewproteinfamilysinthebenchmarkdatasetsisquestionable. ThespeedandaccuracyofMUSCLEforagivendatasetisdeterminedbythenumber 117 of'iterations'.Fasteralignmentsareachievedbysimplyreducingthenumberofiterations, andmoreaccuratealignmentsareachievedbysimplyincreasingthenumberofiterations. The'iterations'thatcanbeadjusted(asaparameter)arethet'iterations,which randomlydeletesabranchoftheguidetree,andthendoesaalignmenton thetwosubtrees(i.e.thetwodatasetsofsequencesresultingfromthepartition(branch deletion)). 2.9.4MUSCLEparametersensitivity ApossibleparameterthatourCRMalignments,andhenceourCBdataset D textbfCB ,are sensitivetoarethechoiceofsubstitutionmatrix(especiallyseeinganaminoacid'substitu- tion'doesn'tevenmakeanysenseforCRMs,albeitusingasubstitutionmatrixforaligning regulatorysequencesisunlikelytodoharm,seeingthatinallcasesasubstitutionmatrix rewardsexactmatches,theproblemisitwillalsorewardcertaincasesofmismatchesor atleastfailtopenalizethemismatch.).However,bydefault,ifaDNAsequenceisgiven toMUSCLEthesubstitutionmatrixisnotused(presumably,becausethereadingframe isnotknown,albeitsomeMSAtools(unlikeMUSCLE)willgenerateallpossiblereading framesautomatically(there'sonlythreeofthem)).Furthermore,althoughMUSCLEleaves theoptionofusingBLOSUMorPAMmatricesforconstructingdistancematrices,asI understand,thedefaultofMUSCLEuses k -mercountingasthedistancemeasure,which Iseeno apriori reasonwhythatcannotbeusedforCRMs.Anexhaustivelistofpa- rameters(options)forMUSCLEaredescribedintheCommandLineReferencesectionat: http://www.drive5.com/muscle/manual/valueopts.html.TheSeavieweditorallowsthese options(parameters)tobevariedbysettingtheminthe'AlignmentOptions'tabavailable fromthe'Align'pulldownmenuavailablethroughSeaview'stoolbaratthetopofSeaview's 118 GUIwindow. MUSCLEusesan'agapscore(penalty)thatpenalizesforopening,extending,and closingagap.ThisisthesamescoringtechniqueasCLUSTAW,andlikeinCLUSTALW, byincreasingthegapscorebyafactoroftenwillusuallyhavevisiblebycausingless 'blocks'toappearinthealignment,wheretheblocksthatdoappeararecommen- suratelylonger.Thedefaultgapopenscorewas10,andthisvalueIfound,insomecases couldbepossiblybeincreasedbyanorderofmagnitude,assomeknownbindingsitesthat havehadindelswilllikelyhaveagapinsertedinthealignment(whichcreatesmore workforme,asIprefertohavethebindingsiteblocktohavenogaps,whichallowsfor faster'sites'creationwhereweextracttheblockofputativesitesandtheknown mel site. Thisisdonebyexactsearchintheseavieweditorforaknown mel siteintheCRMs.Once foundthesiteandeverythingithadalignedwith/toareextractedandplacedintheirown fastawhichisdoneinseaviewbyusingthe'site'pulldownmenu,andselecting'create set'thatallowsforhighlightingblocks(subsegmentsofanMSA)ofalignedsequence,which canthenbeautomaticallysavedasnewfasta). The12CRMsthatareassociatedwitheachtargetgenealongwithshadowenhancers(a distinctsetof12CRMsassociatedwithsometargetgenes,suchas vndV ( vndVentral )and sogS and brkS ( brkShadow )-wheremanyoftheshadowswedidnotuse,suchas vndV and brkS )areavailableat:https://github.com/jacobment. 2.9.5GEMSTATmoforlocusannotationofCRMs Giventheknownsites, D CB ,withtheirsequencefromtheblockalignmentswe thencreatedanexactsearchalgorithmthatallowedustoestimatethecoordinateofthe knownsitewithintheregulatoryregionfordatastructuresusedbyGEMSTAT,aplatform 119 developedintheSinhalab[61].Thisprogramhasavariousinputs,whichareirrelevant forourcurrentpurposes,therelevantinputisasetofPWMsthatrepresenttranscription factors,andthesetofCRMsregulatedbythesefactor'smotifs(PWMs).Weextended theGEMSTATinputtoallowforarawFastasetofvariablelengthbindingsites(namely ourFastaforthedataset D CB ).Eachofthesesitesanditsreversecomplementwas transformedtoaprobabilityPWM'singleton'representation(theprobabilityisequalto onefortheobservednucleotideandzerofortheothernucleotides).Thelongestbindingsite oflengthnintheFastadeterminedthelengthofallthesingletons.Bindingsitesinthe Fastathatwereoflengthk,where k

2 = B ( 0 B ) 2 0 ( 0 +1) ; (2.32) where B aretheconcentrations(hyperparameters)for B = A;C;G;T ,and 0 = P B B . Afterweobservethesample(thealignment),thevariancechangesbecausewe'vegainednew information.Wecan(asaconsequenceof'conjugacy')simplyrecycletheformulaabovewith achangeofvariables 0 B = B + n B ::: 0 o = o + n ,thisleadstotheposteriorvariance: ˙ 2 P ( B ) post = ( B + n B )( 0 + n B n B ) ( o + n ) 2 ( o + n + n B +1) : (2.33) Wechosetouseasymmetrichyperparameter, ,where B = 8 B ,whichcanbethoughtof asa'pseudocount'.BergandvonHippelusedthesameanalysiswiththestandardmaximum entropyprior,whichtheydetailintheirappendix[16]. 128 2.9.12Detectorenergythresholds, E c Fromabioinformaticperspective,adetector'sconditionalPWMsortheCBPWMmusttest eachpotential9-merinaCRMbymakingapredictionastowhetherthe9-merisabinding siteorrandombackgroundDNA.Hencethepredictionisabinarythatlabels each9-meraspositiveornegative.Thepositivesitesindicatethe9-mer'senergyisbelowan energythreshold, E c (criticalenergy),whilenegativesiteshave9-mersequenceswithenergy abovetheenergythreshold. Wethebioinformaticspy, ,asthecardinalityofthenumbernofse- quencesoflength9bpthatareconsideredapositivebindingsiteduetotheirenergy beingbelowthecriticalenergy,dividedbythecardinalityofthetotalnumberNofpos- siblesequencesoflength9bp,where N =4 9 .Hencethebioinformaticspyis = n N = P S 2S P ( S ) ( E ( S ) E c ),where S isthesetofallpossiblesequences(i.e.the setofcardinalityN),and isthestepfunction,whichactsasanindicatorvariablethat hasavalueof'one'whenE(S)isbelowthethresholdenergyof E c andthetahasavalue of'zero'otherwise.Onceabioinformaticspyisset,weusetheestimatedcumulative distributionfunction,cdf,oftheCBenergyoverthe4 9 sequencestocalculatetheenergy thresholdthatmatchesaparticularvalue ofthecdf,(whereweassumethese4 9 sequences occurbasedonthebackgroundprobability). Wenaivelybuildthecdfbynestediterations,whichallowsustoiterateoverallpossible 9-mers,N.Ateachiterationwedeterminetheunique9-mersequenceS'sCBenergy E ( S ) (positionindependentmodel),andincrementthebinoftheenergyhistogramthatcorre- spondsto E ( S ),wherethebinwidthswere0.1inarbitraryunits.Tomapanenergy E toabin,wemapeachenergytoabinnumber(binidention),wherethebinnumber 129 is d 10 E ( S ) e fora0.1precisionbinwidth,orsimply d E ( S ) e forabinwidthof1.For example,for =10 6 weexpect n = N possiblesequencestobebelowtheenergy wethencanrankeachsequenceinthesetofNunique9-mersbasedontheirenergyscore, wherethenthsequence'senergyis E c . ForagivenenergyPWMeach9-mer,S,inacrmisscoredas E ( S )+ w ( S;C ),wherethe shiftisdeterminedbythePWMthatwastrainedfromclassspdata( w =0forCB). Forexample,ifthespacerwindowissetat[0 ; 30] bp thentheDCdetectoralwaysexpectsthat thereisacooperatingsiteproximaltoit,andhenceaddsw(S,proximal)totheenergyE(S) foragivensequenceS.All9-mersthatsatisfytheconstraint E ( S )+ w ( S;proximal ) = < 140 aPoissonprocessis p p i =N . @h ( f i ) @f i = (log f i +1) ; (2.39) whereweknowthederivativeof log ( x )withrespectto x is 1 =x ,andweusedtheproduct ruleofntiationontheterm f i log f i .Usingtheabovecomputationwecannowsee howavariationinmagnitudeequaltoastandarddeviationofthefrequencyofevent i will contributetotheerror(thestandarddeviation)intheentropy 9 . Nowthepropagatederrortotheentropyis: ˙ H ( f ) = j X i @h ( f i ) @f i j ˙ f i ; (2.40) ˙ H ( f ) = j X i (log f i +1) j ˙ f i ; (2.41) Nowwesimplymustdeclarethetypeofmeasurementprocesson f i .Ifweassumeapoisson process,thenpluggingintotheaboveformulaforthestandarddeviation ˙ f i wehave 10 : ˙ H ( f ) = j X i (log f i +1) j r p i N ; (2.42) n i > = Np i .Alsorecallthatthevarianceofarandomvariablescaledbyaconstant,forexample1 =N ,is var ( X=N )= var ( X ) =N 2 .Thereforethevarianceinthefrequencies f i is ˙ 2 f i = p i N 9 Whatisthepointofdoingpropagationoferrors?Well,it'sreasonabletoassumearepeatedcounting experimentwillarriveatatvalueof f foreachexperiment;hence,withinthestatisticalestimate ofthestandarddeviation,whichis f ( f f )foraBernoulliprocess,wewouldliketoknowhowmuchthis variationwillcausevariationintheentropy. 10 Anyinformationtheoreticquantitycanbeconvertedtoanentropyusingstandardidentities.Hence, forexample,theinformationcontentofabindingsitelogois IC = H max H ( P ( S )),whereP(S)isthe PWMestimateofthebindingsitesequencedistribution.Theerrorinthisestimateisapproximatelyjust theadditiveerrorinthetwoentropyterms.Themaximumentropyisknownwithcertaintyandhencehas errorzero.Hence,theinformationcontenterrorofaPWMissimplytheerrorestimateofthebindingsite sequencedistribution'sentropy. 141 Figure2.9:Theestimateoftheerror, dh ( f i )= ˙ h ( f i ) isplottedontheverticalaxis,when thenumberofcountsofeventioccurattheirexpectedvalue,wherethehorizontalaxisis theexpectednumberofcountsobservedforeventi, = pN . 142 2.11.5DeterminingifinInformationaret From2.11.4weseethatweneed,intheworstcasescenario,about4decibits(about 4*.1bits,whichwasthemaximumvalueinthegraph)ofinformationcontenttobiologically distinguishtwoalignmentseachof100bindingsitesofDNA(i.e.inordertodistinguishtwo columnsofaPWM).However,thestandarderroroftheentropyrequiresdividingtheesti- matedvariancebyN,inwhichcase,twoalignmentsarestatisticallyresolvabletly t')iftheyarentby.4centibits,whichmayormaynotbebiologicallysignif- icant.Whatisbiologicallyt?Forexample,cannaturalselectionresolvebinding sitesthathavediininformationscoresof.4centibits?Dependsontheselection pressure,andhowbigitis.Certainlyselectioncanresolve.4centibits,givendriftdoesn't causethevariationinthepopulationtodisappear.Whatisphysicallyt?Forexam- ple,canDorsalproteinresolvebindingsitesthataretby.4centibits?Well,given thatDorsalcansamplebindingsites,thengiven2 H ( P ( S )) bindingevents,wewouldexpect aDorsaltocorrectlyidentifyabindingsiterelativetothebackgroundofthegenome(where P ( S ))isthePWMprobability).Soinastatisticalsense,yesitseemstobephysically cant,since.4centibitscanreducetheamountofsamplingrequired.Infact,anyinformation thatDorsalhasaboutitsbindingsitesequencesisphysicallytduetothisidea 11 2.11.6EntropybiasfromMaximumLikelihoodestimation Theestimateoftheerror(standarddeviation 12 )intheentropyintheabovesectionwas basedonestimatesofthefrequencies f .Theestimatesofthefrequencieswerebasedon MaximumLikelihoodPrinciple,whichisunbiasedforfrequencyestimates. 11 Thisdoesn'tmeanallbindingsitesequenceswithinformationscoresgreaterthanzeroarefunctional (adaptations)inEukaryotes,asSchneiderfoundforprokaryotes[108]. 12 thisisnota'standarderror' 143 Functionsofknownunbiasedestimates(suchasthefrequencies)arenotnecessarilyun- biased(suchastheentropy).ItisknownthattheentropyestimateisbiasedifoneusesML estimatesfortheirfrequencies.ToseethiswecanTaylorexpandtheaboveestimator H ( f ) abouttheexpectedvalueof f ,whichis p (there'snobiasintheexpectedvalueof f ).Using thesamenotationasBialek(seepage547[20],wewilladdandsubtracttheexpected frequenciesfromeachfactor: H ( f )= 4 X i =1 ( p i + f i )log( p i + f i )(2.43) where f i ˇ ˙ f i inmynotationabove,andsincewe'reexpandingaboutthe'expected value'wehave f i =( f i p i )(pluggingthatintotheequationyouseewe'vejustwrittenthe oftheentropy).NowTaylorexpanding(whererecall,theexpansionof log (1+ x )= x )wehave: H ( f )= P i p i log p i P i (log p i + 1 ln (2) f i )) (2.44) 1 2 P 4 i =1 1 ln (2) p i ( @f i ) 2 + (2.45) (2.46) equationherethelasttermintheexpansionistheerrorterm(capturinghigherorderterms), andexpression( @f i ) 2 actsasthevariance( @f i ) 2 = ˙ 2 f i inmynotationinthesectionof entropypropagationoferrors.Thevariancetermistheaveragesquaredthe standarddeviationsquared,whichisapositivenumber(evenifanygivenuationis negative).Now,wecantaketheexpectationvalueoftheTaylorexpansion,andofcourse, thearejustaslikelypositiveasnegative,andthey'llcancel(orjustrecallthat 144 f i = p i ).Thetermintheexpansionaboveisaconstant(it'sthetrueentropy).Hence wesomethingabitodd,theexpectedentropyisnotthetrueentropy!Thisisabias. Theexpectedentropy,tosecondorderis: H ( f )= P i p i log p i 1 2 P 4 i =1 1 ln (2) p i ( @f i ) 2 + ::: (2.47) (2.48) Maximumlikelihoodestimatesoftheentropyunderestimatethetrueentropy.Why? Wellmathematically,thesecondtermontherightsideoftheequalsignisassumedtobe dominantoverotherhigherorderterms(wherewe'reassumingthisexpansionconverges). Thesecondtermcontainsthevariance(apositivequantity).Thetrueentropytermon therightside)isalwayspositiveorzero,hencesubtractingthe smaller variancetermfrom thetrueentropyleadstoanunderestimateofthetrueentropy(always). Toestimatethebiasintheentropywemustcompute H ( p ),whichas seenabovecanbeestimatedby 1 2 P 4 i =1 1 ln (2) p i ( @f i ) 2 = 1 2 P 4 i =1 1 ln (2) p i ˙ 2 f i ,wherewehave substitutedinthestandardvariancesymbolforthesquaredNowweknow thevarianceofapoissonprocess,hencepluggingintotheterminthelastsentencefor thevariancewehave 1 2 P 4 i =1 1 ln(2) p i p p i =N = 1 2 P 4 i =1 1 ln(2) p 1 =N .Hence,weseeforDNA nucleotidecounts,theentropyatagivensiteisbiasedby2 ln(2) ln(2) N .Forsmallsamplesize (smallN)thiscanhavealargeect,evenlargerthantheestimateofthevarianceitself ˙ 2 H f . 145 2.11.7EntropyBias Changingnotation,wenowwillsimplytheestimatesoffrequencies f withthetrue probability P .TheBayesianestimateoftheprobabilityof B withaDirichletpriorusing symmetrichyperparameter is P ( B )= n B + N +4 [81].UnliketheMLestimateofentropy (whichunderestimatedtheentropy)thisparticularchoiceofhyperparameterisknownto beanoverestimateofthetrueentropy.Hence,inBeyesianestimation,itispossibleforthe smallsamplebiastobeeitheroverorunderthetruevalue.Forexample,theBaysesian estimateapproachestheMLestimateaswelet approachzero,whichcanbeseenin 2.10.Thisisexpectedsinceif iszerowethenhave: P ( B )= n B + N +4 = n B N . Tominimizethebiasintheentropyweselectedthevalueof thatgavethesmallest biasforthesmallsampleregime.Ofcourse,wedonotknowthetrueentropya-prioriofthe bindingsitedistribution.However,forthelength9bpCBdistributionofDorsalbindingsite sequenceswecouldinitiallyestimatea'known'PWMthatcouldthenbeusedtorandomly generatesyntheticbindingsites.Hence,usingourbestestimateoftheCBPWM,weused thePWMtogeneratedatasetsof N bindingsites.Foreachsyntheticdataweestimated aPWMwithavalueof ,andtherebyestimatedtheentropyasafunctionof N .Foreachvalueof N ,wegeneratednreplicatedatasets,buildingaPWMforeachset, therebyobtaining n estimatesoftheentropyforeachvalueof N .Weestimatedtheentropy basedonadatasetsofsize N =[1,50],and n =20.Byrepeatingthisexperimentforvalues of inthedomain[10 5 ; 0 : 1 ; 0 : 2 ; 0 : 25 ; 0 : 5 ; 1]wefoundanempiricalvalueof thatbest estimatedthe'known'entropyandenergy.Wesimilarlyrepeatedthisforenergyestimates, andfoundtheleastbiasedvaleof =0 : 1forentropyandenergyestimates. 146 Figure2.10:Theprobabilitydistributiondenotedas p wasestimatedfrom N randomde- viatesofa'known'length9PWMthatwasbuiltfrom D CB data.Theentropyof p , H [ p ]= Q 9 i =1 P B p ( i;B )log p ( i;B ),where i runsovertheninepositionsofthealigned N sequencedeviates,and B runsoverthebases,wascomputedfortwentyreplicatesfor eachvalueof N andplottedtheaverageentropyoverthetwentyreplicatesasafunction of N .Wecomputedthefunctional H asafunctionof N forvaluesof inthedomain [10 5 ; 0 : 1 ; 0 : 2 ; 0 : 25 ; 0 : 5 ; 1].The'known'CBPWMhadanentropyof5.6bitsasobservedby thegreenhorizontalline,andfoundanempiricalvalueof thatbestestimatedthis'known' entropytobe =0 : 1asshownbytheredplotofthefunctional H asafunctionof N .We similarlyrepeatedthisforthefunctionalenergyestimates,andfoundtheleastbiasedvalue of =0 : 1forentropyandenergyestimates. 147 Chapter3 3.1ModelBackground 3.1.1FractionalOccupancyofMorphogenBindingtoDNAbind- ingSite Thebindingprocessismodeledusingaorderratelaw,whereMismorphogenandBis theBindingsiteofDNA,andMBisthecomplex: d [ MB )] dt = k on [ M ][ B ] k [ MB ] ; (3.1) wherethebrackets[]denoteconcentrations,and k on islimitedonrate,and k off istheratethatisdeterminedbytheelectrostaticinteractionsandwillvarybetween morphogens.Usingchemicalreactionnotation,wecanwritetheaboveratelawas: [ M ]+[ B ] , [ MB ] : (3.2) Inequilibriumwehave: K a = [ MB ] [ M ][ B ] = k on k ; (3.3) 148 wherethe K a istheassociationconstant(bindingconstant).TheBindingsiteiseither occupied( o )orunoccupied( u )hencethetotal( t )amountofBisconserved: B t = B u + B o (3.4) FromthisequationoneisabletoconstructaBinomialprobabilityspace 23 .Thefractionof occupiedbindingsite B determinesthedistribution'sparameter P : P = B o B t (3.5) whichcanberearrangedintermsoftheconcentrations(i.e.[MB]and[B]): B o B t = [ MB ] [ MB ]+[ B ] (3.6) P = [ MB ] [ MB ]+[ B ] = K a [ M ] 1+ K a [ M ] : (3.7) Again, K a istheassociationconstantofthemorphogen, M ,tothebindingsite B .Previously wehavedenotedthisconstantas K ( S ),where S isaDNAsequencethatfunctionsasa bindingsiteforthemorphogen. 2 IfweencodetheboundstateandunboundstateintheBernoulliRandomvariable(i.e.1,0)wehave: E ( X )= = P i X i P i =1 P +0(1 P ) 3 ˙ 2 ( X )= P i ( X i ) 2 P i =(1 P ) 2 P +(0 P ) 2 (1 P )=(1 P ) P 149 3.1.2FractionaloccupancyofCRMscontainingmultiplebinding sites Giventhatmostregulatorysequenceshavemultiplebindingsitesformultipletmor- phogensonehasamasterequationgoverningthebindingprocess,asetofcoupledtial equations.Here,again,wewillassumethebindingprocesshasequilibrated,andhencewe simplymustenumeratethestatesofboundandunboundsites)ofthemany- bindingsitesystem.Forindependentbindingthisissimplymultinomialprocess(wherethe 'multi'aspectaccountsforthettypesoftranscriptionfactorsbindingtotheCRM). Forexample,forasingletypeoftranscriptionfactorbinding,tonbindingsiteswithina CRM,onehasabinomialprocessgovernedbythepartitionfunction(1+ q ) n ,asdiscussed intheintroductionofthedissertation.Hence,theprobabilityofkboundfactorsissimply P ( k )= n k p k (1 p ) n k ,where p isaBoltzmannprobabilityforasingleboundsite,and p k (1 p ) n k isjointprobabilityofaparticularbinding( which is irrelevantforthecaseofidenticalsitesbindingthesamemorphogen).However,forthecase ofdependenciesbetweenthesites,wecannotfactorizethejointdistributionoverbinding sites( i.e. (1+ q ) n isafactorizationofthepartitionfunction,andhencefactorizationofthe jointdistribution).Furthermore,forthecasewherethesitesarenotidentical,suchasin CRMsthathaveheterogenousbindingsitesduetoeachtranscriptionfactor'sDNAsequence spy,wedonotsimplyhaveoneBoltzmannprobability p forallthesites.Hence,each sitewillneedadistinguishingnotation.Furthermore,thenumberofsites n isnotalways known.Hence,insomecases,onemustdiscover n sites,by'annotating'theCRMwhich thecoordinatesofeachtranscriptionfactor'ssitesandtheircorrespondingenergy. Inthenextthreesectionswewilldescribeanotationthatisamixtureofnotationsused 150 forHiddenMarkovModels(HMM)andLatticeGases(IsingModels)thatishybridization ofthenotationsfromSegal etal .[109](HMMnotation)andthenotationusedbybothXin Heet.al.[61][60].HereafterthesenotationswillsimplybereferredtoasSegal'snotationor Xin'snotation.Thereasonforthehybridizationisbecauseourmodelwasoriginallybased ontheSegal'spaper,whichlatterwasmergedwithworkfromXinHe(whocollaborated withothersinSinha'slabandZhong'slab),byusngXin'sGEMSTATandSTAPC++ programsaslibrariesforimplementationofourmodel. 3.1.3Segal'sHiddenMarkovModel Ifonedoesnotknowthebindingsites apriori ofaCRMthenonecouldtakeallpossible positionswithinthesequenceasthestartofanewbindingsiteasin3.1,where thePWM(seechapter2)scoreseachpositionofthesequenceasapotentialbindingsite, therebycreatingalistofbindingenergiesorderedaccordingtothepositionwithintheCRM. Repeatingthisforeachmorphogen'sPWM,andbystackingthelistsontopofeachother, resultsinamatrixofenergyscoresthatisusefulforcomputationofthepartitionfunction throughtheforwardalgorithmfromHiddenMarkovModels(HMMs). Thepartitionfunctionofthemanybodysystemrequirestheadditionofthestatistical weight(seetheBoltzmannfactorinequation3.9)fromeachpossible'path'throughthe matrix.Each'path'throughthematrixrepresentsaasine3.2.The partitionfunctioncanbecalculatedbyrecursivelymovingthroughthematrixelements,ina waythatissimilarinspirittostandardalgorithmsofmultiplesequencealignment[38],where wecanthinkofeachasapossiblewayto'align'eachmotif(PWM)tothe CRM.ThevariantofHMM'sforwardalgorithmapproachusedbySegalforcomputingthe partitionfunctionofallpathsisdiscussedinhissupplementarymaterial. 151 Figure3.1:Widom,SegalNatureReviewGenetics;"motif"denotespaththroughthe PWM[110] 3.1.4EnumeratingtheofaCRMsequence Herewewillintroduceanotationfortheweightsofmany-bindingsitesystems.First,let P ( c ),betheprobabilityofaparticular(c)occurring P ( c )= W ( c ) c W ( c ) ; (3.8) where W ( c )istheboltzmannfactororweightofthebindingwhichisalistof occupationsforeachbindingsitelocuswithintheCRM,wheretheoccupationiseitherbound orunbound.UsingthenotationofXin,(whichwasusingBuchler'smodelasaguide[61][26], 152 anotationusedbyT.HillthatIfollowedintheIntroductionoftheDissertation),wehave: W ( c )=[ N Y i q ( x ) tf ( i ) ][ N 1 Y j ! tf ( i ) ;tf ( j ) ( d )](3.9) Herewehave N boundtranscriptionfactors,where q ( x ) tf ( i ) istheweightofthetranscription factorfortheithbindingsite( tf ( i ))thatis bound atposition x ofthesequence, q ( x ) tf ( i ) = K tf ( i ) s ( x ) [ tf ( i )](3.10) Here K tfi s ( x ) istheequilibriumconstant K a fromequation(3.3)forthebindingsitesequence s bindingbytranscriptionfactor i .And ! i;j istheinteractionbetweenthetwo bound factors tf ( i ) ;tf ( j ),wherewehaveonelessnearestneighborinteractionthanthenumberofbound factors.And d isthedistanceseparatingfactor tf ( i )fromfactor tf ( j )inunitsofbasepairs ( d = x ( tf ( i )) x ( tf ( j ))where x ( tf ( i ))isthesequencecoordinateoffactor tf ( i )). 3.1.5Thevectornomenclature Herewehaveused'c'todenotethebindingofboundandunboundfactorson theCRMfollowingthesymbolSegalusedtodenotetheandwehaveuseda map tf ( i )tolinkeachbindingsiteitoatranscriptionfactor.Thismapissimilartonotation usedbySegal,whereheexplicitlydenotesthetypeoftranscriptionfactorateachbound site. Xinusedavector, ˙ ,whichhadindicatorvariables ˙ i forithcomponenetof theirvector,wheretheithcomponentwastheithbinding'site'.Xin'snotation iselegant,inthatevery'site'intheirmodelisclearlyrepresentedintheirvector 153 bytheoccupancyoftheithcomponentofthevector. Segal'snotationdoesnotdisplayallthepossibleboundandunboundpositionsinthe (sincethereisabackgroundoccupancyinhismodelthatcausesunbound sequencetohaveaBoltzmannweightof1.).IhavetriedtosticktoSegal'snotationfora byonlydisplayingboundfactorsintheBoltzmannweightofa andbydenotingexplicitlythetypeoftranscriptionfactorateachboundsite.However, Segal'snotationisbasedonlyontheCRMsequence,hedoesnotactuallyhavea'site' notation,sinceeverypossiblepositionintheCRMisconsideredabindingsiteforeach transcriptionfactor,andhesimplylooksatallthepossiblewaysthatcoordinatelyregulating transcriptionfactorscouldformmonolayersontheCRM,withoutoverlappingoneanother. Hence,inSegal'snotation,therereallyisnonotionofa'functional'site,aseach possible positionwithintheCRMisaplaceforthefactorto'plant'itself(aplaceforthetranscription factortobind),where possible isdistinctfrom probable bytheuseofPWMs. Seeingthatbindingsitesare'functional'(adaptations),oratleastvestigalsorexapta- tions,thenotationofXinfora'site'wehavetriedtohybridwithSegal'snotation. Xin'snotationisfundamentallybasedonbinding'sites'(adaptations).Forexample,by eithersettingathresholdonaPWMtodiscoversites,orbyknowing apriori whatarethe binding'sites'(regulatoryadaptations),Xinstartstheproblem(thenotation) withNbindingsites.Theindicatorvariablesofthevectorareequivalentto occupancyofeachsite,hence ˙ i iseither0or1foreachcomponentofthe 154 vector.Forexample: W ( c )= W ( ˙ )= W ( ˙ 1 ;:::;˙ N ;˙ 1 ; 1 ;::;˙ N;N ) (3.11) =exp( N X i ln( q i ) ˙ i N X j ln( ! ij ) ˙ ij ) (3.12) Here,ln( q i ),isthefreeenergyofbindingtothe i thsite,and ˙ i ,denotestheoccupancyofthe i th'site',and ˙ ij denotesthepairwiseenergeticinteractionln( ! ij )betweensite i andsite j 1 . Butwhattypeoftranscriptionfactorisbindingtothe i thsite?InXin'snotation,thisisnot decipherable.Rather,intheirnotation,each'site'correspondstoaparticularfactor.Butwe simplydon'tknowwhichfactor.HenceinXin'snotationit'spossiblethattwodistinct'sites' occupytheexactsamelocus.Hencewedon'tknowiftwositesareoverlapping.Overlapping sitescannotbothbebound,sinceallthermodynamicmodel'sforoccupancyoftranscription factorsuse'hardspherepotentials',there'ssterichinderanceprohibitingtwotranscription factorstooccupythesamespacealongtheCRM.).Hence,inXin'snotation,oneissupposed tobecognizantthatboundoverlappingsiteswillsetthat'sweighttozero. WehaveusedahybridnotationinEq.3.9,whereit'snotthatourCRMhasN bindingsites(likeinXin'snotation)thatareeachboundorunbound.Rather,theweight W ( c )displays N boundsitesfromac.Hence,inournotation N isavariable inthebindingcspace,whileinXin'snotation,Nisthenumberofsites. InXin'snotation,advantageously,theycanputaboundongurationspaceas2 N ,this isanupperboundbecauseoverlappingsiteswillreducethetotalnumberof 1 Recallforthecanonicalensemble,whereHistheHamiltonianofawewouldhave W ( c )= exp H ( c ) =kT beingasimpleBoltzmannfactor,wherek T isthethermalenergy.Weareworkinginagrand canonicalensemble,whichallowsforparticleenergyaswellasenergyexchange,hence,thetotalenergyof thegurationisafunctionoftheHamiltonianandthechemicalpotentialofthefactors,whichcauses concentrationofthetranscriptionfactorstothetotalenergyofahencetheenergy isathermodynamicfreeenergy 155 furthermorethisovercountsthenumberofunboundstates,sincealoci'ssequencethatmatch formultiplefactors-say m factors-wouldconsider2 m = 2 1toomanysince thelociisactuallyonlyunboundinonepossibleway 2 . 3.1.6Anexampleofthehybridnotation Foranexampleofthehybridnotation,considerallpossiblepositionsoftheCRMasa bindingsite,asSegalwould,andmultiplemorphogensbindingtothesamesite(same positionwithinthesequence,samelocus);thentherationvector(c)canbewritten intermsofXin'sindicatorvariables(similartoIsingmodels). ForalengthLCRMandfor3transcriptionfactortypes(likeDorsal,Twist,andSnail) wehaveinXin'snotation N =4 L ,wherethefactoroffourisduetothefourtypesof transcriptionfactorsateachlocus-namely:boundbyDorsal,Twist,orSnailor'background' (backgroundisatypeofpseudoparticlethattheunboundstate)). 3 Inthiscase,thevectorofindicatorvariablesiswrittenasamatrixofsize4xL.For example,forthetoyCRMsequence'acggt',wewouldhaveavectorofsize20. Ifweaddanother'state'tothevectordenotedas'silent'torepresentsteric wherethetypeoftranscriptionfactorindicatorvariablenowonlyindicatesthe'start' ofabindingsite,wewouldthenhaveaindicatormatrixofsize25,asindicatedinFigure 3.2,wheretheDorsaltranscriptionfactoroccupiespositions2and3oftheCRM(inthis toycaseDorsaloccupiesasitejustoflength2),andtherestoftheCRMisoccupiedby 2 Xin'sGEMSTATcomputesthecorrectweightsandpartitionfunction,it'sjusttheestimateonthe spacethatisabound. 3 AspointedoutbySegalinhissupplement,thisproblemmayseemcomputationallyinfeasible,sincethe numberofforL=500isofsize2 4 500 ,butduetotheforwardalgorithmitispossible.Herethe numberofbindingsites, N ,neglectsedgectsofk-mersrequiringlengthkbindingsitesequences.Hence, thecalculationofthenumberofgurationsisonlyaboundonthecardinailityofthesetof wheretheboundisfurtherdbytheneglectofoverlappingsitesthatcausemanytobe unaccessible(againsttherules). 156 thebackgroundfactor.Hencethevector(avalueofthematrixofindicator variables)isofsize N +5(the5duetothe'silent'states)assumingnointeractionsbetween boundfactors.Ifthereareinteractions,forexamplenearestneighbors,thentherearean additional 16 N 2 2 componentstothevector,whichwewillnotdisplay. Nowthatwehaveournotation,anddescribedtheweightofagiven wewillexplaininthenextsectionournovelformofthepairwiseinteraction ! betweenboundfactors. 3.1.7Thepairwiseinteraction ! betweenboundfactors Transcriptionfactorsofteninteractwitheachother,causingcertaintobe morelikelybydecreasingthetotalenergyofthewhereinteractingfactorsare jointlybound,therebyincreasingtheweightofthoseThisobservation,was thephysicalbasisbehindA.Hill'sfamousHemoglobinOxygenbindingmodel. IntheworkofXinanumberformsofdistancedependentpairwiseinteractionsbetween boundfactorswastested,suchassinusoidalfunctionsoverspacethataccountfor'phasing' whereoneboundfactorinteractswiththenearestneighborthatisinphase(byusingthe majorgroovedistance)withthefactorofinterest.Similarly,gaussiandecaysfromthe centeroftheplantedfactor,andsquarefunctionswereattemtedandimplemented.Hence, GEMSTAThasasmalldatabaseofpariwiseinteractionforms.Alltheforms,asourstoo, onlyallowforinteractionsbetweennearestneighborboundproteins. ForthetheDVnetworkinteractionsbetweenDorsalandTwisthavebeenexperimentally showntobedistancedependent,henceweuseaninteractionthatisafunctionoftheDNA basepairdistanceseparatingthenearestneighbors.Forthenetworkthatwearemodeling thisdistancedependentinteractionhasbeenshowntobedominantforcedrivingtheDorsal 157 Figure3.2:ThetoyCRMsequenceacggtisannotatedateachofitsloci(positions)to denotethevectorinrowmajorordering(1stebitsaredorsal'srow,second ebitsaretwist'srowetc...).InthelanguageofHMMs,thevalueofaconvector revealsthehiddenstateofthesequence.Herethereare5states,wherethe'silent'state indicatesatranscriptionfactorisboundtoanupstreampositionofthesequence,causing thelocitobecoveredbyaninternalpositionoftheplantedfactor. 158 Figure3.3:Changingthespacingbetweenmotifsinmoduleschangethespanofcellsthatare expressed.InthisFigurethespacingbetweentwositeswasadjustedbyNaturalSelection inorthologsofthe rhomboid gene'sCRM.WhentheorthologCRMsweretransgenically insertedinto mel speciesthewidthoftheexpressionpatternchangesrelativetotheen- dogenouspatternwidth,suggestingthatcoopertivitybetweenthesitesisafunctionofthe distancebetweensitesthatcanbeusedbyevolutiontotune'theepressionpatternsin development.Wheneachsequenceormoduleisexpressedinitsrespectivespecie(lineage), thentherelativewidths( w=L where L isthelengthsofthemajoraxisofthespecie'sem- bryo, w isthewidthofthetissueinnanometersthatexpressthegene)ofthetissuesare thesame.tspecie'sembryoshavetsizeshencethereisascalinglaw-howa characteristic(suchasgeneexpression)changeswithbodysize.ThisFigureisfromErives Crockeret.al2008"Evolutionactsonenhancerorganizationtogradientthreshold readouts.PLoSBiology" borderofneuroectodermexpressedgenes.Fortheneuroectodermnetwork 2 Crockeret.al. andSzymanskiet.al.[121],[32]haveshownthatthespacingbetweensitesplaysadominant roleinngthenumberofcellsthatareturnedoninthisregion(widthorspanofcells). ForboththeDVandAP(AnteriorPosterior)networkshortrangeantagonisticinterac- tionshavebeenshowtodominatetheactionofrepressortranscriptionfactors[43].TheSnail transcriptionfactorisashortrangerepressoractingtoregulatetheDVaxis,whichwealso modelusingadistancedependentinteraction,commonlycalled'Quenching'[43]. Furthermore,sincewedonotknowtheexactformofthefunction,webinthedistance 2 neuroectodermnetworkarethesetofgenescoordinatelyexpressedinthelateralregionsofthedeveloping embryo,thisdevelopingtissuespansabout10cellsatthetimepointunderconsideration 159 separatingtheproteins,andforeachbinafreeparameterisOnemaylookforacoarse binningoftheseparationdistanceinbinsof10bp,orasasbinsof1bp.Ifwechoose theformeroptionthananexampleofourprotein-proteininteractionparameterwouldbeas follows: ! ! ( d )=[ ! 1 ;! 2 ;:::;! b ;:::;! n ](3.13) wherethesubscriptsofthecomponentsofthe ! vector(coopertivityorsynergisticprotein- proteininteractions)representthecorrespondingbin,b,(inthiscasetherearenbins), hencetheyabinvector, ! B ,whereitscomponentscorresspondtotheintervalofbase pairdistances. ! B =[ B 1 ;B 2 ;:::;b;:::;B n ] =[(0 10) ; (11 20) ;:::; (41 50) ;:::; ( x L )] (3.14) Herebinbisallinteractionswheretwoboundsitesareseparatedby41-50basepairs,and xrepresentsthelastbinborder,and L representstheLengthofthemodule(sequence). Thisrepresentationofcooperativityvector,isreallyonlyusefulforprogrammingpurposes, mathematicallythecooperativityissimplyapiecewisefunction: ! ( d )= 8 > > > > > > > > > > > > < > > > > > > > > > > > > : ! 1 if d 2 (0 ; 10) ! 2 if d 2 (11 ; 20) . . . ! n if d x 160 Ouraimistothebiochemcialparametersthattunetheprobabilityof thatoccurinliveembryos.However,wesimplydon'thavesuchdetailedbiochemicalexper- iments.Hence,weusethemRNAofthetargetgenesregulatedbyDorsalTwistandSnail asareadoutofwhatbindingarelikelyoccurring,andwhetercertain urationshavestronglinkage(pairwiseinteractions)betweencertainboundfactors.Hence, wemustinferfrommRNAdata,whatisoccuringattheDNAbindinglevel;anexpression tosequencemodel.Inthenextsectionwilldescribeaubiquitousandobviousassumption, mRNAiscausedbyPolII,hencePolIIbindingisaproxyforgeneexpression. 3.1.8RelatingthenumberofmRNAtranscriptstofractionaloc- cupancyofPolII TheamountoftranscriptionthatoccursatagenelocusisencodedintwosegmentsofDNA sequence;thebasalpromoterthatbindstheBasalTranscriptionApparatus(BTA), whichisamassivecomplexofmanyproteinsincludingPolII;second,andmostimportant, theCRMthatbindsthemorphogensordistaltranscriptionfactors(suchasDorsal)that modifyandremodelthechromatinstateandpossiblyhavedirectlinkagewiththeBTA. HencethenumberofmRNAtranscriptscanbemodeledassimplyalinearrelationship betweenthenumberofmRNAandthefractionaloccupancyofapromotersequence(which weassumeincludestheCRMsequence). h N mRNA i/ f BTA : (3.15) Ofcourse,theoccupancyoftheBTA, f BTA ,isafractionthisisatmostone,hence h N mRNA i istheaveragenumberofmRNAmoleculesproducedpernuclearcyclenormalizedbythe 161 Figure3.4:Diagramofthecomponentsoffractionaloccupancymodel,alongwiththepa- rameterstobitthroughexpressiondata(estimatesofthemRNAoutputofthegenebound bytheBTAandestimatesofinputmorphogenconcentrations). maximumproductionrateoveracycle. 3.1.9FractionaloccupancyofBTA WecangainmorephysicalinsightintotheproblembythinkingofmechanismsofhowBTA's occupancyisafunctionofthemorphogenoccupancy.InFigure3.4weseethatwhenthe morphogen'sareboundtheyhaveanactivationorrepressiondomain,whichcommunicates withtheBTA,forourcasethiscommunicationmaybethoughtofasthecomplexprocess ofchangingtheepigeniticstateofthechromosome,bycoactivators(histonemonu- cleosomeremodelers)bindingtothemorphogen.Thebindingenergyofthisdomain( w m ) onthemophogencanberelatedtothebindingenergyoftheBTA: G = w 0 + X m n m ( c ) w m (3.16) 162 Herecisaparticularofthemorphogensonthepromoter, w 0 isthebinding energycontributionfromthebasalpromoter, w m isthebindingenergycontributionfrom eachmorphogen,and n m ( c )isthenumberofboundmorphogenofspeciesmtostatec(for examplesee[60]).ForthiswecouldmodeltheoccupancyoftheBTAas: f BTA = 1 1+ e ( w 0 + P m n m ( c ) w m ) (3.17) Thefastpacetimingofmorphogenbindingrelativetotranscriptiontime(e.g.thetime forPolIItoclearthepromoter,andanew(orpreloadedontheenhancer)PolIIbinds), wouldsuggestthattheBTAisnotsamplingorcognizantofeachCRMof boundproteins,rathertheBTAseesanaverageoccupancyofthemorphogens.Thiscanbe modeledas: f BTA =1 = (1+exp ( w 0 + X m h n m ij c w m )) ; (3.18) wheretheoccupancyoftheBTAisrelatedtothetheoreticalmodelofaveragemorphogenm occupancyovertheCRM h n m ij c 4 .Theaveragemorphogenoccupancyover is: h n m ij c = X c n m ( c ) W ( c ) P 0 c W ( c 0 ) ; (3.19) wherewethenormalizedweightsofeacharefromEquation3.8,and n m ( c ), isthenumberofmorphogenoftype m (suchas m =Dorsal)boundtothe c .TheexpectationvalueiscomputedusingGEMSTAPsoftwarefromSinha'slab,seethe followingreferenceforfurtherdetails[60].Hereafter,wesimplyuse h n m i todenote h n m ij c 4 Thiscanalsobeshown,undercertainconditions,tobeanapproximationofthemuchmorecomputation- allyintensemodelofSegal etal. [109]wheretheycompute f BTA j c = * 1 1+ e ( w 0 + P m n m ( c ) w m ) + j c . However,asstatedabove,thisisnotanapproximationofSegal'smodel,it'satmodeloftheinter- actionoftheBTAwiththeCRM. 163 withtheunderstandingthattheexpectationvalueistakenoverthebinding oftheCRM. 3.1.10FractionaloccupancyofBTAfromabindingreactionper- spective Let P =PolIIconcentration(thisisthebestknownproteinintheBasalTranscriptionAppa- ratus,buttechnically P istheconcentrationoftheBTA), D =DNA(orbasalpromoter,i.e. TATAbox,andbindingsitesforTF2Betc..), C =complex( PD ).Weassumethebinding processisinequilibriumttimescalethanmorphogenbinding). P + D , C (3.20) Usingfractionaloccupancywehave: C C + D = 1 D C +1 (3.21) C = P D K a (3.22) C C + D = 1 D PDK a +1 = 1 1 PK a +1 (3.23) nowassumingconcentrationsofunboundPolIIisbyourDNA D ,wecansayfree PolIIisaconstantlike1000,andabsorbtheconstantintothe K a , 1 1 PK a +1 = 1 1 K 0 a +1 = 1 e G k b T +1 (3.24) 164 Now G isthefreeenergythatisreleasedduringmorphogenbinding,sowecanequate G toequation(3.16),howeverwewillnotsayPolII'sees'aparticularratherwe willassumePolIIseestheaverageon(i.e. theaveragenumberofbound morphogen,m) G = w 0 + X m w m (3.25) = X c n m ( c ) P ( c )(3.26) pluggingintheenergyinunitsof k b T fromequation(3.25)intoequation(3.24)wearrive againarriveatequation(3.18): 1 1 PK a +1 = 1 e ( w 0 + P m w m ) +1 (3.27) Thisyieldsarangeof N mRNA 2 [0 ; 1]. 3.1.11FractionaloccupancyofBTAinCooperativeBinding(CB) modelinXinHe'sGEMSTAT ThefractionaloccupancyinXinHe'sCooperativeBindingmodel(CB)iscloselyanalogous toourformofthemodel 5 .Thiscanbeseenbytakingthesimplesystemofonemorphogen bindingsitewithonebasalpromotersitefortheBTA.Hence,theBTAistreatedasifit weresimplyanothermorphogen.Inthiscasethepartitionfunctionofthetwositesystem forthecasethatthetwositesareindependentis:=(1+ q )(1+ q BTA ),where q isthecanonicalpartitionfunctionofthemorphogenboundtoitssiteand q BTA isthe 5 ThisCBisnotourrelatedtoourCBPWMmodelfromchapter2. 165 canonicalpartitionfunctionoftheBTAboundtoitspromoter.Nowifweassumethesites aredependent,thenwecannotfactorizethejointpartitionfunction 6 andwethenhave =(1+ q + q BTA + q BTA qw tf )= Z + Z on ,wherewehavecollectedsimilartermsinthe expansion,where Z =1+ q isthetermthatdoesnotincludeBTAbinding,and Z is thecollectionoftheremainingterms(whereBTAisbound) 7 .IntheirCBmodeltheyhave f BTA = Z on Z + Z on = 1 Z =Z on +1 .Byequatingthistoourformoftheoccupancy(Eq.3.27) wehave: Z =Z on =exp w o + X tf w tf tf : (3.28) Theaboveequationistheodds for BTAbinding,hencethelogoddsis: ln Z =Z on = w o + X tf w tf tf : (3.29) Now,ifweassumethereisonlyonemorphogenbindingsite,thenthedenominatorof tf is Z off ,anditsnumeratoris q .Hencewehave: ln (1+ q ) ( q BTA + q BTA qw 0 tf ) = w o + w tf q 1+ q ; (3.30) wherewehavereplacedthe Z 'swiththeiroriginal q 'sandthe w 0 tf denotesXin'sformofthe cooperativitybetweenBTAandmorphogen(inordertodistinguishfromourcooperativity's symbol w tf .Usingthepropertiesoflogarithmswecanisolate(1+ qw 0 tf )fromtheleftside 6 Asalways,wecan organize thestatesoverourmany-bodysystembysystematicallybuildingthecon- eveninthecaseofdependencies,throughapolynomialexpansionoverbindingsites,whichis presentedinXinHe'sSupplementaryMaterialandT.Hill'stext[62]. 7 InXin'sSupplement, w tf isdenotedas (whichisatparameterthanour thatapproximates themorphogenbindingconstant,wherewechoosethesymbol tomimicSegal'schoiceofparameters) w tf isthecooperativebindingbetweenthemorphogenandtheBTA 166 oftheequation,whichleadsto: ln(1+ q ) ln q BTA ln(1+ qw 0 tf )= w o + w tf q 1+ q ; (3.31) where w o isln q BTA leavingtheresult: ln(1+ q ) ln(1+ qw 0 tf )= w tf q 1+ q (3.32) rearrangingwehave: (1+ q )ln(1+ q ) (1+ q )ln(1+ qw 0 tf )= w tf q (3.33) if q islessthanone(morphogenhaslowconcentration,forexample),thenthestterm ontheleftisapproximatelyzero,andif w 0 tf isnottoolargethentoordertheTaylor expansionofln(1+ qw 0 tf )is- qw 0 tf .HenceitappearsunderthisregimethatXin'sCBmodel, withfreeparameters areequivalenttoour w factors. 3.1.12FractionaloccupancyofBTAinAy'smodel ThefractionaloccupancymodelinFakhouri etal. [43],whichwe'llcallAy'smodelissimilar toSegal'smodel,andhencesimilartoourformoftheBTAoccupancy 8 .InAy'smodel 8 AdistinguishingcharacteristicbetweenSegalandAy'sworkwasthatAy'smodelcontainedhighquality expressiondata(witherrorbars)[],whiletheinputexpressiondatatoSegal'smodelwasabooleanbased 'on''data,whichwassmoothed(forexamplebyusingcubicsplines).FurthermoreSegal'sCRMinput wasjustthesequencesandPWMsofthemorphogensregulatingtheCRMs(wherethebindingsiteswere tobe'discovered'usingaPWMannotationmodel),whileAy'smodelknewexactlywherethebindingsites werewithintheCRMs,therebynotbeinghamperedbyfalsepostivebindingsitesfromPWMpredictionof sitessuchasinSegal'sarpproach.OnceSegal'smodelhadannotatedaCRMwiththemorphogenbinding sites,hismodelandAy'smodel,weaimtoshowinthissection,wereidentical(assumingtheannotation hadnofalsenegativesandfalsepositives). 167 theprobabilityofeachisanelementinavectorofprobabilities,wherethe vectorislabeled F (where P c F c =1,wherecencodesthebindingasthe componentindextothevector.).SimilartoSegal'smodel,eachcausesa particularoccupancyoftheBTA,wheretheoccupancyoftheBTAforeach isacomponentofavector T .HenceT F= P c F c T c (whichgivenoneknowsallthe thisisthenverysimilarinformtoSegal'smodel < 1 1+ e f ( c ) > ˇ , wheretheexpectationistakenoverthebindingand f ( c )isSegal'sfunction ofeach AdeparturefromAy'smodelandourmodelisthequenchingfunctionthathedenotes as q ( d ),wheredisthespacerdistanceinbasepairsbetweentherepressorandtheactivator. Theirmodel,likeours,binsthespacerdistancestoformthequechingfunction(qisa piecewisefunctionoverthetintervalsofspacingbetweentherepressorand activator) 9 .Inhismodeleachintervalhasafreeparametertobetrained,whichisanalogous toourformofthepairwisepotentialforrepression,forexample ! Sn;Dl ( d )(therepression pair-wisepotentialbetweenSnailandDorsal),whichisalsoabinnedpair-wisepotential inthesameform.However,theparametersdepartinthatour ! 'soccurinthepartition functionjustasinSegal'smodel,whileAy'squenchingparametersdonot.Hisquenching parametergoesbacktoamodelformfromReinitz([106]),wherequenchingmodulatesthe probabilityofboundbytherepressingtranscriptionfactors.Inthismodel modulatemeansthatthe p =1normoftheirprobabilityvector jj F jj = P c F p c = P c F c is afunctionoftheamountofrepression. Forexample,amodulerespondingtoactivatorsunderhighconcentrations,butwithvery 9 (Inthecaseofmultiplerepressortypes,onewouldhaveanotationsuchas: q ( d ) tf ,whereeachrepressor typehasitsownquenchingfunction,orpossiblyjustoneuniversalquenchingfunction,ifallrepressorsended upbehavingthesame.Whatadiscoverythatwouldbe!) 168 lowconcentrationsofrepressor(makingtheBoltzmannweightsofrepression zero)hastheusualnorm P c T c =1,whileiftheconditionsarefavorableforrepression, (suchasbothhighconcentrationsofrepressorsandactivatorsANDaverystrongrepression pairwisepotential),theniftheq(d)functionislarge(itslargestvalueis'one')forthespacer distancesdthatoccurbetweentheactivatorsandtherepressors,thenwith boundrepressorshavetheirprobabilitiesmodulatedbyfactorsof(1 q ( d ))foreachbound repressor,wheredisthespacerbetweeneachrepressoranditsnearestneighbor(whilethere wereotherforms,orschemes,testedforthisfunctionratherthanjustnearestneighbors). Forexample,foramodulewith5activatorbindingsitesand5repressorbindingsites (withnositesoverlapping),thenforthewithallbindingsitesbound,and eachactivatorhadanearestneighborboundrepressorwithaspaceratd=10bp,thenthe Boltzmannprobabilityoftheallboundismodulatedbyafactor(1 q ( d = 10)) 5 ,andtheresultofthisis: F 0 c (1 q ( d =10)) 5 ,wherec'isthewithall10 sitesbound 10 ThebetweenAy'smodelofquenchingandoursisduetothequenchingpa- rameters(function)notoccurringinthepartitionfunction.Hereweseeifitispossibleto equate Ay'smodelform P c F c T c toSegal'smodel.Firstlettheoccupancyfunctionofthe BTA,whichisafunctionofthebedenotedas: T ( x )= 1 q 1+ exp ( x ) (3.34) 10 InAy'smodel,thevectorctakesasimilarformtoourvector,forexample, Ay'srepresentationoftheexamplewith10boundsitesis c 0 =[ ARARARARAR ],whereAis foractivatorandRisforrepressor.ThisisacompactrepresentationofourvectorinFigure3.2, compactinthesensethatourvectoralsodisplaysinformationaboutthespacersandinternal positionsofabindingsite(a'silent'state)). 169 where q isthequenchingfromAy'smodel,and x isafunctionofthebinding Ay'sexactformwas: T 0 ( x )= 1 1+exp(5 x ) ; (3.35) whichwewishtoadornwiththefactors(1 q )(theprobabilitymodulationfactorsthatcon- tainthequenchingparameter q ),whichwillthenallowtheBTAoccupancytobemodulated, therebykeepingtheBoltzmanndistributionoverthebindingonsunperturbedby repressormorphogenbinding.Now,makeachangeofvariables,let(1 q )= e + w ,where thesymbolwhasnoparticularmeaningotherthantocapturethechangeofforminthe expression.Nowwehave: T ( x )= exp+ w 1+exp( x ) ; (3.36) whereuponbringingthenumerator'sfactorintothedenominatorwehave: T ( x )= 1 exp( w )+exp( x )exp( w ) ; (3.37) Nowwesimplywantthisdenominatortobeoftheform1+exp( x 0 ),whichwouldallow Ay'smodeltobeexpressedexactlyinthesameformasSegals.Henceweagainmakethe changeofvariables: exp( w )+exp( x )exp( w )=1+exp( x 0 ) ; (3.38) whichleadsto: x 0 = ln (exp( x )+ q ) ln(1 q )(3.39) 170 Hencewehave: T ( x )= 1 1+ exp( x )+ q 1 q (3.40) Now,itappearsthattheoddsforBTAoccupancyfromAy'smodel: exp( x )+ q (1 q ) mustbe equaltotheoddsforBTAoccupancyfromSegal'smodel: e f ( c ) .Settingthelogoddsequal toeachothermayallowforsomedeductiononwhatcertainparameters'mean'intermsof thetwomodels,whichwouldthenbeabletobetracedtoourmodel. 3.2Dataset WewouldliketoknowthekeybiochemicalparametersthatareutilizedbyDorsal,Twist, andSnailastheyregulatethegenesthatpatterntheDorsalVentralaxisofearlydevel- opmentbyproducingexpressionalistofmRNAcountsforeachpositionalong theDVaxis.Understandardnonlinearregression,onecouldimagineanexperimentwhere onemeasurestheresponse(thedependentvariable)tosystematicvariationsoftheinputs (independentvariables),thesemeasurementscanthenbeusedtoanonlinearregression model,wheretheparametersareourbiochemicalparameters(suchasthebindingcon- stants).Heretheresponse,isthegeneexpressionlevels,andtheinputsarethemorphogen andCRMinformation(describedfurtherinthedatasection).Thisisamassiveamountof experimentationinordertoanonlinearmodel.However,followingtheinterpretationof Segalet.al.[109],whichinasense,isareinterpretationofEdLewis'modelofhomeogenes, andfurtheredbyZinzenetal.[129],wecantreat coordinatelyregulated genesasifthey werethe same gene,justundertinputs. Whatarethetinputs?TheCRMsandpositionalinformationoftheembryo(in 171 termsofmorphogenconcentrations). Howisthispossible?Duringdevelopmentmorphogensworktogethertocoordinately regulateasetofgenes. Whataboutreplicates,asawelldesignedexperimenthasstatistical power ?Thenumber ofdegreesoffreedom,whichdeterminesthestatisticalpower,canbeassumede,as acrossapopulationofembryos,eachwiththousandsofnucleithateachhaveagenome,we knowthegenome'sareidentical(neglectingshorttermevolutionandsegregationofSNPs duringsex,where,SNPsininbreedlablineagesisirrelevant.).Eachembryoiselya cloneofoneanother. So,fortheregulatoryregionsofgenes,amodeleronlyelyneedsonesampleto havecompleteknowledgeabouttheDNAsequence,butwhataboutthe trans enviroment, thenumbersofeachmoleculesineachembryo,doesn'tthatvaryacrossembryos?All embryo'sarethesamewithinlimitsofprocesses.Eachembryoacrossapopulation ofembryo'sisproducingmRNAateachcoordinatelyregulatedgenewithahighdegreeof precision.Morphogenconcentrationgradientsandthegene'sexpressionresponse(mRNA levels) are highlyreproducibleacrosstembryos,withinthephysicallimitssetby processes[55].Soyes,theinternalmolecularenvironmentfromoneembryotothe nextdoesvaryduetorandomwalksofmolecules.However,ourerrorinassumingthata populationofembryosareallthesamewillbenobiggerthat p n ,where n isabsolutenumber ofamoleculetypeineachcellofanembryo.Furthermore, n ( t ),where t istime,variesona timescalemuchslowerthantheprocesseswewillstudy.Hencethe concentraition functionacrosstheembryo n ( z;t )wherezisspace(theDVaxis),weassumeisfrozen.For large n ,thefractionalerrorthatweincurinourmodelisquitesmall.Hence,variationin moleculeinabsoluteconcentrationfromoneembryotothenext(orevenwithinanembryo 172 fromonecelltothenextcell-atthesamepositionalongtheDVaxis)isnegligible. Hence,wewilltreatanentirenetworkofregulatorysequencesasiftheyareeachjusta tmeasurement,atinputtoourmodel.Thisisareasonableinterpretation, however,nonlinearmodelsarenotorioustobeingsensitiveincertainintervalsoftheinput variables,andhenceifthenetworkhasnotjusthappenedtoevolvetocoordiantelyexpressed gene'sintheinputregionswhereourmodelissensitive,themodelwillfail,andonemust resorttothetediousworksuchasFukouriet.al.[43]toassurethatallnecessarydatapointsare beingcollected(e.g.ifonewantstomeasuredistancedependentquenching,thenone,under Segal'sinterpretation,shouldhopethatevolutionhasselectedarangeoftspacers thatcoordinatequenchingwithintheCRMs,otherwisethemodelwillbeinsensitiveto thisparameter.Why?Becausetheresimplyisnoinputdatathatvariesthespacing,which isaprerequisitefordistancedependentinteractions.).However,thekeybiochemical parametersweareinterestedinwasmotivatedbypreviousexperimentsofendogenousCRMs thathavepointedtoourparametersofinterestofbeingthemajorcontributingfactors.The modelrequiresthreemainpiecesofdatadenotedas D .First,theCRMsequencesand PWMsofthemorphogen'stargetingthesequences.Second,themorphogenconcentrations alongtheDVaxisataparticularpointintimeindevelopment.Third,theresponseofthe CRM'stothemorphogens(alsointheformof'expression'concentrationsalongtheDVaxis ataparticularpointintimeindevelopment). D = fS crm t ; E t ; E tf ; PWM tf g (3.41) 173 S crm t = 0 B B B B B B B B B @ s rho (1) :::s rho ( L rho ) s vnd (1) :::s vnd ( L vnd ) . . . . . . . . . s n (1) :::s n ( L n ) 1 C C C C C C C C C A = 0 B B B B B B B B B @ S rho S vnd . . . S n 1 C C C C C C C C C A E t = 0 B B B B B B B B B @ E rho (1) :::E rho ( m ) E vnd (1) :::E vnd ( m ) . . . . . . . . . E n (1) :::E n ( m ) 1 C C C C C C C C C A = 0 B B B B B B B B B @ E rho E vnd . . . E n 1 C C C C C C C C C A E tf = 0 B B B B B @ E Dorsal (1) :::E Dorsal ( m ) E Twist (1) :::E Twist ( m ) E Snail (1) :::E Snail ( m ) 1 C C C C C A = 0 B B B B B @ E Dorsal E Twist E Snail 1 C C C C C A PWM tf = 0 B B B B B @ PWM Dorsal DC ;PWM Dorsal DU PWM Twist PWM Snail 1 C C C C C A 3.2.1SequencePartofthedata ThemostimportantinputofthemodelarejustthetheCRMsequences.Themodelisthe responseoftheCRMacrosstheentireDVaxis,andhence,eachCRMwillwillhaveresponse valuesacrossallthepositionsoftheDVaxis.Eachrowin S crm t isaDNAsequenceofthe modulethatdrivesthe target expression(concentration)ofthesamerowin E t asafunction ofconcentrationsofthemorphogensandthemorphogens'PWMs'annotationsonthecrm 174 S crm t (throughbindingsitediscovery).TheCRMsareusuallyabout500bpthatcontrola givengeneintheDVnetwork,ortheCRMwasanengineeredconstructthatwastested in vivo .Thecolumnsof S crm t aretheorderednucleotidesoftheDNAoflengthL,whereeach baseofthesequence S t isrepresentedass(i),andthesubcriptsontheCRMindicatethe labelforthetargetgeneitcontrols. 3.2.2Positional-dependenttargetgeneresponsedataE t TheresponseofCRMtomorphogenregulationisthedata E t whichistheCRM'starget gene'sexpression.Thetargetexpressioniscontrolledin cis byagivenCRM,henceeach rowof E t ,denotedas E t ,correspondstothesamerowin S crm t .Thetargetexpressionis controlledin trans bythemorphogenconentrationsataposition,z,alongtheembryo.Each columnof E t representsthepositionzalongtheembryo. 3.2.3Positional-dependentmorphogendata ThetranscriptionfactorsDorsal,TwistandSnailalsoeachhaveanexpressionalong theDVaxis,whichisstoredinthetable E tf ,where tf istheparticularfactor,andwhere eachfactor'smustbethesamelengthvectorasthetargetgeneThereare aboutfortycells(nuclei)alongtheDVaxisatthetimepointindevelopmentunderstudy(in Foe'stimetablethetimeofdevelopmentis'stage4'),henceeachnucleihasjustonelocus containingtheCRMofinterest(technically,Drosophilais'diploid'(containingtwogenomes ineachnucleus,butwedon'tmodelthis),henceitisnaturaltodemaracatethepositions alongtheDVaxesintoaboutfortybins. Hencethecolumnsof E tf and E t representtheconcentrationsatagivenpositionalong 175 theDVaxisoftheinputmorphogens( tf ),andoutputtargetresponse t .However,the independentvariabledata E tf isnotnecessarilycollectedjointlywiththedependentvariable, E t ,muchoftheexpressiondataiscomingfromtembryos,andtlabs,sowe actuallydonothavetheexactknownamountofinputmorphogenandoutputgeneproduct foragivenpositionalongtheaxis.However,tediousexperimentsalongtheDVaxisbya numberoflabshavealreadyshownthattheexplanatoryvariablecausingvariationin E t arethemorphogen'sweuseasinputs,furthermore,embryosarebelievedtobevery 'reproducible',intheirpatterningexpressionhenceitisreasonabletocollectthe inputandoutputfromtembryos. 3.2.4CollectionofdatafromDVnetworkofDorsal,Twist,and SnailtargetsinNeuroectodermandMesoderm,andPWMs Thetargetgeneandcorrespondingregulatorysequenceswerecollectedfromthe following(references)[72][71][70].Thewerebasedonthenumberofcellsthatspan themesodermandneuroectoderm(about40)atthetimepointunderconsideration.The positionalinformationwasextractedfromthecorrespondingreferencesresultssection.For example,theresultsmaysay"twistPEenhancerborderatcells12-14",wherethemesoderm properisknowntobe18-20cellswideatthetimepointunderstudy(atthispointanentire crosssectionalsliceoftheDorsalVentralaxisis100cellswide(reference)attheminoraxis oftheellipsoid).Furthermoretheventralneuroectodermisknowntospan6-10cellspast themesodermboundary,givenaboundondorsal-mostborderofneuroectodermgenes,and thedorsalborderiscitedbytheauthorwiththeuncertaintyinitscellularposition.The amplitudeswereaccordingtothecomparativeanalysisintheresultssectionofthe 176 references,eachclasswasthenassignedanumericrange,forexampletheclassthatwas regardedasthestrongeststainingwasarbitrarilyassignedtherange[.9,1].Itshouldbe notedthattheamplitudesforSegal's2008paperwerebinary,hencehisborder ,wheretoassignthe0,1transition,hasuncertainty. TheexpressionofDorsalwasbasedonZinzen'sconfocalmicroscopyresultsin theDevexdatabase(nolongeravailableonline).TheTwistexpressionwasalsofrom thisdatabase.TheSnailisknowntobeuniforminthemesoderm,and'inregions oftheembryodorsalofthemesoderm,theprisastepfunction.Hence,Snailissimply 'on'inthemesoderm,and'intheregionsdorsalofthemesoderm. TheDorsal,Twist,andSnailwerecoordinatedwiththetargetgenebythe standardassumptionthatthesharpSnailborderdemarcatesthatmesodermneuroectoderm border,whichIwillcalltheSnailstep.Hence,allNEEgenes,orsyntheticconstructsof functionalNEEswereregisteredwiththepositionofthestepintheSnailSimilarly, allmesodermtargetgeneswerescaledsuchthattheywerelessthanorequaltothestep positionoftheSnail TheDorsalPWMistheDCandDUformchapter2ofthisdissertation.TheTwist motifusedwas5'-CAYATG,andtheSnailmotifusedwas5'-CACCTG.TheTwistand SnailmotifsweretransformedtoenergyPWMstoo.Hence,theenergylevelsoftheTWist andSnailarenotaccurate,but,duetothethresholdfreealgorithmforannotation,the accuracyisnotascentralaquestionasitwouldbeforathresholdbasedapproach. 177 3.3Nonlinearregressionmodel 3.3.1Puttingthedatapartsandfreeparameterstogethertoform thenonlinearmodelofBTAoccupancy OurnonlinearregressionmodelissimplythefractionaloccupancyoftheBTA,whichisa functionoverthehighdimensionalspaceofinputCRMSandpositionalongtheDVaxisof theembryo: f ( S;z )= 1 1+exp ( w 0 + P m h n m ( z ) i w m ) ; (3.42) wherethemorphogenm'soccupancyontheCRM h n m ( z ) i isnowafunctionoftheposition, z ,alongtheDVaxisofmeasurement.Hence, z istheindexofourexpressionandmorphogen 3.3.2Freeparameterstobe Thenonlinearmodelhassetofbiochemicalconstantsthatweoptimize.Hencetheseparam- etersareunknown,andtrainedbasedonthedata.Theparameterswedenoteas: = f ! ; ! ! d ; ! w g : (3.43) Thealphavector, ! ,isrelatedtotheproteinDNAbindinginteraction.Recall q i = K ( S )[ tf ( i )]= K 0 exp( ( E ( S ))[ tf ( i )],where K 0 isthemaximumbindingconstantink-mer space,and E ( S )isthetheenergyscorefromthePWM,and[ tf ( i )]istheconcentration ofthetranscriptionfactorthatcorrespondstothePWM.Wedonotknowtheabsolute concentrationofthefactor tf ( i ),ratherwehavenormalizedlaserintensitiesoft 178 markersfortheproteinin'stained'embryos(whichweassumeisproportionaltotheabsolute y).Wehavedenotedthisintensitydataas E tf .Hence,forexample,forthekthbin (orkthcell)alongthezaxis(DVaxis)wehave q i = K ( S )[ tf ( i )]=exp( E ( S )) E tf ( k ) tf . Aninteractionenergy, ! ! ( d ),isfornearestneighborinteractionsforeachtype ofinteractionbetweenspecies,homotypicandheterotypic(thisrepresentsaformof'coop- erativity'andaformof'quenching').Theseconstantsdependonbinningtheseparation distancebetweeninteractingfactors(whichfactorsareinteractingmustbepresp Forexampleifweassumedetectablevariationsonthescaleof10bpsforprotein-protein interactions,thefunctionwouldbeasfollows: ! ! ( d )=[ ! 1 ;! 2 ;:::;! b ;:::;! n ](3.44) wherethesubscriptsofthecomponentsofthe ! vectorrepresentthecorrespondingbin, b , (inthiscasethereare n bins),hencetheyabinvector, ! B ,whereitscomponents correspondtotheintervalofbasepairdistances. ! B =[ B 1 ;B 2 ;:::;b;:::;B n ] =[(0 10) ; (11 20) ;:::; (41 50) ;:::; ( x L )] (3.45) Herebin b containsallinteractionswheretwoboundsitesareseparatedby41-50basepairs, and x representsthelastbinborder,and L representstheLengthoftheenhancer(sequence). Forrepressors(likeSnail)thathave'quenching'interaction, ! isintherange[ : 01 ; 1], whileforbindingcooperativity ! isintherange[1 ; 100]. 179 Lastly, ! w ,containsapostbindingconstantforeachtranscriptionfactor.Thesepa- rametersrepresenttheinteractionoftheboundtranscriptionfactorwiththeBTA.Each transcriptionfactorhasanassociated w tf factor,forexample,seere.Thesefactors,in asense,representthedomainofthetranscriptionfactorthatinteractswiththeBTA.How- ever,thisinterpretationisacontroversial;astranscriptionfactorsrecruithistonemo andremodelers,whichchangetheepigeneticstateofthechromatin,andinthiswaythey transcription,andhenceBTAbinding,sotheinteractionisn'tnecessarilyduetothe protein'touching'orbindingtotheBTA. 3.4Annotationmodelofbindingsites 3.4.1DiscoveringthebindingsiteswithintheCRM TheCRMsequencespaceofallpossibleCRMsoflength500isofdimension4 500 .Segal onlyhadabout40sequencesavailabletotrainhismodel,whichseemssmallwithrespectto apossibly ideal datasetthatwouldcontainall4 500 CRMsequences S oflength500along witheachCRM'sresponse E ateachpositionalongtheDVaxis.Ofcourse,mostofthose S sequencesdonotrespondtoDorsalTwistandSnailmorphogens,andhence,aMSE(Mean SquareError)objectivefunctionunderanidealdatasetwouldbeoverwhelmedbynegative data,visiblebythe'reducedChisquare',whichistheSE(SquaredError)dividedbythe numberofdegreesoffreedom4 500 m p (where m isthenumberofpositionsalongthe axis,and p isthenumberofparameterstobeInreality,whatIhavecalledan ideal data setisNOTidealatall.Norisarandomsamplingofthe4 500 sequences,sincetheprevalence offunctional(CRMSthatformausablepatternbytheorganism)isveryrare.Hence,an approachthatconcentratesonknownPositiveCRMsseemstomakesense(endogenousor 180 humandesigned). ExpertsinCRMshavespentdecadestryingtodecipherwhatsitesarefunctional(adap- tations).ThishasleadtoadditionalfunctionalsitesandfunctionalCRMsthatarenot naturallyevolvedbutarguablyjustasusefulforngthebiochemicalparameters(human designedCRMsorbindingsitesthatworktorecapitulatetheworkofevolution-suchas recapitulatingtheexpressionpatternetc.).Theresultofthistediouswork,isthattheSegal ideaofusingallpossiblepositionswithinaCRMasapossiblesitetoplanttheproteinis simplynotbiological.Transcriptionfactorshaveevolvedtorecognizeafewspsites withinCRMs.Hence,a'site'basedapproachismorebiological.EvenSegal'sapproach, reallydidnotuseallpossiblepositionsasasite,asmathematicallyonecansetathreshold onPWMenergyscoresduetohighenergysiteshavingnegligibleonthemodel 11 . Hence,beforewetheparametersofthenonlinearmodel,wewillexplainanovel annotationmodel,analgorithm,totransformfromtheCRMsequencetoabindingsite space(muchlikeXin'sspace)ofmuchsmallerdimensionthanthesequence space.Thisalgorithmwillactuallyusethemodelparameters,however,afullanalysis ofmodelandparameteruncertainties,requirestheparameters,whichwe describeafterourannotationalgorithm. Traditionally'annotation'ofbindingsitesisaccomplishedby'scanning'PWMsoverthe CRM,settingathreshold,andcallingapositivesiteanysitebelowthethreshold.Ofcourse, thisalgorithmbegsthequestionofhowtosetthethresholdonthePWM.Cuttingatrue functionalsitefromthelistofsitestobeusedformodeling(thesitesusedinthe vector)wouldcausesevereovorcuttingasitethatisaknownpositive(atleast 11 SegalnotedinhisSupplementthathedidnotuseallpossiblepositionsasasiteforallpossiblemor- phogens.Furthermore,inhisSupplementhepointedoutthathiscomputationsoftheationsul- timatelyreliedonMCMC(MarkovChainMonteCarlo)samplingofrationsmorethantheHMM recursionalgorithmtocomputetheweightofa 181 'known'byevolutionasapositive).Furthermore,PWMsarenotoriousforfalsepositives. Veryweaksitesarenotaconcernforournonlinearmodel(sincetheBoltzmannfactorof averyweaksitewillwillexcludethatgurationfrommakingadetectablecontribution tothepartitionfunction).However,'weak'sitesthatareattheborderofbeingfunctional ornon-functionalcanhaveenormousAweaksitethatcooperateswithastrongsite mayhavefunctionalinteraction,functionalcooperativityorquenching,butaconservative thresholdmaycuttheweaksite,whichinturnwillcausethecooperativitytobemissedin thespace,whichinturnwillcausethetocompensateforthecooperativity bytuningotherparameters.Similarobservationsweremadethroughsensitivityanalysisby Dresch[37]. 3.4.2AnnotationModelofBindingSiteswithoutaPWMthresh- old StandardbioinformaticannotationofthebindingsitesusesathresholdonthePWWsenergy, hereweanalternativeapproachforidentifyingbindingsiteswithintheCRMs.Instead ofusingathresholdforthePWMenergiesofallthepossiblesiteswithinaCRMweusea thresholdoftheCRM'sresponse,itsgene'sexpression,tosetaminimalconstraintonthe occupancyofDorsaltranscriptionfactoratthepositionalongtheDVaxiswherethegene isturned'on'.Hence,theannotationmodelassumesDorsaloccupancymustreacha criticalvaluebeforeageneswitches'on'.Thisassumptionisanalogoustotheassumption thatthemRNAcountsareproportionaltotheBTAoccupancy,herewearejustpushing theassumptionclosertoanoccupationthatwecanexplicitlycompute,inthesensethatthe BTAoccupancyiscausedbyDorsaloccupancy,butwedon'tusebasalpromoters(likeTATA 182 boxes)inoursequenceanalysisandhencecan'tcomputetheiroccupancy.Furthermore, BTAoccupancy,intermsoferentoccupancyacrossthegenomeisafunctionprimarily ofCRMs,asallgeneshaveTATAboxes(TATAboxesdonottiallyregulateBTA occupancy). TheoccupancyofDorsalisentangledwiththeparametersthatweaimtohencethis annotationsteprequiressettingtheseparameterstoaspvalue(allparametersmust beset,sincetheyareutilizedinthisannotationalgorithm).Forexample,ifaDorsalsite isadjacenttoaTwistsite,andTwistisboundatthissite,andiftheDorsalsiteiswithin therangeofofthecooperativityparameterbetweenTwistandDorsal ! Dl;Tw ( d ), wheredisthedistanceseparatingDorsalandTwist,thentheoddsofDorsalbeingbound willincrease.Infact,aspointedoutbyA.Hill,ifthecooperativityisstrongenough,then theonlyvectorsthatwillcontributetothepartitionfunctionarethosethat containtheseboundDorsalandTwistsites. Theannotationmodelassumesaconservedquantitygovernsalltheenhancersatthe pointalongtheDVaxiswheretheirtargetgenebecomesactivated.Basedonprevious literatureitseemsDorsalaloneistandnecessary,whileTwistisneither.Hence, theassumptionisthatDorsal'soccupancywithintheenhancermustreachacriticalvalue, aconservedvalue,afterwhichthegeneexpresses. OnlytwopointsalongtheDVaxisareusedfortheannotationmodel.Thepoint ofactivationwhenveiwedinthedirectionfromDorsaltoVentralrepresentsonepoint.For exampleneurectodermgenes,theexpression E issearchedorscannedstartingwiththe dorsal-mostpositionfortialexpressionthatisgreaterthan0.5,whichoccursatthe dorsalborderoftheneurectodermgenes.Forgenesthatareonlyactivatedinthemesoderm, thissearchwillsimilarlythegeneisactivatedatthemesoderm-neuroectodermborder, 183 oratleastthedorsalborderofthemesodermgene'sexpression.Forgene'sonlyexpressed inthemesoderm,thisistheonlypointwhichisusedtoestimateDorsaloccupancywithin thetargetCRM(amesodermCRM). Thesecondpointofactivation,isactuallyapointofrepressionofthegene,whereSnail turnstheexpressionoftheneurectodermgenes.Hence,thesecondpoint,forneuroec- todermgenesistheSnailborder,whichdemarcatestheventralborderofneurectoderm expressedgenes. IassumeSnailisnotpresentintheneuroectoderm.Thisgivesascheme, whichisusedtodividetheenhancersintotwocategories(mesoderm,neuroectoderm).Hence thestep,istoscantheexpressionstartingfromthedorsalectodermand thepositionwherethegeneisactivated.Ifactivationisdetectedinmesoderm,thenan algorithmtocalculateDorsaloccupancyisusedwhichallowsforsnailbinding. However,neuroectodermgenesaresolelyannotatedwithDorsalandTwistsites. HereannotationmeanstDorsalandTwistsitesareannotatedsuchthattheDorsal occupancyreachesavalueofoneunit(whichislikeoneDorsalbound,butthiscouldbe,for example,becausethereare5Dorsalsitesalongwith2Twistsitesallworkingtogethersuch thattheaverageoccupancyofDorsalinthepromoteris1unit). IftheneuroectodermexpressedgenesindicaterepressionbySnailinthemesoderm,we assumethatthemesodermneuroectodermborderisazerosumgame,thatis,enoughSnail sitesmustbefoundtocancelthegeneactivationbasedontheactivatorsitesthatwere annotatedintheneuroectoderm.Hencethetwopointsusedfortheexpressionare usuallythe'dorsalectoderm-neuroectodermborder'thatisdeterminedbyasearchofthe targetgene's,andtheotherpointisthe'mesoderm-neuroectoderm'borderthatis determinedbytheborderofSnail's(i.e.themesodermisbySnailexpression). 184 Onceageneisasaneuroectodermexpressor(i.e.itshowsexpressioninthe neuroectodermaboveavalueIhavesetat0.5),thentheoccupancyofDorsalforthe gene'sCRMmustbe'one'.Thelocationthatthegeneachievesexpressionof0.5is determinedbyasearchstartingfromtheDorsalectoderm.Oncethatpositionalong theisfound,thenweextractthevectoroftranscriptionfactorconcentrationsatthat exactsamepoint,whichareneededforthecomputationoftheoccupancy.Hence,given thenetworkparameter'svalues,andtheconcentrationsofthetranscriptionfactorswhere thegene'switches'on(itsexpressionis0.5),Ithen'select'thebestDorsalsitefromthe annotatedlistofDorsal'sPWMscoresforthegivenCRM.Thisselectedsite'soccupancyis calculated,andifitisbelowthevalueone,IselectthenextbestDorsalsite.Ifitsoccupancy isaboveorequaltoone,thenIstopsearchingforDorsalsitesintheCRMUNLESSthere happenstobeDorsalsiteswiththeexactsameenergy,inwhichcasethesesitesareannotated too.ThisisbecauseanumberoftheJiangandSzymanskiconstructs(CRMsinthetraining data)hadmultiplereplicatesofDorsalsitesinthesameenhancer. InthecasethatoneDorsalsiteisannotatedandtheCRM'soccupancyofDorsalis belowavalueofone,thenthebestTwistsiteisannotatedfromthelistof occupancy scores ofTwistintheCRM(noteIdidnotsayPWMscores,whichareagnostictothefree parameters,whileoccupancyishighlysensitivetothefreeparameterssuchasDorsal-Twist cooperativity).WiththenewlyannotatedTwistsite,Dorsal'soccupancyisrecalculated, andcheckedtoseeifitisaboveavalueofone.IftheDorsaloccupancyisstillbelowthe criticaloccupancy,thenthelistofPWMscoresisprocessedsuchthattheoriginalDorsal annotatedsiteismaskedoutalongwithalloverlappingsites.Thenthenewlist(i.e.thelist post-masking)ofDorsalPWMscoresistransformedtoalistofoccupancyscores,whereeach occupancyiscalculatedwiththeoriginalDorsalandTwistannotatedsitealongwiththe 185 DorsalsitefromthelistofPWMscores(thesescorescanbethoughtofasadatastructure thatcontainstheenergyandcoordinateofthebindingsiteintheCRM).TheDorsalsite thatachievesthehighestoccupancyisthenselectedasthenextannotatedDorsalsite.If theoccupancyachievesavalueofoneunitthentheannotationprocesshalts.IftheDorsal occupancydoesnotachieveavalueofoneunit,thenthesestepsarerepeateduntilthethe occupancyreachesthecriticalvalue,ortheCRMsareor'covered'withbindingsites (wherenooverlapsareallowed),theCRMisliterallyjampackedatthispoint. Oncethesiteshavebeenannotated,thentheyarestoredinavector.Thisprocessis repeatedforallneuroectodermexpressedgenesinthenetwork(allCRMs).Onceallthe neuroectoderCRMshavebeenannotatedattheirsp'switch'points,thevectorsof sitesforeachCRMispassedtoanotherannotatorthatbuildsalistofSnailsitesatthe mesoderm-neuroectodermborderoftheDVaxis,wherethesnailsite'sareto'repress'the gene(ifthegeneshowsrepressioninthemesoderm).Thisannotatorselectsthebest scoredSnailsitefromthelistofPWMscoresoftheSnailPWM,andthencalculatesthe targetgene'sexpression.Iftheexpressionreachesbelow0.5thentheannotatorhalts.Ifthe expressionisabove0.5,thenthetheannotatedSnailsiteismaskedfromthelistofSnail PWMscores,andthenextbestSnailsiteisselected.Thisannotationprocessiscontinued untilthecriticalexpressionisreached,atwhichpointtheannotatorhalts. Lastly,forgenesnotexpressedintheneuroectoderm,suchas twist ,theannotationprocess issimilartotheneuroectodermgeneswiththeexceptionthatSnailisallowedtobeannotated forfunctionalsites. Wedonotknow apriori whatthe'true'parametervaluesare,hencewestartwitha bestguesspointinparameterspace,andthenoptimizethealgorithmbygradientdescent 186 usingtheobjectivefunction: n X i X j ( i ) [( 1)] 2 (3.46) Here i isthetargetgenelabelandsinceweareonlyusing2ofthe m pointsalongthe j representsthosetwopointsorpositions(whicharetargetgenedependent).Hereafter,we callthisannotationmethodormodeltheMaximumParsimonyAnnotator,MPA,since,in asense,itisaimingtopredicttheminimalsetofbindingsitesthatareconsistentwiththe nonlinearregressionmodel-BTAoccupancyandmorphogenoccupancy-modelparameters. 3.5Model OncewehaveannotatedeachCRMinourdatawiththeDorsal,TwistandSnailbinding siteswewishtoinsomesense)ournonlinearmodelEquation3.27andequivalently Equation3.42,usingstandardnonlinearregression. Asimplenonlinearmodel,is'logisticregression',whichisusedfrequentlyfor andisoftheform: Y = f ( X 1 ;X 2 ;X 3 ;::X N )= 1 1+exp( X ) ; (3.47) Thismodel'sobjectivefunction'ssurfaceinparameterspacehasauniqueoptimumdueto theobjectivefunction(thecrossentropy)beingconvex.However,logisticregressionrequires booleanresponsevariables,whileourresponseisofacontinuousform,ideallysuitedfor regression. Toanyregressionmodel,onehasatableofdata,wheretherstcolumn(forexample) containsthevaluesoftheresponses y (dependentvariable Y )fromthemeasurements,and 187 thenext N columnscontainthevaluesoftheexplanatoryvariable x (independentvariables X 1 ;X 2 :: ,whichtakentogetherwewillcalltheorderedcolumnsofindependentvariablesthe designmatrixdenotedas X or A ).Giventhisdata,onecanthenadjustthefreeparameters tothemodeltothedata. Herewewillassumethereisrandomerrorinourexpression(albeitsmall,such thatthe trans environmentbetweenembryosisalmostthesame).Hencethedeviationsof ourmodelfromeachdatapoint( y i ;X 1 ( i ) ;X 2 ( i ) ;:: )willtaketheformofarandomvariable, calledtheerror: f ( X(i) j ) y i = i ; (3.48) whichweassumeisnormallydistributedwithzeromean.Hence,theprobabilityoftheentire datasetisamultivariatenormal: P ( D j )= 1 2 ˇ j 1 j exp 1 2 ( y f ( ) T 1 ( y f ( ))(3.49) Here 1 isthedata'scovariancematrix.Weassumetheerrorsareindependentwithunit variance,hence,thecovariancematrixissimplytheidentitymatrix.Inthispicture,eachdata point i occurswithprobability exp( 2 i ) 2 ˇ ,andthemultivariatenormalcanbefactorizedasa productofindependentunivariategaussians,whichcanbewrittenas P ( D j )= e ( 1 = 2 ˜ 2 ) . Hencemaximizingtheprobabilityofthedatagiventheparameters(maximumlikelihood principle)isequivalenttominimizingthesquarederrorsfromthemultivariatedistribution, whichwecall'chisquared',denotedas ˜ 2 (that'snotasquaringoperation!).Inthispicture weseethe i throwofthedatamatrixcontainstheinformationneededforthe i thelement 188 of ˜ 2 ,wherethemodelparametersarebyminimizingthesquarederrors, ˜ 2 : ˜ 2 = X i ( y i f ( X 1 ( i ) ;X 2 ( i ) ;X 3 ( i ) ;::X N ( i ) j )) 2 (3.50) Ingeneral, ˜ 2 isnotaconvexfunction,andthereforeissusceptibletogettingstuckin localminimas.Hence,nonlinearregressionisabitofanart[104],unlikeitslinearcounterpart (linearregression),whichhasaglobaloptimumat =( XX T ) 1 X T b ,where X isthe'design matrix'and b denotesthecolumnofdata y forthedependentvariable Y 12 . BothSegalandHe'smethodofoptimizationoftheirthermodynamicmodelsusedgradient descentandsimplex,wherethebestparametersofgradientdescentwerepassedtosimplex method,whereitsbestparameterswerepassedbacktogradientdescent,untila setofalternationsbetweenthealgorithmswereexhausted.Thiswasrepeatedfordt startingpointsinparameterspace,(byrandomlyselectingapointinparameterspace), whichwasanattempttoevadegettingstuckinlocalminima.Wehaveimplementedthe LevenbergMarquardtalgorithminGSL(GnuScienLibrary)whichissimilartogradient descent,uponcompletionofthealternationsbetweengradientdescentandsimplex,thebest parametervectorispassedtotheLevenbergMarquardtalgorithm(thiswaspurelyforthe advantagethatGSLhasimplementedanumberofroutinesforparametererrorestimates usingLevenbergMarquardtnotfoundintheirgeneral'optimization'algorithms). 3.5.1Covariancematrixofparameters Inlinearregression,withnoestimateontheerrorinthedata,astandardestimateofthe errorintheparametersissimplythe ˜ 2 dividedbythedegreesoffreedom(thenumber 12 Inlinearregressionwe'resimplysolvingthelinearequationcommonlydenotedas Ax = b ,whichI'm denotingas X = b ,tofollowcommonstatisticsnomenclature. 189 ofdatapointsminusthenumberofparameters),ormoreconservativelyjust ˜ 2 .Inlinear regressionwithknownerrors(thecovariancematrixofthedata,oneisbetterableto estimatetheerrorintheparametersby( A T A ) 1 ,where A isthedesignmatrix. Innonlinearregression,withnoestimateontheerrorinthedata,againanestimate oftheerrorintheparametersissimplythe ˜ 2 attheminima.Anotherestimateis ( J T J ) 1 ,andanevenbetterestimateistoinvert1/2theHessianmatrixofthe ˜ 2 overthe parameters.Tobetterunderstandourmodel'sbehaviorwewilldiscussthe derivationofthecovariancematrixofparameters. Theestimateofthecovariancematrixofthebestestimatedparametersisbasedonthe followingTaylorexpansionofthe ˜ 2 aboutthebestpointinparameterspace: 1 = 2 ˜ 2 =1 = 2 ˜ 2 0 +1 = 2( 0 ) T @ 2 ˜ 2 @@ ( 0 )+ ::: (3.51) wheretheHessianmatrix,denotedas @ 2 ˜ 2 @@ isa'twoform'( i.e. TheijelementoftheHessian matrixis: @ 2 ˜ 2 @ i @ j ).Intheexpansionthederivativesofthechisquarewithrespectto eachparameterarezero(we'reataminimaofthechisquaresurface).Theterminthe expansionissimplythe ˜ 2 valueattheminima.Asweobtainmoredata,thehigherorder termswillgotozero(assuming( 0 )islessthanone,thenusingTaylor'stheorem,we know( 0 )approachzerofasterthanthederivatives).UsingaBayesianargument,wecan nowshowthatattheminimaofthe ˜ 2 ,ourbestestimateofthefreeparametersaregaussian distributed.InaBayesiansetting,wewishtoestimatethedistributionofparameters(not ajustapointinparameterspace).Hence,usingBayestheoremwehave: P ( j D )= P ( D j ) P ( ) P ( D ) : (3.52) 190 Inthispicturewestillwishtooptimizethelikelihood,howeverwenowhavetheprioroverthe parametersandtheprioroverthedatatodealwith.Assumingthepriorovertheparameters isuniformlydistributed(uninformative)thenmaximizingthelikelihood P ( D j )isequivalent tomaximizingtheposterioroftheparametersgiventhedata(theparameterprioranddata priorsimplydon'tplayarole),asthepriorprobabilityofthedata P ( D )isaconstant(nota functionoftheparameters).Henceweseethemaximumlikelihoodestimate(theminimum ofthe ˜ 2 )isequivalenttoinferringthemaximum aposteriori (MAP)estimate.Furthermore, wenowhaveextendedthepointestimateofmaximumlikelihoodtoafullrangeofpossible values,totheposteriordistributionoverparameters.Hence,wehave: P ( j D ) / P ( D j ) / P (1 = 2 ˜ 2 ) / P (1 = 2( 0 ) T @ 2 ˜ 2 @@ ( 0 )) : (3.53) whereinthelastexpressionwehavereplacedthe ˜ 2 withitsTaylorexpansion.Upon pluggingintheTaylorexpansionweseethatourposteriordistributionhastheformofa multivariateGaussiandistributionwithmean 0 andcovariancematrix @ 2 ˜ 2 @@ 1 ,whichwe willdenoteas .Hencethestandarderrorsoftheparametersareestimatedsimplyas thesquarerootofthediagonalelementsoftheinvertedmatrixof(1/2theHessianmatrix ),wheretheHessianwasevaluatedattheparameters(seeforsimilaranalysisthe 'LaplaceApproximation'onpage213ofBishop[22],andPress'modelsection[104], andthetextbyBeck[14]). WhatifHessianisnotfullrank?ThentheHessiancannotbeinverted.Thiscanbe tracedtoeitherpoorexperimentaldesign(e.g.toolittledata,whichusuallycan'tbehelped), ortopoormodeldesign(whichcanbeimproved). NoninvertibleHessiansandnonfullcolumnrankJacobianshaveuencedmyestima- 191 tionofparameters.Hence,Iwilldiscusstheseissues,inordertobetterdetermine whatparameterswillbe 3.5.2Theoverdeterminedandunderdeterminedproblem Inlinearregressiononesimplyneedstoensuretheirdesignmatrixisfullcolumnrankin ordertottheirdata.Foratwoparameterlinearmodel f = Y = mX + b ,whereone has( X;Y )paireddata,forexample:(5,10)(10,15)(1,5).OnehasaDesignmatrixofthe followingform: A = 0 B B B B B @ 110 115 15 1 C C C C C A wherewehaverewrittenthelinearmodelas Ax = b sincetheform X = b clashesnotation ofthe3x2designmatrix X andwhatI'vedenotedasthesingleindependentvariable X . Henceinthe Ax = b notation, x is2x1vector: x = 0 B @ m b 1 C A (3.54) and b isa3x1vectorthatcontainsthevaluesofthedependentvariable Y .Herethedesign matrixrepresentstwo3dimensionalvectors(twovectorsthatliveinthecolumnspace)do thesevectorsspanthecolumnspace?Ifthetwovectorsdonotspanthecolumnspace,then thisdesignmatrixresultsinanunderdeterminedproblem(notenough independent equations tosolvefortheunknowns).Oddly,thismatrixinthecontextof systemsoflinearequations isalsoaso-calledoverdeterminedproblem(moreequationsthanunknowns)whenthedesign matrixisnotfullcolumnrank.Hence,fordesignmatricesthatarenotfullcolumnrankthe 192 problemissimultaneouslyoverdeterminedandunderdetermined.Aquicktestforthistype ofscenario(instatistics)issimplytocheckifdet A T A iszero,wheredetisthedeterminant. Thenonlinearregressionfollowsthelinearprobleminalmosteverydetail.Thedesign matrixisgeneralizedtothe'Jacobian'matrix J ,wherethe ij elementof J isthepartial derivative @f ( X i ) j where f isournonlinearmodelevaluatedatdatapointiandthederiva- tiveof f istakenwithrespecttothe j thparameter.AquicktesttoseeiftheJacobianis fullcolumnrankistocheckifdet J T J iszero. AnexampleofnonfullcolumnrankJacobian,orasingular J T J matrixisthefollowing model f =exp(( 1 + 2 ) X )[14].Regardlessofhowmuchdataonecollectsfor( Y;X ),the chisquaresurfaceattheminimaisa'trough'overthetwodimensionalplaneof 1 and 2 . Thistroughisparametrizedbytheequation 1 + 2 = ,whichisanlinethrough the 1 ; 2 plane(itisaonedimensionalsubspace,thenullspaceofthetransposeofour Jacobianmatrix).Henceonecanatbestestimateoneparameter(nottwo).Inthisexample itiscleartheproblemwithasingularHessian,ornotfullcolumnrankJacobian,theproblem isthattheoptimalsolution(minimumchisquare)isnotapoint,it'sanlineorin surfaceorhypersurfaceinparameterspace,dependingonthenumberofdependentcolumns intheJacobianmatrix.ThisistheconsequenceofaJacobianmatrixthatisnotfullcolumn rank.Thismeanssomeoftheparametersaredependent,hencesomeofthecolumnsofthe Jacobianmatrixarelinearcombinationsofothercolumnsofthematrix.Hencethecolumn spaceoftheJacobianspansasubspaceofadimensionsmallerthanthenumberofdesired parameters,whichcannotberemediedbymoredatainthecaseofpoormodeldesign. Forexampleimagineoneknowswithcertaintytwoindependentdatapoints( y 1 ;x 1 ) ; ( y 2 ;x 2 ) forthismodel,constructingalinearmodelfortheaboveequationonearrivesatthefollowing 193 systemofequations 0 B @ x 1 x 1 x 2 x 2 1 C A 0 B @ 1 2 1 C A = 0 B @ ln y 1 ln y 2 1 C A ,whichisjustalineinthe 1 ; 2 plane (theparametervectorspace),whereaunitvectoralongthislineis: 1 p 2 0 B @ 1 1 1 C A ,weaddanother freeparameter todenoteanypointalongthisline,hence parametrizesthenullspaceof theJacobiantranspose).ThiscanbeseenanalyticallysincetheanalyticJacobianis: @f ( x; 1 + 2 ) @ 1 = x exp ( 1 + 2 ) x (3.55) @f ( x; 1 + 2 ) @ 2 = x exp ( 1 + 2 ) x (3.56) Thesetwoequations should spanparameterspace,butthetwoequationsarenotindepen- dent,theyareidentical,hencethey(oroneofthem(whicheveroneyoulike))spansaone dimensionallineinparameterspace. 3.6Results 3.6.1BestofparametersfordatafromSection3.2.4,Experi- ment1 Wethedatafrom3.2.4usingthemodelfromEquation3.42,wherethe ˜ 2 was as: ˜ 2 = X z X t ( f ( S t ;z ) E t ( z )) 2 ; (3.57) where f ( S t ;z )isequation3.42,where S t isaCRMfromourdataset,and z isaposition alongtheDVaxis,and E t ( z )isthethetargetexpressiondrivenbytheCRM'scorresponding 194 genedenotedby t atposition z alongtheDVaxis. TheannotationmodelMPAwasoriginallytoabestsetofparameters,andthose parameterswerethenusedastheinitialpointinparameterspaceforthenonlinearregression model3.42,andforminimizingtheobjectivefunction3.57.HOWEVER,duetothe ˜ 2 surfaceof3.46beingattheminima,wedecidednottotryandparametersusingthe MPAmodel.Ratherwedecidedtojustset w 0 =5andsetallothermodelparametersto 'one',thedefaultvaluesofthenonlinearregressionmodelparametersfromEquation3.43 andallowedtheannotationmodelMPAtomakepredictionsofbindingsiteswithinthe CRMSusingthesedefaultparameters. Given ,theMPAannotatedCRMs,wethenwere abletothenonlinearregressionmodel3.42,wheretheareinFigure3.7. IntheinFigure3.7wesetthemodelparameter w 0 =5(asinAy'smodel)andwe estimatedthefollowingparameters: ! Dl;Tw ( d 1 )=5 2 ;! Dl;Tw ( d 2 )=64 37 ;! Dl;Sn ( d 1 )= : 1 25 ;! Dl;Sn ( d 2 )= : 7 12 ;w Dl =7 : 05 ;w Sn =39 687,where d 1 representsthespacer binof[0 ; 30] bp ,and d 2 representsthespacerbinof[30 ; 60] bp .Theerrorswereestimatedas thesquarerootofthediagonalof( J T J ) 1 .The ˜ 2 =41,wherewehad1200datapoints (eachpositionalongthezaxisforeachgene).TheHessian,hadahighconditionnumber (10 4 )wherethelargesteigenvaluewas1.4,andthetwosmallesteigenvalueswere0.0002 and0.001,suggestingthat ˜ 2 surfacewasclosetoalongthoseeigenvectordirectionsof parameterspace. Thecorrelationcocientbetweentheobservedpatternandthepredictedpatternis denotedasCCforeachgene(whichisatmost'one'),andalsothesquarederrorbetween theobservedpatternandpredictedpatternisdenotedasSEforeachgene(whereeachgene had40positions, z ,alongtheDVaxis(i.e.SEisatmost40).TheSnailisuniform frompositions0to8,whereitis'on'(atavalueof'one')andSnailisfrompositions9 195 Figure3.5:thelegendisintheupperrightcornerofthetable,denotingtheObserved ( E t ( z ))asred,themodelpredictionsasgreenalongwiththeheaderaboveeach denotingtheCRM(genetarget)ingreen,andtheDorsalmorphogen( E Dl ( z ) )asdottedbluecurve. 196 to40alongtheaxis,andtheTwistgradientwasreplicatedastheDorsalgradient. 197 3.6.2Redesigningtheparameterstobe Theresultsabovesuggestpoornonlinearregressionmodeldesignortdata,where thefreeparameterstobefrom3.43werebasedSegal'sthermodynamicmodel,which wassimilarlyusedbyXininGEMSTATandFukhouriet.al.[43].Ingeneral,themore parametersweaddtothenonlinearregressionmodelwilleitherdecreasethe ˜ 2 statistic (albeit,thereducedchi-squaredthataccountsforthedegreesoffreedommayincrease)orit willmaintainthe ˜ 2 statisticataparticularvalue( i.e. thestatisticbecomesinsensitiveto additionalparameters);butthe ˜ 2 willnotincrease.Thisassumesone'salgorithmis allowedtothepointalongthenewlyaddedparameter'slineinparameterspacethat improvestheoratleastdoesnotspoilthesmallerparameterset'sSpoilingaprevious withanewlyaddedparametercanbeavoided,forexample,bysimplysettingthenew parametertoavaluewherethemodelisinsensitive(andhence ˜ 2 willnotbedisturbed).For model'swithextensivedependenciesbetweenparametersthismaynotbepossible.However, byinspection,theparametersofournonlinearmodelcanallbesettovaluessuchthatthey havenocontributiontothe ˜ 2 .Forexample,alltheparametersinournonlinearmodel, whensettonumericallyone,havetheofnottheformofthemodel,ina sense,thisistheparameterfreeformofthemodel.Iwillcallthesevaluesoftheparameters thedefaultvalues( i.e. thedefaultvalueshavetheoddpropertythatitappearsweare nottheparametersatall;and,inanothersense,onecouldimagineallthe parametersandalltheparametersbeingtothedefaultvalues.).Whilesettingthevalues tobezerohavetheof'knockingout'allbindingsitesofafactor(inthecaseof )or 'knockingout'pairsofinteractingbindingsites(inthecaseof ! ). FittingsubsetsoftheparametersinEq.3.43(suchasaparametersubsetwithoutour 198 novelpairwisepotential ! )leadstogoodbyeye,ofthedata,suchasRMSE'sof0.05. However,uponimplementationofaHessianmatrixusingatwopointwe foundthattheHessianmatrixwasnotfullrankfortheparametersetevenwithoutthepair- wiseinteractions ! ,ournovelformofthepotential.Withoutpairwiseinteractions,onestill hassixfreeparameterstofortheCRM'sresponsestoDorsalTwistandSnailmorphogens. BasedonexpertknowledgeoftheDVnetwork,certainchoicesoftwooftheparameterswere selectedtothedata,whichoccasionallyleadtoaHessianwithaconditionnumberof about100(theconditionnumberisanumericaltechniquetodetermineifamatrixissingu- lar 13 ),whilemanyofthechoicesofparametersleadtoHessianswithconditionnumberson theorderof10 4 .Furthermore,theHessianshouldbepositiveforauniquesolution, whilewefoundtheeigenvalueswerefrequentlymixedinsign(indicatingsaddlepoints)- ofcoursethegradient(thederivativeofthe ˜ 2 withrespecttoeachparameter,at objectfromtheJacobian)waszero(ontheorderof10 10 foreachcomponent)atallofour estimatesoftheminimaofthe ˜ 2 surface 14 . Inordertobetterestimatetheerrorsoftheparametersforourmodelweaimedto obtainafullcolumnrankHessian(regardlessofwhethertheHessianwaspositive Weanalysedtheanalyticderivativeofournonlinearmodelwithrespecttoeachparameter i ,andcheckedifotherparametersoccurintheresult(indicatingdependenciesbetween parameters).Ifpossible,parametersthatdodependononeanotherweregroupedtoform 13 Inacomputer,inanumericalrepresentation,amatrixisrarelyactuallysingular,itjusthassome eigenvaluesthataremuchsmallerthanthelargesteigenvalueofthematrix.Thenumberofeigenvaluesof Hessianisthenumberofparameterstobeandifsomeoftheseeigenvaluesareclosetozero,thissuggest azerodeterminantHessian. 14 TheseresultsleadtotheimplementationoftheLevenbergMarquardtthathasanumberofGSLbuilt infunctionstohelpdiagnosepathologicalnonlinearmodels(unlikethegradientdescentandsimplexim- plementations).IfLevenbergMarquardtrevealedthesametypesoferrorsinourparameterestimatesthen wecouldbemoretthattherewasnotanerrorinourimplementationoftheHessian.(i.e.azero determinantHessian,meansourparametererrorbarsareinextent-unlessonedecidestousea tmethodtoestimatetheirerrorbars). 199 justoneparameter[14].Furthermore,certainregionsoftheDataspace(suchascertain positionsintheembryo(likethedorsalectoderm,wherenoneofourgenesareactive)may beveryinsensitivetoourparameters(causingsmallvaluesoftheJacobianforthesedata points,howeverthisdoesnotappeartotheHessianofthe ˜ 2 ,sincetheanalytic Hessian, H ofthe ˜ 2 isexactlyequalto J T J + G ,whereGisamatrixofsecondderivatives overthemodel 15 . 3.6.3AnalyticJacobian Asimplefunctionis f ( z )=1 = (1+exp( z )),where isaconstant,wherethedomainand rangeof z and f are: z 2 [ inf ; inf] f ( z ) 2 [0 ; 1].Nowifweimaginethat wasafree parameter,thenwecouldaskhowsensitivethisfunctionistovariationsof asafunction ofparticularpositionsalongthedomainofz.Hencethederivativeis: @f ( ;z ) @ = exp( z ) (1+exp( z )) ˇ f ( z; )(1 f ( z; ))(3.58) Theaboveequation,inasense,isananalyticrepresentationoftheJacobianmatrixelements forthenonlinearmodel(e.g.BTAoccupancy)asafunctionofdata(allpossibledata,which isrepresentedbythedomainof z ). WedonotneedtoanalyzeourmorecomplicatedBTAoccupancyfunctiontounderstand 15 TheanalyticHesssianinexactformis H = @ 2 ˜ 2 @@ = J T J + @ 2 f @@ = J T J + G ,where G isamatrixwith secondderivativeelementsofthenonlinearmodel f (thefractionaloccupancyoftheBTA)overthemodel parameters(forexample,seeequationA.4cpage482ofBeck[14]).Inthecasethat G 'sFrobeniusnormis nearlyzero(theelementsof G arenearlyzero),thenwehave H = J T J ,whichiswhywithcertaindatasets andexperiments,itispossibletoestimatethecovariancematrixofparametersbasedonJacobianalone. J T J forthecaseofonefreeparameterisjustadotproduct(i.e.theJacobianisjustavectorofsize n x1, where n isthenumberofdatapoints),whichwillbeinsensitivetodatapointsthatwereuninformativefor parameterestimation(theyarejustaddingatermzerotothedotproduct).Hence,itseemsreasonablethat uninformativedata(datawherethemodelisinsensitive,suchasDorsalectodermregionsoftheembryo, wherenoneofourCRMsareactive)willnotcauseharmtomodel 200 thebehavioroftheJacobianmatrixelements,ratherwecansimplyusetheaboveequationas aphenomenologicalparametrizationofourmodelofCRMresponsetomorphogengradients (whichareencodedinthespatialposition z ).Forexample,imagineonewishedtoanalyze phenomenologicallytheresponseof rhomboid expressionpatternjustinthepositionsofthe embryooftheneuroectodermandthedorsalectoderm(inourbinningoftheDVaxisthis wouldbeequivalenttothe z intervalof[9,40],wheretheinterval[1,8]containsthemesoderm, whichweareuninterestedinforthemoment).Wecanmodelthe rhomboid expressionasa functionofDVaxisusingtheaboveequation,whereweset tobenegativesincewemust thefunctionaboutthe'switch'pointwherethegeneisturned'on(thepointinspace where rho turns'on'isattheneuroectoderm-ectodermborder). 16 Wemustalsothe positionwherethelogisticfunctionreaches1 = 2max(whichintheaboveequationisposition z =0).Hencewemustaddaconstanttotheargumentoftheexponential f ( z )=1 = (1+exp( z 0 )) ; where 0 willthe1 = 2maxofthelogisticfunctionwhen z = 0 . TheanalyticJacobianforthesetwoparameters(asafunctionofdataz): @f @ = z exp( z 0 ) (1+exp( z 0 )) 2 = zf (1 f )(3.59) 16 Ifwemodeledthe'switch'pointwherethegeneisturned'(thepointinspacewhere rho turns' isatthemesoderm-neurectodermborder)wewouldnothavetoourgraphabouttheswitch. 201 17 @f @ 0 = exp( z 0 ) (1+exp( z 0 )) 2 = f (1 f )(3.60) 18 Nowwecanseesomesimplerelations.Firstofall J T J =0ifoneonlycollecteddatain thedorsalectodermregionoftheembryo(wherethelogisticisnearly'1',meaningwecannot invertthatmatrix(so J isnotfullcolumnrank).Similarly,ifonejusthasBooleandatafor theexpressionpatternof rho alongthezaxis,say00000001111111100000000= E rho ,theonly possiblepointwhere J isnonzeroisattheborderswherethegeneswitchesfromtoonor viceversa.NowimagineforanNEEgenethatwehaveanexpressionpatternoverspacelike 0000000101010100000000wheretheswitchingbehavior(10101010)isintheneurectoderm 17 Thelogisticfunctionwhenevaluatedataparticularpoint z 0 canbethoughtofasaBernoullidistribution, hence,thelastexpressionisthevarianceoftheBernoullidistribution,wherethefunction f ( x 0 )ateachpoint x 0 happenstobethe'parameter'thatdescribesaBernoullidistribution(normallythisparameterisdenoted as' p 'forprobability,wherethemeanofBernoullirandomvariableisalso p ,andthevarianceofaBernoulli randomvariableis p (1 p ).ThisistivelyhowtheBTAoccupancyisdescribedalongtheDVaxiswhere x 0 nowdenotesthepositionalongtheDVaxis,and isafreeparameter. 18 Dothesetwotwoequationsspanparameterspace?Recallpreviouslyweanalyzedexp( 1 + 2 ) z for itsbehaviorinthecontextofnonlinearregression.Similarlyforthelogisticnonlinearmodelwewouldlike toanalyze1 = (exp( 0 + z ).Forexample,andifoneknowswithcertaintytwoindependentdatapoints ( y 1 ;x 1 ) ; ( y 2 ;x 2 ),thenwecantransformtoasimplelinearmodelfortheaboveequation,whereonearrives atthefollowingsystemofequations 1 x 1 1 x 2 0 = 0 @ ln y 1 1 y 1 ln y 2 1 y 2 1 A .Doesthisspanthe ; 0 plane(the parametervectorspace)?Thedeterminantofthedesignmatrixis x 2 x 1 ,whichsuggestaslongasthe experimentaldesignwassuchthatthedatapoints x 1 and x 2 arenottooclosetogetherwecouldactually 0 and .However,onemustbecarefulhere,sincedatacollectedinregionsofverylargeorverysmall x 1 and x 2 ,willcausetheoriginalresponse y tobecomeclose'one'or'zero'causingthelogarithm'svalueto beverylarge,possiblycausingnumericalissues.Thiscanbeseenbysymbolicallysolvingfortheeigenvalues ofthedesignmatrix,wherewemustsolvefortherootsof(1 )( x 2 ) x 1 =det( X )=0),whereX isdesignmatrix,and istheeigenvaluesofthedesignmatrix.Thisyieldsthequadratic 2 + x 2 + x 2 x 1 . If iszero(whichmeanswewouldhaveasingulardesignmatrix)weseethatindeed x 2 = x 1 ,whichmakes sense,wecan'texpectrepeatedmeasurementsofthesamepositioninspacetotellusanythingaboutthe globalbehaviorofthesigmoid(asbythe parameters).Solvingthefortherootsofthepolynomial usingthequadraticformulawehave: = x 2 q x 2 2 4 x 1 2 ; (3.61) hereweknowconstraintsonthedataduetotheembryosize,wheretheDVportionoftheembryoconstrains [ x 1 ;x 2 ]toresideintheinterval[0 ; 40],sincethereareonly40cellsalongthataxis.Giventheconstrainton embryosizeitseemswecanatleastsay p 4 x 1 =1 = 2,hence E Dl ( z )=1.However, theCRM6 xdl had6Dorsalbindingsites,henceoneunitofoccupancycouldbereachat 1/6thisconcentration(sincethesitesareallindependent,byassumption). AnotherconstructthatwasusedwasthedoubleknockoutofTwistsitesbyIp,whichre- portedcatastrophiclossofexpressionintheneuroectoderm,theconstructlabelled rho 2216 t 1 t 2 s 4 a . HencetheIpconstructalongwiththe6 xdlPLZ actedasacontrolgroup.Thetreatment group(inasense)weretwoconstructsthatareknowntohaveTwistsitesthe rho CRM (from mel specie),andthe vn CRM(from vir specie). TheSnailproteinwasnotusedinthismodel,,andtheTwistgradientwas replicatedastheDorsalgradient.The6 xtwPLZ wasalsoleftinthetrainingset,butthis hasnegligible(duetomodelassumptions(i.e.defaultparametervaluesofTwist), andthereportedexpressionlebySzymanskiforathisconstructwasroughlyzero(i.e. Twistbindingsitesalonearenott)-seeDatasetsectionforreferencesondata). 3.6.6Robustnessanalysis,Experiment3 WeincreasedtheDorsalexpressionby0.15ineachpositionalongtheDVaxistoseeifthe annotatedbindingsiteswerethesamewhenusingtheannotationmodelMPA.Thedataset wasfourNEEmodules rhomel;rhovir;vnmel;vnvir . TheTwistgene'sfreeparameterswereallsetto'one'for ,and ! Tw;Dl wassetto 205 Figure3.6:thelegendisintheupperrightcornerofthetable,denotingtheObserved ( E t ( z ))asred,themodelpredictionsasgreenalongwiththeheaderaboveeach denotingtheCRM(genetarget)ingreen,andtheDorsalmorphogen( E Dl ( z ) )asdottedbluecurve.Thecorrelationcotbetweentheobservedpatternandthe predictedpatternisdenotedasCCforeachgene(whichisatmost'one'),andalsothe squarederrorbetweentheobservedpatternandpredictedpatternisdenotedasSEforeach gene.Eachgenehad20positions, z ,alongtheDVaxis,whichasalways(inDVliterature), isplottedsuchthatventralisatthezeroposition. 206 Figure3.7:TheCRMsandpredictingbindingsitesfromMPAfordefaultparameterson allproteins.Dorsalannotatedblue,Twistgreen.Thecolumnd+denotesaddednoiseto Dorsalconcentration(whichwaszerointhiscase,henced+shouldnotbethere).A bugintheprintingcodecausedoneofthe vnvir sitestonotappearintheCRMsequence highlighting. 207 'one'forallbinsexcept B =[0 ; 20] bp ,and w Tw parameterwassettozero.Furthermore, theTwist'smorphogenconcentrationgradient E Tw wassetequaltoDorsal'sgradientto assuretheirtialgradientswerenottheresult.Snail'squenchingwasset for w Sn;Dl foronlyonebin B =[0 ; 50] bp (quenchingmeans ! isintherange[0,1],while cooperativitymeans ! isintherange[1,100].Hence,weonlyallowedtheDorsalandSnail parameterstobetunedbytheroutine. Theresultsofannotationwiththewild-typeDorsalexpression(noperturbationofthe arein.TheresultsoftheperturbedDorsalarein3.9.The annotationsaretheidentical.ThispossiblyisanartefactofthewaytheDorsalwas perturbed(bysimplyadding.15toeachcell).DuetothesigmoidalnatureoftheDorsal thishaslittleAbetterdesignedexperimentwouldshiftthehalfmaxofthe Dorsalbyanumberofpositionsalongthezaxis. 208 Figure3.8:Herethetargetgeneisdenotedintheleftcolumn,andthecellalongtheDV axisisdenotedinthesecondcolumn.Twistsite'sareannotatedgreen,Dorsalblue,Snail red,andbrowndenotesoverlaps.Thesitesannotatedatthemesodermbottomborderwere usedtoannotatethesequence.Forexample,thegeneis rhomel ,for rhomboid inthe species melanogaster . 209 Figure3.9:Herethetargetgeneisdenotedintheleftcolumn,where,wherethesecond columncontainsd+todenoteaninincrease(+)intheDorsal(d)gradientalongtheDV axis,andthecellalongtheDVaxisusedforextractingconcentrationsfortheannotation modelMAPisdenotedinthethirdcolumn.Twistsite'sareannotatedgreen,Dorsalblue, Snailred,andbrowndenotesoverlaps. 210 3.7DiscussionandBackgroundonRobustness 3.7.1MorphogenGradientsandDevelopmentalRobustness Givenamorphogendistributionoveronespatialdimension,onecandeterminehowsensitive thethresholdsofthetargetgenesaretoperturbationsofthemorphogendistribution.This isdiscussedinAlon'stextinchapter8"RobustPatterninginDevelopment"[5].Herewe willreproduceAlon'sderivationsforproducingamorphogen(distribution)andhis analysisofwhetherthisisrobust,butwewilldosointhecontextofthemorphogen Bicoid(Bc)anditsrobustnessprobeHunchback(Hb),wheresimilaranalysiswasdoneby GregorandHouchmandzadeh[55][66].Ouranalysisstartswithatransportequationforthe morphogen. @M @t = D @ 2 M @x 2 F ( M )(3.64) Here F ( M )isafunctionofthemorphogenconcentrationandrepresentsthedegradationof themorphogen.Notablymissingfromtheequationistheproductionrateofthemorphogen, asassumedfromGregorandinAlon'stextthemorphogenisassumedtobeinsteady state[5].Hencethetimederivativeiszero,andonecandetermine,ordeclare,themorphogen concentrationbytheboundaryconditions: M ( x =0)= M 0 and M ( x = 1 )=0Itis importanttoconceptuallyseewhattheseconditionsmean,sincethermodynamicequilibrium inprocesseswouldleadtoamorphogendistributionthatisuniformoverspace, whichwe'reabouttoseeisnotthecaseforourproblembecauseauniformoccurs onlyifthesystemdoesnothavesourcesandsinks.Herewe'reassumingsomesource, aconstantsourcesuchasribosomestranslatingBcmRNAatpositionx=0,andweare 211 assumingasinkbytheproteasomedegradationoftheBcproteinat x = 1 . 1 The constantDdeterminesthe absolute lengthscaleoftheproblem,whichcreatesaproblemfor patterningmechanismsiftheembryoshavelargevariationsintheirsize,sinceanintricate scaling mechanismwouldberequired,asisbasedonabsolutelengthscales,thisis discussedbyGregorandbyCrocker[54].Assumingthevariationoftheembryoslengthis negligibleweneedtoseehowthescaleoftheproblemisestablished.Alon'sanalysisstarts withalineardegradationmechanism F ( M )= M ,whereuponsolvingthePDEgiventhe morphogenisinsteadystate,onearrivesat: M ( x )= M 0 e x (3.65) where = q D ,clearlywhen x = themorphogenhasreachedaconcentrationof M 0 e , whichisonlyabout33%ofitsmax,andafter2 ,themorphogenhasreachedaconcentration of M 0 e 2 ingeneral: x = 2 (3.66) then M = M 0 e ; M 0 e 2 ;:: M 0 e n (3.67) Using'units'of isnotonlyusefulforseeinghowthemorphogendecaysasafunctionof absolutelength,itisinfacttheunitsthemorphogenusestocreatepatterns,aswhenone anthropomorphizesthemorphogen,theyrealizethatthemorphogencannotreproducible makepatternsonanyscaleotherthan ,wherereproduciblemeanscomparingthepatterns 1 theideathatBcmRNAislocalizedatx=0wasrecentlyshownfalse,itwasshownthatBcmRNAisalso andcloselyfollowstheBcproteinbutforourdiscussionwe'llassumethatBcistranslated intoproteinatx=0andthenandallofthisiscapturedbyourboundarycondition. 212 between embryos. 3.7.2Finepatterns Let'sseeifthemorphogencancreateapatternonthescaleof1 =when =10 . NowthemorphogencreatesthepatternbybindingtotheHbpromoterandstimulatingHb expression,sotheHbexpressionasafunctionofspaceisthepattern.Thisisdenotedby 3.10.Indrosophilathexaxis(theAnteriorPosterioraxis,majoraxisoftheelliposdal embryo)isabout500 andeachnucleusisseparatedbyabout10 ,sincethepattern hereisreferringtotheexpressionofHbasafunctionofnuclearposition,weseethat1 doesn'tevenmakesense,sincethesmallestunitofourproblemis10 ,Solet'sour ill-posedquestiontothemorphogentryingtomakeapatternover10 where =100 . NowtodetermineifapatterncanbecreatedallweneedtoknowisiftheHbpromotercan detect(distinguish)Bcconcentrationsthatareseparatedby10 (twoneighboringnuclei). Welltheanswerissimple, yes ,ifweusetheHillmodel(3.68),torepresentHb'sdetection mechanism,thenclearlythepromotercandetect any concentrationofBicoid,andclearly anydesiredpatterncouldbeproduced(TodeterminethepatternonesimplyaBicoid concentrationasthestartofthepattern(i.e.a'thresholdconcentration', M ( x ))andand theninvertstheHillequationtotheBcconcentrationandposition,whichcomefrom equation(3.65)). [ Hb ] [ Hb max ] = 1 1+ e ( w 0 + h n Bc i w Bc ) (3.68) But,wewanttoexplorethisproblemfromtheperspectiveofrobustness,thatisifthe Bcconcentrationdeviatesfromitsidealforwhateverreason(e.g.themotherlaid lessfunctionalbcmRNAbecauseonlyoneofherbcallelesisfunctional(i.e.sheisa 213 heterozygote),oraslightvariantofmRNAarelaidintheoocyteetc.seeforexample[9]for acaseofmicroscopyanalysisofheterozygotesandhomozygotes),thenwhatwouldhappen toourHbInthissense,onemaylooselysaythatweareinterestedinknowingifa nonrobustinputcanstillbetransformedintoarobustoutput 2 . Aformalmeasureofthis(robustness)isbyAlonasthe'positionalshift'that isbyhowmanymicrons doesourpatternshift.Todeterminethefunction weneedto perturbtheproductionrate M (i.e.theboundaryconditionat x =0)from M to M 0 ,then onecansee fromthegraphin3.10.Whenanalyzingtheseequations,keepinmind thattheHbconcentrationiscapturedbytheconcentrationofthemorphogen[ M ],so M ( x ) and M ( x 0 )representthesameHbconcentration. dx = dx dM dM (3.69) dx = = x x 0 (3.70) invertingequation(3.65)wecanisolatetheposition,hencesolvingfor x : x = log M M ( x ) (3.71) nowifwechangetheboundaryconditionto M ',andsolvefor x ',whichrepresentthenew 2 forthermodynamicmodeling(fractionaloccupancy)thisquestionisapproximatelyansweredbytaking thederivativeofthemodel withrespecttotheinputconcentrationthatisvarying,onewill seethattheisonlyattheborderduetotheformofthemodel 214 , Figure3.10:thegreengradientrepresentstheconcentrationofBicoidandtheorangenuclei denoteHbexpression,whilebluenucleirepresentnoHbexpression.Thiswasmo from1of[125],andAlon'stext[5]. borderpositionofHbexpression: x 0 = log M 0 M ( x 0 ) (3.72) nowwehavethepiecestocalculate , = log M M ( x ) log M 0 M ( x 0 ) (3.73) since M ( x )= M ( x 0 ),wearriveat: = log M 0 M (3.74) Nowthatweknow wecananalyzewhy istheelengthscale.Forexample,let 215 M 0 = M 2 ,then = og ( 1 2 ) ˇ : 7 .Recall,wewantedtoifthemorphogencancreatea patternover10 ,whichwasctivelythepatternsince10 isthespacingbetween twocellsornuclei,but : 7 isashiftof7nuclei!Clearly,forourproblem,iftheproduction ratevariesbya50%reduction(asseenforclassicalheterozygoteexperiments,whereone alleleislost),aprecisionontheorderofthespacingbetweennucleiisnotpossible. Hence,theidea,of'relevantlengthscale'forourproblemcanbeunderstoodfromthe perspectiveofwhathappensduetonaturalvariationsinmorphogengradient. Determiningifapatternis'robust'tothemorphogen'sgradient(nearlyindependent ofthemorphogen'sproductionrate)orifthepatternis(extremelysensitive toanyvariationoftheproductionrate)isanimportantprobleminstudyingmorphogen gradients-theparentheticalbothcomefromAlon'schapter7'Robustnessof ProteinCircuits'. Fromouranalysis,weseethatthetwogoalsof attaining 'robust'outputs,while main- taining apatternfortheoutputareatoddswithoneanother,thereisaThe morerobustwemakeouroutputtovariationsoftheinputthenthelessthemodulewill beabletodistinguishconcentrationoftheinput,andhencewillnotbeableto produceapattern.However,greatattentionhasbeengivento'combinatorial'regu- latorymechanisms,inthissenseonecanimaginethatnoisyinputsdoallowforrobustand tuning,buttherobustnessiswithrespectto'redundant'inputs 1 .Thiscanforexample beachievedthroughcooperativity,orratherthanrobustnessfromcombinatorialinputs,the promotermayhaveredundantbindingsitesforakeyactivator,therebyreachingtheminimal numberofmorphogentobeboundformaximumexpression. 1 Forexample,iftwoactivatorsarebothtfortranscriptionandbothbindtothemodulethat determinestheoutputthenrandomnoiseinoneactivatorcanclearlybecompensatedforbytheotherinput 216 ForexampleLetusmodeltheborderofrhomboidexpressionasafunctionofcooperativ- ity.Thenwecanaskhowrobustistheborderwithrespecttovariationsoftheconcentrations ofTwandDl 2 . IfweapplytheanalysisofAlontotherhomboidgene 3 ,whichisinthiscaseactivated bytwomorphogens M dl ;M tw weseethatifwewishtoapply'thermodynamicmodeling'it doesn'tmatterwhattheactualprooftheDlandTwlooklike,allweneedtodoisvary thesetwoconcentrationssimultaneouslyandanalyzeitsonthemodel . Usingthefactthat < ( n m ) 2 > = @ 2 @n 2 m ,whereisthepartitionfunctionfortheCRM andmorphogensystem,onecancheckbysimulationthat < ( n m ) 2 > = p .For example,theRhomboidexpressionatthedorsalborder: = 1 1+ e w dl + w sn w tw (3.75) the w 'saretheparameterstobeandthe n 'srepresenttheaverageoccupancyofeach proteinonthepromoter.Itisknownthattwistaloneisnotientfortranscription,sowe canset w tw =0,furthermorethedorsalborderofrhomboidiswellbeyondsnail'sexpression pattern,hencetherewillbenosnailoccupancythere,so =0.Leaving: = 1 1+ e w dl (3.76) Thereforethevariancein ,whichwe'lllabelas ˙ rho willberoughlytherangeof 2 hereweshouldbecarefulasTwconcentrationisafunctionofDlconcentration,hencetheirnoiseis correlated,althoughlesssoatthedorsalborderoftheneuroectoderm,forsimplicitywe'llassumethenoise inuncorrelated 3 Tissuescanbedeedbytheirgeneexpression,soouranalysisissimplytothethreshold T ,(the borderconcentration)ofrhomboid. 217 expression: ˙ rho = 1 1+ e ( + n m ) w dl 1 1+ e ( n m ) w dl (3.77) Oncesigmaisknownonecanthendeterminebyhowfarthethresholdconcentrationof Rhomboid( M ( x )= T )hasshiftedintermsofcells,inthesensethat ˙ x rho yieldsthetwonewexpectedvaluesofRhomboid,andonecansimplycomparethesewith theexpectedRhomboid(i.e. ˙ x rho = and x x 0 = theshift. Since isrelatedtotheoccupancyoftwist, ,thisimpliesthatthevariance indorsaloccupancy n dl captureshowbothdorsalandtwistvariationscauseshiftsinthe dorsalborderofrhomboid.Anotherwayistosimplylookatthatmatrixelementwithinthe Hessian: @ 2 @M tw @M dl (3.78) 3.7.3ofBorderandProductionRate Alonatranscriptionnetworkasasetofnodes(targetgenes)andsetofedges(inputs toHillfunction).Theinputstoeacheachnodearethethreeparametersthatthe Hillfunction( n , K , ),where n istheHillcot, K istheequilibriumconstantofthe transcriptionfactoractivatingorrepressingthenode 1 ,and istheproductionrateofthe gene.Henceforaparticulargene,whichhasmRNAconcentration Y ,onewouldhavethe followingequationthegene'sdynamics: dY dt = Y 1+ X=K (3.79) 1 istheproductionrate( V max )inMichelisMentenkinetics,thatismaximumexpressionlevel 218 Here X istheconcentrationofanactivatortranscriptionfactorand K istheequilibrium constantforthethefactor X toitsbindingsite,(forarepressorinputsimplyinverse X=K ), and isthedegradationrateofthemRNAor(proteinproductdependingonwhat Y represents). Clearlyforanygene(node) isconstrainedbyprocesstivity(themaximumtranslocation rate, V max ,ofPOLIIalongthegene,whichisdoubledwith10 increaseintemperature(ref Davidson),hence 2 [0 ;V max ],furthermoresinceeachgene, g ,willhaveitsownproduction rate,wecansay g 2 [0 ; max ]. Nowfromthisdescriptionalloftheinformationaboutthemaxproductionrateforour thermodynamicmodel(i.e.foraspgene)isencodedinsideofthesigmoidalfunction ofthe w 'sbecausethemaxproductionrateoftheentirenetworkis max ,andbecausethe model isdividedby max (similartoGregor's[Hb]/[Hbmax]) 1 Thismeansthe maximumexpression,inourthermodynamicmodelis1,andisat max (assumingsteady stateofmRNAexpression).Nowaswevarythegenesinthenetwork,ifweassume is 2 thenfortsteadystatelevels,itfollowsthat mustbevarying(i.e.decreasing from max ),hencethereisarule,whichmapsoursigmoidalfunctionofthew'sto ,(the operationisnotonetooneduetosaturationofthesigmoidalfunction,assumingwedon't haveprecision). Inourdescriptionofthewehaveassumedthatsomeprarenotproducing at max ,thatiswehaveloweredthelevelofgeneexpressionbecausethelevelislowerthen 1 Sinha'slabnotedthatthecorrelationcoient(onepossibleformoftheobjectivefunctionuseto theparametersofthenetwork)isscaleinvariant,hencetheycreateamodelwhere isafreeparameterfor EACHgene.ThisisexactlythewayAlondescribesthenetwork(whereeachgenegetsitsown ,however Iprefertonotdothatbecausethatchangesthemeaningofthe'border',andhencethemeaningofthe w 's inthesigmoidal. 2 sinceourreporterforeachgene(ormodule)isLacZ,itissafetoassumethedegradationrateofLacZ mRNAandproteinisroughlythesame 219 areferencegenewhichgive max (thatisallgeneswhoseexpressionlevelisat1). Inassigningtexpressionlevels( ),oneisnotthe border oftheexpres- sion(i.e.theparameter K ),asinthismodeltheparameters( and K )arebiochemically independent.Hencesomewhichhavetheirmaxexpressionlevelsbelow1/2,donot haveaborder.However,aspointedoutbyothers,onecouldimagineastepwise(ladder) function,wheretherearemultipleborders(i.e.thelevel=1isnotthemaxanymore).For examplewecouldassumeanexpressionlevelof2asthemax: f ( Y ss )= 1 ( K 1 )+ 2 ( K 2 )(3.80) Here Y ss isthesteadystateconcentration,and theta isthestepfunction,henceonce X is above K 1 ,theleveljumpsto 1 ,thenitstaysatthatleveluntil X reaches K 2 ,atwhich pointtheleveljumpsto 1 + 2 .Actuallythisequationisn'tproperlywrittenasoneshould startfromthedynamicequation(withdegradation)andsolve,buttheideastillholds,the bordersare( K 1 and K 2 ),whichallowsforlowerexpressingmodulestohaveawell border( K ). Ifweassumeoursetoftargetgenes G = g 1 ;g 2 ;::g n (i.e.regulatorymodulesorsequences S crm inourDataset D ),followtheproductionraterelation: g 2 (0 ; max ),thenusingthe SegalorHillequation(equation3.18, = 1 1+ exp ( w o + P i w i ) (3.81) 220 wecanlinearizeittosolveforthew's: log 1 = w o + X i w i = odds (3.82) ifweassumethefactors i governall n genesthenwehaveamatrixequation: 0 B B B B B B B B B @ log 1 log 1 . . . log 1 1 C C C C C C C C C A = 0 B B B B B B B B B @ 1 1 . . . . . . . . . 1 1 C C C C C C C C C A 0 B B B B B B B B B @ w 0 w 1 w 2 w 3 1 C C C C C C C C C A Herewecanuseallifcontinuous)possiblepointsalongthe( where z iscellularposition 1 tosolveforthew's,furthermorewealsoseethatthepointscorrespondingto = 8 > > > < > > > : 0 1 1 Recallthattheoccupanciesarefunctionofconcentration: ,andhenceareafunctionofposition alongthetissue.Ifwehave40cellsthatwemeasure Y;X 1 ;X 2 ;X 3 ,thenandifwe n genes,thenwe'llhave n *40equations,andhence n *40rowsinourmatrix.Manyofthoseequationswillhavethesamevaluefor Y ,andhencecanbesetequaltoeachother,thisisaconsequencethattherearemanypossibleoccupancies thatleadtothesamevalueofodds(Equation3.82),howeverifwetheoddstobesay4,5,6,7thenwe haveasystemof4equations,ifweselectonegenethathasallthesevaluesreachedatsomepointalongits we'llhave: 0 B B @ 4 5 6 7 1 C C A = 0 B B @ 1 . . . . . . . . . 1 1 C C A 0 B B @ w 0 w 1 w 2 w 3 1 C C A where z isvariablepositionalongtheevaluatedat4tpositions,z1,.z4.Inthiscaseitis reasonableandinfactmandatorythattheoccupancychangefromoneequationtoanother.However,what ifwecollected4positions,thatallhadidenticalvaluesoftheodds? 221 leadtosingularities,whichisnotsurprisingastherearecombinationsofw'sthatin thelimitgive0or1forthesigmoidalfunction(anotherwaytothinkofthe1,0solutions, isthattheyhavelostinformation,astherearemultipleinputsthatleadto1or0,hence knowledgeofthesesaturatedoutputsisuninformativeoftheexactinputoccupancy.). Howeverif 2 ( : 1 ;: 9),thenourmatrixequationiswell(Heretheremay bemultiplewaystotoyieldthesesolutionstoo,asaremaywaystogenerateanumber fromasumof3integers).Inparticularifwecollectallthegenesthathaveaborder(i.e. = : 5,whichisanalogousto X = K intheabovediscussion,andforourcaseitislike sayingthegene'sactivatoroccupancy = K )wethenwillhaveasubsetof G ,which yieldsahomogeneoussystemofequations: 0 B B B B B B B B B @ 0 0 . . . 0 1 C C C C C C C C C A = 0 B B B B B B B B B @ 1 1 . . . . . . . . . 1 1 C C C C C C C C C A 0 B B B B B B B B B @ w 0 w 1 w 2 w 3 1 C C C C C C C C C A Nowifweimagine m =3,wecanmakesomeimportantanalysis(for m> 3,wecan useleastsquarestosolve w 1 ).If m =3,wehaveasystemof3homogeneousequations, with3unknowns( w 's).Ifoursystemislinearlyindependent,thenthe w 'smustbezero,as thatistheuniquesolutionforahomogeneoussystemthatislinearlyindependent(ifthisis truefor m =3,itobviouslyholdsfor m> 3,whichmeansaddingmoregenesdoesn'thelp usnontrivial w 's,ratherwemustidentifyinformativeconstructs(targetgenes)under particularconditions). 1 LeastSquaresgivesananalyticsolutiontotheminima,andthisminimaforalinearequationcanbe showntobetheglobalminima,forthesituationwitherrorinbothinputsandoutputs,Ibelieveone thattherearemultiplerootsandhencemultipleminima,butthenumberofminimaaresmall,henceone canexhaustivelylookatalltheminimaandsimplythesmallest(theglobal) 222 Furthermore,ifthevariablesarelinearlyindependentforeachgene(sowenolongerhave asystemofequationsforthe w 's)thenthisindicatesamethodtoseeiftheideaofputting multiplegenesinthenetwork(thesystemofequations)makessense(i.e.doeseachgene haveitsownsetof w 's,orarethe w 'sgoverningtheentirenetwork).Addingmoregenes (whichwillincrease m )willthencausethematrixequationtostayconsistentorthesystem ofequationswillbecomeinconsistent. Clearlytheideaof'LeastSquares'alreadyindicatesthesystemofequationsisincon- sistent,andhenceoneistomaximizetheprojection(parallelcomponent)ofthevectorAx ontothedatasety(avectorofodds),therebyminimizingtheperpendicularcomponent(i.e. thedeviation( P A ij x j y i ).However,onecouldimaginecreatingastatisticaltest,likea t-test,wheretheaverageerrorcomponentisthemaxofmeasurementandbiologicalnoise, thenonecanaddanewgenetothematrixcalculatetheerrorandthencouldcalculate,then onehasastatisticwhichtheycanusetocalculatethepvalue. Thisisconsistentwiththeideaofthenetworkbeingcompletelygovernedbytheoccu- pancies.Thisisbasicallytellingusthatthereisonlyonetypicaltargetgene(that'swhy thereisonlyone w vector),andthatthattypicalgenealwaysexpresseswhentheactivator occupancygoesfrom j to j +1(thiswasanobservationpointedoutbyW.Wedemeyerin adiscussiononthepseudoinverse). Soifthefactor'soccupanciesarelinearlydependent 2 ,thenleastsquareswillnotwork, 2 the 's,Ibelieve,wouldbelinearlydependentbecausetheideathattheoccupancygoesfrom j to j +1foractivation(whencrossingtheborder),meansthereisanequationrelatingtheoccupancies,a constraint,namely P i w i = j +1,where j +1isplayingtheroleof K 223 asthedesignmatrix, A ,multipliedonitsleftby A T (i.e. A T A )doesn'thaveaninverseif thefactor'soccupanciesarelinearlydependent(i.e.thedesignmatrixisunderdetermined), however,themethodofsvdallowsforabestsolutionbyusingthepseudoinverse (thisisalsoknownastheMoorePenroseinverse).Regardless,itisalgebra,oneinvertsthe/a matrix,whichcanbedonebyalmostalllinearalgebrapackages 20 Ifwehave m> 3and assumeoursystemisundetermined,thenweourleftwithanontrivialsolutioninthe nullspace(asweobviouslydon'twantallthew'szero).Thismeansthatthereare solutionstothew's(well,therewouldbeonlyonenormalizedsolutioninthenullspace). 3.7.4UniformShiftsandScaleInvariance Scalinglawsarerelationsbetweenanattributeofasystemandthesizeofthesystem(thesize isthescalelikelength,mass,volume).Powerlaws(e.g.polynomicalequationsrelatingthe attributetothesize),areimportantinbiology,becausetheyindicatescaleinvariance.That isthesystemisabletopreservetheformoftherelationasthescalevaries.Inthissensescale invarianceisabitofamisnomer,itwouldhavebeenbettertosaytheequationisinvariant, butsuchistradition.WorkbyEriveslabhasshownthatpatternsarescaleinvariant(so thepatternwidthgetswiderasthelengthoftheembryogetslongeraccordingtosome law,possiblyapowerlaw).Thiswasachievedbylookingattsizedeggst becausetheyweretspeciesofundertselectionpressures).Interestingly theynotedthepattern's cis regulatorymodule'sgrammarisNOTinvariant.Thatisthe modulechanges(evolvesbygeneticmutationsthateventuallyforadaptivesizedembryos) 20 Inthecasethat A T A iszerodeterminant,wewouldliketostill'solve'forthe w 's,orcombina- tionsofthem,thatourinformativeinasubspaceoftheparameterspace(the w space),thisisthe pointofsvd(singularvaluedecomposition),whichhasbeenimplementedbymeinGSLforChip-chip dataasacomponentofamultiobjectivefunctionforoccupanciesandexpressiondata(availableat: https://github.com/jacobedOccupancy,seethefunctioncompOccMatinsideExprPredic- tor.cpp 224 tocompensateforscaling.Scalinglawsarebynatureverycoarsegrain,scalinglawsdonot telloneanythingaboutthedetailedmechanismwithinaobject.However,Eriveswork(or hisstudentCrocker'swork)insomesenselinkedthecoarsewiththedetails. 3.7.5MolecularbasisforRobustnessofMorphogengradients AlonandEldar'sworkshowthatarobustwouldrequireadegradationratethatis anonlinearimplicitfunctionofspace, F ( M )= kM 2 ,apowerlawforthedegradationis workedoutinEldar'spaperthatshowsrobustness[41].FurthermoreGregorshowedthat amongtDipterianspeciesthathavetsizedembryosthedomainontheBi- coidproteinthatistargetedfordegradationhasbeentunedbyevolutionarymutations therebyallowingforascalingmechanismbetweenthetsizedembryosthatleadstoa robustforBicoid.Gregoralsoanalyzedthepossibiltythattargetgene'scisregulatory modulesmaybetuningtheBicoidbindingsitestoachieveamechanismtoscalethe patternsinlargerembryos[53][53].CrockeranalyzedcisregulatorymodulesofDorsaland TwisttargetsandfoundthatthespacingbetweenTwistandDorsalbindingsiteshadvaried, bytestingtheofthisvariationina invivo studytheyshowedthatchangingthespac- ing(thenumberofbasepairsbetweenDorsalandTwistsites)changedthetarget anatomicalpositionalborder(theneuroectoderm-dorsalectodermborderoftheembryo), henceexplainingthescalingmechanismfortsizedembryosbythearchitectureof theenhancer. 1 1 Crocker'sassumptionsassumesamenumberofnucleiatnc14=2 14 ' 16000,howevertherehasbeen indicationsthatthenumberofnucleiscalewithsizeofembryo,indicatingnuclearcyclesmaynotbecompletey synchronized,ortherearemorematernalnucleiinthesyncitium(thechamberthatholdstheegg)thanthe initialmother'segg(haploidnucleus).[91]seeingthatthereisvariationinthenumberofnuclei,thatindicates thatfortargetgenesmakingdecisionsatnc9,thenbync14thewidth, w ,oftheexpressionpatternwillnot minimallybe w ˇˇ 2 14 2 9 =5nucleiasindicatedbyBialekinhisresponsetoSvenBergman'spaperon pre-steadystatedecoding[21][19]. 225 3.8Conclusion TheMPAannotationmodeldevelopedherewasusedinconjunctionwiththeORgatefor Dorsalbindingsites(DCandDUPWMs).TheORgatewasdesignedtobesensitive(itis morelikelytocallapositivehitthanitscomponentPWMs),whiledisregardingthepitfalls ofspecy(afalsepositivepickedupbyjustonecomponentdetectorwillbedeclared apositivefortheORgate,evenifthemajorityofcomponentdetectorsdeclarethesite anegative.).ThehighsensitivityoftheORgateusedinconjunctionwiththetheMPA annotator(whichisdesignedforhighspy),yieldsahighlyebioinformatic toolfordiscoveryofbindingsites.AnnotationofCRMsathalfmaxoccupancyofthe BasalTranscriptionApparatus(thetransitionfromtheboolean0to1expression),reduces thelikelihoodofpickingupbindingsitesthatarenonfunctionalandhenceasa 'hit'intheannotationprocess,sincehighoccupancyofBTAcouldresultfrommanybinding sitesthatarenotcriticaltogettheinitialswitch. Theoccupancymodelgeneralizesoverbindingsites,inthattheinputtothemodelis anenhancersequence,notalistofbindingsites.Soapriorionedoesnotknowwhat bindingsiteswillbeselected,ratherthephysicalcontrolparameters(concentration,binding energy,protein-proteininteractions)determinewhatisabindingsite.InthissensetheMPA annotationmodelcouldbethoughtofasgeneratinganumberofbindingsitesaspart ofitsoutput.ThisisacontributiontothewhichipointoutSegalsaidsp theirmodelcouldnotaddressatthislevelofdetail,theycouldonlymakestatementsabout thenumericalvaluesofcooperativityquenching.Withthisbeingsaidonecouldjudiciously chooseparametervaluestoreconstructtheexactengineeredenhancers(whichinasensewe didforourlastexperiment),butofcourse,wewishtoparameterswhichareconsistent 226 withnotjustoneconstructbuttheentireGRN. Muchoftheanalysisisaboutexperimentaldesign,asexperimentsarenotonlynecessary totesttheory,buttheyarenecessarytotuneparametersforwellknowntheories.Binding ofmorphogenstoCRMsisnotatheoryunderdispute,noristhetheoreticalcalculationof theoccupationofthemorphogen'stotheCRM,that'sbeenknownformanyyears.What isnotknownisthestrengthandimportanceoftheparameterscontrollingtheoccupation ofmorphogensonthepromoter,andhowmucheachoftheseinturnBTAbinding. Thisrequiresmodeldesign,toinferwhatthevalueoftheseparameters.However,thisdata isnotalwaysavailable,henceIhaveshownanalternativetechnique,whereIhavetriedto uselowqualitydatafromtheliteraturetotunetheparameters.AswasseenfromExperi- ment2,itispossibletomathematicallyarriveatcertainconclusionsreachedbySzymanski (thatcooperativityisnecessaryintheNEEs).However,Szymanski's6 xdl constructisnot consistentwithreportedparameterrangesofthermodynamicmodelscollectedbyBuchler et.al.[26].Hence,itwasnecessarytoadjustthethermodynamicmodelparameterranges. PossiblyonecouldarguethatonecannotadjusttheparameterrangesfoundbyBuchler sincetheycoverabroadspectrumofproteins.Indeed,thevaluesofthecooperativitywe foundinExperiment2aresohighthatisunlikelyDorsalandTwistwouldeverunbind oncebound;albeititispossibleDorsal-Twistdimersdooccurinthenucleoplasm.Further- more,Szymanskiandothershavereported'uniform'expressionincertainconstructs (CRMs)thatarenotfully'on'(thestainingofthetargetgenewasweak).Suchreportsare notconsistentwiththethermodynamicmodelforDorsaloccupancy,sinceDorsaloccupancy isnever'uniform'itisagradient,likeLewisWolpert'sFrenchFlagmodelofmorphogens. Hence,reconstructionofbasicassumptionsIhavemadeheremaybenecessarytoaccurately representDorsaltargetedgenesinDV. 227 Ccodewaswrittenandtestedforamultiobjectivefunctionwithatermforexpres- sionandatermforChip-chipdata,whereChip-chipdatafromZeitlingerandMacArthur wasusedforDorsal,Twist,andSnail[128][85].However,thisworkisstillinprogressas tohowthedatasets(expressionandoccupancydata(i.e.Chip-chip))shouldbebal- anced,andhowtheparametererrorshouldbedetermined,hencenoresultswerepre- sentedforthiswork.However,thenecessarycomponentsofthemodelcanbefoundat https://github.com/jacobedOccupancy(seetheobjectivefunctionsinEx- prPredictor.cpp),whichalsocontainsthecodefortheresultsfromchapter3,suchasthe MPAmodelforannotatingbindingsitesasafunctionofexpressiondata. Lastly,theideathatoneshouldsearchfortheidealpositiveHessianatthebest estimatesoftheparametersisincorrectfortheDVnetwork.Theelusivepositive Hessianisonlyforindependentparameters.Thebiochemicalparametersthatcontrolthe logic(geneswitches)governingearlydevelopmenthavecoevolvedtoworktogethertojointly regulatetheDVaxis,henceitiswrongtostatethattheparametersmustbeindepen- dent.Ofcourse,biochemcialexperimentscouldbedesignedtoindependentlymeasurethese parameters,however,suchexperimentswouldunlikelyberepresentativeoftheecological environmentoftheinternalsoftheinearlydevelopment. 228 BIBLIOGRAPHY 229 BIBLIOGRAPHY [1] C.Adami.Theuseofinformationtheoryinevolutionarybiology. Ann.N.Y.Acad. Sci. ,1256:49{65,May2012. [2] ChristophAdami.Whatiscomplexity? BioEssays ,24(12):1085{1094,2002. [3] A.Afek,J.L.Schipper,J.Horton,R.Gordan,andD.B.Lukatsky.Protein-DNA bindingintheabsenceofspbase-pairrecognition. Proc.Natl.Acad.Sci.U.S.A. , 111(48):17140{17145,Dec2014. [4] BruceAlberts. Molecularbiologyofthecell .GarlandScience,NewYork,4th;4. edition,2002. [5] UriAlon.Anintroductiontosystemsbiology:Designprinciplesofbiologicalcircuits. ChapmanHallCRC ,2006. [6] N.J.Armstrong,H.Steinbeisser,C.Prothmann,R.DeLotto,andR.A.Rupp.Con- servedSptzle/TollsignalingindorsoventralpatterningofXenopusembryos. Mech. Dev. ,71(1-2):99{105,Feb1998. [7] M.I.ArnoneandE.H.Davidson.Thehardwiringofdevelopment:organizationand functionofgenomicregulatorysystems. Development ,124(10):1851{1864,May1997. [8] P.AtkinsandJdePaula. PhysicalChemistry .W.H.FreemanandCompany,2002. [9] A.Ay,W.D.Fakhouri,C.Chiu,andD.N.Arnosti.Imageprocessingandanalysis forquantifyinggeneexpressionfromearlyDrosophilaembryos. TissueEngPartA , 14:1517{1526,Sep2008. [10] T.L.BaileyandC.Elkan.Fittingamixturemodelbyexpectationmaximizationto discovermotifsinbiopolymers. ProcIntConfIntellSystMolBiol ,2:28{36,1994. [11] T.L.BaileyandC.Elkan.Thevalueofpriorknowledgeindiscoveringmotifswith MEME. ProcIntConfIntellSystMolBiol ,3:21{29,1995. [12] A.S.Bais,N.Kaminski,andP.V.Benos.Findingsubtypesoftranscriptionfactor motifpairswithdistinctregulatoryroles. NucleicAcidsRes. ,39(11):e76,Jun2011. 230 [13] Y.Barash,G.Elidan,T.Kaplan,andFriedman.Modelingdependenciesinprotein- DNAbindingsites.In Procofthe7thAnnIntConfinCompMolBio(RECOMB) , pages28{37,2003. [14] JamesBeckandKennethArnold. Parameterestimationinengineeringandscience . WileyandSons,1edition,1977. [15] O.G.Berg.Base-pairspyofprotein-DNArecognition:astatistical-mechanical model. Biomed.Biochim.Acta ,49(8-9):963{975,1990. [16] O.G.BergandP.H.vonHippel.SelectionofDNAbindingsitesbyregulatory proteins.Statistical-mechanicaltheoryandapplicationtooperatorsandpromoters. J. Mol.Biol. ,193:723{750,Feb1987. [17] O.G.Berg,R.B.Winter,andP.H.vonHippel.usion-drivenmechanismsofprotein translocationonnucleicacids.1.Modelsandtheory. Biochemistry ,20(24):6929{6948, Nov1981. [18] Berg,OG.TheevolutionaryselectionofDNAbasepairsingene-regulatorybinding sites. ProceedingsoftheNationalAcademyofSciences ,89(16):7501{7505,1992. [19] S.Bergmann,O.Sandler,H.Sberro,S.Shnider,E.Schejter,B.Z.Shilo,andN.Barkai. Pre-steady-statedecodingoftheBicoidmorphogengradient. PLoSBiol. ,5:e46,Feb 2007. [20] W.Bialek. BiophysicsSearchingforPrinciples .PrincetonUniversityPress,2012. [21] WilliamBialek,ThomasGregor,DavidW.Tank,andEricF.Wieschaus.Response: Canweallofthedata? Cell ,132:17{18,2007. [22] Bishop. PatternRecognitionandMachineLearning ,volume1.ComputerScience Press,Rockville,MD,1985. [23] R.J.BrittenandE.H.Davidson.Repetitiveandnon-repetitiveDNAsequencesand aspeculationontheoriginsofevolutionarynovelty. QRevBiol ,46(2):111{138,Jun 1971. [24] C.T.BrownandC.G.Callan.EvolutionarycomparisonssuggestmanynovelcAMP responseproteinbindingsitesinEscherichiacoli. Proc.Natl.Acad.Sci.U.S.A. , 101(8):2404{2409,Feb2004. 231 [25] CTitusBrown.Computationalapproachestodingandanalyzingcis-regulatory elements. Methodsincellbiology ,87:337{365,2008. [26] N.E.Buchler,U.Gerland,andT.Hwa.Onschemesofcombinatorialtranscription logic. Proc.Natl.Acad.Sci.U.S.A. ,100:5136{5141,Apr2003. [27] M.L.Bulyk,A.M.McGuire,N.Masuda,andG.M.Church.Amotifco-occurrenceap- proachforgenome-widepredictionoftranscription-factor-bindingsitesinEscherichia coli. GenomeRes. ,14(2):201{208,Feb2004. [28] MatthewSBusse,ChristopherPArnold,ParTowb,JamesKatrivesis,andStevenA Wasserman.Asequencecodeforpathway-speinnateimmuneresponses. The EMBOJournal ,26(16):3826{3835,2007. [29] J.M.Carothers,S.C.Oestreich,J.H.Davis,andJ.W.Szostak.Informational complexityandfunctionalactivityofRNAstructures. J.AmericanChem.Society , 126:5130{5137,2004. [30] S.B.Carroll.Endlessforms:theevolutionofgeneregulationandmorphological diversity. Cell ,101(6):577{580,Jun2000. [31] J.Crocker,N.Potter,andA.Erives.Dynamicevolutionofpreciseregulatoryencodings createstheclusteredsitesignatureofenhancers. NatCommun ,1:99,2010. [32] J.Crocker,Y.Tamori,andA.Erives.Evolutionactsonenhancerorganizationto gradientthresholdreadouts. PLoSBiol. ,6:e263,Nov2008. [33] E.H.DavidsonandM.S.Levine.Propertiesofdevelopmentalgeneregulatorynet- works. Proc.Natl.Acad.Sci.U.S.A. ,105(51):20063{20066,Dec2008. [34] EricH.Davidson. GenomicRegulatorySystems:DevelopmentandEvolution .Aca- demicPress,SanDiego,CA,2001. [35] EricH.Davidson. TheRegulatoryGenome:GeneRegulatoryNetworksinDrvelopment andEvolution .AcademicPress,SanDiego,CA,2006. [36] ThomasA.DownandTimJ.P.Hubbard.Nestedmica:sensitiveinferenceofover- representedmotifsinnucleicacidsequence. NucleicAcidsResearch ,33(5):1445{1453, 2005. 232 [37] JacquelineDresch,XiaozhouLiu,DavidArnosti,andAhmetAy.Thermodynamic modelingoftranscription:sensitivityanalysistiatesbiologicalmechanismfrom mathematicalmodel-induced BMCSystemsBiology ,4(1):142,2010. [38] R.Durbin,S.Eddy,A.Krogh,andG.Mitchison.Bilogicalsequenceanalysis. Cam- bridgePress ,1998. [39] R.C.Edgar.MUSCLE:amultiplesequencealignmentmethodwithreducedtimeand spacecomplexity. BMCBioinformatics ,5:113,Aug2004. [40] R.C.Edgar.MUSCLE:multiplesequencealignmentwithhighaccuracyandhigh throughput. NucleicAcidsRes. ,32(5):1792{1797,2004. [41] A.Eldar,R.Dorfman,D.Weiss,H.Ashe,B.Z.Shilo,andN.Barkai.Robustnessofthe BMPmorphogengradientinDrosophilaembryonicpatterning. Nature ,419(6904):304{ 308,Sep2002. [42] A.ErivesandM.Levine.Coordinateenhancerssharecommonorganizationalfeatures intheDrosophilagenome. Proc.Natl.Acad.Sci.U.S.A. ,101:3851{3856,Mar2004. [43] W.D.Fakhouri,A.Ay,R.Sayal,J.Dresch,E.Dayringer,andD.N.Arnosti.De- cipheringatranscriptionalregulatorycode:modelingshort-rangerepressioninthe Drosophilaembryo. Mol.Syst.Biol. ,6:341,2010. [44] D.S.Fields,Y.He,A.Y.Al-Uzri,andG.D.Stormo.Quantitativespyofthe Mntrepressor. J.Mol.Biol. ,271(2):178{194,Aug1997. [45] D.S.FieldsandG.D.Stormo.QuantitativeDNAsequencingtodeterminetherel- ativeprotein-DNAbindingconstantstomultipleDNAsequences. Anal.Biochem. , 219(2):230{239,Jun1994. [46] StevenM.Gallo,DaveT.Gerrard,DavidMiner,MichaelSimich,BenjaminDesSoye, CaseyM.Bergman,andMarcS.Halfon.v3.0:towardacomprehensivedatabase oftranscriptionalregulatoryelementsindrosophila. NucleicAcidsResearch ,2010. [47] N.Galtier,M.Gouy,andC.Gautier.SEAVIEWandPHYLOWIN:twographictools forsequencealignmentandmolecularphylogeny. Comput.Appl.Biosci. ,12(6):543{ 548,Dec1996. [48] W.J.GehringandK.Ikeo.Pax6:masteringeyemorphogenesisandeyeevolution. TrendsGenet. ,15(9):371{377,Sep1999. 233 [49] WalterJ.Gehring. MasterControlGenesinDevelopmentandEvolution:TheHome- oboxStory .YaleUniversityPress,NewHaven,CT,1998. [50] B.GeorgiandA.Schliep.Context-spindependencemixturemodelingforposi- tionalweightmatrices. Bioinformatics ,22(14):e166{173,Jul2006. [51] ScottF.Gilbert. Developmentalbiology .SinauerAssociates,Sunderland,Mass,5th edition,1997. [52] ScottF.GilbertandDavidEpel. Ecologicaldevelopmentalbiology:integratingepige- netics,medicine,andevolution .SinauerAssociates,Sunderland,Mass,2009. [53] T.Gregor,W.Bialek,R.R.deRuytervanSteveninck,D.W.Tank,andE.F.Wi- eschaus.andscalingduringearlyembryonicpatternformation. Proc.Natl. Acad.Sci.U.S.A. ,102:18403{18407,Dec2005. [54] T.Gregor,A.P.McGregor,andE.F.Wieschaus.ShapeandfunctionoftheBi- coidmorphogengradientindipteranspecieswithntsizedembryos. Dev.Biol. , 316:350{358,Apr2008. [55] T.Gregor,D.W.Tank,E.F.Wieschaus,andW.Bialek.Probingthelimitsto positionalinformation. Cell ,130:153{164,Jul2007. [56] DebrajGuhaThakurtaandGaryD.Stormo.Identifyingtargetsitesforcooperatively bindingfactors. Bioinformatics ,17(7):608{621,2001. [57] NaomiHabib,TommyKaplan,HanahMargalit,andNirFriedman.Anovelbayesian dnamotifcomparisonmethodforclusteringandretrieval. PLoSComputBiol , 4(2):e1000010,2008. [58] S.Hannenhalli.Eukaryotictranscriptionfactorbindingsites{modelingandintegrative searchmethods. Bioinformatics ,24(11):1325{1331,Jun2008. [59] SridharHannenhalliandLi-SanWang.Enhancedpositionweightmatricesusingmix- turemodels. Bioinformatics ,21(suppl1):i204{i212,2005. [60] X.He,C.C.Chen,F.Hong,F.Fang,S.Sinha,H.H.Ng,andS.Zhong.Abiophysical modelforanalysisoftranscriptionfactorinteractionandbindingsitearrangement fromgenome-widebindingdata. PLoSONE ,4:e8155,2009. 234 [61] X.He,M.A.Samee,C.Blatti,andS.Sinha.Thermodynamics-basedmodelsof transcriptionalregulationbyenhancers:therolesofsynergisticactivation,cooperative bindingandshort-rangerepression. PLoSComput.Biol. ,6,Sep2010. [62] TerrellLHill. CooperativityTheoryinBiochemistry:Steady-stateandEquilibrium Systems .NewYork:Springer-Verlag,1985. [63] T.L.Hill. AnIntroductiontoStatisticalThermodynamics .DoverBooksonPhysics andChemistry.NewYork:DoverPublications,1986. [64] AHobson. ConceptsinStatisticalMechanics .GordonandBreach,NewYork,1971. [65] Joung-WooHong,DavidA.Hendrix,DmitriPapatsenko,andMichaelS.Levine.How thedorsalgradientworks:Insightsfrompostgenometechnologies. Proceedingsofthe NationalAcademyofSciences ,105(51):20072{20076,2008. [66] B.Houchmandzadeh,E.Wieschaus,andS.Leibler.Establishmentofdevelopmental precisionandproportionsintheearlyDrosophilaembryo. Nature ,415:798{802,Feb 2002. [67] Y.T.Ip,R.E.Park,D.Kosman,E.Bier,andM.Levine.Thedorsalgradient morphogenregulatesstripesofrhomboidexpressioninthepresumptiveneuroectoderm oftheDrosophilaembryo. GenesDev. ,6:1728{1739,Sep1992. [68] F.JACOBandJ.MONOD.Geneticregulatorymechanismsinthesynthesisofpro- teins. J.Mol.Biol. ,3:318{356,Jun1961. [69] A.K.Jain,M.N.Murty,andP.J.Flynn.Dataclustering:Areview. ACMComput. Surv. ,31(3):264{323,1999. [70] J.Jiang,D.Kosman,Y.T.Ip,andM.Levine.Thedorsalmorphogengradientregulates themesodermdeterminanttwistinearlyDrosophilaembryos. GenesDev. ,5:1881{ 1891,Oct1991. [71] J.JiangandM.Levine.BindingandcooperativeinteractionswithbHLH activatorsdelimitthresholdresponsestothedorsalgradientmorphogen. Cell ,72:741{ 752,Mar1993. [72] JinJiangandMichaelLevine.Bindingiesandcooperativeinteractionswith bhlhactivatorsdelimitthresholdresponsestothedorsalgradientmorphogen. Cell , 72(5):741{752,1993. 235 [73] M.C.KingandA.C.Wilson.Evolutionattwolevelsinhumansandchimpanzees. Science ,188(4184):107{116,Apr1975. [74] D.Kosman,Y.T.Ip,M.Levine,andK.Arora.Establishmentofthemesoderm- neuroectodermboundaryintheDrosophilaembryo. Science ,254:118{122,Oct1991. [75] D.LandauandI.Lifshitz. Mechanics ,volume1.ButterworthHeinemann,1976. [76] M.Lassig.Frombiophysicstoevolutionarygenetics:statisticalaspectsofgeneregu- lation. BMCBioinformatics ,8Suppl6:S7,2007. [77] C.E.Lawrence,S.F.Altschul,M.S.Boguski,J.S.Liu,A.F.Neuwald,J.C.Wootton, etal.Detectingsubtlesequencesignals:agibbssamplingstrategyformultiplealign- ment. Science ,262:208{208,1993. [78] P.Lawrence. TheMakingofaFly .Wiley-Blackwell,1edition,1992. [79] T.H.Leung,A.n,andD.Baltimore.OnenucleotideinakappaBsitecan determinecofactorspyforNF-kappaBdimers. Cell ,118(4):453{464,Aug2004. [80] M.LevineandE.H.Davidson.Generegulatorynetworksfordevelopment. Proc.Natl. Acad.Sci.U.S.A. ,102(14):4936{4942,Apr2005. [81] RaphaelD.LevineandMyronTribus,editors. TheMaximumEntropyFormalism . MITPress,Cambridge,MA,1978. [82] E.B.Lewis.AgenecomplexcontrollingsegmentationinDrosophila. Nature , 276(5688):565{570,Dec1978. [83] L.Li.GADEM:ageneticalgorithmguidedformationofspaceddyadscoupledwith anEMalgorithmformotifdiscovery. J.Comput.Biol. ,16(2):317{329,Feb2009. [84] X.Liu,D.L.Brutlag,J.S.Liu,etal.Bioprospector:DiscoveringconservedDNA motifsinupstreamregulatoryregionsofco-expressedgenes.In PacSympBiocomput , volume6,pages127{138,2001. [85] S.MacArthur,X.Y.Li,J.Li,J.B.Brown,H.C.Chu,L.Zeng,B.P.Grondona, A.Hechmer,L.Simirenko,S.V.Keranen,D.W.Knowles,M.Stapleton,P.Bickel, M.D.Biggin,andM.B.Eisen.Developmentalrolesof21Drosophilatranscription factorsaredeterminedbyquantitativediinbindingtoanoverlappingsetof thousandsofgenomicregions. GenomeBiol. ,10(7):R80,2009. 236 [86] S.MahonyandP.V.Benos.STAMP:awebtoolforexploringDNA-bindingmotif similarities. NucleicAcidsRes. ,35(WebServerissue):W253{258,Jul2007. [87] M.Markstein,R.Zinzen,P.Markstein,K.P.Yee,A.Erives,A.Stathopoulos,and M.Levine.AregulatorycodeforneurogenicgeneexpressionintheDrosophilaembryo. Development ,131:2387{2394,May2004. [88] JohnMaynardSmith. Evolutionarygenetics .OxfordUniversityPress,NewYork; Oxford[Oxfordshire],1989. [89] W.McGinnis,M.S.Levine,E.Hafen,A.Kuroiwa,andW.J.Gehring.Aconserved DNAsequenceinhomoeoticgenesoftheDrosophilaAntennapediaandbithoraxcom- plexes. Nature ,308(5958):428{433,1984. [90] S.H.Meijsing,M.A.Pufall,A.Y.So,D.L.Bates,L.Chen,andK.R.Yamamoto. DNAbindingsitesequencedirectsglucocorticoidreceptorstructureandactivity. Sci- ence ,324(5925):407{410,Apr2009. [91] C.M.Miles,S.E.Lott,C.L.LudwigLuengoHendriks,M.Z.Manu.,C.L.Williams, andM.Kreitman.selectiononeggsizeperturbsearlypatternformationin drosophilamelanogaster. Evolution ,65:33642,2011. [92] A.M.MosesandM.B.Eisen.Phylogeneticmotifdetectionbyexpectation- maximizationonevolutionarymixtures. Pac.Symp.Biocomput ,324:324{335,2004. [93] B.MoussianandS.Roth.DorsoventralaxisformationintheDrosophilaembryo{ shapingandtransducingamorphogengradient. Curr.Biol. ,15(21):R887{899,Nov 2005. [94] N.Mrinal,A.Tomar,andJ.Nagaraju.RoleofsequenceencodedDNAgeometryin generegulationbyDorsal. NucleicAcidsRes. ,39(22):9574{9591,Dec2011. [95] V.MustonenandM.Lassig.Evolutionarypopulationgeneticsofpromoters:predicting bindingsitesandfunctionalphylogenies. Proc.Natl.Acad.Sci.U.S.A. ,102(44):15936{ 15941,Nov2005. [96] IlyaNemenman,FarielShafee,andWilliamBialek.Entropyandinference,revisited. In AdvancesinNeuralInformationProcessingSystems14 ,pages471{478.MITPress, 2002. [97] C.Nusslein-VolhardandE.Wieschaus.Mutationsctingsegmentnumberandpo- larityinDrosophila. Nature ,287(5785):795{801,Oct1980. 237 [98] D.J.Obbard,J.Maclennan,K.W.Kim,A.Rambaut,P.M.O'Grady,andF.M.Jig- gins.EstimatingdivergencedatesandsubstitutionratesintheDrosophilaphylogeny. Mol.Biol.Evol. ,29(11):3459{3473,Nov2012. [99] D.PanandA.J.Courey.Thesamedorsalbindingsitemediatesbothactivationand repressioninacontext-dependentmanner. EMBOJ. ,11(5):1837{1842,May1992. [100] D.Papatsenko,Y.Goltsev,andM.Levine.Organizationofdevelopmentalenhancers intheDrosophilaembryo. NucleicAcidsRes. ,37(17):5665{5677,Sep2009. [101] D.PapatsenkoandM.Levine.Quantitativeanalysisofbindingmotifsmediating diversespatialreadoutsoftheDorsalgradientintheDrosophilaembryo. Proc.Natl. Acad.Sci.U.S.A. ,102(14):4966{4971,Apr2005. [102] U.J.Pape,H.Klein,andM.Vingron.Statisticaldetectionofcooperativetranscription factorswithsimilarityadjustment. Bioinformatics ,25(16):2103{2109,Aug2009. [103] M.W.Perry,J.D.Cande,A.N.Boettiger,andM.Levine.Evolutionofinsect dorsoventralpatterningmechanisms. ColdSpringHarb.Symp.Quant.Biol. ,74:275{ 279,2009. [104] WilliamH.Press,BrianP.Flannery,SaulA.Teukolsky,andWilliamT.Vetterling. NumericalRecipesinC:TheArtofComputing .CambridgeUniversityPress, 2edition,1992. [105] R.Quiring,U.Walldorf,U.Kloter,andW.J.Gehring.Homologyoftheeyeless geneofDrosophilatotheSmalleyegeneinmiceandAniridiainhumans. Science , 265(5173):785{789,Aug1994. [106] J.Reinitz,S.Hou,andD.Sharp.TranscriptionalControlinDrosophila. Complexus , 1:54{64,2003. [107] T.D.SchneiderandR.M.Stephens.Sequencelogos:Anewwaytodisplayconsensus sequences. NucleicAcidsRes. ,18:6097{6100,1990. [108] T.D.Schneider,G.D.Stormo,L.Gold,andA.Ehrenfeucht.Informationcontentof bindingsitesonnucleotidesequences. J.Mol.Biol. ,188:415{431,Apr1986. [109] E.Segal,T.Raveh-Sadka,M.Schroeder,U.Unnerstall,andU.Gaul.Predicting expressionpatternsfromregulatorysequenceinDrosophilasegmentation. Nature , 451:535{540,Jan2008. 238 [110] E.SegalandJ.Widom.FromDNAsequencetotranscriptionalbehaviour:aquanti- tativeapproach. Nat.Rev.Genet. ,10:443{456,Jul2009. [111] I.SelaandD.B.Lukatsky.DNAsequencecorrelationsshapenonsptranscription factor-DNAbindingnity. Biophys.J. ,101(1):160{166,Jul2011. [112] E.Sharon,S.Lubliner,andE.Segal.Afeature-basedapproachtomodelingprotein- DNAinteractions. PLoSComput.Biol. ,4(8):e1000154,2008. [113] R.K.Shultzaberger,D.Y.Chiang,A.M.Moses,andM.B.Eisen.Determiningphys- icalconstraintsintranscriptionalinitiationcomplexesusingDNAsequenceanalysis. PLoSONE ,2(11):e1199,2007. [114] R.Siddharthan.Dinucleotideweightmatricesforpredictingtranscriptionfactorbind- ingsites:generalizingthepositionweightmatrix. PLoSONE ,5(3):e9722,2010. [115] S.Sinha,M.Blanchette,andM.Tompa.PhyME:aprobabilisticalgorithmfor motifsinsetsoforthologoussequences. BMCBioinformatics ,5:170,Oct2004. [116] A.StathopoulosandM.Levine.Genomicregulatorynetworksandanimaldevelop- ment. Dev.Cell ,9(4):449{462,Oct2005. [117] AlexanderJStewart,SridharHannenhalli,andJoshuaBPlotkin.Whytranscription factorbindingsitesaretennucleotideslong. Genetics ,192(3):973{85,Nov2012. [118] G.D.StormoandD.S.Fields.Spy,freeenergyandinformationcontentin protein-DNAinteractions. TrendsBiochem.Sci. ,23:109{113,Mar1998. [119] G.D.StormoandY.Zhao.Determiningthespyofprotein-DNAinteractions. Nat.Rev.Genet. ,11(11):751{760,Nov2010. [120] AlfredSturtevant. Thebehaviorofthechromosomesasstudiedthroughlinkage. PhD thesis,ColumbiaUniversity,1914. [121] P.SzymanskiandM.Levine.Multiplemodesofdorsal-bHLHtranscriptionalsynergy intheDrosophilaembryo. EMBOJ. ,14:2229{2238,May1995. [122] U.TechnauandC.B.Scholz.Originandevolutionofendodermandmesoderm. Int. J.Dev.Biol. ,47(7-8):531{539,2003. 239 [123] N.VanKampen. StochasticProcessesinPhysicsandChemistry .NorthHolland,3 edition,2007. [124] P.H.vonHippel.Onthemolecularbasesofthespyofinteractionoftranscrip- tionalproteinswithgenomedna.InR.F.Goldberger,editor, BiologicalRegulationand Development ,volume1,pages279{347.PlenumPublishing,NewYork,1979. [125] O.Wartlick,A.Kicheva,andM.Gonzalez-Gaitan.Morphogengradientformation. ColdSpringHarbPerspectBiol ,1(3):a001255,Sep2009. [126] L.Wolpert.Theevolutionaryoriginofdevelopment:cycles,patterning,privilegeand continuity. Dev.Suppl. ,pages79{84,1994. [127] G.A.Wray,M.W.Hahn,E.Abouheif,J.P.M.Pizer,M.V.Rockman,L.A. Romano,andG.A.Wray.Theevolutionoftranscriptionalregulationineukaryotes. Mol.Biol.Evol. ,20(9):1377{1419,Sep2003. [128] J.Zeitlinger,R.P.Zinzen,A.Stark,M.Kellis,H.Zhang,R.A.Young,andM.Levine. Whole-genomeChIP-chipanalysisofDorsal,Twist,andSnailsuggestsintegrationof diversepatterningprocessesintheDrosophilaembryo. GenesDev. ,21:385{390,Feb 2007. [129] R.P.Zinzen,K.Senger,M.Levine,andD.Papatsenko.Computationalmodelsfor neurogenicgeneexpressionintheDrosophilaembryo. Curr.Biol. ,16:1358{1365,Jul 2006. 240