MULTIMODALLEARNINGANDITSAPPLICATIONTOMODELING ALZHEIMER'SDISEASE By QiWang ADISSERTATION Submittedto MichiganStateUniversity inpartialoftherequirements forthedegreeof ComputerScienceŠDoctorofPhilosophy 2020 ABSTRACT MULTIMODALLEARNINGANDITSAPPLICATIONTOMODELINGALZHEIMER'S DISEASE By QiWang Multimodallearninggainsincreasingattentioninrecentyearsasheterogeneousdatamodalities arebeingcollectedfromdiversedomainsorextractedfromvariousfeatureextractorsandused forlearning.Multimodallearningistointegratepredictiveinformationfromdifferentmodalities toenhancetheperformanceofthelearnedmodels.Forexample,whenmodelingAlzheimer's disease,multiplebrainimagingmodalitiesarecollectedfromthepatients,andeffectivelyfusion fromwhichisshowntobetopredictiveperformance. Multimodallearningisassociatedwithmanychallenges.Oneoutstandingchallengeisthese- vereovproblemsduetothehighfeaturedimensionwhenconcatenatingthemodalities. Forexample,thefeaturedimensionofdiffusion-weightedMRImodalities,whichhasbeenusedin Alzheimer'sdiseasediagnosis,isusuallymuchlargerthanthesamplesizeavailablefortraining. Tosolvethisproblem,inthework,Iproposeasparselearningmethodthatselectstheimpor- tantfeaturesandmodalitiestoalleviatetheovproblem.Anotherchallengeinmultimodal learningistheheterogeneityamongthemodalitiesandtheirpotentialinteractions.Mysecond workexploresnon-linearinteractionsamongthemodalities.Theproposedmodellearnsamodal- ityinvariantcomponent,whichservesasacompactfeaturerepresentationofthemodalitiesandhas highpredictivepower.Inadditiontoutilizethemodalityinvariantinformationofmultiplemodal- ities,modalitiesmayprovidesupplementaryinformation,andcorrelatingtheminthelearningcan bemoreinformative.Thus,inthethirdwork,Iproposemultimodalinformationbottlenecktofuse supplementaryinformationfromdifferentmodalitieswhileeliminatingtheirrelevantinformation fromthem.Onechallengeofutilizingthesupplementaryinformationofmultiplemodalitiesisthat mostworkcanonlybeappliedtothedatawithcompletemodalities.Modalitiesmissingproblem widelyexistsinmultimodallearningtasks.Forthesetasks,onlyasmallportionofdatacanbe usedtotrainthemodel.Thus,tofullyuseallthepreciousdata,inthefourthwork,Iproposea knowledgedistillationbasedalgorithmtoutilizeallthedata,includingthosethathavemissing modalitieswhilefusingthesupplementaryinformation. ACKNOWLEDGMENTS Firstandforemost,Iwouldliketothankmyadvisor,Dr.JiayuZhou,forhisinsight,guidance,and supportingduringmyPh.Dstudy.Ifromhisadvice,particularlywhenexploringnew ideasandwritingpapers.Thisdissertationwouldnothavebeenpossiblewithouttheassistance ofhim.Theexperienceswithhimaremylifelongassets.Also,Iamverygratefultomyco- advisor,Dr.Pang-NingTan,forhisadvice,knowledgeandmanyinsightfuldiscussions andsuggestions.Iwouldliketothankthecommitteemembers,Dr.JiliangTangandDr.Chenxi Li,fortheirvaluableinteractionsandfeedback. IwouldliketoextendmysincerethankstoDr.PatriciaSonora,Dr.KendraSpenceCheruvelil andthemembersfromData-IntensiveLandscapeLimnologyLab.Ihavehadthepleasuretowork withthemforthreeyears.IgratefullyacknowledgetheassistanceofDr.LiangZhan,Dr.Paul ThompsonandDr.HirokoDodge.ImustalsothankthemembersofILLIDANLabasthey inspiredmealotthroughdiscussionsandseminars. iv TABLEOFCONTENTS LISTOFTABLES ....................................... vii LISTOFFIGURES ...................................... ix Chapter1Introduction .................................. 1 1.1DiscriminativeFusionofMultipleBrainNetworks..................4 1.2MultimodalDiseaseModelingviaCollectiveDeepMatrixFactorization......6 1.3MultimodalInformationBottleneck.........................9 1.4MultimodalLearningwithIncompleteModalities..................11 Chapter2RelatedWorks ................................ 13 2.1Co-trainingApproach.................................13 2.2Linearapproaches...................................15 2.2.1Canonicalcorrelationanalysis........................15 2.2.2Collectivematrixfactorization........................19 2.3Nonlinearapproaches.................................21 2.3.1Kernelcanonicalcorrelationanalysis.....................21 2.3.2Deepcanonicalcorrelationanalysis.....................23 2.3.3MultimodaldeepBoltzmannmachine....................26 Chapter3DiscriminativeFusionofMultipleBrainNetworks ............ 32 3.1Methodology.....................................32 3.1.1Preliminary..................................32 3.1.2Overview...................................34 3.1.3DiscriminativeFusion:............................35 3.1.4Optimization.................................37 3.2Experiments......................................41 3.2.1Dataset....................................41 3.2.2BrainNetworks................................42 3.2.3ExperimentSettings..............................42 3.2.4Results....................................43 3.3Discussion.......................................44 3.4Summary.......................................45 Chapter4MultimodalDiseaseModelingviaCollectiveDeepMatrixFactorization47 4.1Methodology.....................................48 4.1.1Matrixfactorization..............................48 4.1.2Collectivematrixfactorizationformultimodalanalysis...........49 4.1.3Capturingcomplexinteractionsviacollectivedeepmatrixfactorization..50 4.2Experiments......................................56 4.2.1Datasetandfeatures..............................56 v 4.2.2Datapreprocessing..............................56 4.2.3Predictperformance..............................59 4.2.4Effectsofknowledgefusionparameters...................65 4.2.5Imaging-geneticsassociation.........................66 4.3Summary.......................................67 Chapter5MultimodalInformationBottleneck .................... 68 5.1Methodology.....................................69 5.1.1InformationBottleneckMethod.......................69 5.1.2Deepmultimodalinformationbottleneck...................71 5.1.3Optimization.................................72 5.1.4Generalizetomultiplemodalities.......................77 5.2Experiments......................................78 5.2.1Syntheticdatasets...............................79 5.2.1.1Setting1..............................79 5.2.1.2Setting2..............................81 5.2.1.3Setting3..............................82 5.2.2Casestudy:reservoirdetection........................84 5.2.3Casestudy:Alzheimer'sdisease................86 5.2.3.1DataPreprocessing.........................86 5.2.4Otherbenchmarkdatasets...........................92 5.3Summary.......................................92 Chapter6MultimodalLearningwithIncompleteModalities ............. 94 6.1Methodology.....................................94 6.1.1KnowledgeDistillation............................94 6.1.2Multimodallearningwithmissingmodalities................96 6.2Experiment......................................104 6.2.1Syntheticdataexperiments..........................106 6.2.2ExperimentsonAlzheimer'sdiagnosis....................112 6.2.3Experimentsonotherreal-worlddatasets..................116 6.3Summary.......................................119 Chapter7Conclusion .................................. 121 BIBLIOGRAPHY ....................................... 123 vi LISTOFTABLES Table3.1:Quantitativecomparisonofusingdifferentbrainnetworkstopredict theearlyMCI....................................44 Table3.2:Combinationcoef t of9networks.....................44 Table4.1:Demographicinformationofsubjects........................57 Table4.2:PredictionperformanceofdifferentmodelsusingADNI2'sT1MRIanddMRI intermsofAUC...................................59 Table4.3:PredictionperformanceofdifferentmodelsusingADNI2andADNI1'sT1MRI anddMRIintermsofAUC.............................60 Table4.4:Predictionperformanceoffusinggeneticknowledgeandimagingknowledge usingADNI1andADNI2intermsofAUC.....................61 Table4.5:PredictionperformanceofDNNusingADNI1andADNI2intermsofAUC...61 Table4.6:Resultsofapplyingsparselogisticregressiononeachsinglemodalityinterms ofAUC.......................................66 Table5.1:Averageerrorsofallmethodsunderdifferentnoiselevels.............80 Table5.2:Averageerrorsofallmethodsunderdifferentsamplesizes............82 Table5.3:Averageerrorsofallmethodsunderdifferentextra-featuredimensions......83 Table5.4:Demographicinformationforthetwocohorts(ADNI2andNACC)........86 Table5.5:ParametersfordMRIandT1MRIdataforADNI2andNACC..........86 Table5.6:Top10dMRIfeaturevariables.....................88 Table5.7:Averageerrorsforthreebenchmarkdatasets....................91 Table6.1:accuracyofSetting3.........................113 Table6.2:TheaccuracyforallthemodelstrainedontheunionofADNIand NACCdatasets...................................113 Table6.3:TheaccuracyofallthemodelstrainedonXRMBdataset......118 vii Table6.4:TheaccuracyofallthemodelstrainedonMNISTdataset......118 Table6.5:TheclassaccuracyforthemodelstrainedonAlzheimer'sdiseasedata from[132]......................................119 viii LISTOFFIGURES Figure1.1:Differenttractographymethodsdetectdifferentsetsof..........6 Figure2.1:Exampleofcanonicalcorrelationanalysis(CCA)involvingtwodatamodalities.18 Figure2.2:Illustrationofkernelcanonicalcorrelationanalysis(kernelCCA)........22 Figure2.3:Illustrationofdeepcanonicalcorrelationanalysisstructure[9]..........26 Figure2.4:Illustrationofdeepcanonicalcorrelatedautoencoders[119]...........26 Figure2.5:ExampleofrestrictedBoltzmannmachineanddeepBoltzmannmachine....28 Figure2.6:TheillustrationofamultimodalDeepBoltzmannmachine[103]........30 Figure2.7:DifferentmultimodalBoltzmannmachines[103].................31 Figure3.1:Overviewofournetworkfusionframework....................35 Figure4.1:Illustrationofproposedcollectivedeepmatrixfactorization(CDMF)framework.50 Figure4.2:ManhattanplotforSNPswithadjusted p valuegreaterthan2..........58 Figure4.3:BrainmapsofthelevelateachROIforthemostassociatedSNP withinthatROI...................................64 Figure4.4:Testingperformancewithvarying a parameters.................64 Figure5.1:Illustrationofextra-featuresinthesyntheticdataexperiments..........80 Figure5.2:Averageerrorforreservoirdetectiontask.....................83 Figure5.3:T-DistributedStochasticNeighborEmbeddingforthejointrepresenta- tionsforreservoirsdetectionmodels.......................84 Figure5.4:Thepipelineofcomputingthestabilityscore...................87 Figure5.5:ThedistributionofthestabilityscoresforthedMRIfeatures...........88 Figure5.6:AverageerrorforclassifyingMCIwithAD....................89 ix Figure5.7:T-DistributedStochasticNeighborEmbeddingforthejointrepresenta- tionsforclassifyingMCIwithNC........................90 Figure6.1:Patternofthedata.................................96 Figure6.2:Overviewoftheproposedteacher-studentmodel.................99 Figure6.3:Totalteachersneedtobetrainedwiththreemodalities..............104 Figure6.4:Totalteachersneedtobetrainedwithpruning(low-levelteacher)........105 Figure6.5:Totalteachersneedtobetrainedwithpruning(high-levelteacher)........105 Figure6.6:accuracyforSetting1........................109 Figure6.7:accuracyforSetting2........................110 Figure6.8:Accuracywithdifferent a and b .........................114 Figure6.9:Thetop10importantT1MRIfeaturesforTe 1 trainedontheunionofNACC andADNIdatasets.................................115 Figure6.10:Thetop10importantT1MRIfeaturesforM-DNNtrainedontheunionof NACCandADNIdatasets.............................116 Figure6.11:Thetop10importantT1MRIfeaturesforTStrainedontheunionofNACC andADNIdatasets.................................117 Figure6.12:Thetop10importantdMRIfeaturesformodelstrainedontheunionofNACC andADNIdatasets.................................120 x Chapter1 Introduction Thewideavailabilityofdatafrommultipledatamodalitieshasbroughtincreasingattentionto multimodallearning.Ingeneral,modalitiesareassetsofheterogeneousfeaturesthatare collectedfromdiversedomainsorextractedfromvariousfeatureextractors[126].Thesetsof featurescouldprovidebothsharedandsupplementaryinformationofthesubjects.Sincedifferent modalitiesareextractedfromdifferentdomainsorfeatureextractors,therepresentationsofthe modalitiesmaybeverydistinctfromeachother.Multimodallearningistointegratepredictivein- formationfromdifferentmodalitiestoenhancetheperformanceofthelearnedmodels.Forexam- ple,itiscommonthatimagesareaccompaniedbytextdescriptionsorcategoricaltags.Leveraging informationfromtagsandtextdescriptionsusuallyprovidesamorecomplementarydescriptionof imagesthanimagesalonebecauseoftheinherentrelatedness.Moreover,sincethedatacollection orfeatureextractionprocessforthemodalitiesareseparately,thenoiseinducedbythecollection orextractionprocessistoeachdatamodalities.Multimodallearningcanreducetheeffect ofnoisebylearningthecommonstructureacrossmultiplemodalities.Therefore,learningfrom multiplemodalitiescanpotentiallyhelptoimproveperformance.Asanotherexample,inthemed- icalarea,multipledatasuchasdifferentkindofMRIdata,genedata,bloodbiochemicalindexare available.WhendoctorsdiagnosesomecomplexdiseasesuchasAlzheimer'sdisease,theyusually askthepatientstodomultipletestssuchasbrainimagingtests,laboratorytests,mentalstatus testsandneuropsychologicaltests.Differenttestresultprovidesdifferenttypeofevaluationofthe 1 patients.Combiningthealltheresultsprovidescomprehensiveandaccurateinformationofthe patientandcanhelpthedoctorsruleoutotherconditionsthatcausesimilarsymptoms.Therefore, whenusingmachinelearningalgorithmstomodelingAlzheimer'sdisease,multimodallearning demonstratedbetterperformancethansinglemodallearning[117,135,91,143]. Multimodallearningisassociatedwithmanychallenges.(1)Sincemultiplemodalitiesareused whenbuildingthemodel,thetotalfeaturedimensionismuchlargerthanthefeaturedimensionfor singlemodallearningmodels.Ifdirectlyconcatenatthefeaturesfromdifferentmodalitiesand buildsinglemodallearningmodels,themodelsmaysufferfromsevereovproblem,espe- ciallywhenthefeaturedimensionofeachmodalityisconsiderablylarge.Thus,thechallenge ishowtobuildrobustmodelsformodalitiesthathavehighfeaturedimensionstopreventov tingproblem.(2)Thesecondchallengeformultimodallearningistheheterogeneityamongthe modalitiesandtheirpotentialinteractions.Sincemodalitiesarecollectedfromdiversedomainsor featureextractors,theyarenotlinearlyinteracted.Usinglinearmodelsforthesemodalitieslimits thepowerofmultimodallearning.(3)Onemotivationtousemultimodallearningisthatdiffer- entmodalitiesprovidesupplementaryinformationtothesubjects.Modalitiesmayhavenoiseor irrelevantinformationtothefollowingtasks.Whenlearningthecommonstructureofmodalities, theirrelevantinformationandnoiseareautomaticallyeliminated.However,whencombiningthe supplementaryinformationfrommodalitiesandlearningajointrepresentation,theirrelevantinfor- mationandnoisearenotremoved.So,thethirdchallengeishowtoeliminatethenoiseofirrelevant informationfromthemodalitiesandonlyleaveusefulinformationwhenlearningthejointrepre- sentationofallthemodalities.(4)Mostexitingworksthataddressthesupplementaryinformation acrossthemodalitiescanonlybeappliedtothedatawithcompletemodalities,whichwastesalot ofpreciousdata.ThelastchallengeIwouldliketoaddressisthehowtobuildmultimodalmodels withthedatahavingmissingmodalities. 2 Inthisdissertation,Iproposefourapproachestoaddresstheaforementionedchallengesre- spectively.Inmywork,Iproposeasparsemodeltoselecttheimportantfeaturesaswellas modalitiestoalleviatetheovproblemformultimodallearningwhenthemodalities'fea- turedimensionorthemodalitiesnumberistoolarge.Inmysecondwork,Iproposeaframework tofusemultipledatamodalitiesforpredictivemodelingusingdeepmatrixfactorization,which exploresthenon-linearinteractionsamongthemodalitiesandexploitssuchinteractionstotrans- ferknowledgeandenablehigh-performanceprediction.,theproposedcollectivedeep matrixfactorizationdecomposesallmodalitiessimultaneouslytocapturenon-linearstructuresof themodalitiesinasupervisedmanner,andlearnsacomponentforeachmodal- ityandamodalityinvariantcomponentacrossallmodalities.Themodalityinvariantcomponent servesasacompactfeaturerepresentationofpatientsthathashighpredictivepower.Tosolvethird challenge,Iproposeasupervisedmultimodallearningframeworkbasedontheinformationbottle- neckprincipletooutirrelevantandnoisyinformationfrommultiplemodalitiesandlearnan accuratejointrepresentation.,theproposedmethodmaximizesthemutualinformation betweenthelabelsandthelearnedjointrepresentationwhileminimizingthemutualinformation betweenthelearnedlatentrepresentationofeachmodalityandtheoriginaldatarepresentation. Forthefourthchallenge,Iproposeaframeworkbasedonknowledgedistillation,utilizingthe supplementaryinformationfromallmodalities,andavoidingdiscardingdatawithmissingmodal- ities.,Itrainmodelsoneachmodalityindependentlyusingalltheavailabledata. Thenthetrainedmodelsareusedasteacherstoteachthestudentmodel,whichistrainedwiththe sampleshavingcompletemodalities. ThefourapproachesareallvalidatedonAlzheimer'sdisease(AD)modelingproblems.AD isasevereneurodegenerativediseasecausing60%to70%dementia[124].Itstartswithvan- ishedmemoryandprogressestoanadvancedstagefollowedbycognitivefunctionloss,which 3 ultimatelyleadstodeath.Currently,ADranksthesixthleadingcauseofdeathintheU.S.and thenumberofpatientsaffectedisexpectedtoreach13.4millionbytheyear2050,whichinduces substantialburdenonthehealthcaresystem[8].Thetransitionalstagebetweenexpectedcognitive declineofnormalagingandAD,mildcognitiveimpairment(MCI)hasbeenconsideredassuitable forpossibleearlytherapeuticinterventionforAD[85].EffectivediagnosisofMCIordementia cangreatlypublichealthandreducehealthcareburden.Alzheimer'sdiseasecanonlybe velydiagnosedafterdeathbyexterminatingthebraintissueinanautopsy[19].Occasion- ally,doctorsdeterminewhetherapersonisapossiblepatientornormalagingusingbiomarkers ofthelivingbody.Onecommonlyusedbiomarkerisbrainmedicalimagingasitshowsthemi- croscopicstructureofthebrainandhasthekeyroleasa"windowonthebrain"[51].However, analyzingbrainmedicalimagingresultsrequiresconsiderabletimeandeffort.Intheareasthat lackofdoctorsexperiencedinAlzheimer'sdisease,itisdiftodiagnosethediseaseevenwith brainmedicalimaging.Inthepastyears,variousmachinelearningmodelshavebeendevelopedto modeldiseases[110,88]andsomeofthemevenhavebetterdiagnosticaccuracythanexperienced doctors[79].Therefore,developingeffectivemachinelearningmodelscouldgreatlyreducethe costneededtodiagnosisAlzheimer'sdisease.Inthisdissertation,Ishowhowtoapplythepro- posedalgorithmtoAlzheimer'sdiseasediagnosis.Inthefollowingfoursubsections,Igiveabrief introductiontoeachwork. 1.1DiscriminativeFusionofMultipleBrainNetworks Recently,withthedevelopmentofdiffusion-weightedmagneticresonanceimagingtechniquesthat mappatternsofconnectionsinthebrain.Manyresearchershavebeguntomodelthebrainasanet- workofinterconnectedbrainregions,orconnectome[102].Thepropertiesofthesenetworks 4 canthenbestudiedmathematicallywithnetworktheory.Mathematically,abrainnetworkatthe macro-scaleistypicallyexpressedbyaconnectivitymatrix,inwhicheachelementrepresentssome propertyoftheconnectionbetweeneachpairofbrainregions[101].Thesenetwork-derivedfea- turesprovidecluesabouthowcharacteristicnetworkdisruptionsoccurandhowtheymayprogress inAlzheimer'sdisease.DiffusionMRIisavariantofstandardanatomicalMRIthatissensitive tomicroscopicpropertiesofthebrain'swhitematterthatarenotdetectablewithstandardanatom- icalMRI.Thegeneralprocessofreconstructingastructuralbrainnetworkincludestwomain steps[134].Thestepextractsthedominantdiffusiondirection(s)ateachvoxelbasedona diffusionMRIsignalmodel.Somepopularmodelsincludethediffusiontensor,theorientation distributionfunction(ODF),oraprobabilisticmixtureoftensors[70],amongothers.Thenext stepiswholebraintractographybasedonthesevoxel-leveldiffusiondirection(s).Currently,there aretwomainclassesoftractographymethods:deterministicandprobabilisticapproach.Based onwholebraintractographyresult,brainnetworkscanbecomputedbycombiningthepatternof tractswithsomeanatomicalpartitioningscheme,andmeasuringsomepropertyof theconnectionsbetweeneachpairofbrainregions,suchastheirdensityorintegrity. Theoretically,differentalgorithmicmethodstomapstructuralconnectionsshouldultimately provideaconsistentanatomicaldescriptionofthebrain.Evenso,thismaynotbetrueinreality. Differenttractographymethodsrecoverdifferentsetsof(Fig.1.1),andthebundlesthat bestdifferentiatepatientsfromcontrolsmaybeextractedbysomealgorithmsbutnotothers[133]. Differenttracttrackingmethodsvaryintheirabilitytoperformrobustlyondatasetofdifferent quality.Andthereisnogeneralprincipletodecidewhichtractographymethodornetworkmodel ismostsensitivetodiseaseeffectsinclinicalresearchstudies[134].Ithereforecombineallthese networksandbuildpredictivemodelwiththem.Thechallengetocombinethesenetworksis thatthedimensionsforthemodalitiesaretoohighcomparedwiththesamplesize.Toaddress 5 Figure1.1: Differenttractographymethodsdetectdifferentsetsof HereIshowthe generatedbytwotractographyalgorithms(T-FACT[78]andPICo[84]),passingthrough thesamebrainslice. thischallenge,Icreateasparselearningframeworktooptimallyfusethenetworks.The ofsparselearningisitselectsimportantcomponents.Therefore,thefeaturedimensionandthe modalitynumbersarereducedandtheovproblemisalleviated. 1.2MultimodalDiseaseModelingviaCollectiveDeepMatrix Factorization Inadditiontothebrainnetworksmentionedinthework,therearemultiplebiologicalmea- suressuchasT1weightedMRIandgenotypeavailable.T1-weightedMRI(T1MRI)cancapture structuralinformationofgraymatterinthebrain.CombiningT1MRIandbrainnetworksfrom diffusionweightedMRItogetherprovidesacomprehensiveillustrationofthebrainthanutilizing themseparately.Moreover,priorstudiesstronglyfavorajointanalysisonmultiplemodalities includingimagingandgenetics,sinceithasbeenshownthatgeneticvariantshaveplayedasignif- icantroleintheonsetofthedisease[93,15,130,129].Combiningthethreemodalitiesprovide complementaryinformationonbrainstructureandfunction,thusimprovecapabilityindifferenti- atingbetweennormalagingsubjectsandMCIpatients[116,89]. However,fewpriorstudiescombinedtwotypesofMRIimagingindetectingMCI,letalone 6 ajointmodelthatincorporateimagingmodalitiesandgeneticinformation.Onereasonisthe limitedsamplesize.Itisusuallyverycostlytoconstructlargecohortstudiesthatinvolveimag- ingandgeneticdata.Forexample,morethan$60millionhasbeendevotedtothestageof Alzheimer'sDiseaseNeuroimagingInitiative(ADNI)tocollect819subjects'brainimagingdata, geneticdataandotherbiologicalsamples.Differentbiologicaldatamodalitieshavedifferentfea- turedimensions.Forexample,imagingdatacontainshundredstothousandsfeatures,whilethe featuredimensionofgeneticdataisaround1million.Duetothehighdimensionalityofbrain imagesandgeneticmarkers,directlycombiningmultiplemodalitieswillincreasethefeaturedi- mensiondrastically,whichnotonlymakesitdiftoextractvalidpredictivesignals,butalso inducesovproblems.Also,somesubjectsdonothavegeneticdataordMRIdatabecause theydidnotparticipatesomepartsofthestudy.Directlycombiningmultiplemodalitiesmeans thosesubjectsmustbediscarded,whichreducesthesamplesize.Moreover,different modalitiesdescribedifferentaspectsofbrain:T1MRIcapturesareascomposedofneuronswhile dMRIestimatesconnectionbetweenthoseareas;thegenotypeimpactsthediseaseinawaythat isnotdirectlyrelatedtobrainstructureandfunction.Assuch,allthesedatamodalitiesareinter- actinginacomplicatedmanner,suggestingthatdirectlycombiningfeaturespacesmaynotleadto effectiveintegration. Analysisofhighdimensionaldatacangreatlyfromitsintrinsiclow-rankstructures sinceexploitingthelow-rankstructureofthehighdimensionaldataallowsustore- ducethefeaturedimensionalitywhilemaintainmostinformationindata.Whenthesamplesizeis limit,itcouldreducetheovproblem.Recentstudieshavesuchlow-rankprop- ertiesinimagingandgeneticdata[142,73,121].Matrixfactorizationtechniques[68,61]are powerfultoolstorecoverthelowrankstructureofamatrixandhavebeenwidelyusedinmany dataminingandmachinelearningapplications.Becauseofitscapabilitytodenoisedata,such 7 approachisespeciallyattractiveinprocessingnoisydatasuchasgeneticsandimaging.Matrix factorizationalsoprovidesanintegratedapproachtofusemultipledatamodalitiesbymapping differentmodalitiestoasharedsubspace.Thismethodhasbeenwidelyappliedinnetworkanal- ysis[25]andclustering[6].Matrixfactorizationtechniqueshaveastronglinearassumptionthat objectsinteractwitheachotherlinearlyinalowdimensionalsubspace.However,brainaswell asgenotype-phenotypeinteractionshaveinherentcomplexstructure[36,62,41].Forexample, ithasbeenthathumanbrainfunctionalnetworkshaveahierarchicalmodularorgani- zationstructure[76].Thus,thelinearassumptionintraditionalmatrixfactorizationmayfailto capturethecomplexity,nonlinearitiesandhierarchicalinteractionsamongdifferentmodalitiesin ADresearch. Inthiswork,Iproposeadeepmatrixfactorizationframeworktofuseinformationfrommultiple modalitiesandtransferpredictiveknowledgeinordertodifferentiateMCIpatientsfromcognitive normalsubjects.,Ibuildanonlinearhierarchicaldeepmatrixfactorizationframework whichdecomposeseachmodalityintoamodalityinvariantcomponentandamodality componentguidedbysupervisioninformation.Theproposedcollectivedeepmatrixfactorization delivershigherpredictiveperformancethanitslinearcounterpart,sinceitsdeepnonlinearstructure candiscoverthehiddencomplexityandnonlinearityoforiginaldata,andmaporiginaldatawhich arenotlinearseparableintoarepresentationthatcanmakesubjectseasiertobeseparated.More- over,themodalitytermcanbeusedtouncovercomplicatedinteractionsamongdifferent modalitiesthatcannotbediscoveredbytraditionalmatrixfactorizationmethods.Iperformex- tensiveempiricalstudiesontheAlzheimer'sDiseaseNeuroimagingInitiativedataset 1 toidentify MCIpatientsbyfusingthreemodalitiesincludingT1MRI,dMRI,andgenotype.Ialsocompare theproposedmethodwithstate-of-the-artdeepmultimodalalgorithmsincluding deepneuralnet- 1 http://adni.loni.usc.edu 8 work, DCCA[9]andDCCAE[119].Theresultsdemonstratetheeffectivenessoftheproposed approach. 1.3MultimodalInformationBottleneck Inadditiontolearnthecommonstructureofthemodalities,anothermotivationtousemultimodal learningisthatmultiplemodalitiesprovidesupplementarydescriptionsofthesamesubjectsand correlatingtheminthelearningcanbemoreinformative.Whenutilizingalltheinformationfrom differentmodalities,theperformanceisexpectedtobeimprovedcomparedtolearningwiththe informationfromonlyonemodality.Duringthepastyears,multiplemethodshavebeenproposed tocombinethesupplementaryinformation.Forexample,kernel-basedalgorithmsusethemultiple kernelmethodstocombinethekernelsofdifferentmodalitiesfromlinearcombinationmethods suchaslinearconvexcombination[113]tononlinearcombinationmethods[114].Withthedevel- opmentofdeeplearning,multipleneuralnetworks[92,64]areusedtoextractedabstractfeature representationsforeachmodality.Then,theextractedrepresentationsfromallmodalitiesarefused indifferentwayssuchasconcatenationtocombinethesupplementaryinformation. Whenlearningcommonstructureofthemodalities,thenoiseorirrelevantinformationcould beeliminatedautomatically.However,whenlearningthejointrepresentationofthemodalities andfuseallthesupplementaryinformationtogether,thenoiseorirrelevantinformationisvery likelytobeincludedintothejointrepresentation,whichincreasethemodelcomplexityandcause ovproblem.Therefore,howtoeffectivefusetheusefulsupplementaryinformationfrom allthemodalitiesareverychallenging. Morerecently,anovelsupervisedlearningmethod[127]basedontheinformationbottleneck principle[108]hasgainedincreasingattentionduetoitsabilitytoaconciserepresentation 9 ofthefeatures,takingintoaccountthetrade-offbetweenperformanceandcomplexityfroman informationtheoryperspective.However,themaindrawbackofthismethodisthatitemploys alinearprojectiontobridgetherepresentationofeachmodality.Astherelationshipbetween differentmodalitiesareoftencomplicated,asimplelinearprojectionwouldconstrainthetypeof informationthatcanbefusedfromthedifferentmodalities. Deeplearninghasbeensuccessfullyusedtolearnabstractrepresentationfromtherawinput data[67].DCCAandDCCAEaretwoexamplesofsuccessfulmethodsusingdeepneuralnet- workstoextractfeaturesfromeachmodalityandlearntheirjointrepresentation.Thesemethods havedemonstratedbetterperformancecomparedtotraditionallinearCCA.However,adopting deepneuralnetworkstoinformationbottleneckbasedmultimodallearningformulationremainsa challengingproblem.Fortheinformationbottleneckapproach,theinformationbetweendifferent representationsaremeasuredintermsoftheirmutualinformation.Computingmutualinformation requiresestimationoftheposteriordistribution,whichiscomputationallyintractablewhenthe modeliscomplicated. Inthiswork,Iproposeadeepmultimodalinformationbottleneckmethodtofusesupplemen- taryknowledgefrommultiplemodalitiestoimprovepredictiveperformance.Theproposedframe- workconsistsoftwoparts.Thepartistoextractconciseandrelevantlatentrepresentationfrom eachmodalitywhilethesecondpartfusesthelatentrepresentationstolearnthejointrepresentation ofallmodalities.Theproposeddeepmultimodallearningframeworkadoptstheinformationbot- tleneckprincipletosupervisethelearningbythebestrepresentationthatbalancesmodel complexityandaccuracy.Theframeworkalsoemploysavariationalinferenceapproach[59,7]to overcomethechallengeofcomputingmutualinformationef.Thevariationalinferenceap- proachprovidesanapproximatesolutiontotheoriginaloptimizationproblembymaximizingthe variationallowerboundofthetargetobjectivefunction.Sincethevariationalboundcanbeeasily 10 optimizedbystandardgradientdescentmethods,theproblembecomescomputationallytractable. IapplythisalgorithmtotheofMCIwithNCforAlzheimer'sdiseaseandtheresults showperformanceimprovementcomparedwiththebaselines. 1.4MultimodalLearningwithIncompleteModalities Onecommondrawbackofthemethodsfusingthesupplementaryinformationisthattheyusually canonlybetrainedonthesamplesthathavecompletemodalities,andinpracticetherearevery fewsamplesofsuchkind,especiallywhenconsideringalargenumberofmodalities.Forexample, whenstudyingAlzheimer'sdisease,onlypartialsubjectshavethediffusion-weightedMRIwhile onlyanotherpartofsubjectshasgeneticdataavailable.Theexistingmethodsmayhavetodiscard alargeportionofdatacollectedthroughhugeefforts.Onesolutiontodealwiththedatawith incompletemodalitiesistoimputethemissingmodalities.Afterimputation,standardmultimodal learningmethodscanbeusedtocombinethesupplementaryinformation.Theincompleteness ofmodalitiesleadstoblockmissingoffeatures.Therefore,classicalmatrixcompletionmethods suchasmatrixfactorization[131]andetc.cannotbeusedtoimputethemissingmodalities. Someadvancedimputationmethodssuchascascadedresidualautoencoder[111]andadversarial training[21,118,105,83],whichhavesimilarstructureasGAN,havebeenproposedtodealwith themodalitymissingproblem.Thesesolutions,however,mayintroduceunwantedimputation noisewhenimputingthemissingmodalities[37].Especiallywhenthesizeofsampleshaving completemodalitiesissmall,themodalitiesimputedbysuchmethodsmayhaveanegativeeffect ontheperformanceofthefollowingtasks[37]. Inthiswork,Iproposeanewmultimodallearningframeworktointegratethesupplementary informationofmultiplemodalities.Thismethodutilizesallthesamplesincludetheoneswith 11 incompletemodalities.Theproposedmethodisbasedonknowledgedistillation[46].Itrain modelsforeachmodalityseparatelywithallthedataavailable.Then,Itreatthetrainedmodels asteacherstoteachastudentmodel.Thestudentmodelisamultimodallearningmodelwhich fusesthesupplementaryinformationfrommultiplemodalities.Itistrainedwiththesoftlabels labeledbytheteachermodelsandthetrueone-hotlabel.Sincetheteachermodelsaretrainedwith eachmodalityseparately,thesamplesizeismuchlargerthanthesamplesusedtotrainthestudent model.Withenoughdata,thewell-trainedteachersactasexpertsoneachmodality.Thestudent thenlearnsfromtheseexpertsandcombinetheknowledgefromalltheexperts.Comparedwith existingmethods,ourmethoddoesnotdiscardthesampleswithincompletemodalitiesnorimpute them.Instead,Iusethesesamplestotraintheteachermodelstomakesuretheteachermodelsare experts.Toverifytheeffectivenessofourmethod,Idemonstrateexperimentsonsyntheticdata andreal-worlddatasuchasAlzheimer'sdiseasedatasetandsomebenchmarkdatasets. 12 Chapter2 RelatedWorks 2.1Co-trainingApproach Co-trainingisasemi-supervisedapproach.Itisproposedtodealwithasetting inwhichlimitedlabeledsamplesandalargenumberofunlabeledsamplesareavailablefortwo distinctmodalities[16].Therearetwoassumptionsonco-training: Eachdatamodalityprovidescomplementaryinformationofthesamples; Thetwodatamodalitiesareconditionallyindependentgiventheclasslabels. Co-trainingmethodseparatesthesamplesintotwosets,labeledset L andunlabeledset U .It createsasmallerpool U 0 U .Thentwoweakaretrainedforthetwomodalities,i.e., h 1 , h 2 ,usingthelimitedlabeleddatafrom L .Next, h 1 and h 2 areusedtolabel p positivesamplesand n negativesamplesthattheyfeelmostfrom U 0 .Thosenewlylabeledsamplesarethen addedto L . U 0 isreplenishedbydrawing2 p + 2 n samplesfrom U randomly.Now, L isenlarged bythe2 n + 2 p samples.Thosestepsarerepeatedforanumberofsteps,and,two descentwillbeobtained.Theintuitionofthismethodistousethesamplesaddedby h 1 totrain h 2 andviceversa[65].Afterrepeatingforenoughtimes, h 1 and h 2 willagreewitheach other.Hence,theunlabeleddatahereisusedtoprunethehypothesisspacefor h 1 and h 2 suchthat thesearchspacesarecompatible. 13 Wenotethatoneassumptionofco-trainingisthateachmodalityisconditionallyindependent givenclasslabels.Theintuitionbehindthisassumptionisthat,whentwomodalitiesarecondi- tionallyindependent,eachtimetheaddedsamplesareasinformativeasrandomsamplesandthe learningshouldthusprogress[81].However,insomecases,thisassumptioncannotbe Then,theaddedsamplesmaynotbeinformativeandthelearningprocessmayfail.Co-EMalgo- rithmisanalgorithmbasedontheoriginalco-trainingalgorithmtoloosenthisassumption[81].It canbeshownthatco-EMworksevenwhentheconditionalindependenceassumptionisviolated. Denotethetwomodalitiesas s 1 and s 2 .Thisalgorithmtrainsausingthelabeleddata from L on s 1 .Denotethisas h 1 .Then h 1 isusedtoprobabilisticallylabelalltheunla- beleddatain U .Next,another h 2 istrainedusingthelabeleddataandtheprobabilistic labeleddataon s 2 ,and h 2 isusedtore-labelthedatain U .Repeatthosestepsforsomeiterations andtheareobtained.Comparedwithco-training,co-EMusesonelearnertoassign labelstoalltheunlabeledsamplesandthesecondislearnedusingalltheprobabilistic labeledsamples.Hence,itdoesnotrequiretheaddedsamplestobeasinformativeasrandomsam- ples.However,sinceco-EMneedstoassignprobabilisticlabels,thethatcanbeusedare limited. Co-regularizationapproach[95]isdevelopedbasedonco-training.Thismethodusesregular- izationtoreachanagreementacrossdifferentmodalities.Denote H 1 and H 2 astwoReproducing KernelHilbertSpacesoffunctionsontheinputspace.Denotethelabeledsetas L and unlabeledsetas U .Co-regularizationlearnsthefollowingpredictionfunction[96]: f = 1 2 ( f 1 ( x )+ f 2 ( x )) ; (2.1) 14 where f 1 2 H 1 , f 2 2 H 2 , f 1 and f 2 arelearnedbyaconvexoptimizationproblem: ( f 1 ; f 2 )= argmin f 1 2 H 1 ; f 2 2 H 2 g 1 k f 1 k 2 H 1 + g 2 k f 2 k 2 H 2 + m å i 2 U [ f 1 ( x i ) f 2 ( x i )] 2 + å i 2 L V ( y i ; f ( x i )) ; (2.2) wherethetwotermsareusedtocontrolthemodelcomplexity,and g 1 , g 2 areregulariza- tionparameters,thethirdtermisusedtoenforcethatthelearnedhypothesesagreewitheach otherondifferentmodalitiesfortheunlabeleddata.Thelasttermistheempiricallossonthela- beleddatausing f = 1 2 ( f 1 + f 2 ) ,and V denotesthelossfunction.Comparedwithco-training,this methodisnon-greedy,convexandeasytoimplement[95,96].Therearealsomultiplevariants ofco-regularizationdealingwithdifferentproblems,suchasCo-regularizedLeastSquareswhich minimizestheagreementinaleast-squaresense[95,18],Co-regularizedLaplacianSVM[95], co-regularizedclustering[65]whichusesco-regularizationtoregularizetheclusteringhypothesis toobtainconsistentclustersacrossdifferentmodalities. 2.2Linearapproaches 2.2.1Canonicalcorrelationanalysis Giventwosetsofvariables,whenthenumberofvariablesislarge,itisnoteasytousethecovari- ancematrixofthosetwosetsofvariablestothedependencebetweenthem.Sometimeseven ifthevariablenumberissmall,inthecurrentcoordinatesystem,itisstillhardtoseetherelation betweenthemdirectly.Canonicalcorrelationanalysis(CCA)isawidelyusedmethodtosolvethis challengingproblem[49].CCAtherelationbetweentwosetsofvariablesbymaximiz- ingthecorrelationbetweentheweightedlinearcombinationofonesetofvariablesandthatofthe othersetofvariables.Itcanbeviewedasprojectingtheoriginaltwosetsofvariablestoalow- 15 dimensionalsubspace,suchthatthecorrelationbetweenthetwosetofvariablesismaximizedin thenewsubspace.Hence,itismucheasiertoanalyzevariabledependenceinthelearnedsubspace thanintheoriginalspaces.WewillreviewtheclassicalCCAanditsapplicationstomultimodal learninginthissection. Giventworandomvectors x 2 R d withthemean m x and y 2 R p withthemean m y .Weassume that d > p .Arandomvectoristobeavectorofrandomvariables.Thecorrelationbetween x and y measuresthelinearrelationbetweenthetworandomvectors.Considerthefollowinglinear combination: a = w T x x ; (2.3) b = w T y y ; (2.4) where a and b aretworandomvariablesand w 1 2 R d , w 2 2 R p .Thecorrelationbetween a and b isgivenby: Corr ( a ; b )= w T x S ( x ; y ) w T y ( w T x S ( x ; x ) w x ) 1 = 2 ( w T y S ( y ; y ) w y ) 1 = 2 : (2.5) CCAseeksvectors w 1 and w 2 suchthatCorr ( a ; b ) ismaximized,i.e., w x ; w y = argmax w x ; w y Corr ( a ; b ) : (2.6) 16 Findingtop k canonicalvariatepairsisequivalenttosolvethefollowingmaximizationproblem: W x ; W y = max W x ; W ; y tr ( W 0 x S ( x ; y ) W y ) ; s.t. W 0 x S ( x ; x ) W x = W 0 y S ( y ; y ) W y = I (2.7) w xi S ( x ; y ) w yj = 0for i 6 = j Projectoriginalrandomvariablestoanewsubspace: W x =( w x 1 ; w x 2 ;:::; w xk ) ,and W y = ( w y 1 ; w y 2 ;:::; w yk ) serveastwomappingmatriceswhichproject x and y toa k dimensionalsub- spaceandformtwo k dimensionalvectors,i.e, ( a 1 ; a 2 ;:::; a k ) and ( b 1 ; b 2 ;:::; b k ) .CCAthe projectionleadingtoapossiblejointstructureforthetwosetofrandomvariables[52].Since k is usuallysettobemuchsmallerthan p , ( a 1 ; a 2 ;:::; a k ) and ( b 1 ; b 2 ;:::; b k ) arethenewrepresentation oftheoriginalrandomvectorswhichhavemuchlowerdimensionalitythantheoriginalvectorsbut yetkeepmostofthejointinformationof x ad y .Hence,CCAcanbeusedtoreducedimensionality forthedata.Figure2.1illustrateshowCCAprojectsvariablesintoalow-dimensionalsubspace. Solidlinesrepresentthecanonicalcorrelationvectors w x 1 ; w y 1 ,anddashedlinesrepresentthe secondcanonicalcorrelationvectors w x 2 ; w y 2 .Inthisexample,theoriginaldimensionalityfor x and y are4and3respectively.Thedimensionalityofthenewsubspaceissettobe2.Thepro- jectionmatrix W x = f w x 1 ; w x 2 g projects x to f a 1 ; a 2 g andthematrix W y = f w y 1 ; w y 2 g projects y to f b 1 ; b 2 g ,where ( a 1 ; b 1 ) isthecanonicalvariatepairand ( a 2 ; b 2 ) isthesecondcanonical variatepair. Applyingcanonicalcorrelationanalysistomultimodallearning: Inmultimodallearning,data arecollectedfrommultiplemodalitiesforasetofsamples.Consideratwo-modalproblem,i.e., X 2 R n d and Y 2 R n p ,where n isthesamplesize, d and p arefeaturedimensionscorresponding tothetwomodalities.Wesupposethat d p .Inrealworldapplications,weusuallydonotknow 17 Figure2.1: Exampleofcanonicalcorrelationanalysis(CCA)involvingtwodatamodalities. Thedatapointsinthedatamodalityare R 4 andthoseintheseconddatamodalityare R 3 . Solidlinesrepresentthecanonicalcorrelationvectors w x ; 1 ; w y ; 1 ,anddashedlinesrepresent thesecondcanonicalcorrelationvectors w x ; 2 ; w y ; 2 .Datapointsfromtheoriginaldatamodalities areprojectedtoanewcommonsubspace.Inthissubspace,thedimensionalityofthedatais reducedfrom R 4 or R 3 to R 2 . thedistributionofdata,i.e.,themeanandcovarianceareunknown.InordertouseCCA,weneed tousesamplemeanandsamplecovariancetoestimatethemeanandcovarianceofthedistribution. CCAhasbeenwidelyusedinmultimodallearning.Forexample,whenlargeunlabeleddataare availablefortwomodalitiesandonlylimitedlabeleddataareavailable,ifthefeaturedimensional- ityisverylarge,learningagoodmodelonlybythelabeleddataisnoteasy.Apossiblewaytosolve thischallengingproblemistoutilizetheunlabeleddatatoconstructaprojectionbyCCAtoreduce thefeaturedimensionality.[39]providesatheoreticalguaranteethatsuchdimensionalityreduc- tioncanreducethenumberoflabeledsampleneeded.Inclusteringarea,high-dimensionaldata clusteringisadifproblem.ByusingCCA,thedimensionalityofthedatacanbereduced whichmakestheclusteringproblemeasier.Moreover,CCAallowsinformationtobetransferred betweenthetwomodalities.Suchtransfercanleadtopotentialimprovementsontheclusterqual- ity.Forexample,videoandaudiodataclusteringqualitycanbeimprovedifCCA isapplied[27].Whendealingwithactiondata,vectorCCAcanalsobeextendedtotensorCCA 18 whichcanbeusedtopair-wiselyanalyzealignedandholisticactionvolumes[58]. 2.2.2Collectivematrixfactorization Matrixfactorizationhasbeenextensivelystudiedinmanydomainssuchascompressivesensing, recommendersystemsandcomputervision[24,23,20,55,69,68].Whenamatrixisusedto describetherelationshipbetweentwoentities,matrixfactorizationcanbeusedtolearnlatentvari- associatedwiththeentitiesthroughtheirinteractions(i.e.,valuesinthematrix).For example,intheproblem[61],useranditemarelearnedthroughidentifying thesubspacebytheuser-iteminteractionmatrix. Classicalmatrixfactorizationseekstoapproximateamatrixwithalow-rankmatrix,byexplic- itlylearningthematrixfactors.Givenadatamatrix X 2 R m n ,matrixfactorizationlearnstwo reducedmatrixfactors U 2 R m r and V 2 R n r ,suchthat X ˇ UV T ,and r < min ( m ; n ) isthe upperboundoftherankoftheapproximatedmatrix UV T (therankof UV T canbelessthan r ifcolumnsof U or V arelinearlydependent).Thefactors U and V aretypicallylearnedviaan objectivefunction: min U ; V d ( X ; UV T ) ; s.t. U 2 S 1 ; V 2 S 2 ; (2.8) where d ( X ; Y ) isadistancemetricfunctionmeasuringthedifferencebetweenmatrices X and Y , and S 1 and S 2 aretwoconstrainsimposedonthefactormatrices X and Y . Typicallythedistancemetric d ( X ; Y ) ischosentobetheFrobeniusnormofthedifference between X and Y .However,whenmissingvaluespresentin X , d ( X ; Y ) canbeasthe squared ` 2 distancebetweenalltheobservedelementsin X andtheircorrespondingelementsin Y .Assuch,weareabletolearnmatrixfactorsevenwithmissingvalues,andthelearnedmatrix 19 factorscanthenbeusedtoestimatethemissingvaluesunderthelow-rankassumption.Thisis thesetupformatrixcompletion[22]andiscommonlyusedinrecommendersystems[61].The constraints S 1 and S 2 specifythefeasibleregionsofthematrixfactorstoinducemanydesired properties,suchasnon-negativity S = f U j U i ; j 0 ; 8 i ; j g innon-negativematrixfactorization[69] andsparsity S = f U jk U k 1 z g forinterpretablefactors[140].Inaddition,thecomplexitycontrol canbeimplementedusingFrobeniusconstraints S = f U jk U k 2 F z g ,whichareequivalenttothe Frobeniusnormregularizations[60]. Theapproximationin(4.1)addressesimportantsemanticsindataanalysis.Whenthedata matrix X describestherelationshipbetweentwotypesofentities,thefactors U and V canbe thoughtofaslatentfeaturesorlatentrepresentationsoftheentities.Forexample,inrecommender systemsweuse X i ; j todescribetherelationship(e.g.,rating)betweenauser i andanitem j .The rowvector u i 2 R r givesa r -dimensionallatentfeaturerepresentationfortheuser i andsimilarly, therowvector v j 2 R r isalatentrepresentationoftheitem j .Thetwotypesoflatent interactwitheachotherlinearlyinthelatentsubspace R r ,i.e.,theobservedrelationshipin X i ; j can beexplainedas u i ( v j ) T . Incollectivematrixfactorization,thelatentrepresentation/subspaceperspectiveofmatrixfac- torizationallowsustolinkmultipledatamodalities,whentheentitiesinvolvedinthemodalitiesare overlapped.Inmultimodalmodeling,assumethereare t datamodalities X 1 2 R n d 1 ;:::; X t 2 R n d t describingdifferentsetsoffeaturesofthesamesetof n samples,where d 1 ; d 2 ;::: d t arethefeature dimensionforeachmodality.Forexample, X 1 isthematrixtheimagesdata, X 2 isthematrix ofthetextdescriptionsassociatedwiththoseimagesand X 3 isthetagsmatrixfortheimages. Then,wecanapplythematrixfactorizationproceduretofactorizeallthedatasetsandconnectthe 20 factorizationsbyenforcingasharedsubjectlatentrepresentation: min U ; f V i g t i = 1 t å i = 1 d ( X i ; UV T i ) s.t. U 2 S 0 ; V i 2 S i ; i = 1 ; 2 ;::: t ; wherethelatentrepresentation U isthusjointlylearnedfrommultiplemodalities.The U matrix iscalledmodalityinvariant,astherepresentationnowcapturesintrinsicpropertiesoftheobjects. Whenperformingregressionandclasontheobjects,wecanusethelatentrepresentation insteadofusingfeaturesfromrawdatamatrices X i ,sincethelatentrepresentation U containsthe commonstructureandthesharedinformationacrossallmodalities. Collectivematrixfactorizationhasbeenappliedinvariousmultimodallearningproblems.For example,itcanbeusedtotransferknowledgefromtexttoimagetobuildmorerobusttext-to-image transferlearningmodels[128].Itisalsousedtofuseinformationbetweenuser-taganduser-item [56]todevelopmorereliablerecommendersystem,whentheusers'informationislimited.Innet- worksimilaritylearning,itisusedtocombinetopologicalstructure,content,andusersupervision tobuildmodelsbetterthanthosebuiltonasinglemodality[25]. 2.3Nonlinearapproaches 2.3.1Kernelcanonicalcorrelationanalysis Kernelmethodsenablenonlinearlearningbyimplicitlymappingtheoriginalfeaturespacetoa high-dimensionalfeaturespace.Whenapplyinglinearlearningmethodsinthehigh-dimensional featurespace,weareimplicitlyperformingnon-linearlearning[48].Itiswidelyusedinmachine learningandpatternanalysisalgorithmssuchaskernelsupportvectormachine[31]andkernel principalcomponentsanalysis[77].Similarly,theconceptofkernelcanbeusedtoenablenon- 21 Figure2.2: Illustrationofkernelcanonicalcorrelationanalysis(kernelCCA). ThekernelCCA projectsdatafromtwomodalitiestoaHilbertspaceandasubspacethatmaximizes canonicalcorrelationoftheprojecteddata. linearityinCCA,calledkernelCCA[4].Assumethatwehavetwodatamodalities,kernelCCA projectsdatafromthetwomodalitiestoaHilbertspace,i.e., X ! f x ( X ) 2 H x and Y ! f y ( Y ) 2 H y .Itthenmaximizesthecorrelationbetweentheprojecteddatapoints a : = w T x f x ( X ) and b : = w T y f y ( Y ) .TheconceptofkernelCCAisillustratedinFigure2.2. Duetothecapabilitytodealwithnonlinearcorrelateddata,kernelCCAiswidelyusedin multimodallearning.Forexample,itcanbeusedforphoneticrecognitionwhenarticulatorymea- surementsandacousticfeaturesareavailable[10].WhenapplyingkernelCCAonthosetwo modalities,thenon-discriminativeinformationislargelyuncorrelatedandthereforeout. Hence,thelearnedprojectionsonlyincorporatethecorrelatedinformationandcandeliverbetter phoneticperformancethantheoriginalfeatures.KernelCCAisalsousedinfacial expressionrecognitionproblems[136].Facialimagescanprovidegeometricinformationaboutthe facialexpression.Meanwhile,inthelearningphase,therearesomesemanticratingsdescribing thebasicexpressionssuchashappiness,sadness,surprise,anger,disgustandfear.KernelCCA isusedtolearnthecorrelationbetweenthegeometricinformationandthesemanticinformation andprojectthosetwofeaturevectorstoasubspacewheretheyhavelineardependence.Inthenew 22 subspace,itiseasiertobuildlinearregressionormodelsbetweenthetwomodalities, thanintheoriginalsubspace.Hence,givenatestimage,associatedsemanticratingcanbeesti- matedbyit.Insocialmediaarea,peoplesharetheeventstheyattendedonsocialmediawebsites. Identifyinguniqueeventsfromthesewebsitesandgroupinginformationforthesameeventsisa cumbersometaskduetothehighdimensionalityofthedatacollectedfromsocialmediaandthe nonlineardependencebetweendifferentmodalities.KernelCCAcaneffectivelylearnasemantic representationofpotentiallycorrelatedfeaturesets.Itcanbeusedtolearnajointrepresentation fromimagesandtexts/tags/usernames.Thenewfeaturescanbeconcatenatedasanewfeature vectorforclusteringsocialevents[3].Thismethoddeliversbetterperformancethanthoseonly usedatafromonemodality. 2.3.2Deepcanonicalcorrelationanalysis EventhoughkernelCCAcanbeusedtolearnnonlinearrepresentations,thismethodisnoteasyto scalewhenthesizeoftrainingdataislarge.Moreover,therepresentationslearnedbykernelCCA isdependentonthekernelsused.Ifthekernelsarenotsuitableforthedata,thismethodmayfail. Recently,deepneuralnetworkhasshownitsstrongabilitytolearningnonlinearrepresentations [104,28,14,90].Therefore,deepCCAisproposed[9,120]tolearnxibleanddata-drivennon- linearrepresentationsfromtwomodalities.Giventwodatamodalities,deepCCAlearnstwodeep nonlinearmappingswhichmapthetwomodalitiestonewrepresentationssuchthatthecanonical 23 correlationofthenewrepresentationsismaximized[119] 1 : min q f ; q g ; W x ; W y 1 N tr ( W T x f ( X ) g ( Y ) T W y ) ; (2.9) s : t : W T x ( 1 N f ( X ) f ( X ) T + r x I ) W x = I ; W T y ( 1 N g ( Y ) g ( Y ) T + r y I ) W y = I ; w T xi f ( X ) g ( Y ) T w yj = 0for i 6 = j ; where X and Y areinputdataoftwomodalities. f and g denotetwofull-connecteddeepneural networkswhichproducenonlinearmappings. q f ; q g areparametersofthetwonetworks. N isthe samplesize. W x and W y arecanonicalcorrelationvectorsinSection2.2.1.Weuseregu- larizedcovarianceinsteadoforiginalcovariancetopreventovand r x , r y areregularization parameters(weassumethedataarecentered). w xi isthe i -thcolumnof W x and w yj isthe j -th columnof W y .Figure2.3istheoverviewofdeepCCA.InCCA,themappingsare W T x and W T y fortwomodalities,whichproducelinearprojections.Itmaybediftoaccuratelyreconstruct onemodalityfromtheotherduetothepossiblenon-linearinteractionbetweenthetwomodalities [119].IndeepCCA,themappingfunctionsforthetwomodalitiesare W T x f ( ) and W T y g ( ) . Theylearnthepossiblenonlinearinteractionandprojectthetwomodalitiestoasubspaceinwhich theyareeasilytoreconstructtheotherone. OnealternativeviewofdeepCCAisthatitlearnstwokernelsfromdataforkernelCCA. Sometimeswedonotknowwhatkindofkernelsarebestsuitableforthedata.Hence,thekernel wechoosemaynotprovideanappropriatenonlinearmappingforthedata.Inthiscase,deepneural networkisabetterchoicethanaprescribedkernel,asthe`non-lineartransformation'islearned 1 Weusebiasedcovariancetomakeitconsistentwiththeoriginalformulationproposedin[119].Since N isacon- stant,itdoesn'taffecttheoptimalsoulutionsofmodelparamtersifweusebiasedcovarianceorunbiasedcovariance. 24 fromthedata.Thisisempiricallydemonstratedbytheexperimentsonarticulatoryspeechdataand MINSTdata. WenotethatdeepCCAcanalsobecombinedwithotherdeeplearningtechniques.Forexam- ple,itcanbecombinedwithautoencoder[119].InadditiontodeepCCA,thismodelalsocontains twoautoencoderstoreconstructthelearnedviews.Itoptimizesanobjectivethatmaximizesthe canonicalcorrelationbetweentheprojectedrepresentationsandminimizesthereconstructionerror oftheautoencoderssimultaneously 2 : min q x ; q y ; W x ; W y 1 N tr W T x f ( X ) g ( Y ) T W y (2.10) + l N N å i = 1 k x i p ( f ( x i ))) k 2 + k y i q ( g ( y i )) 2 k ; s : t : W T x ( 1 N f ( X ) f ( X ) T + r x I ) W x = I ; W T y ( 1 N g ( Y ) g ( Y ) T + r y I ) W y = I ; w T xi f ( X ) g ( Y ) T w yj = 0for i 6 = j ; where x i isthe i -thsamplefromthemodality. y i isthe i -thsamplefromthesecondmodality. l > 0isatrade-offparametertocontrolthereconstructionerror.Othernotationsarethesamewith deepCCAinEq.(2.9).ComparedwithdeepCCA'sformulationinEq.(2.9),thisformulationcon- sidersthereconstructionerroroftwoautoencodersintheformofregularizations,inwhicheach autoencodermaximizesthelowerboundofthemutualinformationbetweentheinputsandlearned features[115].Meanwhile,CCAcanbeviewedasmaximizingthemutualinformationbetween thecanonicalvariatepairs,i.e.,theprojectedfeaturesofthetwomodalities[17].Hence,this methodoffersatrade-offbetweentheinformationcapturedintheinput-featuremappingwithin 2 note1 25 Figure2.3: Illustrationofdeepcanonicalcorrelationanalysisstructure[9]. Itlearnstwodeep non-linearmappingswhichmaptwomodalitiestonewrepresentationssuchthatthecanonical correlationofnewfeaturevectorsismaximized. Figure2.4: Illustrationofdeepcanonicalcorrelatedautoencoders[119]. Itsimultaneously maximizesthecanonicalcorrelationbetweentheprojectedrepresentationsandminimizesrecon- structionerroroftheautoencoders. eachmodalityononehand,andtheinformationinthefeature-featurerelationshipacrossmodali- tiesontheotherhand[119].TheframeworkisillustratedinFigure2.4. 2.3.3MultimodaldeepBoltzmannmachine Inadditiontothediscriminativemodelsintroducedabove,generativeapproachesarealsowidely usedinmultimodallearning.Generativeapproachesmodelthejointprobabilityofmultiplemodal- ities. 26 OneexampleismultimodaldeepBoltzmannmachine(DBM)[103].Thismethodisbasedon restrictedBoltzmannmachine(RBM).WereviewsomebasicconceptsofRBM.RBM isanetworkofsymmetricallycoupledbinaryrandomvariablesorunits.RBMcontainstwolayers ofunits.Thelayercontainsvisibleunits(input) x 2f 0 ; 1 g m ,andthesecondlayercontains hiddenunits h 2f 0 ; 1 g n ,where n isthenumberofthehiddenunits,and m isthenumberofthe visibleunits.Thehiddenunitsandthevisibleunitsareconnected.Novisible-to-visibleorhidden- to-hiddeninteractionisallowed.Figure2.5(a)isanillustrationofRBM. W istheinteraction betweenthehiddenunitsandthevisibleunits.ForallBoltzmannmachines,thejointprobability distributionbetweenunitsiscalculatedbyenergyfunction E asknownfromstaticalphysics: p = 1 Z exp ( E ) ; where Z isanormalizationfactortomakesuretheintegralover p is1.FortheRBMillustrated inFigure2.5(a),theenergyfunctionis E ( x ; h j W )= x T Wh ,ifweignoreself-energyforthetwo layersandonlyconsidertheinteractionenergybetweenthetwolayers.Hence,thejointdistribution ofthevisibleunitsandthehiddenunitsis: p ( x ; h j W )= 1 Z ( W ) exp ( x T Wh ) ; where Z ( W ) isthenormalizationfactorparameterizedbythenetworkparameter W . Whendealingwithchallengingapplications,wemayneedabstractinternalrepresentation.In thesecases,thetwo-layerstructureofRBMsmaynotbeabletoproduceasatisfactoryperfor- mance.ThislimitationcanbeovercomebyDBM.SimilartoRBM,DBMisanetworkofsymmet- ricallycoupledstochasticbinaryunits[103].Itcontainsvisibleunits(input),andseverallayersof 27 (a)(b) Figure2.5: ExampleofrestrictedBoltzmannmachineanddeepBoltzmannmachine. (a)An exampleofrestrictedBoltzmannmachine(RBM),where h isthehiddenlayer. x isthevisible layer.(b)AnexampleofdeepBoltzmannmachine(DBM).Itcontainstwohiddenlayersandone visiblelayer. hiddenunits h ( i ) 2f 0 ; 1 g F i ,where i representsthe i -thhiddenlayerand F i istheunitsnumberofthe i -thhiddenlayer.Figure2.5(b)illustratesanexampleofDBMwithtwohiddenlayers,where W ( 1 ) and W ( 2 ) aretheweightmatrixtoconnectconsecutivelayerswhichmeasuretheinteractionsbe- tweenlayers.Theenergyfunctionis E ( x ; h ( 1 ) ; h ( 2 ) j W ( 1 ) ; W ( 2 ) )= x T W ( 1 ) h ( 1 ) ( h ( 1 ) ) T W ( 2 ) h ( 2 ) , andhence,thejointdistributionoftheinputunitsandthetwohiddenunitsisgivenby: p ( x ; h ( 1 ) ; h ( 2 ) j W ( 1 ) ; W ( 2 ) )= 1 Z ( W ( 1 ) ; W ( 2 ) ) exp ( x T W ( 1 ) h ( 1 ) +( h ( 1 ) ) T W ( 2 ) h ( 2 ) ) : Inmultimodallearning,multiplemodalitiesmayhavedistinctstatisticproperties.Forexam- ple,textfeaturesarediscreteandimagefeaturesarecontinuous.SinceDBMcanextractabstract representations,inmostcases,itismoresuitableformultimodallearningthanRBM.Figure2.6is anexampleofusingthree-hidden-layerDBMtolearnthejointrepresentationsfromtwomodali- ties[103].Thetwomodalitiescaneitherbetexts,images,tags,videos,etc.Asanexample,weuse textsandimagesasthetwomodalities.InFigure2.6, v m 2 R D and v t 2 N K denoteimageinput andtextinput,respectively. h ( im ) , h ( it ) with i = 1 ; 2,and h ( 3 ) arethehiddenlayers.Eachmodality 28 hastwospecihiddenlayers. h ( 3 ) isthelearnedjointrepresentation.Sinceimagefeaturesare real-valued,thevisible-hiddeninteractionenergyshouldusetheformofGaussianRBM.Theen- ergybetweentheimagevisiblelayerandthetwohiddenlayersoftheimagepartisgiven by([103]): E ( v m ; h ( 1 m ) ; h ( 2 m ) j q )= D å i = 1 F 1 å j = 1 v m i s i W ( 1 m ) ij h ( 1 m ) j + D å i = 1 ( v m i b i ) 2 2 s 2 i F 1 å j = 1 F 2 å l = 1 h ( 1 m ) j W ( 2 m ) jl h ( 2 m ) l ; where q = f W ( 1 m ) ; W ( 2 m ) ; b ; s g aremodelparameters. W ( 1 m ) istheweightbetweentheinputlayer andthehiddenlayer. W ( 2 m ) istheweightbetweenthehiddenlayerandthesecondhidden layer. b isthebiasoftheinputlayer. s isthestandarddeviationoftheGaussiandistribution. Itcanbethesameforallthevisibleunitsorindependentforeachvisibleunitifthedataisnot whitened[63].Inthisenergyfunction,thetwotermsaretheinteractionbetweentheinput (visible)unitsandthehiddenlayer.Thethirdtermistheenergybetweenthehidden layerandthesecondhiddenlayer.Thejointdistributionofthoselayerscanbecalculatedbythis energyfunction.Thejointdistributionoftextissimilartothatoftheimagecomponent,exceptthat theenergybetweentheinputunitsandthehiddenlayerneedstobechangedtoaReplicated Softmaxmodel[47]todealwiththetextinput.Thismodelcanbeeasilyextendedtootherdata modalitiesbymodifyingtheenergyoftheinputlayeraccordingtothedatapropertyoftheinput. Theenergybetweenthesecondhiddenlayerofeachmodalitieswiththethirdhiddenlayer(the jointrepresentationlayer)is: E ( h ( 3 ) ; h ( 2 t ) ; h ( 2 m ) j q )= ( h ( 2 m ) ) T W ( 3 m ) h ( 3 ) ( h ( 2 t ) ) T W ( 3 t ) h ( 3 ) : (2.11) 29 Figure2.6: TheillustrationofamultimodalDeepBoltzmannmachine[103]. Itmodelsthe jointdistributionofdatafromtwomodalities,andthusprovidesajointrepresentation. Thejointdistributionofthoseunitsis p ( h ( 3 ) ; h ( 2 t ) ; h ( 2 m ) j q )= 1 Z ( q ) exp ( E ( h ( 3 ) ; h ( 2 t ) ; h ( 2 m ) j q )) . Giventhesedistributions,wecancomputethejointdistributionoftheinputsfrommultiplemodal- ities:[103]: p ( v m ; v t j q )= å h p ( h ( 2 m ) ; h ( 2 t ) ; h ( 3 ) j q )( å h ( 1 t ) p ( v t ; h ( 1 t ) ; h ( 2 t ) j q )) ( å h ( 1 m ) p ( v m ; h ( 1 m ) ; h ( 2 m ) j q )) ; where h = f h ( 1 m ) ; h ( 2 m ) ; h ( 1 t ) ; h ( 2 t ) ; h ( 3 ) g .Forgenerativemodels,themodelparameterscanbe learntbymaximizingthelikelihood.Inthismodel,exactmaximumthelikelihoodisintractable, buttheycanstillbelearntbyvariationalapproachapproximately[103]. Figure2.7(a)showsamultimodalRBM,andFigure2.7(b)presentsadifferentviewofthis model.Thedifferencebetweenthosetwomodelsisthatthedeepmodelhasmanylayerstotrans- 30 Figure2.7: DifferentmultimodalBoltzmannmachines[103]. (a)containsonlyonehiddenlayer. (b)containsmultiplehiddenlayers.Thetasktoremovecomponentisdistributedin differentlayersinthedeepmodel.Itcanbeeasiertoextractjointrepresentationsfor(b)than(a). formfeatures.Insomecases,thestatisticpropertiesofdifferentmodalitiesareratherdifferent. Forexample,inthepreviousexample,textfeaturesarediscreteandimagefeaturesarecontinuous. DirectlylearningajointrepresentationfromdifferentmodalitiesthrougharestrictedBoltzmann machinemaynotbefeasiblethen.Itneedsextrabridgesbetweenthejointrepresentationandthe inputsofeachmodality.InDBM,eachlayersuccessivelytransformstherepresentationintoa slightlymoreabstractlevelandremovespartofcorrelations[103].Hence,themid- dlelayercanbeviewedasamodal-freerepresentation,whiletheinputsaremodal-fullrepresenta- tions.ComparedwithasimplemultimodalRBM,thetasktoremovecomponents isdistributedindifferentlayersinthedeepmodel.Therefore,itismucheasiertoextractjoint representationsforthedeepmodelthantheshallowmodel. 31 Chapter3 DiscriminativeFusionofMultipleBrain Networks Inneuroimagingresearch,brainnetworksderivedfromdifferenttractographymethodsmaylead todifferentresultsandperformdifferentlywhenusedintasks.Asthereisnoground truthtodeterminewhichbrainnetworkmodelsaremostaccurateormostsensitivetogroupdif- ferences,wedevelopedanewsparselearningmethodthatcombinesinformationfrommultiple networkmodels.Weusedittolearnaconvexcombinationofbrainconnectivitymatricesfrom 9differenttractographymethods,tooptimallydistinguishpeoplewithearlymildcognitiveim- pairmentfromhealthycontrolsubjects,basedonthestructuralconnectivitypatterns.Ourfused networksoutperformedthebestsinglenetworkmodel,Probtrackx(0 : 89versus0 : 77cross-vali- datedAUC),suggestingitspotentialfornumerousconnectivityanalysis. 3.1Methodology 3.1.1Preliminary Sincethisworkisbasedonsparselogisticregression,wegiveabriefintroductiontosparselogistic regressionhere. Inlinearmodels,thesparsitymeansafeaturevariableisdeterminedtobeirrelevantifthecor- 32 respondingweightiszero.Therefore,someirrelevantfeaturevariablesarediscardedinthemodel andhavenocontributiontothemodel.Sparselearningalgorithmssuchassparse logisticregressionforarepowerfultoolstobuildmodelsfromhighdimensionaldata withlowcomputationalcost.Thesparsityisachievedbyaddingsparsity-inducingregularization termsontheweightvector w suchas l k w k 1 totheobjectivefunction,andtheweight,orthe model,issparsewithhighprobability.Let x i 2 R d denotesonesubjectwhere d isthenumberof featurevariablesweused,whichwillbeelaboratedlater.Thebinaryclasslabelofthissubjectis denotedby y i 2 1 ; 1 g ,whereaMCIsubjectisdenotedas-1andaNCsubjectisdenotedas+1. Given n samples ff x 1 ; y 1 g ; f x 2 ; y 2 g ;:::; f x n ; y n gg ,thelossfunctionforthesparselogisticregression is: l = 1 n n å i = 1 log ( 1 + exp ( y i ( w T x i + c )))+ l k w k 1 (3.1) where c istheintercept,andisatunableregularizationparameterthatisgreaterthanorequalto 0.Hereweuse l 1 normtoregularizetheweightvector-thiswillyieldsparsityintheweight vector.When l equals0,thereisnosparsityinweightvector.As l increases,moreentriesin weightvectorturnto0.When l islargeenough,alltheentriesinweightvectorbecome0.By minimizingthelossfunction,weobtaintheoptimalweightvector‹ w andintercept‹ c .Foranew subjectŸ x ,theprobabilitythatthissubjectbelongstoclass‹ y is: P ( Ÿ y j Ÿ x )= 1 1 + exp ( Ÿ y ( ‹ w T Ÿ x + ‹ c )) (3.2) IftheprobabilityofthissubjectbelongingtotheNCgroupisgreaterthan0.5,thissubjectwillbe labeledasNC.OtherwisethissubjectwillbelabeledasMCI. 33 3.1.2Overview Fig.3.1summarizestheoverviewofourfusionapproachtobuildficonsensusnetworksflbased onfusingnetworksfrommultipletracttracingmethods.FromdiffusionMRIscansofmultiple subjects,weextractdifferentbrainnetworkswithwholebraintractography.Thoughourproposed fusionapproachisnotlimitedtostructuralnetworkscomputedfromdMRItractography,herewe usetheninetractographymethodsstudiedinourpreviouswork[134],whichincludemethods thatareastensor-baseddeterministic,orientationdistributionfunction(ODF)-basedde- terministic,andprobabilisticapproaches.Eachnetworkreconstructionmethoddescribesbrain connectivityfromadifferentperspective,andnoneisuniversallybetterthanallothersfordiagnos- tictasks.ThereforewhenitcomestobuildingmodelsfromdiffusionMRIimages, itisintuitivetofusedifferentbrainnetworksandleveragethepredictiveinformationfromallthe networks.However,thekeyquestionishowtofusethedifferentnetworksandbuildeffectivepre- dictivemodelsfromthefusedmodels.Asfarasweknow,thereisnoprincipledapproachproposed tocombinenetworksforuseinpredictivemodels.Asshownintheexperimentalsection,simple numericalaveragingofnodaledgeweightsmaynotbeabletoboostthepredictiveperformance. Instead,weproposetolearnhowtofusethenetworksfromdata,suchthatthecombinationgives theoptimalpredictiveperformance.First,westudyfusednetworkscomputedasaconvexcombi- nationofdifferentbrainnetworks.Wedescribeanewmachinelearningmodeltosimultaneously learnthecoefoftheconvexcombinationaswellastheparameters.Asaresult, thecombinationcoefentsarelearnedtomaximizethepredictiveperformanceofthe andmeanwhiletheislearnedtousethecombinednetwork. 34 Figure3.1: Overviewofournetworkfusionframework. Multipletypesofbrainnetworksare computedbyapplyingdifferenttractographymethodstotheparticipants'diffusionMRIdata[134]. Differentbrainnetworksarecombinedusingasparselearningmethodandtheoptimalconvex combinationisusedforThecombinationcoefandthearesimul- taneouslylearnedfromthetrainingdataandcross-validated. 3.1.3DiscriminativeFusion: Ourproposeddiscriminativefusion( D F USE )isadata-drivenmodelthatincludesatrainingstage andapredictionstage.Inthetrainingstagethe D F USE algorithmlearnstheoptimalcombination coefandalogisticregressionfromasetofpatientswithknownmedicalclas- Inthepredictionstage,thebrainnetworksfromapatientarecombinedaccordingto thecoefThecombinednetworkisthenusedbythecltogiveapredictionforthe medicalproblem. Formulation. GivenasetofdiffusionMRIscansfrom N patients,weapplydifferenttractog- raphymethodstoobtain M brainnetworksforeachparticipant.Let x ( m ) i denotesavectorrepre- sentationofthe m -thbrainnetworkforpatient i ( i 2 [ 1 ; N ] ; m 2 [ 1 ; M ] ),inwhicheachelementis anumericalrepresentationofaconnectionproperty(e.g.,densityorintegrity)betweentwobrain regions.Wewouldliketocombineallnetworksforeachparticipantintoasinglenetworkusinga convexcombination,i.e.,thecombinednetwork x i ( t )= å M m = 1 t m x ( m ) i ,where t =[ t 1 ::: t M ] isthe vectorofcombinationcoefandtheconvexcombinationgives å M m = 1 t m = 1; t m 0 ; 8 t m . 35 Convexcombinationisonetypeoflinearcombinationthatgivesaclearinterpretationonhow mucheachoriginalnetworkcontributestothefusednetwork.Forthe N subjectsusedfortraining, wealsohavediagnosticlabelinformationstoredin y =[ y 1 ;:::; y N ] ,where y i = 1ifthepatientis caseand 1ifcontrol. Tolearnthecombinationofthenetworks,weproposeamachinelearningformulationthat jointlylearnstheparametersandthecombinationcoefwhichsolvesthefollowing optimizationproblem: min w ; c ; t S N i = 1 ` ( w ; c ; t ; x i ; y i )+ l k w k 1 ; (3.3) s.t. S M m = 1 t m = 1; t m 0 ; 8 t m where w and c areparameters,theconstraintson t ensuresaconvexcombination,the logisticlossis: ` ( w ; c ; t ; x i ; y i )= log 1 + exp y i ( x i ( t ) T w + c ) : The ` 1 -norminducessparsityintheparameters w [72,141,139,138],suchthatthe learnsasubsetofpredictiveconnectionsandonlyusestheseconnectionsinthe.The sparsityparameter l controlsthesparsityofthemodel.Asmaller l allowsmoreconnections tobeinvolvedinthemodel.Theoptimizationproblemin(3.3)canbesolvedbyproximalblock coordinatedescent[12,112,125].Oncetheoptimizationprocesshasconverged,weobtainthe optimalcombinationcoef t andparameters w and c . 36 3.1.4Optimization TheobjectivefunctioninEq.(3.3)isaconvexfunction.So,ithasglobalsolution.Sincethereare non-differentiableterms,weuseproximalgradientdescenttooptimizeit.Wecomputethe gradientwithrespecttoallparameters.Wedenote L = S N i = 1 ` ( w ; c ; t ; x i ; y i )+ l k w k 1 (3.4) Denote X tobethetensorthatisformedbystackall x ( j ) with j = 1 ;::: m .Then,theshapeof X is n d m .Thegradientof L withrespectto w is ¶ L ¶ w = 1 N x ( t ) T ( y ( s ( y ( x ( t ) w + c ))) (3.5) where denotedotproduct. s isthesigmoidfunction. s ( x )= 1 1 + exp ( x ) (3.6) Thegradientwithrespectto t is ¶ L ¶t = y N ( s ( y ( x ( t ) w + c ) Xw (3.7) Thegradientwithrespectto c is ¶ L ¶ c = y N s ( y ( x ( t ) w + c ) y (3.8) 37 ProximalGradientDescent: Proximalgradientdescent[107]iswidelyusedtooptimizethe objectivefunctionwithbothdifferentiableandnon-differentiableterms.Westartfromthe generalformofproximalgradientdescentandthenapplyittoourproblem. Givenobjectivefunction f ( x )= g ( x )+ h ( x ) (3.9) where g ( x ) isaconvexdifferentiablefunctionand h ( x ) isaconvexnon-differentiablefunction.If weonlyconsiderthedifferentiablepartfor f ( x ) ,i.e. f ( x )= g ( x ) ,wecanusegradientdescentto optimizeit,i.e. x k + 1 = x k t Ñ f ( x ) (3.10) Itisequivalenttooptimizesolvethefollowingoptimizationproblem. x k + 1 = argmin z f ( x k )+ Ñ f ( x k ) T ( z x k )+ 1 2 t k z x k k 2 (3.11) However, h ( x ) isnotdifferentiable.Thestrategyistoleave h ( x ) unchanged.Sotheupdatefor x k + 1 is x k + 1 = argmin z f ( x k )+ Ñ f ( x k ) T ( z x k )+ 1 2 t k z x k k 2 + h ( z ) (3.12) Eq.(3.12)meansthatwhenweupdate z = x ,weminimizethenon-differentiableterm h ( z ) .So,in eachstep,weupdatethesmoothtermusingthegradientdescenttomakesureitismovingtothe directionthatmakesthefunctionvaluesmallerandmeanwhilethe h ( z ) isalsominimizedandis 38 to-wardingtoourgoal,i.e.,minimizetheobjectivefunction. Eq.(3.12)canbewrittenas x k + 1 = argmin z 1 2 t k z ( x k t Ñ g ( x k )) k 2 + h ( z ) (3.13) Wetheproximalmappingasfollows. Prox ( x k + 1 )= argmin z 1 2 t k x z k 2 + h ( z ) (3.14) Then,Eq.(3.13)canbewrittenas x k + 1 = Prox ( x k t k Ñ g ( x k )) (3.15) Inourproposedform,wehavetwonon-differentiableterms. h 1 ( w )= k w k 1 (3.16) h 2 ( w )= simplex ( t ) (3.17) whereweuse simplex ( x ) denotetheconstraint å x i = 1 ; x i 0. Projections: Next,wewillshowhowtooptimizeEq.(3.14)withthesetwonon-differentiable terms.Westartwithageneralsimplexprojection. min x 1 2 k x y k (3.18) s : t : x T 1 = 1(3.19) x 0(3.20) 39 TheLagrangianoftheprobleminEq.(3.18)is L ( x ; l ; b )= 1 2 k x y k 2 l ( x T 1 1 ) b T x (3.21) where l and b areLagrangianmultipliers.Attheoptimalpoint,wehavetheKKTcondition x i y i l b i = 0(3.22) x i 0(3.23) b i 0(3.24) x i b i = 0(3.25) å i x i = 1(3.26) FromEq.(3.25)andEq.(3.23)wehave(1)if x 0, b i =0and y i + l i 0,AND(2)if x = 0, b i 0and y i + l = b i .Thus,wecansortthexandyinthedescentorder. y 1 y 2 ;:::; y r y r + 1 ;:::; y d (3.27) x 1 x 2 ;:::; x r = x r + 1 = ;:::; = x d (3.28) wherewehavewhen i > r ,all x 1 = 0.FromEq.(3.26)wehave l = 1 r ( 1 r å i y i ) (3.29) 40 WIthShealev-ShwartzandSingerTheorem,wehavethesolutionfor r is r = f j ; max f 1 j d : y j + 1 j ( 1 j å i y j ) > 0 g (3.30) Next,weshowhowtoprojectto l 1 ball.Supposewehavetheprojection min 1 2 k x y k 2 + l k x k 1 (3.31) Since k x k 2 = å i x 2 i and k x k 1 = å i j x i j ,weoptimizeeachdimensionseparatelyforEq.(3.31),i.e., 1 2 ( x 1 y 1 ) 2 + l j x 1 j with i = 1 ;::: d (3.32) TosolveEq.(3.32),weusethesubgradientmethod.Thesubdifferentialof j x j is sign ( x ) and d ( x y 2 dx = 2 ( x y ) .Thus,wehave0 2 ¶ f ( x ) where x denotetheoptimalsolution.Therefore,we have x = sign ( x ) max ( j x j l ; 0 ) (3.33) 3.2Experiments 3.2.1Dataset Theimagingdatasetsanalyzedforinthisstudywerecollectedfrom16sitesacrosstheUnited StatesandCanadainthesecondstageoftheNorthernAmericanAlzheimer'sDiseaseNeuroimag- ingInitiative(ADNI2).Intotal,124subjects'diffusionMRIandstructuralMRIdatawereana- lyzed.Detailedsubjectinclusion,exclusioncriteriaandscanningprotocolscanbefoundinthe 41 ADNI2website.These124subjectsinclude51normalelderlycontrols(NCs),73individuals diagnosedwithearlymildcognitiveimpairment(eMCI). 3.2.2BrainNetworks Foreachsubject,wecomputed9brainnetworksusingninemethods,including4tensor-based deterministicalgorithms:FACT(T-FACT)[78],thesecond-orderRungeŒKutta(T-RK2)[11],the tensorline(T-TL)[66],andinterpolatedstreamline(T-SL)methods[29],twodeterministictrac- tographyalgorithmsbasedonfourthordersphericalharmonicderivedODFsŒFACT(O-FACT) andRK2(O-RK2),andthreeprobabilisticapproaches:fiball-and-stickmodelbasedprobabilistic trackingflProbtrackx(Probt)[13],theHoughvotingmethod[2]andtheprobabilisticindexofcon- nectivity(PICo)method[84].Eachbrainnetworkdescribesdetectedconnectionsbetween113 corticalandsubcorticalregions-of-interest(ROIs),whicharebyusingtheHarvardOxford CorticalandSubcorticalProbabilisticAtlas[33].Thereforewecanuseavectorofdimension6328 (113 112 = 2)torepresentallconnectionsofdistinctROIspairsineachnetwork.Pleasesee[134] fordetailsofcomputingtheseninebrainnetworks. 3.2.3ExperimentSettings Intheexperimentwecomparedthepredictiveperformanceofindividualnetworks,intermsof areaundertheROCcurve(AUC),sensitivityand.Thesearestandardmetricsmeasuring algorithmperformanceinproblems.Wealsoprovidetwointuitivefusionmethods forbaselinecomparisons.Themethodconcatenatesvectorsfromallnetworks(B-CON), resultinginafeaturevectorofdimension56952.Thesecondmethodcombinesthenetworksby averagingofalloftheindividualnetworks;thiscanbeconsideredasaspecialcaseofthegeneral 42 linearcombination( t i = 1 = 9 ; 8 i ).Forallthepatients,weused10-foldcrossvalidation,i.e.,each timeweusethebrainnetworksfrom90%patientstotraina,andthe10%totestthe andcomputeperformancemetrics.Forallindividualbrainnetworksaswellasthetwo baselinemethods,weusesparselogisticregressiontotrainFortheproposed D F USE , theclassistrainedusingalgorithmsinSection3.1.Asthesamplesizeistoosmalltogenerate extravalidationdataformodelselection(theselectionofhyperparameter l inthesparselogistic regression),wereportthebestperformanceforallmethods. 3.2.4Results Averagedresultsover10iterationsaregiveninTable3.1.Ourproposed D F USE algorithmoutperformedallothercompetingmethods( p -value<0.001). D F USE has anaverageAUCof0 : 89,comparedto0 : 77achievedbythebestindividualmethod,whichused onlytheProbtrackx(Probt)networks. D F USE alsohadthehighestaveragesensitivityof0 : 84and of0 : 77,comparedtothesecondhighestsensitivityof0 : 72achievedbytensor-based FACT(T-FACT)and0 : 69bytheProbtrackxnetworks.Noindividualbrainnetworkgeneration methodhadapredictivepowerthatwasevenclosetotheonefromthefusedbrainnetwork.This improvementinpredictiveperformancesupportsourhypothesisabouttheof fusionforbrainnetworks. Twootherbaselinenetworkcombinationmethodsalsodidnotperformwell:thepredictive performanceofthefeatureconcatenation(B-CON)doesnotevenperformaswellasthebest individualbrainnetwork.Thismaybebecause,fortheB-CONmethod,therearetoomanyfeatures presentedtothe(over56k),relativetothenumberofsubjects(samples)availabletotrain it.Only ˘ 110samplesareavailableheretotraintheateveryiteration(90%ofthe totalof124subjects).Ontheotherhand,theAUCofthesimpleaveragebrainnetwork(B-AVG) 43 AUCSensitivity D F USE 0 : 89 0 : 090 : 84 0 : 160 : 77 0 : 07 B-CON0 : 58 0 : 100 : 56 0 : 210 : 50 0 : 07 B-AVG0 : 55 0 : 150 : 58 0 : 200 : 49 0 : 08 B-ENS0 : 79 0 : 110 : 71 0 : 250 : 72 0 : 09 T-FACT0 : 59 0 : 110 : 72 0 : 250 : 44 0 : 14 T-RK20 : 58 0 : 110 : 56 0 : 250 : 49 0 : 10 T-SL0 : 62 0 : 140 : 48 0 : 270 : 64 0 : 26 T-TL0 : 58 0 : 140 : 60 0 : 210 : 48 0 : 07 O-FACT0 : 62 0 : 090 : 60 0 : 190 : 51 0 : 09 O-RK20 : 60 0 : 130 : 60 0 : 210 : 53 0 : 07 PICo0 : 58 0 : 100 : 56 0 : 210 : 50 0 : 07 Hough0 : 66 0 : 110 : 64 0 : 230 : 54 0 : 11 Probt0 : 77 0 : 080 : 70 0 : 220 : 69 0 : 08 Table3.1: Quantitativecomparisonofusingdifferentbrainnetworkstopredictthe earlyMCI. Wecomparetheperformanceofeachindividualbrainnetworksfromtractography, simplenetworkcombination,andournetworkfusionmethod( D F USE ).Theaverageandvariance ofareaundertheROCcurve(AUC),sensitivityandover10splittingsarereported.The proposed D F USE outperformsallothermethodsonthisproblem( p -value < 0 : 001). network t network t network t T-FACT0 : 025 T-Rk20 : 014 T-SL0 : 023 PICo0 : 058 Hough0 : 010 Probt0 : 871 T-TL0 O-FACT0 O-RK20 Table3.2: Combination t of9networks. is0 : 55,whichisevenpoorerthantheworstperformingbrainnetworkT-TL,at0 : 58.Arbitrary combinationsofbrainnetworksmaynothelpforthetaskofdistinguishingearlyMCIfromNCs. Taskfusionasproposedinthispapermaybemore 3.3Discussion Oneattractivepropertyoftheproposed D F USE approachisthatwecanobtainaninterpretable combinationcoef t ,indicatinghowmucheachoftheindividualbrainnetworkscontributes tothecombinednetwork.Theaveragecombinationcoefforallnetworksaregiven 44 inTable3.2.Weseethatinthecombination,Probtrackxhastheheaviestweightof0 : 871(all elementsof t rangefrom0to1),averagedover10iterations.Thisisconsistentwiththe thatProbtrackxisalsothebestpredictiveindividualnetworkasshowninTable3.1.Ontheother hand,theweightsofT-TL,O-FACT,O-RK2areconsistentlyzeros,i.e.,theydonotcontributeto thecombinednetwork.Assuch,thecombinationoffersaguidetowhichtractographymethods torun(clearlynotallmethodsneedtoberunforproblemswheretheyaregivenzeroweight). Moreover,thenetworkswithzeroweightsarenotthesameastheleastwhiteindividualnetworks (T-RK2,PICo,T-FACT).Theinconsistencyshowsthatnetworkswithweakpredictivepowermay stillhavevaluableconnectioninformationtocomplementotherbetterperformednetworks.It ispossibletoleverageclusteringanalysis[137]andexploredifferentsub-modalitieswithinthe networks,andwewillleavethisinterestinganalysisinourfuturework. Becauseofthesparsityintroducedonthemodel w ,wearealsoabletoinspectwhatarethe importantconnectionscontributingtotheByaveragingthenon-zeroweightsfor eachconnectionfromdifferentexperiments,wecangeneratearankedlistofconnections,many ofwhicharepreviouslyknowntoberelevanttotheprogressionofAlzheimer's.Hereareafew connectionsthatappearinthetopofthelist:RightTemporalPole , RightPrecentralGyrus,Left Pallidum , LeftCaudate,LeftLingualGyrus , LeftThalamus,LeftCingulateGyrusAnterior Division , LeftFrontalMedialCortex,RightPlanumPolare , RightHippocampus. 3.4Summary Inthiswork,wedevelopedanewmethodfordiscriminativefusionofmultiplebrainnetworksto detectearlymildcognitiveimpairment(MCI).Wesimultaneouslylearnedaconvexcombination ofdifferentbrainnetworkstobestdetectearlyMCI,andathatworkswiththecombined 45 brainnetwork.Asthenetworksarefusedinawaythatmaximizesthediscriminativepowerbe- tweennormalcontrolsandearlyMCIsubjects,theresultsfromthefusednetwork improveonsinglebrainnetworksaswellassimplefusionmethods. 46 Chapter4 MultimodalDiseaseModelingviaCollective DeepMatrixFactorization Alzheimer'sdisease(AD),oneofthemostcommoncausesofdementia,isasevereirreversible neurodegenerativediseasethatresultsinlossofmentalfunctions.Thetransitionalstagebetween theexpectedcognitivedeclineofnormalagingandAD,mildcognitiveimpairment(MCI),has beenwidelyregardedasasuitabletimeforpossibletherapeuticintervention.Thechallengingtask ofMCIdetectionisthereforeofgreatclinicalimportance,wherethekeyistoeffectivelyfusepre- dictiveinformationfrommultipleheterogeneousdatasourcescollectedfromthepatients.Inthis work,weproposeaframeworktofusemultipledatamodalitiesforpredictivemodelingusingdeep matrixfactorization,whichexploresthenon-linearinteractionsamongthemodalitiesandexploits suchinteractionstotransferknowledgeandenablehighperformanceprediction.,the proposedcollectivedeepmatrixfactorizationdecomposesallmodalitiessimultaneouslytocapture non-linearstructuresofthemodalitiesinasupervisedmanner,andlearnsamodalitycom- ponentforeachmodalityandamodalityinvariantcomponentacrossallmodalities.Themodality invariantcomponentservesasacompactfeaturerepresentationofpatientsthathashighpredictive power.Themodalitycomponentsprovideaneffectivemeanstoexploreimaginggenet- ics,yieldinginsightsintohowimagingandgenotypeinteractwitheachothernon-linearlyinthe ADpathology.ExtensiveempiricalstudiesusingvariousdatamodalitiesprovidedbyAlzheimer's 47 DiseaseNeuroimagingInitiative(ADNI)demonstratetheeffectivenessoftheproposedmethodfor fusingheterogeneousmodalities. 4.1Methodology 4.1.1Matrixfactorization Classicalmatrixfactorizationseekstoapproximateamatrixwithalow-rankmatrix,byexplicitly learningthematrixfactors.Givenadatamatrix X 2 R m n ,matrixfactorizationlearnstworeduced matrixfactors U 2 R m r and V 2 R n r ,suchthat X ˇ UV T ,and r < min ( m ; n ) istheupperbound oftherankoftheapproximatedmatrix UV T (therankof UV T canbelessthan r ifcolumnsof U or V arelinearlydependent).Thefactors U and V aretypicallylearnedviaanobjectivefunction: min U ; V d ( X ; UV T ) ; s.t. U 2 S 1 ; V 2 S 2 ; (4.1) where d ( X ; Y ) isadistancemetricfunctionmeasuringthedifferencebetweenmatrices X and Y , and S 1 and S 2 aretwoconstrainsimposedonthefactormatrices X and Y . Typicallythedistancemetric d ( X ; Y ) ischosentobetheFrobeniusnormofthedifference between X and Y .However,whenmissingvaluespresentin X , d ( X ; Y ) canbeasthe squared ` 2 distancebetweenalltheobservedelementsin X andtheircorrespondingelementsin Y .Assuch,weareabletolearnmatrixfactorsevenwithmissingvalues,andthelearnedmatrix factorscanthenbeusedtoestimatethemissingvaluesunderthelow-rankassumption.Thisis thesetupformatrixcompletion[22]andiscommonlyusedinrecommendersystems[61].The constraints S 1 and S 2 specifythefeasibleregionsofthematrixfactorstoinducemanydesired properties,suchasnon-negativity S = f U j U i ; j 0 ; 8 i ; j g innon-negativematrixfactorization[69] 48 andsparsity S = f U jk U k 1 z g forinterpretablefactors[140].Inaddition,thecomplexitycontrol canbeimplementedusingFrobeniusconstraints S = f U jk U k 2 F z g ,whichareequivalenttothe Frobeniusnormregularizations[60]. 4.1.2Collectivematrixfactorizationformultimodalanalysis Theapproximationin(4.1)addressesimportantsemanticsindataanalysis.Whenthedatamatrix X describestherelationshipbetweentwotypesofentities,thefactors U and V canbethoughtof aslatentfeaturesorlatentrepresentationsoftheentities.Forexample,inrecommendersystems weuse X i ; j todescribetherelationship(e.g.,rating)betweenauser i andanitem j .Therowvector u i 2 R r givesa r -dimensionallatentfeaturerepresentationfortheuser i andsimilarlytherow vector v j 2 R r isalatentrepresentationoftheitem j .Thetwotypesoflatentinteractwith eachotherlinearlyinthelatentsubspace R r ,i.e.,theobservedrelationshipin X i ; j canbeexplained as u i ( v j ) T . Thelatentrepresentation/subspaceperspectiveofmatrixfactorizationallowsustolinkmul- tipledatamodalities,whentheentitiesinvolvedinthemodalitiesareoverlapped.Inmultimodal modeling,assumewehavetwodatasets X 1 2 R n d 1 and X 2 2 R n d 2 describingthesamesetof objectsfromtwosetsoffeatures.Forexample,westudyasetof n patients. X 1 includes d 1 fea- turesfromT1MRImodalityand X 2 includes d 2 featuresfromdMRImodality.Thenwecanapply thematrixfactorizationproceduretofactorizebothdatasetsandconnectthetwofactorizationsby enforcingasharedpatientlatentrepresentation: min U ; V 1 ; V 2 d ( X 1 ; UV T 1 )+ d ( X 2 ; UV T 2 ) ; s.t. U 2 S 0 ; V i 2 S i ; i = 1 ; 2 ; wherethelatentrepresentation U isthusjointlylearnedfromtwomodalities.Wecallthis U 49 Figure4.1: Illustrationofproposedcollectivedeepmatrixfactorization(CDMF)framework. Inthisexample,CDMFfusesinformationfromthreemodalities:T1weightedMRI,diffusionMRI, andgenotypes(SNPs)tolearnamodalityinvariantlatentrepresentation,toperformpredictive modeling. matrixmodalityinvariant,astherepresentationnowcapturesintrinsicpropertiesofthepatients. Whenperformingregressionandonpatients,insteadofusingfeaturesfromrawdata matrices X 1 and X 2 ,wecanusethelatentrepresentation.Wecaneasilygeneralizethisapproach tohandlemoredatamodalities. 4.1.3Capturingcomplexinteractionsviacollectivedeepmatrixfactoriza- tion Oneessentialassumptionassociatedtotheclassicalmatrixfactorizationisthelineardependence inthematrix.Therefore,itimplicitlythatthelatentrepresentationslearnedfromcollec- tivematrixfactorizationhavetointeractwitheachotherlinearlyinthelearnedlatentsubspace. However,thisassumptionistoorestrictiveinmanyapplications,especiallyinthemodelingof 50 Alzheimer'sdisease,whereimagingmodalitiesandgeneticmodalityarelikelytolinkthrough ahighlynon-linearlymanner.Tocapturethecomplexinteractionsamongmodalities,wethus proposeanovelframeworktofusemultipledatamodalitiesthroughdeepmatrixfactorization.As- sumewehave t datamodalities X 1 2 R n d 1 ;:::; X t 2 R n d t describingdifferentviewsofthesame setof n samples.Weuseadeepneuralnetwork g q ( : ) parameterized q tofactorizeeachmodality, i.e., X i ˇ Ug q i ( V i ) ,whereinthisworkweuseastructureddeepneuralnetworkwith k layers: g q i ( V i )= f ( W ( k ; i ) f ( W ( k 1 ; i ) f ( :::; f ( W ( 1 ; i ) V i )) ; where W ( j ; i ) isthenetworkweightatthe j -thlayer, q i = f W ( k ; i ) ; W ( k 1 ; i ) ;:::; W ( 1 ; i ) g collectively denotesnetworkweights,and f isanon-linearactivationfunction.Thedeepnetworkservesasa highlynon-linearmappingbetweeninputmatrix X i and U ,andprojectsthelatentrepresentations non-linearlytothesamelatentspace.Wecallthis g q i ( V i ) modalitycomponentfor i -th modality.Wecanthusperformcollectivedeepmatrixfactorization(CDMF)toassociatemultiple datamodalities: min U ; f V i ; q i g t i = 1 å t i = 1 d ( X i ; Ug q i ( V i )) s.t. U 2 S 0 ; V i 2 S i ; 8 i : Wewouldliketohighlightonepropertyofcollectivedeepmatrixfactorizationthatmodalityin- variantcomponent/representationcanhavedifferentdimensionsfrommodalitycomponents,i.e., U and V canbedifferent,and V indifferentmodalitiescanalsobedifferent.Thisxibilityis desiredespeciallywhendifferentmodalitiescontaindifferentamountofinformation,andthusthe optimallatentrepresentationsmayhavedifferentdimensions.Wealsonotethatonewaytocontrol thecomplexityofnetworksundermultiplemodalitiesistoenforcesharednetworkstructures,i.e., 51 f g q i g havethesamearchitectureandsharethesameparametervalues,exceptforthelastlayer. Inmanyapplications,ourultimategoalistobuildpredictivemodelsfrommulti-modalanalysis. Toachievethis,wecanintegratepredictivemodelingandcollectivedeepmatrixfactorization duringlearning,suchthatpredictivemodelinguseslatentrepresentationslearnedfromcollective deepmatrixfactorizationasinputfeatures.Assumethatwearegivensupervisioninformation f y 1 ;:::; y n g forthe n subjects,andalinearmodelforthepredictiontask h ( U ; w )= U w (witha dummyvariabletoincludebias).Givenalatentrepresentation U j (i.e.the j -throwof U matrix) forthe j -thsubjectanditscorrespondinglabel y j ,weuseaproperlossfunction ` ( h ( U j ; w ) ; y j ) (e.g.,logisticlossforandleastsquaresforregression).Theproposedsupervised CDMFformulationisthusgivenby: min w ; U ; f V i ; q i g t i = 1 å n j = 1 ` ( h ( U j ; w ) ; y j )+ å t i = 1 a i d ( X i ; Ug q i ( V i )) s.t. U 2 S 0 ; V i 2 S i ; 8 i ; (4.2) where a i isatunableparametertocontrolknowledgefusionproportionofthe i -thmodality,spec- ifyinghowmuchthatthemodalitythelearningofthemodalityinvariantcomponent. When a i islarge,alessreconstructionerrorforthismodalitywillbeachievedwhenminimizing overallloss,andthereforethelearnedrepresentation U containsmoreinformationofthismodality, andviceversa.Figure4.1illustratestheproposedframeworkfusingthreemodalities:dMRI,T1 MRIandgenotypes(SNPs). Optimizationandinitialization. TheformulationcanbesolvedefentlybyTensorFlow[1]. However,sincetheobjectivein(4.2)ishighlynon-convexandgradientalgorithmsmayeasily trappedinlocaloptima,agoodinitializationisimportantfortrainingthenetwork.Inthiswork, weproposetoiterativelyapplylinearmatrixfactorizationsintheoriginaldatamatrix,anduselin- 52 earandhierarchicalmatrixfactorstoinitializethedeepneuralnetworks.Assuch,theinitialization issimilartoavalidlinearmatrixfactorization,andthealgorithmiterativelyexplorenon-linear effectswithinlinearlatentspacesandcapturenon-linearityinthenetworkduringlearningpro- cess.Technicallywecanchoosearbitrarylinearfactorizationmethodsin(4.1)forinitialization, however,weinourexperimentsthatsingularvectorsgivenbyiterativesingularvaluedecom- position(SVD)usuallyprovidedecentmodelsthatoutperformotherfactorizationmethods.This mayduetofactthatorthogonalbasisobtainedbySVDcharacterizetheoptimallinearsubspaceof thedatamatrix. Handlingmodalitieswithmissingsubjects. Inmanyapplicationsespeciallymedicalcases,some datamodalitiesmaynotbeavailabletoallsamples.Forexample,somesubjectsdidnotpartic- ipatethegeneticstudyandthuslackgenotypeinformation.Besides,inthestageofADNI studytherearenodiffusionMRIimagingavailable,leadingtostructuredmissingpatternsinthe dataset[132].Since f X i g involvedifferentsetsofsubjects,suchmissingmodalitieswillcause dimensionproblemsin U ,andthusthemodalitiescannotbeprojectedtothesame U .Oneway toovercomethisissueistodiscardallthesubjectswithmissingmodalitiesandmakethedimen- sionsconsistentacrossmodalities.However,thisapproachwillntlyreducethenumber ofsamplesandthuscompromisethepredictiveperformance.Wethereforeextendtheproposed formulationtodealwithit.Weanindicatormatrixforeachmodality,whereforthe i -th modalityitisdenotedby I i 2 R n n ,whose j -throwisgivenby: ( I i ) j = 8 > > > < > > > : 0 ifthethismodalityismissingfor j -thsubject e j otherwise ; where e j 2 R n is n -dimensionalstandardbasiswithonly j -thentryas1.Therevisedformulation 53 isgivenby: min w ; U ; f V i ; q i g t i = 1 å n j = 1 ` ( h ( U j ; w ) ; y j )+ å t i = 1 a i d ( ‹ X i ; I i Ug q i ( V i )) s.t. U 2 S 0 ; V i 2 S i ; 8 i ; (4.3) where ‹ X i isanaugmenteddatamatrix,whose j -throwisgivenby: ‹ X j i = 8 > > > < > > > : 0 ifthissubjectlacksof i -thmodality X j i (originalfeatures)otherwise : Bymultiplyingindicatorsandreplacing X i by ‹ X i ,thecorrespondingrowsofsubjectswithmissing modalitywillbe0forthismodality,whichhasnoeffectonloss.Thisapproachwouldensurethat weusealltheinformationavailableduringthelearning. ApplicationinDiseaseModeling. EventhoughtheproposedCDMFframeworkcanbeused invariousdataminingapplications,hereweemphasizeonitsadvantagesinourdisease modelingproblem.ThegoalofMCIdiagnosisistodifferentiatebetweenMCIsubjectsandnormal cognitive(NC)subjects,whichisaproblem.WethususeCDMFinEq.(4.3)witha logisticloss,inwhichknowledgefromdifferentmodalitiesisfusedinasupervisedmannersuch thatonlythepartthatismorerelevanttogroupdifferenceofMCIandNCwillbefusedtothe latentrepresentation U ,whichinturncanimproveprediction.Thispropertyisimportantforour multimodaldiseasemodelingsincethemodalitiesmaycontainknowledgethatisnotrelevanttothe desiredlearningtask.Withoutproperguidance,theirrelevantknowledgemaynegativelyimpact therepresentationleadingtosuboptimalpredictiveperformance.Forexample,brainimagingmay containinformationofotherinheritedbraindiseasesoragingproperties,likewiseforgeneticdata. 54 Ifthefusionprocessiscarriedoutinanunsupervisedmanner,wemaynotobtaina U thatismost informativeregardingtheprogressionofMCI. Associationstudyofmultiplemodalities. Theinteractionsbetweenlatentrepresentationsareof greatinterestsinthecommunity(e.g.,generatepredictionsintherecommendersystem),andcan revealimportantinsightsintohowdifferentmodalitiesareconnectedtoeachother.Althoughitis straightforwardinlinearcasethatwecanuseinnerproducts u i ( v j ) T ,wecannotdirectlycompute thiswayinCDMFsincethemodalitiesareconnectedthroughnon-linearnetworks.Instead,we canusethefollowingtransformedlatentfactors: Ÿ V i = f ( W ( k ; i ) f ( W ( k 1 ; i ) f ( :::; f ( W ( 1 ; i ) V i )) ; (4.4) whichisamappingmatrixthatcontainsthemodalityinformationofthecorresponding modality.Allthecolumnsofthismatrixformthefeaturespaceofthismodality.Hence, wecancalculatetheassociationofanyfeaturesbetweenanytwomodalitiesusingthetransformed latentfactors Ÿ V i .Let C i ; j ( m ; n ) denotethecosinesimilaritybetweenthe m -thcolumnfrom Ÿ V i and the n -thcolumnfrom Ÿ V j .When C i ; j ( m ; n ) islarge,the m -thfeatureof i -thmodalityishighlyrelated withthe n -thfeatureof j -thmodalityandasmall C i ; j ( m ; n ) indicatestheassociationbetween thosefeaturesisweak.Thisprovidesanoveltooltostudytheimaginggenetics,identifyinghow genotypesbrainstructuresundertasks(e.g.,MCIpredictioninourcase). 55 4.2Experiments 4.2.1Datasetandfeatures DatafromtwostagesofADNIareusedinthisstudy:ADNI1andADNI2.Detaildemographic characteristicsandmissingdatainformationarelistedinTable4.1.Wholegenomesequencing (WGS)SNPsareprovidedbyADNIandusedasgeneticmodalityinourstudy.ForMRI,ADNI1 participantsarescannedby1.5Tor3TMRIscannerwhileallADNI2participantsarescannedby 3TMRIscanner 1 .FreeSurferV5.3isadoptedtoextract333measuresincludethearea,thick- ness,corticalvolume,subcorticalvolumeandwhitemattervolumefromT1MRItoformT1 MRImodality.FordMRI,weparcellatethebraininto113corticalandsubcorticalregion-of- interests(ROIs)accordingtotheHarvardOxfordCorticalandsubcorticalProbabilisticAtlas[33]. Thenwereconstructthewhole-braintractographyusinganODF-basedprobabilisticapproach: PICo[32].Finally,abrainnetworkisgeneratedinwhichthenodesindicateROIsandtheedges aredeterminedbytheproportionofintersectingwitheachpairofROIs.Assuch,eachbrain networkisa113 113symmetricmatrixwith6328distinctedges.These6328edgesareusedas thefeaturevariablesfordMRImodality. 4.2.2Datapreprocessing Imagingmodalitiespreprocessing. ADNI1andADNI2usedifferentscannerprotocolwhich mayintroducebiasesforthedatasets.Hence,wedecidetoharmonizethecohortsbyremoving thiscohorteffect.WecreateanindicatorvariabletodifferentiateADNI1andADNI2with1forall subjectsfromADNI1and-1forallsubjectsfromADNI2.Inaddition,ageandsexarecommon confoundersbiasingtheanalysis.Inthisstudy,generalizedlinearregressionapproach[80]isused 1 http://adni.loni.usc.edu/data-samples/mri/ 56 ADNI1Cohort NCMCITotal Age 75.84 4 : 9574.48 7.4875.17 6.68 Sex 115M/108F247M/138F362M/246F totalsubjects 223385608 SubjectswithdMRI 000 SubjectswithT1MRI 223385608 Subjectswithgenotype 202348550 ADNI2Cohort NCMCITotal Age 69.36 15.4071.68 9.9370.96 11.89 Sex 22M/28F71M/41F93M/69F totalsubjects 50112162 SubjectswithdMRI 50112162 SubjectswithT1MRI 50112162 Subjectswithgenotype 2782109 Table4.1: Demographicinformationofsubjects. toremoveallconfoundersincludingage,sexandcohortindex.Itassumeseachobservedvariable islinearlydependentontheconfoundervariablesandageneralizedlinearmodelcanremove confounders'effect.Denotetheobservedvariableofvariable X as X obs andtheoriginalvariable as X ori .Thelineardependenceof X obs and X ori is: X obs = w 1 age + w 2 sex + w 3 cohort + X ori ; where w 1 ; w 2 ; w 3 arecoefofconfounders.Let ( w 1 ; w 2 ; w 3 ) be w and ( age i ; sex i ; cohort i ) be t i ,where i denotesthe i -thsubject.Coefcanbeobtainedbysolvingalinearregression: w = min w å n i = 1 ( w T t i X obs i ) 2 : (4.5) AftersolvingEq.(4.5),theoriginalfeaturevariableisgivenby: X ori = X obs ( w 1 age + w 2 sex + w 3 cohort ) : 57 Figure4.2: ManhattanplotforSNPswithadjusted p valuegreaterthan2. Colorsindicate differentchromosomes. WeapplythisonbothT1MRIdataanddMRIdataandwillonlyuse X ori inthedownstream experiments. Geneticmodalitypreprocessing. Geneticdataispreprocessedbystandardqualitycontrolusing PLINK 2 andthenimputeusingMaCH 3 .SNPswithminorallelefrequency(MAF)lessthan5%or missingvaluesgreaterthan5%arediscarded.Subjectswithmissingvaluesgreaterthan10%atall SNPsareremoved.Finally,659subjectswithreadingvalueson6 ; 566 ; 154SNPsareattained. Inordertoextractmorerelevantfeatures,weapplygenome-wideassociationstudy(GWAS)on ourdata.Indetail,weregresspatientstateNL/MCIoneachSNPusinglogisticregression,withp- valuegeneratedandadjustedto log 10 scale.Largeradjustedp-valueindicatesstrongassociation betweenresponseandthemarker.Figure4.2showsSNPswithadjustpvaluegreaterthan2on eachchromosome.SNPsonchromosome19havestrongerassociationwithMCIthanothers, suggestingcrucialeffectsofthischromosomeontheAlzheimer'sdeterioration.Finally,thetop 200SNPsforeachiterationareretainedasfeaturesforourdownstreamanalysis.Since SNPsarecategorical,i.e. f 0 ; 1 ; 2 g ,weusetheone-hotcodingtobethefeaturerepresentation. Hence,thefeaturedimensionforgeneticmodalityis600. 2 http://pngu.mgh.harvard.edu/purcell/plink/ 3 http://csg.sph.umich.edu/abecasis/MaCH/ 58 4.2.3Predictperformance Comp.# Shallowcollectivematrixfactorization linear sigmoid square 30 0 : 529 0 : 080 0 : 616 0 : 102 0 : 564 0 : 011 50 0 : 587 0 : 069 0 : 593 0 : 120 0 : 718 0 : 076 70 0 : 610 0 : 079 0 : 644 0 : 075 0 : 659 0 : 161 90 0 : 526 0 : 065 0 : 597 0 : 086 0 : 634 0 : 097 110 0 : 656 0 : 089 0 : 681 0 : 116 0 : 658 0 : 106 130 0 : 561 0 : 024 0 : 613 0 : 105 0 : 668 0 : 127 Comp.# Deepcollectivematrixfactorization linear sigmoid square 30 0 : 519 0 : 099 0 : 653 0 : 139 0 : 719 0 : 142 50 0 : 594 0 : 151 0 : 646 0 : 078 0 : 693 0 : 100 70 0 : 573 0 : 135 0 : 593 0 : 165 0 : 758 0 : 115 90 0 : 519 0 : 093 0 : 610 0 : 146 0 : 805 0 : 073 110 0 : 558 0 : 083 0 : 542 0 : 048 0 : 726 0 : 027 130 0 : 553 0 : 124 0 : 544 0 : 110 0 : 679 0 : 152 Comp.# Otherdeepmultimodalmethods DCCA DCCAE DNN 30 0 : 770 0 : 065 0 : 723 0 : 031 0 : 617 0 : 143 50 0 : 722 0 : 088 0 : 743 0 : 094 0 : 604 0 : 026 70 0 : 689 0 : 134 0 : 780 0 : 054 0 : 560 0 : 111 90 0 : 684 0 : 089 0 : 703 0 : 042 0 : 579 0 : 068 110 0 : 735 0 : 135 0 : 627 0 : 165 130 0 : 699 0 : 089 0 : 689 0 : 131 Table4.2: PredictionperformanceofdifferentmodelsusingADNI2'sT1MRIanddMRI intermsofAUC. Withanappropriateactivationfunctionandcomponents'number,ourmethod outperformsthanallothermethods. ]meansnotapplicableduetothealgorithmdesign. Inthissection,weevaluatetheperformanceofourmethodandcomparewithothermethods usingADNIdataset.Thedistancemetric d ( X ; Y ) weusedinthefollowingexperimentsis k X Y k 2 F .Weperformexperimentsonthreedifferentsettings. Inthesetting,onlyADNI2datasetanditstwomodalities:T1MRIanddMRIarecovered. Inthissetting,nomodalityhasmissingsubjects.Werandomlyselect90%subjectsasthetraining setand10%subjectsasthetestingset.Ourmainassumptionisdeepmatrixfactorizationcanex- 59 Comp.# Shallowcollectivematrixfactorization linear sigmoid square 30 0 : 702 0 : 019 0 : 672 0 : 137 0 : 708 0 : 024 50 0 : 749 0 : 052 0 : 793 0 : 034 0 : 742 0 : 063 70 0 : 743 0 : 063 0 : 696 0 : 037 0 : 747 0 : 061 90 0 : 754 0 : 046 0 : 756 0 : 059 0 : 749 0 : 049 110 0 : 791 0 : 027 0 : 798 0 : 058 0 : 786 0 : 032 130 0 : 671 0 : 049 0 : 652 0 : 058 0 : 679 0 : 048 Comp.# Deepcollectivematrixfactorization linear sigmoid square 30 0 : 634 0 : 065 0 : 665 0 : 044 0 : 627 0 : 768 50 0 : 701 0 : 064 0 : 735 0 : 061 0 : 681 0 : 039 70 0 : 778 0 : 059 0 : 749 0 : 011 0 : 784 0 : 055 90 0 : 775 0 : 063 0 : 801 0 : 023 0 : 821 0 : 015 110 0 : 806 0 : 049 0 : 792 0 : 031 0 : 800 0 : 032 130 0 : 717 0 : 037 0 : 705 0 : 049 0 : 759 0 : 044 Comp.# Otherdeepmultimodalmethods DCCA DCCAE DNN 30 0 : 801 0 : 101 0 : 737 0 : 063 0 : 758 0 : 098 50 0 : 732 0 : 041 0 : 753 0 : 014 0 : 767 0 : 069 70 0 : 788 0 : 084 0 : 813 0 : 047 0 : 756 0 : 087 90 0 : 746 0 : 159 0 : 750 0 : 124 0 : 757 0 : 078 110 0 : 759 0 : 151 0 : 780 0 : 058 0 : 754 0 : 070 130 0 : 739 0 : 183 0 : 774 0 : 074 0 : 754 0 : 056 Table4.3: PredictionperformanceofdifferentmodelsusingADNI2andADNI1'sT1MRI anddMRIintermsofAUC. AlthoughdMRImodalitylacksofalargenumberofsubjects,per- formanceisstillimprovedalotcomparedwiththatonlyusesADNI2data. 60 Components# Shallowcollectivematrixfactorization linear sigmoid square 30 0 : 684 0 : 051 0 : 658 0 : 039 0 : 766 0 : 115 50 0 : 767 0 : 019 0 : 772 0 : 032 0 : 818 0 : 076 70 0 : 763 0 : 059 0 : 759 0 : 020 0 : 797 0 : 049 90 0 : 772 0 : 070 0 : 775 0 : 030 0 : 767 0 : 081 110 0 : 822 0 : 018 0 : 795 0 : 005 0 : 803 0 : 014 130 0 : 702 0 : 067 0 : 669 0 : 055 0 : 689 0 : 071 Components# Deepcollectivematrixfactorization linear sigmoid square 30 0 : 632 0 : 019 0 : 665 0 : 042 0 : 670 0 : 052 50 0 : 707 0 : 054 0 : 737 0 : 064 0 : 719 0 : 073 70 0 : 781 0 : 065 0 : 750 0 : 010 0 : 799 0 : 040 90 0 : 784 0 : 071 0 : 797 0 : 019 0 : 852 0 : 018 110 0 : 811 0 : 047 0 : 782 0 : 030 0 : 779 0 : 008 130 0 : 728 0 : 048 0 : 705 0 : 055 0 : 725 0 : 105 Table4.4: Predictionperformanceoffusinggeneticknowledgeandimagingknowledgeus- ingADNI1andADNI2intermsofAUC. Geneticmodalitycanbesuccessfullyintegratedwith imagingmodalities. Components# 30 50 70 DNN 0 : 674 0 : 114 0 : 666 0 : 108 0 : 669 0 : 119 Components 90 110 130 DNN 0 : 667 0 : 090 0 : 656 0 : 080 0 : 671 0 : 098 Table4.5: PredictionperformanceofDNNusingADNI1andADNI2intermsofAUC. Genetic modalitycanbesuccessfullyintegratedwithimagingmodalities. tracthigh-levelnonlinearfeaturestoimprovediagnosisperformance.Inordertoproveit,wecom- paredeepmodelswithshallowmodels,i.e.onelayermatrixfactorization,andcomparenonlinear modelswithlinearmodels.Twomainnonlinearfunctionsareusedinourexperiments: sigmoid ( x ) and x 2 .Indeepmodels,wefocusonthosewithtwohiddenlayers.Aftersomepreliminaryexper- iments,wethelayer'scomponentstobe162,i.e. V j 2 R 162 d j for j = 1 ; 2 ;:::; t andvary secondlayer'scomponentsfrom30to130,i.e. W 1 ; j 2 R r 162 where r 2f 30 ; 50 ; 70 ; 90 ; 110 ; 130 g . Hence, U 2 R n j r .Howthenewfeatures'dimensionaffectsperformancecanbetracedbyvarying r .WereportaverageareaunderROCcurve(AUC)overthreeiterationsinTable4.2.Weimple- 61 mentedtheproposedmodelusingTensorFlow[1].AlltheexperimentswererunonGT1080or TitanX.Ittakesapproximately3minutestotrainonemodel. Whenusing x 2 asactivationfunctionandsettingcomponentsnumbertobe90,ourmodel outperformsallothermodels.Weobservewhentheactivationfunctionisinappropriate,i.e, sigmoid ( x ) forourcase,theAUCisverylow.Hence,choosingasuitableactivationfunction isveryimportant.Onlycertainnonlinearfunctionscancorrectlythisdatasetandextractthe desiredfeatures.Also,wedthenumberofcomponentsiscrucialforalldifferentmodels.An inappropriatenumberofcomponentswillreducetheperformancedrastically.Whenthenumber ofcomponentsistoosmall,newfeaturerepresentationisnotrichenoughtocapturethecomplex hiddeninformation.Butwhenthisnumberbecomestoolarge,theycontaintoomanyredundant features.Sincesamplesizeisnotlargeenough,itcausesovandreducestestingperfor- mance.Wealsocompareourmethodwiththreestate-of-the-artmultimodallearningalgorithms: DCCA,DCCAE anddeepneuralnetwork .Sincetrainingsamplesizeis90,whenthecomponents numberofnewfeaturerepresentationislargerthan90,DCCA'scode 4 reportserror.Hence,weset ittobe f 30 ; 50 ; 70 ; 90 g forDCCA. Thedeepneuralnetworkhastwoparts.Thepartisused toremovemodalityinformation.Ithastwotwo-layersub-networkscorrespondingtotwo modalities.Thelayeristheinputlayer.Tomakethenetworkconsistent.Thesecondlayer contains162neuronsforeachsub-network.Theoutputsoftwosub-networksareconcatenateto avectorandusedastheinputofthesecondpartofthewholenetworktofuseknowledgeand implementtasks.Thesecondparthasthreelayers.Thelayeristheinputlayer wheretheoutputofthepartisfed.Thesecondlayercontains{30 ; 50 ; 70 ; 90 ; 110 ; 130}units. Thethirdlayerisalogisticregressionlayer.Tocomparewithourmodel,thetwopartsarejointly trained. Theresultsarereportedinthelast three columnsinTable4.2.Ourmethodoutperforms 4 http://ttic.uchicago.edu/wwang5/dccae.html 62 allbaselines. Inthesecondsetting,weincludeallADNI1subjects'imagingdataintothetrainingset.Com- paredwiththesetting,dMRImodalityhasalotofmissingsubjectsinthissetting.Also,this setting'strainingsamplesizeismuchlargerthanthepreviousone.Inordertocomparetheper- formanceofthesetwosettings,thetestingdatasetandalltheothermodelsettingsarethesame asinthesetting.Since DNN, DCCAandDCCAEcannotdealwithmodalitywithmissing subjects, weallthemissingvalueswiththemeanoverallavailablesamplesforeachmodality. AverageAUCisreportedinTable4.3forallmodelsandsimilartrendsareobservedintheseresults withthoseinthesetting.Moreover,weunderthesameexperimentsettings,almostall models'performanceishigherthanthatofthepreviousone.Itshowsourextendedformulation cansuccessfullydealwithmodalitywithmissingsubjectsandleveragepartialknowledgeinthis modalitytogreatlyimproveoverallperformance. Inthelastsetting,weincludegeneticmodalityasthethirdmodalityandfusegeneticknowledge andimagingknowledgetoimprovediagnosisperformance.WepreformGWASoneachiteration's trainingsettoselectSNPsinvolvedinourexperiment.Tocomparewiththesecondsetting,allthe modelsettingsarethesameasinprevioussettings.AverageAUCisreportedinTable4.4and TabRefDNN. SinceDCCAandDCCAEcannotdealwiththreemodalities,weonlyuseDNNas baseline. Withthesametrainingsamplesize, DNN'sperformanceismuchworsethanthatof previoussetting,whichimpliesconcatenatingalltheoutputofeachsub-networkasfusionmethod doesnotworkforthiscase.Thatisbecausefeaturesfromthegeneticmodalityarediscreteand thematrixisverysparse,whilefeaturesfortwoimagingmodalitiesarecontinuesandthematrices areextremelydense.Theyhavedifferentstatisticalproperties.However,forourmethod,the performanceforthissettingismuchbetterthanthatofthesecondsetting,whichimpliesgenetic modalitycanbesuccessfullyintegratedwithimagingmodalitiesbyourmethodeventhoughthe 63 modalitiesareradicallydifferent. Figure4.3: BrainmapsoftheelevelateachROIforthemostassociatedSNP withinthatROI. Figure4.4: Testingperformancewithvarying a parameters. Atlast,wereportsparselogisticregressionresultsoneachsinglemodalityassinglemodality baselines.TheresultsareshowninTable4.6.ExperimentsonADNI2datasethavethesame trainingtestingsplittingmethodasthesettingandexperimentsonADNI1+ADNI2dataset havethesamesplittingwayasthesecondsetting.Weseesinglemodality'saverageAUCis lowerthanthehighestAUCinallthreesettings.Hence,onlybyfusingknowledgefromdifferent modalitiescanweachievedescentperformance. 64 4.2.4Effectsofknowledgefusionparameters Knowledgefusionparameterscontrolhowmuchknowledgeamodalityisfusedintomodality invariantterm.Inthissection,weshowhowtheseparametersaffectperformance.Let a 1 , a 2 , a 3 betheparameterstocontrolknowledgefusionofdMRI,T1MRIandSNPsrespectively.The trainingsetandtestingsetaresplitinthesamewayasthethirdsettinginthelastsection.Wefocus ondeepmodelwith2hiddenlayers,with x 2 asactivationfunction.Thecomponentsofthe layerandthesecondlayeris162and90respectively. We a 1 and a 2 tobe1andvary a 3 toseehow a 3 affectsperformance.Theresultsis showninFigure4.4inblueline.Weseewhenweincrease a 3 ,theperformanceincreases slightly.Butwhen a 3 islargerthan0.1,theperformancedecreasesveryfastifwecontinuein- creasingit.Thatisbecausegeneticmodalityisnoisierthanimagingmodalities.Withasmall a 3 ,i.e.0 : 1,thismodelcantolerantalargerreconstructionerrorforgeneticmodality.Hence, themodelisrobusttothenoiseingeneticmodality.When a 3 becomeslarger,thereconstruction errorofgeneticmodalitymustbesmallinordertoachievealowtotalloss.Morenoisedistorts U , whichreducestheperformance.Butwhen a 3 istoosmall,someusefulknowledgeofthismodality cannotallbefusedto U ,whichalsoreducestheperformance.Hence,onlywithasuitablefusion parametercanthemodelcorrectlyfusesalltheusefulknowledgeofgeneticmodality.Next,we a 3 , a 1 tobe0.1and1receptivelyandvary a 2 toseehow a 2 affectsperformance.Wealso a 3 , a 2 tobe0.1and1respectivelyandvary a 1 toseetheeffectsofchanging a 1 .Theresults areshowninFigure4.4ingreenlineandredline.Thesetwoareverysimilartoeachothersince theybothcontrolknowledgefusionofimagingmodalities.Weseewhen a 1 and a 2 reach1,the performancereachesthehighest.Hence,imagingmodalitiesneedtocontributemoreknowledge to U thangeneticmodalitytomakeabetterperformance. 65 ADNI2 T1MRI dMRI SNPs AUC 0 : 71 0 : 04 0 : 63 0 : 07 0 : 63 0 : 14 ADNI1+ADNI2 T1MRI dMRI SNPs AUC 0 : 72 0 : 06 - 0 : 67 0 : 27 Table4.6: Resultsofapplyingsparselogisticregressiononeachsinglemodalityintermsof AUC. ADNI1studydidnotcollectdMRI. 4.2.5Imaging-geneticsassociation Inthissection,wepresentimaging-geneticsassociationuncoveredbymodalitycompo- nents.WecomputetheassociationbetweenSNPswithcorticalthicknessandareaon68ROIs. ThisassociationindicateshowabrainimagingfeatureisassociatedwithaSNPun- derthetaskofpredictingMCI.InFigure4.3,weshowthemapofthelevelateach ROIforthemostassociatedSNPwithinthatROI.Thetwoarebasedoncortical areafeaturesforleftandrightbrainrespectivelyandthelasttwoareforcorticalthick- nessfeatures.Warmercolorsrepresentstrongerassociationandcoolercolorsindicatetheoppo- site.OurresultsshowthattherearesomeclusterpatternswhichindicatethoseROIsarehighly relatedtoeachotherinrespectofMCI.Top6T1MRIfeaturesare:rightcuneusthick- ness,rightparahippocampalarea,rightposteriorcingulatethickness,leftparsopercularisarea, leftcuneusthicknessandrightfrontalpolethickness.Amongthosefeatures,cuneusthickness, posteriorcingulatethickness,frontalpolethicknessandparahippocampalregionaresig- associatedwithMCI[82,26,45,34].TheSNPsmostrelatedtothese6featuresare: rs10414043,rs429358,rs429358,rs8141950,rs11178933,rs10414043respectively.AlltheSNPs exceptrs8141950arelocatedatChromosome19whichhasbeentobehighlyassociated withMCIandAD[30,71].Especially,rs429358locatesinthefourthexonoftheAPOEgene[57] inChromosome19,whichhasbeenextensivelyreportedasthegeneticriskfactorforthelate-onset 66 ofAD.rs8141950,locatedonChromosome22,hasalsobeenfoundtobecloselyrelatedtoAD[5]. Thisshowsthatourmethodcancorrectlyuncoverimaging-geneticassociationinrespectofMCI. Thisassociationcanbeusedtoanalyzehowthegenotypebrainstructuresandprovide apotentialwaytoexplorethemechanismbehindMCIandAD. 4.3Summary Inthiswork,weproposedcollectivedeepmatrixfactorizationtofuseknowledgefromdiffer- entmodalities.,webuilduniformnonlinearhierarchicaldeepmatrixfactorization frameworkacrossdifferentmodalitieswhichdecomposeseachmodalityintoamodality componentandamodalityinvariantcomponentthatservesasalearnedfeaturerepresentation. Wealsoaddsupervisiononthemodalityinvariantcomponenttoguidethelearningprocess.The proposedmethodcanexploitcomplicatednon-linearinteractionsamongdifferentmodalitiesand learnafeaturerepresentationwhichiscompactandmorerelevanttoourpredictiveproblem.Also, themodalitytermcanbeusedtouncovercomplicatedimaging-geneticassociations.We performextensiveexperimentsonADNIdatasetandshowtheproposedmethodim- provespredictiveperformance. 67 Chapter5 MultimodalInformationBottleneck Inmanyproblems,thepredictionscanbeenhancedbyfusinginformationfromdif- ferentdatamodalities.Inparticular,whentheinformationfromdifferentmodalitiescomplement eachother,itisexpectedthatmultimodallearningwillleadtoimprovedpredictiveperformance. Inthispaper,weproposedasupervisedmultimodallearningframeworkbasedontheinformation bottleneckprincipletooutirrelevantandnoisyinformationfrommultiplemodalitiesand learnanaccuratejointrepresentation.lly,ourproposedmethodmaximizesthemutual informationbetweenthelabelsandthelearnedjointrepresentationwhileminimizingthemutual informationbetweenthelearnedlatentrepresentationofeachmodalityandtheoriginaldatarep- resentation.Astherelationshipsbetweendifferentmodalityareoftencomplicatedandnonlinear, weemployeddeepneuralnetworkstolearnthelatentrepresentationandtodisentangletheircom- plexdependencies.However,sincethecomputationofmutualinformationcanbeintractable, weemployedthevariationalinferencemethodtoefsolvetheoptimizationproblem.We performedextensiveexperimentsonvarioussyntheticandreal-worlddatasetstodemonstratethe effectivenessoftheframework. 68 5.1Methodology 5.1.1InformationBottleneckMethod Informationbottleneck[108]isanapproachbasedoninformationtheory.Itformalizestheintuitive ideasaboutinformationtoprovideaquantitativemeasureoffimeaningful"andfirelevant"[108]. Itprovidesatradeoffbetweenaccuracyandcomplexity.Thismethodhasbeenwidelyusedin clustering[98,109,42],ranking[50]and[97].Exactsolutiondoesnotexistifthe latentrepresentationislearnedbydeepneuralnetworks.In[7],theauthorsappliedinformation bottlenecktosingle-modallearningandproposedtousethevariationalmethodtooptimizeit. Insteadofdirectlysolvingtheoptimizationproblemofinformationbottleneck,theauthors calculatedalowerboundoftheoriginaltarget.Thenthelowerboundwasmaximizedtopushthe resultsclosertotheoptimalsolutiontotheoriginaloptimizationproblem.Thedistributionsofthe posteriorswerelearnedbytheneuralnetworks.Themethodalsoutilizedthereparameterization trickforeftraining.Informationbottleneckisalsousedinmultimodallearning.In[127], theauthorsproposedtouseinformationbottlenecktolearnajointlatentrepresentation.Thejoint latentrepresentationwasacombinationofthelinearprojectionofallthemodalities.Theprojection matriceswerelearnedbytheinformationbottleneckapproach.Althoughtheapproachachieves decentresults,itislimitedtolinearprojection.Therefore,weproposeanonlineardeepversionof multimodalinformationbottlenecktoovercomethislimitation. Informationbottleneckisaninformation-basedapproachtothebesttradeoffbetweenthe accuracyandcomplexity.Givendata X withlabels Y ,informationbottleneckaimstoacon- ciseandaccuratelatentrepresentationof X .Denotethelatentrepresentationas Z .Information 69 bottlenecksolvesthefollowingoptimizationproblem: max Z I ( Y ; Z ) s.t. I ( X ; Z ) g ; (5.1) where I ( Y ; Z ) isthemutualinformationbetween Y and Z whereas I ( X ; Z ) isthemutualinformation between X and Z .Themutualinformationbetweenanytworandomvariables X and Y is as: I ( X ; Y )= Z Y Z X p ( x ; y ) log ( p ( x ; y ) p ( x ) p ( y ) ) dxdy ; where p ( x ; y ) isthejointprobabilitydensityfunctionof X and Y while p ( x ) and p ( y ) arethe marginalprobabilitydensityfunctionsof X and Y . Eq.(5.1)maximizesthemutualinformationbetween Y and Z tomakesurethelearned Z containsinformationabout Y asmuchaspossible.Ifthereisnoconstrainton Z ,thesolutionwould be Z = X .Butinmostcases, X containsnoiseorotherirrelevantinformationto Y .Therefore,a constraintmustbeappliedto Z toensurethatthelearned Z providesaconciserepresentationthat containslessnoiseandirrelevantinformationcomparedwith X .Thisconstraintreducesthemodel complexityandimprovesthemodel'sgeneralizationability.Eq.(5.1)canalsoberelaxedtothe followingformulation: max Z I ( Y ; Z ) a I ( X ; Z ) ; where a isaregularizationparametertocontrolthetradeoffbetween I ( Y ; Z ) and I ( X ; Z ) . 70 5.1.2Deepmultimodalinformationbottleneck. Inmultimodallearning,informationbottleneckcanbeusedtolearnthejointdiscriminativerep- resentationasitcanremovetheirrelevantinformationandnoiseofeachmodality.Sincefor real-worlddata,therelationbetweenmultiplemodalityarelikelytobenonlinearandcomplex, inthispaper,weproposeadeepmultimodalinformationbottleneckmethodtomaptheoriginal representationtoanonlinearrepresentationthatcanmakesubjectseasiertobeseparated. Giventwomodalities X 1 ; X 2 andtheclasslabels Y ,theproposedmethodaimstolearnajoint representation Z tofusetheinformationfromallmodalities.Themodelcontainstwoparts.The partistolearnthehiddenrepresentationsfromallthemodalities.Eachmodalityhasone hiddenrepresentation.Thispartistoremovethenoiseandirrelevantinformationfrom X 1 and X 2 asmuchaspossibletomakesurethelearnedrepresentationsareveryconcise.Weuse Z 1 and Z 2 to denotethehiddenrepresentationsfor X 1 and X 2 ,respectively.Thesecondpartistofusethehidden representationsusinganeuralnetworkas Z = f q ( Z 1 ; Z 2 ) ; (5.2) where f denotethenetworkand q thenetworkparameter.Thispartistotransferknowledge fromallmodalitiesandlearnajointrepresentation Z .Thesetwopartsarelearnedjointlybythe informationbottleneckas max Z ; Z 1 ; Z 2 I ( Y ; Z ) a I ( X 1 ; Z 1 ) b I ( X 2 ; Z 2 ) ; (5.3) s.t. Z = f q ( Z 1 ; Z 2 ) ; where a and b areregularizationparameters.Thettermistomaximizethemutualinformation 71 betweenthejointrepresentationandthelabel Y tomakesurethelearnedjointrepresentation arediscriminativeaccordingtotheclasslabels.Thelasttwotermsaretominimizethemutual informationbetweenthelatentrepresentationofeachmodalityanditsoriginaldatarepresentation. Thesetwotermsreducethemodelcomplexitytomakethemodelmoregeneralizable,sincethey canouttheirrelevantandnoisyinformation. 5.1.3Optimization ThemajorchallengeofsolvingEq.(5.3)isthatthemutualinformationtermsarecomputationally intractable.Recently,variationalmethods[59,7,38]arewidelyusedtodealwithsuchproblems. Variationmethodsmaximizethevariationallowerboundsoftheobjectivefunctionsinsteadof directlymaximizingthem.Thesemethodsusesomeknowndistributionstoapproximatethein- tractabledistributions,andprovidelowerboundsoftheoriginalobjectivefunctions.Byincreasing thelowerbounds,wecanobtainapproximatesolutionstotheoriginalobjectivefunctions.To obtainthevariationallowerboundofEq.(5.3),weneedtothejointprobabilitydensity functionofallthevariablesincludingthelatentvariables.UsingBayes'rule,thejointprobability densityfunctionof X 1 ; X 2 ; Z 1 ; Z 2 ; Y ; Z canbeexpressedas p ( x 1 ; x 2 ; z 1 ; z 2 ; y ; z )= p ( z j z 1 ; z 2 ; x 1 ; x 2 ; y ) p ( z 1 j z 2 ; x 1 ; x 2 ; y ) p ( z 2 j x 1 ; x 2 ; y ) p ( x 1 ; x 2 ; y ) : (5.4) Since Z 1 islearntfrom X 1 ,wethusassumegiven X 1 , Z 1 isindependentof Z 2 ; X 2 ; Y .Similarly, weassumegiven X 2 , Z 2 isindependentof X 1 ; Y ,andgiven Z 1 ; Z 2 , Z isindependentof X 1 ; X 2 ; Y . 72 Therefore,wehavethefollowingequalities: p ( z 1 j z 2 ; x 1 ; x 2 ; y )= p ( z 1 j x 1 ) ; p ( z 2 j x 1 ; x 2 ; y )= p ( z 2 j x 2 ) ; p ( z j z 1 ; z 2 ; x 1 ; x 2 ; y )= p ( z j z 1 ; z 2 ) : Usingtheseassumptions,thejointprobabilitydensityfunctioncanbeas p ( x 1 ; x 2 ; z 1 ; z 2 ; y ; z )= p ( z j z 1 ; z 2 ) p ( z 1 j x 1 ) p ( z 2 j x 2 ) p ( x 1 ; x 2 ; y ) : (5.5) First,letusstartwith I ( Y ; Z ) .Since p ( y j z ) isintractable,weuseadistribution q ( y j z ) ,whichwillbe learnedfromthenetwork,toapproximate p ( y j z ) .TheKL-divergencebetween p ( y j z ) and q ( y j z ) is alwaysnon-negative.Therefore,wehave KL [ p ( y j z ) ; q ( y j z )] 0 ) Z dydzp ( y ; z ) log ( p ( y j z )) Z dydzp ( y ; z ) log ( q ( y j z )) : (5.6) Themutualinformationbetween Y and Z is I ( Z ; Y )= Z dydzp ( y ; z ) log p ( y ; z ) p ( y ) p ( z ) = Z dydzp ( y ; z ) log p ( y j z ) p ( y ) : 73 UsingEq.(5.6),wehave I ( Y ; Z ) Z dydzp ( y ; z ) log q ( y j z ) p ( y ) = Z dydzp ( y ; z ) log q ( y j z ) Z dyp ( y ) log p ( y ) : Since R dyp ( y ) log p ( y ) istheentropyofthelabels,andthistermhavenoeffectontheoptimiza- tion,wecandirectlydropit.Therefore,thevariationlowerboundof I ( Y ; Z ) is I ( Y ; Z ) Z dydzp ( y ; z ) log q ( y j z ) = Z dydzdx 1 dx 2 dz 1 dz 2 p ( x 1 ; x 2 ; z 1 ; z 2 ; y ; z ) log q ( y j z ) : ByusingthejointprobabilitydensityfunctioninEq.(5.5),thevariationallowerboundofthe mutualinformationbetween Z and Y canbewrittenas I ( Y ; Z ) Z dx 1 dx 2 dyp ( x 1 ; x 2 ; y ) Z dzdz 1 dz 2 p ( z j z 1 ; z 2 ) p ( z 1 j x 1 ) p ( z 2 j x 2 ) log q ( y j z ) : (5.7) Next,weneedtotheupperboundof I ( X 1 ; Z 1 ) .Since p ( z 1 ) isintractable,weuse r 1 ( z 1 ) to approximate p ( z 1 ) .Similarly,weusethepropertyoftheKL-divergencebetween p ( z 1 ) and r 1 ( z 1 ) . KL [ p ( z 1 ) ; r 1 ( z 1 )] 0 ) Z dzp ( z 1 ) log p ( z 1 ) Z dzp ( z 1 ) log r ( z 1 ) : 74 Therefore,themutualinformationbetween Z 1 and X 1 is I ( Z 1 ; X 1 )= Z dz 1 dx 1 p ( x 1 ; z 1 ) log p ( z 1 j x 1 ) p ( z 1 ) Z dz 1 dx 1 p ( x 1 ; z 1 ) log p ( z 1 j x 1 ) r 1 ( z 1 ) = Z dx 1 dx 2 dydz 1 p ( x 1 ; x 2 ; z 1 ; y ) log p ( z 1 j x 1 ) r 1 ( z 1 ) : Usingtheassumptionthatgiven x 1 , z 1 isindependentofallothervariables,wehave I ( Z 1 ; X 1 ) Z dx 1 dx 2 dyp ( x 1 ; x 2 ; y ) Z dz 1 p ( z 1 j x 1 ) log p ( z 1 j x 1 ) r 1 ( z 1 ) : (5.8) Similarly,for I ( Z 2 ; X 2 ) ,wehave I ( Z 2 ; X 2 ) Z dx 1 dx 2 dyp ( x 1 ; x 2 ; y ) Z dz 2 p ( z 2 j x 2 ) log p ( z 2 j x 2 ) r 2 ( z 2 ) : (5.9) WithEq.(5.7),Eq.(5.8)andEq.(5.9),thevariationallowerboundis: I ( Y ; Z ) a I ( X 1 ; Z 1 ) b I ( X 2 ; Z 2 ) Z dx 1 dx 2 dyp ( x 1 ; x 2 ; y ) ( Z dzdz 1 dz 2 p ( z j z 1 ; z 2 ) p ( z 1 j x 1 ) p ( z 2 j x 2 ) log q ( y j z ) a Z dz 1 p ( z 1 j x 1 ) log p ( z 1 j x 1 ) r 1 ( z 1 ) b Z dz 2 p ( z 2 j x 2 ) log p ( z 2 j x 2 ) r 2 ( z 2 ) ) : 75 Theintegralover x 1 ; x 2 and y canbeapproximatedbyMonteCarlosampling[94].Therefore, I ( Y ; Z ) a I ( X 1 ; Z 1 ) b I ( X 2 ; Z 2 ) 1 N N å i f Z dzdz 1 dz 2 p ( z j z 1 ; z 2 ) p ( z 1 j x 1 ) p ( z 2 j x 2 ) log q ( y j z ) a Z dz 1 p ( z 1 j x 1 ) log p ( z 1 j x 1 ) r 1 ( z 1 ) b Z dz 2 p ( z 2 j x 2 ) log p ( z 2 j x 2 ) r 2 ( z 2 ) g ; where N isthesamplesizeofthetotalsampleddata.Next,weassume p ( z 1 j x 1 ) ; p ( z 2 j x 2 ) and p ( z j z 1 ; z 2 ) areGaussian.ThemeansandvariancesoftheGaussiandistributionsarealllearned fromdeepneuralnetworks,i.e., p ( z 1 j x 1 )= N ( m 1 ( x 1 ; f 1 ) ; S 1 ( x 1 ; f 1 )) ; p ( z 2 j x 2 )= N ( m 2 ( x 2 ; f 2 ) ; S 2 ( x 2 ; f 2 )) ; p ( z j z 1 ; z 2 )= N ( m ( z 1 ; z 2 ; q ) ; S ( z 1 ; z 2 ; ; q )) ; where m 1 ; m 2 ; m and S 1 ; S 2 ; S arethenetworkstolearnthemeansandvariancesfor p ( z 1 j x 1 ) , p ( z 2 j x 2 ) and p ( z j z 1 ; z 2 ) . f 1 ; f 2 and q arenetworkparametersforthenetworkstolearn p ( z 1 j x 1 ) , p ( z 2 j x 2 ) and p ( z j z 1 ; z 2 ) ,respectively.Since z 1 , z 2 and z areallrandomvariables,backpropagation throughthoserandomvariablesmaycauseproblems.Therefore,weusethereparameterization 76 trickhere,i.e., z 1 = m ( x 1 ; f 1 )+ S ( x 1 ; f 1 ) e 1 ; z 2 = m ( x 1 ; f 1 )+ S ( x 1 ; f 1 ) e 2 ; z = m ( z 1 ; z 2 ; q )+ S ( z 1 ; z 2 ; ; q ) e ; where e ; e 1 ; e 2 ˘ N ( 0 ; I ) .Byusingthisreparameterizationtrick,randomnessistransferedto e ; e 1 ; e 2 ,whichdonotaffectthebackpropagation.Therefore,thelossis max 1 N N å f E e E e 1 E e 2 log q ( y j z ) a E e 1 log p ( z 1 j x 1 ) r 1 ( z 1 ) b E e 2 log p ( z 2 j x 2 ) r 2 ( z 2 ) g : (5.10) ThreeMonteCarlosamplingproceduresareusedareusedheretoapproximatetheintegrals. p ( z 1 j x 1 ) ; p ( z 2 j x 2 ) arealllearnedfromneuralnetworks.NotethattheterminEq.(5.10)is thecross-entropybetween y and z .Thus,wecanuseadeepneuralnetworkwithasoftmaxlayer asoutputtocalculatetheclassprobabilitiesandthecross-entropyloss. 5.1.4Generalizetomultiplemodalities Theproposeddeepmultimodalinformationbottleneckframeworkcanbeeasilygeneralizedto settingswithmorethan2modalitiesbyaddingcorrespondinginformationconstraintterms.Given v modalities f X 1 ; X 2 ;:::; X v g ,theformulationoftheproposedmethodis max Z ; Z 1 ; Z 2 ;::: Z v I ( Y ; Z ) v å i a i I ( X i ; Z i ) ; (5.11) 77 where Z i isthelatentrepresentationof X 1 . a i istheregularizationparametertoregularizethe mutualinformationbetween X i and Z i .FollowingtheproceduresinSection5.1.3,thelossfor Eq.(5.11)is max 1 N N å f E e E e 1 E e 2 ::: E e v log q ( y j z ) a i E e i log p ( z i j x i ) r i ( z i ) g ; (5.12) where e ; e 1 ; e 2 ;::: e v ˘ N ( 0 ; I ) . r i ( z i ) areassumedas r i ( z i ) ˘ N ( 0 ; I ) .Each p ( z i j x i ) areGaussian with m and S leanedfromadeepneuralnetwork. 5.2Experiments Inthissection,wepresenttheexperimentalresultsonsyntheticandreal-worlddatasets.The baselinealgorithmsusedforcomparisonincludelinearCCA[27],DCCA[9],DCCAE[119],and thefully-connecteddeepneuralnetwork(DNN),whichusestwofully-connectedneuralnetworks todirectlyextractlatentrepresentations Z 1 ; Z 2 andthenusesadeepneuralnetworktofuse Z 1 and Z 2 tomakeprediction.Oneintuitivebaselineformultimodallearningistoconcatenatethefeatures fromallthemodalitiesandtreattheconcatenatedfeaturesasonemodality.Inourexperiments, weusesingle-modalinformationbottleneck[7]asthemodelforthisbaselineanddenotethis baselineas singlemodal12 .Wealsoprovidetheresultsofsingle-modallearningusinginformation bottleneck[7]foreachmodality,anduse singlemodal1,singlemodal2 todenotethebaselinesusing themodalityandthesecondmodality.Wedenotetheproposedmethodas deepIB . 78 5.2.1Syntheticdatasets .Thedataaresynthesizedinthefollowingway.First,wesample2 n pointsfromtwoGaus- siandistributions,i.e., N ( 0 : 5 e ; I ) and N ( 0 : 5 e ; I ) toform Z .Samplesfromeachdistribution formoneclass.Eachclasshas n datapoints.Then,wedirectlyuse Z togenerate X 1and X 2 bysetting X i = f ( D )+ noise,where D =[ Z ; extra-features ] with i 2f 1 ; 2 g and f isanonlinear function.Extra-featureshereareusedtodistorttheandaresampledfromanother twoGaussiandistributions,i.e., N ( e ; I ) and N ( e ; I ) .Wesample m datafromtheGaus- siandistributionand2 n m samplesfromthesecondGaussiandistribution.Weconcatenatethe extra-featurestotheusefulfeaturestodistorttheExtra-featuresareillustratedin Figure5.1.InFigure5.1,therowrepresentsthesamplesandthecolumnrepresentsthefeatures. Extra-featureshavedifferentclasspropertycomparedwiththeusefulfeatures.Inallthesynthetic dataexperiments,weset m = 2 n = 3forthemodalityand m = 2 n = 6forthesecondmodality. Extra-featureswidelyexistinmultimodallearningscenario.Forexample,whenwecollectallthe geneticdatafrompeoplewithagene-relateddiseaseandhealthypeople,thegeneticdatacontain notonlyinformationtoclassifythedisease,butalsogenderinformation.Thefeaturesthatdescribe thegenderinformationareextra-features.Theeffectofthosefeaturesneedstobeeliminatedin theprocess.Thenoiseissampledfrom N ( 0 ; t I ) ,where t denotesthenoiselevel andischangedintheexperimentstotestthealgorithms'abilitytoeliminatetheeffectfromnoise. f is tanh ( tanh ( D ))+ 0 : 1forthemodalityand sigmoid ( D ) 0 : 5forthesecondmodality. 5.2.1.1Setting1 .Inthesetting,wechangethenoiselevel t andcompareourmodelwithotherbaselines. t istherelativenoiselevelwhichiscalculatedas t = a max ( abs ( X i )) foreachmodality,where 79 Figure5.1: Illustrationofextra-featuresinthesyntheticdataexperiments. Extra-featureshave differentclasspropertywiththeusefulfeatures. Noiselevel(a) 0.2 0.4 0.6 singlemodal1 0 : 070 0 : 016 0 : 111 0 : 020 0 : 135 0 : 025 singlemodal2 0 : 132 0 : 025 0 : 205 0 : 040 0 : 253 0 : 025 singlemodal12 0 : 064 0 : 012 0 : 083 0 : 012 0 : 125 0 : 011 linearCCA 0 : 064 0 : 027 0 : 109 0 : 023 0 : 143 0 : 035 DCCA 0 : 065 0 : 024 0 : 096 0 : 023 0 : 132 0 : 029 DCCAE 0 : 075 0 : 017 0 : 098 0 : 008 0 : 139 0 : 044 DNN 0 : 061 0 : 004 0 : 094 0 : 012 0 : 128 0 : 025 deepIB 0 : 059 0 : 016 0 : 073 0 : 011 0 : 122 0 : 019 Noiselevel(a) 0.8 1.0 1.2 singlemodal1 0 : 166 0 : 025 0 : 192 0 : 030 0 : 206 0 : 040 singlemodal2 0 : 283 0 : 031 0 : 313 0 : 032 0 : 338 0 : 035 singlemodal12 0 : 154 0 : 031 0 : 164 0 : 023 0 : 181 0 : 037 linearCCA 0 : 165 0 : 038 0 : 194 0 : 041 0 : 209 0 : 043 DCCA 0 : 154 0 : 033 0 : 173 0 : 034 0 : 198 0 : 043 DCCAE 0 : 165 0 : 026 0 : 182 0 : 028 0 : 203 0 : 041 DNN 0 : 156 0 : 024 0 : 164 0 : 037 0 : 191 0 : 031 deepIB 0 : 139 0 : 017 0 : 158 0 : 017 0 : 171 0 : 026 Table5.1: Averageerrorsofallmethodsunderdifferentnoiselevels. abs meanstheabsolutevalue.Weset a tobe f 0 : 2:0 : 2:1 : 2 g .Thesamplesizeperclassisset tobe500.Theusefulfeaturedimensionis20,andtheextra-featuredimensionis5. a and b aretunedin [ 1 e 5 ; 5 e 5 ; 1 e 4 ; 5 e 4 ; 1 e 3 ; 5 e 3 ; 1 e 2 ] .Forthesubnetworksthatextract featuresfrom X 1 and X 2 forallthedeepmodelsincludingDCCA,DCCAE,wetunethenumber oflayersin [ 3 ; 4 ; 5 ] andthenodenumberforeachlayeristunedin [ 256 ; 512 ; 1024 ] .Theactivation functionisReLU.Forthesubnetworksthatfusetheextractedfeaturesfromallmodalities,wetune 80 thenumberoflayersin [ 1 ; 2 ; 3 ] andthenodenumberistunedin [ 128 ; 256 ; 512 ] .Theactivation functionisReLU.Foralltheexperiments,weuse80%dataastrainingandtherestastesting andrepeattheexperimentsfor5times.WereporttheaverageerrorsforallmethodsinTable5.1. FromTable5.1,weseewhennoiseincreases,theperformancebecomesworseforallthemethods. Single-modalmethodsareallworsethansupervisedmultimodalmethods.Simpleconcatenationof twomodalitiesisnotasgoodasdeepIB.ComparedwithCCA-basedmethod,weseesupervision informationimprovestheperformancealot.DNNisachallengingbaselineasshownintheresults. DNNhasasimilarnetworkstructurewithdeepIB.ThedifferencebetweenDNNanddeepIBis thatDNNtriestoextractlatentfeaturesbydirectlymaximizingthecross-entropybetweenthe outputsofthenetworkandlabels,whiledeepIBnotonlymaximizesthecross-entropybetween theoutputsofthenetworkandlabels,butalsoconstrainsthemodelcomplexitybyreducingthe informationbetween Z 1 and X 1 ,andbetween Z 2 and X 2 .Therefore,thegeneralizationperformance ofdeepIBisbetterthanDNN. 5.2.1.2Setting2 .Inthesecondsetting,wevarythesamplesizeperclasstoseehowtheperformancechanges.The noiselevel a issettobe1 : 0.Theusefulfeaturesdimensionis20,andtheextra-featuredimension issettobe5.Themodelsaretunedinthesamewayasthesetting.Wereporttheerrorsforall methodsinTable5.2.Fromthetable,weseeincreasingthesamplesizeimprovestheperformance forallmethods.WeobservesomesimilarpatternswiththatofSetting1.Forexample,deepIB resultsarebetterthanallothersingle-modalmethodsresults.CCA-basedmethodsarenotasgood assupervisedmethods.Oneobservationisthatwhenthesamplesizeislargeenough, i.e.,greaterthan1100,DNN'sperformanceisbetterthandeepIB.ThatisbecausedeepIBhas theassumptiontoreducethemodelcomplexity.Whenthesamplesizeislargeenough,deepIB 81 Sampleperclass 300 500 700 singlemodal1 0 : 178 0 : 035 0 : 192 0 : 030 0 : 175 0 : 022 singlemodal2 0 : 333 0 : 043 0 : 313 0 : 032 0 : 319 0 : 032 singlemodal12 0 : 230 0 : 080 0 : 173 0 : 023 0 : 164 0 : 023 linearCCA 0 : 163 0 : 030 0 : 194 0 : 041 0 : 166 0 : 038 DCCA 0 : 165 0 : 013 0 : 173 0 : 033 0 : 158 0 : 026 DCCAE 0 : 169 0 : 008 0 : 182 0 : 028 0 : 154 0 : 033 DNN 0 : 173 0 : 024 0 : 164 0 : 032 0 : 161 0 : 032 deepIB 0 : 162 0 : 015 0 : 158 0 : 017 0 : 143 0 : 021 Sampleperclass 900 1100 1300 singlemodal1 0 : 179 0 : 015 0 : 192 0 : 014 0 : 183 0 : 005 singlemodal2 0 : 326 0 : 021 0 : 317 0 : 010 0 : 316 0 : 024 singlemodal12 0 : 174 0 : 021 0 : 180 0 : 008 0 : 185 0 : 011 linearCCA 0 : 159 0 : 011 0 : 173 0 : 011 0 : 170 0 : 023 DCCA 0 : 151 0 : 018 0 : 165 0 : 013 0 : 155 0 : 013 DCCAE 0 : 154 0 : 003 0 : 178 0 : 017 0 : 160 0 : 008 DNN 0 : 146 0 : 016 0 : 143 0 : 004 0 : 139 0 : 008 deepIB 0 : 139 0 : 009 0 : 143 0 : 007 0 : 143 0 : 006 Table5.2: Averageerrorsofallmethodsunderdifferentsamplesizes. thedata,whileDNNhasnoassumption.Therefore,whenthesamplesizeislargeenough, directusingDNNdeliversthehighestaccuracy. 5.2.1.3Setting3 .Inthethirdsetting,wechangetheextra-featuredimensiontoseehowtheextra-featuredimension affectstheresults.Inthissetting,thesamplesizeperclassissettobe500.Thenoiselevel a is1 : 0. Theusefulfeaturedimensionisedas20.TheerrorsareshowninTable5.3.FromTable5.3, weseedeepIBoutperformsalltheothermethodswithanyextra-featuredimension.Whenthe extra-featuredimensionincreases,thedatacontainmoreirrelevantinformation,whichmakesthe tobedistorted.Weseewhentheextra-featuredimensionincreases,theerrorof DNN,CCA-basedmethodsincreasealot.However,forIBbasedmethodsincludingthesingle- modalbaselines,theerrorsarestablewhentheextra-featuredimensionislargerorequalto25. 82 Extra-featuredim 5 15 25 singlemodal1 0 : 192 0 : 030 0 : 194 0 : 036 0 : 199 0 : 020 singlemodal2 0 : 313 0 : 032 0 : 333 0 : 024 0 : 342 0 : 019 singlemodal12 0 : 164 0 : 023 0 : 181 0 : 027 0 : 194 0 : 024 linearCCA 0 : 194 0 : 041 0 : 192 0 : 038 0 : 225 0 : 030 DCCA 0 : 173 0 : 033 0 : 179 0 : 040 0 : 187 0 : 016 DCCAE 0 : 182 0 : 028 0 : 185 0 : 037 0 : 195 0 : 020 DNN 0 : 164 0 : 037 0 : 197 0 : 020 0 : 201 0 : 022 deepIB 0 : 158 0 : 017 0 : 174 0 : 053 0 : 175 0 : 013 Extra-featuredim 35 45 55 singlemodal1 0 : 198 0 : 042 0 : 194 0 : 019 0 : 193 0 : 016 singlemodal2 0 : 327 0 : 020 0 : 334 0 : 026 0 : 332 0 : 021 singlemodal12 0 : 189 0 : 041 0 : 191 0 : 026 0 : 189 0 : 017 linearCCA 0 : 205 0 : 047 0 : 255 0 : 027 0 : 286 0 : 012 DCCA 0 : 181 0 : 018 0 : 183 0 : 026 0 : 201 0 : 040 DCCAE 0 : 193 0 : 023 0 : 215 0 : 026 0 : 221 0 : 032 DNN 0 : 216 0 : 025 0 : 219 0 : 015 0 : 225 0 : 032 deepIB 0 : 174 0 : 035 0 : 173 0 : 019 0 : 176 0 : 023 Table5.3: Averageerrorsofallmethodsunderdifferentextra-featuredimensions. Figure5.2: Averageerrorforreservoirdetectiontask. Fromtheresults,weconcludethatIB-basedmethodsaremorerobusttoextra-features. 83 Figure5.3: T-DistributedStochasticNeighborEmbeddingforthejointrepresentations forreservoirsdetectionmodels. Bluedotsarenaturallakesandreddotsarereservoirs. 5.2.2Casestudy:reservoirdetection .Inthiscasestudy,wecompareallthemodelsonareservoirdetectiondataset.Reservoirsfor thisdatasetaresampledwithArcMap10.3.1byjoiningdamfeaturesfromtheUSArmyCorps' NationalInventoryofDamswithlakepolygonsover4hectaresfromtheLAGOSdatabase[100]. 84 Forcomparison,wealsoselectaproportionalnumberofnaturallakesfromthemajorriverwa- tershedthateachreservoirislocatedin.Thesamplesizeforthisdatasetis1327with660natural lakesand667reservoirs.Therearetwomodalitiesavailableinthisdataset.Theoneisthe boundaryofthelakes.BoundaryfeaturesofeachlakeandreservoirareexportedusingArcMap. Eachboundaryisa224 224image.Todealwiththeboundarydata,weuseVGG16to extractfeatures.Weusethelastfully-connectedlayer'soutputasthefeatures.Thedimensionis 4096.Sincethesamplesizeisnotlarge,weusePCAtoreducethefeaturedimensionbykeeping thetop1%singularvalues.Thereducedfeaturedimensionis75.Thesecondmodalityisthefea- turesextractedfromGoogleEarth.Thefeaturesincludetheareaofthelakes,shapelength,classes ofthegeneraltypesofparentmaterialofsoilonthesurface,classesoflandforms,NED-derived mTPIrangingfromnegative(valleys)valuestopositive(ridges)values,NED-derivedCHILIindex rangingfrom0(verycool)to225(verywarm).Intotal,thereare21features.Wesplitthedatainto trainingandtestingasthesyntheticdataexperimentsandreporttheaverageerrorinFigure5.2. FromtheweseedeepIBoutperformsallothermethods.InFigure5.3,wealsoqualitatively showthejointrepresentationslearnedbyallmethodswitht-DistributedStochasticNeighbor Embedding[74].Thejointrepresentationistheoutputofthelayerthatisconnectedwith thelinear.Forexample,fordeepIB,DNNandthesingle-modalmethods,the jointrepresentationsaretheoutputsofthelayerbeforethelastlayer.FortheCCA-basedmethods, thelearnedrepresentationsaretheprojectedrepresentationsfromthemodality.InFig- ure5.3,bluedotsarenaturallakesandreddotsarereservoirs.Weseethattheseparationqualities areconsistentwiththeperformanceinFigure5.2.2. 85 ADNI2 NC MCI Total p-value Number 50 112 163 - Age 69 : 36 15 : 40 71 : 68 9 : 93 70 : 96 11 : 89 0.0016 Sex 22M/28F 71M/41F 93M/69F 0.0040 NACC HC MCI Total p-value Number 329 57 386 - Age 60 : 96 8 : 96 73 : 60 7 : 93 63 : 82 9 : 73 0.0100 Sex 107M/222F 38M/19F 145M/241F 0.0046 Table5.4: Demographicinformationforthetwocohorts(ADNI2andNACC). Thep-values for695thedifferencebetweenADNI2andNACCare0.023forsexand3.88e-23forage.Thelast column696isthep-valueforthedifferencebetweenMCIandNC. ADNI2 NACC bvalue 1000s/mm 2 1300s/mm 2 Numberofb0images 5 8 Numberofdiffusionweightedimages 42 40 T1MRIvoxelsize 1 : 0156 1 : 0156 1 : 2mm 3 1 : 0 1 : 9 1 : 2mm 3 T1MRITR 6.98ms 8.16ms T1MRITE 2.85ms 3.18ms T1MRIImagedimension 256 256 196 256 256 156 dMRIvoxelsize 2 : 7 2 : 7 2 : 7mm 3 0 : 94 0 : 94 2 : 9mm 3 dMRITR 9050ms 8000ms dMRITE Minimum 81.8ms dMRIImagedimension 128 128 59 256 256 52 Table5.5: ParametersfordMRIandT1MRIdataforADNI2andNACC. 5.2.3Casestudy:Alzheimer'sdisease 5.2.3.1DataPreprocessing ThedataweusedistheunionofADNI2andNACCdataset.Therearetwoclasses,i.e.,normal control(NC),mildcognitiveimpairment(MCI).NACChas329NCand57MCI.ADNI2has50 NCand112MCI.DemographiccharacteristicsofthetwodatasetsissummarizedinTable5.4. dMRIandT1wMRIdataforeachsubjectwasanalyzed.Table5.5summarizesthekeydata collectionparametersforthetwocohorts. Twotypesoffeaturevariableswereextractedinthisstudy.Thetypeisfromthegraymatter 86 Figure5.4: Thepipelineofcomputingthestabilityscore. Thewarmercolorindicatesahigher probabilityofselection.(a)Calculatingtheselectionprobabilityusingdifferentregularization parameters.(b)Illustrationofusingselectionprobabilitytocalculatestabilityscore. usingT1wMRI.FreeSurferwasusedtoextract136measurementsincludingcorticalvolumeand thicknessfor68brainROIsbasedonDesikan-Killianyatlas[33].ThesecondtypeisfromdMRI- derivedstructuralconnectomeornetwork.Thebrainstructuralconnectomewasconstructedusing PICo[84],awhole-brainprobabilistictractographyalgorithmand113ROIsontheHarvard OxfordCorticalandsubcorticalProbabilisticAtlas[33,40].Thedetailsofcomputingthebrain networkcanbereferredto[134].Eachsubject'snetworkhasadimensionof113x113,with6,328 distinctedgesconnecting113brainROIs(theedgesarenotdirectionalandthusthenetworkis symmetric). ForboththeADNI2andNACCcohorts,thenumberofsubjectsislimited,especiallywhen weneedsubjectstohavebothvalidT1MRIanddMRIavailable.Whenperforming modeling,thedimensionoffeaturevariableswillbemuchlargerthanthesamplesizeforboth 87 Figure5.5: ThedistributionofthestabilityscoresforthedMRIfeatures. ROI1 ROI2 1 RightParahippocampalGyrus, RightHeschlflSGyrus PosteriorDivision 2 RightAmygdala LeftCerebellum 3 RightInferiorTemporalGyrus, RightSupramarginalGyrus, TemporooccipitalPart AnteriorDivision 4 Brainstem LeftInsularCortex 5 LeftInsularCortex LeftFrontalOpercularCortex 6 RightSuperiorTemporalGyrus, LeftSupramarginalGyrus, PosteriorDivision PosteriorDivision 7 LeftCaudate LeftPallidum 8 LeftFrontalPole LeftFrontalOpercularCortex 9 LeftInferiorTemporalGyrus, RightSupramarginalGyrus, TemporooccipitalPart PosteriorDivision 10 RightSuperiorParietalLobule LeftPlanumTemporale Table5.6: Top10dMRIfeaturevariables dMRI.Thiswouldleadtotheficurseofdimensionalityflproblemwhereourmodels ovtrainingdataanddeliverpoorgeneralizationpower.Sincenotallfeaturevariablesarere- latedtotheADprogression,weperformafeaturevariableselectionprocedurethatranksallthe variablesaccordingtotheirrelevancetotheproblem,andincludeonlythosefeature variablesinourmodels.Weusethepowerfulstabilityselectionmethodandusestabilityscoreas ourcriterionforrelevance.WeselectsparselogisticregressioninChapter3asthesparsemodelto selectthefeatures.Givenasetofregularizationparameters f l 1 ; l 2 ;::: l n g ,foreachregularization 88 Figure5.6: AverageerrorforclassifyingMCIwithAD. parameter l weobtainasetoffeaturevariables S l thatcontributetothemodel inthecorrespondingsparsemodel.Stabilityselectionisavariableselectionmethodbasedonsub- samplingincombinationwithhigh-dimensionalsparselearningalgorithms.Insteadofselecting onemodel,stabilityselectionperturbsthedata(e.g.,bysubsampling)manytimes,andweidentify consistentfeaturevariablesthatareincludedinthemodel,underdifferentvaluesoftheparameter l ,acrossbootstrapdatasets[75].Intuitively,featurevariablesselectedinthiswayaremorecon- sistentlyrelevanttothetargetproblemthanfeaturevariablesselectedonlybysparsealgorithms. Stabilityselectionworksasfollows:werandomlyselect50%oftrainingsamplesandapply sparselogisticregressiontotheselectedtrainingsampleswithregularizationparameter l i tobuild asparsemodel.Let F denotethewholefeaturevariablessetand f 2 F denotetheindexofa particularfeaturevariableintheset.Thesetoffeaturevariablesselectedbythismodelisdenoted by: U l i = f f : w l i ; f 6 = 0 g (5.13) Werepeatthisprocedurefor t = 1000times.Selectionprobabilityforeachfeaturevariableis 89 Figure5.7: T-DistributedStochasticNeighborEmbeddingforthejointrepresentations forclassifyingMCIwithNC. BluedotsareNCsandreddotsareMCIs. calculatedasfollows: Pr f ; l i = å I ( f 2 U l i ) = t (5.14) 90 Dataset XRMB MNIST Wiki singlemodal1 0 : 185 0 : 003 0 : 075 0 : 006 0 : 449 0 : 024 singlemodal2 0 : 271 0 : 003 0 : 160 0 : 012 0 : 337 0 : 018 singlemodal12 0 : 179 0 : 006 0 : 057 0 : 009 0 : 336 0 : 006 linearCCA 0 : 358 0 : 004 0 : 235 0 : 006 0 : 741 0 : 017 DCCA 0 : 231 0 : 006 0 : 187 0 : 012 0 : 478 0 : 049 DCCAE 0 : 226 0 : 005 0 : 170 0 : 020 0 : 499 0 : 037 DNN 0 : 168 0 : 006 0 : 060 0 : 056 0 : 311 0 : 017 deepIB 0 : 161 0 : 005 0 : 056 0 : 002 0 : 298 0 : 005 Table5.7: Averageerrorsforthreebenchmarkdatasets. where I ( ) istheindicationfunction: I ( c )= 1when c istrueand I ( c )= 0when c isfalse.The procedureofcalculatingselectionprobabilityisillustratedintheupperportionofFigure5.4. Thenwevarytheregularizationparametermanytimesandcalculateselectionprobabilityunder theseregularizationparameters.Bytheseselectionprobabilities,stabilityscoreforfeaturevariable f iscalculatedasfollows: Sc ( f )= max l i ( Pr f ; l i ) (5.15) Withstabilityscore,wecanrankthevariablesandchooseonlytop k stablevariables,orastability scorethatislargerthanapre-setthreshold.Thecomputationofthestabilityscoreisshownin thelowerportionofFigure5.4.Afterselectingfeaturevariablesbystabilityscore,thefeature dimensionisreduceddrastically.Wewillusethenewfeaturevariablessettobuildourmodel. Figure5.5showsthedistributionofthestabilityscoresforT1MRIfeaturesfeatures.selectthetop 172featureswhichhavethetop30%stabilityscoresasthefeaturesforthisdMRI. Wesplitthedataintotrainingandtestingdatasetswiththeratio9:1andrepeattheexperi- mentsfor5times.TheerrorisshowninFigure5.6andtheTSNEofthehidden representationsareinFigure5.7.Weseeourmethodoutperformsallothermethods. 91 5.2.4Otherbenchmarkdatasets Inthissection,wereporttheperformanceonthreebenchmarkdatasets.Thedatasetsweusedare WisconsinX-RayMircro-Beam(XRMB)[119,122]:themodalityis273Dacoustic inputs,thesecondmodalityis112Darticulatoryinputs 1 . MNIST[119]:twomodalitiesaregeneratedfromMNISTdatasets.Themodalityisa randomrotationoftheoriginalimages.Thesecondmodalityisgeneratedbyaddingnoise totheoriginalimages.Bothmodalitieshave784features 2 . Wiki[35]:thedatasetcontains2866images-textpairs.Eachimageisrepresentedby128D inputsandtextisrepresentedby10Dinputs.Thereare10classesintotal. TheaverageerrorsareshowninTable5.7.Fromthetable,weseeforallthebenchmark datasets,theproposedmethodperformsthebestamongallthemethods,whichvtheeffec- tivenessoftheproposedmethod. 5.3Summary Inthiswork,weproposedanovelmultimodallearningmodelbasedoninformationbottleneck. Themodelencouragedthelatentrepresentationkeepingtargetinformationasmuchaspossible whilecontainingtheinformationoforiginalfeaturesaslittleaspossibletoreducethemodelcom- plexity.Tolearnthecomplicatedrelationshipbetweenmodalitiesandwithinmodalities,weuseda deepneuralnetworktolearnthelatentrepresentation.Sincethemutualinformationtermswerein- tractable,wemaximizedthelowerboundoftheformulationinsteadofdirectlymaximizingit.We 1 Wedidnotusethewholedatasetsincesomebaselinesarequiteslow.Werandomlysampled50000datapoints fortrainingandsampled6000pointsfortestingfromthe10classes. 2 Wedidnotusethewholedatasetsincesomebaselinesareverytime-consuming.Wesampled5000datafor trainingand1000fortesting. 92 demonstratedexperimentsonvarioussyntheticandreal-worlddatasetstoshowtheeffectiveness oftheproposedmethod. 93 Chapter6 MultimodalLearningwithIncomplete Modalities 6.1Methodology Inthissection,wegiveabriefintroductiontoknowledgedistillation[46].Then,weintroduce ourmethodwhichleveragesknowledgedistillationtoconductmultimodallearningwithsupple- mentaryinformation. 6.1.1KnowledgeDistillation Knowledgedistillationisusedtotransferfidarkknowledgeflfromateachertoastudent.Totransfer knowledge,theteacheristrainedonadataset.Denotethetrainedteachermodelas Te ( f ) with f denotestheparametersoftheteachermodel.Then,thestudentistrainedtomimictheoutput oftheteacheronthetrainingdataset.Givenadataset D = ff X 1 ; y 1 g ; f X 2 ; y 2 g ::: f X N ; y N gg used totrainthestudent,theteacherisappliedonthedataandlabelthedatawiththelogits.We assumethereareintotal C classes,andthelabelsarethusgivenby: z i = Te ( X i ; f ) ; (6.1) 94 where z i 2 R C 1 isthelogitslabeledbytheteachermodelforsample X i .Thestudentmodelisthen trainedwithboththetrueone-hotlabel f y i ; y 2 ;:::; y N g andthelogits f z 1 ; z 2 ;:::; z N g .Supposethe studentmodelisadeepneuralnetwork f ( q ) parameterizedby q .Ittakes X i asinputandoutputs a C 1vectorwhichisthelogitvector.Then,aSoftMaxfunctionisaddedtothelogitvectorto outputtheprobabilityof X i tobeas C classes.Thelossfunctionoftrainingthestudent networkis: min q l = N å i l c ( X i ; y i ; q )+ l d ( X i ; z i ; q ) : (6.2) where l c isalosswiththetrueone-hotlabelwiththeform: l c ( X i ; y i ; q )= H ( s ( f ( X i ; q )) ; y i ) ; (6.3) where H isthenegativecross-entropyloss,and s ( x ) : R C ! R C istheSoftMaxfunction: s ( x ) j = e x j å C k = 1 e x k for i = 1 ; 2 ;:::; C : (6.4) l d ( X i ; z i ; q ) isthedistillationloss.Examplesofthedistillationlossincludenegativecross-entropy lossorKL-divergence.Withoutlossofgenerality,weadoptKL-divergenceasthedistillationloss: l d ( X i ; z i ; q )= D KL ( s T ( f ( X i ; q ) ; T ) ; s T ( z i ; T )) : (6.5) where s T ( x ; T ) denotestheSoftMaxwithtemperature T : s T ( x ; T ) j = e x j T å C k = 1 e x k T : (6.6) 95 Withtemperature T ,theoutputprobabilityisrescaledandsmoothed.Iftemperature T islarge,the probabilitywillbemoresmoothcomparedwithasmalltemperate T .Theoutputof s T ( z i ; T ) is calledthefisoftlabelfl,whichislabeledbytheteachermodelonthesample X i .Itisbelievethat thefisoftlabelsflcontainmoreinformationthantheone-hotlabel[46]. (a)(b) Figure6.1: Patternofthedata. (a)showsthestructureofadatasetwithtwomodalities.Samples inthebluedashed-lineboxhavecompletemodalitiesandsamplesintheyellowdashed-linebox onlyhaveonemodalityavailable.(b)Illustrationofsamplesusedtotrainteachermodelsfortwo- modallearning.Samplesinthegreendashed-lineboxareusedtotraintheteacherandsamples intheorangedashed-lineboxareusedtotrainthesecondteacher. 6.1.2Multimodallearningwithmissingmodalities Formultimodallearning,itisrathercommonthatsomesamplesdonothavecompletemodalities. Belowwestartourdiscussionsontwomodalitiesandthengeneralizeourmethodtomultiple modalities. Giventwomodalities f X 1 2 R n 1 d 1 ; X 2 2 R n 2 d 2 g withlabels,wedenotethesampleshave completemodalitiesas f X 1 c 2 R n c d 1 ; X 2 c 2 R n c d 2 ; y c 2 R n c g .Samplesonlyhavethemodal- ityaredenotedas f X 1 u 2 R n 1 u d 1 ; y 1 u 2 R n 1 u g andsamplesonlyhavethesecondmodalityarede- 96 notedas f X 2 u 2 R n 2 u d 2 ; y 2 u 2 R n 2 u g with n 1 = n c + n 1 u and n 2 = n c + n 2 u .InFigure6.1,(a)shows thestructureofthedata.Samplesinthebluedashed-lineboxarethesewithcompletemodalities andsamplesinyellowdashed-lineboxonlyhaveonemodalityavailable.Toutilizeallthesam- ples,wetraintwosinglemodalmodelswithalltheavailabledataincludingthesampleswith missingmodalities.Thesetwomodelsarethenactingasteachermodelsinourframework.We assumethatthetwoteachersaretwoneuralnetworks g 1 ( f 1 ) and g 2 ( f 2 ) withparameters f 1 and f 2 . g 1 ( f 1 ) takesthesamplesfrom [ X 1 c ; X 1 u ] asinputandoutputsthelogitsand g 1 ( f 1 ) takesthe samplesfrom [ X 2 c ; X 2 u ] asinputandoutputthelogits.Thetwoteachersaretrainedbyminimizing thefollowinglossfunctions: Te 1 ( f 1 )= min f 1 å n 1 i H ( s ( g 1 ( X 1 i ; f 1 )) ; y i ) ; Te 2 ( f 2 )= min f 2 å n 2 i H ( s ( g 2 ( X 2 i ; f 2 )) ; y i ) (6.7) Then,weusethetwoteacherstolabelthesamplesin f X 1 c ; X 2 c g .Thelogitsforthe i -thsample are: z 1 i = Te 1 ( X 1 c i ; f 1 ) ; z 2 i = Te 2 ( X 2 c i ; f 2 ) ; (6.8) where z j i denotesthelogitlabeledbyteacher j forthe i -thsample. Inordertofusethesupplementaryinformationfromdifferentmodalities,wetrainastudent modelwithmultimodalDNN(M-DNN)[87].TheM-DNNfortwomodalitiescontainstwo branches.Eachbranchtakesonemodalityasinputandisfollowedwithseveralnonlinearfully- connectedlayers.Theoutputsofallthebranchesareconcatenatedtoformajointrepresentation. Then,thejointrepresentationisconnectedtoalinearlayertooutputthelogits z .Thereasonwe 97 usesuchamodelasthestudentmodelisthatthejointrepresentationlearnedwiththismodelcon- tainsthesupplementaryinformationofthetwomodalities.IfwetraintheM-DNNasthemethods in[123],i.e.,onlyusethesampleswithcompletemodalities f X 1 c ; X 2 c ; y c g totrainthemodel,the samplesizeislimitedtobe n c .If n c isverysmallcomparedwith n 1 and n 2 ,alargeamountof usefulinformationisdiscardedandthesamplesfortrainingthemodelisnotenough.Thus,we proposetotraintheM-DNNwiththeinformationfromthetwoteachers Te 1 ( f 1 ) and Te 2 ( f 2 ) to improvetheperformanceasthetwoteachersaretrainedonmuchlargerdatasets.Theclassi- performanceforeachteachermightbenotgoodenoughsinceeachteacheronlyhasaccess toonemodality.Buttheteacherscandothebesttolearnwiththesemodalities,provide theexpertiseforthesemodalitiesandteachthestudentwiththisknowledge.Denotethestudent networkas f ( q ) with q representingtheparameters.Thelossfunctionfortheproposedmethod is: min q l = min q n c å i l c ( X 1 i ; X 2 i ; y i ; q )+ a l d 1 ( X 1 i ; X 2 i ; y i ; q ; Te 1 ( f 1 )) + b l d 2 ( X 1 i ; X 2 i ; y i ; q ; Te 2 ( f 2 )) ; (6.9) where l c ( X 1 i ; X 2 i ; y i ; q ) isthelossas l c ( X 1 i ; X 2 i ; y i ; q )= H ( s ( f ( X 1 i ; X 2 i ; q )) ; y i ) : l d 1 , l d 2 aredistillationloss, a and b aretwotunableparameterstocontrolhowmuchknowledgethe studentmodelneedsfromtheteachermodels.Iftheparameterislarge,itmeansthestudentmodel needsmoreknowledgefromthisteacherthanasmallregularizationparameter.Theformulations 98 Figure6.2: Overviewoftheproposedteacher-studentmodel. Wetrainteachermodelswith alltheavailabledataincludingthesampleshavemissingmodalities.Then,weusethesoftlabels labeledbytheteachermodelsalongwiththeone-hotlabeltotrainthestudentmodel. of l d 1 and l d 2 are: l d 1 ( X 1 i ; X 2 i ; y i ; q ; Te 1 ( f 1 ))= D KL ( s T ( f ( X 1 i ; X 2 i ; q )) ; s T ( z 1 i )) ; (6.10) l d 2 ( X 1 i ; X 2 i ; y i ; q ; Te 2 ( f 2 ))= D KL ( s T ( f ( X 1 i ; X 2 i ; q )) ; s T ( z 2 i )) : (6.11) Figure6.2overviewsoftheproposedframework. Wewouldliketohighlightthedifferencebetweentheproposedmethodwithtwosimilarand intuitivemethods.Theoneis latefusion ,i.e.,fusionatthedecisionlevel,whichdirectly combinesthelabels/logitslabeledbytheteachermodelsastheprediction.Sincetheteachers 99 onlyhavepartialknowledgeofthedata,thedatalabeledbytheteachersmaynotbeperfect. Researcheshaveshownformostcaseslatefusionperformsworsethan earlyfusion ,i.e.,feature levelfusion[99,43].Inourproposedmethod,wenotonlyutilizethelabelsfromtheteachers,but alsoperformearlyfusionwiththeM-DNN.So,theperformanceisexpectedbetterthanlatefusion. Anothermethodistousetheteachersasfeatureextractorstoextractabstractfeaturesandthenuse theseabstractfeaturesasnewsetsoffeaturestoreplacetheoriginalinputstotrainamultimodal model.Theperformanceofthismethodmayperformwellwhendifferentmodalitiesonlyhave commonorsharedinformationandnoises.However,whendifferentmodalities containsupplementaryinformation,theabstractfeaturesextractedbyeachteachermodelsmay havealreadylostsomeusefulinformationastheteachermodelsaretrainedononlyonemodality andarebiased.Therefore,itsperformanceislikelytobeworsethantheproposedmethod.Wewill showtheperformanceofthesemethodsintheexperimentsession. Mechanismoftheproposedmethod: Theunderlyingmechanismoftheproposedapproachcan beillustratedusinggradientanalysis.Thegradientofthelosswithrespecttothe outputprobabilityofthe k -thclassis: ¶ l c ¶ p k = å N i ( p ik y ik ) ; where y ik denotetheone-hotlabelofsample i forclass k , p ik denotetheoutputprobabilityof sample i forclass k .Let L d denoteallthedistillationlosses,thegradientofthedistillationlosses withrespecttotheoutputprobability p k is: ¶ L d ¶ p k = ¶ ¶ p k ( a å N i D KL ( s T ( z i ) ; s T ( z 1 i )) + b å N i D KL ( s T ( z i ) ; s T ( z 2 i ))) 100 = a å N i ( log p ik log q 1 ik )+ b å N i ( log p ik log q 2 ik ) (6.12) ˇ a å N i ( p ik q 1 ik )+ b å N i ( p ik q 2 ik ) ; (6.13) where q m ik isthesoftlabelproducedbyteacher m forsample i atclass k with m = 1 ; 2.Weuse log ( 1 + x ) ˇ x toget(6.13)from(6.12).Thegradientofthetotallosswithrespectto p k is: ¶ l ¶ p k = å N i (( p ik y ik )+ a ( p ik q 1 ik )+ b ( p ik q 2 ik )) = å N i ( 1 + a p ik q 1 ik p ik y ik + b p ik q 1 ik p ik y ik )( p ik y ik ) (6.14) = å N i w ik ( p ik y ik ) ; (6.15) where w ik =( 1 + a ( p ik q 1 ik ) = ( p ik y ik )+ b ( p ik q 1 ik )( p ik y ik )) .Eq.(6.15)indicatesthesamples arereweightedby w ik . w ik isdeterminedbythesoftlabelsandtheofthesoftlabels.If bothteacherslabeledthesamplecorrectlyandthe p ik forthecorrectlabelishigh,the weight w ik isaround ( 1 + a + b ) forthissample.Ifonlyoneteacherlabeledthesamplecorrectly andtheishigh,theweightis ( 1 + a ) or ( 1 + b ) ,whichissmallerthanthesamplesthat arecorrectlylabeledbybothteacherswithhighIftheteachersbothmakemistakesor iftheylabeledcorrectlybutwithverylowtheweightislowerthantheaforementioned twocases.So,theproposedmethodreweightsthesampleswiththeteachers'labelsandthecon- oftheteachersandassignhigherweightsforthesamplesthatarecorrectlylabeledbythe teacherswithhigh Generalizetomultiplemodalities: Given m modalities X 1 2 R n 1 d 1 , X 2 2 R n 2 d 2 ,... X m 2 R n m d m ,thedatasetcouldbedividedinto n parts:(1)sampleswithcompletemodalities X ic 2 R n c d i with i = f 1 ; 2 ;::: m g ;(2)sampleswithonemodalityavailable X iu 2 R n ui d i with i = f 1, 101 2,..., m g ;(3)sampleswithtwomodalitiesavailable X ku f ij g 2 R n u f ij g d k with i ; j = f 1 ; 2 ;::: n g and k = f i ; j g . X ku f ij g isthe k -thmodalityforthesubsetthatsamplescontains i -thand j -th modality;...(n)sampleswith n 1modalitiesavailable X ku f M n i g 2 R n f M n i g d k with i = f 1 ; 2 ;:::; m g . Weuse f M g denotethesetoftheindexforall m modalities,i.e., f M g = f 1 ; 2 ;:::; m g . f M n i g denotesthesetwithoutindex i . k isanindextakenfromtheset f M n i g . X ku f M n i g isthe k -th modalityforthesubsetinwhichsamplescontain f M n i g modalities.Wetraintheteachermodels inahierarchicalmanner.First,wetrainteachermodelsoneachmodalityseparatelyandobtain Te i with i = f 1 ; 2 ;:::; m g .Then,weusethesemodelstoteachtheteachermodelstrainedwith twomodalitiesandobtainedteachermodel Te ij with i ; j = f 1 ; 2 ;:::; m g .Next,weuseallthe Te ij toteachtheteachermodelstrainedwiththreemodalityandsoforth.Finally,weobtainall theteachershierarchically.Denotetheteacherstrainedwith h modalitiesasthe h -levelteachers. f C h g isthesetthatcomposedbyallthecombinationof h indexessampledfromset M .Thesize of f C h g is m h .Forexample,if f M g = f 1 ; 2 ; 3 ; 4 g , f C 2 g = ff 1 ; 2 g , f 1 ; 3 g , f 1 ; 4 g , f 2 ; 3 g , f 2 ; 4 g , f 3 ; 4 gg and f C 3 g = ff 1 ; 2 ; 3 g ; f 1 ; 2 ; 4 g ; f 1 ; 3 ; 4 g ; f 2 ; 3 ; 4 gg . H -levelteachermodelsaretrainedon themodalitiesindexedbytheelementsin f C h g .Fortheaboveexample,therearefour3-level teachers,i.e.,ateachertrainedwiththemodalities1,2,3,ateachertrainedwiththemodality1,2,4, ateachertrainedwithmodalities1,3,4andateachertrainedwithmodalities2,3,4.Denotethe modelofthe t -thteacherfromthe h -levelteachersby Te C ht ( f ht ) with f ht denotingthenetwork parametersand C ht denotingthe t -thelementofset f C h g .Fortheaboveexample, C 23 = f 1 ; 4 g . Te C ht ( f ht ) istrainedbyminimizingthefollowinglossfunction: 102 min f ht l C ht = min f ht N C ht å i l c ( f X ku C ht i g k = C ht ; y u C ht i ; f ht ) + å N C ht i å j C h 1 j j a j l d ( f X ku C ht i g k = C ht ; Te C ( h 1 ) j ) ; (6.16) where j C h 1 j isthesizeofset C h 1 and N C ht isthesizeofsamplehavingmodalitiesindexedby C ht .Afterobtainingallteachers,wetrainthestudentmodelwithalltheteachers. Onepotentialissueisthatifwehavealotofmodalities,thenumberofteachermodelscan beverylager.For m modalities,thecompletenumberofteachermodelsis2 m 2.Assuch,we cannotbuildalltheteacherstotrainthestudentmodelduetothecomputationalcost.Asasolution, weproposetoprunetheteacherstoimprovethescalabilityoftheproposedframework.Asimple pruningstrategyistoselectasubsetofteacherstotrainthestudentmodel.Basically,after levelteachersaretrained,i.e.,single-modalteachers.Weonlyselecttheteachersthathavehigh performancetotrainthesecondlevelteachers.Themodalitiesthathavebadperformancearealso discardedwhenbuildingthesecondlevelteachers.Webuildteachersatallotherlevelsinthesame way.Finally,alltheremainingteachersareusedtoteachastudentmodelbuildwith m modalities. Thispruningmethoddrasticallyreducesthenumberofteachersandmaketheproposedmethod scalable.Forexample,foradatasetwithvemodalities,ifinthelevelweeliminatetwo teachersandinthesecondlevelweeliminateoneteacher,thetotalteachernumberisreducedto ve.Wedemonstrateexperimentonsyntheticdatatoshowtheprocessofpruningandverifyits effectiveness.Here,weusetoillustratethepruningprocedure.Supposewehavethree modalities.Ifweusealltheteacherstotrainthestudent.Thetotalnumberofteachersneedto betrainedare6(seeFigure6.3).DenotetheteachersasTe 1 ,Te 2 ,Te 3 ,Te 12 ,Te 13 ,Te 23 .Ifthe performanceofTe 3 isrelativelowcomparedwithTe 1 andTe 2 ,weremoveTe 3 .Inthemeanwhile, 103 Figure6.3: Totalteachersneedtobetrainedwiththreemodalities. weremoveTe 13 andTe 23 sincebothTe 13 andTe 23 needtheteachingofTe 3 .Thepruningprocedure isshowninFigure6.4.IfallthevelteachersperformancesaregoodbutTe 13 'sperformance isrelativelowcomparedwithTe 12 andTe 23 ,weremoveTe 13 .Wedonotneedtoremoveother teacherssincethereisnohigh-levelteachers. 6.2Experiment Inthissection,wevalidatetheproposedmethodandbaselinesonbothsyntheticandrealdatasets. Thebaselinesincludedare(1)Te i :the i -thteachermodel(weuseDNNastheteachermodelin alltheexperiments 1 ),(2)M-DNN:multimodalDNNtrainedonlywiththecompletesamples,(3) T-DNN:usingteachermodelstoextractabstractfeaturesandthentrainingaDNNwiththe concatenationoftheseabstractfeaturesasinput,(4)CAS-AE[111]:usingcascaderesidue autoencodertoimputethemissingmodalitiesandthentrainingmultimodalDNNwiththeorigi- 1 Othermodelscouldalsobeusedasteachermodels.ThereasonweuseDNNastheteachermodelinourworkis thatDNNmodel'sperformanceisrelativelyhighcomparedwithothercommonlyusedEnsemblemodels alsohavehighperformance.ButDNNmodelcouldgeneratesoftlabelsmoreeasilythanensemblemodels. 104 Figure6.4: Totalteachersneedtobetrainedwithpruning(low-levelteacher). Ifalowlevel teacher'sperformanceisnotgood.Wecouldremovethisteacherandtheupperleverteachers whichneedtheteachingfromthelowlevelteacher. Figure6.5: Totalteachersneedtobetrainedwithpruning(high-levelteacher). Ifahighlevel teacher'sperformanceisnotgood.Wecouldremovethisteacher.Wedonotneedtopruningother teacherssincethereisnotupperlevelteacher. 105 naldataandimputeddata(5)ADV[21]:usingadversariallearningtogeneratethemissing modalitiesandthentrainingmultimodalDNNwiththeoriginaldataandimputeddata,(6)Sub- space:multimodalsubspacelearning[106],(7)CCA[54]:canonicalcorrelationanalysis,(8) DCCA[9]:deepcanonicalcorrelationanalysis,(9)T-LATE:weightedaddingtheteachers'logits, (10)MCTN 2 [86]:multimodalcyclictranslationnetwork.OurmethodisdenotedasTS. 6.2.1Syntheticdataexperiments Setting1: Wesynthesizedatawithtwomodalitiesinthefollowingsteps:(1)Wedraw n sam- plesfrom N ( 1 ; I ) and N ( 1 ; I ) separately.Samplesfromeachnormaldistributionformone modality.Denotethesesamplesas X 1 and X 2 .Thefeaturedimensionisedto32.(2)Wethen generaterandomweightmatrices W 1 1 2 R 32 64 ; W 2 1 2 R 64 64 ; W 1 2 2 R 32 64 ; W 2 2 2 R 64 64 anduse theseweightmatriceswithReLUfunctiontotransformthe X 1 and X 2 toabstractfeatures,i.e., ReLU ( ReLU ( X 1 W 1 1 ) W 2 1 ) and ReLU ( ReLU ( X 2 W 1 2 ) W 2 2 ) .(3)Afterobtainingthetransformedfea- turesforthetwomodalities,weconcatenatethosefeaturestoformthejointfeaturesandusea linearlayertotransformthejointfeaturestologits z .Theclasslabelis s ( z ) .Whensynthe- sizingthedata,wemakesurethenumberofsamplesforeachclasstobethesamebygenerating morethan n samplesanddownsampling.(4)Werandomselect a %samplestobe X 1 c and X 2 c . Theremainingsamplesaredividedintotwoequalparts.Weremoveonemodalityforeachpartto form X 1 u and X 2 u .So, X 1 u and X 2 u allhave n ( 1 a % ) = 2samples.Foreachclass,werandomly choose80%ofdataasthetrainingset,10%asthevalidationset,and10%asthetestingset.We repeattheexperimentsfor5times. Inthissetting,wethenumberofsamplesperclasstobe400andchangetheclassnumber 2 WeusefullyconnectedneuralnetworksinsteadofRNNsfortheencoder,decoderandthepredictionsubnetwork sinceourdataarenottimeseriesdata. 106 in f 2 ; 5 ; 7 ; 10 ; 12 g .Thesampleswithcompletemodalitiesareedtobe40%.Themissingrate foreachmodalityis30%.TheteachermodelisaDNNmodelwith3hiddenlayersandthehidden nodesaretunedin f 32 ; 64 ; 128 ; 256 g .ForTSandM-DNN,wethenetworkstructuretobe identicaltotheoneusedtogeneratethedatabutwithunknownweightmatrices.Sincethetwo modalitieshaveequalcontributiontotheoutputwhenwesynthesizethedata,weset a tobe equalto b andistunedin f 0 : 1 ; 0 : 2 ;:::; 0 : 9 g .Temperature T istunedin f 1 ; 5 ; 10 ; 15 ; 20 g .ForT- DNN,weusethelayerbeforetheoutputlayeroftheteachermodelsastheabstractfeatures.These abstractfeaturesareconcatenatedtoformnewfeatures.Then,wetrainaDNNmodelwiththenew features.TheDNNmodelhas3hiddenlayerswithnodenumberbeingtunedin f 64 ; 128 ; 256 g . ForeachblockoftheautoencoderintheCAS-AEmodel,theencoderhas3layersanddecodehas 3layers.Theencodedfeaturedimensionisedtobe64sincetheoriginaldatahas32features foreachmodality.Thenodenumberforthehiddenlayeroftheencoderanddecoderistunedin f 128 ; 256 ; 512 g .Wefollowthestepsinthe[111]totunethenumberoftheautoencoderblock, i.e.,thejointoptimizationoftheentirenetworkisperformedwhenaddingoneautoencoderblock. Duringthetrainingphase,werandomlychoosehalfsamplesfromthecompletesamplestoremove onemodalityandtheotherhalftoremovetheothermodality.Then,wetraintheCAS-AEto reconstructtheremovedmodalities.Afterthetraining,theCAS-AEisusedtoimputethemissing modalitiesfortheincompletesamples.Finally,wetrainamultimodalDNNusingalltheimputed samplesandthecompletesamplestogether.ThestructureofthemultimodalDNNusedhereisthe sameasthestudentmodelofTSandM-DNNmodel.ForADV,theencoderpartisa3layerDNN, thehiddennodenumberistunedin f 128 ; 256 ; 512 g .Thestructureofdiscriminatorisa3-layer DNNwithhiddennumberbetunedin f 128 ; 256 ; 512 g .SinceADVcanonlyimputeonemodality inonetime,weusethemodalitytoimputethesecondmodalitywiththecompletesamples asthetrainingdata.Then,weusetheimputedsamplesandthecompletesamplesastrainingdata 107 totrainasecondmodeltoimputethemodality.Afterweimputeallthemissingpart,we trainamultimodalDNNtoperformtheThestructureofthemultimodalDNNisthe samewiththestudentmodelofTSandM-DNNmodel.TheformulationofSubspacebaselineis identicaltoEq.(2)in[106].WeinitializethelatentfactorsbySVDoftheconcatenationoftwo modalitytoimprovetheperformanceofthismodel.Thelatentfactorrankistunedin f 16 ; 32 ; 64 g . ForCCAandDCCA,theprojectedfeaturedimensionistunedin f 16 ; 32 g .Thehiddennodeof DCCAistunedin f 64 ; 128 ; 256 ; 512 g andthehiddenlayernumberisedtobe3.ForT-LATE, weusethetrainingsamplestolearntheoptimalweightsforeachteacher.Then,weusethe learnedweightsandteachermodelstolabelthetestingsamples.ForMCTN,thehiddennodeof encoderanddecoderistunedin f 64 ; 128 ; 256 ; 512 g andthehiddenlayernumberisedtobe3. Thepredictionsubnetworkhasonehiddenlayerandthehiddennodenumberisedtobe128. TheresultsareshowninFigure6.6.WeseethatTSoutperformsallothermodels.Theper- formanceofTe 1 andTe 2 aremuchworsethanM-DNNsinceeachteacheronlyhasaccesstothe informationofonemodality.Althoughtheyarewell-trainedwithalltheavailabledata,theinfor- mationlossstillmakestheperformancetobeworsethanM-DNN.TheperformanceoftheADV andCAS-AEislowerthanM-DNNbecausetheimputedsampleshavelowqualitywithlimited sampleshavingcompletemodalities.Althoughthesetwomethodsenlargethesamplesize,they stillcannotoutperformM-DNN.EspeciallyforADV,theperformanceismuchlowerthanM- DNNandCAS-AEsinceadversarialtrainingismuchmoredifthantraininganautoencoder. ThedifferencebetweenT-LATEandtheTSmodelincreasesastheclassnumberincreasewhich impliesthatlatefusiondoesnotworkwellwhentheclassnumberislarge.Thekeydifference betweenourmodelandT-DNNisthatourmodelusesteacherstoteachthestudentvialabeling thesamples,butT-DNNdirectlyusesthefeaturesextractedbytheteachersastheinputfeatures. Thesamplesandmodelstructurestotraintheteachersandthestudentmodelsareallthesamefor 108 Figure6.6: accuracyforSetting1. Theproposedmethod(TS)outperformsallthe otherbaselines. thetwomethods.However,theperformanceofT-DNNisworsethantheproposedmethod.One reasonisthatfeaturesextractedbyteachershavelostsomeusefulinformation. Setting2: Inthesecondsetting,thedataaresynthesizedthesamewayasSetting1.Wechange therateofsampleswithcompletemodalities(completerate)tobe f 60% ; 50% ; 40% ; 30% ; 20% g . AllthemodelstructureandparametersettingsareidenticaltoSetting1.Theresultsareshownin Figure6.7.WeseesimilarpatternsasSetting1.Whenthecompleterateislarge,theperformance ofTSandM-DNNorCAS-AEisalmostthesame.Butwhenthecompleterateissmallenough,TS ismuchbetterthanM-DNNandCAS-AEsinceM-DNNandCAS-AEaretrainedwellwithalarge completerate.Whenthecompleterateissmall,thereisnoenoughdatatotrainthem.T-DNNand T-LATEshowtheoppositepatternwithM-DNNandCAS-AE,i.e.,thedifferencebetweenTSand thesetwomodelsissmallerwithasmallcompleteratethanthatwithalargecompleterate.T-DNN andT-LATErelylessonthecompletesamples.Whenthecompleterateissmall,theof 109 Figure6.7: accuracyforSetting2. Theproposedmethod(TS)outperformsallthe otherbaselines. usinglargedatatotraintheteachersmakesthemperformmuchbetterthanthemodelsonlyusing completesamples.Forourproposedmodel,weutilizethistomakesuretheperformance tobegoodwhencompletesamplesarescarce. Setting3: Inthissetting,weshowtheresultsof5-modalitysyntheticdataexperiments.The challengeof5-modalitylearningisfromscalabilitysincetherearetoomanyteachersavailable. Wetesttheproposedpruningstrategyinthissection.Thedatasetissynthesizedinthefollow- ingway.(1)Wedraw n samplesfrom N ( 1 ; I ) and N ( 1 ; I ) separately.Samplesfromeach normaldistributionformonemodality.Denotethesesamplesas X 1 and X 2 .Thefeaturedimen- sionisedto32.(2)Weusearandommatrix T 2 R 32 32 tolinearlytransform X 1 toform thethirdmodality,i.e., X 3 = X 1 T .(4)Wetakehalffeaturesfrom X 2 andthenmultiply arandommatrix M 2 R 16 32 toformmodality4.(5)Wethendraw n samplesfrom N ( 0 ; I ) . Thefeaturedimensionissetto32.Thesesamplesformthemodality.Butwhenforming 110 thejointrepresentation,weonlyusethehalffeaturesofthemodality,denotedby X 5 1 = 2 . (6)Wethengeneratearandomweightmatrices W 1 1 ; W 2 1 ; W 1 2 ; W 2 2 and W 1 5 ; W 2 5 .Thesizeis32for W 1 1 ; W 1 2 ,64 64for W 2 1 ; W 2 2 ,16 32for W 1 5 and32 32for W 2 5 .(7)WeuseReLUasthenonlin- earactivationfunction.Thejointrepresentationistheconcatenationof ReLU ( ReLU ( X 1 W 1 1 ) W 2 1 ) , ReLU ( ReLU ( X 2 W 1 2 ) W 2 2 ) and ReLU ( ReLU ( X 5 1 = 2 W 1 5 ) W 2 5 )] Weonlyuse X 1 , X 2 and X 5 toformjoint representationbecause X 3 and X 4 aregeneratedby X 1 and X 2 .(8)Alinearlayerisaddedtothe jointrepresentationtogeneratethelogits z .Theclasslabelis s ( z ) .(9)Werandomlyselect 40%samplestobe X 1 c , X 2 c , X 3 c , X 4 c , X 5 c .Wedividetheremainingsamplesintothreeequal parts.Weremoveonemodalityforeachparttoform X 1 u , X 2 u and X 5 u . X 3 u hasthesamemissing patternwith X 1 u and X 4 u hasthesamemissingpatternwith X 2 u .Foreachclass,wechoose80% ofdataastraining,10%asvalidation,and10%astesting.Experimentsarerepeated5times. Wesetthenumberofsamplesperclasstobe1000andtheclassnumbertobe5.Wetrain theteacherswitheverysinglemodality.Then,wecomparetheperformanceoftheseteachers. TheresultsareshowninTable6.1.FromTable6.1,weseetheperformanceof4-thteacherand 5-teacherisrelativelylowcomparedwithotherteachers.Thus,weonlyusethe3teachers andmodalitiestoformthetwo-modalteachers,whichareTe 12 ,Te 23 andTe 13 .Then,wethe performanceofTe 13 ismuchworsethantheperformanceofTe 12 andTe 23 .So,wedonotneed totraina3-modalitymodelwithmodality1 ; 2 ; 3astheteachersinceitcontainsboththemodality 1andthemodality3.TheteacherweusedareTe 1 ,Te 2 ,Te 3 ,Te 12 ,andTe 23 .Ifwedonot selectteachers,theteachernumberwillbe2 5 1 = 31.Butnow,weonlyneed5teachers.Asa comparison,wetrainmodelswithmodality5and4andthenusethemasteachersalongwithall the5teacherstoteachthestudentmodel.Theperformancedropsto70 : 76 0 : 01.So,whenthe performanceofoneteacheristoobad,wedonotusethisteachertoteachthestudent.Wenote thatalthoughmodality5alonehasbadperformance,itstillcontributestothejointrepresentation 111 asshowninthestepswhenwesynthesizethedata.Wethusonlyusethismethodtoselectteachers butnotthemodalitiesusedtotrainthestudentmodel. 6.2.2ExperimentsonAlzheimer'sdiagnosis Inthissubsection,wereporttheexperimentperformanceontheunionoftwo-stageofADNI datasets 3 ,i.e.,ADNI1andADNI2,andNACCdataset 4 .Thesedatasetscontainbrainimaging dataofsubjectswithdifferentstagesofAlzheimer'sdisease.Twomodalitiesareusedinthisex- periments.TheoneisT1MRI.136corticalvolumeandthicknessfeaturesareextractedfor 68brainregionofinterests(ROIs)basedonDesiken-Killianyatlas[33].Thesecondmodality isdMRI-derivedstructuralnetwork.WeusePICo[84]toconstructbrainnetworksfor113ROIs basedontheHarvardOxfordCorticalandsubcorticalProbabilisticAtlas[33,40].Sincethenet- workisundirected,weextracttheuppertriangleoftheweightedadjacencymatrixtoform6328 features.Finally,Weusestabilityselection[116,75]toselectthetop172featureswhichhavethe top30%stabilityscoresasthefeaturesforthismodality.Ourtaskistoclassifyifthesubject isnormalcontrol(NC),mildcognitiveimpairment(MCI)ordementia(AD).ADNI1datahave 223NC,385MCIand186AD.ADNI2datahave50NC,112MCIand39AD.NACCdatahave 329NC,57MCIand53AD.ADNI2andNACChavebothdMRIandT1MRImodalitieswhile ADNI1onlyhasT1MRI. Wetraintheteachernetworks,thestudentnetworkandM-DNNbeforethefusionlayerwith 4hiddenlayersandthehiddennodenumberistunedin f 256 ; 512 ; 1024 g .Afterthefusionlayer, alinearlayerwithSoftMaxisaddedtocompletethe a and b aretuned in{0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}separately.ForCAS-AE,weuse4layersfor 3 http://adni.loni.usc.edu 4 https://www.alz.washington.edu 112 Model Te 1 Te 2 Te 3 Te 4 Te 5 ACC 47 : 80 0 : 09 47 : 04 0 : 17 44 : 98 0 : 26 34 : 48 0 : 05 22 : 80 0 : 04 Model Te 12 Te 23 Te 13 M-DNN TS ACC 71 : 32 0 : 03 68 : 84 0 : 10 47 : 40 0 : 3 71 : 44 0 : 07 72 : 28 0 : 03 Table6.1: accuracyofSetting3. Weusetheselectedteacherstotrainthestudent model.Ascompassion,theaccuracydropsto70 : 76 0 : 01whenaddingnon-selectedteachersTe 4 andTe 5 . Model TS M-DNN Te 1 Acc 75 : 48 0 : 07 73 : 26 0 : 08 69 : 67 0 : 06 Model Te 2 Subspace MCTN Acc 62 : 98 0 : 01 67 : 66 0 : 04 69 : 05 0 : 11 Model CCA DCCA CAS-AE Acc 61 : 03 0 : 47 72 : 70 0 : 46 71 : 11 0 : 01 Model ADV T-DNN T-LATE Acc 72 : 70 0 : 05 72 : 27 0 : 10 74 : 21 0 : 01 Table6.2: TheclasaccuracyforallthemodelstrainedontheunionofADNIand NACCdatasets. encoderand4layersfordecoder.Theencodedfeaturesdimensionistunedin f 128 ; 256 g .For ADV,thehiddenlayernumberforencoderandthediscriminatorissettobe4.Thenodenumberis tunedin f 256 ; 512 ; 1024 g .ForSubspace,wetunetherankin f 32 ; 64 ; 128 g .Theprojectedfeature dimensionofCCAandDCCAistunedin f 32 ; 64 ; 128 g .ForMCTN,thehiddenlayernumberis edtobe4forencoderanddecoderandthehiddennodenunmberistunedin f 256 ; 512 ; 1024 g . Thepredictionsubnetworkhiddennumberisedtobe256.Werandomselect90%samplesas trainingsetandtherestastestingset.Werepeattheexperiment5times. TheaverageaccuracyisreportedinTable6.2.Weseeourproposedmethod outperformsallotherbaselines.Te 1 istheteachermodeltrainedonT1MRIandTe 2 istheteacher modeltrainedonthedMRImodality.Forthisdataset,allthesampleshavethemodalityand onlypartofthesampleshavethesecondmodality.So,theperformanceofTe 1 ismuchhigherthan theperformanceofTe 2 .Thisisalsointheregularizationparameters a and b .Thebest performanceforourproposedmodelisreachedwhen a is0 : 7and b is0 : 0.SincedMRImodality 113 Figure6.8: Accuracywithdifferent a and b . a isedtobe0.7whilechanging b and b is edtobe0 : 0whilechanging a . ismissingforsomesamplesandT1MRImodalityiscompleteforallsamples,thesingleteacher trainingondRMIwillbeuseless.Thus,when b is0 : 0,theperformanceisthehighest.Figure6.8 showshowtheaccuracychangeswiththeparameter a and b .Inthiswechange a when b tobe0 : 0andchange b when a tobe0 : 7.Theperformancedecreaseswiththe increasingof b .Meanwhile,theteachertrainedwithT1MRIimprovestheperformancealotwith alarge a .WealsoshowthetopimportantT1MRIfeaturesforTe 1 ,M-DNNandTSmodelin Figure6.9,Figure6.10,Figure6.11andthetopimportantdMRIfeaturesforTe 1 ,M-DNNandTS modelinFigure6.12(thetopimportantdMRIfeaturesforM-DNNandTSmodelarethesamesime thedMRIteacherdonothavecontributiontothetrainingofthestudentmodel).Thefeaturesare rankedbytheabsoluteweightsvaluebetweentheinputlayerandthehiddenlayer.Wesumall theabsolutevaluesoftheweightsthatareconnectedwiththeinputnodeastherelativeimportance oftheassociatedinputfeature.Weseetherearesomeoverlappingbetweenthetopimportant featuresofthethreemodelsbutstillsometopfeaturesareverydifferentforTe 1 andTS/M-DNN. 114 Forexample,rightisthmuscingulatethicknessisrankedthethirdmostimportantfeatureforthe teachermodelsandthemostimportantfeaturesforthestudentmodels.Leftentorhinalvolumeis thesecondmostimportantfeatureforM-DNN/TSbutdoesnotinthetop10importantfeatures fortheTe 1 .BothtwofeatureshavebeenprovedtoberelatedtoAlzheimer'sdisease[53,44]. ThedifferencebetweentheimportanceofthefeaturescausesT-DNNtobeworstthanTSmodel asT-DNNusesthefeaturesextractedbyTe 1 .Trainingwithtwomodalitiessimultaneouslyleads todifferentfeaturerankssincethetwomodalitiesarecoupledandeachother.Some featuresinonemodalityalonedonotshowtobeimportant.Butthesefeaturescouldbevery importantwiththepresenceofsomefeaturesfromtheothermodality. Figure6.9: Thetop10importantT1MRIfeaturesforTe 1 trainedontheunionofNACCand ADNIdatasets. 115 Figure6.10: Thetop10importantT1MRIfeaturesforM-DNNtrainedontheunionof NACCandADNIdatasets. 6.2.3Experimentsonotherreal-worlddatasets Inthissection,wereporttheperformanceonthreeadditionalreal-worlddatasets.Theone isAlzheimer'sdiseasedatafrom[132],whichhas3modalitiesand3classesavailable,i.e.,MRI, PET,Proteomics.Thefeaturedimensionsforthese3modalitiesare305,116and147,respectively. Inthisdataset,648subjectshaveMRIdata.372subjectshavePETdata.496subjectshave Proteomicsdata.Only215subjectshaveallthreemodalities.Werandomlysplitthedatainto thetrainingsetandtestingsetwiththeratio0 : 9:0 : 1.Theparametersaretunedthesameway asSection6.2.2.Werepeattheexperimentsfor5iterations.Theaverageaccuracyisshownin Table6.5.Fromthetable,weseetheperformanceofM-DNNisevenworstthanTe 13 sincewhen trainingtheM-DNNwithallthethreemodalities,thesamplesizeismuchsmallerthanthatused totrainTe 13 .Butwiththeteachingstep,theperformanceimprovesalotandoutperformsthe 116 Figure6.11: Thetop10importantT1MRIfeaturesforTStrainedontheunionofNACCand ADNIdatasets. performanceofTe 13 . Anothertworeal-worlddatasetsweusedareMNISTandXRMB[119].ForMNISTdata,we subsample10,000astrainingdata,1,000samplesasvalidationdataand1,000samplesastesting data.Theclassnumberis10.MNISThastwomodalitieswith784featuresforeachmodality.For XRMBdata,wesubsample19,500samplesfortraining,1,950forvalidationand1,950fortesting. TheclassnumberforXRMBis39.TwomodalitiesareavailableforXRMBdatawith273and 112features.Sincethesedatadonothavemissingmodalities,werandomlychoose a %ofsamples tobethesampleswithcompletemodalities.Andfortherestpartofthedata,wesplittheminto twopartsandremoveonemodalityforeachpart.Wechangetherateofcompletemodalitiesin f 40% ; 30% ; 20% ; 10% g .TheparametersaretunedthesamewayasSection6.2.2exceptforthe nodenumber.Thehiddenlayernodenumberistunedin{512,1024,2048}.Theencodedfeature 117 Rate 40% 30% 20% 10% TS 66 : 13 0 : 03 64 : 77 0 : 01 63 : 19 0 : 01 58 : 36 0 : 01 M-DNN 62 : 66 0 : 01 60 : 59 0 : 01 57 : 18 0 : 01 50 : 33 0 : 03 Te 1 56 : 05 0 : 01 53 : 13 0 : 01 51 : 08 0 : 01 44 : 57 0 : 01 Te 2 45 : 73 0 : 01 42 : 59 0 : 01 41 : 63 0 : 01 37 : 93 0 : 01 CAS-AE 59 : 75 0 : 02 57 : 96 0 : 01 56 : 58 0 : 01 53 : 84 0 : 01 ADV 59 : 37 0 : 01 57 : 83 0 : 02 56 : 37 0 : 01 53 : 60 0 : 01 Subspace 45 : 25 0 : 02 41 : 63 0 : 01 38 : 08 0 : 01 34 : 15 0 : 01 DCCA 41 : 94 0 : 41 41 : 64 0 : 46 33 : 53 0 : 13 32 : 86 0 : 34 T-DNN 65 : 14 0 : 02 63 : 11 0 : 01 61 : 61 0 : 01 56 : 59 0 : 01 T-ENS 63 : 69 0 : 01 61 : 91 0 : 02 59 : 70 0 : 02 56 : 13 0 : 01 MCTN 53 : 58 0 : 01 51 : 18 0 : 02 47 : 78 0 : 03 40 : 38 0 : 02 Table6.3: TheaccuracyofallthemodelstrainedonXRMBdataset. Rate 40% 30% 20% 10% TS 96 : 46 0 : 01 96 : 00 0 : 01 95 : 42 0 : 01 92 : 34 0 : 01 M-DNN 93 : 70 0 : 01 92 : 04 0 : 03 89 : 04 0 : 02 86 : 46 0 : 01 Te 1 93 : 04 0 : 01 91 : 78 0 : 01 90 : 72 0 : 02 87 : 12 0 : 01 Te 2 78 : 82 0 : 09 74 : 52 0 : 02 69 : 66 0 : 06 57 : 08 0 : 08 CAS-AE 94 : 54 0 : 01 94 : 26 0 : 01 93 : 72 0 : 01 91 : 48 0 : 01 ADV 94 : 98 0 : 01 94 : 42 0 : 01 94 : 32 0 : 01 91 : 74 0 : 01 Subspace 86 : 70 0 : 01 84 : 34 0 : 02 79 : 76 0 : 04 72 : 28 0 : 06 DCCA 87 : 38 0 : 09 84 : 70 0 : 15 81 : 60 0 : 16 76 : 72 0 : 31 T-DNN 95 : 18 0 : 01 94 : 92 0 : 01 92 : 25 0 : 01 92 : 28 0 : 04 T-ENS 95 : 90 0 : 01 94 : 74 0 : 01 94 : 44 0 : 01 90 : 50 0 : 01 MCTN 92 : 24 0 : 01 90 : 22 0 : 01 88 : 86 0 : 01 85 : 02 0 : 02 Table6.4: TheaccuracyofallthemodelstrainedonMNISTdataset. dimensionforCAS-ADVistunedin{128,256,512}.TheprojectedfeaturenumbersforCCA, DCCAandSubspacearetunedin{128,256,512}forMNISTand{32,64,100}forXRMB.The experimentsarerepeatedfor5timesandtheresultsareshowninTable6.3andTable6.4.Wesee thatourmethodoutperformsallotherbaselinesunderdifferentmissingrates. 118 Model TS M-DNN T-DNN Accuracy 55 : 57 0 : 02 47 : 43 0 : 05 45 : 57 0 : 02 Model Te 12 Te 13 Te 23 Accuracy 48 : 57 0 : 02 54 : 43 0 : 06 52 : 29 0 : 06 Model Te 1 Te 2 Te 3 Accuracy 48 : 14 0 : 01 45 : 14 0 : 31 47 : 43 0 : 24 Model CAS-AE ADV T-ENS Accuracy 53 : 27 0 : 02 53 : 04 0 : 06 53 : 86 0 : 02 Table6.5: TheaccuracyforthemodelstrainedonAlzheimer'sdiseasedata from[132]. 6.3Summary Inthiswork,weproposedanovelframeworktofusethesupplementaryinformationofmultiple modalitiesforthedatasetswithmissingmodalities.Wetrainedmodelsoneachmodalitywith alltheavailabledatatoobtainteachermodels.Then,weusedtheseteachermodelstoteacha multimodalDNNnetworkbyknowledgedistillation.Sincetheteachermodelsweretrainedon relativelylargerdatasetscomparedwiththedatasetsusedtotrainthestudentmodel,theteachers wereexpertsoneachmodalityandtheexpertisecouldhelpthestudenttoimprovetheperfor- mance.Theexperimentresultsonbothsyntheticandreal-worlddatavtheeffectivenessof theproposedmethod. 119 (a)Te 2 (b)TS/M-DNN Figure6.12: Thetop10importantdMRIfeaturesformodelstrainedontheunionofNACC andADNIdatasets. 120 Chapter7 Conclusion Inthisdissertation,Iproposefouralgorithmsformultimodallearninganddemonstratehowthe proposedalgorithmshelpmodelingtheAlzheimer'sdisease.Thefouralgorithmshavedifferent assumptionsanddifferentproblemsanddatatypes. Thealgorithmadoptsaconvexcombinationofthemodalities.Itrequiresthefeature dimensionsofthemodalitiestobethesame.Oneassumptionofthealgorithmisthemodalitiesare linearlyinteracted.Therefore,whenusingthisalgorithm,theinteractionofmodalitiesisexpected tobelinear. Thesecondalgorithmcanbeappliedtomodalitieswithdifferentdimensions.Itdoesnot requirethemodalitiestobelinearlyinteracted.Theassumptionofthesecondalgorithmisthat eachmodalityhasenoughinformationonthesubjectandmaycontainsomeuselessinformation. Forexample,thebrainimagingdatacontainnotonlyAlzheimer'sdiseaseinformationbutalsothe brainfunctionsinformation.Moreover,whencollectingthedata,instrumentsmaybeinaccurate whichmakesthedatanoisy.Sincethesecondalgorithmlearnsthecommonpartofthemodalities, thenoiseandtheirrelevantinformationofthemodalitiesarenotincludedinthemodality-invariant component. Thethirdalgorithmistofusethesupplementinformationofthemodalities.Itassumeseach modalityonlyhaspartialinformationofthesubjects.Combiningtheinformationfromallthe subjectsprovidesamorecomprehensivedescriptionofthesubjects.Sincemodalitiesmayhave 121 irrelevantinformationandnoise,thisalgorithmirrelevantinformationandnoisewhenlearn- ingthejointrepresentation.Therefore,thisalgorithmcanbeappliedtothemodalitiesthathaving incompleteinformationonthesubject,andtheperformanceisexpectedtobebetterthantheexist- ingalgorithmwhenthemodalitieshaveirrelevantinformationandnoise. Thefourthalgorithmisproposedtodealwiththedatahavingmissingmodalities.Thesecond algorithmcanalsobeappliedtothedatahavingmissingmodalities.Thedifferencebetweenthisal- gorithmandthesecondalgorithmisthatthesecondalgorithmassumeseachmodalityhascomplete informationofthesubjects.Thefourthalgorithmdoesnothavethisassumption.Itisworthmen- tioningthatthestudentmodelcouldbereplacedbyavariantofmultimodalalgorithmsalthough thestudentmodelusedinthisdissertationisamultimodalDNNwhichfusesthesupplementaryin- formationofthemodalities.Therefore,thisalgorithmcanalsobeappliedtothemodalitieshaving completeinformationonthesubjectandlearnthecommonstructureofthemodalities. 122 BIBLIOGRAPHY 123 BIBLIOGRAPHY [1] MartínAbadi,AshishAgarwal,PaulBarham,EugeneBrevdo,ZhifengChen,CraigCitro, GregSCorrado,AndyDavis,JeffreyDean,MatthieuDevin,etal.Tw:Large-scale machinelearningonheterogeneousdistributedsystems. arXivpreprintarXiv:1603.04467 , 2016. [2] ImanAganj,ChristopheLenglet,NedaJahanshad,EssaYacoub,NoamHarel,PaulM Thompson,andGuillermoSapiro.AHoughtransformglobalprobabilisticapproachto multiple-subjectdiffusionMRItractography. MedicalImageAnalysis ,15(4):414Œ425, 2011. [3] UnaizaAhsanandIrfanEssa.Clusteringsocialeventimagesusingkernelcanonicalcor- relationanalysis.In ProceedingsoftheIEEEConferenceonComputerVisionandPattern RecognitionWorkshops ,pages800Œ805,2014. [4] ShotaroAkaho.Akernelmethodforcanonicalcorrelationanalysis. arXivpreprint cs/0609071 ,2006. [5] HYahyaKaraman,HDemirta¸s,N Imamo glu,YusufÖzkul,etal.Evaluationofthe nucleolarorganizerregionsinalzheimer'sdisease. Gerontology ,51(5):297Œ301,2005. [6] ZeynepAkata,ChristianThurau,andChristianBauckhage.Non-negativematrixfactoriza- tioninmultimodalitydataforsegmentationandlabelprediction.In 16thComputervision winterworkshop ,2011. [7] AlexanderAAlemi,IanFischer,JoshuaVDillon,andKevinMurphy.Deepvariational informationbottleneck. arXivpreprintarXiv:1612.00410 ,2016. [8] Alzheimer'sAssociation.2013alzheimer'sdiseasefactsand2013. [9] GalenAndrew,RamanArora,JeffABilmes,andKarenLivescu.Deepcanonicalcorrelation analysis.In ICML ,pages1247Œ1255,2013. [10] RamanAroraandKarenLivescu.Multi-viewcca-basedacousticfeaturesforphonetic recognitionacrossspeakersanddomains.In Acoustics,SpeechandSignalProcessing (ICASSP),2013IEEEInternationalConferenceon ,pages7135Œ7139.IEEE,2013. [11] PeterJBasser,SinisaPajevic,CarloPierpaoli,JeffreyDuda,andAkramAldroubi.Invivo tractographyusingDT-MRIdata. MagneticResonanceinMedicine ,44(4):625Œ632, 2000. 124 [12] AmirBeckandMarcTeboulle.Afastiterativeshrinkage-thresholdingalgorithmforlinear inverseproblems. SIAMJournalonImagingSciences ,2(1):183Œ202,2009. [13] TEJBehrens,HJohansenBerg,SaadJbabdi,MFSRushworth,andMWWoolrich.Prob- abilisticdiffusiontractographywithmultipleorientations:Whatcanwegain? Neu- roImage ,34(1):144Œ155,2007. [14] YoshuaBengio,AaronCourville,andPascalVincent.Representationlearning:Areview andnewperspectives. IEEEtransactionsonpatternanalysisandmachineintelligence , 35(8):1798Œ1828,2013. [15] LarsBertramandRudolphETanzi.Thirtyyearsofalzheimer'sdiseasegenetics:theimpli- cationsofsystematicmeta-analyses. NatureReviewsNeuroscience ,9(10):768Œ778,2008. [16] AvrimBlumandTomMitchell.Combininglabeledandunlabeleddatawithco-training. In ProceedingsoftheeleventhannualconferenceonComputationallearningtheory ,pages 92Œ100.ACM,1998. [17] MagnusBorga.Canonicalcorrelation:atutorial. Onlinetutorialhttp://people.imt.liu. se/magnus/cca ,4:5,2001. [18] UlfBrefeld,ThomasGärtner,TobiasScheffer,andStefanWrobel.Efco-regularised leastsquaresregression.In Proceedingsofthe23rdinternationalconferenceonMachine learning ,pages137Œ144.ACM,2006. [19] AlistairBurns,MargaretReith,RobinJacoby,andRaymondLevy.`howtodoit'Šobtaining consentforautopsyinalzheimer'sdisease. InternationalJournalofGeriatricPsychiatry , 5(5):283Œ286,1990. [20] Jian-FengCai,EmmanuelJCandès,andZuoweiShen.Asingularvaluethresholdingalgo- rithmformatrixcompletion. SIAMJournalonOptimization ,20(4):1956Œ1982,2010. [21] LeiCai,ZhengyangWang,HongyangGao,DinggangShen,andShuiwangJi.Deepad- versariallearningformulti-modalitymissingdatacompletion.In Proceedingsofthe24th ACMSIGKDDInternationalConferenceonKnowledgeDiscovery&DataMining ,pages 1158Œ1166.ACM,2018. [22] EmmanuelCandesandBenjaminRecht.Exactmatrixcompletionviaconvexoptimization. CommunicationsoftheACM ,55(6):111Œ119,2012. [23] EmmanuelJCandesandYanivPlan.Matrixcompletionwithnoise. Proceedingsofthe IEEE ,98(6):925Œ936,2010. [24] EmmanuelJCandèsandBenjaminRecht.Exactmatrixcompletionviaconvexoptimization. FoundationsofComputationalmathematics ,9(6):717,2009. 125 [25] ShiyuChang,Guo-JunQi,CharuCAggarwal,JiayuZhou,MengWang,andThomasS Huang.Factorizedsimilaritylearninginnetworks.In ICDM ,pages60Œ69.IEEE,2014. [26] Yu-LingChang,MarkWJacobson,ChristineFennema-Notestine,DonaldJHagler, RobinGJennings,AndersMDale,LindaKMcEvoy,Alzheimer'sDiseaseNeuroimag- ingInitiative,etal.Levelofexecutivefunctionverbalmemoryinamnesticmild cognitiveimpairmentandpredictsprefrontalandposteriorcingulatethickness. Cerebral Cortex ,20(6):1305Œ1313,2010. [27] KamalikaChaudhuri,ShamMKakade,KarenLivescu,andKarthikSridharan.Multi-view clusteringviacanonicalcorrelationanalysis.In ICML ,pages129Œ136.ACM,2009. [28] MinminChen,KilianQWeinberger,FeiSha,andYoshuaBengio.Marginalizeddenois- ingauto-encodersfornonlinearrepresentations.In Proceedingsofthe31stInternational ConferenceonMachineLearning(ICML-14) ,pages1476Œ1484,2014. [29] ThomasEConturo,NicolasFLori,ThomasSCull,ErbilAkbudak,AbrahamZSnyder, JoshuaSShimony,RobertCMcKinstry,HaroldBurton,andMarcusERaichle.Tracking neuronalpathwaysinthelivinghumanbrain. ProceedingsoftheNationalAcademyof Sciences ,96(18):10422Œ10427,1999. [30] EHCorder,AMSaunders,WJStrittmatter,DESchmechel,PCGaskell,GWetalSmall, ADRoses,JLHaines,andMargaretAPericak-Vance.Genedoseofapolipoproteinetype4 alleleandtheriskofalzheimer'sdiseaseinlateonsetfamilies. Science ,261(5123):921Œ923, 1993. [31] CorinnaCortesandVladimirVapnik.Support-vectornetworks. Machinelearning , 20(3):273Œ297,1995. [32] MaximeDescoteaux,RachidDeriche,ThomasRKnosche,andAlfredAnwander.De- terministicandprobabilistictractographybasedoncomplexorientationdistributions. IEEEtransactionsonmedicalimaging ,28(2):269Œ286,2009. [33] RahulSDesikan,FlorentSégonne,BruceFischl,BrianTQuinn,BradfordCDickerson, DeborahBlacker,RandyLBuckner,AndersMDale,RPaulMaguire,BradleyTHyman, etal.Anautomatedlabelingsystemforsubdividingthehumancerebralcortexonmriscans intogyralbasedregionsofinterest. NeuroImage ,31(3):968Œ980,2006. [34] DPDevanand,RaviBansal,JunLiu,XuejunHao,GnanavalliPradhaban,andBradleyS Peterson.Mrihippocampalandentorhinalcortexmappinginpredictingconversionto alzheimer'sdisease. Neuroimage ,60(3):1622Œ1629,2012. [35] GuiguangDing,YuchenGuo,andJileZhou.Collectivematrixfactorizationhashingfor multimodaldata.In ProceedingsoftheIEEEconferenceoncomputervisionandpattern recognition ,pages2075Œ2082,2014. 126 [36] RobinDDowell,OwenRyan,AnJansen,DorisCheung,SudeepAgarwala,TimothyDan- ford,DouglasABernstein,PAlexanderRolfe,LawrenceEHeisler,BrianChin,etal.Geno- typetophenotype:acomplexproblem. Science ,328(5977):469Œ469,2010. [37] CraigKEnders. Appliedmissingdataanalysis .Guilfordpress,2010. [38] OttoFabiusandJoostRvanAmersfoort.Variationalrecurrentauto-encoders. arXivpreprint arXiv:1412.6581 ,2014. [39] DeanPFoster,ShamMKakade,andTongZhang.Multi-viewdimensionalityreductionvia canonicalcorrelationanalysis. TechnicalReportTR-2008-4 ,2008. [40] JeanAFrazier,SufenChiu,JanisLBreeze,NikosMakris,NicholasLange,DavidN Kennedy,MarthaRHerbert,EileenKBent,VamsiKKoneru,MeganEDieterich,etal. Structuralbrainmagneticresonanceimagingoflimbicandthalamicvolumesinpediatric bipolardisorder. AmericanJournalofPsychiatry ,162(7):1256Œ1265,2005. [41] DavidCGlahn,PaulMThompson,andJohnBlangero.Neuroimagingendophenotypes: strategiesforgenesbrainstructureandfunction. Humanbrainmapping , 28(6):488Œ501,2007. [42] ShiriGordon,HayitGreenspan,andJacobGoldberger.Applyingtheinformationbottleneck principletounsupervisedclusteringofdiscreteandcontinuousimagerepresentations.In null ,page370.IEEE,2003. [43] HaticeGunesandMassimoPiccardi.Affectrecognitionfromfaceandbody:earlyfusion vs.latefusion.In 2005IEEEinternationalconferenceonsystems,manandcybernetics , volume4,pages3437Œ3443.IEEE,2005. [44] LeticiaGutiérrez-Galve,ManjaLehmann,NicolaZHobbs,MatthewJClarkson,Ger- ardRRidgway,SebastianCrutch,SebastienOurselin,JonathanMSchott,NickCFox,and JosephineBarnes.Patternsofcorticalthicknessaccordingtoapoegenotypeinalzheimer's disease. Dementiaandgeriatriccognitivedisorders ,28(5):461Œ470,2009. [45] PäiviHartikainen,JanneRäsänen,ValtteriJulkunen,EiniNiskanen,MerjaHallikainen, MiiaKivipelto,RitvaVanninen,AnneMRemes,andHilkkaSoininen.Corticalthickness infrontotemporaldementia,mildcognitiveimpairment,andalzheimer'sdisease. Journalof Alzheimer'sDisease ,30(4):857Œ874,2012. [46] GeoffreyHinton,OriolVinyals,andJeffDean.Distillingtheknowledgeinaneuralnetwork. arXivpreprintarXiv:1503.02531 ,2015. [47] GeoffreyEHintonandRuslanRSalakhutdinov.Replicatedsoftmax:anundirectedtopic model.In Advancesinneuralinformationprocessingsystems ,pages1607Œ1614,2009. 127 [48] ThomasHofmann,BernhardSchölkopf,andAlexanderJSmola.Kernelmethodsinma- chinelearning. Theannalsofstatistics ,pages1171Œ1220,2008. [49] HaroldHotelling.Relationsbetweentwosetsofvariates. Biometrika ,28(3/4):321Œ377, 1936. [50] WinstonHHsu,LyndonSKennedy,andShih-FuChang.Videosearchrerankingviainfor- mationbottleneckprinciple.In Proceedingsofthe14thACMinternationalconferenceon Multimedia ,pages35Œ44.ACM,2006. [51] KeithAJohnson,NickCFox,ReisaASperling,andWilliamEKlunk.Brainimagingin alzheimerdisease. ColdSpringHarborperspectivesinmedicine ,2(4):a006213,2012. [52] RichardArnoldJohnson,DeanWWichern,etal. Appliedmultivariatestatisticalanalysis , volume4.Prentice-HallNewJersey,2014. [53] KJuottonen,MPLaakso,RInsausti,MLehtovirta,APitkänen,KPartanen,andHSoininen. Volumesoftheentorhinalandperirhinalcorticesinalzheimer'sdisease. Neurobiologyof aging ,19(1):15Œ22,1998. [54] ShamMKakadeandDeanPFoster.Multi-viewregressionviacanonicalcorrelationanaly- sis.In InternationalConferenceonComputationalLearningTheory ,pages82Œ96.Springer, 2007. [55] RaghunandanHKeshavan,AndreaMontanari,andSewoongOh.Matrixcompletionfrom afewentries. IEEETransactionsonInformationTheory ,56(6):2980Œ2998,2010. [56] BuSungKim,HeeraKim,JaedongLee,andJee-HyongLee.Improvingarecommender systembycollectivematrixfactorizationwithtaginformation.In SoftComputingandIntel- ligentSystems(SCIS),2014Joint7thInternationalConferenceonandAdvancedIntelligent Systems(ISIS),15thInternationalSymposiumon ,pages980Œ984.IEEE,2014. [57] JungsuKim,JacobMBasak,andDavidMHoltzman.Theroleofapolipoproteinein alzheimer'sdisease. Neuron ,63(3):287Œ303,2009. [58] Tae-KyunKim,Shu-FaiWong,andRobertoCipolla.Tensorcanonicalcorrelationanalysis foractionIn ComputerVisionandPatternRecognition,2007.CVPR'07. IEEEConferenceon ,pages1Œ8.IEEE,2007. [59] DiederikPKingmaandMaxWelling.Auto-encodingvariationalbayes. arXivpreprint arXiv:1312.6114 ,2013. [60] MariusKloft,UlfBrefeld,PavelLaskov,Klaus-RobertMüller,AlexanderZien,andSören Sonnenburg.Efandaccuratelp-normmultiplekernellearning.In NIPS ,pages997Œ 1005,2009. 128 [61] YehudaKoren,RobertBell,andChrisVolinsky.Matrixfactorizationtechniquesforrecom- mendersystems. Computer ,42(8),2009. [62] MichaelKrawczak,SusannaNikolaus,HubertavonEberstein,PeterJPCroucher,NourEd- dineElMokhtari,andStefanSchreiber.Popgen:population-basedrecruitmentofpatients andcontrolsfortheanalysisofcomplexgenotype-phenotyperelationships. PublicHealth Genomics ,9(1):55Œ61,2006. [63] AlexKrizhevskyandGeoffreyHinton.Learningmultiplelayersoffeaturesfromtinyim- ages.2009. [64] AlexKrizhevsky,IlyaSutskever,andGeoffreyEHinton.Imagenetwithdeep convolutionalneuralnetworks.In Advancesinneuralinformationprocessingsystems ,pages 1097Œ1105,2012. [65] AbhishekKumarandHalDaumé.Aco-trainingapproachformulti-viewspectralcluster- ing.In Proceedingsofthe28thInternationalConferenceonMachineLearning(ICML-11) , pages393Œ400,2011. [66] MarianaLazar,DavidMWeinstein,JaySTsuruda,KhaderMHasan,KonstantinosAr- fanakis,MElizabethMeyerand,BenhamBadie,HowardARowley,VictorHaughton, AaronField,etal.Whitemattertractographyusingdiffusiontensoron. Human BrainMapping ,18(4):306Œ321,2003. [67] YannLeCun,YoshuaBengio,andGeoffreyHinton.Deeplearning. nature ,521(7553):436, 2015. [68] DanielDLeeandHSebastianSeung.Learningthepartsofobjectsbynon-negativematrix factorization. Nature ,401(6755):788Œ791,1999. [69] DanielDLeeandHSebastianSeung.Algorithmsfornon-negativematrixfactorization.In NIPS ,pages556Œ562,2001. [70] AlexDLeow,SiweiZhu,LiangZhan,KatieMcMahon,GreigIdeZubicaray,Matthew Meredith,MJWright,AWToga,andPMThompson.Thetensordistributionfunction. MagneticResonanceinMedicine ,61(1):205Œ214,2009. [71] Chia-ChanLiu,TakahisaKanekiyo,HuaxiXu,andGuojunBu.Apolipoproteineand alzheimerdisease:risk,mechanismsandtherapy. NatureReviewsNeurology ,9(2):106Œ 118,2013. [72] JunLiu,JianhuiChen,andJiepingYe.Large-scalesparselogisticregression.In Proceed- ingsofthe15thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandData Mining ,pages547Œ556.ACM,2009. 129 [73] XiaoxiaoLiu,MarcNiethammer,RolandKwitt,NikhilSingh,MattMcCormick,and StephenAylward.Low-rankatlasimageanalysesinthepresenceofpathologies. IEEE transactionsonmedicalimaging ,34(12):2583Œ2591,2015. [74] LaurensvanderMaatenandGeoffreyHinton.Visualizingdatausingt-sne. Journalof machinelearningresearch ,9(Nov):2579Œ2605,2008. [75] NicolaiMeinshausenandPeterBühlmann.Stabilityselection. JournaloftheRoyalStatis- ticalSociety:SeriesB(StatisticalMethodology) ,72(4):417Œ473,2010. [76] DavidMeunier,RenaudLambiotte,AlexFornito,KarenDErsche,andEdwardTBullmore. Hierarchicalmodularityinhumanbrainfunctionalnetworks. Hierarchyanddynamicsin neuralnetworks ,1(2),2010. [77] SebastianMika,BernhardSchölkopf,AlexJSmola,Klaus-RobertMüller,MatthiasScholz, andGunnarRätsch.Kernelpcaandde-noisinginfeaturespaces.In Advancesinneural informationprocessingsystems ,pages536Œ542,1999. [78] SusumuMori,BarbaraJCrain,VPChacko,andPeterVanZijl.Three-dimensionaltracking ofaxonalprojectionsinthebrainbymagneticresonanceimaging. AnnalsofNeurology , 45(2):265Œ269,1999. [79] Abdullah-AlNahidandYinanKong.Involvementofmachinelearningforbreastcancer imageasurvey. Computationalandmathematicalmethodsinmedicine ,2017, 2017. [80] JohnANelderandRJacobBaker.Generalizedlinearmodels. Encyclopediaofstatistical sciences ,1972. [81] KamalNigamandRayidGhani.Analyzingtheeffectivenessandapplicabilityofco- training.In ProceedingsoftheninthinternationalconferenceonInformationandknowledge management ,pages86Œ93.ACM,2000. [82] EiniNiskanen,MerviKönönen,SaraMäättä,MerjaHallikainen,MiiaKivipelto,Silvia Casarotto,MarcelloMassimini,RitvaVanninen,EsaMervaala,JariKarhu,etal.New insightsintoalzheimer'sdiseaseprogression:acombinedtmsandstructuralmristudy. PLoSOne ,6(10):e26113,2011. [83] YongshengPan,MingxiaLiu,ChunfengLian,TaoZhou,YongXia,andDinggangShen. Synthesizingmissingpetfrommriwithcycle-consistentgenerativeadversarialnetworksfor alzheimer'sdiseasediagnosis.In InternationalConferenceonMedicalImageComputing andComputer-AssistedIntervention ,pages455Œ463.Springer,2018. [84] GeoffreyJMParker,HamiedAHaroon,andClaudiaAMWheeler-Kingshott.Aframe- workforastreamline-basedprobabilisticindexofconnectivity(PICo)usingastructural 130 interpretationofmridiffusionmeasurements. JournalofMagneticResonanceImaging , 18(2):242Œ254,2003. [85] RonaldCPetersen,RachelleDoody,AlexanderKurz,RichardCMohs,JohnCMorris, PeterVRabins,KarenRitchie,MartinRossor,LeonThal,andBengtWinblad.Current conceptsinmildcognitiveimpairment. Archivesofneurology ,58(12):1985Œ1992,2001. [86] HaiPham,PaulPuLiang,ThomasManzini,Louis-PhilippeMorency,andBarnabásPóczos. Foundintranslation:Learningrobustjointrepresentationsbycyclictranslationsbetween modalities.In ProceedingsoftheAAAIConferenceonIntelligence ,volume33, pages6892Œ6899,2019. [87] SnehashisRoy,JohnAButman,DanielSReich,PeterACalabresi,andDzungLPham. Multiplesclerosislesionsegmentationfrombrainmriviafullyconvolutionalneuralnet- works. arXivpreprintarXiv:1803.09172 ,2018. [88] JavadSalimiSartakhti,MohammadHosseinZangooei,andKouroshMozafari.Hepatitis diseasediagnosisusinganovelhybridmethodbasedonsupportvectormachineandsim- ulatedannealing(svm-sa). Computermethodsandprogramsinbiomedicine ,108(2):570Œ 579,2012. [89] MarziaAScelsi,RaiyanRKhan,MarcoLorenzi,LeighChristopher,MichaelDGreicius, JonathanMSchott,SebastienOurselin,andAndreAltmann.Geneticstudyofmultimodal imagingalzheimer'sdiseaseprogressionscoreimplicatesnovelloci. Brain ,141(7):2167Œ 2180,2018. [90] JürgenSchmidhuber.Deeplearninginneuralnetworks:Anoverview. Neuralnetworks , 61:85Œ117,2015. [91] TijnMSchouten,MarisaKoini,FrankdeVos,StephanSeiler,JeroenvanderGrond,Anita Lechner,AnneHafkemeijer,ChristianeMöller,ReinholdSchmidt,MarkdeRooij,etal. Combininganatomical,diffusion,andrestingstatefunctionalmagneticresonanceimaging forindividualofmildandmoderatealzheimer'sdisease. NeuroImage:Clini- cal ,11:46Œ51,2016. [92] AlexanderGSchwingandRaquelUrtasun.Fullyconnecteddeepstructurednetworks. arXiv preprintarXiv:1503.02351 ,2015. [93] DennisJSelkoe.Amyloid b -proteinandthegeneticsofalzheimer'sdisease. Journalof BiologicalChemistry ,271(31):18295Œ18298,1996. [94] AlexanderShapiro.Montecarlosamplingmethods. Handbooksinoperationsresearchand managementscience ,10:353Œ425,2003. [95] VikasSindhwani,ParthaNiyogi,andMikhailBelkin.Aco-regularizationapproachtosemi- 131 supervisedlearningwithmultipleviews.In ProceedingsofICMLworkshoponlearning withmultipleviews ,pages74Œ79,2005. [96] VikasSindhwaniandDavidSRosenberg.Anrkhsformulti-viewlearningandmanifoldco- regularization.In Proceedingsofthe25thinternationalconferenceonMachinelearning , pages976Œ983.ACM,2008. [97] NoamSlonim,RachelSomerville,NaftaliTishby,andOferLahav.Objective ofgalaxyspectrausingtheinformationbottleneckmethod. MonthlyNoticesoftheRoyal AstronomicalSociety ,323(2):270Œ284,2001. [98] NoamSlonimandNaftaliTishby.Documentclusteringusingwordclustersviatheinfor- mationbottleneckmethod.In Proceedingsofthe23rdannualinternationalACMSIGIR conferenceonResearchanddevelopmentininformationretrieval ,pages208Œ215.ACM, 2000. [99] CeesGMSnoek,MarcelWorring,andArnoldWMSmeulders.Earlyversuslatefusionin semanticvideoanalysis.In Proceedingsofthe13thannualACMinternationalconference onMultimedia ,pages399Œ402,2005. [100] PatriciaASoranno,LindaCBacon,MichaelBeauchene,KarenEBednar,EdwardGBis- sell,ClaireKBoudreau,MarvinGBoyer,MaryTBremigan,StephenRCarpenter,JamieW Carr,etal.Lagos-ne:amulti-scaledgeospatialandtemporaldatabaseoflakeecological contextandwaterqualityforthousandsofuslakes. GigaScience ,6(12):1Œ22,2017. [101] OlafSporns. NetworksoftheBrain .MITpress,2011. [102] OlafSporns,GiulioTononi,andRolfKötter.Thehumanconnectome:astructuraldescrip- tionofthehumanbrain. PLoSComputBiol ,1(4):e42,2005. [103] NitishSrivastavaandRuslanRSalakhutdinov.Multimodallearningwithdeepboltzmann machines.In Advancesinneuralinformationprocessingsystems ,pages2222Œ2230,2012. [104] YiSun,YuhengChen,XiaogangWang,andXiaoouTang.Deeplearningfacerepresentation byjointon-vIn Advancesinneuralinformationprocessingsystems , pages1988Œ1996,2014. [105] QiulingSuo,WeidaZhong,FenglongMa,YeYuan,JingGao,andAidongZhang.Metric learningonhealthcaredatawithincompletemodalities.In Proceedingsofthe28thInterna- tionalJointConferenceonIntelligence ,pages3534Œ3540.AAAIPress,2019. [106] QiaoyuTan,GuoxianYu,CarlottaDomeniconi,JunWang,andZiliZhang.Incomplete multi-viewweak-labellearning.In IJCAI ,pages2703Œ2709,2018. [107] RyanTibshirani.Proximalgradientdescentandacceleration. LectureNotes ,2010. 132 [108] NaftaliTishby,FernandoCPereira,andWilliamBialek.Theinformationbottleneck method. arXivpreprintphysics/0004057 ,2000. [109] NaftaliTishbyandNoamSlonim.Dataclusteringbymarkovianrelaxationandtheinfor- mationbottleneckmethod.In Advancesinneuralinformationprocessingsystems ,pages 640Œ646,2001. [110] LucasRTrambaiolli,AnaCLorena,FranciscoJFraga,PauloAMKanda,RenatoAnghi- nah,andRicardoNitrini.Improvingalzheimer'sdiseasediagnosiswithmachinelearning techniques. ClinicalEEGandneuroscience ,42(3):160Œ165,2011. [111] LuanTran,XiaomingLiu,JiayuZhou,andRongJin.Missingmodalitiesimputationvia cascadedresidualautoencoder.In ProceedingsoftheIEEEConferenceonComputerVision andPatternRecognition ,pages1405Œ1414,2017. [112] PaulTseng.Convergenceofablockcoordinatedescentmethodfornondifferentiablemini- mization. JournalofOptimizationTheoryandApplications ,109(3):475Œ494,2001. [113] GrigoriosTzortzisandAristidisLikas.Kernel-basedweightedmulti-viewclustering.In 2012IEEE12thinternationalconferenceondatamining ,pages675Œ684.IEEE,2012. [114] ManikVarmaandBodlaRakeshBabu.Moregeneralityinefmultiplekernellearning. In Proceedingsofthe26thAnnualInternationalConferenceonMachineLearning ,pages 1065Œ1072.ACM,2009. [115] PascalVincent,HugoLarochelle,IsabelleLajoie,YoshuaBengio,andPierre-AntoineMan- zagol.Stackeddenoisingautoencoders:Learningusefulrepresentationsinadeepnetwork withalocaldenoisingcriterion. JournalofMachineLearningResearch ,11(Dec):3371Œ 3408,2010. [116] QiWang,LeiGuo,PaulMThompson,CliffordRJackJr,HirokoDodge,LiangZhan,Jiayu Zhou,Alzheimer'sDiseaseNeuroimagingInitiative,etal.Theaddedvalueofdiffusion- weightedmri-derivedstructuralconnectomeinevaluatingmildcognitiveimpairment:A multi-cohortvalidation. JournalofAlzheimer'sDisease ,64(1):149Œ169,2018. [117] QiWang,LiangZhan,PaulMThompson,HirokoHDodge,andJiayuZhou.Discrimi- nativefusionofmultiplebrainnetworksforearlymildcognitiveimpairmentdetection.In BiomedicalImaging(ISBI),2016IEEE13thInternationalSymposiumon ,pages568Œ572. IEEE,2016. [118] QianqianWang,ZhengmingDing,ZhiqiangTao,QuanxueGao,andYunFu.Partialmulti- viewclusteringviaconsistentgan.In 2018IEEEInternationalConferenceonDataMining (ICDM) ,pages1290Œ1295.IEEE,2018. [119] WeiranWang,RamanArora,KarenLivescu,andJeffABilmes.Ondeepmulti-viewrepre- 133 sentationlearning.In ICML ,pages1083Œ1092,2015. [120] WeiranWang,RamanArora,KarenLivescu,andJeffABilmes.Unsupervisedlearningof acousticfeaturesviadeepcanonicalcorrelationanalysis.In Acoustics,SpeechandSignal Processing(ICASSP),2015IEEEInternationalConferenceon ,pages4590Œ4594.IEEE, 2015. [121] YishuWang,DejieYang,andMinghuaDeng.Low-rankandsparsematrixdecomposition forgeneticinteractiondata. BioMedresearchinternational ,2015,2015. [122] JohnWestbury,PaulMilenkovic,GaryWeismer,andRaymondKent.X-raymicrobeam speechproductiondatabase. TheJournaloftheAcousticalSocietyofAmerica ,88(S1):S56Œ S56,1990. [123] JenniferWilliams,StevenKleinegesse,RamonaComanescu,andOanaRadu.Recognizing emotionsinvideousingmultimodaldnnfeaturefusion.In ProceedingsofGrandChallenge andWorkshoponHumanMultimodalLanguage(Challenge-HML) ,pages11Œ19,2018. [124] WorldHealthOrganization.Dementiafactsheetn362.Retrievedat https: //web.archive.org/web/20150318030901/http://www.who.int/ mediacentre/factsheets/fs362/en ,2016.Retrieved13January2016. [125] StephenJWright,RobertDNowak,andMárioATFigueiredo.Sparsereconstruction byseparableapproximation. SignalProcessing,IEEETransactionson ,57(7):2479Œ2493, 2009. [126] ChangXu,DachengTao,andChaoXu.Asurveyonmulti-viewlearning. arXivpreprint arXiv:1304.5634 ,2013. [127] ChangXu,DachengTao,andChaoXu.Large-marginmulti-viewinformationbottleneck. IEEETransactionsonPatternAnalysisandMachineIntelligence ,36(8):1559Œ1572,2014. [128] LiuYang,LipingJing,andMichaelKNg.Robustandnon-negativecollectivematrix factorizationfortext-to-imagetransferlearning. IEEEtransactionsonImageProcessing , 24(12):4701Œ4714,2015. [129] TaoYang,JunLiu,PinghuaGong,RuiwenZhang,XiaotongShen,andJiepingYe.Absolute fusedlasso&itsapplicationtogenome-wideassociationstudies.In Proceedingsofthe22th ACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining .ACM, 2016. [130] TaoYang,JieWang,QianSun,DerrekPHibar,NedaJahanshad,LiLiu,YalinWang,Liang Zhan,PaulMThompson,andJiepingYe.DetectinggeneticriskfactorsforAlzheimer's diseaseinwholegenomesequencedataviaLassoscreening.In BiomedicalImaging(ISBI), 2015IEEE12thInternationalSymposiumon ,pages985Œ989.IEEE,2015. 134 [131] Hsiang-FuYu,Cho-JuiHsieh,SiSi,andInderjitDhillon.Scalablecoordinatedescentap- proachestoparallelmatrixfactorizationforrecommendersystems.In DataMining(ICDM), 2012IEEE12thInternationalConferenceon ,pages765Œ774.IEEE,2012. [132] LeiYuan,YalinWang,PaulMThompson,VaibhavANarayan,JiepingYe,Alzheimer's DiseaseNeuroimagingInitiative,etal.Multi-sourcefeaturelearningforjointanalysisof incompletemultipleheterogeneousneuroimagingdata. NeuroImage ,61(3):622Œ632,2012. [133] LiangZhan,NedaJahanshad,YanJin,ArthurWToga,KatieLMcMahon,GreigdeZu- bicaray,NicholasGMartin,MargaretJWright,PaulMThompson,etal.Brainnetwork efyandtopologydependonthetrackingmethod:11tractographyalgorithms comparedin536subjects.In 10thInternationalSymposiumonBiomedicalImaging(ISBI) , pages1134Œ1137.IEEE,2013. [134] LiangZhan,JiayuZhou,YalinWang,YanJin,NedaJahanshad,GautamPrasad,TaliaM Nir,CassandraDLeonardo,JiepingYe,PaulMThompson,etal.Comparisonofnine tractographyalgorithmsfordetectingabnormalstructuralbrainnetworksinAlzheimer's disease. FrontiersinAgingNeuroscience ,7,2015. [135] DaoqiangZhang,YapingWang,LupingZhou,HongYuan,DinggangShen,Alzheimer's DiseaseNeuroimagingInitiative,etal.Multimodalofalzheimer'sdiseaseand mildcognitiveimpairment. Neuroimage ,55(3):856Œ867,2011. [136] WenmingZheng,XiaoyanZhou,CairongZou,andLiZhao.Facialexpressionrecognition usingkernelcanonicalcorrelationanalysis(kcca). IEEEtransactionsonneuralnetworks , 17(1):233Œ238,2006. [137] JiayuZhou,JianhuiChen,andJiepingYe.Clusteredmulti-tasklearningviaalternating structureoptimization.In Advancesinneuralinformationprocessingsystems ,pages702Œ 710,2011. [138] JiayuZhou,JunLiu,VaibhavANarayan,JiepingYe,Alzheimer'sDiseaseNeuroimaging Initiative,etal.Modelingdiseaseprogressionviamulti-tasklearning. NeuroImage ,78:233Œ 248,2013. [139] JiayuZhou,ZhaosongLu,JimengSun,LeiYuan,FeiWang,andJiepingYe. biomarkeronfrommedicaldatathroughfeaturegeneralizationandselection.In Proceedingsofthe19thACMSIGKDDinternationalconferenceonKnowledgediscovery anddatamining ,pages1034Œ1042.ACM,2013. [140] JiayuZhou,FeiWang,JianyingHu,andJiepingYe.Frommicrotomacro:datadriven phenotypingbyoflongitudinalelectronicmedicalrecords.In SIGKDD ,pages 135Œ144.ACM,2014. [141] JiayuZhou,LeiYuan,JunLiu,andJiepingYe.Amulti-tasklearningformulationforpre- 135 dictingdiseaseprogression.In Proceedingsofthe17thACMSIGKDDinternationalcon- ferenceonKnowledgediscoveryanddatamining ,pages814Œ822.ACM,2011. [142] HongtuZhu,ZakariaKhondker,ZhaohuaLu,andJosephG.Ibrahim.Bayesiangeneralized lowrankregressionmodelsforneuroimagingphenotypesandgeneticmarkers. Journalof theAmericanStatisticalAssociation ,109(507):977Œ990,2014. [143] ChenZu,BiaoJie,MingxiaLiu,SongcanChen,DinggangShen,DaoqiangZhang, Alzheimer'sDiseaseNeuroimagingInitiative,etal.Label-alignedmulti-taskfeaturelearn- ingformultimodalofalzheimer'sdiseaseandmildcognitiveimpairment. Brainimagingandbehavior ,10(4):1148Œ1159,2016. 136