STATISTICALMACHINELEARNINGTHEORYANDMETHODSFOR HIGH-DIMENSIONALLOWSAMPLESIZE(HDLSS)PROBLEMS By KaixuYang ADISSERTATION Submittedto MichiganStateUniversity inpartialful˝llmentoftherequirements forthedegreeof ctorofPhilosophy 2020 ABSTRACT STATISTICALMACHINELEARNINGTHEORYANDMETHODSFOR HIGH-DIMENSIONALLOWSAMPLESIZE(HDLSS)PROBLEMS By KaixuYang High-dimensionallowsamplesize(HDLSS)dataanalysishavebeenpopularnowadays instatisticalmachinelearning˝elds.Suchapplicationsinvolvesahugenumberoffeatures orvariables,butsamplesizeislimitedduetoreasonssuchascost,ethnicityandetc.It isimportantto˝ndapproachestolearntheunderlyingrelationshipsviaasmallfraction ofdata.Inthisdissertation,westudythestatisticalpropertiesforsomenon-parametric machinelearningmodelsthatdealwiththeseproblemsandapplythesemodelstovarious ˝eldsforvalidation. InChapter2,westudythegeneralizedadditivemodelinthehigh-dimensionalsetup withagenerallinkfunctionthatbelongtotheexponentialfamily.Weapplyatwo-step approachtodovariableselectionandestimation:agrouplassostepasaninitialestima- tor,thenfollowedbyaadaptivegrouplassosteptoobtain˝nalvariablesandestimations. Weshowthatundercertainconditions,thetwo-stepapproachconsistentlyselectsthetruly nonzerovariablesandderivedtheestimationrateofconvergence.Moreover,weshowthat thetuningparameterthatminimizesthegeneralizedinformationcriterion(GIC)hasasymp- toticallyminimumrisk.Simulationsinvariableselectionandestimationaregiven.Realdata examplesincludingspamemailandprostatecancergeneticdataarealsousedtosupport thetheory.Moreover,wediscussedthepossibilityofusinga l 0 normpenalty. InChapter3,westudyashallowneuralnetworkmodelinthehigh-dimensionalclassi˝- cationsetup.Thesparsegrouplasso,alsoknownasthe l p; 1 + l 1 normpenalty,isapplied toobtainfeaturesparsityandasparsenetworkstructure.Neuralnetworkscanbeusedto approximateanycontinuousfunctionwithanarbitrarysmallapproximationerrorgiventhat thenumberofhiddennodesislargeenough,whichisknownastheuniversalapproximation theorem.Therefore,neuralnetworksareusedtomodelcomplicatedrelationshipsbetween theresponseandpredictorswithhugeinteractions.Weprovedthatundercertainconditions, thesparsegrouplassopenalizedshallowneuralnetworkhasclassi˝cationrisktendtothe Bayesrisk,whichistheoptimalamongallpossiblemodels.Realdataexamplesincluding prostatecancergeneticdata,Alzheimer'sdisease(AD)magneticresonanceimaging(MRI) dataandautonomousdrivingdataareusedtosupportthetheory.Moreover,weproposeda l 0 + l 1 penaltyandshowedthatthesolutioncanbeformulatedasanmixedintegersecond orderconeoptimization(MISOCO)problem. InChapter4,weproposeastage-wisevariableselectiontechniquewithdeepneural networksinthehigh-dimensionalsetup,namedensembleneuralnetworkselection(ENNS). Weapplyanensembleonthestage-wiseneuralnetworkvariableselectionmethodtofurther thefalselyselectedvariables,whichisshowntobeabletoconsistently˝lteroutunwanted variablesandselectedthetrulynonzerovariablesundercertainconditions.Moreover,we proposedasecondapproachtofurthersimplifytheneuralnetworkstructurebyspecifyingthe desiredpercentageofnonzeroparametersineachhiddenlayer.Atypeofcoordinatedescent algorithmisproposedtoobtainthesolutionfromthesecondstep.Wealsoshowthatthetwo stepapproachachievesuniversalconsistencyforbothregressionandclassi˝cationproblems. Simulationsarestudiedtosupportvariousarguments.Realdataexamplesincludingthe ribo˛avinproductiondata,theprostatecancergeneticdataandaregionofinterest(ROI) inMRIdataareusedtovalidatethemethod. Idedicatethisdissertationtomyparents,XiangyangXuandZhidongYang,mywife, JunLiu,andmanyfriends,whoallhavebeensupportingmeduringthishardtime. iv ACKNOWLEDGMENTS Here,IwouldliketoexpressmydeepestgratitudetomyadvisorDr.TapabrataMaitiforhis guidancetowardsmystudiesandresearches.Dr.Maitiisextremelykindandknowledgeable. Healwaysprovideconstructiveinsightsandsuggestionsthathelpmemakegreatprogress. WithoutDr.Maiti'sguidance,Iwouldn'tbeabletohavesuchaprofoundunderstandingin thisgorgeous˝eld. Iwouldalsoliketoextendmysincereappreciationtomydissertationcommitteemem- bers,Dr.LyudmilaSakhanenko,Dr.Ping-shouZhongandDr.DavidZhu.Theircomments andsuggestionsareextremelybene˝cialformyresearch. IamalsogratefultothehelpIobtainedfromalltheprofessorsintheDepartmentof StatisticsandProbability,bothacademicallyandinotheraspects.DuringmyPh.D.life, Ilearnalotfromthecourseso˙eredbytheseexcellentprofessors,andI'mabletoapply theknowledgeIlearntomyresearch.TheprofessorswhoIhaveworkedwithasateaching assistantalsoimpressmealotbythewaytheydealwithdi˙erentkindsofdi˚culties. DuringmysevenyearsatMichiganStateUniversity,Imadelotsoffriendsandbecause ofthem,Ineverfeellonelyintheseyears.AlotofthankstomyfriendsatMSU.Theyeither havegraduatedandbecomesuccessfulfacultymembersorstatisticiansinbigcompanies, orarestillworkinghardtopursuetheirPh.D.degree.Isincerelywishallofyouhavea wonderfulfuture. Lastbutnotleast,Iwouldliketoexpressmysincerethankstomyparentsfortheir supportandconcerns,aswellasmywifeforheraccompanyandworkinghardtogether towardsagloriousfuture. v TABLEOFCONTENTS LISTOFTABLES .................................... viii LISTOFFIGURES .................................... xi LISTOFALGORITHMS ................................ xii Chapter1Introduction ............................... 1 1.1Overview......................................1 1.2SparsityinHigh-dimensionalModeling.....................3 1.2.1ChoosingtheTuningParameters....................6 1.2.2AlgorithmsforTrainingSparseModels.................7 1.2.3Stage-wiseSelection...........................8 1.3TheProjectionApproach.............................9 1.4Non-parametricModeling............................12 1.4.1BasisExpansion..............................12 1.4.2NeuralNetworks.............................13 1.4.3DeepNeuralNetworks..........................15 Chapter2High-dimensionalGeneralizedAdditiveModeling ........ 18 2.1Introduction....................................18 2.2Model.......................................22 2.3Methodology&TheoreticalProperties.....................29 2.3.1FirstStep:ModelScreening.......................29 2.3.2SecondStep:PostSelection.......................36 2.4TuningParameterSelection...........................43 2.5OtherPossiblePenalty..............................46 2.5.1TheL0NormPenalty..........................46 2.5.2TheL0andL1NormPenalty......................48 2.6NumericalProperties...............................49 2.6.1SimulatedExamples...........................50 2.6.1.1LogisticRegression.......................51 2.6.1.2Otherlinkfunctions......................57 2.6.2RealDataexamples............................58 2.7Discussion.....................................65 Chapter3SparseNeuralnetwork ......................... 67 3.1Introduction....................................67 3.2TheBinaryClassi˝cationProblem.......................70 3.3TheConsistencyofNeuralNetworkClassi˝cationRisk............77 3.4Simulation.....................................82 3.4.1DNPSimulation:Revisit.........................82 vi 3.4.2SmallerSampleSizeCase........................83 3.5RealDataexamples................................85 3.5.1example1:ProstateCancerData....................85 3.5.2example2:MRIDataforAlzheimer'sDisease.............86 3.5.3example:KITTIAutonomousDrivingData..............89 3.6Discussion.....................................92 3.6.1The l 1 + l 0 PenaltyandAlgorithm...................92 Chapter4EnsembleNeuralNetworkSelection(ENNS) ........... 98 4.1Introduction....................................99 4.2TheTwo-stepVariableSelectionandEstimationApproach..........102 4.2.1TheEnsembleNeuralNetworkSelection(ENNS)Algorithm.....103 4.2.2EstimationWithRegularization.....................110 4.2.2.1DroppingOutandBagging..................110 4.2.2.2Stage-wiseTraining.......................111 4.2.2.3L1NormRegularization....................112 4.3TheoreticalGuarantee..............................113 4.4SimulationStudy.................................121 4.4.1Stage-wiseCorrectSelectionProbabilityDecreasingStudy......121 4.4.2FalsePositiveRateStudy........................122 4.4.3VariableSelectionSimulationStudy...................123 4.4.4EstimationSimulationStudy......................126 4.4.5VariableSelectionandEstimation....................127 4.4.6CorrelatedPredictors...........................127 4.5RealDataexamples................................132 4.5.1VariableSelection:MRIData......................132 4.5.2Regression:Ribo˛avinProductionData................135 4.5.3Classi˝cation:ProstateCancerData..................136 4.6Conclusion.....................................140 Chapter5Epilogue .................................. 141 APPENDICES ...................................... 143 APPENDIXATechnicalDetailsandSupplementaryMaterialsforChapter2...144 APPENDIXBTechnicalDetailsandSupplementaryMaterialsforChapter3...198 APPENDIXCTechnicalDetailsandSupplementaryMaterialsforChapter4...206 BIBLIOGRAPHY .................................... 233 vii LISTOFTABLES Table2.1:Simulationresultsforthetwo-stepapproachcomparedwiththeLasso, GAMSELandGAMBoostinthethreecasesofexample2.1.NV,average numberofthevariablesbeingselected;TPR,thetruepositiverate;FPR, thefalsepositiverate;andPE,predictionerror(hereisthemisclassi˝cation rate).Resultsareaveragedover100repetitions.Enclosedinparentheses arethecorrespondingstandarderrors.....................54 Table2.2:Simulationresultsforthetwo-stepapproachcomparedwiththeLasso, GAMSELandGAMBoostinexample2.2withcorrelation0.3and0.7for n =100 , p =200 and s =3 .NV,averagenumberofthevariablesbeing selected;TPR,thetruepositiverate;FPR,thefalsepositiverate;and PE,predictionerror(hereisthemisclassi˝cationrate).Resultsareaver- agedover100repetitions.Enclosedinparenthesesarethecorresponding standarderrors.................................56 Table2.3:Simulationresultsforthetwo-stepapproachcomparedwiththeLasso, GAMSELandGAMBoostinexample2.3,with n =100 , p =200 , s =3 andsignalstrengthreduced.NV,averagenumberofthevariablesbeing selected;TPR,thetruepositiverate;FPR,thefalsepositiverate;and PE,predictionerror(hereisthemisclassi˝cationrate).Resultsareaver- agedover100repetitions.Enclosedinparenthesesarethecorresponding standarderrors.................................57 Table2.4:Simulationresultsforthetwo-stepapproachcomparedwiththeLasso, GAMSELandGAMBoostinexample2.4forPoissonregressionandGamma regressionwith n =100 , p =200 and s =3 .NV,averagenumberof thevariablesbeingselected;TPR,thetruepositiverate;FPR,thefalse positiverate;andPE,predictionerror(hereisthemisclassi˝cationrate). Resultsareaveragedover100repetitions.Enclosedinparenthesesarethe correspondingstandarderrors.TheGAMBoostmethoddoesnotsupport Gammaregressionwithnon-canonicallinkfunction,whilethecanonical linkfallsoutsideofrange,thereforeitdoesnotsupportGammaregression.59 Table3.1:TheAUCandF1scoreofthecomparedmodelsinthesimulationstudy. Standarderrorsaregivenintheparentheses.................84 Table3.2:TheAUCandF1scoreofthecomparedmodelsinasmallersamplesize scenariowith m =5 .Standarderrorsaregivenintheparentheses.....84 Table3.3:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheProstategenedataexample.....87 viii Table3.4:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheMRIAlzheimer'sdiseaseexample..89 Table3.5:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheKITTIautonomousdrivingexample.91 Table3.6:TheMISOCOformulationforthe l 1 + l 0 penaltyindealingwiththe l 0 normstep....................................96 Table4.1:Theproportionofcorrectvariableselectionafter0-4correctvariablesin themodel,fordi˙erentcasesover1000repetitions.Theresultsshowthe mean.Theresultsshowthreedi˙erentdatagenerationstructures:linear, additivenon-linearandneuralnetworkforbothregressionandclassi˝cation.122 Table4.2:SelectionfalsepositiverateaverageoftheENNSandDNPunderdi˙erent numberoftruevariablesin101repetitions.Standarddeviationsaregiven inparenthesis..................................123 Table4.3:VariableselectioncapacityofENNSandothermethodswithlowsignal strengthintheregression(top)andclassi˝cation(bottom)setup.The numbersreportedaretheaveragenumberofselectedvariableswhichare trulynonzero.Thestandarderrorsaregiveninparenthesis.........124 Table4.4:VariableselectioncapacityofENNSandothermethodswithnormalsignal strength.Thenumbersreportedaretheaveragenumberofselectedvari- ableswhicharetrulynonzero.Thestandarderrorsaregiveninparenthesis.125 Table4.5:VariableselectioncapacityofENNSandothermethodswithhighsignal strength.Thenumbersreportedaretheaveragenumberofselectedvari- ableswhicharetrulynonzero.Thestandarderrorsaregiveninparenthesis.126 Table4.6:Predictionresultsonthetestingsetusingneuralnetworkswithandwithout l 1 normregularizationfor s =2 ; 5 ; 10 .RMSEisrootedmeansquarederror, MAEismeanabsoluteerror,andMAPEismeanabsolutepercenterror. Accuracyisthepercentageofcorrectprediction,aucisareaundertheROC curve,andf1scoreistheinverseofinverseprecisionplustheinverserecall.129 Table4.7:ModelperformanceofthecombinationofENNSalgorithmand l 1 threshold- ingestimation,comparedwithDNP,LassoandHSIC-Lassofor s =2 ; 5 ; 10 casesinbothregressionandclassi˝cation.Theaverageperformanceof101 repetitionswiththeirstandarderrorsinparenthesisarepresented.....130 ix Table4.8:Selectionandestimationcomparisonforpredictorswithcorrelation0,0.3 and0.7.Thenumberofnonzeropredictorsissetto5.Forselection, theaveragenumberofcorrectselectedvariableswithitsstandarderror isgiven.ForestimationtheaverageRMSEorAUCwiththeirstandard errorsisgiven.Theresultsareaveragedover101repetitions........131 Table4.9:VariableselectionresultfortheADdata.Thetableincludesallbiolog- icallyimportantvariableswiththreelevels:red(veryimportant),yellow (secondlyimportant)andgreen(thirdlyimportant).Thenon-important variablesarenotincludedinthemodel.Checkmarksindicatewhetherthe correspondingalgorithmselectedthevariableornot.............134 Table4.10:VariableselectionresultfortheADdata.Thereportednumbersinclude IS,theweightedaverageofselectedimportantvariableswiththeweights being3,2and1forred(mostimportant),yellow(secondlyimportant)and green(thirdlyimportant),respectively;NI,numberofimportantvariables selected;andNU,numberofunimportantvariablesselected.........135 Table4.11:TestMSEwithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentmodelsintheribo˛avingenedataexample......136 Table4.12:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheProstategenedataexample.....137 x LISTOFFIGURES Figure1.1:Adiagramforthesingle-layerneuralnetworkmodel...........14 Figure2.1:Theclassi˝cationaccuracyagainstthenumberofnonzerovariablesmea- suredonatestingsetforexample2.5over50repetitions.Thetwo-step approach,thelogisticregressionwithLasso,the l 1 normpenalizedSVM andthesparsegrouplassoneuralnetworkareincludedincomparison..61 Figure2.2:Theestimatedfunctionsforthemostfrequentlyselectedfunctionsfor example2.5...................................62 Figure2.3:Theclassi˝cationaccuracyagainstthenumberofnonzerovariablesmea- suredonatestingsetforexample2.6over500repetitions.Thetwo-step approach,thelogisticregressionwithLasso,the l 1 normpenalizedSVM andthesparsegrouplassoneuralnetworkareincludedincomparison..63 Figure2.4:Theestimatedfunctionsforthemostfrequentlyselectedfunctionsordered bydescendinginfrequencyforexample2.6.................64 Figure2.5:ThetestingMSEagainstthenumberofnonzerovariablesmeasuredon atestingsetforexample2.7over50repetitions.Thetwo-stepapproach andlogarithmtransformationwiththeLassoareincludedincomparison.65 Figure3.1:exampleimagefromtheKITTI2Dobjectdetectiondataset.......89 Figure3.2:exampleimagesofpedestriansandcarsafterpre-processing.......90 Figure3.3:Testaccuracyscorevssparsitylevelinthethreeexamples.........91 Figure4.1:Testingmeansquarederror(MSE)fordi˙erentmodelsontheribo˛avin data.......................................138 Figure4.2:Testingaccuracyfordi˙erentmodelsontheprostatecancerdata.....139 xi LISTOFALGORITHMS Algorithm1:Trainingclassi˝cationneuralnetworkwith l 1 + l 0 penalty........97 Algorithm2:FeatureselectioninENNS.........................109 Algorithm3: l 1 normestimationusingcoordinatedescent...............113 xii Chapter1 Introduction InthisChapter,wewillstatetheproblemsthatwestudyanddiscusssomeexistingwork thathavesigni˝cantimpacttothis˝eld. 1.1Overview Duringthepastdecades,high-dimensionaldataanalysisbecameincreasingmorepopular. [11]de˝nesthatgh-dimensionalstatisticsreferstostatisticalinferencewhenthenumber ofunknownparametersisofmuchlargerorderthansamplesize."Thiswillbethede˝nition ofhigh-dimensionalstatisticsthroughthewholethesis.Sometimes,by peoplemayrefertoafeaturespacewhosedimensionisgreaterthan2whenconstructing datavisualization,seeforexample[3],thisisnotthewerefertointhis thesis.Let n bethesizeofsamplewehave,andlet p bethenumberofunknownparameters involved,thuswehave p>>n (1.1) inhigh-dimensionalstatistics.Consideralinearregressionmodel y i = + p X j =1 j x ij + i ;i =1 ;:::;n; (1.2) 1 with p>>n .It'sobviousthattheestimationoftheunknownparameters 1 ;:::; p can notbeestimatedwithoutproperassumptions,suchasthesparsityassumption,whichwe willdiscussinthenextsection.Forexample,theleastsquaredestimator ^ ^ 1 ;:::; ^ p =argmin 1 p 1 n n X i =1 ( y i j x ij ) 2 isunder-determinedthushasin˝nitelymanysolution. Variousmethodshavebeenproposedtosolvethisissuesince[39]broughttheproblem topeople'ssight,wheretheauthorsconsideredanorthogonaldesigninthecase n = p .In therestofthisChapter,wewillreviewtheworkthathasbeendoneinhigh-dimensional statistics˝eld. Ontheotherhand,non-parametricmodelinghasbeenpopularforseveraldecades.Non- parametricresearchhappenedasearlyasin1947,when[133]studiedalocalaveraging approach.Someliteratureregardingnon-parametricresearcheventracebacktothe1930's aboutpartitioningestimation,butthiswasnotfullystudiedbythen.As[61]states,in non-parametricmodeling,onedoesnotrestricttheclassofpossiblerelationshipbetweenthe response y andthepredictors x 1 ;:::;x p ,butassumethat y dependson x 1 ;:::;x p througha generalfunction f ( x 1 ;:::;x p ) .Di˙erentapproacheshavebeenusedtoapproximate f ( ) in aworkableway,includingaveraging,partitioning,basisexpansion,neuralnetworkapproxi- mation,andetc.Furtherassumptionscanalsobemadeon f ( ) ,forexample,assumingan additivestructurewillresultinthegeneralizedadditivemodel(GAM),seeforexample[63], de˝nedas y i = + p X j =1 f j ( x ij )+ i ;i =1 ;:::;n: (1.3) Therelationshipisnotasgeneralasthe f ( x 1 ;:::;x p ) wejustde˝ned,butismuchmore 2 handy.Wewilldiscussthesenon-parametricworkintherestofthisChapter. 1.2SparsityinHigh-dimensionalModeling Assumingsparsityisane˙ectivewayofdealingwiththehigh-dimensionality,i.e.,weassume thatonly s ofthe p variablesareinvolvedinpredictingtheresponse y .Richliteraturehave assumedsparsity,forexamplesee[126,46,155,148,85,47,65,91].Takingthelinear regressionmodelasanexample,let 2 R p betheparameters.Adirectmethodtoobtain sparsityistorestrictthenumberofnonzeroparameters,i.e., ^ =argmin 2 R p 1 n k y k 2 2 ;s:t: k k 0 K; (1.4) forsomepositiveinteger K ,where y 2 R n istheresponsevectorand x 2 R n p isthedesign matrix.The l 0 normisde˝nedas k k 0 = f # of j : j 6 =0 ;j =1 ;:::;p g : (1.5) ThisoptimizationproblemcanbewrittenintotheLagrangianformas ^ =argmin 2 R p 1 n k y k 2 2 + k k 0 (1.6) forsomecorresponding of K .However,optimizingalossfunctionwithzeronormhas beenprovedtobeanon-deterministicpolynomial-timehardness(NP-hard)problem,see [99],whichrequiresexponentialtimetosolve. Fortunately,ithasbeenprovedthatinsteadofdirectlypenalizingthenumberofnonzero 3 coe˚cientsby l 0 norm, l 1 normpenaltyisabletoshrinksomecoe˚cientstozero,andthus thefeaturescorrespondingtothesecoe˚cientsarenotincludedinthemodel.Asafamous pieceofwork,[126]proposedthemethodleastabsoluteshrinkageandselectionoperator (Lasso)inthelinearregressionsetup,whichyieldssparseestimationsfortheparameters. Themodelconsidersa l 1 normpenalizedleastsquaremodel,i.e., ^ =argmin 2 R p 1 n k y k 2 2 + p X j =1 j j j ; (1.7) forsomehyper-parameter .TheLassowasnotintendedforhigh-dimensionalmodeling,but yieldssparsityintheestimators,i.e.,withproperchoiceof ,theestimatedparametersare shrunktowardszeroandsomeofthemwillbeexactlyzero.Anumberofstatisticianshave studiedthepropertiesoftheLasso,forexample,[73]derivedtheasymptoticdistributionsof theLassoestimatorandshoweditsestimationconsistency.Laterpeoplefoundthatthis l 1 normpenaltyworkswellinthehigh-dimensionalsetup,bothpracticallyandtheoretically. [60]derivedthe l 1 normboundforapersistentsequenceofestimatorintheLassomodel.[38] considersthehigh-dimensionalcaseandconsideredminimizingthe l 1 normoftheparameters restrictedtozerotrainingerror.Heshowsthatthismethodconsistentlyidentifythetrue subsetofnonzerovariablesgiventhatthetruemodelissparse.[95]proposedaneighborhood selectionschemethatconsistentlyidenti˝esthenonzeroentriesinthecovariancematrixofa multivariatenormaldistribution.[152]showedtheselectionconsistencyoftheLassounder theirrepresentablecondition.[149]studiedtwodi˙erentsparsityassumptionsinthelinear model,thestrongsparsity max j = s +1 ;:::;p j j j =0 (1.8) 4 andtheweaksparsity p X j = s +1 j j j 1 (1.9) forsome 1 .TheauthorsshowedthatLassoconsistentlyselectsthetruesubsetofvariables withtheweaksparsityassumptionandtheirrepresentablecondition. Lateron,Lassobecamemuchmorematureandwidelyappliedtoobtainasparsesolution inthehigh-dimensionalsetup.VariationsoftheLassoalsoemergesrapidly.[46]mentioned thatthebesttuningparameter maynotbeobtainedbyminimizingthepredictionerror, thusLassocannotreachselectionconsistencyandpredictionconsistencyatthesametime. Therefore,Lassodoesnothavetheso-calledoracleproperty.Theauthorsproposedthe smoothlyclippedabsolutedeviationpenalty(SCAD)andshoweditsoracleproperty.[41] arguedthatLassodoesnotworkwellincorrelatedpredictors,thus[156]proposedtheelastic netpenaltytoovercomethisissue.Inthenextyear,[155]proposedadata-drivenapproach namedtheadaptivelasso,whereaweighted l 1 normoftheparametersisusedtoreplacethe l 1 norm,i.e., ^ =argmin 2 R p 1 n k y k 2 2 + p X j =1 w j j j j ; (1.10) forsome w j ;j =1 ;:::;p .Asimpleconventionistochoose w j =1 = j ^ j j forsomeinitial estimator ^ j ,andset w j = 1 iftheinitialestimatoriszero.Theauthorshowedthatthe adaptiveLassoenjoystheoracleproperty.[20]proposedtheDantzigselectorandshowed itsoracleproperty. [148]discussedvariousgroupedvariableselectiontechniques,includingthewell-known grouplassopenalty,seealso[135,65].Considerthat =( 1 ; 1 ;:::; 1 ;g 1 ;:::; p; 1 ;:::; p;g p ) , where involves p groupswitheachgrouphaving g j parameters j =1 ;:::;p .Thiscaseis usefulinalotofrealworldapplications.Forexample,ifacategoricalvariableinclude3 5 levels,afterone-hotencoding,itdoesnotmakesensetoonlyincludeoneofthethreelevels inthemodel.Thegrouplassominimizesthelossfunctionplusagroupedpenalty ^ =argmin 2 R p 1 n k y k 2 2 + p X j =1 p g j k j k 2 ; (1.11) where j denotesparametervectorinthe j th group.Thesumof l 2 normsencouragesgroup- wisesparsitybutin-groupnon-sparsity.Adaptivityinthegrouplassoisalsodiscussedby theauthors.Tobemoregeneral,[5]namedthistypeofpenaltyas l p; 1 normpenalty,where the l 2 norminequation1.11isreplacedby l p normwith 1 p< 1 . Amoreintuitivepenaltyemergedlateras[114]proposedthesparsegrouplasso,which isacombinationofthegrouplassoandthelasso ^ =argmin 2 R p 1 n k y k 2 2 + 1 p X j =1 p g j k j k 2 + 2 p X j =1 g j X g =1 j j;g j ; (1.12) Thepenaltyencouragesbothgroup-wisesparsityandin-groupsparsity,andisgreatlybene- ˝cialinneuralnetworkmodels.Ahugenumberofdi˙erenttypesarealsowidelyusedinthe statisticsandmachinelearningapplications,forexample,thefusedLasso[127],thegraph fusedLasso[122],thetreeLasso[71]andetc. 1.2.1ChoosingtheTuningParameters Thepowerofregularizationisdecidedthroughthehyperparameter,whichisalsocalled thetuningparameter.Intuitively,choosinglargertuningparameter resultsinamore sparsemodel.However,makethemodeltoosparsemayleadtolosingthetrueunderlying relationshipbetweentheresponseandthepredictors.Severalcriteriaareusefulinhelping 6 chooseaappropriatetuningparameter,forexample,BIC[111],EBIC[24],andGIC[151, 52].Speci˝cally,[52]showedtheriskconsistencyoftuningparameterselectionusingGIC. Anotherpracticaltoolisthecross-validations,wherethetuningparameterischosensuch thatthecross-validatedmetricisoptimized.Conventionaltheoryforcross-validationisgiven inchapter8of[61]. 1.2.2AlgorithmsforTrainingSparseModels Theregularizationmethodsusuallyinvolve l 1 normpenaltyterm,whichisnoteasytosolve usingregulargradientdescentalgorithms,seeforexample[142].Thisissueisgeneralfor allmodelswith l 1 penalty.Notethatthegrouplassopenaltydoesnothaveexplicit l 1 normpenaltybutinvolvessumof l 2 norms,whichisequivalentto l 1 normpenaltyinterms ofoptimizationtheory.Coordinatedescentalgorithms[143,54],alsoknownasproximity gradientdescentalgorithmsareusefulindealingwiththe l 1 normpenalty.Forthe l 1 norm penalty,notrestrictedtothelinearregressioncase,asoft-thresholdingfunction S ( ; ): R p R ! R p ,where ( S ( z ; )) j = sign ( z j )( j z j j ) + ;j =1 ;:::;p (1.13) canbeappliedafterthegradientdescentforthesmoothparttozerosomeparameters.While forthegrouplassopenalty,thesoft-thresholdingfunctionbecomes S ( ; ): R P p j =1 g j R ! R p ,where ( S ( z ; )) j = 1 p g j k ( x ; y ; ) k 2 + j ;j =1 ;:::;p; (1.14) 7 where ( x ; y ; ) isaquantitydependingon x , y , andthelossfunction.Thecoordinate descentalgorithmmightbeslowinupdatingthegroups,thusablockco-ordinategradient descentmethodby[132]canbeapplied,whereasecondorderTaylorapproximationisused tosimplifythesmoothpartoftheoriginallossfunction. Asthedevelopmentofclassicalalgorithms,themodelscanalsobeformulatedasclassical optimizationproblemsandsolvedwithexistingsoftware.TheLassointhelinearregression casecanbeformulatedasalinearprogrammingproblem,seeforexample[30].Thegroup Lassointhelinearregressionsetupcanbeformulatedasasecondorderconequadratic programmingproblem,seeforexample[1].Maturepackagesareavailableforsolvingthese classicaloptimizationproblems.Aswementionedthe l 0 penaltybefore,thefastdevelopment ofoptimizationsoftwarealsomakethistypeofpenaltyeasiertoapply.The l 0 penaltyinthe linearregressionsetupcanbeformulatedasamixedintegersecondorderconeoptimization (MISOCO)problem,seeforexample[92]. 1.2.3Stage-wiseSelection Variousstage-wisealgorithmsareusedtoobtainapathselection.Theleastangleregression [41]providesaforwardalgorithmtoaddnewfeaturesbylookingatthecorrelation.The LARSalgorithmwithsimplemodi˝cationcanbeusedtoobtainthelassosolutionpath. [128]providesastage-wisealgorithm,whichprovidesveryclosesolutionpathtothelasso solutionpath.[102]studiedastage-wisealgorithmtoincorporatethe l 2 , l 1 and l 0 norm penaltywiththegradientswithrespecttotheinputweights.Thegradienthasimplicit connectionswiththecorrelationstudiedin[41]. Itworthnotingthat[125]hasshownthatthereisanequivalencebetweenusingthe stage-wisealgorithmandthegrouplassopenalty.Thisbuiltaconnectionbetweenthe 8 regularizationmethodsandthestage-wisevariableselectionalgorithms.[82]hasappliedthe resultondeeparti˝cialneuralnetworkstodofeatureselection.Comparedwithoptimizing penalizedlossfunctions,stage-wisealgorithmsstartsfromthenullmodelandaddsvariables gradually.Thisisbene˝cialtocomplicatedmodelssuchasneuralnetworks,sincelearning acomplicatedmodelonallvariablesinvolvesunpredictableuncertainty,andthusmightbe sensitivetoinitialization. 1.3TheProjectionApproach Anotherpopularapproachtodealwiththehigh-dimensionality,thoughnotstudiedtoomuch inthisthesis,isworthmentioning,theprojection-basedmethods,alsoknownasdimension- alityreduction.Projectionmethods˝ndalowerdimensionalrepresentationfortheoriginal featurespace.Classicalworkinclude[141,129,31,16,68,157,21].Projectionmethodshave more˛avorofunsupervisedlearning,sincethelowerdimensionalrepresentationshouldnot dependontheresponsebutispurelyapropertyofthedesignmatrix.Amaindrawbackof projectionbasedapproachesisthatonelosesinterpretability,becausetheprojectedfeatures arenolongertheoriginalfeatures,andthusnotinterpretable. Generaldimensionalreductionmethodsdonotworkinthehigh-dimensionalsetup.For example,theprincipalcomponentanalysis(PCA)usesalinearprojectionon x .Itusesa matrix A 2 R d p toproject x 2 R p to Ax 2 R d ,wherethematrix A ischosentomaintain thegreatestvariancebasedonthetrainingdata.Duringthecomputation,thesample covariancematrix ^ n = X T X =n isusedtoestimatethepopulationcovariancematrix = Cov ( x ) ,andtheprincipalcomponentscanbeobtainedfromtheeigenvectorsofthe samplecovariancematrix.However,inthehigh-dimensionalsetup,thesamplecovariance 9 matrixisnolongeraconsistentestimatorofthepopulationcovariancematrix,i.e. ^ n 6! asn !1 ; ifwehave p>n .[69]gavetheconditionswhenPCAworksinthehighdimensionalcase: thelargesteigenvalueshavetobelargeenough. VariationsofthePCAincludetheSimpli˝edComponentTechniqueLASSO(SCoTLASS) by[68]andthesparsePCAby[157].Theformerconsidersa L 1 penaltyontheprojection matrix A thatyieldssomesparsesolutionin A .However,thisisnotfeasibleinpractice, sincethereisnoguidelineforchoosingthetuningparameterandahighcomputational costduetothenon-convexity.ThelatterobtainedtheconnectionbetweenPCAandridge regression,andusetheridgeorelasticnettoobtainasparseapproximationtotheoriginal PCA.Inbothmethods,theirprojectionmatrix A issparseintheoriginalfeatures,i.e.,some columnsof A areexactlyzeros.Thesebuildabridgeconnectingwiththesparseassumption tosomeextent. Aftertheclassicworkofmanifoldlearning(non-lineardimensionalityreduction)such as[130,107,124],currentmanifoldlearningfocusesontwoaspects:imageprocessingand datavisualization.Inimageprocessing,manifoldlearningcanbeusedtoreducethedi- mensionalitytoitsintrinsicdimension,seeforexample[89,154,84].Theseapplicationsare notgeneralizedtoabroadersetup.Manifoldlearningappliedtodatavisualizationusually reducesthedimensionalitytotwoorthree,seeforexample[139].Theseapplicationsarenot usefulinbuildingmodels.Anotherconcernistheout-of-sampleperformanceofmanifold learning,whichisrecentlystudiedby[123].Accordingtothepaper,mostcurrentmanifold learningalgorithmsfailtomapnewdatatothepreviouslylearntembedding. 10 Autoencoder[75]usesneuralnetworkasanon-linearprojectiontoencodetheoriginal featurespacetoalowerdimensionalspace,whichischosentobesuchthattheoriginal featurespacecanberecoveredwithanotherneuralnetwork.Trainingahigh-dimension neuralnetworkisstillchallenging,sparseautoencoder[100,90]eliminatestheinsigni˝cant connectionsandthuskillsthoseparameters. Anotherusefultoolistherandomprojection,whichreliesontheJohnson-Lindenstrauss lemma Lemma1.1 ([67]) . Given 0 << 1 ,aset X of m pointsin R N ,andanumber n> 8log( m ) 2 ,thereisalinearmap f : R N ! R n suchthat (1 ) k u v k 2 k f ( u ) f ( v ) k 2 (1+ ) k u v k 2 forall u;v 2 X . Thelemmastatesthatthereexistalowerdimensionalprojectionthatwellmaintains thedistanceintheoriginalfeaturespace,however,thereisn'taguidelineofhowto˝nd suchalowerdimensionalspace.[21]usedrandomprojectionoverclassicalclassi˝ersasan ensembleclassi˝er,andprovidedane˚cientalgorithmduetothefactthattherandom projectionmatricesaredrawnfromdistributionsratherthancomputedfromdata.They alsoshowedthatthebestrandomprojectioncanbefoundinlineartimewhentherandom projectionmatricesaresampledaccordingtoHaarmeasure.Onepossibledrawbackof randomprojectionisthatitdoesnotprovideuswithanylower-dimensionalrepresentation information. 11 1.4Non-parametricModeling Non-parametricmodelingassumesanarbitraryrelationshipbetweentheresponseandthe variables.Wewouldliketo˝ndafunction f ( ): R p ! R suchthat f ( x ) isagoodapproxi- mationof y orarepresentationof y ,see[61].It'scommontoassumethat f ( ) iscontinuous, butthisisnotnecessary.Complicatedmachinelearningmodelssuchastreemodels,see [108,81,56],neuralnetworks,see[116]andetc.canbeconsideredasnon-parametricmod- elsthatapproximateacomplicatedrelationship.Anarbitraryfunctionwithgreat˛exibility canbewiggly,i.e.,itmayperfectly˝tthetrainingdata.Therefore,peopleusuallyneedto restrictthesmoothnessofafunctionbyrestricting Z f 00 ( x ) 2 dx; (1.15) seeforexample[11].Obviously,alinearfunctionhasthequantityequaltozeroandthusis thesmoothest,buttheapproximationpowerofalinearfunctionissub-optimal.Therefore, abalancebetweentheapproximationpowerandthesmoothnessneedstobeconsidered. 1.4.1BasisExpansion Directlyworkingwiththearbitraryfunction f isnotrealistic,sincethecandidatespace isin˝nitelydimensional.Therefore,parametricapproximationsareneed.Considerthe univariatecase,agreatnumberofapproachescanbeusedtoapproximateacontinuous function f ( ): R ! R .Themostpopularapproachisbasisexpansion f ( x )= 1 X k =1 k ˚ k ( x ) ; (1.16) 12 where f ˚ k ( ) ;k 2 N + g isasetofbasisfunctions.A˝nitenumberofbasesisusuallyused toapproximatethein˝nitesum.Asmentionedabove,asetofbasiswithmaximaldi˙eren- tiabilityisagoodproperty,thereforeB-splineisamongthemostpopularbasisfunctions, see[110].Accordingto[120],asplinewithdegree l consistsofpiece-wisepolynomialsupto degree l on K pre-speci˝edpartitions,withconnectionknots l 0 timescontinuouslydi˙eren- tiable,where 0 l 0 l 2 .Accordingto[65],thereexistsanormalizedB-splinebasis f ˚ k g suchthat f n ( x )= K + l X k =1 k ˚ k ( x ) : (1.17) [65]showedthatusingtheaboveB-splinebasis,wehavethefollowingapproximationerror k f f n k 2 2 = Z b a ( f ( x ) f n ( x )) 2 dx = O (( K + l ) 2 d ) ; (1.18) wheretheparameter d = k + issuchthatthe k th derivativeoffunction f ( ) satis˝esthe Lipschitzconditionoforder j f ( k ) ( s ) f ( k ) ( t ) j C j s t j ; forsomeconstant C .Thisresultshowsthatasweincreasethenumberofpartitionsina basisexpansion,theapproximationerrorcanbearbitrarilysmallundercertainconditions. 1.4.2NeuralNetworks Toapproximateanarbitrarymulti-variatefunction f ( ): R p ! R ,neuralnetworkisa powerfultool.Neuralnetworkisusedtomimichuman'sbrain:theinputdoesnotleadto theoutputdirectly,butafewintermediatenodesareneeded.Originatedfromamulti-layer 13 Input layer Hidden layer Output layer ... ... Input1 Input2 Input p 1 Input p Ouput Figure1.1:Adiagramforthesingle-layerneuralnetworkmodel perceptronin[106],whichisamulti-layerversionoftheperceptronin[105],neuralnetwork hasbeengettingdeeperandmorepowerfulinapproximatingacontinuousfunction.˝gure 1.1showsadiagramofasinglehiddenlayerneuralnetwork,whichisalsoknownasashallow neuralnetwork.Thehiddenlayerconsistsoflinearcombinationsofthe˝rstlayer'sinputs plusanon-polynomialactivationfunction. Mathematically,let x 2 R p betheinputvector,theoutputofashallowneuralnetwork is ( x )= K X k =1 k ˙ ( x + t )+ b 2 R ; (1.19) where k 2 R ;k =1 ;:::;K aretheoutputlayercoe˚cients, 2 R K p isthehiddenlayer coe˚cientmatrix, t isthehiddenlayerinterceptvector, b istheoutputlayerintercept,and ˙ ( ) isanactivationfunction.Anotabletheoremontheapproximationpowerofashallow neuralnetworkisgivenby[33]asfollow Theorem1.1 (Universalapproximationtheorem,Cybenko1989) . Let ˙ ( ) besuchthat ˙ ( t ) ! 0 as t ! and ˙ ( t ) ! 1 as t !1 .Foracontinuousfunction f on [0 ; 1] n andan 14 arbitrary > 0 ,thereexista K andparameters ; ; t ;b suchthat j f ( x ) ( x ) j < 8 x 2 [0 ; 1] n Thistheoremguaranteestheuniversalapproximationabilityofshallowneuralnetworks andisthereasonthatneuralnetworksareusefulinmanycases.Typicalactivationfunctions includes ‹ HyperbolicTangent: ˙ ( x )= tanh ( x ) . ‹ Sigmoid: ˙ ( x )= e x 1+ e x . ‹ Recti˝edLinearUnit(ReLU): ˙ ( x )= x + . ‹ LeakyReLU: ˙ ( x )=max x forsomesmall > 0 . Actually,aslongasthattheactivationfunctionisnotpolynomial,ashallowneuralnetwork hastheuniversalapproximationability,see[113].Recti˝edLinearUnit(ReLU)isoneof themostpopularactivationfunctionsnowadays,thoughitmaysu˙erfromthedeadReLU problems,see[87].It'scomputationallye˚cientandconvergesfast. 1.4.3DeepNeuralNetworks Deepneuralnetworksrefertoneuralnetworkswithmorethanonehiddenlayers.Asthe developmentofdeepneuralnetworks,peoplefoundthelimitationsofshallowneuralnet- worksuchthatthenumberofneuronsneededtoachieveadesireerrorincreasesasfastas exponentially,see[29,28].Afterthatpeoplefoundthattwohiddenlayermodelmaybe signi˝cantlymorepromisingthanthesinglehiddenlayermosee[103].Sumneuralnet- works,orequivalently,polynomialneuralnetworkshavebeenstudied[36,86],anduniversal 15 approximationpropertyhasbeenestablishedrecentlyby[43]thatacontinuousfunctionin F n d canbeapproximatedwitherror byaquadraticnetworkwhohavedepth O log(log( 1 ))+log( 1 ) andnumberofweights O log(log( 1 ))( 1 ) d=n +log( 1 )( 1 ) d=n where d isthedimensionofdomain.Theapproximationtheoryforregulardeepneural networkshavealsobeenestablishedrecently.[104]showedthatadeepnetworkneed O ( n 1) L 2 modelcomplexitytoapproximatea L -Lipshitzcontinuousfunctionof n variablesinsteadof O L n inashallowneuralnetwork.[112,113]providesmoredetailedresultsforthedeepneural networkapproximationpower. Therefore,it'spromisingthatadeeperneuralnetworkworksbetterinlearningcompli- catedrelationships.Throughthisthesis,wewillstartfromthegeneralizedadditivemodel (GAM),studyitspropertiesandtuningparameterselectioninChapter2,thenmovetoa morecomplicatedshallowneuralnetworkmodelandtalkaboutitsasymptoticpropertiesin chapter3,and˝nallydiveintodeepneuralnetworksinchapter4,whereweproposedanen- 16 semblevariableselectionalgorithmandestimationmethod.Inchapter5,wewillsummarize theworkwehavedoneandthefutureworktobedone. 17 Chapter2 High-dimensionalGeneralizedAdditive Modeling Inthischapter,wewillstudythegeneralizedadditivemodel(GAM),see[63].Weconsider anon-linearlinkfunctionthatbelongtotheexponentialfamily.Atwo-stepapproachis usedforvariableselectionandestimation,withagrouplassofollowedbyanadaptivegroup lassoappliedtothebasisexpansionofthenon-parametricfunctions.Wewillshowthatthe variableselectionresultinthesecondstepisconsistent,andwealsoderivetheconvergence rateofthesecondstepestimator.Moreover,wediscussthetuningparameterselectionand showedthatwereachriskconsistencybyusingthegeneralizedinformationcriterion(GIC). Simulationsandreaddataexamplesonvariousaspectsaregiventosupportthearguments. 2.1Introduction Themainobjectiveofthisworkistoestablishtheorydrivenhighdimensionalgeneralizedad- ditivemodelingmethodwithnonlinearlinks.Themethodologyincludesconsistentlyselect- ingthetuningparameter.Additivemodelsplayimportantrolesinnon-parametricstatistical modelingandinmachinelearning.Althoughthisimportantstatisticallearningtoolhasbeen usedinmanyimportantapplicationsandtherearefreesoftwareavailableforimplementing thesemodelsalongwiththeirvariations,tooursurprise,thereisnoprocedurewithnonlinear 18 linkavailablewhichhasbeenstudiedsystematicallywiththeoreticalfoundation.generalized additivemodelingallowsthenonlinearrelationshipbetweenaresponsevariableandasetof covariates.Thisincludesthespecialcase,namely,thegeneralizedlinearmodels,byletting eachadditivecomponentbealinearfunction.Ingeneral,let ( y i ; X i ) ;i =1 ;:::;n beinde- pendentobservations,where y i 'sareresponsevariableswhosecorresponding p -dimensional covariatevectorsare X i 's.Ageneralizedadditivemodel[62]isde˝nedas i = E ( y i j X i )= g 1 0 @ p n X j =1 f j ( X ij ) 1 A ; (2.1) where g ( ) isalinkfunction, f j 'sareunspeci˝edsmoothfunctionsand X ij isthe j thcom- ponentofvector X i .Oneofthefunctionscouldbeaconstant,whichistheinterceptterm, butthisisnotnecessary.Thenumberofadditivecomponentsiswrittenas p n ,sinceit sometimes(usuallyinhighdimensionalsetup)increasesas n increases.Asimplecasethat manypeoplestudiedis p n = p ,wherethenumberofadditivecomponentsis˝xedandless thanthesamplesize n .Thechoiceoflinkfunctionsisassimpleasingeneralizedlinear models,wherepeopleprefertochooselinkfunctionsthatmakethedistributionofthere- sponsevariablesbelongtopopularexponentialfamily.Awidelyusedgeneralizedadditive modelhastheidentitylinkfunction g ( )= ,whichgivestheclassicaladditivemodel y i = p n X j =1 f j ( X ij )+ i ; (2.2) where i 'sarei.i.drandomvariableswithmean 0 and˝nitevariance ˙ 2 . Ontheotherhand,highdimensionaldatahasincreasinglybecomeapartofmanymodern daysscienti˝capplications.Oftenthenumberofcovariates p n ismuchlargerthanthe 19 numberofobservations n ,whichisusuallywrittenas p n ˛ n .Amostinterestingscaleis p n increasesexponentiallyas n increases,i.e. log p n = O ( n ˆ ) forsomeconstant ˆ> 0 .[48] calledthisasnon-polynomialdimensionalityorultrahigh-dimensionality. Inthischapter,weconsiderthegeneralizedadditivemodelinahigh-dimensionalsetup. Toavoididenti˝cationproblems,thefunctionsareassumedtobesparse,i.e.onlyasmall proportionofthefunctionsarenon-zeroandallothersareexactlyzero.Amoregeneralized setupisthatthenumberofnonzerofunctions,denoted s n ,alsodivergesas n increases.This caseisalsoconsideredinthispaper. Manyothershaveworkedongeneralizedadditivemodels.Commonapproachesuseba- sisexpansiontodealwiththenonparametricfunctions,andperformvariableselectionand estimationmethodsonthebases.[94]consideredasimplercase,asin2.2,withanew sparsity-smoothnesspenaltyandprovedit'soracleproperty.Theyalsoperformedasimu- lationstudyunderlogitlinkwiththeirnewpenalty,however,notheoreticalsupportwas provided.[45]proposedthenonparametricindependencescreening(NIS)methodinscreen- ingthemodelasin2.2.However,theselectionconsistencyandthegeneralizedlinkfunctions werenotdiscussed.[91]discussedthepracticalvariableselectioninadditivemodels,but notheorywasgiven.[83]consideredatwo-steporacallye˚cientapproachingeneralized additivemodelsinthelowdimensionalsetup,butnovariableselectioninthehighdimen- sionalsetupwasdone.[65]focusedonthevariableselectionof2.2with˝xednumberof nonzerofunctionsusingatwostepapproach:Firstgrouplasso[7,148]onthebasestoselect thenonzerocovariatesandthenuseadaptivegrouplassotoestimatethebasescoe˚cients. Theythenestablishedtheselectionconsistencyandprovidedtherateofconvergenceofthe estimation.[2]reviewedseveralexistingalgorithmshighlightingtheconnectionsbetween them,includingthenon-negativegarrote,COSSOandadaptiveshrinkage,andpresented 20 somecomputationallye˚cientalgorithmsfor˝ttingtheadditivemodels.[98]extendedthe consistencyandrateofconvergenceof[65]tospatialadditivemodels.[51]studiedthe GAMwithidentitylinkundertheendogeneitysetting.Itworthmentioningthatalternative methodstopenalizationhavealsobeenstudied,forexample,[134]studied˝ttingGAMand performvariableselecitonimplicitlythroughlikelihoodbasedboosting. However,thoughwidelyused,nosystematictheoryaboutselectionandestimationcon- sistencyandrateofconvergencehasbeenestablishedforgeneralizedadditivemodelswith non-identitylinkfunctions. Inthischapter,weestablishthetheorypartforgeneralizedadditivemodelswithnon- identitylinkfunctionsinhighdimensionalsetup.Wedevelopatwo-stepselectionapproach, whereinthe˝rststepweusegrouplassotoperformascreening,which,undermildassump- tions,isabletoselectallnonzerofunctionsandnotover-selecttoomuch.Inthesecond step,theadaptivegrouplassoprocedureisusedandisprovedtoselectthetruepredictors consistently. Anotherimportantpracticalissueinvariableselectionandpenalizedoptimizationprob- lemsistuningparameterselection.Variouscrossvalidation(CV)techniqueshavebeenused inpracticeforalongtime.InformationcriteriasuchasAkaikeinformationcriterion(AIC), AICc,Bayesianinformationcriterion(BIC),Mallow's C p andetc.havebeenusedtoselect `thebest'modelaswell.Manyequivalencesamongthetuningparameterselectionmethods havebeenshownintheGaussianlinearregressioncase.However,theconsistencyofthese selectionmethodswerenotestablished.Latersomevariationsoftheinformationcriteria suchasmodi˝edBIC[150,137]extendedBIC[24]andgeneralizedinformationcriterion (GIC)[52]wereproposedandshowntohavegoodasymptoticpropertiesinpenalizedlinear modelsandpenalizedlikelihoods.However,theresultsarenotusefulforgroupedvariablesin 21 additivemodels,forwhichbasisexpansiontechniqueisusuallyusedandthusbringsgrouped selection. Inthischapter,wegeneralizetheresultofgeneralizedinformationcriterion(GIC)by [52]togroup-penalizedlikelihoodproblemsandshowthatundersomecommonconditions andwithagoodchoiceoftheparameterinGIC,weareabletoselectthetuningparameter thatcorrespondstothetruemodel. Insection2.2,themodelisspeci˝edandbasicapproachisdiscussed.Notationsand basicassumptionsarealsointroducedinthissection.Section2.3givesthemainresultsof thetwostepsselectionandestimationprocedure.Section2.4developsthetuningparameter selection.Variationofpenaltyfunctionsisdiscussedinsection2.5.Extensivesimulation studyandrealdataexamplearepresentedinsection2.6followedbyashortdiscussionin section2.7.TheproofsofalltheoremsaredeferredtoAppendixA. 2.2Model Weconsiderthegeneralizedadditivemodel(2.1)wherethelinkfunctioncorrespondstoan exponentialfamilydistributionoftheresponse.Foreachofthe n independentobservations, thedensityfunctionisgivenas f y i ( y )= c ( y )exp y i b ( i ) ˚ ; 1 i n; i 2 R : (2.3) Withoutlossofgenerality,weassumethatthedispersionparameter 0 <˚< 1 isassumed tobeaknownconstant.Speci˝callyweassume ˚ =1 .Weconsidera˝xed-designthrough thispaper,i.e.,thedesignmatrix X isassumedtobe˝xed.However,wehaveshownin 22 appendixAthatthesametheoryworksforarandomdesignundersimpleassumptionson thedistributionof X .Theadditiverelationshipassumesthatthedensitiesof y i 'sdepend on X i 'sthroughtheadditivestructure i = P p n j =1 f j ( X ij ) .Thisisthecanonicallink.Ifwe useotherlinkfunctions,forexample, A ( ) ,thetheoryalsoworksaslongasthefunctions A ( ) satis˝estheLipschitzconditionsforsomeorder.Let b ( k ) ( ) bethe k -thderivativeof b ( ) ,thenbypropertyoftheexponentialfamily,theexpectationandvariancematrixof y =( y 1 ;:::;y n ) T ,undermildassumptionsof b ( ) ,isgivenby ( ) and ˚ ) ,where ( )=( b (1) ( 1 ) ;:::;b (1) ( n )) T and )= diag f b (2) ( 1 ) ;:::;b (2) ( n ) g : (2.4) Thelog-likelihood(ignoringtheterm c ( y ) whichisnotinterestingtousinparameter estimation)canbewrittenas l = n X i =1 2 4 y i 0 @ p n X j =1 f j ( X ij ) 1 A b 0 @ p n X j =1 f j ( X ij ) 1 A 3 5 : (2.5) AssumethattheadditivecomponentsbelongtotheSobolevspace W d 2 ([ a;b ]) .According to[110],seepages268-270,thereexistsB-splineapproximation f nj ( x )= m n X k =1 jk ˚ k ( x ) ; 1 j p: (2.6) with m n = K n + l ,where K n isthenumberofinternalknotsand l d isthedegreeofthe splines.Generally,itischosedthat d =2 and l =4 ,i.e.,cubicsplines. Usingtheapproximationabove,[65]provedthat f nj wellapproximates f j inthesense 23 ofrateofconvergencethat k f j f nj k 2 2 = Z b a ( f j ( x ) f nj ( x )) 2 dx = O ( m 2 d n ) : (2.7) Therefore,usingthebasisapproximation,thelog-likelihood(ignoringtheterm c ( y ) which isnotrelatedtotheparameters)canbewrittenas l = n X i =1 2 4 y i 0 @ p n X j =1 m n X k =1 0 jk k ( x ij ) 1 A b 0 @ p n X j =1 m n X k =1 0 jk k ( x ij ) 1 A 3 5 = n X i =1 h y i 0 T i b 0 T i ; (2.8) where 0 and i arethevectorbasiscoe˚cientsandbasesde˝nedbelow. It'salsoworthnotingthatthenumberofbases m n increasesas n increases.This isnecessarysince[110]mentionedthatoneneedtohavesu˚cientpartitionstowellap- proximate f j by f nj .Ifwe˝x m n ,i.e.let m n = m 0 ,thoughinthelaterpartwe willshowtheapproachtoestimatethebasiscoe˚cientscanhavebetterrateofconver- gence,theapproximationerrorbetweentheadditivecomponentsandthesplinefunctions k f j ( x ) f nj ( x ) k 2 =[ R b a ( f j ( x ) f nj ( x )) 2 dx ] 1 = 2 = O (1) willincreaseandleadtoinconsistent estimations.Therefore, m n ,ormoreprecisely, K n ,needtoincreasewith n . Ourselectionandestimationapproachwillbebasedonthebasesapproximatedloglike- lihood2.8.Beforestartingthemethodology,welistthenotationsandstatetheassumptions weneedinthispaper. Notations Thedesignmatrixis X ( n p n ) =( x 1 ;:::; x n ) T . 24 Thebasismatrixis ( n m n p n ) = 1 ;:::; n ) T ,where i =( ˚ 1 ( x i 1 ) ;:::;˚ m n ( x i 1 ) ;:::;˚ 1 ( x ip n ) ;:::;˚ m n ( x ip n )) T : Thetruebasisparameters 0 =( 0 11 ;:::; 0 1 m n ;:::; 0 p n 1 ;:::; 0 p n m n ) T 2 R m n p n Weassumethefunctions f 1 ;:::;f p n aresparse,then 0 isblock-wisesparse,i.e.the blocks 0 1 =( 0 11 ;:::; 0 1 m n ) T ;:::; 0 p n =( 0 p n 1 ;:::; 0 p n m n ) T aresparse. Let y betheexpectationof y basedonthetruebasisparametersand " = y y . De˝netherelationship a n b n asthereexistsa˝niteconstant c suchthat a n cb n . Foranyfunction f de˝ne k f k 2 =[ R b a f 2 ( x ) dx ] 1 = 2 ,whenevertheintegralexists. Foranytwocollectionsofindices S; ~ S f 1 ;:::;p n g ,thedi˙erencesetisdenoted S ~ S . Thecardinalityof S isdenotedcard ( S ) .Forany 2 R m n p n ,de˝ne 1 ;:::; p n asitssub- blocks,where i 2 R m n ,andde˝netheblock-wisesupport supp B ( )= f j 2f 1 ;:::;p n g ; j 6 = 0 g : De˝netheblock-wisecardinality card B ( )= card ( supp B ( )) : 25 For S = f s 1 ;:::;s q gf 1 ;:::;p n g ,de˝nesub-blockvector S =( T s 1 ;:::; T s q ) T Thenumberofadditivecomponentsisdenoted p n ,whichispossibletogrowfasterthanthe samplesize n .Let T = supp B ( 0 ) and T c bethecomplimentset.Letcard ( T )= s n ,where s n isallowedtodivergeslowerthan n . Foreach U f 1 ;:::;p n g withcard ( U T ) m forsome m ,de˝ne B ( U )= f 2 R m n p n ; supp B ( ) U g ; B ( m )= fB ( U ); forany U f 1 ;:::;p n g ; Card ( U T ) m g : Let q beanintegersuchthat q>s n and q = o ( n ) .De˝ne B 1 = f 2B : card B ( ) q g ; where B isasu˚cientlylarge,convexandcompactsetin R d . Assumptions Assumption2.1 (Ondesignmatrix) . UsingthenormalizedB-splinebases,thebasisma- trix haseachcovariatevector j ;j =1 ;:::;p n bounded,i.e., 9 c suchthat k j k 2 p nc ; 8 j =1 ;:::;m n p n . Assumption2.2 (RestrictedEigenvalues RE ) . Foragivensequence N n ,thereexist 0 and 26 1 suchthat 0 2 s n 2 m 1 n T T n k k 2 2 1 m 1 n ; (2.9) where 2 isapositiveconstantsuchthat 0 < 2 < 0 : 5 ,forall 2C ,where T =( T 1 ;:::; T p n ) and C = f 2 R p n m n : k k 2 6 =0 ; k k 2 N n andcard B ( )= o ( s n ) g : (2.10) Assumption2.3 (Ontheexponentialfamilydistribution) . Thefunction b ( ) isthreetimes di˙erentiablewith c 1 b 00 ( ) c 1 1 and j b 000 ( ) j c 1 1 initsdomainforsomeconstant c 1 > 0 .Forunboundedandnon-Gaussiandistributed Y i ,thereexistsadivergingsequence M n = o ( p n ) suchthat sup 2B 1 max 1 i n b 0 T i M n : (2.11) Additionallytheerrorterm i = y i y i 'sfollowtheuniformsub-Gaussiandistribution,i.e., thereexistconstants c 2 > 0 suchthatuniformlyforall i =1 ;::;n ,wehave P ( j i j t ) 2exp( c 2 t 2 ) forany t> 0 : (2.12) Assumption2.4 (Onnonzerofunctioncoe˚cients) . Thereexistasequence c f;n thatmay tendtozeroas n !1 suchthatforall j 2 T ,thetruenonzerofunctions f j satisfy min j 2 T k f j k 2 c f;n : Wenotethatassumption2.1isastandardassumptioninhighdimensionalmodels,where thedesignmatrixneedstobeboundedfromabove.assumption2.2isawell-knowncondition inhigh-dimensionsetupontheempiricalGrammatrix[15].Itisdi˙erentthantheregular 27 eigenvaluecondition,sincewhen n
0
,choosetheregularizationparameter
b
n
1
=
C
p
m
n
r
n
+log(
p
n
m
n
)
n
forboundedresponse(i.e.,
j
y
i
j
p
.Thedatasetisavailableat
https://web.stanford.
edu/~hastie/ElemStatLearn/data.html
.Thisdatasethasbeenstudiedinmanydi˙erent
contextswiththeobjectivebeingtopredictwhetheranemailisaspamornotbasedon
afewfeaturesoftheemails.Thereare
n
=4601
observations,amongwhich1813(39.4%)
arespams.Thereare
p
=57
predictors,including48continuousreal
[0
;
100]
attributesof
therelativefrequencyof48`spam'wordsoutofthetotalnumberofwordsintheemail,
6continuousreal
[0
;
100]
attributesoftherelativefrequencyof6`spam'charactersoutof
thetotalnumberofcharactersintheemail,1continuousrealattributeofaveragelength
ofuninterruptedsequencesofcapitalletters,1continuousintegerattributeoflengthof
longestuninterruptedsequenceofcapitalletters,and1continuousintegerattributeoftotal
numberofcapitallettersinthee-mail.Thedatawas˝rstlogtransformed,sincemostofthe
predictorshavelong-taileddistribution,asmentionedin[55].Theywerethencenteredand
59
standardised.
Thedatawassplitintoatrainingdatasetwith3067observationsandatestingdata
setwith1534observations.Wechooseorder
l
=4
whichimpliesacubisB-spline.We
choose
m
n
=15
,whichimpliesthereare
11
innerknots,evenlyplacedovertheempirical
percentilesofthedata.WecomparetheresultwiththelogisticregressionwithLassopenalty,
thesupportvectormachine(SVM)withLassopenalty,andthesparsegrouplassoneural
network(SGLNN,[53],seealso[146]).TheLassoandSMVareimplementedwiththeskikit-
learnmoduleinpython,andtheSGLNNisimplementedwiththealgorithminthepaperin
python.Bychangingthetuningparameterorstoppingcriterion,wegetestimationswith
di˙erentsparsitylevels.Allresultsareaveragedover50repetitions.Theclassi˝cationerror
withdi˙erentlevelofsparsityisshownin˝gure2.1onpage61.Thetwo-stepapproach
andtheneuralnetworkperformbetterthanthelinearmethods,whichindicatesanon-linear
relationship.Thetwo-stepapproachhasbestaccuracy0.944,whiletheneuralnetwork
hasbestaccuracy0.946.Theneuralnetworkperformsalittlebetterthanthetwo-step
approachduetoitsabilitytomodeltheinteractionsamongpredictors,butthisdi˙erence
isnotsigni˝cant.However,neuralnetworkhasnointerpretationandtakeslongertotrain.
Allfourmethodshaveperformanceincreaseasmorepredictorsareincluded,whichindicates
thatallpredictorscontributestosomee˙ecttotheprediction.However,weareabletoreach
morethan0.9accuracywithonly15predictorsincluded.WiththeGICcriterion,thetwo-
stepapproachselects
14
:
6
1
:
52
predictors,withanaverageaccuracyof
0
:
914
0
:
015
.The
mostfrequentlyselectedfunctionsareshownin˝gure2.2onpage62,whichalsoshowsthat
thesefunctionsaretrulynon-linear.Theplotsareoftheoriginalfunctions,i.e.,beforethe
logarithmtransformation.Theestimatedfunctionsareclosetotheresultsin[55],Chapter9,
withslightscaledi˙erenceduetodi˙erentpenalization.Theresultsshowthattheadditive
60
modelbytheadaptivegrouplassoismoresuitableforthisdatathanlinearmodels.
Figure2.1:Theclassi˝cationaccuracyagainstthenumberofnonzerovariablesmeasured
onatestingsetforexample2.5over50repetitions.Thetwo-stepapproach,thelogistic
regressionwithLasso,the
l
1
normpenalizedSVMandthesparsegrouplassoneuralnetwork
areincludedincomparison.
Example2.6.
Forhigh-dimensionalclassi˝cationexample,weusetheprostatecancergene
expressiondatadescribedinhttp://featureselection.asu.edu/datasets.php.Thedatasethas
abinaryresponse.102observationswerestudiedon5966predictorvariables,whichindicates
thatthedatasetisreallyahighdimensionaldataset.Theresponseshavevalues1(50
samplepoints)and2(52samplepoints),where1indicatesnormaland2indicatestumor.
Allpredictorsarecontinuouspredictors,withpositivevalues.
Toseetheperformanceofourprocedure,weran100replications.Ineachreplication,we
randomlychoose76oftheobservationsastrainingdatasetandtherest26observationsas
testingdataset.Wechooseorder
l
=4
whichimpliesacubisB-spline.Wechoose
m
n
=9
,
whichimpliesthereare
5
innerknots,evenlyplacedovertheempiricalpercentilesofthedata.
61
Figure2.2:Theestimatedfunctionsforthemostfrequentlyselectedfunctionsforexample
2.5.
Similartothelastexample,wecomparetheresultwiththelogisticregressionwithLasso
penalty,theSVMwithLassopenalty,andSGLNN.Theclassi˝cationerrorwithdi˙erent
levelofsparsityisshownin˝gure2.3onpage63.Fromthe˝gureweseethatcompared
withlinearmethodssuchasthelogisticregressionorsupportvectormachine,thenon-
parametricapproachesconvergesfaster.Thetwo-stepapproachreachesatestingaccuracy
of0.945whenaround15variablesareincludedinthemodel,whilethelinearmethodsneed
over30variablestoreachcompetitiveresults.Comparedwithneuralnetwork,thetwo-step
approachiseasiertoimplementandperformsstabler.Adrawbackofthenon-parametric
methodsistoeasilyover˝tsmallsample,andthat'sthereasontheperformancedropsastoo
manyvariablesenteredthemodel.WiththeGICcriterion,thetwo-stepapproachselects
3
:
25
1
:
67
predictors,withanaverageaccuracyof
0
:
914
0
:
016
.Toshowthenon-linear
relationship,˝gure2.4onpage64showstheestimatedfunctionsforthe6mostfrequently
62
selectedvariables.
Figure2.3:Theclassi˝cationaccuracyagainstthenumberofnonzerovariablesmeasured
onatestingsetforexample2.6over500repetitions.Thetwo-stepapproach,thelogistic
regressionwithLasso,the
l
1
normpenalizedSVMandthesparsegrouplassoneuralnetwork
areincludedincomparison.
Example2.7.
Inthisexample,weinvestigatetheperformanceofthetwo-stepapproachon
Gammaregression.ThedatasetisfromNationalOceanicandAtmosphericAdministration
(NOAA).Weusethestormdata,whichincludestheoccurrenceofstormsintheUnited
Stateswiththetime,location,propertydamage,anarrativedescriptionandetc.Herewe
onlytakethedatainMichiganfrom2010to2018andkeepthenarrativedescriptionasour
predictorvariableandthepropertydamageasourresponsevariable.Thedescriptionisin
text,thereforeweappliedwordingembeddingalgorithmWord2vec[96]totransformeach
descriptionintoanumericrepresentationvectoroflength
p
=701
,similarwordembedding
preprocessingcanbefoundin[77].Theresponsevariablepropertydamagehasalong
taildistribution,thusweuseaGammaregressionhere.Afterremovingoutliers,thedata
63
Figure2.4:Theestimatedfunctionsforthemostfrequentlyselectedfunctionsorderedby
descendinginfrequencyforexample2.6.
setcontains3085observations.Inordertostudythehigh-dimensionalcase,werandomly
sample
10%
oftheobservationsasourtrainingdata(
n
=309
)andtherestareusedfor
validation.Moreover,theresponseisnormalizedwiththelocationandscaleparametersof
gammadistribution.
Toseetheperformanceofourprocedure,weran50replications.Wechooseorder
l
=4
whichimpliesacubisB-spline.Wechoose
m
n
=9
,whichimpliesthereare
5
innerknots,
evenlyplacedovertheempiricalpercentilesofthedata.Sincethere'slimitedlibrariesavail-
ableinvariableselectionunderhigh-dimensionalsetup,wecomparethetwo-stepapproach
withthelinearregressionwithLassoonalogarithmtransformationontheresponsevariable.
Thepredictionerrorwithdi˙erentlevelofsparsityisshownin˝gure2.5onpage65.With
theGICcriterion,thetwo-stepapproachselects
34
:
45
3
:
52
predictors,withanaverage
64
MSEof
0
:
004334
0
:
000115
.
Figure2.5:ThetestingMSEagainstthenumberofnonzerovariablesmeasuredonatesting
setforexample2.7over50repetitions.Thetwo-stepapproachandlogarithmtransformation
withtheLassoareincludedincomparison.
2.7Discussion
Inthischapter,weconsideredultrahigh-dimensional(
log
p
n
=
O
(
n
ˆ
)
)generalizedadditive
modelwithadivergingnumberofnonzerofunctions(
s
n
!
0
asn
!1
).Afterusingbasis
expansiononthenonparametricfunctions,weusedtwostepprolassoand
adaptivegrouplassotoselectthetruemodel.Wehaveprovedthescreeningconsistencyof
thegrouplassoestimatorandtheselectionconsistencyoftheadaptivegrouplassoestimator.
Theratesofconvergenceofbothestimatorswerealsoderived,whichprovedthattheadaptive
grouplassodoeshaveanimprovementontheestimator.Thewholepaperprovidesasolid
foundationfortheexistingmethods.Finallyweprovedthatunderthisnonparametricset
up,thegeneralizedinformationcriterion(GIC)isagoodwaytoselectthetuningparameter
65
thatconsistentlyselectsthetruemodel.
Inthispaper,weuseda˝xeddesignonthedatamatrix
X
.Arandomdesignon
X
couldbeconsidered,i.e.,
X
hasacontinuousdistributionfunction
f
X
(
X
)
onitsinterval
[
a;b
]
,
however,extraassumptionssuchastheboundednessofthedensityfunctionareneededto
reachthesameresult.AlsoweprovedtheselectionconsistencyoftheGICprocedureon
theadaptivegrouplassoestimator,conditioningthattheinitialestimatorsatis˝es2.14,
whichispossessedbythegrouplassoprocedurewithprobabilitytendingto1.However,the
theoryofscreeningconsistencyforthegrouplassoestimatorisstilltobeestablished.This
isachallengingproblem,sincetheredoesn'thavetoexistatuningparameterthatgives
selectionconsistencyinthegrouplassoprocedure,butthisisaninterestingproblemthat
deservesfurtherinvestigation.Wealsodiscussedthesubsetselectionandsubsetselection
withshrinkageunderoursetup.Thetheoreticalinvestigationsuggeststheotherpenalty
functionsmaynothaveclearadvantagesovertheproposedprocedure.
Moreover,theheteroskedasticerrorcaseisalsoattractinginhigh-dimensionalGAM.
ThesquarerootLasso[10]hasbeenprovedtoovercomethisissue,however,ithasn'tbeen
extendedtothenon-parametricsetup.ItcouldbeinterestingtoapplysquarerootLasso
ontheGAMtoincorporatethiscase.Thisisandemandingtopicthatdeservesfurther
investigationaswell.
66
Chapter3
SparseNeuralnetwork
Aswementionedintheintroductionchapter,thegeneralizedadditivemodelisappropriate
intheadditivecase.Ifthevariablesinteractwitheachother,it'sbettertouseamulti-variate
approximation,theneuralnetwork.Inthischapter,wewillstudythesparsegrouplasso
penalizedshallowneuralnetwork.Wewillshowthatundercertainassumptions,thismodel
haveclassi˝cationrisktendingtotheoptimalrisk,theBayesrisk.Simulationsandrealdata
examplesareusedtosupportthearguments.
3.1Introduction
High-dimensionalmodelswithcorrelatedpredictorsarecommonlyseeninpractice.Most
statisticalmodelsworkwelleitherinlow-dimensionalcorrelatedcase,orinhigh-dimensional
independentcase.Therearefewmethodsthatdealwithhigh-dimensionalcorrelatedpre-
dictors,whichusuallyhavelimitedtheoreticalandpracticalcapacity.Neuralnetworkshave
beenappliedinpracticeforyears,whichhaveagoodperformanceundercorrelatedpredic-
tors.Thereasonisthatthenon-linearityandinteractionsarebroughtinbytheactivation
functionsandnodesinthehiddenlayers.Auniversalapproximationtheoremguarantees
thatasingle-layerarti˝cialneuralnetworkisabletoapproximateanycontinuousfunction
withanarbitrarilysmallapproximationerror,providedthatthereisalargeenoughnumber
ofhiddennodesinthearchitecture.ThustheANNhandlesthecorrelationandinteractions
67
automaticallyandimplicitly.Apopularmachinelearningapplicationthatdealswiththis
typeofdependencyisthespatio-temporaldata,wherethetraditionalstatisticalmethods
modelthespatialcovariancematrixofthepredictors.However,byarti˝cialneuralnetworks,
workingwiththisbigcovariancematrixcanbeavoided.Moreover,arti˝cialneuralnetworks
alsohavegoodperformanceincomputervisiontasksinpractice.
Amaindrawbackofneuralnetworksisthatitrequiresahugenumberoftrainingsample
duetolargenumberofinherentparameters.Insomeapplication˝elds,suchasclinical
trials,brainimagingdataanalysisandsomecomputervisionapplications,it'susuallyhard
toobtainsuchalargenumberofobservationsinthetrainingsample.Thusthereisaneed
fordevelopinghigh-dimensionalneuralnetworkswithregularizationordimensionreduction
techniques.Itisknownthat
l
1
normregularization[126]shrinksinsigni˝cantparameters
tozero.Commonlyusedregularizationincludes
l
p
normregularization,forexample,see
thekeraspackage[26].
l
p
normregularizationwith
p
2
controlsthemodelsensitivity
[74].Ontheotherhand
l
p
normregularizationwith
p<
2
,wherepeopleusuallytake
p
=1
forcomputatione˚ciency,doesnotencouragegroupinformation.Thegrouplasso
regularization[148]yieldsgroup-wisesparsenesswhilekeepingparametersdensewithinthe
groups.Acommonregularizationusedinhigh-dimensionalarti˝cialneuralnetworkisthe
sparsegrouplassoby[114],seeforexample[53],whichisaweightedcombinationofthe
lassoregularizationandthegrouplassoregularization.Thegrouplassoregularizationpart
penalizestheinputfeatures'weightsgroup-wisely:afeatureiseitherselectedordropped,
anditisconnectedtoallnodesinthehiddenlayerifselected.Thelassopartfurthershrinks
someweightsoftheselectedinputsfeaturestozero:afeaturedoesnotneedtobeconnected
toallnodesinthehiddenlayerwhenselected.Thispenalizationencouragesasmanyzero
weightsaspossible.Anothercommonwaytoovercomethehigh-dimensionalityistoadd
68
dropoutlayers[117].Randomlysettingparametersinthelaterlayerstozerokeepslessnon-
zeroestimationsandreducesthevariance.Dropoutlayerisprovedtoworkwellinpractice,
butnotheoreticalguaranteeisavailable.[82]considersadeepnetworkwiththecombination
ofregularizationinthe˝rstlayeranddropoutinotherlayers.Withadeeprepresentation,
neuralnetworkshavemoreapproximationpowerwhichworkswellinpractice.Theypropose
afastandstablealgorithmtotrainthedeepnetwork.However,notheoreticalguaranteeis
givenfortheproposedmethodotherthanpracticalexamples.
Ontheotherhand,thoughwidelyused,high-dimensionalarti˝cialneuralnetworksstill
donothaveasolidtheoreticalfoundationforstatisticalvalidation,especiallyinthecaseof
classi˝cation.Typicaltheoryforlow-dimensionalANNstracesbacktothe1990s,including
[33,8,119,4].Theexistingresultsincludetheuniversalapproximationcapabilitiesofsingle
layerneuralnetworks,theestimationandclassi˝cationconsistencyundertheGaussianas-
sumptionand0-1lossinthelowdimensionalcase.Thesetheoryassumesthe0-1losswhich
isnotusednowadaysandarenotsu˚cientforhigh-dimensionalcaseasconsideredhere.
Currentworksfocusmoreondevelopingnewcomputingalgorithmsratherbuildingtheoreti-
calfoundationsoronlyhavelimitedtheoreticalfoundations.[53]derivedaconvergencerate
ofthelog-likelihoodfunctionintheneuralnetworkmodel,butthisdoesnotguaranteethe
universalclassi˝cationconsistencyortheconvergenceoftheclassi˝cationrisk.Theconver-
genceofthelog-likelihoodfunctionisnecessarybutnotsu˚cientfortheclassi˝cationrisk
toconverge.Inthispaper,weobtainedconsistencyresultsofclassi˝cationriskforhigh-
dimensionalarti˝cialneuralnetworks.Wederivedtheconvergenceratefortheprediction
error,andprovedthatundermildconditions,theclassi˝cationriskofhigh-dimensionalarti-
˝cialneuralnetworkclassi˝eractuallytendstotheoptimalBayesclassi˝er'srisk.Thistype
ofpropertyhasbeenestablishedonotherclassi˝erssuchasKNN[23],whichderivedthe
69
resultthattheclassi˝cationriskofKNNtendstotheBayesrisk,LDA[79],whichderivesthe
classi˝cationerrorrateunderGaussianassumptionsandetc.Populartaskslikeanalyzing
MRIdataandcomputervisionmodelswerealsoincludedintheseresearchpapers,andwe
appliedthehigh-dimensionalneuralnetworktothesedemandingtasksaswell.
Insection3.2,wewillformulatetheproblemandthehigh-dimensionalneuralnetwork
formally.Insection3.3,westatetheassumptionsandthemainconsistencyresult.Insection
3.4,wecomparedthehigh-dimemsionalneuralnetworkwithotherneuralnetworkvariable
selectionapproaches.Insection3.5,weapplythehigh-dimensionalneuralnetworkinthree
di˙erentaspectsofexamples:thegenedata,theMRIdataandthecomputervisiondata.
Insection3.6,furtherideaswillbediscussed.
3.2TheBinaryClassi˝cationProblem
Considerthebinaryclassi˝cationproblem
P
(
Y
=1
j
X
=
x
)=
f
(
x
)
;P
(
Y
=0
j
X
=
x
)=1
f
(
x
)
;
where
x
2
R
p
isthefeaturevectordrawnfromthefeaturespaceaccordingtosomedis-
tribution
P
X
,and
f
(
):
R
p
!
R
issomecontinuousfunction.Noteherethat,inthe
function
f
(
x
)
,therecanbeanyinteractionsamongthepredictorsin
x
,whichensuresthe
possibilitytohandlecorrelatedpredictors.Let
P
X
;Y
bethejointdistributionon
(
X
;Y
)
,
where
X
2
R
p
and
Y
2f
0
;
1
g
.Here
p
couldbelarge,andmaybeevenmuchlargerthan
thetrainingsamplesize
n
.Tostudythetheory,weassume
p
hassomerelationshipwith
n
,forexample,
p
=
O
(exp(
n
))
.Therefore,
p
shouldbewrittenas
p
n
,whichindicatesthe
70
dependency.However,forsimplicity,wesuppressthenotation
p
n
anddenoteitwith
p
.
Foranewobservation
x
0
2
R
p
,theBayesclassi˝er,denoted
C
(
X
)
,predicts
1
if
f
(
x
)
p
s
and
0
otherwise,where
p
s
2
(0
;
1)
isaprobabilitythreshold,whichisusuallychosenas
1
=
2
inpractice.TheBayesclassi˝erisprovedtominimizetherisk
R
(
C
)=
Z
R
p
0
;
1
g
1
f
C
(
X
)
6
=
Y
g
dP
X
;Y
:
(3.1)
However,theBayesclassi˝erisnotusefulinpractice,since
f
(
x
)
isunknown.Thusa
classi˝eristobefoundbasedontheobservations
f
(
x
1
;y
1
)
;:::;
(
x
n
;y
n
)
g
,whicharedrawn
from
P
X
;Y
.Agoodclassi˝erbasedonthesampleshouldhaveitsrisktendtotheBayesrisk
asthenumberofobservationstendstoin˝nity,withoutanyrequirementforitsprobability
distribution.Thisistheso-calleduniversalconsistency.
Multiplemethodshavebeenadoptedtoestimate
f
(
x
)
,includingthelogisticregression
(alinearapproximation),generalizedadditivemodels(GAM,anon-parametricnonlinear
approximationwhichdonottakeinteractionsintoaccount),neuralnetworks(acomplicated
structurewhichisdenseincontinuousfunctionsspace)andetc.The˝rsttwomethods
usuallyworkinpracticewithgoodtheoreticalfoundation,however,theysometimesfailto
catchthecomplicateddependencyamongthefeaturevector
x
inawiderangeofapplications
(brainimages,computervisions,spatialdataanalysis).Theneuralnetworkstructureis
provedtobeabletocapturethisdependencyimplicitlywithoutexplicitlyspecifyingthe
dependencyhyper-parameters.Considerasinglelayerneuralnetworkmodelwith
p
predictor
variables.Thehiddenlayerhas
m
n
nodes,where
m
n
maybeadivergingsequencedepending
on
n
.Similarto
p
n
,wesuppress
m
n
as
m
.
Foraninputvector
x
2
R
p
,itsweightmatrix
2
R
p
m
anditshiddenlayerintercept
71
vector
t
2
R
m
,letthevector
˘
2
R
m
bethecorrespondingvaluesinthehiddennodes,
whichisde˝nedas
˘
k
=
t
k
+
T
k
x
;k
=1
;:::;m:
Let
(
)
beanactivationfunction,thentheoutputforagivensetofweight
iscalculated
by
b
+
T
(
˘
)
;
wherethefunction
(
)
isthefunction
(
)
beingappliedelement-wisely.Wehaveawide
rangeofchoicesfortheactivationfunction.[78]provedthataslongastheactivationisnot
algebraicpolynomials,thesinglelayerneuralnetworkisdenseinthecontinuousfunction
space,thuscanbeusedtoapproximateanycontinuousfunction.Thisstructurecanbe
consideredasamodelwhich,foragivenactivationfunction
(
)
,mapsa
p
1
inputvector
toanreal-valuedoutput
(
;
t
;
;b
)
(
x
)=
T
(
t
+
T
x
)+
b
=
m
X
k
=1
k
(
t
k
+
T
k
x
)+
b;
where
(
;
t
;
;b
)
(
x
)
2
R
istheoutputofthesinglehiddenlayerneuralnetworkwithparameter
(
;
t
;
;b
)
.Applyingthelogisticfunction
˙
(
)
,
˙
(
(
;
t
;
;b
)
(
x
))
2
(0
;
1)
asanapproximation
of
f
(
x
)
withparameters
(
;
t
;
;b
)
P
(
Y
=1
j
X
=
x
)
ˇ
˙
(
(
;
t
;
;b
)
(
x
))
;P
(
Y
=0
j
X
=
x
)
ˇ
1
˙
(
(
;
t
;
;b
)
(
x
))
;
where
˙
(
)=exp(
)
=
[1+exp(
)]
.Accordingtotheuniversalapproximationtheorem,see[33],
withabigenough
m
,thesinglelayerneuralnetworkisabletoapproximateanycontinuous
72
functionwithaquitesmallapproximationerror.
By[8],assumingthatthereisaFourierrepresentationof
f
(
x
)
oftheform
f
(
x
)=
R
R
p
e
i
!
T
x
~
F
(
d
!
)
,let
B;C
=
f
f
(
):
R
B
k
!
k
2
j
~
f
(
d
!
)
0
thatmaydependon
m
,
s
,
f
andthedistribution
P
X
;Y
,butdoesnotdependon
p
suchthatforall
˚
2
EQ
0
,we
have
"
r
2
˚
Z
R
p
0
;
1
g
l
˚
(
y;
x
)
dP
X
;Y
!#
˚
=
˚
0
h
min
2
6
4
I
0
00
3
7
5
where
A
B
meansthat
A
B
isapositivesemi-de˝nitematrix.
Thenextassumptionismadetoboundtheexcesslossfrombelowfortheparameters
outside
EQ
0
,i.e.,thetruemodelisidenti˝able.Let
d
0
(
˚
)
betheminimumdistancefrom
anelementin
EQ
0
to
˚
,thenweassume
79
Assumption3.4
(Identi˝ability)
.
Forall
>
0
,thereisan
thatmaydependon
m
,
s
,
f
andthedistribution
P
X
;Y
,butdoesnotdependon
p
suchthat
inf
˚
8
<
:
"
(
˚
):
d
0
(
˚
)
and
k
S
C
k
1
3
X
j
2
S
(
(
j
)
0
;
(
˚
)
(
j
)
)+
k
(
t
;
;b
)
(
t
0
;
(
˚
)
;
0
;
(
˚
)
;b
0
;
(
˚
)
)
k
2
o
Assumption3.3statesthatthoughneuralnetworkisanon-convexoptimizationproblem
globally,theparametersofthebestneuralnetworkapproximationofthetruefunction
f
(
x
)
hasalocallyconvexneighborhood.Theassumptioncanbeassuredinthisway.Bythe
continuityoftherepresentationofneuralnetworkandthelossfunction,theintegrationin
assumption3.3isin˝nitelycontinuouslydi˙erentiablewithrespecttothenonzeroparame-
ters,thereforethesecondderivativeisacontinuousfunctionofthenonzeroparameters.By
thede˝nitionoftheparametersofthebestneuralnetworkapproximation,
˚
0
minimizes
theintegrationinassumption3.3.Ifthereisn'tapositive
h
min
thatsatis˝esassumption,it
eithercontradictswiththefactthatthesecondderivativeiscontinuousorthede˝nitionof
˚
0
.
Assumption3.4statesthatthenon-optimalneuralnetworkscanbedistinguishedfrom
thebestneuralnetworkapproximationintermsoftheexcessloss,iftheparametersofthe
non-optimalneuralnetworkisnotinthe
-neighborhoodofanyoftheparametersofthebest
neuralnetworkclass
EQ
0
.Similartothecompatibilityconditionin[18],theconditiondoes
nothavetoorevenmaynotholdforthewholespace,butisonlyneededinthesubspace
f
˚
:
k
S
C
k
1
3
P
j
2
S
(
(
j
)
0
;
(
˚
)
(
j
)
)+
k
(
t
;
;b
)
(
t
0
;
(
˚
)
;
0
;
(
˚
)
;b
0
;
(
˚
)
)
k
2
g
,thusthis
isaweakerconditionthanimposingthelowerboundontheexcessloss.Thesubspaceis
80
derivedfromthethebasicinequalityofthede˝nitionof
^
˚
withrearrangingtermsandnorm
inequalities,seeforexample[18].Similarsubspacecanalsobeenfoundinthecompatible
conditionin[94].Since
s
isunknown,itcannotbecheckedinpractice,butitissu˚cient
tochecktheinequalityforallsets
S
2f
1
;:::;p
g
withcardinality
s
0
if
s
0
isknown,whichis
astrongerversionofassumption3.4.
Nowwearereadytostateourmainresult.Wehavetoadmitthatourtheorybased
ontheestimatorfrom3.3istheglobaloptima,whichsu˙ersfromthebiggestproblemin
optimizationresearch:thegapbetweentheglobaloptimaintheoryandalocaloptimain
practice.Wewillleavethiscomputationalissuetofutureresearch.
Theorem3.1.
Underassumptions1-4,letourestimationbefrom3.3,choosingtuningpa-
rameter
2
T
~
forsomeconstant
T
1
and
~
=
c
p
m
log
n=n
(
p
log
Q
+
p
m
log
p
log(
nm
)
=
(1
+
=
p
m
))
,if
log(
n
)
=
(
n
~
2
)
!
0
,
s
2
2
=
(
n
~
2
)
!
0
and
n
1
m
9
=
2
s
5
=
2
p
log(
p
)
!
0
as
n
!1
,assumethatourpredictioniswithinaconstantdistancefromthebestapproxima-
tion
0
(
x
)
,thenwehave
R
(
^
C
)
R
(
C
)
!
0
asn
!1
Aproofofthistheoremisgiveninappendix.Thistheoremstatesthatwithproper
choiceoftuningparametersandundersomemildassumptionsandcontrolsof
n
,
p
and
s
,
thehigh-dimensionalneuralnetworkwithsparsegrouplassoregularizationtendstohave
theoptimalclassi˝cationrisk.Thisisasigni˝cantimprovementinthetheoreticalneural
networkstudy,sinceitgivesthetheoreticalguaranteethathighdimensionalneuralnetwork
willde˝nitelyworkinsuchsituations.
81
3.4Simulation
Inthissection,wewillshowtwoexamples.The˝rstexampleisarevisitofthesimulation
studyin[82],whereweshownumericalresultsthattheSGLNN'sperformanceiscloseto
theDNP'sperformanceintheirsetup.Thesecondexampleconsidersascenariowherethe
samplesizeismuchsmaller,whereweshownumericalresultsthattheSGLNNout-performs
theDNP.
3.4.1DNPSimulation:Revisit
Inthissubsection,werevisittheexperimentin[82]andcomparethosemodelswiththe
neuralnetworkwithsparsegrouplassoregularization.Weuseexactlythesamesetupas
[82].Theinputvariable
X
isdrawnfrom
U
(
1
;
1)
,wherethefeaturedimension
p
is˝xed
tobe
10000
.Thecorrespondinglabelsareobtainedbypassing
X
intothefeedforward
neuralnetworkwithhiddenlayersizes
f
50
;
30
;
15
;
10
g
andReLUactivationfunctions.Input
weightsconnectingthe˝rst
m
inputsarerandomlysampledfrom
N
(0
;
0
:
5)
.Theremaining
inputweightsarekeptzero.Furthermore,5%ofthelabelsarerandomlychosenand˛ipped
toaddnoises.Foreach
m
=2
;
5
;
10
;
25
,wegenerate800trainingsamples,200validation
samplesand7500testsamples.WereporttheAUCandF1scoresofthemodelsintable3.1
on5repetitionsofthedatageneration,whichisexactlythesameasin[82].TheDNPmodel
iscodedaccordingtotheiralgorithmoutlineinpythonwithpyTorch.TheHSICLassois
implementedwiththepackagebytheauthorsfollowedbytheSVMpackageinscikit-learn.
TheLogR-
l
1
modelisimplementedwiththeLogisticRegressionCVpackageinscikit-learn.
Inthissimulationstudy,theassumptionsinourtheoryareallsatis˝ed.First,wehavea
sparselevelof2,5,10and25,whichareverysmallportionsofthetotalnumberoffeatures,
82
10000,thusassumption2.1issatis˝ed.Second,we'vecontrolledtheseedsuchthatthe
labelsarebalancedbetween0and1,andthetrueprobabilitiesgeneratedfromtheneural
networkareboundedawayfrom0and1byanon-negligibleconstant,thusassumption2.2
issatis˝ed.Then,wegeneratexfromuniformdistributionandyfromaneuralnetwork
structure,whichisacontinuousdistributionplusacontinuousmapfromtheoriginalspace
totheprobabilities
[0
;
1]
.Therefore,thelocalconvexityassumptionisjusti˝edbycontinuity.
Finally,assumption2.4isapropertyofneuralnetworksthathasbeenarguedinsection3.3.
Therefore,thisexamplenotonlyservesasarevisitoftheDNPpaper,butalsoservesasa
supportforourtheory.
Resultsofthisexampleisshownintable3.1.Fromtheresults,weseeboththeSGLNN
modelandtheDNPmodeloutperformtheothertwobaselinemodels.TheSGLNN'sper-
formanceisveryclosetotheperformanceofDNPwithasmallgap,whichisinaccordance
toourexpectations.Deeperneuralnetworkshavemuchhigherrepresentationpowersof
complicatedfunctions.Thestabilityofthetwomodelsareclose,withtheSGLNNhaving
slightlysmallerSEinthe
m
=2
and
m
=5
cases.TheDNPmodeldoesnotusedropout
inthepredictionprocess,whiletheSGLNNuses
l
1
normpenaltyalongwiththegroup
lassopenalty.TheSGLNNisexpectedtoshowstablerresults.Thereasonthatweseeno
signi˝cantdi˙erenceinSEisthatthesamplesize,800,islargeenoughtotraintheDNP
withafullnetworkontheselectedvariableswithoutover-˝tting.Tofurtherinvestigatethe
performance,westudythesmallersamplescenariointhenextsubsection.
3.4.2SmallerSampleSizeCase
Inthissubsection,weconsiderasmallersamplesize,whichhappensinmanyapplications
suchasclinictrials,geneticexpressiondataanalysisandMRIdataanalysis.Withthesame
83
TrueDim
2
5
10
25
LogR-
l
1
AUC(std)
0.897(0.015)
0.745(0.035)
0.661(0.029)
0.629(0.015)
HSIC-Lasso
AUC(std)
0.920(0.001)
0.844(0.015)
0.732(0.025)
0.638(0.021)
DNP
AUC(std)
0.925(0.020)
0.879(0.035)
0.784(0.020)
0.669(0.016)
SGLNN
AUC(std)
0.911(0.015)
0.862(0.021)
0.770(0.021)
0.658(0.016)
LogR-
l
1
F1score(std)
0.889(0.022)
0.748(0.032)
0.668(0.037)
0.638(0.027)
HSIC-Lasso
F1score(std)
0.914(0.000)
0.791(0.010)
0.680(0.012)
0.368(0.028)
DNP
F1score(std)
0.959(0.009)
0.849(0.033)
0.769(0.017)
0.811(0.015)
SGLNN
F1score(std)
0.940(0.005)
0.839(0.012)
0.747(0.019)
0.754(0.015)
Table3.1:TheAUCandF1scoreofthecomparedmodelsinthesimulationstudy.Standard
errorsaregivenintheparentheses.
Trainingsize
Metric
100
200
300
500
DNP
AUC(std)
0.606(0.118)
0.701(0.091)
0.727(0.074)
0.828(0.043))
SGLNN
AUC(std)
0.624(0.099)
0.703(0.089)
0.762(0.053)
0.820(0.026)
DNP
F1score(std)
0.567(0.095)
0.634(0.073))
0.679(0.043)
0.750(0.035)
SGLNN
F1score(std)
0.602(0.073)
0.641(0.069)
0.690(0.029)
0.746(0.029)
Table3.2:TheAUCandF1scoreofthecomparedmodelsinasmallersamplesizescenario
with
m
=5
.Standarderrorsaregivenintheparentheses.
setupasthemodelgenerationinthelastsubsection,wegenerateatrainingsampleofsize
n
=100
;
200
;
300
and
500
.Thenumberofactivefeaturesis˝xedtobe
m
=5
.Thetotal
samplesizeissetto
10000
,thusthecorrespondingtestingsamplesizesare
9900
;
9800
;
9700
and
9500
.WecomparetheperformancebetweentheSGLNNandDNPinthissetup.The
resultsareshownintable3.2.
Fromtheresults,weseetheSGLNNmodeloutperformstheDNPinthesmallersample
sizescenariowhen
n
=100
;
200
;
300
.ThereasonisthatDNPover˝tsthetrainingdatadue
tosmallsamplesize,whileSGLNNhaslowerriskofover-˝ttingcomparedwithDNP.In
allthefoursamplesizes,theSGLNNhassmallerSEthantheDNPmodel'sSE.SGLNN
achievedthisbyasimplerrepresentationandtheextra
l
1
normregularization.Theresults
suggestthatweneedasimplermodeltopreventover-˝ttingwhenthetrainingsamplesize
84
isverysmall,whichisthecaseinmostbiologydata.Moreover,anextra
l
1
regularization
makesthemodelmorestableandreliable.
3.5RealDataexamples
Inthissection,wegaverealdataapplicationsindi˙erentresearchareas:geneexpression
data,MRIdataandcomputervisiondata.Theseexamplesindicatethatthesparseneural
networkhasgoodperformanceandisusefulincorrelatedpredictorsituations.Throughall
threeexamples,theregularizedneuralnetworkwasimplementedwithfastspeedusingthe
algorithmby[53]throughtheirlibraryinpython3onadesktopcomputerwithUbuntu18
systemonai7processorandGTX1660TIgraphiccards.
Toevaluatethemodelperformance,allaccuracyresultsaremeasuredonthetestingsets
whicharenotusedfortrainingandaveragedondi˙erenttraintestsplits.Thenumberof
featuresintheresultsaremediansamongallnumbersoffeaturesthatcorrespondtothe
bestmodelsevaluatedfromcross-validation.
3.5.1example1:ProstateCancerData
Inthisexample,weconsideredaprostatecancergeneexpressiondata,whichispublicly
availablein
http://featureselection.asu.edu/datasets.php
.Thedatasetcontainsa
binaryresponsewith102observationson5966predictorvariables.Clearly,thedataset
isreallyahighdimensionaldataset.Theresponseshavevalues1(50samplepoints)and
2(52samplepoints),where1indicatesnormaland2indicatestumor.Allpredictorsare
continuouspredictors,withpositivevalues.
40observationsfromnoexpressiongroupand40observationsfromexpressiongroupwere
85
randomlyselectedtoformthetraininggroup.Therest22observationsformthetesting
group.WerunthesparseANNmodelonareplicationof100di˙erenttrain-testsplits.On
average,thesparseneuralnetworkselectsonly18predictorsanduses4hiddennodes.Using
across-validationtechnique,thehyper-parametersthusthenumberoffeaturesweredecided.
Ithasaaveragetrainingerrorrateof
0
:
04
andatestingerrorrateof
0
:
045
.Theresults
comparedwithothermethodsarelistedintable3.3.ThesparseANNand
l
1
penalizedlinear
SVMperformthebestwith95.5%and95%accuracy,respectively.Thegradientboosting
treeclassi˝er[25]isapowerfulensembleclassi˝cationmethodbutperformsworsethan
regularizedmethodsinthehigh-dimensionalsettingwithanaccuracyof92.2%using83.5
features(onanaverage).Thelogisticregressionwith
l
1
regularizationuses36featuresand
achievesanaccuracyof93.3%.TheGAMperformstheworstwithanaccuracyof91.8%,
mainlyduetotheinfeasibilityofbasisexpansioninthisdataset,wherethedatadistribution
ishighlyskewed.
Insummary,weshowednumericallythatthesparsegrouplassopenalizedneuralnetwork
isabletoachieveatleastasgoodastheexistingmethodsalongwithprovidingstrongtheo-
reticalsupport.Thegreaterstandarderrormainlycomesfromadditionaltuningparameters.
Intermsofthenumberofpredictorvariables,itisnotthebest.However,aswefoundinthe
investigation,sinceANNisanon-convexoptimizationproblem,aspeoplecontinuetotrain
themodelandtunethehyper-parameters,theygetbetteraccuracyrateswithlessnumber
ofpredictorvariables.
3.5.2example2:MRIDataforAlzheimer'sDisease
DatausedinthisexampleisfromtheAlzheimer'sdiseaseNeuroimagingInitiative(ADNI)
database(
http://www.loni.ucla.edu/ADNI
).WeusedT1-weightedMRIimagesfrom
86
Classi˝erTestaccuracyNumberoffeatures
Regularizedneuralnetwork0.955(0.066)18
Gradientboostingtree0.922(0.058)83.5
LogisticRegressionwithLasso0.933(0.058)36
l1penalizedLinearSVM0.950(0.052)16
Generalizedadditivemodelwithgrouplasso0.918(0.061)5
Table3.3:Testaccuracywithstandarderrorinparenthesesandmedianofnumberoffeatures
fordi˙erentclassi˝ersintheProstategenedataexample.
thecollectionofstandardizeddatasets.ThedescriptionofthestandardizedMRIimaging
fromADNIcanbefoundin
http://adni.loni.usc.edu/methods/mri-analysis/adni-\
standardized-data/
.Theimageswereobtainedusingmagnetizationpreparedrapidgra-
dientecho(MARAGE)orequivalentprotocolswithvaryingresolutions(typically
1
:
0
1
:
0
mminplanespatialresolutionand1.2mmthicksagittalsliceswith
256
256
256
voxels).
Theimageswerethenpre-processedaccordingtoanumberofofstepsdetailedat
http://
adni.loni.usc.edu/methods/mri-analysis/mri-pre-processing/
,whichcorrectedgra-
dientnon-linearity,intensityinhomogeneityandphantom-baseddistortion.Inaddition,the
pre-processedimagingwereprocessedbyFreeSurferforcorticalreconstructionandvolumet-
ricsegmentationbyCenterforImagingofNeurodegnerativeDiseases,UCSF.Theskull-
strippedvolume(brainmask)obtainedbyFreeSurfercross-sectionalprocessingwereusedin
thisstudy.
Inourexample,weusedimagesfromADNI-1subjectsobtainedusing1.5Tscannersat
screeningvisits,andweusedthe˝rsttimepointiftherearemultipleimagesofthesame
subjectacquiredatdi˙erenttimes.Therearetotally414subjects,amongwhich187are
diagnosedasAlzheimer'sdiseaseand227healthysubjectsatthescreeningvisit.AnR
packageANTsRwereappliedforimagingregistration.Then3dresamplecommandinAFNI
wereusedtoadjusttheresolutionandreducethetotalnumberofvoxelsoftheimagingto
87
18
22
18
.The
x
axisand
y
axisforhorizontalplane,
x
axisand
z
axisforcoronalplane
and
y
axisand
z
axisforsagittalplane.Onlythe1100voxelslocatedinthecenterofthe
brainwereusedasfeaturesforclassi˝cation.Afterremovingthevoxelswithzerosignalfor
mostofthesubjects,wehave1971voxelsleftinuse.
Werandomlysampled100fromthe187ADsubjectsand100fromthe227healthy
subjectsasourtrainingset,andtherest214subjectsastestingset.ThesparseANNmodel
wasrunon100replicationsofdi˙erenttrain-testsplit.Onaverage,theneuralnetwork
˝nallyselects7predictorsandused12hiddennodes.Ithasatrainingerrorrateof
0
:
21
andatestingerrorrateof
0
:
224
.Theresultscomparedwithothermethodsarelistedin
table3.4.ThesparseANNhasacompetitiveperformancewhichisslightlybetterthan
thePMLE-LDAwithsimilarstandarderror.Thegoalofthisexperimentisnottoshow
thatthesparsegrouplassoneuralnetworkoutperformstheothermethodssigni˝cantly,
butjustdemonstratesthatthemodelcanbetunedtoworkasgoodastheothermethods
alongwithstatisticaltheory.With˝nertuningofthehyper-parameters,theresultscanbe
furtherimproved.WecomparedtheresultswithafewmethodsincludingtheMLE-LDA,the
PMLE-LDA(penalizedMLE-LDA;[79]),thePREG-LDA(penalizedregularLDA;[79]),the
FAIR(FeatureAnnealedIndependenceRule;[44])andtheNB(NaiveBayes;[14]).Methods
withoutregularizationarenotpresented,sincethehigh-dimensionalitypreventsitfrom
givingstablesolutions.ThepenalizedMLE-LDAproducesanaccuracyrateof77.2%using
only5predictorvariables.ThePREG-LDAmethodhastheleastnumberofpredictors,3,
buthasaloweraccuracyrate0.750.Therestofthemethodsperformworse.Besidesthe
application,thisexampleindicatesthattheANNcouldbeanalternativetoolforspatial
datamodeling,atleastforclassi˝cation.
88
Classi˝erTestaccuracyNumberoffeatures
Regularizedneuralnetwork0.776(0.031)7
MLE-LDA0.692(0.030)1971
PREG0.750(0.029)3
PMLE-LDA0.772(0.025)5
FAIR0.632(0.033)7
NB0.674(0.024)1971
Table3.4:Testaccuracywithstandarderrorinparenthesesandmedianofnumberoffeatures
fordi˙erentclassi˝ersintheMRIAlzheimer'sdiseaseexample.
Figure3.1:exampleimagefromtheKITTI2Dobjectdetectiondataset.
3.5.3example:KITTIAutonomousDrivingData
Forthegeneralization,wealsogiveanexampleonthecomputervisiontasks.Thedata
usedinthisexampleisfrom[57],whichhasdi˙erentaspectofimagesorstereosfromhigh-
resolutioncolorandgray-scalevideocamerasorladar(laserradar).Theimageandstereo
dataincludesdailytra˚cdatathathelpdevelopingautonomousdrivingindi˙erentaspects.
Inourexample,weusedthe2Dobjectdetectiondata,forexample,see˝gure3.1.
The2Dobjectdatasetincludes7481trainingimagesand7518testingimages,comprising
atotalof80256labeledobjects.Allimagesarecoloredandsavedaspng.Sincethiswasan
opencompetition,thelabelsfortestingdataarenotavailable.Weonlyusedthetraining
datasetasourexampledata.Theoriginaldataincludes13di˙erentlabelclasses,however,
toemphasizethebinaryclassi˝cationabilityoftheregularizedANNmodel,wetookonly2
89
Figure3.2:exampleimagesofpedestriansandcarsafterpre-processing
classes:pedestrianandcar.Thesub-imagesofpedestriansandcarswereextractedfromthe
originalimagesusingthelocationinformationprovidedbythedataset.Thisgivesus18000
carimagesand7000pedestrianimagesofresolutionrangingfrom40-by-40to180-by-150.
SincetheregularizedANNmethodsisespeciallyusefulwhenthesamplesizeissmall,we
randomlysample2000carsand2000pedestriansasourdata.Weshu˜edthe4000images,
anddividedtheminto800trainingimagesand3200testingimages.˝gure3.2showthe
imagesafterpre-processing.Apre-processingstepsaredonewithpythonlibrariesincluding
matplotlib,PILandpandas.
SinceANNdoesnothandlelocalfeatureinformationasgoodasconvolutionalneural
networks(CNN),wedidafeatureextractionusingthepre-trainedVGG19CNNmodel
[115],whoseweightsweretrainedon120millionimagesofover1000classes.Weadopted
theconvolutionallayersfromthemodelasafeatureextractor.Tofeedtheimagesintothe
model,are-sizing(fromtheoriginalsizeto224-by-224-by-3)wereperformedusingbi-linear
interpolationmethods.TheVGG19featureextractorgeneratesa4096-by-1vectorfromeach
image.OurANNmodeltakesthis4096-dimensionalinputandtrainsonthe800images.We
compareourresultswiththeregularANN,thelogisticregressionwith
l
1
regularization,the
90
Classi˝erTestaccuracyNumberoffeatures
SGLNN0.994(0.001)148
Neuralnetwork0.514(0.102)4096
Log-
l
1
0.991(0.002)176
SVM-
l
1
0.993(0.001)144
GAM0.989(0.002)152
Table3.5:Testaccuracywithstandarderrorinparenthesesandmedianofnumberoffeatures
fordi˙erentclassi˝ersintheKITTIautonomousdrivingexample.
Figure3.3:Testaccuracyscorevssparsitylevelinthethreeexamples.
SVMwith
l
1
regularizationandtheGAMwithgrouplassoregularization.Theresultsare
shownintable3.5.Theregularneuralnetworktotallyfailedduetothehigh-dimensionality
andhenceestimabilityissue.ThelogisticregressionandtheGAMhavemorenonzero
featuresandslightlygreaterstandarderror.TheSVMhassimilarresulttotheSGLNN,but
theSGLNNgainsaslightlybetterperformance.
Moreover,wehaveplottedtheaccuracyscoreforallthreeexamplesalongwiththe
sparsitylevel.
l
1
logisticregression,
l
1
SVMandgrouplassopenalizedgeneralizedadditive
model(GAM)areusedalongwiththesparsegrouplassoneuralnetwork(SGLNN).These
plotsgiveusahinthowsparsitylevelin˛uencesthepredictionresults.Theplotsaregiven
in˝gure3.3.Alltheseexamplesillustratetheusefulnessandthenumericalpropertiesofour
proposedhighdimensionalneuralnetworkmodel.
91
3.6Discussion
Inthispaper,weconsideredthesparsegrouplassoregularizationonhigh-dimensionalneural
networksandprovedthatundermildassumptions,theclassi˝cationriskconvergestothe
optimalBayesclassi˝er'srisk.Tothebestofourknowledge,thisisthe˝rstresultthat
theclassi˝cationriskofhigh-dimensionalsparseneuralnetworkconvergestotheoptimal
Bayesrisk.Neuralnetworksareverygoodapproximationstocorrelatedfeaturefunctions,
includingcomputervisiontasks,MRIdataanalysisandspatialdataanalysis.Weexpect
furtherinvestigationwarrantedincomingdays.Thecomputationalissueof˝ndingtheglobal
optimalinnon-convexoptimizationproblemsisstillwaitingtobesolvedtoeliminatethe
gapbetweenthetheoryandinpractice.Thiswillpavewayforfurthertheoreticalresearch.
3.6.1The
l
1
+
l
0
PenaltyandAlgorithm
Aninnovativeideathatdeservesfurtherinvestigationistospecifyalargernumberofhidden
nodesandusea
l
0
+
l
1
normpenaltyonthehiddennodesparameters
.Thismethods
searchesagranderofsolution˝eldandwillgiveamodelatleastasgoodasthe
l
2
norm
penalty.Moreover,
l
0
+
l
1
normpenaltyisprovedtoworkwellinlowsignalcases[92].In
detail,theformulationistominimize3.3plusanextraregularizationon
:
3
k
k
0
+
4
k
k
1
.
Thisformulationdoesnotbringinextratuningparameters,sincewereleasethepenalization
on
l
2
normof
andthenumberofhiddennodes
m
.With
l
0
+
l
1
normpenalty,the
parameterscanbetrainedusingcoordinatedescentalgorithmwiththe
l
0
+
l
1
normpenalty
beinghandledbymixedintegersecondorderconeoptimization(MISOCO)algorithmsvia
optimizationsoftwarelikeGurobi.Thisalgorithmaddsanextrasteptothealgorithmin
[53]tohandlethe
l
0
+
l
1
normpenalty.
92
Speci˝cally,weareoptimizing
^
sgl
;
^
t
sgl
;
^
sgl
;
^
b
sgl
=argmin
2
R
p
m
;
t
2
R
m
;
2
R
m
;b
2
R
1
n
l
(
;
t
;
;b
)
+
1
p
X
j
=1
k
(
j
)
k
2
+(1
)
1
k
k
1
+
2
k
k
1
suchthat
k
k
0
k
Let'sstartbycomputingthepartialderivativeof
l
withrespectiveto
(
j
)
.Let
˙
(
)
be
thesigmoidfunction
˙
(
x
)=
exp(
x
)
1+exp(
x
)
whichisourchoiceoftheactivationfunction
(
)
.Notingthatthesigmoidfunctionhas
derivative
˙
0
(
x
)=
˙
(
x
)[1
˙
(
x
)]
.Thenwehave
@l
@
jk
=
n
X
i
=1
[
y
i
˙
(
i
)]
k
˙
0
(
˘
ik
)
x
ij
Notethat
@
jk
@
k
=(0
;:::;
0
;
1
;
0
;:::;
0)
T
Thenweget,
@l
@
k
=
n
X
i
=1
[
y
i
˙
(
i
)]
˙
0
(
˘
i
)
x
i
k
=
k
x
T
diag
[
y
˙
(
)]
˙
0
(
˘
k
)
(3.5)
93
Thepartialderivativeof
l
w.r.t
t
isgivenby
@l
@
t
=
diag
[
]
˙
0
(
˘
)(
y
˙
(
))
where
˙
0
(
˘
)
ki
=
˙
0
(
˘
ki
)
.Thepartialderivativeof
l
w.r.t
isgivenby
@l
@
=
˙
(
˘
)(
y
˙
(
))
Finally,thepartialderivativeof
l
w.r.t
b
isgivenby
@l
@b
=
y
˙
(
)
Sincethesparsegrouplassopenaltyonlyappliesto
,thussimilarto[53],ageneralized
gradientdescentcanbeappliedto˝ndacriticalpoint.Nowwe˝x
anditerate
,
t
and
b
.Thisupdateincludesthreesteps.The˝rststepistoupdatetheparameters
,
t
and
b
inthelikelihoodpartusinggradientdescent.Thenlet
S
(
;
):
R
m
n
R
7!
R
m
n
bethe
coordinate-wisesoft-thresholdingoperator
(
S
(
x;c
))
ij
=
sign
(
x
ij
)(
j
x
ij
j
c
)
+
Thesecondstepperformsasoft-thresholdingoperationforeach
ij
duetothelassopenalty.
Finallythethirdstepperformsasoft-scalingoperationoneachgroup
(
j
)
duetothegroup
lassopenalty.Thesoft-scalingoperation,see[114],isgivenby
new
(
j
)
=
0
@
1
k
old
(
j
)
k
2
1
A
+
old
(
j
)
94
where
old
(
j
)
isafterthesoft-thresholdingoperationand
isthestepsize.Thestepsizecan
befoundusinglinesearch.Inspeci˝c,let
"
(
k
)
bethe2-normofthechangeof
,
t
and
b
in
the
(
k
+1)
th
iteration.Thelinesearchcriterionissatis˝edif
1
n
l
(
(
k
+1)
;
t
(
k
+1)
;b
(
k
+1)
)
1
n
l
(
(
k
)
;
t
(
k
)
;b
(
k
)
)
(
k
+1)
(
"
(
k
)
)
2
forsome˝xed
t
2
(0
;
1)
.
Afterupdating
,
t
and
b
,we˝xthemandupdate
.Forsome
L
,wehave
^
=argmin
2
R
m
1
n
n
X
i
=1
h
y
i
(
˙
(
˘
i
)
T
+
b
)
log(1+exp(
˙
(
˘
i
)
T
+
b
))
i
+
k
k
1
s:t:
k
k
0
k
:=
f
(
)+
k
k
1
s:t:
k
k
0
k
UsingTaylorexpansionof
aroundsome
0
,wehave
f
(
)=
f
(
0
)+
r
f
(
0
)
T
(
0
)+
1
2
(
0
)
T
f
(
)(
0
)
f
(
0
)+
r
f
(
0
)
T
(
0
)+
L
2
k
0
k
2
2
:=
M
(
)
forsome
and
L
.Bythepropertiesofsigmoidfunction,
L
1
ispreferred.Observethat
minimize
M
(
)+
k
k
1
s:t:
k
k
0
k
,
minimize
~
M
(
)+
k
k
1
s:t:
k
k
0
k
where
~
M
(
)=
L
2
0
1
L
r
f
(
0
)
2
2
95
minimize
u
+
2
v
s:t:
L
2
(
m
)
1
L
r
f
(
(
m
)
)
2
2
u
Mz
j
j
Mz
j
;j
=1
;:::;p
z
j
2f
0
;
1
g
;j
=1
;:::;p
X
j
z
j
=
k
j
j
j
;j
=1
;:::;p
j
0
;j
=1
;:::;p;
X
j
j
v
Table3.6:TheMISOCOformulationforthe
l
1
+
l
0
penaltyindealingwiththe
l
0
normstep.
Thissuggestasimplecyclingfor
,i.e.
(
m
+1)
=argmin
2
R
m
L
2
(
m
)
1
L
r
f
(
(
m
)
)
2
2
+
2
k
k
1
s:t:
k
k
0
k
Thisconvertoptimizing
toamixedintegersecondorderconeoptimization(MISOCO)
problem,see[92].Theformulationisgivenintable3.6.Theoptimizationproblemisa
secondorderconewithonequadraticconstraintandsomelinearconstraints.Itcanbesolved
usingoptimizationsoftwarelikeGurobi.Therefore,theresultssuggestacyclingiterationon
theparametersalgorithmto˝ndtheestimatedparametersofthehigh-dimensionalneural
networkmodel.Thedetailalgorithmisgiveninalgorithm1.
96
Algorithm1:
Algorithmfortrainingclassi˝cationneuralnetworkwith
l
1
+
l
0
penalty
Initialneuralnetworkparameters
(0)
,
t
(0)
,
(0)
and
b
(0)
.Chooseinitialstepsize
(0)
andstepsizescalingparameter
s
2
(0
;
1)
.Selecttolerance
˝
;
while
Errorofallparametersisgreaterthan
˝
do
while
Errorof
,
t
and
b
isgreaterthan
˝
do
(
k
+1)
;
t
(
k
+1)
;b
(
k
+1)
=
(
k
)
;
t
(
k
)
;b
(
k
)
(
m
)
r
1
n
l
(
(
k
)
;
t
(
k
)
;b
(
k
)
)
(
k
+1)
=
S
(
k
+1)
;
(
m
)
1
(1
)
for
j=1,...,p
do
(
k
+1)
(
j
)
=
0
@
1
(
m
)
k
(
k
+1)
(
j
)
k
2
1
A
+
(
k
+1)
(
j
)
;
end
(
m
+1)
=
(
m
)
;
Untilsomelinesearchcriterionsatis˝es;
end
while
Errorof
isgreaterthan
˝
do
UsetheMISOCOformulationabovetooptimize
asbelow
(
k
+1)
=argmin
2
R
m
L
2
(
k
)
1
L
r
f
(
(
k
)
)
2
2
+
2
k
k
1
s:t:
k
k
0
k
end
end
97
Chapter4
EnsembleNeuralNetworkSelection
(ENNS)
Wehavementionedthatdirectlyoptimizingthepenalizedneuralnetworklossfunctionis
riskywhenthesamplesizeissmall,sinceit'seasytogettrappedinalocalminimal.As
thereisaconnectionbetweenthe
l
p;
1
normpenaltyandthestage-wiseselectionalgorithm,
it'snaturaltoconsideraddingvariablesonebyone.However,thedeepneuralpursuit
(DNP)algorithmin[82]doesnotimplementmethodstostabilizetheselectionprocess,and
theestimationisbasedonafullneuralnetwork,whichstillbringintheestimabilityissue.
Inthischapter,weconsideranensemblestage-wisevariableselectionalgorithmwithdeep
neuralnetworks,whichisnamedasensembleneuralnetworkselection(ENNS).Moreover,
weproposeaheuristicmethodstofurthersparsifytheneuralnetworkbyimposingsoft-
thresholdingfunctionstotheparameters.Wewillprovidemethodologyforthetwo-step
approach,andwewillshowthatthetwo-stepapproachenjoysselectionconsistencyand
riskconsistencyundercertainconditions.Simulationsandrealdataexamplesareusedto
supportthearguments.
98
4.1Introduction
High-dimensionalstatisticsmodeling[18]hasbeenpopularfordecades.Considerahigh-
dimensionalregressionorbinaryclassi˝cationproblem.Let
x
2
R
p
bethefeaturevector,
andlet
y
2
R
forregressionproblemand
y
2f
0
;
1
g
forclassi˝cationproblembetheresponse.
Ourgoalistobuildamodelbasedonthetrainingsample
f
(
x
1
;y
1
)
;:::;
(
x
n
;y
n
)
g
.Wehave
morefeaturesthanthesamplesize,i.e.,
p>n
.Moreover,manydatahavecomplicated
relationshipamongdi˙erentvariables,whichishardtocapturethroughalinearmodel.
Neuralnetworkistheoneofthebestmodelsatcapturingcomplicatedrelationships.It's
interestingtoconsideraneuralnetworkstructurebetween
x
and
y
.
Ingeneral,high-dimensionalmodeldoesnothaveconsistentestimationssincethesystems
havelessconstraintsthannumberofvariables.Twomajorapproachescanbeusedtodeal
withthehigh-dimensionality.The˝rstmajorapproachistoassumethatthefeaturespace
issparse,i.e.,onlyasmallfractionofthevariablesareincludedinthemodelingwith
y
.
Amodelwithonlyafractionoftheoriginalfeaturesenjoyssimplicityandinterpretability.
Sparsesolutionscanbeobtainedusingregularization[126,65]orstage-wisealgorithms[41].
Regularizationobtainssparsesolutionbyshrinkingtheunimportantfeatures'coe˚cients
tozero.Theestimatedcoe˚cientsareshrinkageestimatorsandthushavesmallervariance
[32].However,regularizationwithmultipletuningparameterstakeslongertorunandmay
besensitiveinthetuningparametersinpractice.Stage-wisealgorithmsaddsvariablesone
byoneandstopsatapreferredtime.
Thesecondmajorapproachisprojection-based.One˝ndsalowerdimensionalrepresen-
tationoftheoriginalfeaturespace.LinearprojectionmethodsincludethePCA[64]inthe
lowdimensionalcaseandsomeofitsvariants[68,157]inthehigh-dimensionalcase.Kernel
99
PCA[109]performsPCAonareproducingkernelHilbertspacetoachievenon-linearity.
Manifoldlearning[76]embedstheoriginalfeaturespacetoalow-dimensionalmanifold.Ex-
ceptforthemanifoldlearningalgorithmsthatreducetheoriginaldimensionalityto2or
3dimensionforvisualization,afewmanifoldinglearningincludingthemultidimensional
sacling(MDS)by[130],thelocallinearembedding(LLE)by[107],andtheIsomapby[124].
Applicationsofthemanifoldlearningalgorithmsinthehigh-dimensionalsetupisstudied
forspeci˝c˝elds,butageneralframeworkisnotavailable.Anotherpopulardimensionre-
ductiontechniqueistheauto-encoder[75],whichusesaneuralnetworktoencodethefeature
spaceanddecodetherepresentationtobeasclosetotheoriginalfeaturespaceaspossible.
Alloftheabovemethodsareunsupervised,andthelower-dimensionalrepresentationisno
longeranyoftheoriginalfeatures.Therefore,weloseinterpretabilitybydoingso.Current
manifoldlearningalgorithmsfocusmoreondatavisualization,whichreducesthedimension-
alitytotwoorthree,seeforexample[139].Theseapplicationsarenotusefulinbuilding
models.
Ontheotherhand,neuralnetworkshavebeenutilizedtomodelcomplicatedrelation-
shipsincethe1940s[72],andgainedmuchmoreattentionsincethecomputerhardware's
greatimprovementinthiscentury.Speci˝cally,[101]showedthatthecomputationofneu-
ralnetworkscanbegreatlyimprovedbyGPUaccelerationthanpurelyrunningonCPU,
thusmakesiteasytotraindeepernetworkstructures.Nowadays,variantneuralnetworks
arebeingappliedworld-wide,includingtheconvolutionalneuralnetwork(CNN),recurrent
neuralnetwork(RNN),residualnetwork(ResNet),andetc.Intheory,neuralnetworkworks
inrepresentingcomplicatedrelationshipsmainlyliesontheuniversalapproximationtheo-
rem[33,8,4,113].Thetheoremstatesthatashallowneuralnetwork(neuralnetworkwith
onehiddenlayer)isabletoapproximateanycontinuousfunctionwithanarbitrarilysmall
100
errorgivenahugenumberofhiddennodes,undermildassumptions.Inpractice,toreach
thisgoodapproximation,usuallyahugesetoftrainingdataisneeded,sincethenumber
ofparametersinaneuralnetworkismuchmorethanthatinothermodels.Moreover,the
non-convexityofaneuralnetworkstructuremakesitimpossibletoobtainaglobaloptimum.
Luckily,alocaloptimumoftheneuralnetworkprovidesagoodapproximation.
DeepNeuralNetwork(DNN)isaneuralnetworkwithadeepstructureofhiddenlay-
ers,whichhasbetterperformancethantheshallowneuralnetwork(neuralnetworkwith
only1hiddenlayer)inmanyaspectsincludingabroad˝eldofsubjectsincludingpattern
recognition,speechrecognitionandetc.,seeforexample[97,80,66].Thedeepstructure
hasagreaterapproximationpowerthanashallowneuralnetwork.Therehavebeenafew
literatureregardingtheapproximationpowerofdeepneuralnetworks[12,104,112,43].The
resultssuggestthatusingadeepneuralnetworkhelpsreducetheapproximationerror,which
isusefulinthecaseswheretheapproximationerrordominatesthetotalerror.Therefore,it's
necessarytoconsideradeepneuralnetworkmodeloverashallowmodel.However,˝nding
awaytomakethedeepneuralnetworkwelltrainedonasmallsamplesizeisnecessary.
Inthispaper,wewilldiscussthestage-wisevariableselectionalgorithmwithneural
networks.Wewillshowthattheexistingstage-wisealgorithmperformswellatthebeginning
andselectsthecorrectvariables.However,atthelatersteps,theprobabilitythatitwillselect
acorrectvariabledecreases.Thuswewillproposeanensemblealgorithmonthestage-wise
variableselectionalgorithm,namedensembleneuralnetworkselection(ENNS)algorithm.
Wewillshowthatthenewalgorithmselectallcorrectvariableswithhighprobabilityandits
falsepositiverateconvergestozero.Moreover,insteadofaregularneuralnetworktrained
ontheselectvariables,weproposedafewmethodstofurtherreducethevarianceofthe
˝nalmodel.Wewillgivenumericalcomparisonoftheseproposedalgorithms,andpropose
101
analgorithmforthe
l
1
penalizedneuralnetworkwithsoft-thresholdingoperators.
Insection4.2,wewillpresenttheideasandalgorithmsbehindtheENNSalgorithmand
themethodsofincreasingstabilityduringtheestimationstep.Insection4.3,wewillprovide
thetheoryfortheargumentsinthispaper.Insection4.4,wewillpresenttheresultsofsome
numericstudytosupportourtheoremsandarguments.Insection4.5,wewillevaluatethe
newmethodonrealworlddata.Section4.6willdiscusssomeconclusionandfutureworks.
4.2TheTwo-stepVariableSelectionandEstimationAp-
proach
Considerafeaturevector
x
2
R
p
andaresponse
y
2
R
fortheregressionsetupand
y
2f
0
;
1
g
intheclassi˝cationsetup.Wehavedata
f
(
x
1
;y
1
)
;:::;
(
x
n
;y
n
)
g
consistingofindependent
observations.Denotethedesignmatrix
X
=(
x
1
;:::;
x
n
)
T
2
R
n
p
andtheresponsevector
y
=(
y
1
;:::;y
n
)
T
.Aswementionedbefore,wehavemorevariablesthanobservations,i.e.,
p>n
.Accordingtothepreviousdiscussion,variableselectionisanimportantstepinhigh-
dimensionalmodeling.Ifoneincludesallvariablesinthemodel,therewillbeatleast
p
parameterstoestimate,whichcannotbedonestablywiththe
n
observations.Ifamore
complicatedmodelisneeded,thenumberofparameterswillbetremendous,whichwillcause
severeover-˝ttingandhighvariancewithasmalltrainingsamplesize.
Therefore,wehopeafeatureselectionstepat˝rstcanhelppicktheimportvariables,and
anotherestimationstepcouldbuildamoreaccuratemodelbasedontheselectedvariables.
Moreover,wewillusedeepneuralnetworksasthestructure,sinceitwillbeabletocapture
thecomplicatedrelationships.Wewillconsiderastage-wisealgorithminthevariableselec-
tionstep,whichperformsasimilarfunctionastheDNPmodelin[82].However,wewill
102
showthatthestage-wisealgorithminDNPsu˙ersfromsomedisadvantageandproposean
ensemblealgorithmtorelievethissituation.Inthesecondstep,wewilldiscussthemethods
toreducevarianceandpreventover-˝tting,sinceadeepneuralnetworkwithonlyafew
inputvariablescanstillhaveahugenumberofparameters.
4.2.1TheEnsembleNeuralNetworkSelection(ENNS)Algorithm
Considerthefeatureselectionapproachin[82].Let
D
:
R
p
!
R
beadeepneuralnetwork
functionthatmapstheoriginalfeaturespacetotheoutputspace.Wedon'tspeci˝callymark
thenumberofhiddenlayersandhiddennodesizesinthenotation,butsimplyassumethe
deepneuralnetworkhas
m
hiddenlayerswithsizes
h
1
;:::;h
m
.Denotetheweightmatricesin
eachlayertobe
W
0
;:::;
W
m
,where
W
0
2
R
p
h
1
,
W
i
2
R
h
i
h
i
+1
for
i
=1
;:::;m
1
and
W
m
2
R
h
m
1
.Denote
t
i
theinterceptforthe
i
th
hiddenlayerand
b
theinterceptofthe
outputlayer.Let
=(
W
0
;:::;
W
m
;
t
1
;:::;
t
m
;b
)
betheparametersintheneuralnetwork
model.Foraninput
x
2
R
p
,denotetheoutput
;
x
=
D
(
x
)
(4.1)
whereintheregressionsetup,theoutputisfromalinearlayerand
2
R
,whileinthe
classi˝cation,anextrasigmoidlayerisaddedand
2
(0
;
1)
.Moreover,weassumesparse
feature,i.e.,onlyasmallfractionofthevariablesaresigni˝cantlyrelatedtotheresponse.
Withoutlossofgenerality,weassume
S
0
=
f
1
;:::;s
g
103
ofthevariablesaretrulynonzerovariables.
De˝nethelossfunctionforregressiontobethesquarederrorloss
l
(
)=
E
h
(
y
)
2
i
(4.2)
Inpractice,weworkwiththeempiricalloss
l
(
;
X
;
y
)=
1
n
k
y
k
2
2
(4.3)
where
2
R
n
with
i
=
;
x
i
,
i
=1
;:::;n
.De˝nethelossfunctionforclassi˝cationtobe
thenegativelog-likelihood,whichisknownasthecross-entropyloss
l
(
)=
E
[
y
log
+(1
y
)log(1
)]
(4.4)
Inpractice,weworkwiththeempiricalloss
l
(
;
X
;
y
)=
1
n
n
X
i
=1
[
y
i
log
i
+(1
y
i
)log(1
i
)]
(4.5)
Let
G
i
bethegradientofthelossfunctionwithrespectto
W
i
inthebackpropagation
processfor
i
=0
;:::;m
,i.e.,
G
i
=
@
@
W
i
l
(
;
X
;
y
)
;i
=0
;:::;m
(4.6)
TheDNPalgorithmstartswiththenullmodelandaddsonevariableatatime.Let
S
be
104
theselectedsetand
C
bethecandidateset.Atthebeginning,wehave
S
=
f
intercept
g
and
C
=
f
1
;:::;p
g
(4.7)
Themodelistrainedon
S
onlyandthesubmatrixof
W
0
correspondingtothefeaturesin
C
iskeptzero.Afterthetrainingdone,onechoosesa
l
q
norm(usuallywith
q
=2
)and
computethegradient'normforeach
j
2C
of
W
0
.
G
0
j
=
@
@
W
0
j
l
(
;
X
;
y
)
;j
2C
(4.8)
Thenextvariablethatentersthemodel,
j
+
is
j
+
=argmax
j
2C
k
G
0
j
k
q
(4.9)
Then
S
=
S
=j
+
and
C
=
C[f
j
+
g
Toincreasethestability,insteadofcomputing
G
0
j
directly,theDNPalgorithmcomputes
G
0
j
throughtheaverageovermultipledropouts.Let
B
1
bethenumberofdropouts,thenextvariableis
j
+
=argmax
j
2C
1
B
1
B
1
X
b
=1
k
G
b
0
j
k
q
(4.10)
where
G
b
0
j
denotesthegradientofthelossfunctionwithrespecttothe˝rstlayer's
j
th
weightvectorafterthe
b
th
randomdropout.
Thealgorithmworksbecause
k
G
0
j
k
q
describeshowmuchthelossfunctionwillchange
whenthenextupdateonthecorrespondingvariable'sweightisperformed[102].[125]also
indicatesthatselectingvariablebycomparing
k
G
0
j
k
q
hasanequivalencetoapplyingthe
105
grouplassopenalization,seealso[85].
Thealgorithmworkswellattheverybeginning,whichisdescribedbyproposition4.1
andproposition4.2insection4.3.However,itsu˙ersfromafewdisadvantages.First,
asweincludemorecorrectvariablesinthemodel,theprobabilitythatweselectanother
correctvariabledecreases.Asimulationstudyinsection4.4providesnumericsupportfor
thisargument.Second,oneneedstopre-specifyhowmanyvariablesneedtobeselected
beforestopping,denoted
s
0
.Ifthisnumberischosentobemorethanthenumberoftrue
variables,denoted
s
,therewillbe
s
0
s
variablesthatshouldnotbeincludedbutwas
included,i.e.,thefalsepositiveratecouldbehigh.Finally,themodeldoesnotusedropout
orregularizationduringprediction,whichhaspotentialover-˝ttingrisk.Herewepropose
theensembleneuralnetworkselection(ENNS)algorithmtoremedytheseissues,andwewill
discusspossiblesolutionsinpreventingover-˝ttinginthepredictionstep.
Onecouldobservethatwhenafractionof
S
0
arealreadyinvolvedinthemodel,i.e.,
in
S
,themodelistrainedsuchthatthesevariablesareusedtoexplainthevariationsby
allvariablesin
S
0
.Thisweakensthee˙ectofthosetrulynonzerovariablesin
C
.These
variablesbecomelessimportantthanwhenthere'snovariablein
S
.Moreover,thereareless
trulynonzerozerovariablesin
C
thanattheverybeginning,theprobabilitythatweselect
acorrectvariableinthenextstepis
P
(
j
next
2S
0
)=
X
j
2S
0
\C
P
(
j
next
=
j
)
(4.11)
whichwillbeevenloweras
jS
0
\Cj
decreases.Therefore,therewillbeanonzeroprobability
thatatonestagetheselectedvariabledoesnotbelongto
S
0
.Weconsideranensemble
methodtoremedythisissue.
106
Theideabehindthisensemblemethodissimilartobagging[17,19].Assumethatwe
wanttoadd
s
j
variablesinonestep.Considerabootstrapsampleofsize
n
b
fromthe
originalsample.TheDNPwithrandominitializationistrainedonthissample,whichyields
aselectionset
S
1
=
f
j
1
;:::;j
s
j
g
.Insteadofjustdoingonepass,weproposethatfor
b
2
in
1
;:::;B
2
andabootstrapsamplesize
n
r
,weperformthefeatureselectiononarandom
selectionof
n
r
observations.Denotethefeaturesbeingselectedinall
B
2
roundsas
J
1
=
f
j
11
;:::;j
1
s
j
g
;:::;
J
B
2
=
f
j
B
2
1
;:::;j
B
2
s
j
g
Wewillonlyallowavariabletoenterthemodelifitappearsatleast
[
B
2
p
s
]
timesinthe
B
2
rounds,fora˝xedproportion
p
s
.Mathematically,
J
=
f
jinatleast
[
B
2
p
s
]
of
J
1
;:::;
J
B
2
g
isthesetofvariablesthatwillactuallyenterthemodelinthisstep.
Thereasonthatthisensemblewillimprovetheselectionliesonthreepoints.First,the
algorithmisanaveragingofdi˙erentbootstrappingresults,thusthee˙ectofsomeextreme
observationscouldbeaveragedout.The˝nalselectionresultrepresentsthecommonpart
ofthewholesample.Secondly,neuralnetworkusesrandominitialization.Intwodi˙erent
training,thoughthepredictionsseemsimilar,theestimatedparametersareactuallyfrom
di˙erentlocalminimumsofthelossfunction.Therefore,thesedi˙erenttrainingresults
representdi˙erentaspectsofthemodel.Combiningthetworeasons,ifweselectasmaller
n
b
comparedto
n
,theselectionresultsareclosertoindependent.However,
n
b
cannotbe
toosmalltoavoidmisleadingtheneuralnetwork.Finally,ifavariableisselectedbymistake
107
insomeround,thisispossiblyduetothespeci˝cbootstrapsamplemakingtherelationship
betweenthevariableandtheresponsestronger,whichisnotgeneralinallbootstrapsamples.
Inpractice,onewillobservethatthoughfalseselectionhappens,thosefalsevariablesare
di˙erentindi˙erentrounds.Therefore,thisensemblewillactuallymaketheprobabilityof
falseselectiontendtozero.Thisresultisdescribedbytheorem4.1insection4.3.Moreover,
iftwovariables'interactione˙ectisimportantinthemodel,theyarelikelytobeincluded
inthemodelinthesamestep.
It'spossiblethattheproposedmethodselectslessormorevariablesthanthenumberof
variableswespeci˝ed,
s
j
.Ifthesampleisnotenoughtorepresentthetruerelationshipbe-
tweenthevariablesandtheresponse,it'sverypossiblethatthenumberofselectedvariables,
denoted
^
s
j
,islessthan
s
j
.Inthiscase,weexcludethevariablesthatarealreadyin
C
from
theneuralnetworkandperformanotherroundofvariableselectionwith
S
=
f
1
;:::;p
g
=
C
andthen
C
=
f
intercept
g
.Thenumberofvariablestobeselectedinthisroundwillbe
s
j
^
s
j
.Ontheotherhand,
^
s
j
beingmorethan
s
j
happenswhentheselectionpropor-
tion
p
s
isspeci˝edtoosmall.Inthiscase,onewouldsortthevariablesbytheirappearing
proportionsandonlyselectthe˝rst
s
j
variablesinthelist.
Insummary,wespecifyanumber
s
0
attheverybeginning,whichmeanthe˝nalmodel
willinclude
s
0
variables.Inthe
j
th
iteration,let
s
j
bethenumberofvariablestobeselected.
Rightnowthereare
jS
j
j
variablesinthemodel,denoted
S
j
.Let
X
n
bethesub-matrixof
X
wherethecolumnswithindicesin
S
j
removed.Traintheensembleon
X
n
andobtain
selectionresult
^
S
j
.Let
s
j
+1
=
s
j
j
^
S
j
j
.Thealgorithmisrepeateduntilthemodelhas
selected
s
0
variables.AnalgorithmisgiveninAlgorithm2.Undermildassumptions,the
algorithmwill˝nallyreachselectionconsistency.Thisargumentisdescribedintheorem4.2
insection4.3.Moreover,acomparisonofthevariableselectionperformancesofdi˙erent
108
modelingispresentedinsection4.4.
Algorithm2:
AlgorithmforfeatureselectionENNS
Initializenumberofselectedvariables
S
=
;
,
s
=0
andtarget
s
0
;
while
j
S
j
a;X
2
>b
)
(4.18)
isthebivariateorthantprobabilitywithcorrelation
ˆ
and
)
isthestandardnormaldis-
tributionCDF.
Proposition4.2.
Underassumptions4.1,4.3and4.4,theprobabilitythatweselecta
nonzeropredictorattheverybeginningusingthestage-wiseneuralnetworkselectionis
P
(
Anonzeropredictorentersthemodel˝rst
)=
s
X
k
=1
Z
1
0
f
k
(
x
)
p
Y
j
6
=
k
F
j
(
x
)
dx
(4.19)
where
F
k
(
x
)=
1
2
erf
x
+
j
k
j
p
2
˙
2
+
erf
x
j
k
j
p
2
˙
2
and
f
k
(
x
)=
@
@x
F
k
(
x
)=
r
2
ˇ˙
2
e
x
2
+
2
k
2
˙
2
cosh
k
x
˙
2
(4.20)
and
erf
(
)
istheerrorfunction.Moreover,if
max
=max
j
=1
;:::;s
isbounded,as
s
!1
,
116
theprobabilityisboundedfromabove
P
1
where
isanontrivialquantity.
ProofsofthetwopropositionsaregivenintheAppendix.Proposition4.1and4.2describe
thebehaviorofneuralnetworkstage-wisealgorithmattheverybeginning.Theprobability
thatweselectonepredictoroveranotherdependsonthesumoftheirsignalstrengthand
thedi˙erenceoftheirsignalstrength.Thegreaterthedi˙erence,thehigherprobabilitythat
wewillselectthepredictorwithhighersignalstrength.Theprobabilitythatwewillselect
acorrectpredictorattheverybeginningisdescribedbytheerrorfunctionandstandard
normaldensityfunctions.Thoughtheformoftheprobabilitylookscomplicated,sinceerror
functioncanbeapproximatedbyanexponentialfunctionwithproperconstants,wewillbe
abletoshowthatinsomecases,itisnotguaranteedthatavariableenteringthemodel˝rst
isanonzerovariable.Speci˝cally,thishappenswhenwehavealowsignalstrengthora
hugenumberofcandidatevariables.
Sothereisaconcernthatawrongvariablewillmistakenlyenterthemodelduetoaspecial
trainingcaseoftheneuralnetworkmodel,asshowninthepreviousproposition.Withthe
baggingalgorithm,weareabletoeliminatethefalsepositiveselectionswithprobability
tendingto1.Theintuitionisthatfalsepositiveselectionofacertainpredictorhappens
duetoaspeci˝cobservationofthedesignmatrix,whichappearstobemorecorrelatedto
theresponseorresidual.However,withdi˙erentsub-samplings,it'sveryunlikelythatthey
yieldthesamewrongselection.Thispropertyiscapturedbythefollowingtheorem.
Theorem4.1.
Ifoneofthetwofollowingcaseshappen,thenineachselectionstepofthe
117
ENNSalgorithm,theprobabilityoffalsepositiveconvergestozero,i.e.
P
(
j
2
^
S
andj=
2S
)
!
0
asn
!1
andB
2
!1
(4.21)
(a)
Underassumptions4.1,4.3,4.4and4.5,alsoassumethat
s
0
sand
p
s
0
s
e
˝
n
n
!
0
asn
!1
where
˝
n
and
n
arede˝nedinassumption4.1.
(b)
Underassumptions4.24.3,4.4and4.5,alsoassumethatthenumberofvariablestobe
selectedsatis˝es
s
0
C
s
=
o
(
p
)
forsomeconstant
C
.
Aproofisgivenintheappendix.Invariableselectionalgorithms,themostimportant
propertyistobeabletoselectionthecorrectpredictorsconsistently.Hereweshowthat
ENNSenjoysthispropertyinthefollowingtheorem.
Theorem4.2.
Underassumptions4.2,4.3,4.4and4.5,let
K
n
betheupperboundofthe
normofthebestparametersintheneuralnetworkmodelwhen
S
isincluded,and
K
bethe
sizeofthe˝rsthiddenlayer,withtheensemble,if
n
satis˝es
K
log(
p
s
)
n
log
1
1
2
e
2
n
!
0
asn
!1
118
forsomeconstant
c
,and
K
2
n
r
log(
nK
n
)
n
!1
asn
!1
theprobabilitythatallnonzerovariablesareselectedconvergesto1,i.e.,
P
(
^
S
=
S
)
!
1
asn
!1
andB
2
!1
(4.22)
Aproofisgivenintheappendix.Intheorem4.2,weshowedthatwithstrongenough
signalstrengthinthetruenonzerovariables,thealgorithmwillbeabletoselectallnonzero
variableswithprobabilitytendingto
1
.Theconditionsarenotveri˝ableinpractice,how-
ever,extensiveexamplesinsection4.4showsthattheENNSalgorithmreachesselection
consistencyeasierthantheotheralgorithms.
Fortheestimationstep,therehasbeensomediscussionabouttheasymptoticproperties
suchas[53,146],wheretheresultsofusingsparsegrouplassopenaltyaregiven.The
l
1
normpenaltyisactuallyaspecialcaseofthesparsegrouplassowiththelassoparameter
being
1
andthegrouplassoparameterbeing
0
.Therefore,theresultsofthesepapershold
aslongaswehave
^
S
=
S
0
,whichhasprobabilitytendingto
1
bytheorem4.2.Herewewill
adaptthetheoryin[61]andwillprovidethefollowingresult.
Theorem4.3.
Assumeassumptions4.2,4.3,4.4and4.5,considerthevariablesselected
fromtheENNSalgorithmandtheestimationwiththe
l
1
regularizationmethod.Denote
the
l
1
regularizationtuningparameterwith
n
andthecorrespondingLagrangianparameter
K
n
.Denotethehiddenlayersizewith
k
n
.Intheregressionsetup,assume
E
(
Y
2
)
<
1
,if
119
K
n
!1
,
k
n
!1
and
k
n
s
2
K
4
n
log(
k
n
sK
2
n
)
=n
!
0
,wehave
lim
n
!1
;B
2
!1
P
E
Z
j
f
n
(
x
)
f
(
x
)
j
2
(
dx
)
!
0
=1
where
f
n
istheestimatedneuralnetworkand
f
isthetruefunction.Intheclassi˝cation
setup,assumingthattheprobabilityofresponsebeing
1
isboundedawayfrom
0
and
1
by
~
,denotewith
Q
themaximumnumberofequivalentneuralnetworkclasses,choosing
tuningparameter
n
c
p
k
n
log
n=n
(
p
log
Q
+
p
k
n
log
s
log(
nk
n
)
,if
log(
n
)
=
(
n
~
2
)
!
0
,
s
2
k
n
2
n
=
(
n
~
2
)
!
0
and
n
1
k
9
=
2
n
s
5
=
2
p
log(
s
n
)
!
0
as
n
!1
,wehave
lim
n
!1
;B
2
!1
P
(
R
(
f
n
)
R
(
f
)
!
0)=1
where
R
(
f
n
)
istheriskofneuralnetworkclassi˝erand
R
(
f
)
istheriskofBayesclassi˝er.
Theorem4.3statesthatunderthepreviouslydiscussedconditions,theregressionreaches
weakestimationconsistencyofthenon-parametricfunctionde˝nedin[61].Fortheclas-
si˝cation,theneuralnetworkclassi˝er'srisktendstotheoptimalrisk,Bayesrisk,seefor
example[37].Thetheoremisadirectresultfromtheexistingresultsofthelowdimension
neuralnetworkregressionmodelandclassi˝ers.Conditioningonthefactthatwecanselect
allcorrectvariableswithprobabilitytendingto
1
,applyingthefullprobabilityformula,the
consistencyofthetwo-stepapproachcanbederivedwiththelowdimensionalconsistency
plustheprobabilityofnon-selection-consistency.
Theconsistencyerrorcomesfromtwoaspects:thevariableselectionerrorandtheesti-
mationerror.Theintuitionbehindthisisthatwithawrongselectionresult,theestimation
errormaybebig,however,thishappenswithprobabilitytendingtozero.Withacorrect
120
selectionresult,theestimationerrorbehavesthesameasinthelowdimensionalcase,which
convergestozero.
4.4SimulationStudy
Inthissection,weuseafewexamplesasnumericalsupportstoourargumentsintheprevious
sections.ThecodeforDNPiscomposedaccordingtothealgorithmoutlinein[82],andthe
codeofENNSiscomposedaccordingtothealgorithminthispaper.Bothcodescanbe
foundat
https://github.com/KaixuYang/ENNS
.
4.4.1Stage-wiseCorrectSelectionProbabilityDecreasingStudy
Inthissubsection,weuseanexampletodemonstratethatthechanceofselectingacorrect
variableinastage-wiseneuralnetworkdecreasesaswehavemorecorrectvariablesinthe
model.Consideradesignmatrix
X
thatisdrawnfromauniformdistributionbetween-1
and1.Thesamplesizeissetto
n
=1000
andthenumberofpredictorsissetto
p
=10000
.
s
=5
ofthepredictorsarerelatedwiththeresponse.Weconsiderthreedi˙erenttrue
structuresoftherelationshipbetweenthepredictorsandtheresponse:linear,additivenon-
linearandneuralnetwork.Fortheresponse,weconsidertwodi˙erentcases:regression
andclassi˝cation.Inthelinearcase,thecoe˚cientsaredrawnfromastandardnormal
distribution.Intheadditivenon-linearcase,thefunctionsaresetto
=sin(
x
1
)+
x
2
+exp(
x
3
)+
x
2
4
+log(
x
5
+2)
2
(4.23)
121
where
y
=
+
inthelinearcaseand
prob
=
˙
(
)
intheclassi˝cationcase.Inthe
neuralnetworkcase,wesethiddenlayersas
[50
;
30
;
15
;
10]
andweightsfromstandardnormal
distribution.
Foreachofthecases,wetestthecaseswhenwestartfrom0to4correctpredictors.In
ordertoeliminatethee˙ectofdi˙erentsignalstrengthfromdi˙erentpredictors,werandom
sample
i
indicesfrom
1
;:::;
5
astheselectedvariables,for
i
=0
;:::;
4
,andincludethese
i
indicespredictorsastheinitiallyselectedvariables.Werunarepetitionof1000timesand
reporttheproportionthatthenextvariablethatentersthemodelisacorrectpredictor.
Theresultsaresummarizedintable
4
:
1
.
ystructure0variable1variable2variables3variables4variables
Reg
Linear0.995(0.002)0.952(0.006)0.863(0.010)0.774(0.013)0.430(0.016)
Additive0.993(0.003)0.847(0.011)0.905(0.009)0.794(0.012)0.531(0.016)
NN0.998(0.001)0.971(0.005)0.932(0.007)0.788(0.013)0.574(0.016)
Cls
Linear0.989(0.003)0.918(0.009)0.873(0.009)0.813(0.011)0.552(0.016)
Additive0.992(0.003)0.957(0.006)0.911(0.009)0.706(0.014)0.633(0.015)
NN0.994(0.002)0.968(0.006)0.947(0.004)0.895(0.009)0.762(0.013)
Table4.1:Theproportionofcorrectvariableselectionafter0-4correctvariablesinthemodel,
fordi˙erentcasesover1000repetitions.Theresultsshowthemean.Theresultsshowthree
di˙erentdatagenerationstructures:linear,additivenon-linearandneuralnetworkforboth
regressionandclassi˝cation.
Inthetable,weseethattheprobabilityofselectingacorrectpredictordecreasesaswe
havemorecorrectpredictorsinthemodel,inallcases.Theonlyexceptionisintheregression
setupwithadditivenon-linearstructurefrom1variableto2variables,whichmaydueto
randomerror.
4.4.2FalsePositiveRateStudy
Inthissubsection,weuseanexampletodemonstratethatthefalsepositiverateofENNS
(theprobabilityofselectingawrongvariable)issuperiorthanthepurestage-wisealgorithm.
122
Notethatifonesetthenumberofvariablestobe
s
,stagewisealgorithmalwaysselect5
variables,whileENNSwillstopifthereisn'tanynewvariablethatsatisfythecondition
tobeadded.Therefore,it'spossiblethatENNSselectslessnumberofvariablesandavoid
wrongselection.Weusedthesamesetupas[82]togenerateresponses.Twodi˙erenttypesof
responsesincludingregressionandclassi˝cationwillbeconsideredhere.Theinputvariable
X
wasdrawnfrom
U
(
1
;
1)
,wherethefeaturedimension
p
was˝xedtobe
10
;
000
.The
correspondinglabelswereobtainedbypassing
X
intothefeedforwardneuralnetworkwith
hiddenlayersizes
f
50
;
30
;
15
;
10
g
andReLUactivationfunctions.Inputweightsconnecting
the˝rst
s
inputswererandomlysampledfrom
N
(1
;
1)
forregressionand
N
(0
;
1)
forclassi-
˝cation.Theremaininginputweightswerekeptzero.Foreach
s
=2
;
5
;
10
,wegenerated
1000trainingsamples.Intable4.2,wereportthefalsepositiveratebetweentheENNS
algorithmandtheneuralnetworkstage-wisealgorithm.
ResponseMethods=2s=5s=10
Regression
ENNS10.4%(21.5%)11.5%(22.1%)12.8%(23.6%)
DNP22.5%(29.5%)30.2%(28.7%)41.4%(33.2%)
Classi˝cation
ENNS4.7%(17.9%)7.4%(18.6%)9.8%(20.3%)
DNP16.5%(24.4%)24.8%(29.7%)40.5%(32.8%)
Table4.2:SelectionfalsepositiverateaverageoftheENNSandDNPunderdi˙erentnumber
oftruevariablesin101repetitions.Standarddeviationsaregiveninparenthesis.
ItcanbetestedthattheENNS'sfalsepositiverateissigni˝cantlylessthanthefalse
positiverateofDNPundersigni˝cancelevel
=0
:
05
.Thisprovidesstrongevidencethat
theENNSisusefulinreducingtheprobabilityofselectinganincorrectvariable.
4.4.3VariableSelectionSimulationStudy
Inthissubsection,westudythevariableselectioncapabilityoftheensembleneuralnetwork
selection(ENNS)algorithminthecomplicatedsetup.Weusedsimilarsetupasinthelast
123
subsectiontogenerateresponses.Twodi˙erenttypesofresponsesincludingregressionand
classi˝cationwillbeconsideredhere.Theinputvariable
X
wasdrawnfrom
U
(
1
;
1)
,where
thefeaturedimension
p
was˝xedtobe
10
;
000
.Thecorrespondinglabelswereobtained
bypassing
X
intothefeedforwardneuralnetworkwithhiddenlayersizes
f
50
;
30
;
15
;
10
g
andReLUactivationfunctions.Inputweightsconnectingthe˝rst
s
inputswererandomly
sampledfrom
N
(0
;
2)
forregressionand
N
(0
;
1)
forclassi˝cation.Theremaininginput
weightswerekeptzero.TheDNPmodelwascodedaccordingtotheiralgorithmoutlinein
pythonwithpyTorch.TheENNSalgorithmisbasedontheDNPalgorithmwithanensemble
wrapper.TheLASSO[126]isimplementedbythescikitlearnlibrary,andtheHSIClasso
[144]isimplementedusingtheHSICLassolibrary.Inallfouralgorithms,thenumberof
selectedvariablesarestrictlyrestrictedtothesameasthetruenumberofvariables.Inthe
ENNS,werunabaggingof10roundswithselectionproportion0.3.Wereporttheaverage
numberofcorrectvariablesthatareselectedon101repetitionsofthedatagenerationin
table4.3.
ResponseMethods=2s=5s=10
Regression
ENNS1.73(0.52)4.21(0.56)9.25(1.11)
DNP1.61(0.50)3.92(0.56)8.77(1.13)
LASSO1.65(0.57)3.87(0.62)9.62(1.38)
HSIC-LASSO1.67(0.47)3.80(0.66)3.61(1.17)
Classi˝cation
ENNS1.81(0.49)4.24(0.87)8.04(1.25)
DNP1.67(0.76)3.76(1.06)5.95(1.29)
LASSO1.67(0.56)3.76(0.75)5.76(1.38)
HSIC-LASSO1.67(0.47)2.80(0.91)3.61(1.17)
Table4.3:VariableselectioncapacityofENNSandothermethodswithlowsignalstrength
intheregression(top)andclassi˝cation(bottom)setup.Thenumbersreportedarethe
averagenumberofselectedvariableswhicharetrulynonzero.Thestandarderrorsaregiven
inparenthesis.
WeobservethattheENNSoutperformstheothervariableselectionalgorithmsinall
threecases,andthedi˙erenceissigni˝cantwhen
s
=10
underat-test.TheENNSperforms
124
betterwhentherearemorenonzerovariables.Noneofthealgorithmswereabletorecon-
structtheoriginalfeatureindicesduetoafewreasons:thesamplesizeisrelativelysmall
comparedtothenumberofvariables;thedatagenerationthroughneuralnetworkstructures
iscomplicated;thesignalstrengthisrelativelylow.
TofullystudythevariableselectionpoweroftheENNSalgorithm,weimplemented
anothersimulationcaseinclassi˝cationwherewehaveahighersignalstrengthwhilekeeping
allotherconditionsthesame.Inthissimulationstudy,weincreasethemeanoftheweights
ofthenonzerovariablesto3.5.Withthesameimplementations,wesummarizetheresults
intable4.4.Moreover,table4.5summarizestheresultsforsignalstrength10.
Methods=2s=5s=10
ENNS2.00(0.00)4.71(0.55)8.38(2.06)
DNP1.86(0.35)4.38(0.84)7.43(2.36)
LASSO1.81(0.39)4.19(1.01)7.47(2.40)
HSIC-LASSO1.71(0.45)3.71(1.12)4.95(2.13)
Table4.4:VariableselectioncapacityofENNSandothermethodswithnormalsignal
strength.Thenumbersreportedaretheaveragenumberofselectedvariableswhichare
trulynonzero.Thestandarderrorsaregiveninparenthesis.
TheENNSreachesselectionconsistencywhen
s
=2
,whiletheothercomparedalgorithms
stilldonothaveselectionconsistency.However,allalgorithmshaveobviousimprovementsin
allcases.Wehavetoadmitthatselectingthecorrectsubsetofvariablesinall101repetitions
isextremelychallenging,sincethedatahavegreatvariableindi˙erentrepetitions.Moreover,
when
s
getsgreater,theimportanceofafewvariablesarelesslikelytobeobservedfromthe
data.
Moreover,asweknow,thebaggingalgorithmcanbeparalyzedsincedi˙erentrunsare
independentofeachother.Therefore,thecomputationale˚ciencyofthisvariableselection
algorithmisalmostthesameasthecomputatione˚ciencyofasinglerun.
125
Methods=2s=5s=10
ENNS2.00(0.00)5.00(0.00)9.90(0.29)
DNP2.00(0.00)5.00(0.00)9.47(1.10)
LASSO2.00(0.00)4.90(0.29)9.23(1.74)
HSIC-LASSO2.00(0.00)4.62(0.84)7.76(2.76)
Table4.5:VariableselectioncapacityofENNSandothermethodswithhighsignalstrength.
Thenumbersreportedaretheaveragenumberofselectedvariableswhicharetrulynonzero.
Thestandarderrorsaregiveninparenthesis.
4.4.4EstimationSimulationStudy
Inthissubsection,wecomparetheestimationmethodsinsection4.2.Tofullystudythe
di˙erencebetweenthesemethodswithoutthee˙ectsofotherfactors,inthissubsectionwe
assumecorrectselectionandperformtheestimationonthecorrectsubsetofvariables.The
dataaregeneratedaccordingtothesameschemeasinthelastsubsection.Wewillcompare
theperformanceofthesedi˙erentestimationmethodsfor
s
=2
;
5
;
10
assumingthatweknow
thecorrectsubsetofvariables.Thesimulationisrunon101repetitionsofdatageneration
usingdi˙erentseeds.Intheresults,wereporttheRMSE,theMAEandtheMAPEfor
regression,andtheaccuracy,theaucscoreandthef1scoreforclassi˝cation.Thesearein
table4.6.
Onaverage,wesee
l
1
normregularizationgivesbestperformance,exceptfortheMAPE
of
s
=10
inregression.Moreover,weobservethatbothbuilt-in
l
1
andsoft-thresholdinggives
smallerstandarderrors,whichcoincideswiththeshrinkageestimator'sproperties.However,
soft-thresholdingprovidesbetterperformanceinaveragethanbuilt-in.Thereasonisthat
sparsityisnotwellsupportedwithmostlibraries,thusamanualoperationisneededto
obtainsparsity.
126
4.4.5VariableSelectionandEstimation
Inthissubsection,westudythepredictioncapabilityofthetwo-stageapproachENNS
algorithmwith
l
1
neuralnetwork,andcompareitwiththeDNPmodel,thelogisticregression
andtheHSIC-lassowithSVM.Weusethesameneuralnetworkstructuretogeneratedata
asinthissection.Over101repetitions,wereporttheaverageRMSE(rootedmeansquared
error),MAE(meanabsoluteerror)andMAPE(meanabsolutepercenterror)forregression
andaverageaccuracy,AUCandF1Scoreforclassi˝cation,aswellastheirstandarderrors.
Theresultsaresummarizedintable4.7.
Weobservethatourproposedalgorithmobtainedaslightperformanceboostviathe
ensemblemethod.Moreover,thestandarderrorsoftheseresultsareslightgreaterthan
thestandarderrorsinthelastsubsection,wheretheestimationwasdoneassumingcorrect
selection.Theincreaseofstandarderrorsismainlyduetotheselectionvariations.
4.4.6CorrelatedPredictors
Inthissubsection,weuseanexampletostudythenumericalperformanceoftheproposed
algorithmincorrelatedpredictorsituation.Wewillconsidertwodi˙erentcorrelations:
ˆ
=
0
:
3
and
ˆ
=0
:
7
.Asacomparison,theresultsfor
ˆ
=0
willalsobeincluded.Let
u
1
;:::;u
n
bei.i.d.standardnormalrandomvariables,
x
ij
bei.i.d.standardnormalrandomvariables,
whichareindependentof
u
1
;:::;u
n
,for
i
=1
;:::;n
and
j
=1
;:::;p
.Dothetransformation
x
ij
=(
x
ij
+
tu
i
)
=
p
1+
t
2
forsome
t
,thenweobtainedstandardnormalcorrelatedvariables
cor
(
x
ij
;x
ik
)=
t
2
1+
t
2
;i
=1
;:::;n
;
j
=1
;:::;p
127
Taking
t
=
p
3
=
7
givescorrelation0.3andtaking
t
=
p
7
=
3
givescorrelation0.7.Thenwe
truncatetherandomvariablestointerval
[
1
;
1]
.Thestructuretogenerateresponseiskept
thesameasinthelastsubsection.Theresultsofvariableselectionandestimationisgiven
intable4.8.
128
yMetricMethods=2s=5s=10
Reg
RMSE
NeuralNetwork31.24(13.32)69.46(37.40)136.64(60.54)
Xavier18.64(11.07)58.89(27.73)136.58(65.57)
l
1
built-in20.47(9.62)59.37(23.61)129.55(50.48)
l
1
soft5.97(4.18)45.83(33.06)109.31(47.24)
Stage-wise10.59(11.20)47.64(22.69)117.65(43.96)
Bagging25.48(10.89)59.49(26.53)133.45(59.72)
MAE
NeuralNetwork16.45(10.91)52.85(28.47)103.76(45.99)
Xavier13.65(8.06)45.34(22.18)105.17(53.66)
l
1
built-in15.56(7.76)45.32(18.34)98.54(38.36)
l
1
soft4.37(3.02)35.49(26.21)83.85(36.45)
Stage-wise7.91(8.23)38.86(20.00)89.82(33.82)
Bagging14.77(7.92)43.16(20.51)99.38(41.66)
MAPE
NeuralNetwork0.012(0.015)0.030(0.026)0.033(0.026)
Xavier0.009(0.009)0.027(0.022)0.029(0.017)
l
1
built-in0.011(0.012)0.029(0.023)0.032(0.021)
l
1
soft0.005(0.007)0.017(0.010)
0.029(0.023)
Stage-wise
0.007(0.007)0.019(0.015)
0.027(0.016)
Bagging0.010(0.010)0.026(0.024)0.030(0.022)
Cls
Accuracy
NeuralNetwork0.944(0.026)0.886(0.037)0.841(0.041)
Xavier0.952(0.026)0.891(0.034)0.831(0.036)
l
1
built-in0.927(0.031)0.844(0.085)0.752(0.110)
l
1
soft0.964(0.028)0.908(0.029)0.855(0.031)
Stage-wise0.945(0.030)0.886(0.038)0.804(0.042)
Bagging0.877(0.069)0.806(0.068)0.753(0.087)
AUC
NeuralNetwork0.942(0.027)0.882(0.038)0.837(0.042)
Xavier0.951(0.027)0.891(0.034)0.825(0.037)
l
1
built-in0.924(0.031)0.833(0.100)0.734(0.123)
l
1
soft0.964(0.029)0.905(0.029)0.851(0.032)
Stage-wise0.943(0.031)0.884(0.038)0.800(0.041)
Bagging0.877(0.065)0.803(0.063)0.751(0.084)
F1Score
NeuralNetwork0.943(0.027)0.887(0.045)0.841(0.049)
Xavier0.952(0.026)0.892(0.041)0.832(0.048)
l
1
built-in0.927(0.031)0.824(0.192)0.732(0.200)
l
1
soft0.965(0.026)0.908(0.036)0.857(0.033)
Stage-wise0.944(0.031)0.883(0.042)0.806(0.049)
Bagging0.870(0.077)0.792(0.060)0.748(0.088)
Table4.6:Predictionresultsonthetestingsetusingneuralnetworkswithandwithout
l
1
normregularizationfor
s
=2
;
5
;
10
.RMSEisrootedmeansquarederror,MAEismean
absoluteerror,andMAPEismeanabsolutepercenterror.Accuracyisthepercentageof
correctprediction,aucisareaundertheROCcurve,andf1scoreistheinverseofinverse
precisionplustheinverserecall.
129
yMetricMethods=2s=5s=10
Reg
RMSE
ENNS+l115.67(30.35)48.14(21.16)174.08(65.38)
DNP25.42(33.16)62.63(29.02)
178.91(60.15)
Lasso79.44(67.31)104.19(49.38)192.04(77.34)
HSIC-Lasso56.32(59.41)86.77(47.51)188.35(56.48)
MAE
ENNS+l112.03(23.68)40.12(19.95)132.07(44.99)
DNP20.15(27.15)47.85(22.31)
136.06(45.95)
Lasso64.11(54.63)81.97(39.76)147.86(60.21)
HSIC-Lasso42.89(34.66)70.04(41.23)144.37(48.15)
MAPE
ENNS+l10.012(0.025)0.028(0.036)0.041(0.037)
DNP0.017(0.028)0.032(0.032)
0.042(0.041)
Lasso0.042(0.029)0.046(0.035)0.046(0.025)
HSIC-Lasso0.033(0.021)0.036(0.025)0.048(0.024)
Cls
Accuracy
ENNS+l10.967(0.029)0.848(0.025)0.756(0.067)
DNP0.933(0.076)0.822(0.068)0.736(0.064)
Lasso0.732(0.103)0.726(0.071)0.692(0.075)
HSIC-Lasso0.805(0.094)0.798(0.094)0.706(0.081)
AUC
ENNS+l10.959(0.036)0.834(0.024)0.709(0.058)
DNP0.898(0.148)0.780(0.100)0.699(0.052)
Lasso0.652(0.121)0.640(0.102)0.625(0.068)
HSIC-Lasso0.774(0.125)0.748(0.121)0.677(0.061)
F1-Score
ENNS+l10.962(0.037)0.859(0.036)0.708(0.089)
DNP0.903(0.208)0.761(0.199)0.705(0.100)
Lasso0.590(0.299)0.604(0.250)0.634(0.192)
HSIC-Lasso0.744(0.206)0.731(0.242)0.666(0.208)
Table4.7:ModelperformanceofthecombinationofENNSalgorithmand
l
1
thresholding
estimation,comparedwithDNP,LassoandHSIC-Lassofor
s
=2
;
5
;
10
casesinbothre-
gressionandclassi˝cation.Theaverageperformanceof101repetitionswiththeirstandard
errorsinparenthesisarepresented.
130
131
ResModel
selectionestimation
ˆ
=0
:
0
ˆ
=0
:
3
ˆ
=0
:
7
ˆ
=0
:
0
ˆ
=0
:
3
ˆ
=0
:
7
Reg
ENNS+
l
1
3.81(0.79)3.27(0.75)2.29(0.70)40.82(19.46)37.17(27.29)43.18(44.47)
DNP3.48(0.96)2.95(0.79)2.14(0.56)81.43(46.00)92.91(65.25)101.15(90.63)
LASSO3.38(0.90)2.85(0.79)2.11(1.12)131.37(74.22)151.16(108.88)113.30(97.54)
Cls
ENNS+
l
1
3.66(1.05)3.25(0.76)2.38(0.72)0.856(0.040)0.875(0.061)0.907(0.030)
DNP3.62(1.09)3.43(0.91)2.71(1.03)0.774(0.100)0.766(0.106)0.793(0.092)
LASSO3.55(0.79)2.90(1.31)1.95(0.72)0.598(0.083)0.634(0.117)0.683(0.116)
Table4.8:Selectionandestimationcomparisonforpredictorswithcorrelation0,0.3and0.7.Thenumberofnonzeropredictors
issetto5.Forselection,theaveragenumberofcorrectselectedvariableswithitsstandarderrorisgiven.Forestimationthe
averageRMSEorAUCwiththeirstandarderrorsisgiven.Theresultsareaveragedover101repetitions.
Fromthetableweseethatinthecorrelatedcasesthemodelworksalmostaswellas
whenthere'snocorrelation.Allmodelsselectlessvariableswhenthecorrelationishigher,
andthisisawell-knownsymptomofvariableselectionwithcorrelatedpredictors.However,
thisdoesnota˙ecttheestimationstep,andinsomecasesevenmakestheestimationresults
better.Thereasoncouldbethatwehavelessvariablesthusthemodelissimpler.Since
thepredictorsarecorrelated,wedonotlosetoomuchinformationbynotselectingsome
ofthem.Moreover,someresultsnotinthetableincludesthefalsepositiverate,wherethe
averageforENNSis
0
:
05
0
:
03
,whilethatoftheDNPis
0
:
29
0
:
14
.Therefore,ENNS
includeslessredundantvariablesintheestimationstepandachievesbetterperformance.
4.5RealDataexamples
Inthissection,weevaluatetheperformanceofthetwo-stepmodelonsomerealworlddata
sets.
4.5.1VariableSelection:MRIData
Inthisexample,weevaluatethevariableselectioncapabilitywithothervariableselec-
tionmodels,andcomparetheresultswiththebiologicalgroundtruth.Thedataused
inthisexamplecomefromAlzheimer'sdiseaseneuroimaginginitiatives(ADNI),see
http:
//adni.loni.usc.edu/
.Thedataincludes
n
=265
patients'neuroimagingresults,includ-
ing164Alzheimer's(AD)patientsand101cognitivelynormal(CN)individuals.145regions
ofinterested(ROIs)spanningtheentirebrainwerecalculatedusingMulti-AtlasROIsegmen-
tation,and114ROIswerederivedbycombiningsingleROIswithinatreehierarchytoobtain
volumetricmeasurementsfromlargerstructures.Therefore,
p
=259
ROIswereusedinthis
132
example.Detailsofthepre-processingmethodcanbefoundat
https://adni.bitbucket.
\io/reference/docs/UPENN_ROI_MARS/Multi-atlas_ROI_ADNI_Methods_mod_April2016.
pdf
.AmongthoseROIs,biologicallyimportantfeaturesarepicked,seetable4.9,wherered
indicatedmostimportant,yellowmeanssecondlyimportant,andgreenmeansthirdlyim-
portant.ThecombinationsandallotherROIsarenotlisted.
Thefulldatasetisusedforvariableselection,andtheselectionresultischosensuchthat
a3-foldcross-validationperformsbest.WeruntheENNSalgorithmalongwiththeLASSO
andtheDNP.Theselectionresultarepresentedintable4.9.Notehereifamodelselectsa
simplecombinationofsomefeatures,thesefeaturesarealsomarkedasselected.Moreover,
table4.10showsthenumberofcombinedfeaturesselectedforthemodelsandnumberof
falsepositiveselections.WeobservethatLASSOmissesalotofimportantfeaturesand
selectedaboutonly1/4ofthecombinedfeaturesasneuralnetworks.Thisindicatesthatthe
featuresmayhavecomplicatedrelationshipwiththeresponse.ENNSperformsbetterthan
theshallowDNPintermsofthemetricsintable4.10,whereISisaweightedaveragescore
withtheweightsforred,yellowandgreenbeing3,2and1,respectively;NIisthenumberof
selectedimportantvariables;andNUisthenumberofselectedunimportantvariables.Asa
propertyoftheENNS,itselectslessfalsepositivevariables.It'shardtotrackthecombined
features,sincealotareinvolved,however,thecombinationsrepresentbiologicalintuitions.
Neuralnetworksselectsmorecombinedfeaturesandperformbetterinthissense.
133
134
ROI
R31
R32
R36
R37
R38
R39
R40
R41
R47
R48
R49
R50
R51
Lasso
X
X
X
X
X
DNP
X
X
X
X
X
X
X
ENNS
X
X
X
X
X
X
X
X
X
X
X
X
ROI
R52
R55
R56
R57
R58
R59
R60
R81
R82
R85
R86
R87
R88
Lasso
X
X
DNP
X
X
X
X
X
X
ENNS
X
X
X
X
X
X
X
ROI
R100
R101
R102
R103
R106
R107
R116
R117
R118
R119
R120
R121
R122
Lasso
X
X
X
X
DNP
X
X
X
X
X
X
ENNS
X
X
X
X
X
ROI
R123
R124
R125
R132
R133
R136
R137
R138
R139
R140
R141
R142
R143
Lasso
X
X
X
X
X
X
X
X
DNP
X
X
X
X
X
X
X
X
ENNS
X
X
X
X
X
X
X
X
X
ROI
R146
R147
R152
R153
R154
R155
R162
R163
R164
R165
R166
R167
R168
Lasso
X
X
X
X
X
X
DNP
X
X
X
X
ENNS
X
X
X
X
X
X
X
X
ROI
R169
R170
R171
R178
R179
R186
R187
R190
R191
R194
R195
R198
R199
Lasso
X
X
X
X
X
X
DNP
X
X
X
X
X
ENNS
X
X
X
X
X
X
ROI
R200
R201
R202
R203
R204
R205
R206
R207
Lasso
X
X
X
DNP
X
X
X
X
ENNS
X
X
X
X
Table4.9:VariableselectionresultfortheADdata.Thetableincludesallbiologicallyimportantvariableswiththreelevels:
red(veryimportant),yellow(secondlyimportant)andgreen(thirdlyimportant).Thenon-importantvariablesarenotincluded
inthemodel.Checkmarksindicatewhetherthecorrespondingalgorithmselectedthevariableornot.
VariableselectionmethodISNINU
LASSO1.09432/8625/59
DNP1.42840/8615/59
ENNS1.62449/866/59
Table4.10:VariableselectionresultfortheADdata.ThereportednumbersincludeIS,
theweightedaverageofselectedimportantvariableswiththeweightsbeing3,2and1for
red(mostimportant),yellow(secondlyimportant)andgreen(thirdlyimportant),respec-
tively;NI,numberofimportantvariablesselected;andNU,numberofunimportantvariables
selected.
4.5.2Regression:Ribo˛avinProductionData
Inthisexample,weconsidertheribo˛avinproductionwithbacillussubtilisdata,whichis
publiclyavailableinthe`hdi'packageinR.Thedatasetcontainsacontinuousresponse,
whichisthelogarithmoftheribo˛avinproductionrate,and
p
=4088
predictorswhichare
thelogarithmoftheexpressionlevelof4088genes.Thereare
n
=71
observationsavailable.
Allpredictorsarecontinuouswithpositivevalues.
Weperform50repetitionsofthefollowingactions.Thedataissplitintotraining(56)and
testing(15)observations.Thetrainingdataisfurthersplitintotraining(49)andvalidation
(7).Thetrainingdataisnormalizedwithmeanzeroandstandarddeviationone.Wetrain
theENNSalgorithmtoselectvariablesandperformthe
l
1
neuralnetworktomakeprediction.
Alongwithourproposedalgorithms,wecomparetheperformancewiththelassopenalized
linearregression,whichisimplementedbythescikit-learnlibraryinpython;thegrouplasso
penalizedgeneralizedadditivemodelin[145],wherethecodecanbefoundat
https://
github.com/KaixuYang/PenalizedGAM
;andthesparsegrouplassopenalizedneuralnetwork
in[53].˝gure(4.1)showstheaveragetestingmeansquarederror(MSE)alongwiththe
numberofselectedfeaturesfordi˙erentmodels.Ourproposedalgorithmconvergesfast
andperformscompetitive.table4.11showstheaveragepredictionaccuracywithstandard
errorinparenthesisandthemediannumberofvariablesselected.Ourproposedmethodhas
135
ModelTestMSENumberoffeatures
ENNS
+
l
1
neuralnetwork0.268(0.115)42
Regularizedneuralnetwork0.273(0.135)44
LinearmodelwithLASSO0.286(0.124)38
Generalizedadditivemodelwithgrouplasso0.282(0.136)46
Table4.11:TestMSEwithstandarderrorinparenthesesandmedianofnumberoffeatures
fordi˙erentmodelsintheribo˛avingenedataexample.
competitivemeanperformancebutlowerstandarderror.
The˝nalmodelofsmallsampleutilizesonly2hiddenlayers,withover90%sparsityto
preventover-˝tting,whichisnecessaryforthissmalltrainingsamplesize,49.Trainingwith
largebatchsize,smalllearningrate,hugenumberofepochsandearlystoppinghelpthe
modellearnbetterandpreventover-˝tting.Weadmitthattuningthenetworkstructure
andlearningparametersarehard,butweobtainbetterandstablerresultsoncewehavethe
rightnumbers.
4.5.3Classi˝cation:ProstateCancerData
Inthisexample,weconsideredaprostatecancergeneexpressiondata,whichispublicly
availablein
http://featureselection.asu.edu/datasets.php
.Thedatasetcontainsa
binaryresponsewith102observationson5966predictorvariables.Clearly,thedataset
isreallyahighdimensionaldataset.Theresponseshavevalues1(50samplepoints)and
2(52samplepoints),where1indicatesnormaland2indicatestumor.Allpredictorsare
continuouspredictors,withpositivevalues.
Weperform50repetitionsofthefollowingactions.Thedataissplitintotraining(81)and
testing(21)observations.Thetrainingdataisfurthersplitintotraining(70)andvalidation
data(11).Ineachsplit,thenumberofclass0observationsandnumberofclass1observations
136
Classi˝erTestaccuracyNumberoffeatures
ENNS
+
l
1
neuralnetwork0.956(0.053)15
Regularizedneuralnetwork0.955(0.066)18
LogisticRegressionwithLasso0.933(0.058)36
l1penalizedLinearSVM0.950(0.052)16
Generalizedadditivemodelwithgrouplasso0.918(0.061)5
Table4.12:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof
featuresfordi˙erentclassi˝ersintheProstategenedataexample.
arekeptroughlythesame.WetraintheENNSalgorithmtoselectvariablesandperformthe
l
1
neuralnetworktomakepredictions.Alongwithourproposedalgorithms,wecomparethe
performancewiththe
l
1
normpenalizedlogisticregression;the
l
1
supportvectormachine
(SVM),bothofwhichareimplementedwiththescikit-learnlibraryinpython;thegroup
lassopenalizedgeneralizedadditivemodelin[145],wherethecodecanbefoundat
https://
github.com/KaixuYang/PenalizedGAM
;andthesparsegrouplassopenalizedneuralnetwork
in[53].˝gure(4.2)showstheaveragetestingaccuracyoverthe20repetitionsalongwith
thenumberofselectedfeaturesfordi˙erentmodels.Ourproposedalgorithmconvergesfast
andperformscompetitive.table4.12showstheaveragepredictionaccuracywithstandard
errorinparenthesis,andthemediannumberofvariablesselected.Ourproposedmethods
hascompetitivemeanperformancebutlowerstandarderror.Oneneedstonoticethatthe
meanperformanceishardtoimprovefurther,sincetheresultsarealreadygoodandreach
thebottleneckofthecurrentexplanatoryvariables.ThereasonthatGAMperformsworse
thantheothermodelsisthattherangeofpredictorvariablesarerelativelysmallandskewed,
thusthebasisexpansiononGAMdoesnotworkwell.
137
Figure4.1:Testingmeansquarederror(MSE)fordi˙erentmodelsontheribo˛avindata.
138
Figure4.2:Testingaccuracyfordi˙erentmodelsontheprostatecancerdata.
139
4.6Conclusion
Inthispaper,wediscussedtheexistingmethodstodealwithhigh-dimensionaldataandhow
toapplythestage-wisealgorithmwithneuralnetworks.Wediscussedtheshortageofthe
currentstage-wiseneuralnetworkvariableselectionalgorithmsandproposedtheENNSto
overcometheseshortage.Moreover,wealsocompareddi˙erentmethodstofurtherreduce
theover-˝ttingissueafterthevariableselectionstep.Methodologywasgiventosupportthe
argumentandnewalgorithm,andsimulationstudiesweregivenasextraevidenceforthe
algorithmtowork.
Thoughthere'safewsimulationandmethodologystudyinneuralnetworkvariablese-
lection,thetheoryforneuralnetworkstilldeservesmuchmoreinvestigation.Wehopethis
papercouldworkasapioneerandattractsmorepeople'sattentiontotheneuralnetwork
theory˝eld.
140
Chapter5
Epilogue
Themaingoalofthisdissertationistoestablishtheoryandproposenewmethodologiesfor
non-parametricstatisticalmachinelearningmodelingforthehigh-dimensionallowsample
size(HDLSS)problems.Wehavestudiedthreedi˙erentnon-parametricapproaches:the
generalizedadditivemodel(GAM)inChapter2,thesparsegrouplassopenalizedshallow
neuralnetworkinChapter3andtheensembleneuralnetworkselection(ENNS)algorithm
inChapter4.Thesethreemodelsareappropriateindi˙erentsituations.TheGAMis
usefulwhenthereisnottoomuchinteractionsamongthevariablesorthereisonlyweak
interactionsamongthevariables,whichhasmanyapplications.Actually,inmost˝elds,
theadditivestructureisenoughtoobtainafairlygoodresult.Theshallowneuralnetwork
isusedwhenthereareinteractionsamongthevariables,buttherelationshipbetweenthe
responseandthevariablesisnottoocomplicated.Ifwehavehugeinteractionsandvery
complicatedrelationships,adeepneuralnetworkmakesbetterapproximationtothisrela-
tionship.However,ontheotherhand,thelevelofdi˚cultyintrainingthethreemodelsalso
increasegradually,especiallyit'seasytogetstuckinalocalminimumwhentrainingthe
neuralnetwork.Fromthisaspect,GAMisaconvexoptimizationproblemandisguaranteed
toreachaglobalminimum,thushassmallertrainingdi˚culty.
Thetheoryweprovedalsomakegreatcontributiontowardsthis˝eld.Thesetheoretical
resultsguaranteethatthesemethodswillworkundercertainconditions,whichisbettersup-
141
portthanthenumericalresultsonly.Moreover,wehaveappliedthesemethodstodi˙erent
˝eldsofexamples,includinggeneticdata,computervisiondata,autonomousdrivingdata
andMRIdata.Thesenon-parametricmodelsareprovedtoworkindi˙erentaspects.
Inthefuture,theneuralnetworkstillhaveahugeroomtoinvestigate.Howdowe˝nd
amorestablewaytotrainaneuralnetwork?Howdowepreventneuralnetworkfrom
over˝ttingsmallsample?Howdoweeliminatethegapbetweentheglobalminimumusedin
neuralnetworktheoryandthelocalminimumobtaininpractice?Theseareleftforfuture
research.
142
APPENDICES
143
APPENDIXA
TechnicalDetailsandSupplementary
MaterialsforChapter2
DerivationofAssumption2.2
Thoughassumption2.2isimposedonthe˝xeddesignmatrix,however,itholdsifthedesign
matrix
X
isdrawnfromacontinuousdensityandthedensity
g
j
of
X
j
isboundedawayfrom
0andin˝nityby
b
and
B
,respectively,ontheinterval
[
a;b
]
.Let
A
bethesub-vectorof
whichincludeallnonzeroentries.Withoutlossofgenerality,let
A
=
f
1
;:::;
k
g
,where
k
2
R
m
n
and
k
=
O
(
s
n
)
.Let
A
bethecorrespondingsub-matrixof
.
Bylemma3in[120],ifthedesignmatrix
X
isdrawnfromacontinuousdensityand
thedensity
g
j
of
X
j
isboundedawayfrom0andin˝nityby
b
and
B
,respectively,onthe
interval
[
a;b
]
,andcard
B
(
)=
O
(
s
n
)
,wehave
k
1
1
+
:::
+
k
k
k
2
k
1
2
(
k
1
1
k
2
+
:::
+
k
k
k
k
2
)
forsomepositiveconstant
2
suchthat
0
<
1
2
2
2
<
1
,where
0
=((1
bB
1
)
=
2)
.
Togetherwiththetriangleinequality,wehave
k
1
2
(
k
1
1
k
2
+
:::
+
k
k
k
k
2
)
k
A
A
k
2
k
1
1
k
2
+
:::
+
k
k
k
k
2
144
Bysimplealgebra,wehave
2
k
2
2
(
k
1
1
k
2
2
+
:::
+
k
k
k
k
2
2
)
k
A
A
k
2
2
2(
k
1
1
k
2
2
+
:::
+
k
k
k
k
2
2
)
Forany
j
=1
;:::;k
,bylemma6.2in[153],wehave
c
1
m
1
n
min
(
n
1
T
j
j
)
max
(
n
1
T
j
j
)
c
2
m
1
n
forsome
c
1
and
c
2
.Thenwehave
T
T
k
k
2
2
=
k
A
A
k
2
2
k
A
k
2
2
2
k
2
2
(
k
1
1
k
2
2
+
:::
+
k
k
k
k
2
2
)
k
A
k
2
2
=
2
k
2
2
k
1
1
k
2
2
k
1
k
2
2
k
1
k
2
2
k
A
k
2
2
+
:::
+
k
k
k
k
2
2
k
k
k
2
2
k
k
k
2
2
k
A
k
2
2
!
2
k
2
2
c
1
nm
1
n
k
1
k
2
2
k
A
k
2
2
+
:::
+
k
k
k
2
2
k
A
k
2
2
!
=
2
k
2
2
c
1
nm
1
n
Let
0
=
2
2
c
1
andobservethat
k
=
O
(
s
n
)
,wehave
T
T
n
k
k
2
2
0
2
s
n
2
m
1
n
145
Similarly,wehave
T
T
k
k
2
2
=
k
A
A
k
2
2
k
A
k
2
2
2(
k
1
1
k
2
2
+
:::
+
k
k
k
k
2
2
)
k
A
k
2
2
=2
k
1
1
k
2
2
k
1
k
2
2
k
1
k
2
2
k
A
k
2
2
+
:::
+
k
k
k
k
2
2
k
k
k
2
2
k
k
k
2
2
k
A
k
2
2
!
2
c
2
nm
1
n
k
1
k
2
2
k
A
k
2
2
+
:::
+
k
k
k
2
2
k
A
k
2
2
!
=2
c
2
nm
1
n
Let
1
=
c
2
,wehave
T
T
n
k
k
2
2
1
m
1
n
ProofsofLemmaandTheorems
Thefollowinglemmasareneededinprovingtheorems.
LemmaA.1.
Foranysequence
r
n
>
0
,underassumption2.1and3,wehaveforbounded
responsesuchthat
j
y
i
j
p
intheclassi˝cationsetup,inthesecondexample,
weconsiderthehigh-dimensionalsetup
n
>
<
>
>
:
1
;if
^
p
i
>p
c
0
;if
^
p
i