STATISTICALMACHINELEARNINGTHEORYANDMETHODSFOR HIGH-DIMENSIONALLOWSAMPLESIZE(HDLSS)PROBLEMS By KaixuYang ADISSERTATION Submittedto MichiganStateUniversity inpartialful˝llmentoftherequirements forthedegreeof ctorofPhilosophy 2020 ABSTRACT STATISTICALMACHINELEARNINGTHEORYANDMETHODSFOR HIGH-DIMENSIONALLOWSAMPLESIZE(HDLSS)PROBLEMS By KaixuYang High-dimensionallowsamplesize(HDLSS)dataanalysishavebeenpopularnowadays instatisticalmachinelearning˝elds.Suchapplicationsinvolvesahugenumberoffeatures orvariables,butsamplesizeislimitedduetoreasonssuchascost,ethnicityandetc.It isimportantto˝ndapproachestolearntheunderlyingrelationshipsviaasmallfraction ofdata.Inthisdissertation,westudythestatisticalpropertiesforsomenon-parametric machinelearningmodelsthatdealwiththeseproblemsandapplythesemodelstovarious ˝eldsforvalidation. InChapter2,westudythegeneralizedadditivemodelinthehigh-dimensionalsetup withagenerallinkfunctionthatbelongtotheexponentialfamily.Weapplyatwo-step approachtodovariableselectionandestimation:agrouplassostepasaninitialestima- tor,thenfollowedbyaadaptivegrouplassosteptoobtain˝nalvariablesandestimations. Weshowthatundercertainconditions,thetwo-stepapproachconsistentlyselectsthetruly nonzerovariablesandderivedtheestimationrateofconvergence.Moreover,weshowthat thetuningparameterthatminimizesthegeneralizedinformationcriterion(GIC)hasasymp- toticallyminimumrisk.Simulationsinvariableselectionandestimationaregiven.Realdata examplesincludingspamemailandprostatecancergeneticdataarealsousedtosupport thetheory.Moreover,wediscussedthepossibilityofusinga l 0 normpenalty. InChapter3,westudyashallowneuralnetworkmodelinthehigh-dimensionalclassi˝- cationsetup.Thesparsegrouplasso,alsoknownasthe l p; 1 + l 1 normpenalty,isapplied toobtainfeaturesparsityandasparsenetworkstructure.Neuralnetworkscanbeusedto approximateanycontinuousfunctionwithanarbitrarysmallapproximationerrorgiventhat thenumberofhiddennodesislargeenough,whichisknownastheuniversalapproximation theorem.Therefore,neuralnetworksareusedtomodelcomplicatedrelationshipsbetween theresponseandpredictorswithhugeinteractions.Weprovedthatundercertainconditions, thesparsegrouplassopenalizedshallowneuralnetworkhasclassi˝cationrisktendtothe Bayesrisk,whichistheoptimalamongallpossiblemodels.Realdataexamplesincluding prostatecancergeneticdata,Alzheimer'sdisease(AD)magneticresonanceimaging(MRI) dataandautonomousdrivingdataareusedtosupportthetheory.Moreover,weproposeda l 0 + l 1 penaltyandshowedthatthesolutioncanbeformulatedasanmixedintegersecond orderconeoptimization(MISOCO)problem. InChapter4,weproposeastage-wisevariableselectiontechniquewithdeepneural networksinthehigh-dimensionalsetup,namedensembleneuralnetworkselection(ENNS). Weapplyanensembleonthestage-wiseneuralnetworkvariableselectionmethodtofurther thefalselyselectedvariables,whichisshowntobeabletoconsistently˝lteroutunwanted variablesandselectedthetrulynonzerovariablesundercertainconditions.Moreover,we proposedasecondapproachtofurthersimplifytheneuralnetworkstructurebyspecifyingthe desiredpercentageofnonzeroparametersineachhiddenlayer.Atypeofcoordinatedescent algorithmisproposedtoobtainthesolutionfromthesecondstep.Wealsoshowthatthetwo stepapproachachievesuniversalconsistencyforbothregressionandclassi˝cationproblems. Simulationsarestudiedtosupportvariousarguments.Realdataexamplesincludingthe ribo˛avinproductiondata,theprostatecancergeneticdataandaregionofinterest(ROI) inMRIdataareusedtovalidatethemethod. Idedicatethisdissertationtomyparents,XiangyangXuandZhidongYang,mywife, JunLiu,andmanyfriends,whoallhavebeensupportingmeduringthishardtime. iv ACKNOWLEDGMENTS Here,IwouldliketoexpressmydeepestgratitudetomyadvisorDr.TapabrataMaitiforhis guidancetowardsmystudiesandresearches.Dr.Maitiisextremelykindandknowledgeable. Healwaysprovideconstructiveinsightsandsuggestionsthathelpmemakegreatprogress. WithoutDr.Maiti'sguidance,Iwouldn'tbeabletohavesuchaprofoundunderstandingin thisgorgeous˝eld. Iwouldalsoliketoextendmysincereappreciationtomydissertationcommitteemem- bers,Dr.LyudmilaSakhanenko,Dr.Ping-shouZhongandDr.DavidZhu.Theircomments andsuggestionsareextremelybene˝cialformyresearch. IamalsogratefultothehelpIobtainedfromalltheprofessorsintheDepartmentof StatisticsandProbability,bothacademicallyandinotheraspects.DuringmyPh.D.life, Ilearnalotfromthecourseso˙eredbytheseexcellentprofessors,andI'mabletoapply theknowledgeIlearntomyresearch.TheprofessorswhoIhaveworkedwithasateaching assistantalsoimpressmealotbythewaytheydealwithdi˙erentkindsofdi˚culties. DuringmysevenyearsatMichiganStateUniversity,Imadelotsoffriendsandbecause ofthem,Ineverfeellonelyintheseyears.AlotofthankstomyfriendsatMSU.Theyeither havegraduatedandbecomesuccessfulfacultymembersorstatisticiansinbigcompanies, orarestillworkinghardtopursuetheirPh.D.degree.Isincerelywishallofyouhavea wonderfulfuture. Lastbutnotleast,Iwouldliketoexpressmysincerethankstomyparentsfortheir supportandconcerns,aswellasmywifeforheraccompanyandworkinghardtogether towardsagloriousfuture. v TABLEOFCONTENTS LISTOFTABLES .................................... viii LISTOFFIGURES .................................... xi LISTOFALGORITHMS ................................ xii Chapter1Introduction ............................... 1 1.1Overview......................................1 1.2SparsityinHigh-dimensionalModeling.....................3 1.2.1ChoosingtheTuningParameters....................6 1.2.2AlgorithmsforTrainingSparseModels.................7 1.2.3Stage-wiseSelection...........................8 1.3TheProjectionApproach.............................9 1.4Non-parametricModeling............................12 1.4.1BasisExpansion..............................12 1.4.2NeuralNetworks.............................13 1.4.3DeepNeuralNetworks..........................15 Chapter2High-dimensionalGeneralizedAdditiveModeling ........ 18 2.1Introduction....................................18 2.2Model.......................................22 2.3Methodology&TheoreticalProperties.....................29 2.3.1FirstStep:ModelScreening.......................29 2.3.2SecondStep:PostSelection.......................36 2.4TuningParameterSelection...........................43 2.5OtherPossiblePenalty..............................46 2.5.1TheL0NormPenalty..........................46 2.5.2TheL0andL1NormPenalty......................48 2.6NumericalProperties...............................49 2.6.1SimulatedExamples...........................50 2.6.1.1LogisticRegression.......................51 2.6.1.2Otherlinkfunctions......................57 2.6.2RealDataexamples............................58 2.7Discussion.....................................65 Chapter3SparseNeuralnetwork ......................... 67 3.1Introduction....................................67 3.2TheBinaryClassi˝cationProblem.......................70 3.3TheConsistencyofNeuralNetworkClassi˝cationRisk............77 3.4Simulation.....................................82 3.4.1DNPSimulation:Revisit.........................82 vi 3.4.2SmallerSampleSizeCase........................83 3.5RealDataexamples................................85 3.5.1example1:ProstateCancerData....................85 3.5.2example2:MRIDataforAlzheimer'sDisease.............86 3.5.3example:KITTIAutonomousDrivingData..............89 3.6Discussion.....................................92 3.6.1The l 1 + l 0 PenaltyandAlgorithm...................92 Chapter4EnsembleNeuralNetworkSelection(ENNS) ........... 98 4.1Introduction....................................99 4.2TheTwo-stepVariableSelectionandEstimationApproach..........102 4.2.1TheEnsembleNeuralNetworkSelection(ENNS)Algorithm.....103 4.2.2EstimationWithRegularization.....................110 4.2.2.1DroppingOutandBagging..................110 4.2.2.2Stage-wiseTraining.......................111 4.2.2.3L1NormRegularization....................112 4.3TheoreticalGuarantee..............................113 4.4SimulationStudy.................................121 4.4.1Stage-wiseCorrectSelectionProbabilityDecreasingStudy......121 4.4.2FalsePositiveRateStudy........................122 4.4.3VariableSelectionSimulationStudy...................123 4.4.4EstimationSimulationStudy......................126 4.4.5VariableSelectionandEstimation....................127 4.4.6CorrelatedPredictors...........................127 4.5RealDataexamples................................132 4.5.1VariableSelection:MRIData......................132 4.5.2Regression:Ribo˛avinProductionData................135 4.5.3Classi˝cation:ProstateCancerData..................136 4.6Conclusion.....................................140 Chapter5Epilogue .................................. 141 APPENDICES ...................................... 143 APPENDIXATechnicalDetailsandSupplementaryMaterialsforChapter2...144 APPENDIXBTechnicalDetailsandSupplementaryMaterialsforChapter3...198 APPENDIXCTechnicalDetailsandSupplementaryMaterialsforChapter4...206 BIBLIOGRAPHY .................................... 233 vii LISTOFTABLES Table2.1:Simulationresultsforthetwo-stepapproachcomparedwiththeLasso, GAMSELandGAMBoostinthethreecasesofexample2.1.NV,average numberofthevariablesbeingselected;TPR,thetruepositiverate;FPR, thefalsepositiverate;andPE,predictionerror(hereisthemisclassi˝cation rate).Resultsareaveragedover100repetitions.Enclosedinparentheses arethecorrespondingstandarderrors.....................54 Table2.2:Simulationresultsforthetwo-stepapproachcomparedwiththeLasso, GAMSELandGAMBoostinexample2.2withcorrelation0.3and0.7for n =100 , p =200 and s =3 .NV,averagenumberofthevariablesbeing selected;TPR,thetruepositiverate;FPR,thefalsepositiverate;and PE,predictionerror(hereisthemisclassi˝cationrate).Resultsareaver- agedover100repetitions.Enclosedinparenthesesarethecorresponding standarderrors.................................56 Table2.3:Simulationresultsforthetwo-stepapproachcomparedwiththeLasso, GAMSELandGAMBoostinexample2.3,with n =100 , p =200 , s =3 andsignalstrengthreduced.NV,averagenumberofthevariablesbeing selected;TPR,thetruepositiverate;FPR,thefalsepositiverate;and PE,predictionerror(hereisthemisclassi˝cationrate).Resultsareaver- agedover100repetitions.Enclosedinparenthesesarethecorresponding standarderrors.................................57 Table2.4:Simulationresultsforthetwo-stepapproachcomparedwiththeLasso, GAMSELandGAMBoostinexample2.4forPoissonregressionandGamma regressionwith n =100 , p =200 and s =3 .NV,averagenumberof thevariablesbeingselected;TPR,thetruepositiverate;FPR,thefalse positiverate;andPE,predictionerror(hereisthemisclassi˝cationrate). Resultsareaveragedover100repetitions.Enclosedinparenthesesarethe correspondingstandarderrors.TheGAMBoostmethoddoesnotsupport Gammaregressionwithnon-canonicallinkfunction,whilethecanonical linkfallsoutsideofrange,thereforeitdoesnotsupportGammaregression.59 Table3.1:TheAUCandF1scoreofthecomparedmodelsinthesimulationstudy. Standarderrorsaregivenintheparentheses.................84 Table3.2:TheAUCandF1scoreofthecomparedmodelsinasmallersamplesize scenariowith m =5 .Standarderrorsaregivenintheparentheses.....84 Table3.3:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheProstategenedataexample.....87 viii Table3.4:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheMRIAlzheimer'sdiseaseexample..89 Table3.5:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheKITTIautonomousdrivingexample.91 Table3.6:TheMISOCOformulationforthe l 1 + l 0 penaltyindealingwiththe l 0 normstep....................................96 Table4.1:Theproportionofcorrectvariableselectionafter0-4correctvariablesin themodel,fordi˙erentcasesover1000repetitions.Theresultsshowthe mean.Theresultsshowthreedi˙erentdatagenerationstructures:linear, additivenon-linearandneuralnetworkforbothregressionandclassi˝cation.122 Table4.2:SelectionfalsepositiverateaverageoftheENNSandDNPunderdi˙erent numberoftruevariablesin101repetitions.Standarddeviationsaregiven inparenthesis..................................123 Table4.3:VariableselectioncapacityofENNSandothermethodswithlowsignal strengthintheregression(top)andclassi˝cation(bottom)setup.The numbersreportedaretheaveragenumberofselectedvariableswhichare trulynonzero.Thestandarderrorsaregiveninparenthesis.........124 Table4.4:VariableselectioncapacityofENNSandothermethodswithnormalsignal strength.Thenumbersreportedaretheaveragenumberofselectedvari- ableswhicharetrulynonzero.Thestandarderrorsaregiveninparenthesis.125 Table4.5:VariableselectioncapacityofENNSandothermethodswithhighsignal strength.Thenumbersreportedaretheaveragenumberofselectedvari- ableswhicharetrulynonzero.Thestandarderrorsaregiveninparenthesis.126 Table4.6:Predictionresultsonthetestingsetusingneuralnetworkswithandwithout l 1 normregularizationfor s =2 ; 5 ; 10 .RMSEisrootedmeansquarederror, MAEismeanabsoluteerror,andMAPEismeanabsolutepercenterror. Accuracyisthepercentageofcorrectprediction,aucisareaundertheROC curve,andf1scoreistheinverseofinverseprecisionplustheinverserecall.129 Table4.7:ModelperformanceofthecombinationofENNSalgorithmand l 1 threshold- ingestimation,comparedwithDNP,LassoandHSIC-Lassofor s =2 ; 5 ; 10 casesinbothregressionandclassi˝cation.Theaverageperformanceof101 repetitionswiththeirstandarderrorsinparenthesisarepresented.....130 ix Table4.8:Selectionandestimationcomparisonforpredictorswithcorrelation0,0.3 and0.7.Thenumberofnonzeropredictorsissetto5.Forselection, theaveragenumberofcorrectselectedvariableswithitsstandarderror isgiven.ForestimationtheaverageRMSEorAUCwiththeirstandard errorsisgiven.Theresultsareaveragedover101repetitions........131 Table4.9:VariableselectionresultfortheADdata.Thetableincludesallbiolog- icallyimportantvariableswiththreelevels:red(veryimportant),yellow (secondlyimportant)andgreen(thirdlyimportant).Thenon-important variablesarenotincludedinthemodel.Checkmarksindicatewhetherthe correspondingalgorithmselectedthevariableornot.............134 Table4.10:VariableselectionresultfortheADdata.Thereportednumbersinclude IS,theweightedaverageofselectedimportantvariableswiththeweights being3,2and1forred(mostimportant),yellow(secondlyimportant)and green(thirdlyimportant),respectively;NI,numberofimportantvariables selected;andNU,numberofunimportantvariablesselected.........135 Table4.11:TestMSEwithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentmodelsintheribo˛avingenedataexample......136 Table4.12:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheProstategenedataexample.....137 x LISTOFFIGURES Figure1.1:Adiagramforthesingle-layerneuralnetworkmodel...........14 Figure2.1:Theclassi˝cationaccuracyagainstthenumberofnonzerovariablesmea- suredonatestingsetforexample2.5over50repetitions.Thetwo-step approach,thelogisticregressionwithLasso,the l 1 normpenalizedSVM andthesparsegrouplassoneuralnetworkareincludedincomparison..61 Figure2.2:Theestimatedfunctionsforthemostfrequentlyselectedfunctionsfor example2.5...................................62 Figure2.3:Theclassi˝cationaccuracyagainstthenumberofnonzerovariablesmea- suredonatestingsetforexample2.6over500repetitions.Thetwo-step approach,thelogisticregressionwithLasso,the l 1 normpenalizedSVM andthesparsegrouplassoneuralnetworkareincludedincomparison..63 Figure2.4:Theestimatedfunctionsforthemostfrequentlyselectedfunctionsordered bydescendinginfrequencyforexample2.6.................64 Figure2.5:ThetestingMSEagainstthenumberofnonzerovariablesmeasuredon atestingsetforexample2.7over50repetitions.Thetwo-stepapproach andlogarithmtransformationwiththeLassoareincludedincomparison.65 Figure3.1:exampleimagefromtheKITTI2Dobjectdetectiondataset.......89 Figure3.2:exampleimagesofpedestriansandcarsafterpre-processing.......90 Figure3.3:Testaccuracyscorevssparsitylevelinthethreeexamples.........91 Figure4.1:Testingmeansquarederror(MSE)fordi˙erentmodelsontheribo˛avin data.......................................138 Figure4.2:Testingaccuracyfordi˙erentmodelsontheprostatecancerdata.....139 xi LISTOFALGORITHMS Algorithm1:Trainingclassi˝cationneuralnetworkwith l 1 + l 0 penalty........97 Algorithm2:FeatureselectioninENNS.........................109 Algorithm3: l 1 normestimationusingcoordinatedescent...............113 xii Chapter1 Introduction InthisChapter,wewillstatetheproblemsthatwestudyanddiscusssomeexistingwork thathavesigni˝cantimpacttothis˝eld. 1.1Overview Duringthepastdecades,high-dimensionaldataanalysisbecameincreasingmorepopular. [11]de˝nesthatgh-dimensionalstatisticsreferstostatisticalinferencewhenthenumber ofunknownparametersisofmuchlargerorderthansamplesize."Thiswillbethede˝nition ofhigh-dimensionalstatisticsthroughthewholethesis.Sometimes,by peoplemayrefertoafeaturespacewhosedimensionisgreaterthan2whenconstructing datavisualization,seeforexample[3],thisisnotthewerefertointhis thesis.Let n bethesizeofsamplewehave,andlet p bethenumberofunknownparameters involved,thuswehave p>>n (1.1) inhigh-dimensionalstatistics.Consideralinearregressionmodel y i = + p X j =1 j x ij + i ;i =1 ;:::;n; (1.2) 1 with p>>n .It'sobviousthattheestimationoftheunknownparameters 1 ;:::; p can notbeestimatedwithoutproperassumptions,suchasthesparsityassumption,whichwe willdiscussinthenextsection.Forexample,theleastsquaredestimator ^ ^ 1 ;:::; ^ p =argmin 1 p 1 n n X i =1 ( y i j x ij ) 2 isunder-determinedthushasin˝nitelymanysolution. Variousmethodshavebeenproposedtosolvethisissuesince[39]broughttheproblem topeople'ssight,wheretheauthorsconsideredanorthogonaldesigninthecase n = p .In therestofthisChapter,wewillreviewtheworkthathasbeendoneinhigh-dimensional statistics˝eld. Ontheotherhand,non-parametricmodelinghasbeenpopularforseveraldecades.Non- parametricresearchhappenedasearlyasin1947,when[133]studiedalocalaveraging approach.Someliteratureregardingnon-parametricresearcheventracebacktothe1930's aboutpartitioningestimation,butthiswasnotfullystudiedbythen.As[61]states,in non-parametricmodeling,onedoesnotrestricttheclassofpossiblerelationshipbetweenthe response y andthepredictors x 1 ;:::;x p ,butassumethat y dependson x 1 ;:::;x p througha generalfunction f ( x 1 ;:::;x p ) .Di˙erentapproacheshavebeenusedtoapproximate f ( ) in aworkableway,includingaveraging,partitioning,basisexpansion,neuralnetworkapproxi- mation,andetc.Furtherassumptionscanalsobemadeon f ( ) ,forexample,assumingan additivestructurewillresultinthegeneralizedadditivemodel(GAM),seeforexample[63], de˝nedas y i = + p X j =1 f j ( x ij )+ i ;i =1 ;:::;n: (1.3) Therelationshipisnotasgeneralasthe f ( x 1 ;:::;x p ) wejustde˝ned,butismuchmore 2 handy.Wewilldiscussthesenon-parametricworkintherestofthisChapter. 1.2SparsityinHigh-dimensionalModeling Assumingsparsityisane˙ectivewayofdealingwiththehigh-dimensionality,i.e.,weassume thatonly s ofthe p variablesareinvolvedinpredictingtheresponse y .Richliteraturehave assumedsparsity,forexamplesee[126,46,155,148,85,47,65,91].Takingthelinear regressionmodelasanexample,let 2 R p betheparameters.Adirectmethodtoobtain sparsityistorestrictthenumberofnonzeroparameters,i.e., ^ =argmin 2 R p 1 n k y k 2 2 ;s:t: k k 0 K; (1.4) forsomepositiveinteger K ,where y 2 R n istheresponsevectorand x 2 R n p isthedesign matrix.The l 0 normisde˝nedas k k 0 = f # of j : j 6 =0 ;j =1 ;:::;p g : (1.5) ThisoptimizationproblemcanbewrittenintotheLagrangianformas ^ =argmin 2 R p 1 n k y k 2 2 + k k 0 (1.6) forsomecorresponding of K .However,optimizingalossfunctionwithzeronormhas beenprovedtobeanon-deterministicpolynomial-timehardness(NP-hard)problem,see [99],whichrequiresexponentialtimetosolve. Fortunately,ithasbeenprovedthatinsteadofdirectlypenalizingthenumberofnonzero 3 coe˚cientsby l 0 norm, l 1 normpenaltyisabletoshrinksomecoe˚cientstozero,andthus thefeaturescorrespondingtothesecoe˚cientsarenotincludedinthemodel.Asafamous pieceofwork,[126]proposedthemethodleastabsoluteshrinkageandselectionoperator (Lasso)inthelinearregressionsetup,whichyieldssparseestimationsfortheparameters. Themodelconsidersa l 1 normpenalizedleastsquaremodel,i.e., ^ =argmin 2 R p 1 n k y k 2 2 + p X j =1 j j j ; (1.7) forsomehyper-parameter .TheLassowasnotintendedforhigh-dimensionalmodeling,but yieldssparsityintheestimators,i.e.,withproperchoiceof ,theestimatedparametersare shrunktowardszeroandsomeofthemwillbeexactlyzero.Anumberofstatisticianshave studiedthepropertiesoftheLasso,forexample,[73]derivedtheasymptoticdistributionsof theLassoestimatorandshoweditsestimationconsistency.Laterpeoplefoundthatthis l 1 normpenaltyworkswellinthehigh-dimensionalsetup,bothpracticallyandtheoretically. [60]derivedthe l 1 normboundforapersistentsequenceofestimatorintheLassomodel.[38] considersthehigh-dimensionalcaseandconsideredminimizingthe l 1 normoftheparameters restrictedtozerotrainingerror.Heshowsthatthismethodconsistentlyidentifythetrue subsetofnonzerovariablesgiventhatthetruemodelissparse.[95]proposedaneighborhood selectionschemethatconsistentlyidenti˝esthenonzeroentriesinthecovariancematrixofa multivariatenormaldistribution.[152]showedtheselectionconsistencyoftheLassounder theirrepresentablecondition.[149]studiedtwodi˙erentsparsityassumptionsinthelinear model,thestrongsparsity max j = s +1 ;:::;p j j j =0 (1.8) 4 andtheweaksparsity p X j = s +1 j j j 1 (1.9) forsome 1 .TheauthorsshowedthatLassoconsistentlyselectsthetruesubsetofvariables withtheweaksparsityassumptionandtheirrepresentablecondition. Lateron,Lassobecamemuchmorematureandwidelyappliedtoobtainasparsesolution inthehigh-dimensionalsetup.VariationsoftheLassoalsoemergesrapidly.[46]mentioned thatthebesttuningparameter maynotbeobtainedbyminimizingthepredictionerror, thusLassocannotreachselectionconsistencyandpredictionconsistencyatthesametime. Therefore,Lassodoesnothavetheso-calledoracleproperty.Theauthorsproposedthe smoothlyclippedabsolutedeviationpenalty(SCAD)andshoweditsoracleproperty.[41] arguedthatLassodoesnotworkwellincorrelatedpredictors,thus[156]proposedtheelastic netpenaltytoovercomethisissue.Inthenextyear,[155]proposedadata-drivenapproach namedtheadaptivelasso,whereaweighted l 1 normoftheparametersisusedtoreplacethe l 1 norm,i.e., ^ =argmin 2 R p 1 n k y k 2 2 + p X j =1 w j j j j ; (1.10) forsome w j ;j =1 ;:::;p .Asimpleconventionistochoose w j =1 = j ^ j j forsomeinitial estimator ^ j ,andset w j = 1 iftheinitialestimatoriszero.Theauthorshowedthatthe adaptiveLassoenjoystheoracleproperty.[20]proposedtheDantzigselectorandshowed itsoracleproperty. [148]discussedvariousgroupedvariableselectiontechniques,includingthewell-known grouplassopenalty,seealso[135,65].Considerthat =( 1 ; 1 ;:::; 1 ;g 1 ;:::; p; 1 ;:::; p;g p ) , where involves p groupswitheachgrouphaving g j parameters j =1 ;:::;p .Thiscaseis usefulinalotofrealworldapplications.Forexample,ifacategoricalvariableinclude3 5 levels,afterone-hotencoding,itdoesnotmakesensetoonlyincludeoneofthethreelevels inthemodel.Thegrouplassominimizesthelossfunctionplusagroupedpenalty ^ =argmin 2 R p 1 n k y k 2 2 + p X j =1 p g j k j k 2 ; (1.11) where j denotesparametervectorinthe j th group.Thesumof l 2 normsencouragesgroup- wisesparsitybutin-groupnon-sparsity.Adaptivityinthegrouplassoisalsodiscussedby theauthors.Tobemoregeneral,[5]namedthistypeofpenaltyas l p; 1 normpenalty,where the l 2 norminequation1.11isreplacedby l p normwith 1 p< 1 . Amoreintuitivepenaltyemergedlateras[114]proposedthesparsegrouplasso,which isacombinationofthegrouplassoandthelasso ^ =argmin 2 R p 1 n k y k 2 2 + 1 p X j =1 p g j k j k 2 + 2 p X j =1 g j X g =1 j j;g j ; (1.12) Thepenaltyencouragesbothgroup-wisesparsityandin-groupsparsity,andisgreatlybene- ˝cialinneuralnetworkmodels.Ahugenumberofdi˙erenttypesarealsowidelyusedinthe statisticsandmachinelearningapplications,forexample,thefusedLasso[127],thegraph fusedLasso[122],thetreeLasso[71]andetc. 1.2.1ChoosingtheTuningParameters Thepowerofregularizationisdecidedthroughthehyperparameter,whichisalsocalled thetuningparameter.Intuitively,choosinglargertuningparameter resultsinamore sparsemodel.However,makethemodeltoosparsemayleadtolosingthetrueunderlying relationshipbetweentheresponseandthepredictors.Severalcriteriaareusefulinhelping 6 chooseaappropriatetuningparameter,forexample,BIC[111],EBIC[24],andGIC[151, 52].Speci˝cally,[52]showedtheriskconsistencyoftuningparameterselectionusingGIC. Anotherpracticaltoolisthecross-validations,wherethetuningparameterischosensuch thatthecross-validatedmetricisoptimized.Conventionaltheoryforcross-validationisgiven inchapter8of[61]. 1.2.2AlgorithmsforTrainingSparseModels Theregularizationmethodsusuallyinvolve l 1 normpenaltyterm,whichisnoteasytosolve usingregulargradientdescentalgorithms,seeforexample[142].Thisissueisgeneralfor allmodelswith l 1 penalty.Notethatthegrouplassopenaltydoesnothaveexplicit l 1 normpenaltybutinvolvessumof l 2 norms,whichisequivalentto l 1 normpenaltyinterms ofoptimizationtheory.Coordinatedescentalgorithms[143,54],alsoknownasproximity gradientdescentalgorithmsareusefulindealingwiththe l 1 normpenalty.Forthe l 1 norm penalty,notrestrictedtothelinearregressioncase,asoft-thresholdingfunction S ( ; ): R p R ! R p ,where ( S ( z ; )) j = sign ( z j )( j z j j ) + ;j =1 ;:::;p (1.13) canbeappliedafterthegradientdescentforthesmoothparttozerosomeparameters.While forthegrouplassopenalty,thesoft-thresholdingfunctionbecomes S ( ; ): R P p j =1 g j R ! R p ,where ( S ( z ; )) j = 1 p g j k ( x ; y ; ) k 2 + j ;j =1 ;:::;p; (1.14) 7 where ( x ; y ; ) isaquantitydependingon x , y , andthelossfunction.Thecoordinate descentalgorithmmightbeslowinupdatingthegroups,thusablockco-ordinategradient descentmethodby[132]canbeapplied,whereasecondorderTaylorapproximationisused tosimplifythesmoothpartoftheoriginallossfunction. Asthedevelopmentofclassicalalgorithms,themodelscanalsobeformulatedasclassical optimizationproblemsandsolvedwithexistingsoftware.TheLassointhelinearregression casecanbeformulatedasalinearprogrammingproblem,seeforexample[30].Thegroup Lassointhelinearregressionsetupcanbeformulatedasasecondorderconequadratic programmingproblem,seeforexample[1].Maturepackagesareavailableforsolvingthese classicaloptimizationproblems.Aswementionedthe l 0 penaltybefore,thefastdevelopment ofoptimizationsoftwarealsomakethistypeofpenaltyeasiertoapply.The l 0 penaltyinthe linearregressionsetupcanbeformulatedasamixedintegersecondorderconeoptimization (MISOCO)problem,seeforexample[92]. 1.2.3Stage-wiseSelection Variousstage-wisealgorithmsareusedtoobtainapathselection.Theleastangleregression [41]providesaforwardalgorithmtoaddnewfeaturesbylookingatthecorrelation.The LARSalgorithmwithsimplemodi˝cationcanbeusedtoobtainthelassosolutionpath. [128]providesastage-wisealgorithm,whichprovidesveryclosesolutionpathtothelasso solutionpath.[102]studiedastage-wisealgorithmtoincorporatethe l 2 , l 1 and l 0 norm penaltywiththegradientswithrespecttotheinputweights.Thegradienthasimplicit connectionswiththecorrelationstudiedin[41]. Itworthnotingthat[125]hasshownthatthereisanequivalencebetweenusingthe stage-wisealgorithmandthegrouplassopenalty.Thisbuiltaconnectionbetweenthe 8 regularizationmethodsandthestage-wisevariableselectionalgorithms.[82]hasappliedthe resultondeeparti˝cialneuralnetworkstodofeatureselection.Comparedwithoptimizing penalizedlossfunctions,stage-wisealgorithmsstartsfromthenullmodelandaddsvariables gradually.Thisisbene˝cialtocomplicatedmodelssuchasneuralnetworks,sincelearning acomplicatedmodelonallvariablesinvolvesunpredictableuncertainty,andthusmightbe sensitivetoinitialization. 1.3TheProjectionApproach Anotherpopularapproachtodealwiththehigh-dimensionality,thoughnotstudiedtoomuch inthisthesis,isworthmentioning,theprojection-basedmethods,alsoknownasdimension- alityreduction.Projectionmethods˝ndalowerdimensionalrepresentationfortheoriginal featurespace.Classicalworkinclude[141,129,31,16,68,157,21].Projectionmethodshave more˛avorofunsupervisedlearning,sincethelowerdimensionalrepresentationshouldnot dependontheresponsebutispurelyapropertyofthedesignmatrix.Amaindrawbackof projectionbasedapproachesisthatonelosesinterpretability,becausetheprojectedfeatures arenolongertheoriginalfeatures,andthusnotinterpretable. Generaldimensionalreductionmethodsdonotworkinthehigh-dimensionalsetup.For example,theprincipalcomponentanalysis(PCA)usesalinearprojectionon x .Itusesa matrix A 2 R d p toproject x 2 R p to Ax 2 R d ,wherethematrix A ischosentomaintain thegreatestvariancebasedonthetrainingdata.Duringthecomputation,thesample covariancematrix ^ n = X T X =n isusedtoestimatethepopulationcovariancematrix = Cov ( x ) ,andtheprincipalcomponentscanbeobtainedfromtheeigenvectorsofthe samplecovariancematrix.However,inthehigh-dimensionalsetup,thesamplecovariance 9 matrixisnolongeraconsistentestimatorofthepopulationcovariancematrix,i.e. ^ n 6! asn !1 ; ifwehave p>n .[69]gavetheconditionswhenPCAworksinthehighdimensionalcase: thelargesteigenvalueshavetobelargeenough. VariationsofthePCAincludetheSimpli˝edComponentTechniqueLASSO(SCoTLASS) by[68]andthesparsePCAby[157].Theformerconsidersa L 1 penaltyontheprojection matrix A thatyieldssomesparsesolutionin A .However,thisisnotfeasibleinpractice, sincethereisnoguidelineforchoosingthetuningparameterandahighcomputational costduetothenon-convexity.ThelatterobtainedtheconnectionbetweenPCAandridge regression,andusetheridgeorelasticnettoobtainasparseapproximationtotheoriginal PCA.Inbothmethods,theirprojectionmatrix A issparseintheoriginalfeatures,i.e.,some columnsof A areexactlyzeros.Thesebuildabridgeconnectingwiththesparseassumption tosomeextent. Aftertheclassicworkofmanifoldlearning(non-lineardimensionalityreduction)such as[130,107,124],currentmanifoldlearningfocusesontwoaspects:imageprocessingand datavisualization.Inimageprocessing,manifoldlearningcanbeusedtoreducethedi- mensionalitytoitsintrinsicdimension,seeforexample[89,154,84].Theseapplicationsare notgeneralizedtoabroadersetup.Manifoldlearningappliedtodatavisualizationusually reducesthedimensionalitytotwoorthree,seeforexample[139].Theseapplicationsarenot usefulinbuildingmodels.Anotherconcernistheout-of-sampleperformanceofmanifold learning,whichisrecentlystudiedby[123].Accordingtothepaper,mostcurrentmanifold learningalgorithmsfailtomapnewdatatothepreviouslylearntembedding. 10 Autoencoder[75]usesneuralnetworkasanon-linearprojectiontoencodetheoriginal featurespacetoalowerdimensionalspace,whichischosentobesuchthattheoriginal featurespacecanberecoveredwithanotherneuralnetwork.Trainingahigh-dimension neuralnetworkisstillchallenging,sparseautoencoder[100,90]eliminatestheinsigni˝cant connectionsandthuskillsthoseparameters. Anotherusefultoolistherandomprojection,whichreliesontheJohnson-Lindenstrauss lemma Lemma1.1 ([67]) . Given 0 << 1 ,aset X of m pointsin R N ,andanumber n> 8log( m ) 2 ,thereisalinearmap f : R N ! R n suchthat (1 ) k u v k 2 k f ( u ) f ( v ) k 2 (1+ ) k u v k 2 forall u;v 2 X . Thelemmastatesthatthereexistalowerdimensionalprojectionthatwellmaintains thedistanceintheoriginalfeaturespace,however,thereisn'taguidelineofhowto˝nd suchalowerdimensionalspace.[21]usedrandomprojectionoverclassicalclassi˝ersasan ensembleclassi˝er,andprovidedane˚cientalgorithmduetothefactthattherandom projectionmatricesaredrawnfromdistributionsratherthancomputedfromdata.They alsoshowedthatthebestrandomprojectioncanbefoundinlineartimewhentherandom projectionmatricesaresampledaccordingtoHaarmeasure.Onepossibledrawbackof randomprojectionisthatitdoesnotprovideuswithanylower-dimensionalrepresentation information. 11 1.4Non-parametricModeling Non-parametricmodelingassumesanarbitraryrelationshipbetweentheresponseandthe variables.Wewouldliketo˝ndafunction f ( ): R p ! R suchthat f ( x ) isagoodapproxi- mationof y orarepresentationof y ,see[61].It'scommontoassumethat f ( ) iscontinuous, butthisisnotnecessary.Complicatedmachinelearningmodelssuchastreemodels,see [108,81,56],neuralnetworks,see[116]andetc.canbeconsideredasnon-parametricmod- elsthatapproximateacomplicatedrelationship.Anarbitraryfunctionwithgreat˛exibility canbewiggly,i.e.,itmayperfectly˝tthetrainingdata.Therefore,peopleusuallyneedto restrictthesmoothnessofafunctionbyrestricting Z f 00 ( x ) 2 dx; (1.15) seeforexample[11].Obviously,alinearfunctionhasthequantityequaltozeroandthusis thesmoothest,buttheapproximationpowerofalinearfunctionissub-optimal.Therefore, abalancebetweentheapproximationpowerandthesmoothnessneedstobeconsidered. 1.4.1BasisExpansion Directlyworkingwiththearbitraryfunction f isnotrealistic,sincethecandidatespace isin˝nitelydimensional.Therefore,parametricapproximationsareneed.Considerthe univariatecase,agreatnumberofapproachescanbeusedtoapproximateacontinuous function f ( ): R ! R .Themostpopularapproachisbasisexpansion f ( x )= 1 X k =1 k ˚ k ( x ) ; (1.16) 12 where f ˚ k ( ) ;k 2 N + g isasetofbasisfunctions.A˝nitenumberofbasesisusuallyused toapproximatethein˝nitesum.Asmentionedabove,asetofbasiswithmaximaldi˙eren- tiabilityisagoodproperty,thereforeB-splineisamongthemostpopularbasisfunctions, see[110].Accordingto[120],asplinewithdegree l consistsofpiece-wisepolynomialsupto degree l on K pre-speci˝edpartitions,withconnectionknots l 0 timescontinuouslydi˙eren- tiable,where 0 l 0 l 2 .Accordingto[65],thereexistsanormalizedB-splinebasis f ˚ k g suchthat f n ( x )= K + l X k =1 k ˚ k ( x ) : (1.17) [65]showedthatusingtheaboveB-splinebasis,wehavethefollowingapproximationerror k f f n k 2 2 = Z b a ( f ( x ) f n ( x )) 2 dx = O (( K + l ) 2 d ) ; (1.18) wheretheparameter d = k + issuchthatthe k th derivativeoffunction f ( ) satis˝esthe Lipschitzconditionoforder j f ( k ) ( s ) f ( k ) ( t ) j C j s t j ; forsomeconstant C .Thisresultshowsthatasweincreasethenumberofpartitionsina basisexpansion,theapproximationerrorcanbearbitrarilysmallundercertainconditions. 1.4.2NeuralNetworks Toapproximateanarbitrarymulti-variatefunction f ( ): R p ! R ,neuralnetworkisa powerfultool.Neuralnetworkisusedtomimichuman'sbrain:theinputdoesnotleadto theoutputdirectly,butafewintermediatenodesareneeded.Originatedfromamulti-layer 13 Input layer Hidden layer Output layer ... ... Input1 Input2 Input p 1 Input p Ouput Figure1.1:Adiagramforthesingle-layerneuralnetworkmodel perceptronin[106],whichisamulti-layerversionoftheperceptronin[105],neuralnetwork hasbeengettingdeeperandmorepowerfulinapproximatingacontinuousfunction.˝gure 1.1showsadiagramofasinglehiddenlayerneuralnetwork,whichisalsoknownasashallow neuralnetwork.Thehiddenlayerconsistsoflinearcombinationsofthe˝rstlayer'sinputs plusanon-polynomialactivationfunction. Mathematically,let x 2 R p betheinputvector,theoutputofashallowneuralnetwork is ( x )= K X k =1 k ˙ ( x + t )+ b 2 R ; (1.19) where k 2 R ;k =1 ;:::;K aretheoutputlayercoe˚cients, 2 R K p isthehiddenlayer coe˚cientmatrix, t isthehiddenlayerinterceptvector, b istheoutputlayerintercept,and ˙ ( ) isanactivationfunction.Anotabletheoremontheapproximationpowerofashallow neuralnetworkisgivenby[33]asfollow Theorem1.1 (Universalapproximationtheorem,Cybenko1989) . Let ˙ ( ) besuchthat ˙ ( t ) ! 0 as t ! and ˙ ( t ) ! 1 as t !1 .Foracontinuousfunction f on [0 ; 1] n andan 14 arbitrary > 0 ,thereexista K andparameters ; ; t ;b suchthat j f ( x ) ( x ) j < 8 x 2 [0 ; 1] n Thistheoremguaranteestheuniversalapproximationabilityofshallowneuralnetworks andisthereasonthatneuralnetworksareusefulinmanycases.Typicalactivationfunctions includes ‹ HyperbolicTangent: ˙ ( x )= tanh ( x ) . ‹ Sigmoid: ˙ ( x )= e x 1+ e x . ‹ Recti˝edLinearUnit(ReLU): ˙ ( x )= x + . ‹ LeakyReLU: ˙ ( x )=max x forsomesmall > 0 . Actually,aslongasthattheactivationfunctionisnotpolynomial,ashallowneuralnetwork hastheuniversalapproximationability,see[113].Recti˝edLinearUnit(ReLU)isoneof themostpopularactivationfunctionsnowadays,thoughitmaysu˙erfromthedeadReLU problems,see[87].It'scomputationallye˚cientandconvergesfast. 1.4.3DeepNeuralNetworks Deepneuralnetworksrefertoneuralnetworkswithmorethanonehiddenlayers.Asthe developmentofdeepneuralnetworks,peoplefoundthelimitationsofshallowneuralnet- worksuchthatthenumberofneuronsneededtoachieveadesireerrorincreasesasfastas exponentially,see[29,28].Afterthatpeoplefoundthattwohiddenlayermodelmaybe signi˝cantlymorepromisingthanthesinglehiddenlayermosee[103].Sumneuralnet- works,orequivalently,polynomialneuralnetworkshavebeenstudied[36,86],anduniversal 15 approximationpropertyhasbeenestablishedrecentlyby[43]thatacontinuousfunctionin F n d canbeapproximatedwitherror byaquadraticnetworkwhohavedepth O log(log( 1 ))+log( 1 ) andnumberofweights O log(log( 1 ))( 1 ) d=n +log( 1 )( 1 ) d=n where d isthedimensionofdomain.Theapproximationtheoryforregulardeepneural networkshavealsobeenestablishedrecently.[104]showedthatadeepnetworkneed O ( n 1) L 2 modelcomplexitytoapproximatea L -Lipshitzcontinuousfunctionof n variablesinsteadof O L n inashallowneuralnetwork.[112,113]providesmoredetailedresultsforthedeepneural networkapproximationpower. Therefore,it'spromisingthatadeeperneuralnetworkworksbetterinlearningcompli- catedrelationships.Throughthisthesis,wewillstartfromthegeneralizedadditivemodel (GAM),studyitspropertiesandtuningparameterselectioninChapter2,thenmovetoa morecomplicatedshallowneuralnetworkmodelandtalkaboutitsasymptoticpropertiesin chapter3,and˝nallydiveintodeepneuralnetworksinchapter4,whereweproposedanen- 16 semblevariableselectionalgorithmandestimationmethod.Inchapter5,wewillsummarize theworkwehavedoneandthefutureworktobedone. 17 Chapter2 High-dimensionalGeneralizedAdditive Modeling Inthischapter,wewillstudythegeneralizedadditivemodel(GAM),see[63].Weconsider anon-linearlinkfunctionthatbelongtotheexponentialfamily.Atwo-stepapproachis usedforvariableselectionandestimation,withagrouplassofollowedbyanadaptivegroup lassoappliedtothebasisexpansionofthenon-parametricfunctions.Wewillshowthatthe variableselectionresultinthesecondstepisconsistent,andwealsoderivetheconvergence rateofthesecondstepestimator.Moreover,wediscussthetuningparameterselectionand showedthatwereachriskconsistencybyusingthegeneralizedinformationcriterion(GIC). Simulationsandreaddataexamplesonvariousaspectsaregiventosupportthearguments. 2.1Introduction Themainobjectiveofthisworkistoestablishtheorydrivenhighdimensionalgeneralizedad- ditivemodelingmethodwithnonlinearlinks.Themethodologyincludesconsistentlyselect- ingthetuningparameter.Additivemodelsplayimportantrolesinnon-parametricstatistical modelingandinmachinelearning.Althoughthisimportantstatisticallearningtoolhasbeen usedinmanyimportantapplicationsandtherearefreesoftwareavailableforimplementing thesemodelsalongwiththeirvariations,tooursurprise,thereisnoprocedurewithnonlinear 18 linkavailablewhichhasbeenstudiedsystematicallywiththeoreticalfoundation.generalized additivemodelingallowsthenonlinearrelationshipbetweenaresponsevariableandasetof covariates.Thisincludesthespecialcase,namely,thegeneralizedlinearmodels,byletting eachadditivecomponentbealinearfunction.Ingeneral,let ( y i ; X i ) ;i =1 ;:::;n beinde- pendentobservations,where y i 'sareresponsevariableswhosecorresponding p -dimensional covariatevectorsare X i 's.Ageneralizedadditivemodel[62]isde˝nedas i = E ( y i j X i )= g 1 0 @ p n X j =1 f j ( X ij ) 1 A ; (2.1) where g ( ) isalinkfunction, f j 'sareunspeci˝edsmoothfunctionsand X ij isthe j thcom- ponentofvector X i .Oneofthefunctionscouldbeaconstant,whichistheinterceptterm, butthisisnotnecessary.Thenumberofadditivecomponentsiswrittenas p n ,sinceit sometimes(usuallyinhighdimensionalsetup)increasesas n increases.Asimplecasethat manypeoplestudiedis p n = p ,wherethenumberofadditivecomponentsis˝xedandless thanthesamplesize n .Thechoiceoflinkfunctionsisassimpleasingeneralizedlinear models,wherepeopleprefertochooselinkfunctionsthatmakethedistributionofthere- sponsevariablesbelongtopopularexponentialfamily.Awidelyusedgeneralizedadditive modelhastheidentitylinkfunction g ( )= ,whichgivestheclassicaladditivemodel y i = p n X j =1 f j ( X ij )+ i ; (2.2) where i 'sarei.i.drandomvariableswithmean 0 and˝nitevariance ˙ 2 . Ontheotherhand,highdimensionaldatahasincreasinglybecomeapartofmanymodern daysscienti˝capplications.Oftenthenumberofcovariates p n ismuchlargerthanthe 19 numberofobservations n ,whichisusuallywrittenas p n ˛ n .Amostinterestingscaleis p n increasesexponentiallyas n increases,i.e. log p n = O ( n ˆ ) forsomeconstant ˆ> 0 .[48] calledthisasnon-polynomialdimensionalityorultrahigh-dimensionality. Inthischapter,weconsiderthegeneralizedadditivemodelinahigh-dimensionalsetup. Toavoididenti˝cationproblems,thefunctionsareassumedtobesparse,i.e.onlyasmall proportionofthefunctionsarenon-zeroandallothersareexactlyzero.Amoregeneralized setupisthatthenumberofnonzerofunctions,denoted s n ,alsodivergesas n increases.This caseisalsoconsideredinthispaper. Manyothershaveworkedongeneralizedadditivemodels.Commonapproachesuseba- sisexpansiontodealwiththenonparametricfunctions,andperformvariableselectionand estimationmethodsonthebases.[94]consideredasimplercase,asin2.2,withanew sparsity-smoothnesspenaltyandprovedit'soracleproperty.Theyalsoperformedasimu- lationstudyunderlogitlinkwiththeirnewpenalty,however,notheoreticalsupportwas provided.[45]proposedthenonparametricindependencescreening(NIS)methodinscreen- ingthemodelasin2.2.However,theselectionconsistencyandthegeneralizedlinkfunctions werenotdiscussed.[91]discussedthepracticalvariableselectioninadditivemodels,but notheorywasgiven.[83]consideredatwo-steporacallye˚cientapproachingeneralized additivemodelsinthelowdimensionalsetup,butnovariableselectioninthehighdimen- sionalsetupwasdone.[65]focusedonthevariableselectionof2.2with˝xednumberof nonzerofunctionsusingatwostepapproach:Firstgrouplasso[7,148]onthebasestoselect thenonzerocovariatesandthenuseadaptivegrouplassotoestimatethebasescoe˚cients. Theythenestablishedtheselectionconsistencyandprovidedtherateofconvergenceofthe estimation.[2]reviewedseveralexistingalgorithmshighlightingtheconnectionsbetween them,includingthenon-negativegarrote,COSSOandadaptiveshrinkage,andpresented 20 somecomputationallye˚cientalgorithmsfor˝ttingtheadditivemodels.[98]extendedthe consistencyandrateofconvergenceof[65]tospatialadditivemodels.[51]studiedthe GAMwithidentitylinkundertheendogeneitysetting.Itworthmentioningthatalternative methodstopenalizationhavealsobeenstudied,forexample,[134]studied˝ttingGAMand performvariableselecitonimplicitlythroughlikelihoodbasedboosting. However,thoughwidelyused,nosystematictheoryaboutselectionandestimationcon- sistencyandrateofconvergencehasbeenestablishedforgeneralizedadditivemodelswith non-identitylinkfunctions. Inthischapter,weestablishthetheorypartforgeneralizedadditivemodelswithnon- identitylinkfunctionsinhighdimensionalsetup.Wedevelopatwo-stepselectionapproach, whereinthe˝rststepweusegrouplassotoperformascreening,which,undermildassump- tions,isabletoselectallnonzerofunctionsandnotover-selecttoomuch.Inthesecond step,theadaptivegrouplassoprocedureisusedandisprovedtoselectthetruepredictors consistently. Anotherimportantpracticalissueinvariableselectionandpenalizedoptimizationprob- lemsistuningparameterselection.Variouscrossvalidation(CV)techniqueshavebeenused inpracticeforalongtime.InformationcriteriasuchasAkaikeinformationcriterion(AIC), AICc,Bayesianinformationcriterion(BIC),Mallow's C p andetc.havebeenusedtoselect `thebest'modelaswell.Manyequivalencesamongthetuningparameterselectionmethods havebeenshownintheGaussianlinearregressioncase.However,theconsistencyofthese selectionmethodswerenotestablished.Latersomevariationsoftheinformationcriteria suchasmodi˝edBIC[150,137]extendedBIC[24]andgeneralizedinformationcriterion (GIC)[52]wereproposedandshowntohavegoodasymptoticpropertiesinpenalizedlinear modelsandpenalizedlikelihoods.However,theresultsarenotusefulforgroupedvariablesin 21 additivemodels,forwhichbasisexpansiontechniqueisusuallyusedandthusbringsgrouped selection. Inthischapter,wegeneralizetheresultofgeneralizedinformationcriterion(GIC)by [52]togroup-penalizedlikelihoodproblemsandshowthatundersomecommonconditions andwithagoodchoiceoftheparameterinGIC,weareabletoselectthetuningparameter thatcorrespondstothetruemodel. Insection2.2,themodelisspeci˝edandbasicapproachisdiscussed.Notationsand basicassumptionsarealsointroducedinthissection.Section2.3givesthemainresultsof thetwostepsselectionandestimationprocedure.Section2.4developsthetuningparameter selection.Variationofpenaltyfunctionsisdiscussedinsection2.5.Extensivesimulation studyandrealdataexamplearepresentedinsection2.6followedbyashortdiscussionin section2.7.TheproofsofalltheoremsaredeferredtoAppendixA. 2.2Model Weconsiderthegeneralizedadditivemodel(2.1)wherethelinkfunctioncorrespondstoan exponentialfamilydistributionoftheresponse.Foreachofthe n independentobservations, thedensityfunctionisgivenas f y i ( y )= c ( y )exp y i b ( i ) ˚ ; 1 i n; i 2 R : (2.3) Withoutlossofgenerality,weassumethatthedispersionparameter 0 <˚< 1 isassumed tobeaknownconstant.Speci˝callyweassume ˚ =1 .Weconsidera˝xed-designthrough thispaper,i.e.,thedesignmatrix X isassumedtobe˝xed.However,wehaveshownin 22 appendixAthatthesametheoryworksforarandomdesignundersimpleassumptionson thedistributionof X .Theadditiverelationshipassumesthatthedensitiesof y i 'sdepend on X i 'sthroughtheadditivestructure i = P p n j =1 f j ( X ij ) .Thisisthecanonicallink.Ifwe useotherlinkfunctions,forexample, A ( ) ,thetheoryalsoworksaslongasthefunctions A ( ) satis˝estheLipschitzconditionsforsomeorder.Let b ( k ) ( ) bethe k -thderivativeof b ( ) ,thenbypropertyoftheexponentialfamily,theexpectationandvariancematrixof y =( y 1 ;:::;y n ) T ,undermildassumptionsof b ( ) ,isgivenby ( ) and ˚ ) ,where ( )=( b (1) ( 1 ) ;:::;b (1) ( n )) T and )= diag f b (2) ( 1 ) ;:::;b (2) ( n ) g : (2.4) Thelog-likelihood(ignoringtheterm c ( y ) whichisnotinterestingtousinparameter estimation)canbewrittenas l = n X i =1 2 4 y i 0 @ p n X j =1 f j ( X ij ) 1 A b 0 @ p n X j =1 f j ( X ij ) 1 A 3 5 : (2.5) AssumethattheadditivecomponentsbelongtotheSobolevspace W d 2 ([ a;b ]) .According to[110],seepages268-270,thereexistsB-splineapproximation f nj ( x )= m n X k =1 jk ˚ k ( x ) ; 1 j p: (2.6) with m n = K n + l ,where K n isthenumberofinternalknotsand l d isthedegreeofthe splines.Generally,itischosedthat d =2 and l =4 ,i.e.,cubicsplines. Usingtheapproximationabove,[65]provedthat f nj wellapproximates f j inthesense 23 ofrateofconvergencethat k f j f nj k 2 2 = Z b a ( f j ( x ) f nj ( x )) 2 dx = O ( m 2 d n ) : (2.7) Therefore,usingthebasisapproximation,thelog-likelihood(ignoringtheterm c ( y ) which isnotrelatedtotheparameters)canbewrittenas l = n X i =1 2 4 y i 0 @ p n X j =1 m n X k =1 0 jk k ( x ij ) 1 A b 0 @ p n X j =1 m n X k =1 0 jk k ( x ij ) 1 A 3 5 = n X i =1 h y i 0 T i b 0 T i ; (2.8) where 0 and i arethevectorbasiscoe˚cientsandbasesde˝nedbelow. It'salsoworthnotingthatthenumberofbases m n increasesas n increases.This isnecessarysince[110]mentionedthatoneneedtohavesu˚cientpartitionstowellap- proximate f j by f nj .Ifwe˝x m n ,i.e.let m n = m 0 ,thoughinthelaterpartwe willshowtheapproachtoestimatethebasiscoe˚cientscanhavebetterrateofconver- gence,theapproximationerrorbetweentheadditivecomponentsandthesplinefunctions k f j ( x ) f nj ( x ) k 2 =[ R b a ( f j ( x ) f nj ( x )) 2 dx ] 1 = 2 = O (1) willincreaseandleadtoinconsistent estimations.Therefore, m n ,ormoreprecisely, K n ,needtoincreasewith n . Ourselectionandestimationapproachwillbebasedonthebasesapproximatedloglike- lihood2.8.Beforestartingthemethodology,welistthenotationsandstatetheassumptions weneedinthispaper. Notations Thedesignmatrixis X ( n p n ) =( x 1 ;:::; x n ) T . 24 Thebasismatrixis ( n m n p n ) = 1 ;:::; n ) T ,where i =( ˚ 1 ( x i 1 ) ;:::;˚ m n ( x i 1 ) ;:::;˚ 1 ( x ip n ) ;:::;˚ m n ( x ip n )) T : Thetruebasisparameters 0 =( 0 11 ;:::; 0 1 m n ;:::; 0 p n 1 ;:::; 0 p n m n ) T 2 R m n p n Weassumethefunctions f 1 ;:::;f p n aresparse,then 0 isblock-wisesparse,i.e.the blocks 0 1 =( 0 11 ;:::; 0 1 m n ) T ;:::; 0 p n =( 0 p n 1 ;:::; 0 p n m n ) T aresparse. Let y betheexpectationof y basedonthetruebasisparametersand " = y y . De˝netherelationship a n b n asthereexistsa˝niteconstant c suchthat a n cb n . Foranyfunction f de˝ne k f k 2 =[ R b a f 2 ( x ) dx ] 1 = 2 ,whenevertheintegralexists. Foranytwocollectionsofindices S; ~ S f 1 ;:::;p n g ,thedi˙erencesetisdenoted S ~ S . Thecardinalityof S isdenotedcard ( S ) .Forany 2 R m n p n ,de˝ne 1 ;:::; p n asitssub- blocks,where i 2 R m n ,andde˝netheblock-wisesupport supp B ( )= f j 2f 1 ;:::;p n g ; j 6 = 0 g : De˝netheblock-wisecardinality card B ( )= card ( supp B ( )) : 25 For S = f s 1 ;:::;s q gf 1 ;:::;p n g ,de˝nesub-blockvector S =( T s 1 ;:::; T s q ) T Thenumberofadditivecomponentsisdenoted p n ,whichispossibletogrowfasterthanthe samplesize n .Let T = supp B ( 0 ) and T c bethecomplimentset.Letcard ( T )= s n ,where s n isallowedtodivergeslowerthan n . Foreach U f 1 ;:::;p n g withcard ( U T ) m forsome m ,de˝ne B ( U )= f 2 R m n p n ; supp B ( ) U g ; B ( m )= fB ( U ); forany U f 1 ;:::;p n g ; Card ( U T ) m g : Let q beanintegersuchthat q>s n and q = o ( n ) .De˝ne B 1 = f 2B : card B ( ) q g ; where B isasu˚cientlylarge,convexandcompactsetin R d . Assumptions Assumption2.1 (Ondesignmatrix) . UsingthenormalizedB-splinebases,thebasisma- trix haseachcovariatevector j ;j =1 ;:::;p n bounded,i.e., 9 c suchthat k j k 2 p nc ; 8 j =1 ;:::;m n p n . Assumption2.2 (RestrictedEigenvalues RE ) . Foragivensequence N n ,thereexist 0 and 26 1 suchthat 0 2 s n 2 m 1 n T T n k k 2 2 1 m 1 n ; (2.9) where 2 isapositiveconstantsuchthat 0 < 2 < 0 : 5 ,forall 2C ,where T =( T 1 ;:::; T p n ) and C = f 2 R p n m n : k k 2 6 =0 ; k k 2 N n andcard B ( )= o ( s n ) g : (2.10) Assumption2.3 (Ontheexponentialfamilydistribution) . Thefunction b ( ) isthreetimes di˙erentiablewith c 1 b 00 ( ) c 1 1 and j b 000 ( ) j c 1 1 initsdomainforsomeconstant c 1 > 0 .Forunboundedandnon-Gaussiandistributed Y i ,thereexistsadivergingsequence M n = o ( p n ) suchthat sup 2B 1 max 1 i n b 0 T i M n : (2.11) Additionallytheerrorterm i = y i y i 'sfollowtheuniformsub-Gaussiandistribution,i.e., thereexistconstants c 2 > 0 suchthatuniformlyforall i =1 ;::;n ,wehave P ( j i j t ) 2exp( c 2 t 2 ) forany t> 0 : (2.12) Assumption2.4 (Onnonzerofunctioncoe˚cients) . Thereexistasequence c f;n thatmay tendtozeroas n !1 suchthatforall j 2 T ,thetruenonzerofunctions f j satisfy min j 2 T k f j k 2 c f;n : Wenotethatassumption2.1isastandardassumptioninhighdimensionalmodels,where thedesignmatrixneedstobeboundedfromabove.assumption2.2isawell-knowncondition inhigh-dimensionsetupontheempiricalGrammatrix[15].Itisdi˙erentthantheregular 27 eigenvaluecondition,sincewhen n

0 ,choosetheregularizationparameter b n 1 = C p m n r n +log( p n m n ) n forboundedresponse(i.e., j y i j 0 ,choosetheregularizationparameter b n 1 = C r n +log( p n ) n forboundedresponse(i.e., j y i j 0 ,choosetheregularizationparameter b n 1 = C p m n r n +log( p n m n ) n forboundedresponse(i.e., j y i j > > < > > > : k ^ j k 1 2 ; if k ^ j k 2 > 0 1 ; if k ^ j k 2 =0 : (2.17) Let ^ AGL betheoptimizerfor2.16,i.e. ^ AGL =argmin 2 R m n p L a ( ; n 2 ) : Forthechoiceofweights,the˝rststageestimatorsneednottobenecessarilythesolution ofgrouplasso,rathermoregeneralthatsatisfyfollowingassumptions. Assumption2.5. Theinitialestimator ^ is r n consistentatzero,i.e., r n max j 2 T c k ^ j 0 j k 2 = O P (1) ; (2.18) 36 andthereexistsaconstant c 3 suchthat P min j 2 T k ^ k 2 c 3 b n 1 ! 1 ; (2.19) where b n 1 =min j 2 T k 0 j k 2 . Assumption2.6. Let s n = p n s n bethenumberofnonzerocomponents.Thetuning parameter n 2 satis˝es p log( s n m n ) n 1 = 2 n 2 r n + s n n 2 r n m d +1 = 2 n + n 2 r n n p s n =n = o (1) (2.20) foranydivergingsequence n . assumption2.5givestherestrictionsontheinitialestimator.Wedon'trequireourinitial estimatortobethegrouplassoestimator,butanyinitialestimatorsatisfyingassumption2.5 willbeabletomaketheadaptivegrouplassoestimatorconsistentlyselectsandestimates thetruenonzerocomponents.However,therateofconvergenceoftheadaptivegrouplasso estimatordependsontherateofconvergenceoftheinitialestimator,whichisassumedto be r n inassumption2.5.Moreover,theinitialmustn'thavea0estimationforthenonzero components,otherwiseitwillmisleadtheresultsintheproceedingstep.assumption2.6 putrestrictionsonthetuningparameter n 2 intheadaptivegrouplassostep.The˝rsttwo termsgivestheupperboundfor n 2 andthethirdtermgivesthelowerbound.Onlywith choiceof n 2 wecanhavetheselectionconsistencyandestimationconsistency. Itworthnotingthatifwetakethegrouplassoestimatorasourinitialestimator,as- sumptions5and6areautomaticallysatis˝ed.Speci˝cally,atrivialchoiceof r n would 37 be r n = O 1 P p s n s n 2 m n p log( p n m n ) p n ! + O 1 ( b n 1 m n p s n s n 2 )+ O 1 ( s n m 0 : 5 d n s n 2 ) fortheboundedresponseand r n = O 1 P p s n s n 2 p n m n p log( p n m n ) p n ! + O 1 ( ub n 1 m n p s n s n 2 )+ O 1 ( s n m 0 : 5 d n s n 2 ) fortheunboundedcaseandanydivergingsequence n ,sinceweobservethatfor j 2 T c , ^ j iseitherestimatedaszero,orhasarateofconvergenceto j boundedbytherateof convergenceintheorem(2.1).Forequation(2.19),observethattherateofconvergenceofthe grouplassoestimatorishigherorderin˝nitesimaloftheminimalsignalstrengthofnonzero coe˚cients,thustaking c 3 =0 : 5 issu˚cient.Inassumption6,withourtrivialchoiceof r n , weareableto˝ndarangeoftuningparametersthatsatisfyequation(2.20).Therefore,it's reasonabletotakethegrouplassoestimatorasaninitialestimatorfortheadaptivegroup lasso. Letthenotation ^ n 0 = 0 denotethatthesignofeach ^ j and 0 j areeitherbothzero orbothnonzero.Thenwehavethefollowingasymptoticpropertiesfortheadaptivegroup lassoestimator. Theorem2.3. Assumeassumptions1-6hold,considertheestimator ^ AGL byminimizing 2.16,wehave (i) If f c;n >> p s n =n ,theadaptivegrouplassoconsistentlyselectsthetrueactivepredic- 38 torswithprobabilityconvergingto1,i.e., P ^ AGL 0 = 0 ! 1 : (2.21) (ii) Therateofconvergenceoftheadaptivegrouplassoestimatorisgivenby X j 2 T k ^ AGLj 0 j k 2 2 = O p s n 2 s n 2 m 2 n log( s n m n ) n + O ( s 2 n 2 s n 2 m 1 2 d n ) + O ( 2 n 2 m 2 n s n 2 s n 2 ) fortheboundedresponsecaseand X j 2 T k ^ AGLj 0 j k 2 2 = O p n s n 2 s n 2 m 2 n log( s n m n ) n + O ( s 2 n 2 s n 2 m 1 2 d n ) + O ( 2 n 2 m 2 n s n 2 s n 2 ) fortheunboundedresponsecase,where n isanydivergingsequence. TheproofofthistheoremisgiveninAppendixA.It'sinterestingtocomparetheadaptive grouplassoresultswith[138],whostudiedtheasymptoticpropertiesoftheadaptivegroup lassoforgeneralizedlinearmodels.Itworthnotingthatweconsideredamoregeneralcase byallowingthegroupsizetodivergewith n ,andtheeigenvaluetobeboundedbysequences thatdependingon n onabroaderdomain.Inthespecialcasethatcorrespondstotheir assumptions,ourresults(Theorem4.2)coincideswiththeirresults.Theseresultsarealso trueinthespecialcase:generalizedlinearmodels.Thespecialcaseisgiveninthefollowing corollary. 39 Corollary2.2. Considerthegeneralizedlinearmodelswiththeassumptionssameasin theorem2.3,butwith n 2 satis˝es p log( s n ) n 1 = 2 n 2 r n + n 2 r n n p s n =n = o (1) ; wehave (i) If f c;n >> p s n =n ,theadaptivelassoconsistentlyselectsthetrueactivepredictors withprobabilityconvergingto1,i.e., P ^ AL 0 = 0 ! 1 ; (2.22) where ^ AL istheadaptivelassoestimatorcorrespondingtotheadaptivegrouplasso estimatorinthenonparametricsetup. (ii) Therateofconvergenceoftheadaptivelassoestimatorisgivenby k ^ AL 0 k 2 2 = O P s n log( s n ) n + O ( 2 n 2 s n ) (2.23) fortheboundedresponsecaseand k ^ AL 0 k 2 2 = O P n s n log( s n ) n + O ( 2 n 2 s n ) (2.24) fortheunboundedresponsecase,where n isanydivergingsequence. Similartothegrouplassoestimator,wealsoderiveresultsforthenon-parametricfunction estimation. 40 Let ^ f AGLj ( x )= j ( x ) ^ AGLj ,thenwehavethefollowingresults. Theorem2.4. Assumeassumptions1-6hold,considertheestimator ^ AGL byminimizing 2.16,wehave (i) Theadaptivegrouplassoconsistentlyselectsthetrueactivepredictorsinprobability, i.e., P k ^ f AGLj ( x ) k 2 > 0 ;j 2 Tand k ^ f AGLj ( x ) k 2 =0 ;j 2 T c ! 1 : (2.25) (ii) Therateofconvergenceoftheadaptivegrouplassoestimatorisgivenby X j 2 T k ^ f AGLj f j k 2 2 = O p s n 2 s n 2 m n log( s n m n ) n + O ( s 2 n 2 s n 2 m 2 d n ) + O ( 2 n 2 m n s n 2 s n 2 ) fortheboundedresponsecaseand X j 2 T k ^ f AGLj f j k 2 2 = O p n s n 2 s n 2 m n log( s n m n ) n + O ( s 2 n 2 s n 2 m 2 d n ) + O ( 2 n 2 m n s n 2 s n 2 ) fortheunboundedresponsecase,where n isanydivergingsequence. TheproofofthistheoremisgiveninAppendixA.Thetheoremsinthissectionensures thatundermildassumptions,weareabletorecoverthetruemodelwithprobabilitytend- ingto1andachievesarateofconvergencebetterthantheinitialestimator.Particularly, iftherestrictionsof n;p n ;m n and s n intheprevioussectionsatisfy,thegrouplassoesti- 41 matorisactuallyagoodinitialestimator.Therefore,thistwostepprocedureactuallyisa completeprocedurethatgivesusawaytodothismodelselectionandestimationonanyhigh- dimensionalgeneralizedadditivemodel.However,theprocedureisnotpracticallycomplete withoutproperselectionofthetuningparameter .Therefor,weproposeatheoretically validatedtuningparameterselectioninthenextsection. Theconvergencerateforthegrouplassoestimatoris p n X j =1 f j ^ f nj 2 2 = O P s n 2 s n 2 m n log( p n m n ) n + O ( b n 1 2 m n s n 2 s n 2 )+ O ( s 2 n m 2 d n 2 s n 2 ) ; whilefortheadaptivegrouplassoestimatoris X j 2 T k ^ f AGLj f j k 2 2 = O p s n 2 s n 2 m n log( s n m n ) n + O ( 2 n 2 m n s n 2 s n 2 )+ O ( s 2 n 2 s n 2 m 2 d n ) Theregressiontermdi˙ersbythesizeofcandidateset.Thepricewepaybynotknowing thetruesetis log( pm n ) inthegrouplassostep,andbecomes log( s n m n ) intheadaptive grouplassostep,sincetheinitialestimatorhaverecoveredasupersetofthetruesetwith cardinality O ( s n ) .Thepenaltyterm'sdi˙erenceappearsonthetuningparameter,where n 2 isofasmallerorderthan n 1 withamultiplierof r 1 n .Accordingtoourchoiceof n 2 , ithasatrivialupperboundwhichisoforder O ( 2 n 1 ) .Therefore,thetuningparameterpart inthepenaltyconvergenceratetermbecomesquadratic.Theapproximationerrortermis nota˙ectedbytheadaptivegrouplassostep. Theadaptivegrouplassoisimportantintworeasons:˝rst,withprobabilitytendingto 1,thisisenabletoselectthetruenonzerocomponentsaccurately,whichisnotalwaysthe caseingrouplasso;second,therateofconvergenceoftheadaptivegrouplassoestimatoris 42 fasterthantherateofconvergenceofthegrouplassoestimator.Thedi˙erenceintheleading termsareintheorderof r 1 n .Thismakestheadaptivegrouplassoestimatortoachievea bettererrorwiththesamesamplesize,orthesameerrorwithasmallersamplesize. 2.4TuningParameterSelection Oneimportantissueinpenalizedmethodsischoosingapropertuningparameter.Itisknown thattheselectionresultsaresensitivetothechoiceoftuningparameter.Thetheoretical resultsonlyprovidetheorderofthetuningparameter,whichisnotveryusefulinpractice. Thereasonisthattheorderofasequencedescribesthelimitpropertieswhen n goesto in˝nity.Inreality,our n isa˝xednumber,sowemusthaveapracticalinstructionof selectingthetuningparameter. Despiteit'simportance,thereisn'tmuchdevelopmentfortuningparameterselectionin thehighdimensionalliterature.Theconventionaltuningparameterselectioncriteriatend toselecttoomanypredictors,thusishardtoreachselectionconsistency.Anotherreason, especiallyingrouplassoproblems,isthatthesolutionpathofgrouplassoispiece-wise nonlinear,whichmakesthetestingprocedureevenharder.Here,weproposethegeneralized informationcriterion(GIC)[151,52]thatsupportsconsistentmodelselection. Let ^ betheadaptivegrouplassosolutionwithtuningparameter .Thegeneralized informationcriterionisde˝nedas GIC ( )= 1 n f D (^ ; Y )+ a n j ^ T jg ; (2.26) where D (^ ; Y )=2 f l ( Y ; Y ) l (^ ; Y ) g .Herethe l ( ; Y ) isthelog-likelihoodfunction 43 inequation(2.3)expressedasafunctionoftheexpectation and Y . l ( Y ; Y ) represents thesaturatedmodelwith = Y ,and ^ = b 0 ( P p n i =1 ^ f j ( x ij ))= b 0 ( ˚ ^ ) isourestimated expectationwhenthetuningparameteris .Thehyper-parameter a n hereischosento penalizethesizeofthemodel.UsingGIC,underproperchoiceof a n ,weareabletoselect allactivepredictorsconsistently.Theresultisgiveninthefollowingtheorem. Theimportanceofthefollowingconsistencytheoremisthattheresultintheprevious sectionguaranteesthatwithprobabilityconvergingto1,thereexistsa n 0 thatwillbeable toidentifythetruemodel.Therefore,agoodchoiceof a n willbeabletoidentifythetrue modelwithprobabilityconvergingto1.Forasupport A ˆf 1 ;:::;p g suchthat j A j q , where q s n and q = o ( n ) ,let I ( ( A ))= E [log( f =g A )]= n X i =1 h b 0 i 0 T i ( 0 ( A )) b T i 0 )+ b T i ( A )) i (2.27) betheKullback-Leibler(KL)divergencebetweenthetruemodelandtheselectedmodel, where f isthedensityofthetruemodel,and g A isthedensityofthemodelwithpopulation parameter ( A ) .Let ( A ) bethemodelwiththesmallestKLdivergenceoverallmodels withsupport A ,andlet n =inf A 6˙ T j A q 1 n I ( ( A )) Herewenotethatif T ˆ A ,theminimizerisautomatically 0 andthustheKL-divergence iszero.Foranunder˝ttedmodels T 6ˆ A , n describeshoweasilyonecandistinguishthe modelsfromthetruemodelbymeasuringtheminimumdistancefromthetruemodeland theestestimatedmoLaterinthetheoremswewillneedtoassumelowerboundson n sothatwewillbeabletoreachourconsistencyresults.Thefollowingtheoremproves 44 thatGICworksundermildconditions. Theorem2.5. Underassumptions1-6,supposethat n K 1 R 1 n !1 , n s 1 n a 1 n !1 and a n 1 !1 ,where R n and n arede˝nedinlemmaA.3andlemmaA.4,wehave,as n !1 , P f inf 2 [ + GIC a n ( ) >GIC a n ( n 0 ) g! 1 ; (2.28) where = f 2 [ min ; max ]: T 6˙ T g ; + = f 2 [ min ; max ]: T ˙ T and T 6 = T g ; where T isthesetofpredictorsselectedbytuningparameter . min canbechosenasthe smallest suchthattheselectedmodelhassize q thatsatis˝esthetheoremassumption,and max simplycorrespondstoamodelwithnovariables. TheproofofthistheoremisgiveninAppendixA.Inpractice,achoiceof a n isproposed tobe m n log(log( n ))log( p n ) .Wehave Corollary2.3. Underassumptions1-6,withchoiceof a n = m n log(log( n ))log( p n ) ,wehave P f inf 2 [ + GIC a n ( ) >GIC a n ( n 0 ) g! 1 : Inourtwostepprocedure,therearetwotuningparameterstobeselected: n 1 inthe grouplassostepand n 2 intheadaptivegrouplassostep.Thechoiceof n 2 isofmore importance,since n 1 onlyserveastheparameterinscreening.Aslongaswehavea screeningstepthatsatis˝es2.14,wearereadyfortheadaptivegrouplassostep.Tobe simple,weproposetouseGICforselectingboth n 1 and n 2 .Asaresultoftheprevious 45 theorem,weareabletoreachselectionconsistency. 2.5OtherPossiblePenalty Inthissectionwebrie˛ydiscussedotherpossiblepenaltyfunctionsinthecontextofhigh dimensionalgeneralizedadditivemodels.Inparticularweconsiderthebestsubsetselection, whichcorrespondstoa L 0 penalty,andthebestsubsetwithshrinkagewhichcorresponds toacombinationof L 0 and L q , 0 p intheclassi˝cationsetup,inthesecondexample, weconsiderthehigh-dimensionalsetup n

p .Thedatasetisavailableat https://web.stanford. edu/~hastie/ElemStatLearn/data.html .Thisdatasethasbeenstudiedinmanydi˙erent contextswiththeobjectivebeingtopredictwhetheranemailisaspamornotbasedon afewfeaturesoftheemails.Thereare n =4601 observations,amongwhich1813(39.4%) arespams.Thereare p =57 predictors,including48continuousreal [0 ; 100] attributesof therelativefrequencyof48`spam'wordsoutofthetotalnumberofwordsintheemail, 6continuousreal [0 ; 100] attributesoftherelativefrequencyof6`spam'charactersoutof thetotalnumberofcharactersintheemail,1continuousrealattributeofaveragelength ofuninterruptedsequencesofcapitalletters,1continuousintegerattributeoflengthof longestuninterruptedsequenceofcapitalletters,and1continuousintegerattributeoftotal numberofcapitallettersinthee-mail.Thedatawas˝rstlogtransformed,sincemostofthe predictorshavelong-taileddistribution,asmentionedin[55].Theywerethencenteredand 59 standardised. Thedatawassplitintoatrainingdatasetwith3067observationsandatestingdata setwith1534observations.Wechooseorder l =4 whichimpliesacubisB-spline.We choose m n =15 ,whichimpliesthereare 11 innerknots,evenlyplacedovertheempirical percentilesofthedata.WecomparetheresultwiththelogisticregressionwithLassopenalty, thesupportvectormachine(SVM)withLassopenalty,andthesparsegrouplassoneural network(SGLNN,[53],seealso[146]).TheLassoandSMVareimplementedwiththeskikit- learnmoduleinpython,andtheSGLNNisimplementedwiththealgorithminthepaperin python.Bychangingthetuningparameterorstoppingcriterion,wegetestimationswith di˙erentsparsitylevels.Allresultsareaveragedover50repetitions.Theclassi˝cationerror withdi˙erentlevelofsparsityisshownin˝gure2.1onpage61.Thetwo-stepapproach andtheneuralnetworkperformbetterthanthelinearmethods,whichindicatesanon-linear relationship.Thetwo-stepapproachhasbestaccuracy0.944,whiletheneuralnetwork hasbestaccuracy0.946.Theneuralnetworkperformsalittlebetterthanthetwo-step approachduetoitsabilitytomodeltheinteractionsamongpredictors,butthisdi˙erence isnotsigni˝cant.However,neuralnetworkhasnointerpretationandtakeslongertotrain. Allfourmethodshaveperformanceincreaseasmorepredictorsareincluded,whichindicates thatallpredictorscontributestosomee˙ecttotheprediction.However,weareabletoreach morethan0.9accuracywithonly15predictorsincluded.WiththeGICcriterion,thetwo- stepapproachselects 14 : 6 1 : 52 predictors,withanaverageaccuracyof 0 : 914 0 : 015 .The mostfrequentlyselectedfunctionsareshownin˝gure2.2onpage62,whichalsoshowsthat thesefunctionsaretrulynon-linear.Theplotsareoftheoriginalfunctions,i.e.,beforethe logarithmtransformation.Theestimatedfunctionsareclosetotheresultsin[55],Chapter9, withslightscaledi˙erenceduetodi˙erentpenalization.Theresultsshowthattheadditive 60 modelbytheadaptivegrouplassoismoresuitableforthisdatathanlinearmodels. Figure2.1:Theclassi˝cationaccuracyagainstthenumberofnonzerovariablesmeasured onatestingsetforexample2.5over50repetitions.Thetwo-stepapproach,thelogistic regressionwithLasso,the l 1 normpenalizedSVMandthesparsegrouplassoneuralnetwork areincludedincomparison. Example2.6. Forhigh-dimensionalclassi˝cationexample,weusetheprostatecancergene expressiondatadescribedinhttp://featureselection.asu.edu/datasets.php.Thedatasethas abinaryresponse.102observationswerestudiedon5966predictorvariables,whichindicates thatthedatasetisreallyahighdimensionaldataset.Theresponseshavevalues1(50 samplepoints)and2(52samplepoints),where1indicatesnormaland2indicatestumor. Allpredictorsarecontinuouspredictors,withpositivevalues. Toseetheperformanceofourprocedure,weran100replications.Ineachreplication,we randomlychoose76oftheobservationsastrainingdatasetandtherest26observationsas testingdataset.Wechooseorder l =4 whichimpliesacubisB-spline.Wechoose m n =9 , whichimpliesthereare 5 innerknots,evenlyplacedovertheempiricalpercentilesofthedata. 61 Figure2.2:Theestimatedfunctionsforthemostfrequentlyselectedfunctionsforexample 2.5. Similartothelastexample,wecomparetheresultwiththelogisticregressionwithLasso penalty,theSVMwithLassopenalty,andSGLNN.Theclassi˝cationerrorwithdi˙erent levelofsparsityisshownin˝gure2.3onpage63.Fromthe˝gureweseethatcompared withlinearmethodssuchasthelogisticregressionorsupportvectormachine,thenon- parametricapproachesconvergesfaster.Thetwo-stepapproachreachesatestingaccuracy of0.945whenaround15variablesareincludedinthemodel,whilethelinearmethodsneed over30variablestoreachcompetitiveresults.Comparedwithneuralnetwork,thetwo-step approachiseasiertoimplementandperformsstabler.Adrawbackofthenon-parametric methodsistoeasilyover˝tsmallsample,andthat'sthereasontheperformancedropsastoo manyvariablesenteredthemodel.WiththeGICcriterion,thetwo-stepapproachselects 3 : 25 1 : 67 predictors,withanaverageaccuracyof 0 : 914 0 : 016 .Toshowthenon-linear relationship,˝gure2.4onpage64showstheestimatedfunctionsforthe6mostfrequently 62 selectedvariables. Figure2.3:Theclassi˝cationaccuracyagainstthenumberofnonzerovariablesmeasured onatestingsetforexample2.6over500repetitions.Thetwo-stepapproach,thelogistic regressionwithLasso,the l 1 normpenalizedSVMandthesparsegrouplassoneuralnetwork areincludedincomparison. Example2.7. Inthisexample,weinvestigatetheperformanceofthetwo-stepapproachon Gammaregression.ThedatasetisfromNationalOceanicandAtmosphericAdministration (NOAA).Weusethestormdata,whichincludestheoccurrenceofstormsintheUnited Stateswiththetime,location,propertydamage,anarrativedescriptionandetc.Herewe onlytakethedatainMichiganfrom2010to2018andkeepthenarrativedescriptionasour predictorvariableandthepropertydamageasourresponsevariable.Thedescriptionisin text,thereforeweappliedwordingembeddingalgorithmWord2vec[96]totransformeach descriptionintoanumericrepresentationvectoroflength p =701 ,similarwordembedding preprocessingcanbefoundin[77].Theresponsevariablepropertydamagehasalong taildistribution,thusweuseaGammaregressionhere.Afterremovingoutliers,thedata 63 Figure2.4:Theestimatedfunctionsforthemostfrequentlyselectedfunctionsorderedby descendinginfrequencyforexample2.6. setcontains3085observations.Inordertostudythehigh-dimensionalcase,werandomly sample 10% oftheobservationsasourtrainingdata( n =309 )andtherestareusedfor validation.Moreover,theresponseisnormalizedwiththelocationandscaleparametersof gammadistribution. Toseetheperformanceofourprocedure,weran50replications.Wechooseorder l =4 whichimpliesacubisB-spline.Wechoose m n =9 ,whichimpliesthereare 5 innerknots, evenlyplacedovertheempiricalpercentilesofthedata.Sincethere'slimitedlibrariesavail- ableinvariableselectionunderhigh-dimensionalsetup,wecomparethetwo-stepapproach withthelinearregressionwithLassoonalogarithmtransformationontheresponsevariable. Thepredictionerrorwithdi˙erentlevelofsparsityisshownin˝gure2.5onpage65.With theGICcriterion,thetwo-stepapproachselects 34 : 45 3 : 52 predictors,withanaverage 64 MSEof 0 : 004334 0 : 000115 . Figure2.5:ThetestingMSEagainstthenumberofnonzerovariablesmeasuredonatesting setforexample2.7over50repetitions.Thetwo-stepapproachandlogarithmtransformation withtheLassoareincludedincomparison. 2.7Discussion Inthischapter,weconsideredultrahigh-dimensional( log p n = O ( n ˆ ) )generalizedadditive modelwithadivergingnumberofnonzerofunctions( s n ! 0 asn !1 ).Afterusingbasis expansiononthenonparametricfunctions,weusedtwostepprolassoand adaptivegrouplassotoselectthetruemodel.Wehaveprovedthescreeningconsistencyof thegrouplassoestimatorandtheselectionconsistencyoftheadaptivegrouplassoestimator. Theratesofconvergenceofbothestimatorswerealsoderived,whichprovedthattheadaptive grouplassodoeshaveanimprovementontheestimator.Thewholepaperprovidesasolid foundationfortheexistingmethods.Finallyweprovedthatunderthisnonparametricset up,thegeneralizedinformationcriterion(GIC)isagoodwaytoselectthetuningparameter 65 thatconsistentlyselectsthetruemodel. Inthispaper,weuseda˝xeddesignonthedatamatrix X .Arandomdesignon X couldbeconsidered,i.e., X hasacontinuousdistributionfunction f X ( X ) onitsinterval [ a;b ] , however,extraassumptionssuchastheboundednessofthedensityfunctionareneededto reachthesameresult.AlsoweprovedtheselectionconsistencyoftheGICprocedureon theadaptivegrouplassoestimator,conditioningthattheinitialestimatorsatis˝es2.14, whichispossessedbythegrouplassoprocedurewithprobabilitytendingto1.However,the theoryofscreeningconsistencyforthegrouplassoestimatorisstilltobeestablished.This isachallengingproblem,sincetheredoesn'thavetoexistatuningparameterthatgives selectionconsistencyinthegrouplassoprocedure,butthisisaninterestingproblemthat deservesfurtherinvestigation.Wealsodiscussedthesubsetselectionandsubsetselection withshrinkageunderoursetup.Thetheoreticalinvestigationsuggeststheotherpenalty functionsmaynothaveclearadvantagesovertheproposedprocedure. Moreover,theheteroskedasticerrorcaseisalsoattractinginhigh-dimensionalGAM. ThesquarerootLasso[10]hasbeenprovedtoovercomethisissue,however,ithasn'tbeen extendedtothenon-parametricsetup.ItcouldbeinterestingtoapplysquarerootLasso ontheGAMtoincorporatethiscase.Thisisandemandingtopicthatdeservesfurther investigationaswell. 66 Chapter3 SparseNeuralnetwork Aswementionedintheintroductionchapter,thegeneralizedadditivemodelisappropriate intheadditivecase.Ifthevariablesinteractwitheachother,it'sbettertouseamulti-variate approximation,theneuralnetwork.Inthischapter,wewillstudythesparsegrouplasso penalizedshallowneuralnetwork.Wewillshowthatundercertainassumptions,thismodel haveclassi˝cationrisktendingtotheoptimalrisk,theBayesrisk.Simulationsandrealdata examplesareusedtosupportthearguments. 3.1Introduction High-dimensionalmodelswithcorrelatedpredictorsarecommonlyseeninpractice.Most statisticalmodelsworkwelleitherinlow-dimensionalcorrelatedcase,orinhigh-dimensional independentcase.Therearefewmethodsthatdealwithhigh-dimensionalcorrelatedpre- dictors,whichusuallyhavelimitedtheoreticalandpracticalcapacity.Neuralnetworkshave beenappliedinpracticeforyears,whichhaveagoodperformanceundercorrelatedpredic- tors.Thereasonisthatthenon-linearityandinteractionsarebroughtinbytheactivation functionsandnodesinthehiddenlayers.Auniversalapproximationtheoremguarantees thatasingle-layerarti˝cialneuralnetworkisabletoapproximateanycontinuousfunction withanarbitrarilysmallapproximationerror,providedthatthereisalargeenoughnumber ofhiddennodesinthearchitecture.ThustheANNhandlesthecorrelationandinteractions 67 automaticallyandimplicitly.Apopularmachinelearningapplicationthatdealswiththis typeofdependencyisthespatio-temporaldata,wherethetraditionalstatisticalmethods modelthespatialcovariancematrixofthepredictors.However,byarti˝cialneuralnetworks, workingwiththisbigcovariancematrixcanbeavoided.Moreover,arti˝cialneuralnetworks alsohavegoodperformanceincomputervisiontasksinpractice. Amaindrawbackofneuralnetworksisthatitrequiresahugenumberoftrainingsample duetolargenumberofinherentparameters.Insomeapplication˝elds,suchasclinical trials,brainimagingdataanalysisandsomecomputervisionapplications,it'susuallyhard toobtainsuchalargenumberofobservationsinthetrainingsample.Thusthereisaneed fordevelopinghigh-dimensionalneuralnetworkswithregularizationordimensionreduction techniques.Itisknownthat l 1 normregularization[126]shrinksinsigni˝cantparameters tozero.Commonlyusedregularizationincludes l p normregularization,forexample,see thekeraspackage[26]. l p normregularizationwith p 2 controlsthemodelsensitivity [74].Ontheotherhand l p normregularizationwith p< 2 ,wherepeopleusuallytake p =1 forcomputatione˚ciency,doesnotencouragegroupinformation.Thegrouplasso regularization[148]yieldsgroup-wisesparsenesswhilekeepingparametersdensewithinthe groups.Acommonregularizationusedinhigh-dimensionalarti˝cialneuralnetworkisthe sparsegrouplassoby[114],seeforexample[53],whichisaweightedcombinationofthe lassoregularizationandthegrouplassoregularization.Thegrouplassoregularizationpart penalizestheinputfeatures'weightsgroup-wisely:afeatureiseitherselectedordropped, anditisconnectedtoallnodesinthehiddenlayerifselected.Thelassopartfurthershrinks someweightsoftheselectedinputsfeaturestozero:afeaturedoesnotneedtobeconnected toallnodesinthehiddenlayerwhenselected.Thispenalizationencouragesasmanyzero weightsaspossible.Anothercommonwaytoovercomethehigh-dimensionalityistoadd 68 dropoutlayers[117].Randomlysettingparametersinthelaterlayerstozerokeepslessnon- zeroestimationsandreducesthevariance.Dropoutlayerisprovedtoworkwellinpractice, butnotheoreticalguaranteeisavailable.[82]considersadeepnetworkwiththecombination ofregularizationinthe˝rstlayeranddropoutinotherlayers.Withadeeprepresentation, neuralnetworkshavemoreapproximationpowerwhichworkswellinpractice.Theypropose afastandstablealgorithmtotrainthedeepnetwork.However,notheoreticalguaranteeis givenfortheproposedmethodotherthanpracticalexamples. Ontheotherhand,thoughwidelyused,high-dimensionalarti˝cialneuralnetworksstill donothaveasolidtheoreticalfoundationforstatisticalvalidation,especiallyinthecaseof classi˝cation.Typicaltheoryforlow-dimensionalANNstracesbacktothe1990s,including [33,8,119,4].Theexistingresultsincludetheuniversalapproximationcapabilitiesofsingle layerneuralnetworks,theestimationandclassi˝cationconsistencyundertheGaussianas- sumptionand0-1lossinthelowdimensionalcase.Thesetheoryassumesthe0-1losswhich isnotusednowadaysandarenotsu˚cientforhigh-dimensionalcaseasconsideredhere. Currentworksfocusmoreondevelopingnewcomputingalgorithmsratherbuildingtheoreti- calfoundationsoronlyhavelimitedtheoreticalfoundations.[53]derivedaconvergencerate ofthelog-likelihoodfunctionintheneuralnetworkmodel,butthisdoesnotguaranteethe universalclassi˝cationconsistencyortheconvergenceoftheclassi˝cationrisk.Theconver- genceofthelog-likelihoodfunctionisnecessarybutnotsu˚cientfortheclassi˝cationrisk toconverge.Inthispaper,weobtainedconsistencyresultsofclassi˝cationriskforhigh- dimensionalarti˝cialneuralnetworks.Wederivedtheconvergenceratefortheprediction error,andprovedthatundermildconditions,theclassi˝cationriskofhigh-dimensionalarti- ˝cialneuralnetworkclassi˝eractuallytendstotheoptimalBayesclassi˝er'srisk.Thistype ofpropertyhasbeenestablishedonotherclassi˝erssuchasKNN[23],whichderivedthe 69 resultthattheclassi˝cationriskofKNNtendstotheBayesrisk,LDA[79],whichderivesthe classi˝cationerrorrateunderGaussianassumptionsandetc.Populartaskslikeanalyzing MRIdataandcomputervisionmodelswerealsoincludedintheseresearchpapers,andwe appliedthehigh-dimensionalneuralnetworktothesedemandingtasksaswell. Insection3.2,wewillformulatetheproblemandthehigh-dimensionalneuralnetwork formally.Insection3.3,westatetheassumptionsandthemainconsistencyresult.Insection 3.4,wecomparedthehigh-dimemsionalneuralnetworkwithotherneuralnetworkvariable selectionapproaches.Insection3.5,weapplythehigh-dimensionalneuralnetworkinthree di˙erentaspectsofexamples:thegenedata,theMRIdataandthecomputervisiondata. Insection3.6,furtherideaswillbediscussed. 3.2TheBinaryClassi˝cationProblem Considerthebinaryclassi˝cationproblem P ( Y =1 j X = x )= f ( x ) ;P ( Y =0 j X = x )=1 f ( x ) ; where x 2 R p isthefeaturevectordrawnfromthefeaturespaceaccordingtosomedis- tribution P X ,and f ( ): R p ! R issomecontinuousfunction.Noteherethat,inthe function f ( x ) ,therecanbeanyinteractionsamongthepredictorsin x ,whichensuresthe possibilitytohandlecorrelatedpredictors.Let P X ;Y bethejointdistributionon ( X ;Y ) , where X 2 R p and Y 2f 0 ; 1 g .Here p couldbelarge,andmaybeevenmuchlargerthan thetrainingsamplesize n .Tostudythetheory,weassume p hassomerelationshipwith n ,forexample, p = O (exp( n )) .Therefore, p shouldbewrittenas p n ,whichindicatesthe 70 dependency.However,forsimplicity,wesuppressthenotation p n anddenoteitwith p . Foranewobservation x 0 2 R p ,theBayesclassi˝er,denoted C ( X ) ,predicts 1 if f ( x ) p s and 0 otherwise,where p s 2 (0 ; 1) isaprobabilitythreshold,whichisusuallychosenas 1 = 2 inpractice.TheBayesclassi˝erisprovedtominimizetherisk R ( C )= Z R p 0 ; 1 g 1 f C ( X ) 6 = Y g dP X ;Y : (3.1) However,theBayesclassi˝erisnotusefulinpractice,since f ( x ) isunknown.Thusa classi˝eristobefoundbasedontheobservations f ( x 1 ;y 1 ) ;:::; ( x n ;y n ) g ,whicharedrawn from P X ;Y .Agoodclassi˝erbasedonthesampleshouldhaveitsrisktendtotheBayesrisk asthenumberofobservationstendstoin˝nity,withoutanyrequirementforitsprobability distribution.Thisistheso-calleduniversalconsistency. Multiplemethodshavebeenadoptedtoestimate f ( x ) ,includingthelogisticregression (alinearapproximation),generalizedadditivemodels(GAM,anon-parametricnonlinear approximationwhichdonottakeinteractionsintoaccount),neuralnetworks(acomplicated structurewhichisdenseincontinuousfunctionsspace)andetc.The˝rsttwomethods usuallyworkinpracticewithgoodtheoreticalfoundation,however,theysometimesfailto catchthecomplicateddependencyamongthefeaturevector x inawiderangeofapplications (brainimages,computervisions,spatialdataanalysis).Theneuralnetworkstructureis provedtobeabletocapturethisdependencyimplicitlywithoutexplicitlyspecifyingthe dependencyhyper-parameters.Considerasinglelayerneuralnetworkmodelwith p predictor variables.Thehiddenlayerhas m n nodes,where m n maybeadivergingsequencedepending on n .Similarto p n ,wesuppress m n as m . Foraninputvector x 2 R p ,itsweightmatrix 2 R p m anditshiddenlayerintercept 71 vector t 2 R m ,letthevector ˘ 2 R m bethecorrespondingvaluesinthehiddennodes, whichisde˝nedas ˘ k = t k + T k x ;k =1 ;:::;m: Let ( ) beanactivationfunction,thentheoutputforagivensetofweight iscalculated by b + T ( ˘ ) ; wherethefunction ( ) isthefunction ( ) beingappliedelement-wisely.Wehaveawide rangeofchoicesfortheactivationfunction.[78]provedthataslongastheactivationisnot algebraicpolynomials,thesinglelayerneuralnetworkisdenseinthecontinuousfunction space,thuscanbeusedtoapproximateanycontinuousfunction.Thisstructurecanbe consideredasamodelwhich,foragivenactivationfunction ( ) ,mapsa p 1 inputvector toanreal-valuedoutput ( ; t ; ;b ) ( x )= T ( t + T x )+ b = m X k =1 k ( t k + T k x )+ b; where ( ; t ; ;b ) ( x ) 2 R istheoutputofthesinglehiddenlayerneuralnetworkwithparameter ( ; t ; ;b ) .Applyingthelogisticfunction ˙ ( ) , ˙ ( ( ; t ; ;b ) ( x )) 2 (0 ; 1) asanapproximation of f ( x ) withparameters ( ; t ; ;b ) P ( Y =1 j X = x ) ˇ ˙ ( ( ; t ; ;b ) ( x )) ;P ( Y =0 j X = x ) ˇ 1 ˙ ( ( ; t ; ;b ) ( x )) ; where ˙ ( )=exp( ) = [1+exp( )] .Accordingtotheuniversalapproximationtheorem,see[33], withabigenough m ,thesinglelayerneuralnetworkisabletoapproximateanycontinuous 72 functionwithaquitesmallapproximationerror. By[8],assumingthatthereisaFourierrepresentationof f ( x ) oftheform f ( x )= R R p e i ! T x ~ F ( d ! ) ,let B;C = f f ( ): R B k ! k 2 j ~ f ( d ! ) 0 .Thenforallfunctions f 2 B;C ,there existsasinglelayerneuralnetworkoutput ( x ) suchthat k ( x ) f ( x ) k 2 = O (1 = p m ) on B .Later[113]generalizestheresultbyrelaxingtheassumptionsontheactivation functionandimprovedtherateofapproximationbyalogarithmicfactor.Theyshowed thatonaboundeddomain ˆ R p withLipschitzboundary,assuming f 2 H r satis˝es ( f )= R R p (1+ j ! j ) m +1 j ^ f e ( ! ) j d!< 1 forsomeextension f e 2 H r ( R p ) with f e j ,ifthe activationfunction ˙ 2 W r; 1 ( R ) isnon-zeroandsatis˝esthepolynomialdecaycondition j ( D k ˙ ( t )) j C r (1+ j t j ) s forsome 0 k r andsome s> 1 ,wehave inf f m 2 NeuNet m k f f m k H r C ( s;r; ;˙ ) ( f ) m 1 = 2 ; wherethenormisinSobolevspaceoforder r ,and C ( s;r; ;˙ ) isafunctionof s , r , and ˙ only.Bothresultsensurethegoodapproximationpropertyofsinglelayerneuralnetwork, andtheconvergencerateisindependentofthedimensionof x , p ,aslongas f hasaFourier transformwhichdecayssu˚cientlyfast. Towardsbuildingthehigh-dimensionalANN,westartbyformalizingthemodel.Let X bea n p designorinputmatrix, X = 2 6 6 6 6 6 4 x 11 x 1 p x n 1 x np 3 7 7 7 7 7 5 = 2 6 6 6 6 6 4 x T 1 . . . x T n 3 7 7 7 7 7 5 = x (1) x ( p ) ; 73 let y bea n 1 responseoroutcomevector, y = y 1 y n T ; let bea p m parameterorinputweightmatrix, = 2 6 6 6 6 6 4 11 1 m p 1 pm 3 7 7 7 7 7 5 = 2 6 6 6 6 6 4 T (1) . . . T ( p ) 3 7 7 7 7 7 5 = 1 m ; let t bea p 1 parametervector, t = t 1 t m T ; let bea m 1 parametervectorrepresentingnodeweights, = 1 m T ; andlet b beascalarparameter. Whenonetriestobringneuralnetworkintothehigh-dimensionsetup,orequivalently, thesmallsamplesizescenario,itusuallydoesnotworkwell.Theestimabilityissue[70]arise fromthefactthatevenasinglelayerneuralnetworkmayhavetoomanyparameters.This issuemightalreadyexistinthelowdimensionalcase( n

a;X 2 >b ) (4.18) isthebivariateorthantprobabilitywithcorrelation ˆ and ) isthestandardnormaldis- tributionCDF. Proposition4.2. Underassumptions4.1,4.3and4.4,theprobabilitythatweselecta nonzeropredictorattheverybeginningusingthestage-wiseneuralnetworkselectionis P ( Anonzeropredictorentersthemodel˝rst )= s X k =1 Z 1 0 f k ( x ) p Y j 6 = k F j ( x ) dx (4.19) where F k ( x )= 1 2 erf x + j k j p 2 ˙ 2 + erf x j k j p 2 ˙ 2 and f k ( x )= @ @x F k ( x )= r 2 ˇ˙ 2 e x 2 + 2 k 2 ˙ 2 cosh k x ˙ 2 (4.20) and erf ( ) istheerrorfunction.Moreover,if max =max j =1 ;:::;s isbounded,as s !1 , 116 theprobabilityisboundedfromabove P 1 where isanontrivialquantity. ProofsofthetwopropositionsaregivenintheAppendix.Proposition4.1and4.2describe thebehaviorofneuralnetworkstage-wisealgorithmattheverybeginning.Theprobability thatweselectonepredictoroveranotherdependsonthesumoftheirsignalstrengthand thedi˙erenceoftheirsignalstrength.Thegreaterthedi˙erence,thehigherprobabilitythat wewillselectthepredictorwithhighersignalstrength.Theprobabilitythatwewillselect acorrectpredictorattheverybeginningisdescribedbytheerrorfunctionandstandard normaldensityfunctions.Thoughtheformoftheprobabilitylookscomplicated,sinceerror functioncanbeapproximatedbyanexponentialfunctionwithproperconstants,wewillbe abletoshowthatinsomecases,itisnotguaranteedthatavariableenteringthemodel˝rst isanonzerovariable.Speci˝cally,thishappenswhenwehavealowsignalstrengthora hugenumberofcandidatevariables. Sothereisaconcernthatawrongvariablewillmistakenlyenterthemodelduetoaspecial trainingcaseoftheneuralnetworkmodel,asshowninthepreviousproposition.Withthe baggingalgorithm,weareabletoeliminatethefalsepositiveselectionswithprobability tendingto1.Theintuitionisthatfalsepositiveselectionofacertainpredictorhappens duetoaspeci˝cobservationofthedesignmatrix,whichappearstobemorecorrelatedto theresponseorresidual.However,withdi˙erentsub-samplings,it'sveryunlikelythatthey yieldthesamewrongselection.Thispropertyiscapturedbythefollowingtheorem. Theorem4.1. Ifoneofthetwofollowingcaseshappen,thenineachselectionstepofthe 117 ENNSalgorithm,theprobabilityoffalsepositiveconvergestozero,i.e. P ( j 2 ^ S andj= 2S ) ! 0 asn !1 andB 2 !1 (4.21) (a) Underassumptions4.1,4.3,4.4and4.5,alsoassumethat s 0 sand p s 0 s e ˝ n n ! 0 asn !1 where ˝ n and n arede˝nedinassumption4.1. (b) Underassumptions4.24.3,4.4and4.5,alsoassumethatthenumberofvariablestobe selectedsatis˝es s 0 C s = o ( p ) forsomeconstant C . Aproofisgivenintheappendix.Invariableselectionalgorithms,themostimportant propertyistobeabletoselectionthecorrectpredictorsconsistently.Hereweshowthat ENNSenjoysthispropertyinthefollowingtheorem. Theorem4.2. Underassumptions4.2,4.3,4.4and4.5,let K n betheupperboundofthe normofthebestparametersintheneuralnetworkmodelwhen S isincluded,and K bethe sizeofthe˝rsthiddenlayer,withtheensemble,if n satis˝es K log( p s ) n log 1 1 2 e 2 n ! 0 asn !1 118 forsomeconstant c ,and K 2 n r log( nK n ) n !1 asn !1 theprobabilitythatallnonzerovariablesareselectedconvergesto1,i.e., P ( ^ S = S ) ! 1 asn !1 andB 2 !1 (4.22) Aproofisgivenintheappendix.Intheorem4.2,weshowedthatwithstrongenough signalstrengthinthetruenonzerovariables,thealgorithmwillbeabletoselectallnonzero variableswithprobabilitytendingto 1 .Theconditionsarenotveri˝ableinpractice,how- ever,extensiveexamplesinsection4.4showsthattheENNSalgorithmreachesselection consistencyeasierthantheotheralgorithms. Fortheestimationstep,therehasbeensomediscussionabouttheasymptoticproperties suchas[53,146],wheretheresultsofusingsparsegrouplassopenaltyaregiven.The l 1 normpenaltyisactuallyaspecialcaseofthesparsegrouplassowiththelassoparameter being 1 andthegrouplassoparameterbeing 0 .Therefore,theresultsofthesepapershold aslongaswehave ^ S = S 0 ,whichhasprobabilitytendingto 1 bytheorem4.2.Herewewill adaptthetheoryin[61]andwillprovidethefollowingresult. Theorem4.3. Assumeassumptions4.2,4.3,4.4and4.5,considerthevariablesselected fromtheENNSalgorithmandtheestimationwiththe l 1 regularizationmethod.Denote the l 1 regularizationtuningparameterwith n andthecorrespondingLagrangianparameter K n .Denotethehiddenlayersizewith k n .Intheregressionsetup,assume E ( Y 2 ) < 1 ,if 119 K n !1 , k n !1 and k n s 2 K 4 n log( k n sK 2 n ) =n ! 0 ,wehave lim n !1 ;B 2 !1 P E Z j f n ( x ) f ( x ) j 2 ( dx ) ! 0 =1 where f n istheestimatedneuralnetworkand f isthetruefunction.Intheclassi˝cation setup,assumingthattheprobabilityofresponsebeing 1 isboundedawayfrom 0 and 1 by ~ ,denotewith Q themaximumnumberofequivalentneuralnetworkclasses,choosing tuningparameter n c p k n log n=n ( p log Q + p k n log s log( nk n ) ,if log( n ) = ( n ~ 2 ) ! 0 , s 2 k n 2 n = ( n ~ 2 ) ! 0 and n 1 k 9 = 2 n s 5 = 2 p log( s n ) ! 0 as n !1 ,wehave lim n !1 ;B 2 !1 P ( R ( f n ) R ( f ) ! 0)=1 where R ( f n ) istheriskofneuralnetworkclassi˝erand R ( f ) istheriskofBayesclassi˝er. Theorem4.3statesthatunderthepreviouslydiscussedconditions,theregressionreaches weakestimationconsistencyofthenon-parametricfunctionde˝nedin[61].Fortheclas- si˝cation,theneuralnetworkclassi˝er'srisktendstotheoptimalrisk,Bayesrisk,seefor example[37].Thetheoremisadirectresultfromtheexistingresultsofthelowdimension neuralnetworkregressionmodelandclassi˝ers.Conditioningonthefactthatwecanselect allcorrectvariableswithprobabilitytendingto 1 ,applyingthefullprobabilityformula,the consistencyofthetwo-stepapproachcanbederivedwiththelowdimensionalconsistency plustheprobabilityofnon-selection-consistency. Theconsistencyerrorcomesfromtwoaspects:thevariableselectionerrorandtheesti- mationerror.Theintuitionbehindthisisthatwithawrongselectionresult,theestimation errormaybebig,however,thishappenswithprobabilitytendingtozero.Withacorrect 120 selectionresult,theestimationerrorbehavesthesameasinthelowdimensionalcase,which convergestozero. 4.4SimulationStudy Inthissection,weuseafewexamplesasnumericalsupportstoourargumentsintheprevious sections.ThecodeforDNPiscomposedaccordingtothealgorithmoutlinein[82],andthe codeofENNSiscomposedaccordingtothealgorithminthispaper.Bothcodescanbe foundat https://github.com/KaixuYang/ENNS . 4.4.1Stage-wiseCorrectSelectionProbabilityDecreasingStudy Inthissubsection,weuseanexampletodemonstratethatthechanceofselectingacorrect variableinastage-wiseneuralnetworkdecreasesaswehavemorecorrectvariablesinthe model.Consideradesignmatrix X thatisdrawnfromauniformdistributionbetween-1 and1.Thesamplesizeissetto n =1000 andthenumberofpredictorsissetto p =10000 . s =5 ofthepredictorsarerelatedwiththeresponse.Weconsiderthreedi˙erenttrue structuresoftherelationshipbetweenthepredictorsandtheresponse:linear,additivenon- linearandneuralnetwork.Fortheresponse,weconsidertwodi˙erentcases:regression andclassi˝cation.Inthelinearcase,thecoe˚cientsaredrawnfromastandardnormal distribution.Intheadditivenon-linearcase,thefunctionsaresetto =sin( x 1 )+ x 2 +exp( x 3 )+ x 2 4 +log( x 5 +2) 2 (4.23) 121 where y = + inthelinearcaseand prob = ˙ ( ) intheclassi˝cationcase.Inthe neuralnetworkcase,wesethiddenlayersas [50 ; 30 ; 15 ; 10] andweightsfromstandardnormal distribution. Foreachofthecases,wetestthecaseswhenwestartfrom0to4correctpredictors.In ordertoeliminatethee˙ectofdi˙erentsignalstrengthfromdi˙erentpredictors,werandom sample i indicesfrom 1 ;:::; 5 astheselectedvariables,for i =0 ;:::; 4 ,andincludethese i indicespredictorsastheinitiallyselectedvariables.Werunarepetitionof1000timesand reporttheproportionthatthenextvariablethatentersthemodelisacorrectpredictor. Theresultsaresummarizedintable 4 : 1 . ystructure0variable1variable2variables3variables4variables Reg Linear0.995(0.002)0.952(0.006)0.863(0.010)0.774(0.013)0.430(0.016) Additive0.993(0.003)0.847(0.011)0.905(0.009)0.794(0.012)0.531(0.016) NN0.998(0.001)0.971(0.005)0.932(0.007)0.788(0.013)0.574(0.016) Cls Linear0.989(0.003)0.918(0.009)0.873(0.009)0.813(0.011)0.552(0.016) Additive0.992(0.003)0.957(0.006)0.911(0.009)0.706(0.014)0.633(0.015) NN0.994(0.002)0.968(0.006)0.947(0.004)0.895(0.009)0.762(0.013) Table4.1:Theproportionofcorrectvariableselectionafter0-4correctvariablesinthemodel, fordi˙erentcasesover1000repetitions.Theresultsshowthemean.Theresultsshowthree di˙erentdatagenerationstructures:linear,additivenon-linearandneuralnetworkforboth regressionandclassi˝cation. Inthetable,weseethattheprobabilityofselectingacorrectpredictordecreasesaswe havemorecorrectpredictorsinthemodel,inallcases.Theonlyexceptionisintheregression setupwithadditivenon-linearstructurefrom1variableto2variables,whichmaydueto randomerror. 4.4.2FalsePositiveRateStudy Inthissubsection,weuseanexampletodemonstratethatthefalsepositiverateofENNS (theprobabilityofselectingawrongvariable)issuperiorthanthepurestage-wisealgorithm. 122 Notethatifonesetthenumberofvariablestobe s ,stagewisealgorithmalwaysselect5 variables,whileENNSwillstopifthereisn'tanynewvariablethatsatisfythecondition tobeadded.Therefore,it'spossiblethatENNSselectslessnumberofvariablesandavoid wrongselection.Weusedthesamesetupas[82]togenerateresponses.Twodi˙erenttypesof responsesincludingregressionandclassi˝cationwillbeconsideredhere.Theinputvariable X wasdrawnfrom U ( 1 ; 1) ,wherethefeaturedimension p was˝xedtobe 10 ; 000 .The correspondinglabelswereobtainedbypassing X intothefeedforwardneuralnetworkwith hiddenlayersizes f 50 ; 30 ; 15 ; 10 g andReLUactivationfunctions.Inputweightsconnecting the˝rst s inputswererandomlysampledfrom N (1 ; 1) forregressionand N (0 ; 1) forclassi- ˝cation.Theremaininginputweightswerekeptzero.Foreach s =2 ; 5 ; 10 ,wegenerated 1000trainingsamples.Intable4.2,wereportthefalsepositiveratebetweentheENNS algorithmandtheneuralnetworkstage-wisealgorithm. ResponseMethods=2s=5s=10 Regression ENNS10.4%(21.5%)11.5%(22.1%)12.8%(23.6%) DNP22.5%(29.5%)30.2%(28.7%)41.4%(33.2%) Classi˝cation ENNS4.7%(17.9%)7.4%(18.6%)9.8%(20.3%) DNP16.5%(24.4%)24.8%(29.7%)40.5%(32.8%) Table4.2:SelectionfalsepositiverateaverageoftheENNSandDNPunderdi˙erentnumber oftruevariablesin101repetitions.Standarddeviationsaregiveninparenthesis. ItcanbetestedthattheENNS'sfalsepositiverateissigni˝cantlylessthanthefalse positiverateofDNPundersigni˝cancelevel =0 : 05 .Thisprovidesstrongevidencethat theENNSisusefulinreducingtheprobabilityofselectinganincorrectvariable. 4.4.3VariableSelectionSimulationStudy Inthissubsection,westudythevariableselectioncapabilityoftheensembleneuralnetwork selection(ENNS)algorithminthecomplicatedsetup.Weusedsimilarsetupasinthelast 123 subsectiontogenerateresponses.Twodi˙erenttypesofresponsesincludingregressionand classi˝cationwillbeconsideredhere.Theinputvariable X wasdrawnfrom U ( 1 ; 1) ,where thefeaturedimension p was˝xedtobe 10 ; 000 .Thecorrespondinglabelswereobtained bypassing X intothefeedforwardneuralnetworkwithhiddenlayersizes f 50 ; 30 ; 15 ; 10 g andReLUactivationfunctions.Inputweightsconnectingthe˝rst s inputswererandomly sampledfrom N (0 ; 2) forregressionand N (0 ; 1) forclassi˝cation.Theremaininginput weightswerekeptzero.TheDNPmodelwascodedaccordingtotheiralgorithmoutlinein pythonwithpyTorch.TheENNSalgorithmisbasedontheDNPalgorithmwithanensemble wrapper.TheLASSO[126]isimplementedbythescikitlearnlibrary,andtheHSIClasso [144]isimplementedusingtheHSICLassolibrary.Inallfouralgorithms,thenumberof selectedvariablesarestrictlyrestrictedtothesameasthetruenumberofvariables.Inthe ENNS,werunabaggingof10roundswithselectionproportion0.3.Wereporttheaverage numberofcorrectvariablesthatareselectedon101repetitionsofthedatagenerationin table4.3. ResponseMethods=2s=5s=10 Regression ENNS1.73(0.52)4.21(0.56)9.25(1.11) DNP1.61(0.50)3.92(0.56)8.77(1.13) LASSO1.65(0.57)3.87(0.62)9.62(1.38) HSIC-LASSO1.67(0.47)3.80(0.66)3.61(1.17) Classi˝cation ENNS1.81(0.49)4.24(0.87)8.04(1.25) DNP1.67(0.76)3.76(1.06)5.95(1.29) LASSO1.67(0.56)3.76(0.75)5.76(1.38) HSIC-LASSO1.67(0.47)2.80(0.91)3.61(1.17) Table4.3:VariableselectioncapacityofENNSandothermethodswithlowsignalstrength intheregression(top)andclassi˝cation(bottom)setup.Thenumbersreportedarethe averagenumberofselectedvariableswhicharetrulynonzero.Thestandarderrorsaregiven inparenthesis. WeobservethattheENNSoutperformstheothervariableselectionalgorithmsinall threecases,andthedi˙erenceissigni˝cantwhen s =10 underat-test.TheENNSperforms 124 betterwhentherearemorenonzerovariables.Noneofthealgorithmswereabletorecon- structtheoriginalfeatureindicesduetoafewreasons:thesamplesizeisrelativelysmall comparedtothenumberofvariables;thedatagenerationthroughneuralnetworkstructures iscomplicated;thesignalstrengthisrelativelylow. TofullystudythevariableselectionpoweroftheENNSalgorithm,weimplemented anothersimulationcaseinclassi˝cationwherewehaveahighersignalstrengthwhilekeeping allotherconditionsthesame.Inthissimulationstudy,weincreasethemeanoftheweights ofthenonzerovariablesto3.5.Withthesameimplementations,wesummarizetheresults intable4.4.Moreover,table4.5summarizestheresultsforsignalstrength10. Methods=2s=5s=10 ENNS2.00(0.00)4.71(0.55)8.38(2.06) DNP1.86(0.35)4.38(0.84)7.43(2.36) LASSO1.81(0.39)4.19(1.01)7.47(2.40) HSIC-LASSO1.71(0.45)3.71(1.12)4.95(2.13) Table4.4:VariableselectioncapacityofENNSandothermethodswithnormalsignal strength.Thenumbersreportedaretheaveragenumberofselectedvariableswhichare trulynonzero.Thestandarderrorsaregiveninparenthesis. TheENNSreachesselectionconsistencywhen s =2 ,whiletheothercomparedalgorithms stilldonothaveselectionconsistency.However,allalgorithmshaveobviousimprovementsin allcases.Wehavetoadmitthatselectingthecorrectsubsetofvariablesinall101repetitions isextremelychallenging,sincethedatahavegreatvariableindi˙erentrepetitions.Moreover, when s getsgreater,theimportanceofafewvariablesarelesslikelytobeobservedfromthe data. Moreover,asweknow,thebaggingalgorithmcanbeparalyzedsincedi˙erentrunsare independentofeachother.Therefore,thecomputationale˚ciencyofthisvariableselection algorithmisalmostthesameasthecomputatione˚ciencyofasinglerun. 125 Methods=2s=5s=10 ENNS2.00(0.00)5.00(0.00)9.90(0.29) DNP2.00(0.00)5.00(0.00)9.47(1.10) LASSO2.00(0.00)4.90(0.29)9.23(1.74) HSIC-LASSO2.00(0.00)4.62(0.84)7.76(2.76) Table4.5:VariableselectioncapacityofENNSandothermethodswithhighsignalstrength. Thenumbersreportedaretheaveragenumberofselectedvariableswhicharetrulynonzero. Thestandarderrorsaregiveninparenthesis. 4.4.4EstimationSimulationStudy Inthissubsection,wecomparetheestimationmethodsinsection4.2.Tofullystudythe di˙erencebetweenthesemethodswithoutthee˙ectsofotherfactors,inthissubsectionwe assumecorrectselectionandperformtheestimationonthecorrectsubsetofvariables.The dataaregeneratedaccordingtothesameschemeasinthelastsubsection.Wewillcompare theperformanceofthesedi˙erentestimationmethodsfor s =2 ; 5 ; 10 assumingthatweknow thecorrectsubsetofvariables.Thesimulationisrunon101repetitionsofdatageneration usingdi˙erentseeds.Intheresults,wereporttheRMSE,theMAEandtheMAPEfor regression,andtheaccuracy,theaucscoreandthef1scoreforclassi˝cation.Thesearein table4.6. Onaverage,wesee l 1 normregularizationgivesbestperformance,exceptfortheMAPE of s =10 inregression.Moreover,weobservethatbothbuilt-in l 1 andsoft-thresholdinggives smallerstandarderrors,whichcoincideswiththeshrinkageestimator'sproperties.However, soft-thresholdingprovidesbetterperformanceinaveragethanbuilt-in.Thereasonisthat sparsityisnotwellsupportedwithmostlibraries,thusamanualoperationisneededto obtainsparsity. 126 4.4.5VariableSelectionandEstimation Inthissubsection,westudythepredictioncapabilityofthetwo-stageapproachENNS algorithmwith l 1 neuralnetwork,andcompareitwiththeDNPmodel,thelogisticregression andtheHSIC-lassowithSVM.Weusethesameneuralnetworkstructuretogeneratedata asinthissection.Over101repetitions,wereporttheaverageRMSE(rootedmeansquared error),MAE(meanabsoluteerror)andMAPE(meanabsolutepercenterror)forregression andaverageaccuracy,AUCandF1Scoreforclassi˝cation,aswellastheirstandarderrors. Theresultsaresummarizedintable4.7. Weobservethatourproposedalgorithmobtainedaslightperformanceboostviathe ensemblemethod.Moreover,thestandarderrorsoftheseresultsareslightgreaterthan thestandarderrorsinthelastsubsection,wheretheestimationwasdoneassumingcorrect selection.Theincreaseofstandarderrorsismainlyduetotheselectionvariations. 4.4.6CorrelatedPredictors Inthissubsection,weuseanexampletostudythenumericalperformanceoftheproposed algorithmincorrelatedpredictorsituation.Wewillconsidertwodi˙erentcorrelations: ˆ = 0 : 3 and ˆ =0 : 7 .Asacomparison,theresultsfor ˆ =0 willalsobeincluded.Let u 1 ;:::;u n bei.i.d.standardnormalrandomvariables, x ij bei.i.d.standardnormalrandomvariables, whichareindependentof u 1 ;:::;u n ,for i =1 ;:::;n and j =1 ;:::;p .Dothetransformation x ij =( x ij + tu i ) = p 1+ t 2 forsome t ,thenweobtainedstandardnormalcorrelatedvariables cor ( x ij ;x ik )= t 2 1+ t 2 ;i =1 ;:::;n ; j =1 ;:::;p 127 Taking t = p 3 = 7 givescorrelation0.3andtaking t = p 7 = 3 givescorrelation0.7.Thenwe truncatetherandomvariablestointerval [ 1 ; 1] .Thestructuretogenerateresponseiskept thesameasinthelastsubsection.Theresultsofvariableselectionandestimationisgiven intable4.8. 128 yMetricMethods=2s=5s=10 Reg RMSE NeuralNetwork31.24(13.32)69.46(37.40)136.64(60.54) Xavier18.64(11.07)58.89(27.73)136.58(65.57) l 1 built-in20.47(9.62)59.37(23.61)129.55(50.48) l 1 soft5.97(4.18)45.83(33.06)109.31(47.24) Stage-wise10.59(11.20)47.64(22.69)117.65(43.96) Bagging25.48(10.89)59.49(26.53)133.45(59.72) MAE NeuralNetwork16.45(10.91)52.85(28.47)103.76(45.99) Xavier13.65(8.06)45.34(22.18)105.17(53.66) l 1 built-in15.56(7.76)45.32(18.34)98.54(38.36) l 1 soft4.37(3.02)35.49(26.21)83.85(36.45) Stage-wise7.91(8.23)38.86(20.00)89.82(33.82) Bagging14.77(7.92)43.16(20.51)99.38(41.66) MAPE NeuralNetwork0.012(0.015)0.030(0.026)0.033(0.026) Xavier0.009(0.009)0.027(0.022)0.029(0.017) l 1 built-in0.011(0.012)0.029(0.023)0.032(0.021) l 1 soft0.005(0.007)0.017(0.010) 0.029(0.023) Stage-wise 0.007(0.007)0.019(0.015) 0.027(0.016) Bagging0.010(0.010)0.026(0.024)0.030(0.022) Cls Accuracy NeuralNetwork0.944(0.026)0.886(0.037)0.841(0.041) Xavier0.952(0.026)0.891(0.034)0.831(0.036) l 1 built-in0.927(0.031)0.844(0.085)0.752(0.110) l 1 soft0.964(0.028)0.908(0.029)0.855(0.031) Stage-wise0.945(0.030)0.886(0.038)0.804(0.042) Bagging0.877(0.069)0.806(0.068)0.753(0.087) AUC NeuralNetwork0.942(0.027)0.882(0.038)0.837(0.042) Xavier0.951(0.027)0.891(0.034)0.825(0.037) l 1 built-in0.924(0.031)0.833(0.100)0.734(0.123) l 1 soft0.964(0.029)0.905(0.029)0.851(0.032) Stage-wise0.943(0.031)0.884(0.038)0.800(0.041) Bagging0.877(0.065)0.803(0.063)0.751(0.084) F1Score NeuralNetwork0.943(0.027)0.887(0.045)0.841(0.049) Xavier0.952(0.026)0.892(0.041)0.832(0.048) l 1 built-in0.927(0.031)0.824(0.192)0.732(0.200) l 1 soft0.965(0.026)0.908(0.036)0.857(0.033) Stage-wise0.944(0.031)0.883(0.042)0.806(0.049) Bagging0.870(0.077)0.792(0.060)0.748(0.088) Table4.6:Predictionresultsonthetestingsetusingneuralnetworkswithandwithout l 1 normregularizationfor s =2 ; 5 ; 10 .RMSEisrootedmeansquarederror,MAEismean absoluteerror,andMAPEismeanabsolutepercenterror.Accuracyisthepercentageof correctprediction,aucisareaundertheROCcurve,andf1scoreistheinverseofinverse precisionplustheinverserecall. 129 yMetricMethods=2s=5s=10 Reg RMSE ENNS+l115.67(30.35)48.14(21.16)174.08(65.38) DNP25.42(33.16)62.63(29.02) 178.91(60.15) Lasso79.44(67.31)104.19(49.38)192.04(77.34) HSIC-Lasso56.32(59.41)86.77(47.51)188.35(56.48) MAE ENNS+l112.03(23.68)40.12(19.95)132.07(44.99) DNP20.15(27.15)47.85(22.31) 136.06(45.95) Lasso64.11(54.63)81.97(39.76)147.86(60.21) HSIC-Lasso42.89(34.66)70.04(41.23)144.37(48.15) MAPE ENNS+l10.012(0.025)0.028(0.036)0.041(0.037) DNP0.017(0.028)0.032(0.032) 0.042(0.041) Lasso0.042(0.029)0.046(0.035)0.046(0.025) HSIC-Lasso0.033(0.021)0.036(0.025)0.048(0.024) Cls Accuracy ENNS+l10.967(0.029)0.848(0.025)0.756(0.067) DNP0.933(0.076)0.822(0.068)0.736(0.064) Lasso0.732(0.103)0.726(0.071)0.692(0.075) HSIC-Lasso0.805(0.094)0.798(0.094)0.706(0.081) AUC ENNS+l10.959(0.036)0.834(0.024)0.709(0.058) DNP0.898(0.148)0.780(0.100)0.699(0.052) Lasso0.652(0.121)0.640(0.102)0.625(0.068) HSIC-Lasso0.774(0.125)0.748(0.121)0.677(0.061) F1-Score ENNS+l10.962(0.037)0.859(0.036)0.708(0.089) DNP0.903(0.208)0.761(0.199)0.705(0.100) Lasso0.590(0.299)0.604(0.250)0.634(0.192) HSIC-Lasso0.744(0.206)0.731(0.242)0.666(0.208) Table4.7:ModelperformanceofthecombinationofENNSalgorithmand l 1 thresholding estimation,comparedwithDNP,LassoandHSIC-Lassofor s =2 ; 5 ; 10 casesinbothre- gressionandclassi˝cation.Theaverageperformanceof101repetitionswiththeirstandard errorsinparenthesisarepresented. 130 131 ResModel selectionestimation ˆ =0 : 0 ˆ =0 : 3 ˆ =0 : 7 ˆ =0 : 0 ˆ =0 : 3 ˆ =0 : 7 Reg ENNS+ l 1 3.81(0.79)3.27(0.75)2.29(0.70)40.82(19.46)37.17(27.29)43.18(44.47) DNP3.48(0.96)2.95(0.79)2.14(0.56)81.43(46.00)92.91(65.25)101.15(90.63) LASSO3.38(0.90)2.85(0.79)2.11(1.12)131.37(74.22)151.16(108.88)113.30(97.54) Cls ENNS+ l 1 3.66(1.05)3.25(0.76)2.38(0.72)0.856(0.040)0.875(0.061)0.907(0.030) DNP3.62(1.09)3.43(0.91)2.71(1.03)0.774(0.100)0.766(0.106)0.793(0.092) LASSO3.55(0.79)2.90(1.31)1.95(0.72)0.598(0.083)0.634(0.117)0.683(0.116) Table4.8:Selectionandestimationcomparisonforpredictorswithcorrelation0,0.3and0.7.Thenumberofnonzeropredictors issetto5.Forselection,theaveragenumberofcorrectselectedvariableswithitsstandarderrorisgiven.Forestimationthe averageRMSEorAUCwiththeirstandarderrorsisgiven.Theresultsareaveragedover101repetitions. Fromthetableweseethatinthecorrelatedcasesthemodelworksalmostaswellas whenthere'snocorrelation.Allmodelsselectlessvariableswhenthecorrelationishigher, andthisisawell-knownsymptomofvariableselectionwithcorrelatedpredictors.However, thisdoesnota˙ecttheestimationstep,andinsomecasesevenmakestheestimationresults better.Thereasoncouldbethatwehavelessvariablesthusthemodelissimpler.Since thepredictorsarecorrelated,wedonotlosetoomuchinformationbynotselectingsome ofthem.Moreover,someresultsnotinthetableincludesthefalsepositiverate,wherethe averageforENNSis 0 : 05 0 : 03 ,whilethatoftheDNPis 0 : 29 0 : 14 .Therefore,ENNS includeslessredundantvariablesintheestimationstepandachievesbetterperformance. 4.5RealDataexamples Inthissection,weevaluatetheperformanceofthetwo-stepmodelonsomerealworlddata sets. 4.5.1VariableSelection:MRIData Inthisexample,weevaluatethevariableselectioncapabilitywithothervariableselec- tionmodels,andcomparetheresultswiththebiologicalgroundtruth.Thedataused inthisexamplecomefromAlzheimer'sdiseaseneuroimaginginitiatives(ADNI),see http: //adni.loni.usc.edu/ .Thedataincludes n =265 patients'neuroimagingresults,includ- ing164Alzheimer's(AD)patientsand101cognitivelynormal(CN)individuals.145regions ofinterested(ROIs)spanningtheentirebrainwerecalculatedusingMulti-AtlasROIsegmen- tation,and114ROIswerederivedbycombiningsingleROIswithinatreehierarchytoobtain volumetricmeasurementsfromlargerstructures.Therefore, p =259 ROIswereusedinthis 132 example.Detailsofthepre-processingmethodcanbefoundat https://adni.bitbucket. \io/reference/docs/UPENN_ROI_MARS/Multi-atlas_ROI_ADNI_Methods_mod_April2016. pdf .AmongthoseROIs,biologicallyimportantfeaturesarepicked,seetable4.9,wherered indicatedmostimportant,yellowmeanssecondlyimportant,andgreenmeansthirdlyim- portant.ThecombinationsandallotherROIsarenotlisted. Thefulldatasetisusedforvariableselection,andtheselectionresultischosensuchthat a3-foldcross-validationperformsbest.WeruntheENNSalgorithmalongwiththeLASSO andtheDNP.Theselectionresultarepresentedintable4.9.Notehereifamodelselectsa simplecombinationofsomefeatures,thesefeaturesarealsomarkedasselected.Moreover, table4.10showsthenumberofcombinedfeaturesselectedforthemodelsandnumberof falsepositiveselections.WeobservethatLASSOmissesalotofimportantfeaturesand selectedaboutonly1/4ofthecombinedfeaturesasneuralnetworks.Thisindicatesthatthe featuresmayhavecomplicatedrelationshipwiththeresponse.ENNSperformsbetterthan theshallowDNPintermsofthemetricsintable4.10,whereISisaweightedaveragescore withtheweightsforred,yellowandgreenbeing3,2and1,respectively;NIisthenumberof selectedimportantvariables;andNUisthenumberofselectedunimportantvariables.Asa propertyoftheENNS,itselectslessfalsepositivevariables.It'shardtotrackthecombined features,sincealotareinvolved,however,thecombinationsrepresentbiologicalintuitions. Neuralnetworksselectsmorecombinedfeaturesandperformbetterinthissense. 133 134 ROI R31 R32 R36 R37 R38 R39 R40 R41 R47 R48 R49 R50 R51 Lasso X X X X X DNP X X X X X X X ENNS X X X X X X X X X X X X ROI R52 R55 R56 R57 R58 R59 R60 R81 R82 R85 R86 R87 R88 Lasso X X DNP X X X X X X ENNS X X X X X X X ROI R100 R101 R102 R103 R106 R107 R116 R117 R118 R119 R120 R121 R122 Lasso X X X X DNP X X X X X X ENNS X X X X X ROI R123 R124 R125 R132 R133 R136 R137 R138 R139 R140 R141 R142 R143 Lasso X X X X X X X X DNP X X X X X X X X ENNS X X X X X X X X X ROI R146 R147 R152 R153 R154 R155 R162 R163 R164 R165 R166 R167 R168 Lasso X X X X X X DNP X X X X ENNS X X X X X X X X ROI R169 R170 R171 R178 R179 R186 R187 R190 R191 R194 R195 R198 R199 Lasso X X X X X X DNP X X X X X ENNS X X X X X X ROI R200 R201 R202 R203 R204 R205 R206 R207 Lasso X X X DNP X X X X ENNS X X X X Table4.9:VariableselectionresultfortheADdata.Thetableincludesallbiologicallyimportantvariableswiththreelevels: red(veryimportant),yellow(secondlyimportant)andgreen(thirdlyimportant).Thenon-importantvariablesarenotincluded inthemodel.Checkmarksindicatewhetherthecorrespondingalgorithmselectedthevariableornot. VariableselectionmethodISNINU LASSO1.09432/8625/59 DNP1.42840/8615/59 ENNS1.62449/866/59 Table4.10:VariableselectionresultfortheADdata.ThereportednumbersincludeIS, theweightedaverageofselectedimportantvariableswiththeweightsbeing3,2and1for red(mostimportant),yellow(secondlyimportant)andgreen(thirdlyimportant),respec- tively;NI,numberofimportantvariablesselected;andNU,numberofunimportantvariables selected. 4.5.2Regression:Ribo˛avinProductionData Inthisexample,weconsidertheribo˛avinproductionwithbacillussubtilisdata,whichis publiclyavailableinthe`hdi'packageinR.Thedatasetcontainsacontinuousresponse, whichisthelogarithmoftheribo˛avinproductionrate,and p =4088 predictorswhichare thelogarithmoftheexpressionlevelof4088genes.Thereare n =71 observationsavailable. Allpredictorsarecontinuouswithpositivevalues. Weperform50repetitionsofthefollowingactions.Thedataissplitintotraining(56)and testing(15)observations.Thetrainingdataisfurthersplitintotraining(49)andvalidation (7).Thetrainingdataisnormalizedwithmeanzeroandstandarddeviationone.Wetrain theENNSalgorithmtoselectvariablesandperformthe l 1 neuralnetworktomakeprediction. Alongwithourproposedalgorithms,wecomparetheperformancewiththelassopenalized linearregression,whichisimplementedbythescikit-learnlibraryinpython;thegrouplasso penalizedgeneralizedadditivemodelin[145],wherethecodecanbefoundat https:// github.com/KaixuYang/PenalizedGAM ;andthesparsegrouplassopenalizedneuralnetwork in[53].˝gure(4.1)showstheaveragetestingmeansquarederror(MSE)alongwiththe numberofselectedfeaturesfordi˙erentmodels.Ourproposedalgorithmconvergesfast andperformscompetitive.table4.11showstheaveragepredictionaccuracywithstandard errorinparenthesisandthemediannumberofvariablesselected.Ourproposedmethodhas 135 ModelTestMSENumberoffeatures ENNS + l 1 neuralnetwork0.268(0.115)42 Regularizedneuralnetwork0.273(0.135)44 LinearmodelwithLASSO0.286(0.124)38 Generalizedadditivemodelwithgrouplasso0.282(0.136)46 Table4.11:TestMSEwithstandarderrorinparenthesesandmedianofnumberoffeatures fordi˙erentmodelsintheribo˛avingenedataexample. competitivemeanperformancebutlowerstandarderror. The˝nalmodelofsmallsampleutilizesonly2hiddenlayers,withover90%sparsityto preventover-˝tting,whichisnecessaryforthissmalltrainingsamplesize,49.Trainingwith largebatchsize,smalllearningrate,hugenumberofepochsandearlystoppinghelpthe modellearnbetterandpreventover-˝tting.Weadmitthattuningthenetworkstructure andlearningparametersarehard,butweobtainbetterandstablerresultsoncewehavethe rightnumbers. 4.5.3Classi˝cation:ProstateCancerData Inthisexample,weconsideredaprostatecancergeneexpressiondata,whichispublicly availablein http://featureselection.asu.edu/datasets.php .Thedatasetcontainsa binaryresponsewith102observationson5966predictorvariables.Clearly,thedataset isreallyahighdimensionaldataset.Theresponseshavevalues1(50samplepoints)and 2(52samplepoints),where1indicatesnormaland2indicatestumor.Allpredictorsare continuouspredictors,withpositivevalues. Weperform50repetitionsofthefollowingactions.Thedataissplitintotraining(81)and testing(21)observations.Thetrainingdataisfurthersplitintotraining(70)andvalidation data(11).Ineachsplit,thenumberofclass0observationsandnumberofclass1observations 136 Classi˝erTestaccuracyNumberoffeatures ENNS + l 1 neuralnetwork0.956(0.053)15 Regularizedneuralnetwork0.955(0.066)18 LogisticRegressionwithLasso0.933(0.058)36 l1penalizedLinearSVM0.950(0.052)16 Generalizedadditivemodelwithgrouplasso0.918(0.061)5 Table4.12:Testaccuracywithstandarderrorinparenthesesandmedianofnumberof featuresfordi˙erentclassi˝ersintheProstategenedataexample. arekeptroughlythesame.WetraintheENNSalgorithmtoselectvariablesandperformthe l 1 neuralnetworktomakepredictions.Alongwithourproposedalgorithms,wecomparethe performancewiththe l 1 normpenalizedlogisticregression;the l 1 supportvectormachine (SVM),bothofwhichareimplementedwiththescikit-learnlibraryinpython;thegroup lassopenalizedgeneralizedadditivemodelin[145],wherethecodecanbefoundat https:// github.com/KaixuYang/PenalizedGAM ;andthesparsegrouplassopenalizedneuralnetwork in[53].˝gure(4.2)showstheaveragetestingaccuracyoverthe20repetitionsalongwith thenumberofselectedfeaturesfordi˙erentmodels.Ourproposedalgorithmconvergesfast andperformscompetitive.table4.12showstheaveragepredictionaccuracywithstandard errorinparenthesis,andthemediannumberofvariablesselected.Ourproposedmethods hascompetitivemeanperformancebutlowerstandarderror.Oneneedstonoticethatthe meanperformanceishardtoimprovefurther,sincetheresultsarealreadygoodandreach thebottleneckofthecurrentexplanatoryvariables.ThereasonthatGAMperformsworse thantheothermodelsisthattherangeofpredictorvariablesarerelativelysmallandskewed, thusthebasisexpansiononGAMdoesnotworkwell. 137 Figure4.1:Testingmeansquarederror(MSE)fordi˙erentmodelsontheribo˛avindata. 138 Figure4.2:Testingaccuracyfordi˙erentmodelsontheprostatecancerdata. 139 4.6Conclusion Inthispaper,wediscussedtheexistingmethodstodealwithhigh-dimensionaldataandhow toapplythestage-wisealgorithmwithneuralnetworks.Wediscussedtheshortageofthe currentstage-wiseneuralnetworkvariableselectionalgorithmsandproposedtheENNSto overcometheseshortage.Moreover,wealsocompareddi˙erentmethodstofurtherreduce theover-˝ttingissueafterthevariableselectionstep.Methodologywasgiventosupportthe argumentandnewalgorithm,andsimulationstudiesweregivenasextraevidenceforthe algorithmtowork. Thoughthere'safewsimulationandmethodologystudyinneuralnetworkvariablese- lection,thetheoryforneuralnetworkstilldeservesmuchmoreinvestigation.Wehopethis papercouldworkasapioneerandattractsmorepeople'sattentiontotheneuralnetwork theory˝eld. 140 Chapter5 Epilogue Themaingoalofthisdissertationistoestablishtheoryandproposenewmethodologiesfor non-parametricstatisticalmachinelearningmodelingforthehigh-dimensionallowsample size(HDLSS)problems.Wehavestudiedthreedi˙erentnon-parametricapproaches:the generalizedadditivemodel(GAM)inChapter2,thesparsegrouplassopenalizedshallow neuralnetworkinChapter3andtheensembleneuralnetworkselection(ENNS)algorithm inChapter4.Thesethreemodelsareappropriateindi˙erentsituations.TheGAMis usefulwhenthereisnottoomuchinteractionsamongthevariablesorthereisonlyweak interactionsamongthevariables,whichhasmanyapplications.Actually,inmost˝elds, theadditivestructureisenoughtoobtainafairlygoodresult.Theshallowneuralnetwork isusedwhenthereareinteractionsamongthevariables,buttherelationshipbetweenthe responseandthevariablesisnottoocomplicated.Ifwehavehugeinteractionsandvery complicatedrelationships,adeepneuralnetworkmakesbetterapproximationtothisrela- tionship.However,ontheotherhand,thelevelofdi˚cultyintrainingthethreemodelsalso increasegradually,especiallyit'seasytogetstuckinalocalminimumwhentrainingthe neuralnetwork.Fromthisaspect,GAMisaconvexoptimizationproblemandisguaranteed toreachaglobalminimum,thushassmallertrainingdi˚culty. Thetheoryweprovedalsomakegreatcontributiontowardsthis˝eld.Thesetheoretical resultsguaranteethatthesemethodswillworkundercertainconditions,whichisbettersup- 141 portthanthenumericalresultsonly.Moreover,wehaveappliedthesemethodstodi˙erent ˝eldsofexamples,includinggeneticdata,computervisiondata,autonomousdrivingdata andMRIdata.Thesenon-parametricmodelsareprovedtoworkindi˙erentaspects. Inthefuture,theneuralnetworkstillhaveahugeroomtoinvestigate.Howdowe˝nd amorestablewaytotrainaneuralnetwork?Howdowepreventneuralnetworkfrom over˝ttingsmallsample?Howdoweeliminatethegapbetweentheglobalminimumusedin neuralnetworktheoryandthelocalminimumobtaininpractice?Theseareleftforfuture research. 142 APPENDICES 143 APPENDIXA TechnicalDetailsandSupplementary MaterialsforChapter2 DerivationofAssumption2.2 Thoughassumption2.2isimposedonthe˝xeddesignmatrix,however,itholdsifthedesign matrix X isdrawnfromacontinuousdensityandthedensity g j of X j isboundedawayfrom 0andin˝nityby b and B ,respectively,ontheinterval [ a;b ] .Let A bethesub-vectorof whichincludeallnonzeroentries.Withoutlossofgenerality,let A = f 1 ;:::; k g ,where k 2 R m n and k = O ( s n ) .Let A bethecorrespondingsub-matrixof . Bylemma3in[120],ifthedesignmatrix X isdrawnfromacontinuousdensityand thedensity g j of X j isboundedawayfrom0andin˝nityby b and B ,respectively,onthe interval [ a;b ] ,andcard B ( )= O ( s n ) ,wehave k 1 1 + ::: + k k k 2 k 1 2 ( k 1 1 k 2 + ::: + k k k k 2 ) forsomepositiveconstant 2 suchthat 0 < 1 2 2 2 < 1 ,where 0 =((1 bB 1 ) = 2) . Togetherwiththetriangleinequality,wehave k 1 2 ( k 1 1 k 2 + ::: + k k k k 2 ) k A A k 2 k 1 1 k 2 + ::: + k k k k 2 144 Bysimplealgebra,wehave 2 k 2 2 ( k 1 1 k 2 2 + ::: + k k k k 2 2 ) k A A k 2 2 2( k 1 1 k 2 2 + ::: + k k k k 2 2 ) Forany j =1 ;:::;k ,bylemma6.2in[153],wehave c 1 m 1 n min ( n 1 T j j ) max ( n 1 T j j ) c 2 m 1 n forsome c 1 and c 2 .Thenwehave T T k k 2 2 = k A A k 2 2 k A k 2 2 2 k 2 2 ( k 1 1 k 2 2 + ::: + k k k k 2 2 ) k A k 2 2 = 2 k 2 2 k 1 1 k 2 2 k 1 k 2 2 k 1 k 2 2 k A k 2 2 + ::: + k k k k 2 2 k k k 2 2 k k k 2 2 k A k 2 2 ! 2 k 2 2 c 1 nm 1 n k 1 k 2 2 k A k 2 2 + ::: + k k k 2 2 k A k 2 2 ! = 2 k 2 2 c 1 nm 1 n Let 0 = 2 2 c 1 andobservethat k = O ( s n ) ,wehave T T n k k 2 2 0 2 s n 2 m 1 n 145 Similarly,wehave T T k k 2 2 = k A A k 2 2 k A k 2 2 2( k 1 1 k 2 2 + ::: + k k k k 2 2 ) k A k 2 2 =2 k 1 1 k 2 2 k 1 k 2 2 k 1 k 2 2 k A k 2 2 + ::: + k k k k 2 2 k k k 2 2 k k k 2 2 k A k 2 2 ! 2 c 2 nm 1 n k 1 k 2 2 k A k 2 2 + ::: + k k k 2 2 k A k 2 2 ! =2 c 2 nm 1 n Let 1 = c 2 ,wehave T T n k k 2 2 1 m 1 n ProofsofLemmaandTheorems Thefollowinglemmasareneededinprovingtheorems. LemmaA.1. Foranysequence r n > 0 ,underassumption2.1and3,wehaveforbounded responsesuchthat j y i j > > < > > > : 1 n T k ( y ^ y )= n 1 ^ k k ^ k k 2 ; 8 ks:t: k ^ k k 2 > 0 1 n T k ( y ^ y ) 2 [ n 1 ; n 1 ] ; 8 ks:t: k ^ k k 2 =0 (A.8) where ^ y isthemeanofresponseapproximatedbysplinesandevaluatedatthesolution ^ andthesecondbelongingrelationshipiselement-wise.Let s k = T k ( y ^ y ) n 1 Thenwehave 8 > > < > > : k s k k 2 =1 ; 8 ks:t: k ^ k k 2 > 0 k s k k 2 1 ; 8 ks:t: k ^ k k 2 =0 (A.9) Weconsiderthefollowingsubsetsof f 1 ;:::;p g .Let A 1 besuchthat n k : k ^ k k 2 > 0 o ˆ A 1 ˆ ( k : 1 n T k ( y ^ y )= n 1 ^ k k ^ k k 2 ) [f 1 ;:::;s n g (A.10) Let A 2 = f 1 ;:::;p gn A 1 , A 3 = A 1 n T , A 4 = A 1 \ T c , A 5 = A 2 n T c and A 6 = A 2 \ T c . Therefore,therelationshipsare j 2 Tj 2 T c A 1 :selected j andsome j 2 TA 3 A 4 A 2 : j notin A 1 (includesunselectedonly) A 5 A 6 151 Thenwehave T A 1 ( y ^ A 1 )= S A 1 (A.11) where S A 1 =( S T K 1 ;:::;S T K q 1 ) T , S K i = n 1 s k i and ^ A 1 = b 0 A 1 ^ A 1 ) .Alsofromthe inequalityinKKT,wehave C A 2 T A 2 ( y ^ A 1 ) C A 2 (A.12) where C A 2 =( C T K 1 ;:::;C T k q 2 ) T and C K i = n 1 1 fk ^ K i k 2 =0 g e m n 1 ,wherealltheelements of e are1.Let " = y y ,thenfrom(A.11)wehave T A 1 ( y + " ^ A 1 )= S A 1 useTaylorexpansionon y around ^ A 1 ,wehave T A 1 1 A 1 ( A 1 ^ A 1 )+ T A 1 1 A 2 A 2 + T A 1 " = S A 1 where 1 = ( 1 ) , 1 liesonthelinesegmentjoining and A 1 ^ A 1 ,and ( )= diag ( b 00 ( 1 ) ;:::;b 00 ( n )) isthediagonalvariancematrixevaluatedat .From(A.12),wehave C A 2 T A 2 1 A 1 ( A 1 ^ A 1 )+ T A 2 1 A 2 A 2 + T A 2 " C A 2 Let ij = T A i ( 1 A j =n ,wehave 11 ( A 1 ^ A 1 )+ 12 A 2 = S A 1 152 and C A 2 21 ( A 1 ^ A 1 )+ 22 A 2 + T A 2 " C A 2 Withourchoiceof n 1 ,theconstantsaresu˚cientlarge,bylemma1in[140],theeigenvalues of 11 areboundedfrombelow.Thuswithoutlossofgenerality,weassume 11 isinvertible. Thenwehave 1 11 S A 1 n = A 1 ^ A 1 + 1 11 12 A 2 + 1 11 n T A 1 " (A.13) and k 1 = 2 ( y y ) k 2 n + n 22 A 2 n 21 11 1 12 A 2 C A 2 T A 2 " 21 1 11 S A 1 + 21 1 11 T A 1 " (A.14) De˝ne V 1 j = 1 p n 1 = 2 11 Q T A j 1 S A j ;j =1 ; 3 ; 4 and w k = 1 = 2 1 ( I P 1 ) 1 = 2 1 A k A k ;k =2 ;:::; 6 where P 1 = 1 = 2 A 1 T A 1 A 1 ) 1 T A 1 1 = 2 and Q A j k isthematrixrepresentingtheselectionofvariablesin A k from A j . Consider j =4 .Forany k 2 A 4 ,wehave k ^ k k 2 > 0 ,then k s k k 2 2 =1 .Thenwehave 153 k S A 4 k 2 2 = P k 2 A 4 N ( A 4 ) ,where N ( A 4 ) isthenumberofpredictorsin A 4 .Thus k V 14 k 2 2 = 1 n k 1 = 2 11 Q T A 4 1 S A 4 k 2 2 1 n c 1 k Q T A 4 1 S A 4 k 2 2 = c 1 n X k 2 A 4 k n 1 s k k 2 2 c 1 2 n 1 ( q 1 s n ) Thatis ( q 1 s n ) + k V 14 k 2 2 c 1 2 n 1 (A.15) Then,weneedto˝ndaboundfor k V 14 k 2 2 and q 1 ( q 1 s n ) + + s n willbebounded.Using (A.13)andconsider V T 14 ( V 14 + V 13 )= S T A 4 Q A 4 1 1 11 n S A 1 = S T A 4 Q A 4 1 ( A 1 ^ A 1 + 1 11 12 A 2 + 1 11 n T A 1 " ) = S T A 4 Q A 4 1 1 11 12 A 2 + S T A 4 Q A 4 1 1 11 n T A 1 " + S T A 4 ( A 4 ^ A 4 ) Observe A 4 =0 ,and S T A 4 ^ A 4 = X k 2 A 4 n 1 ^ T k ^ k k ^ k k 2 = X k 2 A 4 n 1 k ^ k k 2 > 0 wehave V T 14 ( V 14 + V 13 ) S T A 4 Q A 4 1 1 11 12 A 2 + S T A 4 Q A 4 1 1 11 n T A 1 " 154 Ontheotherhand,by(A.14), k w 2 k 2 2 = T A 2 T A 2 1 = 2 1 ( I P 1 ) 1 ( I P 1 ) 1 = 2 1 A 2 A 2 c 1 1 T A 2 T A 2 1 = 2 1 ( I P 1 ) 1 = 2 1 A 2 A 2 = c 1 1 T A 2 T A 2 1 A 2 A 2 + c 1 1 T A 2 T A 2 1 A 1 T A 1 1 A 1 ) 1 T A 1 1 A 2 A 2 = c 1 1 T A 2 ( n 22 A 2 n 21 1 11 12 A 2 ) c 1 1 T A 2 ( C A 2 T A 2 " 21 1 11 S A 1 + 21 1 11 T A 1 " ) = c 1 1 T A 2 C A 2 c 1 1 T A 2 T A 2 21 1 11 T A 1 ) " c 1 1 T A 2 21 1 11 S A 1 = c 1 1 T A 2 C A 2 c 1 1 T A 2 21 1 11 S A 1 c 1 1 w T 2 1 1 " Thenwehave V T 14 ( V 14 + V 13 )+ c 1 k w 2 k 2 2 0 @ S T A 4 Q A 4 1 1 11 n T A 1 w T 2 1 1 1 A " S T A 3 Q A 3 1 1 11 12 A 2 + T A 2 C A 2 De˝ne u = A 1 1 11 Q T A 4 1 S A 4 =n 1 1 w 2 k A 1 1 11 Q T A 4 1 S A 4 =n 1 1 w 2 k 2 155 Observe k T A 1 1 11 Q T A 4 1 S A 4 =n 1 1 w 2 k 2 2( k T A 1 1 11 Q T A 4 1 S A 4 =n k 2 2 + k 1 1 w 2 k 2 2 ) 2 k T A 1 1 11 Q T A 4 1 S A 4 =n k 2 2 +2 c 2 1 k w 2 k 2 2 =2 k V 14 k 2 2 +2 c 2 1 k w 2 k 2 2 Observe c 1 0 or k= 2 T c o ˆ A 1 ˆ ( k : 1 n T k ( y ^ y )= n 1 ^ k k ^ k k 2 or k= 2 T c ) (A.18) Thenwehave A 5 = ; , N ( A 3 )= s n q 1 and A 2 = 0 .Thenwehave k V 14 k 2 2 B 2 1 +4 q c 1 1 2 n 1 s n k V 14 k 2 = B 2 1 +4 B 2 k V 14 k 2 Usethetruththat x 2 c +2 bx implies x 2 2 c +4 b 2 ,wehave k V 14 k 2 2 2 B 2 1 +16 B 2 2 Thenwehavefrom(A.15)that ( q 1 s n ) + k V 14 k 2 2 c 1 2 n 1 2 B 2 1 +16 B 2 2 c 1 2 n 1 = c 5 s n 158 where c 5 =(2 c 2 1 +16) =c 2 1 ,i.e. ( q 1 s n ) + + s n ( c 5 +1) s n (A.19) Wenotethattheconstant c 5 onlydependson c 1 and(A.18)simplyrequireslarger A 1 ,(A.19) holdsforall A 1 satisfying(A.10).Notethat(A.19)holdsif q 1 N ( A 1 [ A 5 ) n m n and j u T " j 2 ( j A 1 j_ m n ) c 2 1 2 n 1 4 m n (A.20) Soitremainstoshowthat(A.20)holdswithprobabilitytendingto1.De˝ne x m =max j A j = m max k U A k k 2 =1 ;k =1 ;:::;m " T A T A A A ) 1 S A 1 = 2 A ( I 1 = 2 A A T A A A ) 1 T A 1 = 2 A ) 1 = 2 A k A T A A A ) 1 S A 1 = 2 A ( I 1 = 2 A A T A A A ) 1 T A 1 = 2 A ) 1 = 2 A k 2 (A.21) for j A j = q 1 = m 0 , S A =( S T A 1 ;:::; S T A m ) T where S A k = n 1 U A k , k U A k k 2 =1 and A isthevariancematrixevaluatedatsome correspondingtotheremainderof theTaylorexpansionwhenthesubset A isconsidered.Tosimplifythenotations,let Q A = n 1 A T A A A ) 1 and P A = 1 = 2 A A T A A A ) 1 T A 1 = 2 A ,thenwehave x m =max j A j = m max k U A k k 2 =1 ;k =1 ;:::;m " T Q A U A 1 = 2 A ( I P A ) 1 = 2 A k Q A U A 1 = 2 A ( I P A ) 1 = 2 A k 2 (A.22) 159 De˝ne m 0 = f ( U; " ): x m C p ( j A j_ 1) m n log( p n m n ) ; 8 m = j A j m 0 g and m 0 = f ( U; " ): x m C p ( j A j_ 1) m n log( p n m n ) ; 8 m = j A j m 0 g foralargeenoughgenericconstant C ,where x m =max j A j = m max k U A k k 2 =1 ;k =1 ;:::;m " T Q A U A 1 = 2 A ( I P A ) 1 = 2 A k Q A U A 1 = 2 A ( I P A ) 1 = 2 A k 2 BytriangleinequalityandCauchy-Schwarzinequality,wehave " T Q A U A 1 = 2 A ( I P A ) 1 = 2 A k Q A U A 1 = 2 A ( I P A ) 1 = 2 A k 2 " T Q A U A 1 = 2 A ( I P A ) 1 = 2 A k Q A U A 1 = 2 A ( I P A ) 1 = 2 A k 2 + k n k 2 " T Q A U A 1 = 2 A ( I P A ) 1 = 2 A k Q A U A 1 = 2 A ( I P A ) 1 = 2 A k 2 + Cn 1 = 2 s n m d n " T Q A U A 1 = 2 A ( I P A ) 1 = 2 A k Q A U A 1 = 2 A ( I P A ) 1 = 2 A k 2 + C p ( j A j_ 1) m n log( p n m n ) Thenwehave ( U; " ) 2 m 0 ) ( U; " ) 2 m 0 )j u T " j 2 j x m j 2 ( j A 1 j_ m n ) c 2 1 2 n 1 4 m n for q 1 m 0 0 160 Since i 'saresub-Gaussianrandomvariablesbyassumption2.2,wehave 1 P ( U; " ) 2 q = P x m >C p ( m _ 1) m n log( p n m n ) ; 8 m = j A j m 0 1 X m =0 P x m >C p ( m _ 1) m n log( p n m n ) 1 X m =0 p n m P 0 @ " T Q A U A 1 = 2 A ( I P A ) 1 = 2 A k Q A U A 1 = 2 A ( I P A ) 1 = 2 A k 2 >C p ( m _ 1) m n log( p n m n ) 1 A 2 1 X m =0 p n m exp( C ( m _ 1) m n log( p n m n )) =2( p n m n ) Cm n +2 1 X m =1 p n m ( p n m n ) Cmm n 2( p n m n ) Cm n +2 1 X m =1 1 m ! p n ( p n m n ) Cm n m =2( p n m n ) Cm n +2exp p n ( p n m n ) Cm n 2 ! 0 as n !1 Therefore,theproofofpart(i)iscomplete. Thenweprovepart(ii).Considertheboundedresponsecase.Forasequence N n such that k ^ 0 k 2 N n ,de˝ne t = N n = ( N n + k ^ 0 k 2 ) ,thenconsidertheconvexcombination = t ^ +(1 t ) 0 .Wehave 0 = t ( ^ 0 ) ,whichimplies k 0 k 2 = t k ^ 0 k 2 = N n k ^ 0 k 2 N n + k ^ 0 k 2 N n (A.23) Recalltheloglikelihoodfunction l n ( )= 1 n n X i =1 h y i + T i b + T i 161 l n ( )= l n ( 0 )+ 1 n n X i =1 h y i i y i i i T ( 0 ) 1 2 n n X i =1 ( 0 ) T T i b 00 ( + T i i ( 0 ) = l n ( 0 )+ ( 0 ) T T ( y y ) n 1 2 n ( 0 ) T T ( 0 ) (A.24) where linesonthelinejoining and 0 ,and ( )= diag b 00 ( + T 1 ) ;:::b 00 ( + T n ) isthevariancematrixofresponsewhenthecoe˚cientstakevalueon .Ontheother hand,byconvexityoftheloglikelihoodfunction, l n ( )= l n ( t ^ +(1 t ) 0 ) tl n ( ^ )+(1 t ) l n ( 0 ) bynorminequality,wehave p n X j =1 k j k 2 = p n X j =1 k t ^ j +(1 t ) 0 j k 2 p n X j =1 ( t k ^ j k 2 +(1 t ) k 0 j k 2 ) joiningthetwoinequalitiesaboveandbythede˝nitionof ^ gives l n ( ) n 1 p n X j =1 k j k 2 tl n ( ^ )+(1 t ) l n ( 0 j ) n 1 p n X j =1 ( t k ^ j k 2 +(1 t ) k 0 j k 2 ) l n ( 0 ) n 1 p n X j =1 k 0 j k 2 162 whichimplies l n ( ) l n ( 0 ) n 1 p n X j =1 k j k 2 n 1 p n X j =1 k 0 j k 2 (A.25) By(A.24)and(A.25)togetherwehave n 1 p n X j =1 k j k 2 k 0 j k 2 ( 0 ) T T ( y y ) n 1 2 n ( 0 ) T T ( 0 ) andmoveonetermtothelefthandside,wehave 1 2 n ( 0 ) T T ( 0 ) 0 T T y y n + n 1 p n X j =1 k 0 j k 2 k j k 2 = 0 T T y y n + 0 T T y y n + n 1 p n X j =1 k 0 j k 2 k j k 2 (A.26) Wehaveforthesecondterm 0 T T y y n = 0 T T ( ) 1 = 2 ( ) 1 = 2 y y n k ( ) 1 = 2 0 k 2 k ( ) 1 = 2 y y k 2 n ( 0 ) T T ( 0 ) 4 n + k ( ) 1 = 2 y y k 2 2 n ( 0 ) T T ( 0 ) 4 n + c 1 d n (A.27) where d n = O ( s 2 n m 2 d n ) ,the˝rstinequalityfollowsfromCauchy-Schwarzinequality,the 163 secondinequalityfollowsfromtheidentity uv u 2 = 4+ v 2 ,andthethirdinequalityfollow fromassumption2.3and(A.7).Thenjoining(A.26)and(A.27),wehave ( 0 ) T T ( 0 ) 4 n 0 T T y y n + n 1 p n X j =1 k 0 j k 2 k j k 2 + c 1 d n (A.28) Forthe˝rsttermontherighthandsideof(A.28),wehave 0 T T y y n = 0 T T ( ) 1 = 2 ( ) 1 = 2 y y n ( 0 ) T T ( 0 ) 8 n + 2 k ( ) 1 = 2 y y k 2 2 n (A.29) wheretheinequalityisbytheidentity a T b k a k 2 2 = 8+2 k b k 2 2 .Joining(A.31)and(A.29), wehave ( 0 ) T T ( 0 ) 8 n 2 k ( ) 1 = 2 y y k 2 2 n + n 1 p n X j =1 k 0 j k 2 k j k 2 + c 1 d n (A.30) Byremark2.1,wehave 0 c 1 2 s n 2 m 1 n 8 k 0 k 2 2 2 k ( ) 1 = 2 y y k 2 2 n + n 1 p n X j =1 k 0 j k 2 k j k 2 + c 1 d n (A.31) 164 Observethat k ( ) 1 = 2 y y k 2 2 c 1 1 k y y k 2 2 c 1 1 m n 0 2 s n 2 k T ( y y ) k 2 2 ThenbylemmaA.1,wehave 0 c 1 2 s n 2 m 1 n 8 k f T [ ^ T g 0 f T [ ^ T g k 2 2 O P s n m n log( p n m n ) 2 s n 2 ! + n 1 p n X j =1 k 0 j k 2 k j k 2 + O ( s 2 n m 2 d n ) (A.32) Observethat n 1 p n X j =1 k 0 j k 2 k j k 2 n 1 X j 2 T [ ^ T 0 j j 2 n 1 p s n f T [ ^ T g 0 f T [ ^ T g 2 0 c 1 2 s n 2 m 1 n 16 f T [ ^ T g 0 f T [ ^ T g 2 2 + 4 2 n 1 s n 0 c 1 2 s n 2 m 1 n (A.33) wherethe˝rsttwoinequalitiesarebynorminequality,andthethirdinequalityisbythe identity a T b k a k 2 2 + k b k 2 2 = 4 .Joining(A.32)and(A.33),wehave 0 2 2 = O P s n 2 s n 2 m 2 n log( p n m n ) n + O ( 2 n 1 m 2 n s n 2 s n 2 )+ O ( s 2 n m 1 2 d n 2 s n 2 ) (A.34) 165 Forsome N n suchthat k 0 k 2 N n = 2 Byde˝nitionof ,wehave k 0 k 2 = N n N n + k ^ 0 k 2 k ^ 0 k 2 N n 2 Theinequalityaboveimplies k ^ 0 k 2 N n Therefore, k ^ 0 k 2 2 = O P s n 2 s n 2 m 2 n log( p n m n ) n + O ( 2 n 1 m 2 n s n 2 s n 2 )+ O ( s 2 n m 1 2 d n 2 s n 2 ) Intheunboundedresponsecase,theonlydi˙erencethatwehavetomakeisin(A.29),we have 0 T T y y n 0 c 1 8 k f T [ ^ T g 0 f T [ ^ T g k 2 2 + O P s n m n a n log( p n m n ) n (A.35) wheretheconvergencerateisbylemmaA.2.Thenwiththechoiceof n 1 forthiscase,we have k ^ 0 k 2 2 = O P s n 2 s n 2 m 2 n log( p n m n ) n + O ( 2 n 1 m 2 n s n 2 s n 2 )+ O ( s 2 n m 1 2 d n 2 s n 2 ) Part(iii)isadirectresultofpart(ii).Byassumption2.4,wehave k f j k 2 c f > 0 ,andwe 166 have k f nj k 2 k f j k 2 k f j f nj k 2 c f O ( m d n ) 1 2 c f forlarge n .Bythepropertiesofsplinein[35],seeforexample[121]and[65],thereexist positiveconstants c 1 and c 2 suchthat c 1 m 1 n k 0 j k 2 2 k f nj k 2 c 2 m 1 n k 0 j k 2 Thenwehave k 0 j k 2 2 c 1 2 m n k f nj k 2 2 0 : 25 c 1 2 m n c 2 f .Supposethereisa j 2 T suchthat k ^ j k 2 =0 ,thenwehave k 0 j k 2 0 : 25 c 1 2 m n c 2 f whichisacontradictiontotheresultin(ii)andthetheoremassumption.Therefore,part (iii)follows. ProofofTheorem2.2 Proof. Bythede˝nitionof ^ f nj ;j =1 ;:::;p ,part(i)andpart(iii)followsdirectlyfrompart (i)andpart(iii)intheorem3.1.Itremainstoshowpart(ii).Bythepropertiesofsplinein [35],seeforexample[121]and[65],thereexistpositiveconstants c 1 and c 2 suchthat c 1 m 1 n k ^ nj nj k 2 2 k ^ f nj f nj k 2 2 c 2 m 1 n k ^ nj nj k 2 2 (A.36) 167 Therefore,wehave k ^ f nj f nj k 2 2 = O P s n 2 s n 2 m n log( p n m n ) n + O ( b n 1 2 m n s n 2 s n 2 )+ O ( s 2 n m 2 d n 2 s n 2 ) (A.37) fortheboundedresponsecaseand k ^ f nj f nj k 2 2 = O P s n a n 2 s n 2 m n log( p n m n ) n + O ( b n 1 2 m n s n 2 s n 2 )+ O ( s 2 n m 2 d n 2 s n 2 ) (A.38) fortheunboundedresponsecase.This,togetherwith(2.7)andtriangleinequalityimplies part(iii). ProofofTheorem2.3 Proof. Westartwithpart(i).Toprovepart(i),it'sequivalenttoprovethattheselection isdoneasitisperformedrightontheactiveset,andnoneofthenonzerocomponentsare droppedwithprobabilitytendingto1.Let ^ NZ =argmin 2 R p n m n : T c =0 L a ( ; n 2 ) betheadaptivegrouplassoestimatorrestrictedtothetruenonzerocomponents.Firstwe showthatwithprobabilityconvergingto1, ^ NZ isthesolutiontominimizing(2.16),i.e., withprobabilityconvergingto1,theminimizerof2.16is ^ NZ .Notethattheadaptivegroup lassoisaconvexoptimizationproblemwitha˚neconstraints,thereforetheKKTconditions arenecessaryandsu˚cient.TheKKTconditionsforavector 2 R p n m n tobethesolution 168 of(2.16)is 8 > > > < > > > : 1 n T j ( y )= n 2 w nj j k j k 2 ; if k j k 2 > 0 k 1 n T j ( y ) k 2 n 2 w nj ; if k j k 2 =0 (A.39) where = b 0 ) .Itissu˚cienttoshowthat P ^ NZ satis˝es ( A: 39) ! 1 Notethatforany j 2 T ,wehavetheKKTconditionsfor ^ NZ that 8 > > > < > > > : 1 n T j ( y ^ NZ )= n 2 w nj ^ NZj k ^ NZj k 2 ; if k j k 2 > 0 ;j 2 T k 1 n T j ( y ^ NZ ) k 2 n 2 w nj ; if k j k 2 =0 ;j 2 T (A.40) whicharetheequalityconditionin(A.39)andpartoftheinequalityconditionin(A.39). Therefore,itsu˚cestoshowthat P k 1 n T j ( y ^ NZ ) k 2 n 2 w nj ; 8 j= 2 T ! 1 (A.41) Thisisequivalenttoshowthat P k 1 n T j ( y ^ NZ ) k 2 > n 2 w nj ; 9 j= 2 T ! 0 (A.42) UseTaylorexpansionon 1 n T j ( y ^ NZ ) ,wehave 1 n T j ( y ^ NZ )= 1 n j ( y y )+ 1 n T j ( y b 0 0 ))+ 1 n T j ^ NZ 0 ) 169 where isthevariancematrixevaluatedatsome locatedonthelinesegmentjoining 0 and ^ NZ .Thenwehave P k 1 n T j ( y ^ NZ ) k 2 > n 2 w nj ; 9 j= 2 T P k 1 n j ( y y ) k 2 > n 2 w nj 3 ; 9 j= 2 T + P k 1 n T j ( y b 0 0 )) k 2 > n 2 w nj 3 ; 9 j= 2 T + P k 1 n T j ^ NZ 0 ) k 2 > n 2 w nj 3 ; 9 j= 2 T P 1 + P 2 + P 3 Nowlet'sconsider P 1 .Byassumption2.3,theerrors y i y i 'saresub-Gaussian.For boundedresponses,wehavebylemmaA.1andassumption2.6that P 1 = P k 1 n T j ( y ^ NZ ) k 2 > n 2 w nj ; 9 j= 2 T P k 1 n T j ( y ^ NZ ) k 2 >C n 2 r n ; 9 j= 2 T + o (1) = P max j= 2 T k 1 n T j ( y ^ NZ ) k 2 >C n 2 r n + o (1) P max j= 2 T k 1 n T j ( y ^ NZ ) k 2 >C n 2 r n j max j= 2 T k 1 n T j ( y ^ NZ ) k 2 Cn 1 = 2 p log( s n m n ) + o (1) ! 0 asn !1 BylemmaA.2,wehave E max j= 2 T;k =1 ;:::;m n 1 n T jk ( y y ) 2 c 6 n 1 = 2 p log( s n m n ) (A.43) 170 forsomeconstant c 6 .Observethatbyassumption2.5,wehave w nj = O P ( r n ) Cr n for somegeneralconstant C .ThenwehavebyMarkov'sinequalityandassumption2.6that P 1 = P k 1 n T j ( y ^ NZ ) k 2 > n 2 w nj ; 9 j= 2 T P k 1 n T j ( y ^ NZ ) k 2 >C n 2 r n ; 9 j= 2 T + o (1) = P max j= 2 T k 1 n T j ( y ^ NZ ) k 2 >C n 2 r n + o (1) E max j= 2 T;k =1 ;:::;m n 1 n T jk ( y y ) 2 C n 2 r n + o (1) c 6 p log( s n m n ) Cn 1 = 2 n 2 r n + o (1) ! 0 asn !1 Thenweconsider P 2 .Wehaveshownthat 1 n k y y k 2 2 = O ( s 2 n m 2 d n ) Thisimpliesthat 1 p n k y y k 2 = O ( s n m d n ) Thenbyassumption2.1, max j= 2 T 1 n j ( y y ) 2 Cm 1 = 2 n 1 p n y y 2 = O ( s n m d 1 = 2 n ) 171 Byassumption2.6,wehave P 2 ! 0 as n !1 .Next,welookat P 3 .Bythede˝nitionof ^ NZ ,wehavebynorminequality 1 n T j ^ NZ 0 )= 1 n T j T ( ^ NZT 0 T ) TheMLEonthetruenonzerosethasarateofconvergence p s n m n =n .Thepenalized solutionhasbeenprovedtobeclosetotheMLEasymptotically([149];[46];[88]).Knowing thetruenonzeroset,therateofconvergenceof ^ NZ is p s n m n =n .Thenwehave P 3 = P 1 n T j T ( ^ NZT 0 T ) 2 > n 2 w nj 3 ; 9 j= 2 T P 1 n T j T ( ^ NZT 0 T ) 2 >C n 2 r n ; 9 j= 2 T + o (1) P max j= 2 T 1 n T j T 2 > C n 2 r n a n p s n m n =n ! + P ^ NZT 0 T 2 >a n r s n m n n + o (1) ! 0 as n !1 foranydivergingsequence a n ,wherethe˝rstprobabilityinthelaststepgoesto0by assumption2.1thatthelefthandsideisoforder m 1 = 2 n andassumption2.6.Thesecond probabilitygoesto0bytherateofconvergenceof ^ NZT . Therefore,wehavethat ^ NZ isouradaptivegrouplassosolutionwithprobabilitycon- vergingto1.Thecomponentsselectedbyadaptivegrouplassoisasymptoticallyatmost thosewhichareactuallynonzero.Thenwewanttoprovethatthetruenonzerocomponents 172 areallselectedwithprobabilityconvergingto1.Byourassumptions,wehave min j 2 T k ^ NZj k 2 min j 2 T k 0 j k 2 k ^ NZj 0 j k 2 c 1 = 2 2 m 1 = 2 n c f o P (1) > 0 Therefore,noneofthetruenonzerocomponentsareestimatedaszero.Combiningthetwo resultsabove,wehavethatwithprobabilityconvergingto1,thecomponentsselectedbythe adaptivegrouplassoareexactlythetruenonzerocomponents,i.e., P ^ AGL 0 = 0 ! 1 as n !1 Part(i)isproved.Thenwelookatpart(ii),wherebasedontheresultinpart(i),weonly considerthehighprobabilityeventthattheselectionoftheadaptivegrouplassoestimator isperfect.Similartopart(ii)oftheorem2.1,weconsideraconvexcombinationof 0 and ^ AGL = t ^ AGL +(1 t ) 0 where t = N n = ( N n + k ^ AGL 0 k 2 ) forsomesequence N n .Similarto(A.26),wehave 1 2 n ( T 0 T ) T T T T ( T 0 T ) ( T 0 T ) T T T ( y y ) n + ( T 0 T ) T T T ( y y ) n + n 2 s n X j =1 w nj ( k 0 k 2 k k 2 ) (A.44) 173 Thenbythefactthat j a T b jk a k 2 2 + k b k 2 2 = 4 ,wehave 1 2 n ( T 0 T ) T T T T ( T 0 T ) ( T 0 T ) T T T ( y y ) n + 1 4 n ( T 0 T ) T T T T ( T 0 T ) + k 1 = 2 ( y y ) k 2 2 n + n 2 s n X j =1 w nj ( k 0 k 2 k k 2 ) Thenby(A.7), 1 4 n ( T 0 T ) T T T T ( T 0 T ) ( T 0 T ) T T T ( y y ) n + O ( s 2 n m 2 d n )+ n 2 s n X j =1 w nj ( k 0 k 2 k k 2 ) By(2.13),thefactthat j a T b jk a k 2 2 + k b k 2 2 = 4 andnorminequality,wehave 0 c 1 2 s n 2 m 1 n 4 k 0 k 2 2 ( T 0 T ) T T T ( y y ) n + O ( s 2 n m 2 d n ) + 2(max j 2 T w nj ) 2 0 c 1 2 n 2 s n + 0 c 1 2 s n 2 m 1 n 8 k 0 k 2 2 Thenbyassumption2.6, 0 c 1 2 s n 2 m 1 n 8 k T 0 T k 2 2 ( T 0 T ) T T T ( y y ) n + O ( s 2 n m 2 d n )+ O ( 2 n 2 s n ) 174 Usethefactthat j a T b jk a k 2 2 + k b k 2 2 = 4 onthe˝rsttermoftherighthandside,wehave 0 c 1 2 s n 2 m 1 n 8 k T 0 T k 2 2 0 c 1 2 s n 2 m 1 n 16 k T 0 T k 2 2 + 4 0 c 1 2 s n 2 m 1 n n 2 k T T ( y y ) k 2 2 + O ( s 2 n m 2 d n )+ O ( 2 n 2 s n ) BynorminequalityandlemmaA.1,wehave 4 0 c 1 2 s n 2 m 1 n n 2 k T T ( y y ) k 2 2 4 0 c 1 2 s n 2 m 1 n n 2 s n m n k T T ( y y ) k 1 = O P s n 2 s n 2 m n log( s n m n ) n Combinethelasttworesults,wehavewithprobabilityconvergingto1, k T 0 T k 2 2 = O p s n 2 s n 2 m 2 n log( s n m n ) n + O ( s 2 n 2 s n 2 m 1 2 d n )+ O ( 2 n 2 m 2 n s n 2 s n 2 ) Thensimilartotheargumentintheproofofpart(ii)oftheorem2.1,wehave X j 2 T k ^ AGLj 0 j k 2 2 = O p s n 2 s n 2 m 2 n log( s n m n ) n + O ( s 2 n 2 s n 2 m 1 2 d n )+ O ( 2 n 2 m 2 n s n 2 s n 2 ) 175 Intheunboundedresponsecase,wereplacelemmaA.1withlemmaA.2andget X j 2 T k ^ AGLj 0 j k 2 2 = O p s n 2 s n 2 m 2 n a n log( s n m n ) n + O ( s 2 n 2 s n 2 m 1 2 d n )+ O ( 2 n 2 m 2 n s n 2 s n 2 ) foranydivergingsequence a n .Part(ii)isproved. ProofofTheorem2.4 Proof. Theproofissimilartotheproofoftheorem2.2. ProofofTheorem2.5 Proof. Theideaoftheproofissimilartotheproofsin[52],butduetothegrouppenalization structure,somechangeshavetobemade.First,theGICcriterionhasthesolutionof adaptivegrouplasso,whichisnoteasytostudy.Soweuseaproxy,theMLEonthenonzero componentsselectedbytheadaptivegrouplassoestimator.Let ^ ( A )=argmax f 2 R p n m n : supp B ( )= A g 1 n n X i =1 h y i T i b T i (A.45) foragiven A ˆf 1 ;:::;p g ,andtheproxyofGICisde˝nedas GIC a n ( A )= 1 n f D (^ A ; Y )+ a n j A jg (A.46) 176 where ^ A = b 0 ^ ( A )) .The˝rstresultisthattheproxy GIC a n ( T ) wellapproximates GIC a n ( 0 ) .Toprovethis,observebythede˝nitionof ^ 0 = ^ ( T ) ,wehavethe˝rstorder necessarycondition @ @ l n ( ^ 0 )= 0 (A.47) UseTaylorexpansionandbyassumptions1and2,wehave 0 GIC a n ( T ) GIC a n ( 0 ) = 1 n l n ( ^ ( n 0 )) l n ( ^ 0 ) = 1 n ^ ( n 0 ) ^ 0 T T ( ^ ( n 0 ) ^ 0 c 1 0 ^ ( n 0 ) ^ 0 2 2 (A.48) where liesonthelinesegmentjoining ^ ( n 0 ) and ^ 0 .Thenweneedtobound ^ ( n 0 ) ^ 0 2 2 .Bythede˝nitionof ^ ( n 0 ) ,wehave T T y b 0 T ^ T ( n 0 )) + n 0 T = 0 (A.49) wheretheelementsof T are w nj ^ j ( n 0 ) = k ^ j ( n 0 ) k 2 for j 2 T .Ontheotherhand,bythe de˝nitionof ^ 0 ,wehave T T y b 0 T ^ 0 T ) = 0 (A.50) Togetherwehave T T b 0 T ^ 0 T ) b 0 T ^ T ( n 0 )) + n 0 T = 0 (A.51) 177 UseTaylorexpansiononthelefthandsideoftheequation,wehave T T ( T ^ T ( n 0 ) ^ 0 T = n 0 T (A.52) where liesonthelinesegmentjoining ^ T ( n 0 ) and ^ 0 T .Taking2normandtogether withassumptions1and2andtheresultsintheorem2.1,wehave ^ T ( n 0 ) ^ 0 T 2 C n 0 k w T k 2 C n 0 p s n k w T k 1 (A.53) where w T =( w nj ;j 2 T ) 0 .Thenwehave k ^ ( n 0 ) ^ 0 k 2 = O ( n 0 p s n ) (A.54) Choose a n tobeanydivergingsequence,thenwehave k ^ ( n 0 ) ^ 0 k 2 = o ( n 0 p s n a n ) (A.55) Thenby(A.48),wehave GIC a n ( 0 ) GIC a n ( T )= o ( n 0 p s n a n ) (A.56) Asadirectresult, GIC a n ( ) GIC a n ( n 0 ) ( GIC a n ( ) GIC a n ( T ))+( GIC a n ( T ) GIC a n ( n 0 )) =( GIC a n ( ) GIC a n ( T ))+ o p ( n 0 p s n a n ) (A.57) 178 Theusingthisproxy,nextweprovethattheproxy GIC isabletodetectthedistance betweenaselectedmodelandthetruemodel.Sincethe GIC dependsonlyontheMLE andhasnothingtodowiththepenalization,thisisthesameasthegeneralizedlinearmodel, butwiththesplinelineapproximationerrorbeingconsidered. Duetotheestimationproblem,weareonlyinterestedinthemodels A suchthat j A j K where Km n = o ( n ) .Astheproofin[52],weconsidertheunder˝ttedmodelandover˝tted model(de˝nedintheirpaper).Brie˛y,theunder˝ttedmodelsare A suchthat A 6˙ T and theover˝ttedmodelsare A suchthat A ) T .Alsointheresultoftheorem2.1,themodel size j A j = O ( s n )= o ( n ) andthustheKLdivergencehasauniqueminimiserforeverysuch model A ,asdiscussedin[52]. LemmaA.3impliesthatforallunder˝ttedmodels GIC A n ( A ) GIC a n ( T )=2 j A j I ( ( A ))+( j A jj T j ) a n n 1 + j A j O P ( R n ) n s n a n n 1 O P ( KR n ) n 2 if n K 1 R 1 n !1 and a n = o ( n s 1 n n ) .Thisresultstatesthatthereisanegligible incrementonthe GIC ifoneofthenonzerocomponentismissed,whentheparameters satisfytheconditions.LemmaA.4impliesthatforallover˝ttedmodels GIC a n ( A ) GIC a n ( T )= j A jj T j n [ a n O P ( n )] > a n 2 n if a n n !1 .Thisresultstatesthatthereisanegligibleincrementonthe GIC ifoneof thezerocomponentisselectedalongwiththetruemodel,whentheparameterssatisfythe 179 conditions.Therefore, P inf A 6˙ T GIC a n ( A ) GIC a n ( T ) > n 2 and inf A ) T GIC a n ( A ) GIC a n ( T ) > a n 2 n ! 1 (A.58) Combinethisresultwith(A.57)andtheoremassumptions,wehave P f inf 2 [ + GIC a n ( ) >GIC a n ( n 0 ) g! 1 LemmaA.3. Underassumptions2and3,as n !1 ,wehave sup j A K A ˆf 1 ;:::;p n g 1 n j A j D ( ^ A ; Y ) D ( ^ 0 ; Y ) 2 I ( ( A )) = O P ( R n ) whereeithera)theresponsesareboundedorGaussiandistributed, R n = p n m n log( p n ) =n ,and m n log( p n )= o ( n ) ;orb)theresponsesareunbounded andnon-Gaussiandistributed, R n = p n m n log( p n ) =n + 2 n m n M 2 n log( p n ) =n and log( p )= o (min f n (log n ) 1 K 2 m 1 n 1 n ;nM 2 n g ) . Proof. lemmaA.3isadirectresultfromlemmaA.7andlemmaA.8. LemmaA.4. Underassumption2.1,2and3,andsuppose log p = O ( n ) forsome 0 << 1 , as n !1 ,wehave 1 j A jj T j D ( ^ A ; Y ) D ( ^ 0 ; Y ) = O P ( n ) uniformlyforall A ) T with j A j 0 , N = L n p j A j m n =n (1+ u ) , u = n p log p n and observethat p n k ( pe=k ) k ,wehave P sup j A K 1 j A j ~ Z A ( N ) 4 L 2 n m n n (1+ u ) 2 j n ! X j A K P ~ Z A ( N ) 4 j A j L 2 n m n n (1+ u ) 2 j n X k K pe k k exp( CKm n u 2 ) X k K pe k k exp( CKm n n log p n ) ! 0 Thenwehave P sup j A K 1 j A j ~ Z A ( N ) 2 n L 2 n m n n log p n ! = o (1)+ P c ) ! 0 LemmaA.6. Underassumptions1-3,wehave sup j A K 1 p j A j k ^ ( A ) ( A ) k 2 = O P n L n r m n log p n n ! Proof. De˝netheconvexcombinationof ^ ( A ) and ( A ) tobethesamewayaswedidin 184 provingtheorem2.1as ^ u ( A ) .Thenisremainstoshow sup j A K 1 p j A j k ^ u ( A ) ( A ) k 2 = O P n L n r m n log p n n ! Bythede˝nitionof ^ ( A ) andtheconcavityofthelikelihoodfunction,wehave l n ( ^ u ( A )) l n ( ( A )) 0 Bythede˝nitionof ( A ) ,wehave E [ l n ( ( A ) l n ( ^ u ( A )] 0 Combinethetwoinequalitiesabove,wehave 0 E [ l n ( ( A ) l n ( ^ u ( A )] l n ( ^ u ( A )) l n ( ( A )) E [ l n ( ^ u ( A ) l n ( ( A )] nZ A ( N ) (A.60) Ontheotherhand,forany A 2 B A ( N ) ,wehave E [ l n ( A ) l n ( ( A ))]= E [ y T A 1 T b A ) y T ( A )+ 1 T b ( A ))] = b 0 ( p n X j =1 f j ) T A ( A )] 1 T [ b A ) b ( A ))] Observethatbythede˝nitionof ( A ) ,wehave b 0 ( p n X j =1 f j ) b 0 ( A ))]= 0 185 useTaylorexpansion,wehave E [ l n ( A ) l n ( ( A ))]= b 0 ( A )) T A ( A )] 1 T [ b A ) b ( A ))] = 1 2 ( A ( A )) T T A ~ A ( A ( A )) Cn k A ( A ) k 2 2 wherethelastinequalityisbyassumptions1and2.Thenwehave k A ( A ) k 2 2 CZ A ( N ) Take N = n L n p j A j m n log p n =n andbylemmaA.5,wehave sup j A K 1 p j A j k ^ u ( A ) ( A ) k 2 = O P n L n r m n log p n n ! ThenlemmaA.6follows. LemmaA.7. Underassumptions1-3,wehave sup j A K 1 n j A j l n ( ^ ( A )) l n ( ( A )) 2 n L 2 n m n log p n n Proof. De˝netheevent E = ( sup j A K 1 p j A j k ^ ( A ) ( A ) k 2 = O P n L n r m n log p n n !) BylemmaA.6,wehave P ( E ) ! 1 .Usingthesameargumentasin(A.60)inprovinglemma 186 A.6,wehave 0 l n ( ^ ( A )) l n ( ( A )) l n ( ^ u ( A )) l n ( ( A )) E [ l n ( ^ u ( A ) l n ( ( A )] nZ A ( N ) BylemmaA.5,conditioningon E ,wehave l n ( ^ ( A )) l n ( ( A )) nO P 2 n L 2 n j A j m n log p n n Thenthelemmafollowfrom P ( A ) P ( A jE )+ P ( E c ) . LemmaA.8. Underassumption2.1,2.2and2.3,wehave sup j A K 1 n j A j j l n ( ( A )) E [ l n ( ( A ))] j = O P r n m n log p n n ! where log p n = o ( n ) forboundedresponseand n m n K 2 log p n = o ( n ) forunboundedre- sponse. Proof. Bythede˝nition,wehave l n ( ( A )) E [ l n ( ( A ))]= " T ( A ) .Forbounded response,byHoe˙ding'sinequality,wehave P ( j " T ( A ) j t ) C exp Ct 2 P n i =1 T i ( A )) 2 ! C exp Ct 2 n j A j m n Take t = j A j p n m n log p n ,wehave P ( j " T ( A ) jj A j p n m n log p n ) C exp( C j A j n log p n ) 187 Thenwehave sup j A K 1 n j A j j l n ( ( A )) E [ l n ( ( A ))] j = O P r n m n log p n n ! Iftheresponsesareunbounded,weuseBernstein'sinequality.Firstcheckthecondition E [ j i ( A ) i j m ]= m Z 1 0 x m 1 P ( j i ( A ) i x ) dx = m j T i ( A ) j m Z 1 0 x j T i ( A ) j ! m 1 P j i j x j T i ( A ) j ! d x j T i ( A ) j m j T i ( A ) j m Z 1 0 t m 1 C exp( Ct 2 ) dt ) m j T i ( A ) j m ( k ( A ) k 1 C ) m 2 m ! 2 ThenbyBernstein'sinequality,wehave P ( j " T ( A ) j p nt ) 2exp 1 2 nt 2 C k A ( A ) k 2 2 + C p n k A ( A ) k 1 t ! Taking t = j A j p n m n log p n ,wehave P ( j " T ( A ) j p n j A j p n m n log p n ) 2exp 1 2 n j A j 2 n m n log p n C k A ( A ) k 2 2 + C p n k A ( A ) k 1 j A j p n m n log p n ! ! 0 if K 2 n m n log p n =n ! 0 . 188 LemmaA.9. Underassumptions1-3,wehave sup A ˙ T j A K 1 j A jj T j ( y 0 ) T 1 = 2 0 B A 1 = 2 0 ( y 0 )= O P ( m n ( n log p n ) ˘ ) where B A = 1 = 2 0 A T A 0 A ) 1 T A 1 = 2 0 and ˘ =1 = 2 forboundedresponseand ˘ =1 forunboundedresponse. Proof. Let k = j A jj T j and P A = B A B T .It'seasytoverifythat P A isaprojection matrix,thuswehave tr ( P )= km n , P n i =1 P ii = km n and P i;j P ij = km n .Let ~ y = 1 = 2 0 ( y 0 ) Wehavethedecomposition 1 m n k ~ y T P A ~ y = 1 m n k n X i =1 P ii ~ y 2 i + 1 m n k X i 6 = j P ij ~ Y i ~ Y j I 1 ( A )+ I 2 ( A ) Let ~ y i beindependentcopiesof ~ y i ,thenbythedecouplinginequality,thereexistsaconstant C> 0 suchthat P 0 @ 1 m n k j X i 6 = j P ij ~ y i ~ y j j t 1 A C P 0 @ 1 m n k j X i 6 = j P ij ~ Y i ~ Y j j C 1 t 1 A Forboundedresponse,applyHoe˙ding'sinequality,wehave P ( I 1 ( A ) 1+ x ) 2exp 2 Cx 2 P n i =1 ( m n k ) 2 P 2 ii ! 2exp( Cm n kx 2 ) 189 Taking x = p log p n ,usetheinequality p k ( pe=k ) k andusethesametechniqueaswe usedinprovinglemmaA.5,wehave P sup j A K I 1 ( A ) 1+ p n log p n ! 2 C K X k =1 ( p n s n ) e k k exp( Cm n k n log p n ) ! 0 Thenobserve P i 6 = j P 2 ij = P i ( P ii P 2 ii ) m n k ,wehavefollowingthedecouplinginequality that P ( j I 2 ( A ) j t ) C P 0 @ 1 m n k j X i 6 = j P ij ~ Y i ~ Y j j C 1 t 1 A C exp C 2 ( m n k ) 2 t 2 P i 6 = j P 2 ij ! C exp( Cm n kt 2 ) Taking t = p n log p n andusethesametechniqueasinthepreviousstep,wehave P sup j A K I 2 ( A ) 1+ p n log p n ! ! 0 Intheunboundedcase,weapplytheBernstein'sinequality.Inthesamewayaswedidin provinglemmaA.8,wecheckthecondition E j P ii ~ Y 2 i j m m ! C m 2 P 2 ii 2 ByBernstein'sinequality,wehave P ( I 1 ( A ) x 2 ) 2exp( Cm n kx 2 ) 190 Taking x = p n log p n ,wehave sup j A K I 1 ( A )= O P ( n log p n ) For I 2 ( A ) ,wehave X i 6 = j j P ij j m E [ j ~ y i ~ y j j m ] m ! C m 2 P 2 ij 2 ThenbyBerstein'sinequalityandtaking x = p n log p n ,wehave P ( j I 2 ( A ) j n log p n ) ! 0 LemmaA.10. Underassumptions1-3,forall A ˙ T and j A j K ,wehave l n ( ^ ( A )) l n ( ( A ))= 1 2 ( y 0 ) T 1 = 2 0 B A 1 = 2 0 ( y 0 ) + j A j 5 = 2 O P m 5 = 2 n 5 = 2 n L 2 n (log p n ) 1+ ˘= 2 p n ! + j A j 4 O P m 4 n 4 n L 4 n (log p n ) 2 n + j A j 3 O P m 3 n 3 n L 3 n (log p n ) 3 = 2 p n ! Proof. UseTaylor'sexpansion,wehave l n ( ^ ( A )) l n ( ( A )) =( ^ ( A ) ( A )) T T ( y b 0 ( A )) 1 2 ( ^ ( A ) ( A )) T T 0 ^ ( A ) ( A )) + Remainder I 1 ( A )+ I 2 ( A )+ I 3 ( A ) 191 First,bythede˝nitionof ^ ( A ) ,wehave T A [ y b 0 ^ ( A ))]=0 ThenbyTaylorexpansion,wehave T A y = T A b 0 ^ ( A )) = T A b 0 ( A ))+ T A 0 ^ ( A ) ( A ))+ T A A where Ai = b 000 T i ~ ( A T i ( ^ ( A ) ( A ))) 2 = 2 and ~ ( A ) liesonthelinesegment joining ^ ( A ) and ( A ) .Bythede˝nitionof ( A ) ,wehave T A [ b 0 ( p n X j =1 f j ) b 0 A ( A ))]=0 wehave ^ ( A ) ( A )= T A 0 A ) 1 T A ( y b 0 ( p n X j =1 A )) Therefore,wehave I 1 ( A )=( y y ) T 1 = 2 0 B A 1 = 2 0 ( y y )+ R 1 ;A where R 1 ;A = T A 1 = 2 0 B A 1 = 2 0 " .ByCauchy-Schwartzinequality,wehave j R 1 ;A jk B A 1 = 2 0 " k 2 k 1 = 2 0 A k 2 ( k B T 1 = 2 0 " k 2 + k ~ R 1 ;A k 2 ) k 1 = 2 0 A k 2 192 where ~ R 1 ;A =( B A B 0 ) 1 = 2 0 " .Observethat 0 = E [ T ] and tr ( B T B T )= m n s n ,take n !1 ,byMarkov'sinequality,wehave P k B T 1 = 2 0 " k 2 p m n s n n 1 m n s n n E [ k B T 1 = 2 0 " k 2 2 ] = 1 m n s n n tr f B T 1 = 2 0 E [ "" T ] 1 = 2 0 B T g = 1 n ! 0 Thenwehave k B T 1 = 2 0 " k 2 = O P ( p m n s n n ) (A.61) BylemmaA.9,wehave ( j A jj T j ) 1 = 2 k ~ R 1 ;A k 2 = O P ( m 1 = 2 n ( n log p n ) ˘ ) (A.62) Finally,wehave k 1 = 2 0 A k 2 C k A k 2 C n X i =1 j T i ( ^ ( A ) ) ( A )) j 4 ! 1 = 2 C n X i =1 k iA k 4 2 k ^ ( A ) ) ( A ) k 4 2 ! 1 = 2 Cm n j A j n 1 = 2 k ^ ( A ) ) ( A ) k 2 2 = m 2 n j A j 2 O P 2 n L 2 n log p n p n (A.63) 193 Combining(A.61),(A.62)and(A.63),wehave I 1 ( A )=( y y ) T 1 = 2 0 B A 1 = 2 0 ( y y )+ O P j A j 5 = 2 m 5 = 2 n 5 = 2 n L 2 n (log p n ) 1+ ˘= 2 p n ! Thenwelookat I 2 ( A ) .Wehave I 2 ( A )= 1 2 ( ^ ( A ) ( A )) T T 0 ^ ( A ) ( A )) = 1 2 ( y y ) T 1 = 2 0 B A 1 = 2 0 ( y y )+ 1 2 R 2 ;A R 1 ;A where R 2 ;A = A 1 = 2 0 B A 1 = 2 0 A C k A k 2 2 Cm 2 n j A j 2 n k ^ ( A ) ( A ) k 4 2 = O m 2 n j A j 4 4 n L 4 n m 2 n (log p n ) 2 n 2 n = O j A j 4 m 4 n 4 n L 4 n (log p n ) 2 n Therefore, I 2 ( A )= 1 2 ( y y ) T 1 = 2 0 B A 1 = 2 0 ( y y )+ O P j A j 5 = 2 m 5 = 2 n 5 = 2 n L 2 n (log p n ) 1+ ˘= 2 p n ! + O j A j 4 m 4 n 4 n L 4 n (log p n ) 2 n 194 Finally,wehavefor I 3 ( A ) that j I 3 ( A ) j Cn j A j 3 = 2 m 3 = 2 n k ^ ( A ) ( A ) k 3 2 = O P j A j 3 m 3 n 3 n L 3 n (log p n ) 3 = 2 p n ! Combiningthethreeresultsfor I 1 ( A ) , I 2 ( A ) and I 3 ( A ) ,wegetthedesiredresult. ProofofTheorem2.6 Proof. Part(i).Because ^ istheminimizer,wehave k y ˚ ^ k 2 2 + p X j =1 1 k ^ j k 2 6 =0 k y 0 k 2 2 + p X j =1 1 k 0 j k 2 6 =0 (A.64) i.e. k y ˚ ^ k 2 2 + ^ k k y 0 k 2 2 + n (A.65) Observethat k y ˚ ^ k 2 2 = k y 0 + 0 ˚ ^ k 2 2 = k y 0 k 2 2 + " T ˚ ( 0 ^ )+ k ˚ ( 0 ^ ) k 2 2 where " = y 0 = y f + f 0 := " + Wehave k ˚ ( 0 ^ ) k 2 2 j " T ˚ ( ^ 0 ) j + ( s n ^ k ) j " T ˚ ( ^ 0 ) j (A.66) 195 Byassumptionsoneigenvaluesof ˚ ,wehave k 0 ^ k 2 2 2 s n 2 m 1 n k 0 ^ k 2 2 sinceboth 0 and ^ aresparse.Ontheotherhand,wehave j " T ˚ ( ^ 0 ) j ( y f ) T ˚ ( ^ 0 ) j + j ( f 0 ) T ˚ ( ^ 0 ) j 2 s n 2 m 1 n 2 k ^ 0 k 2 2 + 1 2 s n 2 m 1 n k ˚ ( y f ) k 2 2 + 2 s n 2 m 1 n 4 k ^ 0 k 2 2 + 2 2 s n 2 m 1 n k f 0 k 2 2 Combinethiswiththepreviousargument,wehave k ^ 0 k 2 2 4 c 2 n 2 2 s n 2 m 1 n k ˚ ( y f ) k 2 2 + 8 c 2 2 s n 2 m 1 n k f 0 k 2 2 (A.67) Bythesameargumentsintheproofoftheorem2.1,wehave k ^ 0 k 2 2 = O p s n 2 s n 2 m 2 n log( s n m n ) n + O ( s 2 n 2 s n 2 m 1 2 d n ) (A.68) Part(ii).From(A.66),wehave ( ^ k s n ) j " T ˚ ( ^ 0 ) j 1 n k ˚ ( y f ) k 2 2 + k f 0 k 2 2 + n k ^ 0 k 2 2 (A.69) Since ^ k s n ,wehave j ^ k s n j O ( s n m n log( p n m n ) )+ O ( ns 2 n m 2 d n ) ! 0 asn !1 (A.70) 196 if s n m n log( p n m n ) ! 0 and ns 2 n m 2 d n ! 0 . ProofofTheorem2.7 Proof. Part(i).Because ^ istheminimiser,wehave k y ˚ ^ k 2 2 + 1 p X j =1 k ^ j k 2 + 2 p X j =1 1 k ^ j k 2 6 =0 k y 0 k 2 2 + 1 p X j =1 k 0 j k 2 + 2 p X j =1 1 k 0 j k 2 6 =0 (A.71) Rearrangingthetermsandusethetheoremconditions,wehave 1 p X j =1 ( k ^ j k 2 k 0 j k 2 ) j " T ˚ ( ^ 0 ) j + k ˚ ( 0 ^ k 2 2 Followingthesameargumentsasin(A.26),wehavethedesiredresults. Part(ii).Frompart(i),wehave ( ^ k s n ) j " T ˚ ( ^ 0 ) j + 1 p X j =1 ( k 0 j k 2 k ^ j k 2 ) 1 n k ˚ ( y f ) k 2 2 + k f 0 k 2 2 + n k ^ 0 k 2 2 + 1 p s n k ^ 0 k 2 Thentheresultsfollowsfromthetheoremconditions. 197 APPENDIXB TechnicalDetailsandSupplementary MaterialsforChapter3. ProofofTheorem3.1 Inthissection,wegivetheprooffortheorem3.1.Toprovethetheorem,weneedthe followinglemmas. LemmaB.1. Forageneralclassi˝er ^ C ( x ) of y ,denote C ( x ) theBayesclassi˝er.Thenwe have R ( ^ C ) R ( C ) 2 E X [ j ˙ ( ( X )) ˙ (^ ( X )) j ] 2 E X [ j ( X ) ^ ( X ) j ] 198 Proof. Bythede˝nitionof R ( C ) and R ( C ) ,wehave R ( ^ C ) R ( C ) = E X ;Y h 1 f ^ C ( X ) 6 = Y g i E X ;Y h 1 f C ( X ) 6 = Y g i = E X E Y j X h 1 f ^ C ( X ) 6 = Y g 1 f C ( X ) 6 = Y g i = E X h 1 f ^ C ( X )=0 g ( X )+ 1 f ^ C ( X )=1 g (1 ( X )) 1 f C ( X )=0 g ( X ) 1 f C ( X )=1 g (1 ( X )) i = E X h 1 f ^ C ( X ) 6 = C ( X ) g j 2 ( X ) 1 j i = E X h 1 f ^ C ( X )=1 ;C ( X )=0 or ^ C ( X )=0 ;C ( X )=1 g j 2 ( X ) 1 j i =2 E X h 1 f ˙ (^ ( X )) 1 = 2 ;˙ ( ( X )) < 1 = 2 or f ˙ (^ ( X )) < 1 = 2 ;˙ ( ( X )) 1 = 2 g j ( X ) 1 = 2 j ] 2 E X [ j ˙ ( ( X )) ˙ (^ ( X )) j ] Forthesecondinequality,considertheTaylorexpansionof ˙ ( ( X )) at ^ ( X ) ,wehave E X [ j ˙ ( ( X )) ˙ (^ ( X )) j ] = E X j ˙ 0 ( ( X ))( ( X ) ^ ( X ) j E X [ j ( X ) ^ ( X ) j ] where ( X ) liesonthelinejointing ( X ) and ^ ( X ) ,andthesecondinequalityfollowsfrom thefactthat ˙ 0 ( x )=exp( x ) = (1+exp( x )) 2 1 . 199 LemmaB.2. Underassumptions,wehave 1 n k ^ 0 k 2 2 = O P log( n ) n ~ 2 + O 1 n ~ 2 m + O s 2 2 n ~ 4 + O P n 1 m 9 = 2 s 5 = 2 p log p where ^ and 0 arethevectorsofpredictionsforthesample x 1 ;:::; x n usingtheestimated parametersandthetrueparameters,respectively.Herethefourtermscomefromtheesti- mation,theneuralnetworkapproximation,theregularizationandtheexcesslosserrorby thesparsegrouplassoregularizationon ,respectively. Proof. Bythede˝nitionof ^ ,wehave 1 n n X i =1 [ y i ^ ( x i ) log(1+exp(^ ( x i )))] + p X j =1 k ^ ( j ) k 2 + (1 ) k ^ k 1 1 n n X i =1 h y i 0 ( x i ) log 1+exp( 0 ( x i )) + p X j =1 k 0 ( j ) k 2 + (1 ) k 0 k 1 200 RearrangethetermsandbyTaylorexpansion,wehave 1 n ( ^ 0 ) T ( y 0 )+ 1 2 n ( ^ 0 ) T 0 ( ^ 0 ) 1 n n X i =1 @ 3 l @ ( x i ) (^ i 0 i ) 3 s n X j =1 h ( k 0 ( j ) k 2 k ^ ( j ) k 2 )+(1 )( k ^ ( j ) k 1 0 ( j ) k 1 ) i p X j = s n +1 h k ^ ( j ) k 2 +(1 )( k ^ 0 ( j ) k 1 i s n X j =1 h ( k 0 ( j ) k 2 k ^ ( j ) k 2 )+(1 )( k ^ ( j ) k 1 k 0 ( j ) k 1 ) i (B.1) where 0 isdiagonalmatrixwith 0 ii =exp( 0 ( x i )) = [1+exp( 0 ( x i ))] 2 , 0 istheconditional expectationof y given X intheneuralnetworkapproximationspace, liesontheline joining ^ and 0 .Considerthesecondordertermin(B.1),byassumption3.2,wehave 1 2 n ( ^ 0 ) T 0 ( ^ 0 ) ~ 2 2 n k ^ 0 k 2 2 Thenthe˝rsttermontheLHSof(B.1),by u T v k u k 2 2 = 4+ k v k 2 2 ,norminequality,maximal inequality(seeproofin[65])andresultin[113], 1 n j ( ^ 0 ) T ( y 0 ) j 1 n j ( ^ 0 ) T ( y ) j + 1 n j ( ^ 0 ) T ( 0 ) j ~ 2 4 n k ^ 0 k 2 2 + 1 n ~ 2 k y k 2 2 + ~ 2 8 n k ^ 0 k 2 2 + 2 n ~ 2 k 0 k 2 2 = 3~ 2 8 n k ^ 0 k 2 2 + O P log( n ) n ~ 2 + O 1 n ~ 2 m 201 Combinethisresultand(B.1),wehave ~ 2 8 n k ^ 0 k 2 2 1 n n X i =1 @ 3 l @ ( x i ) 3 (^ i 0 i ) 3 s n X j =1 h ( k 0 ( j ) k 2 k ^ ( j ) k 2 ) +(1 )( k ^ ( j ) k 1 k 0 ( j ) k 1 ) i (B.2) Notethat @ 3 l @ ( x i ) 3 = exp( ( x i ))[1 exp( ( x i ))] [1+exp( ( x i ))] 3 1 Thenapplynorminequality,(B.2)becomes ~ 2 8 n k ^ 0 k 2 2 1 n k ^ 0 k 3 2 s n X j =1 h ( k 0 ( j ) k 2 k ^ ( j ) k 2 )+(1 )( k ^ ( j ) k 1 0 ( j ) k 1 ) i + O P log( n ) n ~ 2 + O 1 n ~ 2 m (B.3) Applytheauxiliarylemmain[118],seealso[53],wehave k ^ 0 k 2 2 =nC 2 0 s n X j =1 h ( k 0 ( j ) k 2 k ^ ( j ) k 2 )+(1 )( k ^ ( j ) k 1 0 ( j ) k 1 ) i + O P log( n ) n ~ 2 + O 1 n ~ 2 m (B.4) 202 where C 0 =max 1 0 p n ; R 2 0 p n ! 0 = ~ 2 16 forsomeconstant R ,andsome a 0 thatdependson 0 , s n and K .Thenbynorminequalities, the˝rsttermofRHSof(B.4)becomes s n X j =1 h ( k 0 ( j ) k 2 k ^ ( j ) k 2 )+(1 )( k ^ ( j ) k 1 0 ( j ) k 1 ) i s n X j =1 h k 0 ( j ) ^ ( j ) k 2 +(1 ) k 0 ( j ) ^ ( j ) k 1 i s n X j =1 [ + p m (1 )] k 0 ( j ) ^ ( j ) k 2 1 2 2 s [ + p m (1 )] 2 C 2 0 + 1 2 C 2 0 k 0 S ^ S k 2 2 (B.5) Accordingto[53],whenwechoose 2 T ~ forsomeconstant T 1 and ~ = c p m log n=n ( p log Q + p m log p log( nm ) = (1 + = p m )) ,wehave 1 2 C 2 0 k 0 S ^ S k 2 2 ~ _ " ( ^ ; ^ ; ^ t ; ^ b ) = O n 1 m 9 = 2 s 5 = 2 p log p (B.6) 203 withprobabilityatleast 1 p m n 2 c log n exp( T 2 ( p log Q + p m log p log( nm ) = (1 + = p m )) 2 c 1 :=1 P 1 Combiningtheresultsin(B.4),(B.5)and(B.6),wehave 1 n k ^ 0 k 2 2 = O P log( n ) n ~ 2 + O 1 n ~ 2 m + O s 2 2 n ~ 4 + O P n 1 m 9 = 2 s 5 = 2 p log p Proofoftheorem3.1 Proof. BylemmaB.1,wehave R ( ^ C ) R ( C ) 2 E X [ j ( X ) ^ ( X ) j ] Thusitsu˚cestobound E X [ j ( X ) ^ ( X ) j ] ,orequivalently, j ( X ) ^ ( X ) j inprobability. Let W betherandomvariable j ( X ) ^ ( X ) j accordingto P X ,then j ( X ) ^ ( X ) j isa vectorof n i.i.d.copiesof W ,denoted W 1 ;:::;W n .BylemmaB.2,wehave 1 n n X i =1 W 2 i = O log( n ) n ~ 2 + O 1 n ~ 2 m + O s 2 2 n ~ 4 + O n 1 m 9 = 2 s 5 = 2 p log p 204 withprobabilityatleast 1 P 2 forsome P 2 ! 0 as n !1 .Withproperchoiceof n , p and otherhyper-parameters,wehave 1 n n X i =1 W 2 i P ! 0 asn !1 (B.7) Since X 2X forsomeboundedspace X ,bytheweaklawoflargenumbers,wehave 1 n n X i =1 W 2 i P ! E X [ W 2 ] asn !1 Byde˝nition,wehaveforany > 0 , P 1 n n X i =1 W 2 i E X [ W 2 ] > ! ! 0 asn !1 Combinethiswith(B.7),wehave E X [ W 2 ] ! 0 asn !1 ThenbyJensen'sinequality,wehave E X [ W ] E X [ W 2 ] 1 = 2 ! 0 asn !1 Therefore,wehave R ( ^ C ) R ( C ) 2 E X [ W ] ! 0 asn !1 205 APPENDIXC TechnicalDetailsandSupplementary MaterialsforChapter4 . ProofofResultsinChapter4 Inthissection,wewillprovidetheproofofthetheoremsinsection4.3. ProofofProposition4.1 Proof. Considerindependentobservations f ( x 1 ;y 1 ) ;:::; ( x n ;y n ) g .Assume x ( j ) ˘N ( 0 ; I ) ;j =1 ;:::;p Intheregressionsetupwhere y iscentered,wehave y j x 1 ;:::;x p ˘N ( 1 x 1 + ::: + p x p ;˙ 2 ) : 206 Withoutlossofgenerality,weassumethat j 1 jj 2 j ::: j s j otherwise,wemayre-arrangetheorderofcolumnsofthedesignmatrix.Furthermore, withoutlossofgenerality,wemayassumeallcoe˚cientsarepositive,otherwise,wemay multiplythecorrespondingcolumnofthedesignmatrixby 1 .Since sc 2 )= P j x T (1) y j > j x T (2) y j (C.2) Let W 1 = x T (1) y andW 2 = x T (2) y (C.3) whicharebothnormallydistributed.Therefore, c 1 and c 2 followfoldednormaldistribution c 1 ˘ FN ( 1 ;˙ 2 ) andc 2 ˘ FN ( 2 ;˙ 2 ) (C.4) Wecancalculate Cov ( W 1 ;W 2 )= Cov ( 1 + x T (1) y ; 2 + x T (2) y )= ˙ 2 x T (1) x (2) =0 (C.5) Becauseboth W 1 and W 2 arenormallydistributed, W 1 and W 2 areindependent.Therefore, c 1 and c 2 areindependent.Sinceboth c 1 and c 2 arepositive,theprobabilityisequivalent to P ( c 1 >c 2 )= P c 1 c 2 > 1 (C.6) Let c 12 = c 1 c 2 Thenwehave c 12 ˘ RN ( 1 ; 2 ;˙ 2 ;˙ 2 ) (C.7) 209 whereRNstandsfortheratiooffoldednormaldistributions.Bytheorem3.1in[71],we havetheCDFof c 12 F 12 ( x )=2 L 1 2 x ˙ p 1+ x 2 ; 2 ˙ ; x p 1+ x 2 +2 L 1 + 2 x ˙ p 1+ x 2 ; 2 ˙ ; x p 1+ x 2 + 1 2 x ˙ p 1+ x 2 + 1 + 2 x ˙ p 1+ x 2 2 (C.8) where L ( a;b;ˆ )= P ( X 1 >a;X 2 >b ) (C.9) with 2 6 4 X 1 X 2 3 7 5 ˘N 0 B @ 0 ; 2 6 4 1 ˆ ˆ 1 3 7 5 1 C A Thenwehave P ( c 1 0 bymultiply 1 tothosewhicharenegative,wehave theabsolutevaluesbackon j j j .Thisisalsotruefordi˙erent i and j ,sincewedidnotuse thedi˙erencebetweenthenonzeropredictorsandzeropredictors.Bytheexchangeability 210 ofpredictors,theresultholdsforall i and j .Therefore,wehave P ( c j max i 6 = k c i g ;k =1 ;:::;s (C.12) It'seasytoobservethat E k 'saremutuallyexclusive.Therefore,wehave Pr ( Atleastoneofc 1 ;:::;c s isgreaterthanallofc s +1 ;:::;c p ) = Pr s [ k =1 E k ! = s X k =1 Pr ( E k ) (C.13) Wemaycalculate Pr ( E k )= Pr ( c k >c ( k;p 1) ) where c ( k;p 1) isthelargestorderstatisticof c 1 ;:::c ( k 1) ;c ( k +1) ;c ( p ) ,whichisindependent of c k .Let F ( k;p 1) and f ( k;p 1) betheCDFandPDFof c ( k;p 1) ,respectively,wehave F ( k;p 1) ( x )= p Y j 6 = k F j ( x ) and f ( k;p 1) ( x )= @ @x p Y j 6 = k F j ( x ) 212 wherefromthepropertiesoffoldednormaldistributionwehave F k ( x )= 1 2 erf x + j k j p 2 ˙ 2 + erf x j k j p 2 ˙ 2 and f k ( x )= @ @x F k ( x )= r 2 ˇ˙ 2 e x 2 + 2 k 2 ˙ 2 cosh k x ˙ 2 Thenwehave Pr ( E k ) = Pr ( c k >c ( k;p 1) ) = Z 1 0 Pr ( c k >x ) f ( k;p 1) ( x ) dx = Z 1 0 [1 F k ( x )] @ @x p Y j 6 = k F j ( x ) dx = 2 4 [1 F k ( x )] p Y j 6 = k F j ( x ) 3 5 1 0 + Z 1 0 f k ( x ) p Y j 6 = k F j ( x ) dx = Z 1 0 f k ( x ) p Y j 6 = k F j ( x ) dx (C.14) wherethesecondequalityisbytheconvolutionformula,thefourthequalityisbyintegration byparts.Therefore, Pr ( Atleastoneofc 1 ;:::;c m isgreaterthanallofc s +1 ;:::;c p ) = s X k =1 Z 1 0 f k ( x ) p Y j 6 = k F j ( x ) dx (C.15) 213 Nextwewillshowthatthisprobabilityisactuallyaveryhighprobability.Let p k = Z 1 0 f k ( x ) p Y j 6 = k F j ( x ) dx = E k 2 4 p Y j 6 = k F j ( X ) 3 5 Bytheformulasfor F k and f k ,wehave p k = Z 1 0 r 2 ˇ˙ 2 e x 2 + 2 k 2 ˙ 2 cosh k x ˙ 2 Y j 6 = k 1 2 erf x + j p 2 ˙ 2 + erf x j p 2 ˙ 2 dx = Z 1 0 r 1 2 ˇ˙ 2 2 4 e ( x + k ) 2 2 ˙ 2 + e ( x k ) 2 2 ˙ 2 3 5 Y j 6 = k 1 2 erf x + j p 2 ˙ 2 + erf x j p 2 ˙ 2 dx (C.16) Dochangeofvariable z = x=˙ ,wehave p k = Z 1 0 1 p 2 ˇ 2 4 e ( z + k ˙ ) 2 2 + e ( z k ˙ ) 2 2 3 5 Y j 6 = k 1 2 2 4 erf 0 @ z + j ˙ p 2 1 A + erf 0 @ z j ˙ p 2 1 A 3 5 dz (C.17) 214 Let ~ k = k =˙ ,withoutlossofgenerality,assumethat 1 = 0 1 ::: p p +1 =0 , wehave p k = Z 1 0 1 p 2 ˇ " e ( z + ~ k ) 2 2 + e ( z ~ k ) 2 2 # Y j 6 = k 1 2 " erf z + ~ j p 2 ! + erf z ~ j p 2 !# dz = p X i =0 Z i i +1 1 p 2 ˇ " e ( z + ~ k ) 2 2 + e ( z ~ k ) 2 2 # Y j 6 = k 1 2 " erf z + ~ j p 2 ! + 1 f j i +1 g erf z ~ j p 2 ! 1 f j i g erf ~ j z p 2 !# dz (C.18) Bytheexponentialapproximationoftheerrorfunction,seeforexample[131],thereexist c 1 and c 2 suchthat sup x> 0 j erf ( x ) (1 exp[ c 1 x c 2 x 2 ]) j canbearbitrarilysmall,whereapproximately c 1 ˇ 1 : 095 and c 2 ˇ 0 : 7565 .Considerthis approximation,wehave p k = p X i =0 Z i i +1 1 p 2 ˇ " e ( z + ~ k ) 2 2 + e ( z ~ k ) 2 2 # Y j 6 = k 1 2 8 > < > : 1+ 1 f j i +1 g 1 f j i g e c 2 1 4 c 2 2 6 4 e c 2 2 z + ~ j + c 1 p 2 c 2 2 + 1 f j i +1 g e c 2 2 z + ~ j + c 1 p 2 c 2 2 1 f j i g e c 2 2 z ~ j + c 1 p 2 c 2 2 3 7 5 9 > = > ; (C.19) Here e c 2 1 4 c 2 ˇ 1 : 48 >> 1 215 Observethataswhen i = s ,alsoobservethat ˝ n ! 0 indicates max j = s +1 ;:::;p j ! 0 ,we have s Y j =1 ;j 6 = k 1 2 e c 2 1 4 c 2 2 6 4 e c 2 2 z j + c 1 p 2 c 2 2 e c 2 2 z + j + c 1 p 2 c 2 2 3 7 5 ! 0 ass !1 Therefore,theformulaof p k canbesimpli˝edto p k = o 0 @ 1 2 s e sc 2 1 4 c 2 1 A + s X i =0 Z i i +1 1 p 2 ˇ " e ( z + ~ k ) 2 2 + e ( z ~ k ) 2 2 # Y j 6 = k 1 2 8 > < > : 1+ 1 f j i +1 g 1 f j i g e c 2 1 4 c 2 2 6 4 e c 2 2 z + ~ j + c 1 p 2 c 2 2 + 1 f j i +1 g e c 2 2 z + ~ j + c 1 p 2 c 2 2 1 f j i g e c 2 2 z ~ j + c 1 p 2 c 2 2 3 7 5 9 > = > ; o 0 @ 1 2 s e sc 2 1 4 c 2 1 A + s X i =0 0 @ 1 2 e c 2 1 4 c 2 1 A s i 1 2 s h ~ i ~ k ) ~ i +1 ~ k )+ ~ i + ~ k ) ~ i +1 + ~ k ) i (C.20) where isthenormalCDFandtheinequalityisbyobserving e x 2 1 andtheterminthebracketislessthan 2 when j i +1 .Thensummingup p 0 k s andobserving thedoublesumisnotconvergingtozerosinceitconsistsofageometriccomponent,when 216 max isnotbigenoughandlet s !1 ,wehave 1 s X k =1 p k 1 o 0 @ s 1 2 s e sc 2 1 4 c 2 1 A s X k =1 s X i =0 0 @ 1 2 e c 2 1 4 c 2 1 A s i 1 2 s h ~ i ~ k ) ~ i +1 ~ k )+ ~ i + ~ k ) ~ i +1 + ~ k ) i s X i =1 0 @ 1 2 e c 2 1 4 c 2 1 A s i s X k =1 1 2 s (1 max )) o 0 @ s 1 2 s e sc 2 1 4 c 2 1 A c o (1) (C.21) ProofofTheorem4.1 Proof. Inthisproof,wewillshowtheprobabilitythatthesamezeropredictorappearsin k baggingroundstendstozeroas k increases.Atthe˝rststep,wehave C = f 1 ;:::;p g and S = fg .Byproposition4.2weknowthattheprobabilitythatthe˝rstvariablebelongsto S 0 convergestooneundertheconditions. Atthe m th step,denotethecandidateset C m andtheselectedset S m .Assumethat SˆS 0 .Withoutlossofgenerality,consider ˙ 2 =1 .Ifnot,wemaydividetheresponseand coe˚cientsby ˙ .Considerthe˝rstcasethat C m \S 0 6 = ; Let C m \S 0 = f j 1 ;:::;j s 0 g .Bytheproofofproposition4.2,theprobabilitythatazero 217 variableisselectedisatmost P ( selectzerovariable ) 1 P j 2C m \S 0 e j P j 2C m e j =1 P j 2C m \S 0 e j P j 2C m \S 0 e j + P j 2C m \S C 0 e j = P j 2C m \S C 0 e j P j 2C m \S 0 e j + P j 2C m \S C 0 e j ( jC m j s 0 ) e ˝ n s 0 e n +( jC m j s 0 ) where jC m j = O ( p ) isthecardinalityof C m bytheoremcondition, min =min j =1 ;:::;s j , max =max j =1 ;:::;s j andbyassumption4.1 ˝ n = o ( n ) o ( min ) Ifwehave jC m j s 0 s 0 e ˝ n n ! 0 asn !1 (C.22) Thenwehave P ( selectzerovariable ) ! 0 asn !1 Inthiscase,theprobabilityoffalsepositiveintheENNSalgorithmgoestozero.However,it isnotalwaysthatequationC.22issatis˝ed.Ithappensthatthesignalstrengthofnonzero variableisnotbigenough.Thiscasecanbecombinedwiththeothercasethat C m \S 0 = ; 218 Inthiscase,itis(almost)guaranteedthatazerovariablewillbeselectedinthenextstep. However,wewillshowthatthoughazerovariableisselected,aslongasthenumberofzero variablein S isnottoobig,whichisguaranteedbythetheoremcondition s 0 Cs = o ( p ) theselectedzerovariablesindi˙erentroundsofthebaggingalgorithmtogetherwiththeneu- ralnetworkrandominitializationmaketheprobabilitythatthesamezerovariablesappears morethanthethresholdnumberoftimesconvergestozero.Nowwehave C m = f j 1 ;:::;j p 0 gˆf s +1 ;:::;p g Considerthescenariothatallbaggingroundsareindependent.Theresidual y ^ S m isnotrelatedto x j 1 ;:::;x j 0 p byassumption4.2,where S m istheselectedsetatthe m th step and ^ S m istheestimatedconditionalexpectationof y given x S m .Therefore,thevariables 219 x j , j 2C m areexchangeable.Wehave P ( j 2S m +1 \C m )= P 0 @ 1 B 2 B 2 X b =1 1 f c j c ( s 0 m j ) g p s 1 A = B 2 X k =[ B 2 p s ] B 2 k ( s 0 jS m j ) k ( p s 0 ) B 2 k ( p jS m j ) B 2 exp B 2 (1 p s )log (1 p s )( p jS m j ) p s 0 + p s log p s ( p jS m j ) s 0 jS m j (C.23) wherethelastinequalityisby[6].Sincewehave s 0 Cs = o ( p ) thenwehave P j 2S m +1 \C m ! 0 assandB 2 !1 Considerthelastcasethatthereisatleastonevariablein C m thatisalsoin f 1 ;:::;s g and equationC.22doesnothold.Alsoconsiderthetruththatthebaggingroundsarenotfully independentinpractice.Considervariable j andtheestimator ^ s m j = 1 fk G mj k 2 t m g Conditioningontheobservations,thereexista˝xed t m suchthat ^ s m j indicateswhether variable j isnotselected( =1 )orselected( =0 ).Thebaggedestimatorisde˝nedas ^ s m j;B = E 1 fk G mj k 2 t m g 220 where G mj is G mj evaluatedonabootstrapsample.Bytheuniformlawoflargenumbers, seeforexample[61],wehave sup x ;y 1 B 2 B 2 X b =1 1 fk G mj;b k 2 t m g E 1 fk G mj k 2 t m g ! 0 asB 2 !1 (C.24) Let P B n betheempiricalmeasureofthebootstrapsample.It'seasytoverifythat ^ s m j isa smoothfunctionevaluatedat P B n .Byassumption4.3,wehaveindependentobservations. Thenaccordingto[58],seealso[19],thereexistanincreasingsequence f b n g n 2N suchthat b n ( k G mj k 2 c 0 ) !N (0 ;˙ 2 1 ) forsomeconstant c 0 < 1 and ˙ 2 1 < 1 .Byalgebra,inthe m th step,wehave k G mj k 2 = v u u t K X k =1 [ 2 n n X i =1 ( y i ^ i )^ a k ˙ 0 ( x T i ^ k + ^ t k ) x ij ] 2 = 2 n v u u t K X k =1 ^ a 2 k h ^ " T 0 ( x T i ^ k + ^ t k ) x ( j ) i 2 (C.25) wherein ^ k , ^ jk isestimatedfromdatafor j 2S m and ^ jk equalszerofor j 2C m , ^ i is theneuralnetworkestimateof y i basedon x S m , ^ " isthepredictionerrorbasedon x S m , 0 ( x T i ^ k + ^ t k ) isadiagonalmatrixconsistsof ˙ 0 ( ) evaluatedat x T i ^ k + ^ t k and x ( j ) isthe 221 j th columnof x .Wehave ^ " = y ^ f ( ^ ; ^ t ; ^ a ; ^ b; x S m ) = X j 2C m \f 1 ;:::;s g j x ( j ) + 2 4 X j 2S m \f 1 ;:::;s g j x ( j ) ^ f ( ^ ; ^ t ; ^ a ; ^ b; x S m ) 3 5 + " = X j 2C m \f 1 ;:::;s g j x ( j ) + " + O K 2 n r log( nK n ) n ! (C.26) Therefore,for j 2C m \f 1 ;:::;s g ,since x ( j ) isnormalizedand ˙ 0 ( ) 1 ,bynorminequality, wehave E k G mj k 2 ˇ E 2 6 4 2 n v u u u t K X k =1 ^ a 2 k 2 4 X j 0 2C m \f 1 ;:::;s g j 0 x T ( j 0 ) 0 ( x T i ^ k + ^ t k ) x ( j ) + " T 0 ( x T i ^ k + ^ t k ) x ( j ) 3 5 2 3 7 5 E 2 4 2 nK K X k =1 j ^ a k j X j 0 2C m \f 1 ;:::;s g j 0 x T ( j 0 ) 0 ( x T i ^ k + ^ t k ) x ( j ) + " T 0 ( x T i ^ k + ^ t k ) x ( j ) 3 5 c jC m \f 1 ;:::;s gj nK n (C.27) 222 For j 2C m \f s +1 ;:::;p g ,byJensen'sinequality,wehave E k G mj k 2 ˇ E 2 6 4 2 n v u u u t K X k =1 ^ a 2 k 2 4 X j 0 2C m \f 1 ;:::;s g j 0 x T ( j 0 ) 0 ( x T i ^ k + ^ t k ) x ( j ) + " T 0 ( x T i ^ k + ^ t k ) x ( j ) 3 5 2 3 7 5 2 n v u u u u t E 8 > < > : K X k =1 ^ a 2 k 2 4 X j 0 2C m \f 1 ;:::;s g j 0 x T ( j 0 ) 0 ( x T i ^ k + ^ t k ) x ( j ) + " T 0 ( x T i ^ k + ^ t k ) x ( j ) 3 5 2 9 > = > ; c 0 n p K (C.28) If n c 0 p K c ,wehave P min j 2C m \f 1 ;:::;s g E k G mj k 2 max j 2C m \f s +1 ;:::;p g E k G mj k 2 ! ! 1 Sincewehave s 0 Cs = o ( p ) ,taking t m tobethe ( jC m jjS m j ) th smallestvalueof k G mj k 2 ;j 2C m ,combinethiswithequationC.24,for j 2C m \f s +1 ;:::;p g ,wehave 1 B 2 B 2 X b =1 1 fk G mj;b k 2 t m g E 1 fk G mj k 2 t m g + ! b n ( t m c 0 ) ˙ 1 Z asnandB 2 !1 (C.29) wheretheresultisby[19], Z isstandardnormalrandomvariableand ) isthestandard normalCDF.Observethat b n isadivergingsequenceand s 0 Cs = o ( p ) ,thenwehavethe 223 probabilitythatazerovariableisselected P j 2C m \S m +1 c \f s +1 ;:::;p g = P j 2C m \S m +1 c \f s +1 ;:::;p g E k G mj k 2 t m P E k G mj k 2 t m + P j 2C m \S m +1 c \f s +1 ;:::;p g E k G mj k 2 t m P E k G mj k 2 t m P j 2C m \S m +1 c \f s +1 ;:::;p g E k G mj k 2 t m + P E k G mj k 2 t m ˇ 1 b n ( t m E k G mj k 2 ) ˙ 1 Z + s 0 jS m jjC m \f 1 ;:::;s gj p jS m j ! 0 asn !1 andB 2 !1 (C.30) Therefore,thefalsepositiverateoftheENNSalgorithmgoestozero. Intheclassi˝cationsetup,wehavesimilarlyforequationC.25that k G mj k 2 = v u u t K X k =1 [ 1 n n X i =1 ( y i ^ i )^ a k ˙ 0 ( x T i ^ k + ^ t k ) x ij ] 2 = 1 n v u u t K X k =1 ^ a 2 k h ^ " T 0 ( x T i ^ k + ^ t k ) x ( j ) i 2 (C.31) wherein ^ k , ^ jk isestimatedfromdatafor j 2S m and ^ jk equalszerofor j 2C m , ^ i isthe neuralnetworkestimateofthemeanof y i basedon x S m ,i.e. ^ i = ˙ 0 @ K X k =1 ^ k ˙ ( T k x i + t k )+ b 1 A ; ^ " isthepredictionerrorbasedon x S m , 0 ( x T i ^ k + ^ t k ) isadiagonalmatrixconsistsof ˙ 0 ( ) evaluatedat x T i ^ k + ^ t k and x ( j ) isthe j th columnof x .Theonlydi˙erencebetween theregressionsetupandtheclassi˝cationsetupistheformulaforthemean.UseTaylor 224 expansionwithLagrangeremainder,wehave ˙ 0 @ s X j =1 j x j 1 A = ˙ 0 @ X j 2C m \f 1 ;:::;s g j x j 1 A + ˙ 0 0 @ X j 2S m \f 1 ;:::;s g j x j + ˘ X j 2f 1 ;:::;s g = S m j x j 1 A X j 2f 1 ;:::;s g = S m j x j forsome ˘ 2 (0 ; 1) and 0 <˙ 0 0 @ X j 2S m \f 1 ;:::;s g j x j + ˘ X j 2f 1 ;:::;s g = S m j x j 1 A <˙ 0 @ X j 2S m \f 1 ;:::;s g j x j + ˘ X j 2f 1 ;:::;s g = S m j x j 1 A < 1 Then ^ " = " + ˙ 0 0 @ X j 2S m \f 1 ;:::;s g j x j + ˘ X j 2f 1 ;:::;s g = S m j x j 1 A X j 2f 1 ;:::;s g = S m j x j + O K 2 n r log( nK n ) n ! where " isthetheoreticalerrorofBernoullidistributionwiththeirmeans.Wedon'thavea directcontrolon " ,butbyCauchy-Schwarzinequalitywehaveforany > 0 that P 1 n " T 0 ( x T i ^ k + ^ t k ) x ( j ) > P 1 n max i 0 ( x T i ^ k + ^ t k ) " k 2 k x ( j ) k 2 > P 1 n k " k 2 > e 2 = 8 (C.32) 225 Therefore,theonlydi˙erencebetweenclassi˝cationandregressionisthe˝rstorderapprox- imationterm ˙ 0 0 @ X j 2S m \f 1 ;:::;s g j x j + ˘ X j 2f 1 ;:::;s g = S m j x j 1 A Observethatthistermonlydependsonthetruerelationshipandisindependentofany j 2C m ,thereforecanbetreatedasaconstantwhencomparing k G mj k 2 .This˝nishesthe prooffortheclassi˝cationcase. ProofofTheorem4.2 Proof. Intheproofoftheorem4.1,wehaveprovedthatwithprobabilitytendingto 1 ,the algorithmwon'tselectanyzerovariables.Therefore,hereitsu˚cestoshowthatthemodel willbeabletoincludeallnonzerovariablesinthemodel.Thoughitlookscomplicated,we onlyneedtoconsidertheworstcase: S m = f 1 ;:::;s 1 g and C m = f s;s +1 ;:::;p g andprovethatvariable s willbeselectedinthenextstep,sincevariable s hasthesmallest truecoe˚cient s among f 1 ;:::;s g andthusallothercaseshavegreaterprobabilitytoselected anonzerovariable.Notevariable s willbeselected s 2S m +1 ()k G ms k 2 =max j 2C m k G mj k 2 Nowwehave k G mj k 2 = 2 n v u u t K X k =1 ^ a 2 k h ^ " T 0 ( x s 1 i T ^ s 1 k + ^ t k ) x ( j ) i 2 226 where x s 1 i isthe˝rst s 1 th elementsin x i , ^ s 1 k isestimatedfromdataasthecoe˚cientof x s 1 i , ^ i istheneuralnetworkestimateof y i basedon x s 1 , ^ " isthepredictionerrorbasedon x s 1 , 0 ( x s 1 i T ^ s 1 k + ^ t k ) isadiagonalmatrixconsistsof ˙ 0 ( ) evaluatedat x s 1 i T ^ s 1 k + ^ t k and x ( j ) isthe j th columnof x . Hereweneedtheprobabilitythat k G ms k 2 beingthegreatestamongallcandidatesto beverybig,sothatitwillnotbemissedintheensemble˝ltering.For j 2f s +1 ;:::;p g ,we have P k G ms k 2 > k G mj k 2 = P 0 @ K X k =1 ^ 2 k h ^ " T 0 ( x s 1 i T ^ s 1 k + ^ t k ) x ( s ) i 2 > K X k =1 ^ 2 k h ^ " T 0 ( x s 1 i T ^ s 1 k + ^ t k ) x ( j ) i 2 1 A = P 0 @ K X k =1 ^ " T ^ k 0 ( x s 1 i T ^ s 1 k + ^ t k ) x ( s ) 2 ^ " T ^ k 0 ( x s 1 i T ^ s 1 k + ^ t k ) x ( s ) 2 > 0 1 A (C.33) Intheregressionsetup,observethat max i 0 ii ( x s 1 i T ^ s 1 k + ^ t k )=max i ˙ ( x s 1 i T ^ s 1 k + ^ t k ) 1+exp( x s 1 i T ^ s 1 k + ^ t k ) max i ˙ ( x s 1 i T ^ s 1 k + ^ t k ) 1 and ^ " = s x ( s ) + " + O K 2 n r log( nK n ) n ! Alsobythefactthat A = ) B = ) P ( A ) P ( B ) 227 wehaveforregressionthat P k G ms k 2 > k G mj k 2 = P 0 @ K X k =1 " s x ( s ) + " T k x ( s ) 2 s x ( s ) + " T k x ( j ) 2 # O KK 2 n r log( nK n ) n !! = P 0 @ K X k =1 2 s x T ( s ) k x ( s ) 2 x T ( s ) k x ( j ) 2 +2 s h x T ( s ) x ( s ) " T ( x ( s ) x ( j ) ) i + " T x ( s ) 2 " T x ( j ) 2 O KK 2 n r log( nK n ) n !! P 0 @ K X k =1 c 0 s h " T ( x ( s ) x ( j ) ) i + " T x ( s ) 2 " T x ( j ) 2 cK 2 s + O KK 2 n r log( nK n ) n !! (C.34) Observebyassumption4.3that x ( s ) and x ( j ) areindependentandidenticallydistributed, wehave P " T x ( s ) 2 > " T x ( j ) 2 = 1 2 228 Therefore,wehave P k G ms k 2 > k G mj k 2 P c 0 s h " T ( x ( s ) x ( j ) ) i + " T x ( s ) 2 " T x ( j ) 2 2 s + O K 2 n r log( nK n ) n !! K ! s k ( x ( s ) x ( j ) ) k 2 ! K (1 n ) 1 = ( p s ) asn !1 (C.35) underthetheoremconditionsforsomeasymptoticallynegligiblesequence n > 0 .Then considerthebaggingprocess,similartoC.30,accordingtotheorem6in[13],wehave P s= 2C m \S m +1 \C m +1 c = P 0 @ 1 B 2 B 2 X b =1 1 fk G ms k 2 6 =max j 2C m k G mj k 2 g 1 p r 1 A 1 1 p r E 2 4 1 B 2 B 2 X b =1 1 fk G ms k 2 6 =max j 2C m k G mj k 2 g 3 5 n 1 p r ! 0 asn !1 andB 2 !1 (C.36) wherethe˝rstinequalityisbyMarkov'sinequality,andthesecondinequalityisbyC.35. Therefore,theprobabilitythatvariable s willnotenterthemodelinthenextsteptendsto zero,thuswithprobabilitytendingto1,allnonzerovariablesareselectedintheregression setup. 229 Considertheclassi˝cationcase,wehavethesameasinregressionbut ^ " = " + ˙ 0 0 @ X j 2S m \f 1 ;:::;s g j x j + ˘ X j 2f 1 ;:::;s g = S m j x j 1 A X j 2f 1 ;:::;s g = S m j x j + O K 2 n r log( nK n ) n ! bytheproofoftheorem4.1,where " isthetheoreticalerrorofBernoullidistributionwith theirmeans.Herewenolongerhavethenormalityandhaveanextraterm ˙ 0 whichcanbe treatedasconstantinthisstep,butbythecentrallimittheoremwehave p n ( x ( s ) x ( j ) ) T " T ) N (0 ; V ) where V isboundedbyassumption4.4andthefactthat isdiagonalwiththelargest elementlessthan1.FeedingthisbackintoC.35,wehave P ( k G ms k 2 > k G mj k 2 ) (1 0 n ) 1 = ( p s ) where 0 n isgreaterthan n uptoafactorofconstantbutstillconvergestozeroas n !1 , undertheoremconditions.ThensimilartoC.36,wehave P s= 2C m \S m +1 \C m +1 c 0 n 1 p r ! 0 asn !1 andB 2 !1 This˝nishestheprooffortheclassi˝cationcase. 230 ProofofTheorem4.3 Proof. Inthissubsection,weprovetheestimationandpredictionofregressionandclassi˝- cation,respectively.Intheregressionsetup,underassumption4.2,wehave y = f ( x )+ = f ( x S )+ Wehave P E Z j f n ( x ^ S ) f ( x S ) j 2 ( dx ) ! 0 = P E Z j f n ( x ^ S ) f ( x S ) j 2 ( dx ) ! 0 ^ S = S P ^ S = S + P E Z j f n ( x ^ S ) f ( x S ) j 2 ( dx ) ! 0 ^ S6 = S P ^ S6 = S P E Z j f n ( x ^ S ) f ( x S ) j 2 ( dx ) ! 0 ^ S = S P ^ S = S = P E Z j f n ( x S ) f ( x S ) j 2 ( dx ) ! 0 P ^ S = S (C.37) Observethat ^ j j 1 K n Accordingto[61],whenweperformaneuralnetworkestimationonthetruesubsetof variables,wehavethatthetotalerrorisboundedbytheapproximationerror,whichis boundedaccordingto[43],plustheestimationerror,whichisboundedbythecovering number,thenbythepackingnumber,thenbytheVapnik-Chervonenkisdimension,and 231 ˝nallybythespacedimension,i.e. E Z j f n ( x S ) f ( x S ) j 2 ( dx ) = O L r k n n 1 ! + n (C.38) where L istheLipshitzcontinuitycoe˚cient, k n isthe˝rsthiddenlayersize,andby[61] n satis˝es P f sup n > g 8 384 K 2 n ( k n +1) (2 s +5) k n +1 e 2 = 128 2 4 K 4 n Undertheoremassumptions,theprobabilityaboveissummable,thuswehavethe˝rst probabilityinC.37convergesto1.Ontheotherhand,bytheorem4.2,wehavethesecond probabilityinC.37convergesto1.Therefore,theresultforregressionsetupisproved. Intheclassi˝cationsetup,similarly,wehave P R ( f n; ^ S ) R ( f S ) ! 0 = P R ( f n; ^ S ) R ( f S ) ! 0 ^ S = S P ^ S = S + P R ( f n; ^ S ) R ( f S ) ! 0 ^ S6 = S P ^ S6 = S P R ( f n; ^ S ) R ( f S ) ! 0 ^ S = S P ^ S = S = P R ( f n; S ) R ( f S ) ! 0 P ^ S = S (C.39) By[37],wehave R ( f n ) R ( f ) ! 0 asn !1 andfromtheorem4.2,wehavethesecondprobabilityinequationC.39tendsto 1 .Combine thesetworesults,theconsistencyofclassi˝cationcaseisproved. 232 BIBLIOGRAPHY 233 BIBLIOGRAPHY [1] FaridAlizadehandDonaldGoldfarb.Second-orderconeprogramming. Mathematical programming ,2003. [2] UmbertoAmato,AnestisAntoniadis,andItaliaDeFeis.Additivemodelselection. StatisticalMethods&Applications ,2016. [3] DavidFAndrews.Plotsofhigh-dimensionaldata. Biometrics ,pages1972. [4] MartinAnthonyandPeterLBartlett. Neuralnetworklearning:Theoreticalfounda- tions .cambridgeuniversitypress,2009. [5] AndreasArgyriou,TheodorosEvgeniou,andMassimilianoPontil.Multi-taskfeature learning.In Advancesinneuralinformationprocessingsystems ,pages2007. [6] RichardArratiaandLouisGordon.Tutorialonlargedeviationsforthebinomial distribution. Bulletinofmathematicalbiology ,1989. [7] SergeyBakin. Adaptiveregressionandmodelselectionindataminingproblems .PhD thesis,SchoolofMathematicalSciences,AustralianNationalUniversity,1999. [8] AndrewRBarron.Universalapproximationboundsforsuperpositionsofasigmoidal function. IEEETransactionsonInformationtheory ,,1993. [9] ABelloniandVChernozhukov.Leastsquaresaftermodelselectioninhigh-dimensional sparsemodels.bernoulli19 MathematicalReviews(MathSciNet):MR3037163 DigitalObjectIdenti˝er:doi ,10,2013. [10] AlexandreBelloni,VictorChernozhukov,andLieWang.Square-rootlasso:pivotal recoveryofsparsesignalsviaconicprogramming. Biometrika ,2011. [11] PeterBhlmannandSaravandeGeer. StatisticsforHigh-DimensionalData:Methods, TheoryandApplications .SpringerPublishingCompany,Incorporated,1stedition, 2011. [12] MonicaBianchiniandFrancoScarselli.Onthecomplexityofneuralnetworkclassi˝ers: Acomparisonbetweenshallowanddeeparchitectures. IEEEtransactionsonneural networksandlearningsystems ,2014. 234 [13] GÙrardBiau,LucDevroye,andGÆborLugosi.Consistencyofrandomforestsand otheraveragingclassi˝ers. JournalofMachineLearningResearch , 2008. [14] PeterJBickel,ElizavetaLevina,etal.Sometheoryfor˝sher'slineardiscriminant function,naivebayes',andsomealternativeswhentherearemanymorevariablesthan observations. Bernoulli ,2004. [15] PeterJBickel,Ya'acovRitov,andAlexandreBTsybakov.Simultaneousanalysisof lassoanddantzigselector. TheAnnalsofStatistics ,pages2009. [16] EllaBinghamandHeikkiMannila.Randomprojectionindimensionalityreduction: applicationstoimageandtextdata.In ProceedingsoftheseventhACMSIGKDDin- ternationalconferenceonKnowledgediscoveryanddatamining ,pagesACM, 2001. [17] LeoBreiman.Baggingpredictors. Machinelearning ,1996. [18] PeterBühlmannandSaraVanDeGeer. Statisticsforhigh-dimensionaldata:methods, theoryandapplications .SpringerScience&BusinessMedia,2011. [19] PeterBühlmann,BinYu,etal.Analyzingbagging. TheAnnalsofStatistics , 961,2002. [20] EmmanuelCandes,TerenceTao,etal.Thedantzigselector:Statisticalestimation whenpismuchlargerthann. TheannalsofStatistics ,51,2007. [21] TimothyICanningsandRichardJSamworth.Random-projectionensembleclassi- ˝cation. JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology) , 2017. [22] AChatterjeeandSNLahiri.Ratesofconvergenceoftheadaptivelassoestimatorsto theoracledistributionandhigherorderre˝nementsbythebootstrap. TheAnnalsof Statistics ,2013. [23] KamalikaChaudhuriandSanjoyDasgupta.Ratesofconvergencefornearestneighbor classi˝cation.In AdvancesinNeuralInformationProcessingSystems ,pages 2014. [24] JiahuaChenandZehuaChen.Extendedbayesianinformationcriteriaformodelse- lectionwithlargemodelspaces. Biometrika ,2008. 235 [25] TianqiChenandCarlosGuestrin.Xgboost:Ascalabletreeboostingsystem.In Proceedingsofthe22ndacmsigkddinternationalconferenceonknowledgediscovery anddatamining ,pages2016. [26] FrançoisCholletetal.keras,2015. [27] AlexandraChouldechovaandTrevorHastie.Generalizedadditivemodelselection. arXivpreprintarXiv:1506.03850 ,2015. [28] CharlesKChui,XinLi,andHrushikeshNarharMhaskar.Limitationsoftheapproxi- mationcapabilitiesofneuralnetworkswithonehiddenlayer. AdvancesinComputa- tionalMathematics ,1996. [29] CKChui,XinLi,andHrushikeshNarharMhaskar.Neuralnetworksforlocalized approximation. MathematicsofComputation ,,1994. [30] VasekChvatal,VaclavChvatal,etal. Linearprogramming .Macmillan,1983. [31] PierreComon.Independentcomponentanalysis,anewconcept? Signalprocessing , 1994. [32] JohnBCopas.Regression,predictionandshrinkage. JournaloftheRoyalStatistical Society:SeriesB(Methodological) ,1983. [33] GeorgeCybenko.Approximationbysuperpositionsofasigmoidalfunction. Mathe- maticsofcontrol,signalsandsystems ,1989. [34] DebrajDas,KarlGregory,andSNLahiri.Perturbationbootstrapinadaptivelasso. arXivpreprintarXiv:1703.03165 ,2017. [35] CDeBoor.Apracticalguidetosplines(reviseded.)springer. NewYork ,2001. [36] OlivierDelalleauandYoshuaBengio.Shallowvs.deepsum-productnetworks.In Advancesinneuralinformationprocessingsystems ,pages2011. [37] LucDevroye,LászlóGyör˝,andGáborLugosi. Aprobabilistictheoryofpatternrecog- nition ,volume31.SpringerScience&BusinessMedia,2013. [38] DavidLDonoho.Formostlargeunderdeterminedsystemsoflinearequationsthe minimal l 1 -normsolutionisalsothesparsestsolution. CommunicationsonPureand AppliedMathematics:AJournalIssuedbytheCourantInstituteofMathematicalSci- ences ,2006. 236 [39] DavidLDonohoandJainMJohnstone.Idealspatialadaptationbywaveletshrinkage. biometrika ,1994. [40] JohnDuchi,EladHazan,andYoramSinger.Adaptivesubgradientmethodsfor onlinelearningandstochasticoptimization. Journalofmachinelearningresearch , 2011. [41] BradleyEfron,TrevorHastie,IainJohnstone,RobertTibshirani,etal.Leastangle regression. TheAnnalsofstatistics ,2004. [42] PaulHCEilersandBrianDMarx.Flexiblesmoothingwithb-splinesandpenalties. Statisticalscience ,pages1996. [43] FengleiFan,JinjunXiong,andGeWang.Universalapproximationwithquadratic deepnetworks. NeuralNetworks ,2020. [44] JianqingFanandYingyingFan.Highdimensionalclassi˝cationusingfeaturesannealed independencerules. Annalsofstatistics ,36(6):2605,2008. [45] JianqingFan,YangFeng,andRuiSong.Nonparametricindependencescreeningin sparseultra-high-dimensionaladditivemodels. JournaloftheAmericanStatistical Association ,2011. [46] JianqingFanandRunzeLi.Variableselectionvianonconcavepenalizedlikelihoodand itsoracleproperties. JournaloftheAmericanstatisticalAssociation ,96(456):13 1360,2001. [47] JianqingFanandJinchiLv.Sureindependencescreeningforultrahighdimensionalfea- turespace. JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology) , 2008. [48] JianqingFanandJinchiLv.Nonconcavepenalizedlikelihoodwithnp-dimensionality. IEEETransactionsonInformationTheory ,2011. [49] JianqingFan,HengPeng,etal.Nonconcavepenalizedlikelihoodwithadiverging numberofparameters. TheAnnalsofStatistics ,,2004. [50] JianqingFan,RuiSong,etal.Sureindependencescreeningingeneralizedlinearmodels withnp-dimensionality. TheAnnalsofStatistics ,2010. [51] QingliangFanandWeiZhong.Nonparametricadditiveinstrumentalvariableesti- mator:Agroupshrinkageestimationperspective. JournalofBusiness&Economic Statistics ,2018. 237 [52] YingyingFanandChengYongTang.Tuningparameterselectioninhighdimensional penalizedlikelihood. JournaloftheRoyalStatisticalSociety:SeriesB(Statistical Methodology) ,2013. [53] JeanFengandNoahSimon.Sparse-inputneuralnetworksforhigh-dimensionalnon- parametricregressionandclassi˝cation. arXivpreprintarXiv:1711.07592 ,2017. [54] JeromeFriedman,TrevorHastie,andRobTibshirani.Regularizationpathsforgen- eralizedlinearmodelsviacoordinatedescent. Journalofstatisticalsoftware ,33(1):1, 2010. [55] JeromeFriedman,TrevorHastie,andRobertTibshirani. Theelementsofstatistical learning ,volume1.SpringerseriesinstatisticsSpringer,Berlin,2001. [56] JeromeHFriedman.Stochasticgradientboosting. Computationalstatistics&data analysis ,2002. [57] AndreasGeiger,PhilipLenz,andRaquelUrtasun.Arewereadyforautonomous driving?thekittivisionbenchmarksuite.In ConferenceonComputerVisionand PatternRecognition(CVPR) ,2012. [58] EvaristGinéandJoelZinn.Bootstrappinggeneralempiricalmeasures. TheAnnalsof Probability ,pages1990. [59] XavierGlorotandYoshuaBengio.Understandingthedi˚cultyoftrainingdeepfeed- forwardneuralnetworks.In Proceedingsofthethirteenthinternationalconferenceon arti˝cialintelligenceandstatistics ,pages2010. [60] EitanGreenshtein,Ya'AcovRitov,etal.Persistenceinhigh-dimensionallinearpre- dictorselectionandthevirtueofoverparametrization. Bernoulli ,2004. [61] LászlóGyör˝,MichaelKohler,AdamKrzyzak,andHarroWalk. Adistribution-free theoryofnonparametricregression .SpringerScience&BusinessMedia,2006. [62] TrevorHastieandRobertTibshirani.[generalizedadditivemodels]:Rejoinder. Statist. Sci. ,081986. [63] TrevorJHastie.Generalizedadditivemodels.In StatisticalmodelsinS ,pages Routledge,2017. [64] HaroldHotelling.Analysisofacomplexofstatisticalvariablesintoprincipalcompo- nents. Journalofeducationalpsychology ,24(6):417,1933. 238 [65] JianHuang,JoelLHorowitz,andFengrongWei.Variableselectioninnonparametric additivemodels. Annalsofstatistics ,38(4):2282,2010. [66] Bing'erJiang,TimO'Donnell,andMeghanClayards.Adeepneuralnetworkap- proachtoinvestigatetonespaceinlanguages. TheJournaloftheAcousticalSociety ofAmerica ,2019. [67] WilliamBJohnsonandJoramLindenstrauss.Extensionsoflipschitzmappingsintoa hilbertspace. Contemporarymathematics ,26(189-206):1,1984. [68] IanTJolli˙e,NickolayTTrenda˝lov,andMudassirUddin.Amodi˝edprincipal componenttechniquebasedonthelasso. JournalofcomputationalandGraphical Statistics ,2003. [69] SungkyuJung,JStephenMarron,etal.Pcaconsistencyinhighdimension,lowsample sizecontext. TheAnnalsofStatistics ,2009. [70] AndreIKhuri. Linearmodelmethodology .CRCPress,2009. [71] Hea-JungKim.Ontheratiooftwofoldednormaldistributions. Communicationsin Statistics-TheoryandMethods ,2006. [72] StephenColeKleene.Representationofeventsinnervenetsand˝niteautomata. Technicalreport,RANDPROJECTAIRFORCESANTAMONICACA,1951. [73] KeithKnightandWenjiangFu.Asymptoticsforlasso-typeestimators. Annalsof statistics ,pages2000. [74] MatthieuKowalski.Sparseregressionusingmixednorms. AppliedandComputational HarmonicAnalysis ,2009. [75] MarkAKramer.Nonlinearprincipalcomponentanalysisusingautoassociativeneural networks. AIChEjournal ,1991. [76] NeilDLawrence.Aunifyingprobabilisticperspectiveforspectraldimensionalityreduc- tion:Insightsandnewmodels. JournalofMachineLearningResearch ,13(Ma 1638,2012. [77] GeeYLee,ScottManski,andTapabrataMaiti.Actuarialapplicationsofwordem- beddingmodels. ASTINBulletin:TheJournaloftheIAA ,2020. [78] MosheLeshno,VladimirYaLin,AllanPinkus,andShimonSchocken.Multilayer feedforwardnetworkswithanonpolynomialactivationfunctioncanapproximateany function. Neuralnetworks ,1993. 239 [79] YingjieLiandTapsMaiti.Highdimensionaldiscriminantanalysisforspatiallyde- pendentdata.2018. [80] YouLLi,JessicaDucey-Wysling,AurélieD'Hondt,DongwoonHyun,BhavikPatel, andJeremyJDahl.Vector˛owimagingusingadeepneuralnetwork. TheJournalof theAcousticalSocietyofAmerica ,902,2019. [81] AndyLiaw,MatthewWiener,etal.Classi˝cationandregressionbyrandomforest. R news ,2002. [82] BoLiu,YingWei,YuZhang,andQiangYang.Deepneuralnetworksforhighdimen- sion,lowsamplesizedata.In IJCAI ,pages2017. [83] RongLiu,LijianYang,andWolfgangKHärdle.Oracallye˚cienttwo-stepestima- tionofgeneralizedadditivemodel. JournaloftheAmericanStatisticalAssociation , 2013. [84] YangLiu,QuanxueGao,XinboGao,andLingShao. l f 2 ; 1 g -normdiscriminantmanifold learning. IEEEAccess ,0734,2018. [85] YufengLiuandYichaoWu.Variableselectionviaacombinationofthel0andl1 penalties. JournalofComputationalandGraphicalStatistics ,2007. [86] RoiLivni,ShaiShalev-Shwartz,andOhadShamir.Aprovablye˚cientalgorithmfor trainingdeepnetworks. CoRR,vol.abs/1304.7045 ,2013. [87] LuLu,YeonjongShin,YanhuiSu,andGeorgeEmKarniadakis.Dyingreluand initialization:Theoryandnumericalexamples. arXivpreprintarXiv:1903.06733 ,2019. [88] JinchiLvandYingyingFan.Auni˝edapproachtomodelselectionandsparserecovery usingregularizedleastsquares. TheAnnalsofStatistics ,pages2009. [89] LiMa,MelbaMCrawford,andJinwenTian.Localmanifoldlearning-based k -nearest- neighborforhyperspectralimageclassi˝cation. IEEETransactionsonGeoscienceand RemoteSensing ,2010. [90] AlirezaMakhzaniandBrendanFrey.K-sparseautoencoders. arXivpreprint arXiv:1312.5663 ,2013. [91] GiampieroMarraandSimonNWood.Practicalvariableselectionforgeneralized additivemodels. ComputationalStatistics&DataAnalysis ,2011. 240 [92] RahulMazumder,PeterRadchenko,andAntoineDedieu.Subsetselectionwithshrink- age:Sparselinearmodelingwhenthesnrislow. arXivpreprintarXiv:1708.03288 , 2017. [93] LukasMeier,SaraVanDeGeer,andPeterBühlmann.Thegrouplassoforlogistic regression. JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology) , 2008. [94] LukasMeier,SaraVandeGeer,PeterBühlmann,etal.High-dimensionaladditive modeling. TheAnnalsofStatistics ,2009. [95] NicolaiMeinshausen,PeterBühlmann,etal.High-dimensionalgraphsandvariable selectionwiththelasso. Theannalsofstatistics ,2006. [96] TomasMikolov,KaiChen,GregCorrado,andJe˙reyDean.E˚cientestimationof wordrepresentationsinvectorspace. arXivpreprintarXiv:1301.3781 ,2013. [97] MitsunoriMizumachiandMayaOriguchi.Superdirectivenon-linearbeamformingwith deepneuralnetwork. TheJournaloftheAcousticalSocietyofAmerica , 3167,2016. [98] SiddharthaNandy,ChaeYoungLim,andTapabrataMaiti.Additivemodelbuilding forspatialregression. JournaloftheRoyalStatisticalSociety:SeriesB(Statistical Methodology) ,2017. [99] BalasKausikNatarajan.Sparseapproximatesolutionstolinearsystems. SIAMjournal oncomputing ,1995. [100] AndrewNgetal.Sparseautoencoder. CS294ALecturenotes ,2011. [101] Kyoung-SuOhandKeechulJung.Gpuimplementationofneuralnetworks. Pattern Recognition ,14,2004. [102] SimonPerkins,KevinLacker,andJamesTheiler.Grafting:Fast,incrementalfeature selectionbygradientdescentinfunctionspace. Journalofmachinelearningresearch , 2003. [103] AllanPinkus.Approximationtheoryofthemlpmodelinneuralnetworks. Acta numerica ,1999. [104] TomasoPoggio,HrushikeshMhaskar,LorenzoRosasco,BrandoMiranda,andQianli Liao.Whyandwhencandeep-butnotshallow-networksavoidthecurseofdimension- ality:areview. InternationalJournalofAutomationandComputing , 2017. 241 [105] FrankRosenblatt. Theperceptron,aperceivingandrecognizingautomatonProject Para .CornellAeronauticalLaboratory,1957. [106] FrankRosenblatt.Principlesofneurodynamics.perceptronsandthetheoryofbrain mechanisms.Technicalreport,CornellAeronauticalLabIncBu˙aloNY,1961. [107] SamTRoweisandLawrenceKSaul.Nonlineardimensionalityreductionbylocally linearembedding. science ,326,2000. [108] SRasoulSafavianandDavidLandgrebe.Asurveyofdecisiontreeclassi˝ermethod- ology. IEEEtransactionsonsystems,man,andcybernetics ,1991. [109] BernhardSchölkopf,AlexanderSmola,andKlaus-RobertMüller.Nonlinearcompo- nentanalysisasakerneleigenvalueproblem. Neuralcomputation , 1998. [110] LLSchumaker.Splinefunctions:basictheory.1981. JohnWiley&Sons,NewYork , 1981. [111] GideonSchwarzetal.Estimatingthedimensionofamodel. Theannalsofstatistics , 1978. [112] UriShaham,AlexanderCloninger,andRonaldRCoifman.Provableapproximation propertiesfordeepneuralnetworks. AppliedandComputationalHarmonicAnalysis , 2018. [113] JonathanWSiegelandJinchaoXu.Ontheapproximationpropertiesofneuralnet- works. arXivpreprintarXiv:1904.02311 ,2019. [114] NoahSimon,JeromeFriedman,TrevorHastie,andRobertTibshirani.Asparse-group lasso. JournalofComputationalandGraphicalStatistics ,2013. [115] KarenSimonyanandAndrewZisserman.Verydeepconvolutionalnetworksforlarge- scaleimagerecognition. arXivpreprintarXiv:1409.1556 ,2014. [116] DonaldFSpechtetal.Ageneralregressionneuralnetwork. IEEEtransactionson neuralnetworks ,1991. [117] NitishSrivastava,Geo˙reyHinton,AlexKrizhevsky,IlyaSutskever,andRuslan Salakhutdinov.Dropout:asimplewaytopreventneuralnetworksfromover˝tting. Thejournalofmachinelearningresearch ,2014. [118] NicolasStädler,PeterBühlmann,andSaraVandeGeer.Rejoinder:l1-penalization formixtureregressionmodels. Test ,2010. 242 [119] EduardoDStonag.Criticalpointsforleast-squaresproblemsinvolvingcertainanalytic functions,withapplicationstosigmoidalnets. AdvancesinComputationalMathemat- ics ,1996. [120] CharlesJStone.Additiveregressionandothernonparametricmodels. Theannalsof Statistics ,pages1985. [121] CharlesJStone.Thedimensionalityreductionprincipleforgeneralizedadditivemod- els. TheAnnalsofStatistics ,pages1986. [122] WesleyTanseyandJamesGScott.Afastand˛exiblealgorithmforthegraph-fused lasso. arXivpreprintarXiv:1505.06475 ,2015. [123] Gül³enTa³kinandMelbaMCrawford.Anout-of-sampleextensiontomanifoldlearn- ingviameta-modeling. IEEETransactionsonImageProcessing , 2019. [124] JoshuaBTenenbaum,VinDeSilva,andJohnCLangford.Aglobalgeometricframe- workfornonlineardimensionalityreduction. science ,3,2000. [125] AmbujTewari,PradeepKRavikumar,andInderjitSDhillon.Greedyalgorithmsfor structurallyconstrainedhighdimensionalproblems.In AdvancesinNeuralInformation ProcessingSystems ,pages2011. [126] RobertTibshirani.Regressionshrinkageandselectionviathelasso. Journalofthe RoyalStatisticalSociety:SeriesB(Methodological) ,1996. [127] RobertTibshirani,MichaelSaunders,SaharonRosset,JiZhu,andKeithKnight.Spar- sityandsmoothnessviathefusedlasso. JournaloftheRoyalStatisticalSociety:Series B(StatisticalMethodology) ,2005. [128] RyanJTibshirani.Ageneralframeworkforfaststagewisealgorithms. TheJournalof MachineLearningResearch ,2015. [129] MichaelETippingandChristopherMBishop.Probabilisticprincipalcomponent analysis. JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology) , 1999. [130] WarrenSTorgerson.Multidimensionalscaling:I.theoryandmethod. Psychometrika , 1952. [131] Wen-JenTsay,Cli˙JHuang,Tsu-TanFu,andI-LinHo.Asimpleclosed-formapprox- imationforthecumulativedistributionfunctionofthecompositeerrorofstochastic frontiermodels. JournalofProductivityAnalysis ,2013. 243 [132] PaulTsengandSangwoonYun.Acoordinategradientdescentmethodfornonsmooth separableminimization. MathematicalProgramming ,3,2009. [133] JohnWTukey.Non-parametricestimationii.statisticallyequivalentblocksandtol- erancecontinuouscase. TheAnnalsofMathematicalStatistics ,pages 1947. [134] GerhardTutzandHaraldBinder.Generalizedadditivemodelingwithimplicitvariable selectionbylikelihood-basedboosting. Biometrics ,2006. [135] SaraAVandeGeer.High-dimensionalgeneralizedlinearmodelsandthelasso. The AnnalsofStatistics ,pages2008. [136] AWvanderVaartandJ.Wellner. WeakConvergenceandEmpiricalProcesses:With ApplicationstoStatistics .SpringerSeriesinStatistics.Springer,1996. [137] HanshengWang,BoLi,andChenleiLeng.Shrinkagetuningparameterselectionwith adivergingnumberofparameters. JournaloftheRoyalStatisticalSociety:SeriesB (StatisticalMethodology) ,2009. [138] MingqiuWangandGuo-LiangTian.Adaptivegrouplassoforhigh-dimensionalgen- eralizedlinearmodels. StatisticalPapers ,6,2019. [139] RongrongWangandXiaopengZhang.Capacitypreservingmappingforhigh- dimensionaldatavisualization. arXivpreprintarXiv:1909.13322 ,2019. [140] FengrongWeiandJianHuang.Consistentgroupselectioninhigh-dimensionallin- earregression. Bernoulli:o˚cialjournaloftheBernoulliSocietyforMathematical StatisticsandProbability ,16(4):1369,2010. [141] SvanteWold,KimEsbensen,andPaulGeladi.Principalcomponentanalysis. Chemo- metricsandintelligentlaboratorysystems ,1987. [142] StephenJWright.Coordinatedescentalgorithms. MathematicalProgramming , 2015. [143] TongTongWu,KennethLange,etal.Coordinatedescentalgorithmsforlassopenal- izedregression. TheAnnalsofAppliedStatistics ,2008. [144] MakotoYamada,WittawatJitkrittum,LeonidSigal,EricPXing,andMasashi Sugiyama.High-dimensionalfeatureselectionbyfeature-wisekernelizedlasso. Neural computation ,2014. 244 [145] KaixuYangandTapabrataMaiti.Ultrahigh-dimensionalgeneralizedadditivemodel: consistencyandtuningparameterselection. Technicalreport,MichiganStateUniver- sity ,2018. [146] KaixuYangandTapabrataMaiti.Statisticalaspectsofhigh-dimensionalsparsearti- ˝cialneuralnetworkmodels. Machinelearningandknowledgeextraction , 2020. [147] YiYangandHuiZou.Afastuni˝edalgorithmforsolvinggroup-lassopenalizelearning problems. StatisticsandComputing ,1,2015. [148] MingYuanandYiLin.Modelselectionandestimationinregressionwithgrouped variables. JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology) , 2006. [149] Cun-HuiZhangandJianHuang.Thesparsityandbiasofthelassoselectioninhigh- dimensionallinearregression. TheAnnalsofStatistics ,pages2008. [150] NancyRZhangandDavidOSiegmund.Amodi˝edbayesinformationcriterionwith applicationstotheanalysisofcomparativegenomichybridizationdata. Biometrics , 2007. [151] YiyunZhang,RunzeLi,andChih-LingTsai.Regularizationparameterselections viageneralizedinformationcriterion. JournaloftheAmericanStatisticalAssociation , 2010. [152] PengZhaoandBinYu.Onmodelselectionconsistencyoflasso,2006. [153] SZhou,XShen,DAWolfe,etal.Localasymptoticsforregressionsplinesandcon˝- denceregions. Theannalsofstatistics ,1998. [154] BoZhu,JeremiahZLiu,StephenFCauley,BruceRRosen,andMatthewSRosen. Imagereconstructionbydomain-transformmanifoldlearning. Nature , 492,2018. [155] HuiZou.Theadaptivelassoanditsoracleproperties. JournaloftheAmerican statisticalassociation ,2006. [156] HuiZouandTrevorHastie.Regularizationandvariableselectionviatheelasticnet. Journaloftheroyalstatisticalsociety:seriesB(statisticalmethodology) , 320,2005. [157] HuiZou,TrevorHastie,andRobertTibshirani.Sparseprincipalcomponentanalysis. Journalofcomputationalandgraphicalstatistics ,2006. 245