EVOLUTIONARYMULTI-OBJECTIVEBI-LEVELOPTIMIZATIONFOR EFFICIENTDEEPNEURALNETWORKARCHITECTUREDESIGN By ZhichaoLu ADISSERTATION Submittedto MichiganStateUniversity inpartialentoftherequirements forthedegreeof ElectricalEngineering|DoctorofPhilosophy 2020 ABSTRACT EVOLUTIONARYMULTI-OBJECTIVEBI-LEVELOPTIMIZATIONFOR EFFICIENTDEEPNEURALNETWORKARCHITECTUREDESIGN By ZhichaoLu Deepconvolutionalneuralnetworks(CNNs)arethebackbonesofdeeplearning(DL) paradigmsfornumerousvisiontasks,includingobjectrecognition,detection,segmentation, etc.EarlyadvancementsinCNNarchitecturesareprimarilydrivenbyhumanexpertiseand elaboratedesign.Recently,neuralarchitecturesearch(NAS)wasproposedwiththeaimof automatingthenetworkdesignprocessandgeneratingtask-dependentarchitectures.While existingapproacheshaveachievedcompetitiveperformance,theyarestillimpracticaltoreal- worlddeploymentforthreereasons:(1)thegeneratedarchitecturesaresolelyoptimizedfor predictiveperformance,resultinginncyinutilizinghardwareresources|i.e.energy consumption,latency,memorysize,etc.;(2)thesearchprocessesrequirevastcomputational resourcesinmostapproaches;(3)mostexistingapproachesrequireonecompletesearchfor eachdeploymentspofhardwareorrequirement. Inthisdissertation,weproposeantevolutionaryNASalgorithmtoaddressthe aforementionedlimitations.Inparticular,weintroducePareto-optimizationtoNAS, leadingtoadiversesetofarchitectures,multipleobjectives,beingobtainedsimul- taneouslyinonerun.Wethenimprovethealgorithm'ssearchthroughsurrogate models.Weintegrateatransferlearningschemetothealgorithmthatallowsanew tasktoleverageprevioussearchthatfurtherimprovesboththeperformanceofthe obtainedarchitecturesandsearch.Therefore,theproposedalgorithmenablesan automatedandstreamlinedprocesstotlygeneratetask-spcustomneuralnet- workmodelsthatarecompetitiveundermultipleobjectives. TABLEOFCONTENTS LISTOFTABLES .................................... vi LISTOFFIGURES ................................... viii Chapter1Introduction ............................... 1 1.1Motivation.....................................1 1.2Challenges.....................................3 1.3DissertationOverview..............................5 Chapter2Background ................................ 6 2.1NeuralArchitectureSearch............................6 2.2EvolutionaryBi-levelOptimization.......................8 2.3EvolutionaryMulti-objectiveOptimization...................11 2.4EvolutionarySurrogate-AssistedOptimization.................14 2.5RelatedWork...................................15 Chapter3Multi-objectiveEvolutionofDeepNeuralNetworkArchitec- tures .................................... 19 3.1ProposedApproach................................20 3.1.1SearchSpaceandEncoding.......................22 3.1.2AlgorithmProcedures..........................24 3.2ExperimentsandResults.............................30 3.2.1Baselines..................................30 3.2.2ImplementationDetails..........................31 3.2.3eness................................32 3.2.4.................................36 3.2.5ObservationsonEvolvedArchitectures.................38 3.3AblationStudy..................................40 3.3.1DatasetforSearch............................40 3.3.2ProxyModels...............................41 3.4Conclusion.....................................42 Chapter4SurrogateAssistedEvolutionofDeepNeuralNetworkArchi- tectures .................................. 44 4.1ProposedApproach................................46 4.1.1OverallAlgorithmDescription......................46 4.1.2SpeedingUpUpperLevelOptimization.................49 4.1.3SpeedingUpLowerLevelOptimization.................50 4.2ExperimentsandResults.............................50 4.2.1PerformanceoftheSurrogatePredictors................51 iv 4.2.2Searchciency.............................52 4.2.3ResultsonStandardDatasets......................53 4.3ScalabilityofMSuNAS..............................56 4.3.1TypesofDatasets.............................56 4.3.2NumberofObjectives..........................57 4.4Conclusion.....................................59 Chapter5TransferLearningAssistedEvolutionofDeepNeuralNetwork Architectures ............................... 60 5.1ProposedApproach................................62 5.1.1SearchSpaceandEncoding.......................63 5.1.2AccuracyPredictor............................65 5.1.3Many-ObjectiveEvolutionarySearch..................68 5.1.4Many-ObjectiveSelection........................71 5.1.5SupernetAdaptation...........................73 5.1.6ChoosingBestTSolution....................75 5.2ExperimentsandResults.............................76 5.2.1Datasets..................................76 5.2.2SupernetPreparation...........................77 5.2.3ImageNet..........................78 5.2.4ArchitectureTransfer...........................79 5.2.5ScalabilitytoObjectives.........................84 5.2.5.1ThreeObjectives........................85 5.2.5.2TwelveObjectives.......................86 5.3AblationStudy..................................88 5.3.1AccuracyPredictorPerformance.....................88 5.3.2Searchciency.............................90 5.3.3AnalysisofCrossover...........................91 5.3.4AnalysisofMutationHyperparameters.................92 5.3.5enessofSupernetAdaptation..................93 5.4Conclusion.....................................95 Chapter6ApplicationsofEvolutionaryNeuralArchitectureSearch ... 96 6.1MedicalImaging.................................96 6.2AgainstAdversarialAttack............................100 6.3Multi-viewCarAlignment............................103 6.4Non-learnableLayer...............................106 Chapter7Conclusion ................................ 108 7.1Contributions...................................108 7.2DiscussionandFutureWork...........................110 7.3ConcludingRemarks...............................111 BIBLIOGRAPHY .................................... 112 v LISTOFTABLES Table3.1:SummaryofHyper-parameterSettings.................32 Table3.2:ComparisonbetweenNSGANetandotherbaselinemethods.NS- GANetarchitecturesareobtainedbysearchingonCIFAR-100.NS- GANetresultsonCIFAR-10andImageNetareobtainedbyre-training theweightswithimagesfromtheirrespectivedatasets.Ratio-to- NSGANetindicatestheresultingsavingson#Paramsand#FLOPs. ThesearchcostiscomparedinGPU-days,calculatedbymultiplying thenumberofGPUcardsdeployedwiththeexecutiontimeindays.35 Table4.1:ComparingtherelativesearchofMSuNAStoothersingle- objectivemethods:\#Model"isthetotalnumberofarchitectures evaluatedduringsearch,\#Epochs"isthenumberofepochsusedto traineacharchitectureduringsearch. y and z denotetrainingepochs withandwithoutasupernettowarm-starttheweights,respectively.52 Table4.2:ImageNet[1]:MSuNAScomparisonwithmanualand automateddesignoftnetworks.Modelsaregroupedintosec- tionsforbettervisualization.Ourresultsareunderlinedandbest resultineachsectionisinbold.CPUlatency(batchsize=1)ismea- suredonInteli7-8700KandGPUlatency(batchsize=64)ismeasured on1080Ti. y Thesearchcostexcludesthesupernettrainingtimecost. z EstimatedbasedontheclaimthatPNASis8xfasterthanNASNet from[2]..................................54 Table4.3:Non-standardDatasetsforMSuNAS..................56 Table5.1:HyperparameterSettings........................77 Table5.2:BenchmarkDatasetsforEvaluation..................77 Table5.3: ImageNet-1K[1]: NATNetscomparisonwithman- ualandautomateddesignofcientconvolutionalneuralnetworks. Modelsaregroupedintosectionsforbettervisualization.Ourresults areunderlinedandthebestresultineachsectionisinbold.CPUla- tency(batchsize=1)ismeasuredonInteli7-8700KandGPUlatency (batchsize=64)ismeasuredon1080Ti.\WS"standsforweightshar- ing.Allmethodsareundersinglecropandsinglemodelcondition, withoutanyadditionaldata......................80 vi Table5.4:NATmodelperformancecorrespondingtoFigure5.9.........84 Table5.5:ComparingtherelativesearchofNATtoothermethods. \{"denotesfornotapplicable,\WS"standsforweightsharingand \SMBO"standsforsequentialmodel-basedoptimization[3]. y is takenfrom[4], z estimatebasedonthe#ofmodelsevaluatedduring search(20Kin[5],1.2Kin[2],27Kin[6]). denotesre-rankingstage wheretop100-250modelsundergoextendedtrainingandevaluation for300epochsbeforeselectingthemodel.............90 Table5.6:offor75epochsonImageNet............93 Table6.1:AUROConChestX-Ray14testingset..................98 Table6.2:PreliminaryresultsontheCMU-Caralignment[7].Notably,ourpro- posedalgorithmisabletoarchitectureswithcompetitiveperfor- mancewhilehaving2xlessparameterswhencomparedtohuman- designedarchitecture[8].........................105 vii LISTOFFIGURES Figure1.1:Avisualizationoftheneuralnetworkarchitecturesthathavebeen proposedinthedeeplearningcommunity[9].Thediversityinneural networksimpliestheimportanceofarchitecturalchoices.......2 Figure2.1:Generalsketchofabi-levelproblemtakenfrom[10]..........9 Figure2.2:AprincipleofPareto-dominancebasedmulti-objectiveoptimization procedureisillustrated..........................12 Figure2.3:Anillustrationofhypervolumeunderbi-objectiveoptimization....13 Figure3.1: Overview: Givenadatasetandobjectives,NSGANetdesignsaset ofcustomarchitecturesspanningthefront.NSGANetes- timatetheperformanceofanarchitecturethroughits proxymodel , optimizedbyStochasticGradientDescent(SGD)inthelower-level. Thesearchproceedsin exploration viageneticoperations,followed by exploitation viadistributionestimation.SeeAlgorithm1forpseu- docodeandcolorsareincorrespondence................20 Figure3.2:SchematicoftheNSGANetsearchspacemotivatedfrom[5]:(a)An architectureiscomposedofstackedblocks.(b)Thenumberofchan- nelsineachblockisgraduallyincreasedwithdepthofthenetwork. (c)Eachblockiscomposedofenodes,whereeachnodeisatwo- branchedcomputationappliedtooutputsfromeitherpreviousblocks orpreviousnodeswithinthesameblock.(d)Agraphicalvisualiza- tionof(c).................................24 Figure3.3:Illustrationofnode-levelcrossover....................26 Figure3.4:InputandOperationMutation:Dashedlineboxeswithredcolor highlightthemutation. h i 2 , h i 1 areoutputsfromprevious-previous andpreviousblocks,respectively. h (3) i indicatesoutputfromnode3 ofthecurrentblock............................27 Figure3.5:IllustrativeexampleofBN-basedexploitationstepinNSGANet:given pastsuccessfularchitectures,weconstructaBNrelatingthedepen- denciesbetweenthefournodesinsidetheNormalblock.Anewar- chitectureisthensampledfromthisBNandproceedforwardfor performanceestimation..........................28 viii Figure3.6:ExamplesfromCIFAR-10,CIFAR-100andImageNetdatasets.Im- agesineachrowbelongtothesameclasswithlabelnamesshownto theleft...................................31 Figure3.7:Visualizationofblock-levelstructuresfortarchitectures.The NormalandReductionblocksareshownintheandsecondrows, respectivelyforNSGANetarchitectures.Examplesofblocksthatare designedmanuallybyexperts[11,12]andfromotherpeermethods[5] arealsoincludedin(d)-(f)forcomparison..............33 Figure3.8:(a)Accuracyvs.FLOPsofallarchitecturesgeneratedbyNSGANet duringthecourseofevolutiononCIFAR-100.Asubsetofnon- dominatedarchitectures(seetext),namedNSGANet-A0toA4,are re-trainedthoroughlyandcomparedwithotherpeermethodsin(b).34 Figure3.9:TransferabilityoftheNSGANetarchitecturesto(a)CIFAR-10,and (b)ImageNet.WecompareTop-1Accuracy vs. ComputationalCom- plexity.Architecturesjoinedbydashedlinesarefrommulti-objective algorithms.................................36 Figure3.10:SearchcomparisonbetweenNSGANetandotherbaselines intermsof(a)HV,and(b)therequiredcomputetimeinGPU-Days. ThesearchcostismeasuredonCIFAR-10formostmethods,except NSGANetandBlock-QNN[13],wheretheCIFAR-100datasetisused for.....................................37 Figure3.11: PostSearchAnalysis: (a)Frequencyofeachoperationselected duringthesearch.(b)ofnumberofinputchannelsthatare concatenationonthevalidationaccuracy.Morechannelsresultsin morereliableperformanceimprovementbutatthecostofcomputa- tional..............................39 Figure3.12:(a)Meanaccuracydistributionofrandomlygenerated architecturesandarchitecturesfrompeerNASmethodsonfourdatasets. Correlationinperformance(redlines)vs.Savingsingradientdescent walltime(blueboxes)byreducing(b)thenumberofchannelsinlay- ers,(c)thenumberoflayers,and(d)thenumberofepochstotrain. Notethat(b),(c)and(d)havetwoy-axislabelscorrespondingtothe colorofthelines.............................41 Figure4.1: Overview: Givenadatasetandobjectives,MSuNASobtainatask- spsetofmodelsthatarecompetitiveinallobjectiveswithhigh search.Itcomprisesoftwosurrogates,oneattheupper leveltoimprovesampleandoneatthelowerlevel,through asupernet,toimproveweightlearning............45 ix Figure4.2: SearchSpace :Acandidatearchitecturecomprisesecomputa- tionalblocks.Parameterswesearchforincludeimageresolution, numberoflayers( L )ineachblockandtheexpansionrate( e )and thekernelsize( k )ineachlayer.....................47 Figure4.3:AsamplerunofMSuNASonImageNet:Ineachiteration,accuracy- predictionsurrogatemodels S f areconstructedfromanarchiveof previouslyevaluatedarchitectures(a).Newcandidatearchitectures brownboxesin(b) areobtainedbysolvingtheauxiliarysingle-level multi-objectiveproblem ~ F = fS f ; Cg (line10inAlgo3).Asubsetof thecandidatearchitecturesischosentodiversifytheParetofront(c) -(d).Theselectedcandidatearchitecturesarethenevaluatedand addedtothearchive(e).Attheconclusionofthesearch,wereport thenon-dominatedarchitecturesfromthearchive.Thex-axisinall is#MAdds.........................48 Figure4.4:Performanceofthefoursurrogatemodelpredictorsconsidered.Top rowcomparesSpearmanrank-ordercorrelationcotasnum- beroftrainingsamplesincreases.Bottomrowvisualizesthetrue vs.predictedaccuracyunder500trainingsamples.Therootmean squareerror(RMSE)oftheAdaptiveSwitching(AS)strategyisalso provided.................................51 Figure4.5:(a)Asketchvisualizingthehypervolumemetric[14].Inthecase oftwoobjectives,itmeasuresthedominatedareaachievedbya multi-objectivealgorithm.Alargerhypervolumeindicatesabetter Paretofrontachieved.(b)comparingtherelativesearch ofMSuNAS,NSGANet[15]andrandomsearchunderabi-objective setup.Weplotthemeanhypervolume(across5runs)onImageNet, CIFAR-10andCIFAR-100againstnumberofarchitecturesevaluated. Coloredregionsdenotethestandarddeviationofthemean.....53 Figure4.6: Accuracy vs :Toprowcomparespredictiveaccuracyvs. GPUlatencyonabatchof64images.Bottomrowcomparespre- dictiveaccuracyvs.numberofmulti-addsinmillions.Modelsfrom multi-objectiveapproachesarejoinedwithlines.Ourmodelsareob- tainedbydirectlysearchingontherespectivedatasets.Inmostprob- lems,MSuNASmoreaccuratesolutionswithfewerparameters. ......................................55 Figure4.7:Performanceofthesetoftask-spmodelsobtainedbyMSuNAS onthreettypesofnon-standarddatasets,comparingtoSOTA fromtransferlearning[16,17]andsemi-/un-supervisedlearning[18,19]57 x Figure4.8:ScalabilityofMSuNAStotnumbersandtypesofobjectives: optimizing(a)ascalarizedsingle-objectiveonImageNet;(b)eob- jectivesincludingaccuracy,Params,MAdds,CPUandGPUlatency, simultaneously.(c)Post-optimalanalysisonthearchitecturesthat arenon-dominatedaccordingtotobjectives.....58 Figure5.1: Overview: Givenadatasetandobjectivestooptimize,NATdesigns customarchitecturesspanningtheobjectivefront.NAT comprisesoftwomainscomponents,supernetadaptationandevolu- tionarysearch,thatareiterativelyexecuted.NATalsousesanonline accuracypredictormodeltoimproveitscomputationale...61 Figure5.2:ThearchitecturesinoursearchspacearevariantsofMobileNet-v2 familyofmodels[12,17,20,21].(a)Eachnetworksconsistsofe stages.Eachstagehastwotofourlayers.Eachlayerisaninverted residualbottleneckblock.Thesearchspaceincludes,inputimage resolution(R),widthmultiplier(W),thenumberoflayersineach stage,the#ofoutputchannels(expansionratioE)ofthe1 1 convolutionandthekernelsize(K)ofthedepth-wiseseparablecon- volutionineachlayer.(b)Networksarerepresentedas22-integer strings,wherethetwocorrespondtoresolutionandwidthmul- tiplier,andtherestcorrespondtothelayers.Eachvalueindicatesa choice,e.g.thethirdinteger( L 1 )takesavalueof\1"correspondsto usingexpansionratioof3andkernelsizeof3inlayer1ofstage1.64 Figure5.3: TopPath: AtypicalprocessofevaluatinganarchitectureinNAS algorithms. BottomPath: Accuracypredictoraimstobypassthe time-consumingcomponentsforevaluatinganetwork'sperformance bydirectlyregressingitsaccuracy f from a (architectureintheen- codedspace)................................66 Figure5.4:Accuracypredictorperformanceasafunctionoftrainingsamples. Foreachmodel,weshowthemeanandstandarddeviationofthe Spearmanrankcorrelationon11datasets(Table5.2).Thesizeof RBFensembleis500...........................67 Figure5.5:(a) CrossoverOperator :newarchitecturesarecreatedby recombiningintegersfromtwoparentarchitectures.Theprobability ofchoosingfromeitheroneoftheparentsisequal.(b) Mutation Operator :histogramsshowingtheprobabilitiesofmutatedvalues withcurrentvalueat5underthyperparameter m settings.69 xi Figure5.6:(a)Anexample(assumingminimizationofallobjectives)ofthese- lectionprocessinAlgo7:Wecreatereferencedirections Z by joiningreferencepointswiththeidealsolution(origin).Thenthrough non dominated sort ,threenon-dominatedsolutionsareidenas- sociatedwithreferencedirections Z (1) , Z (3) and Z (5) .Wethenselect theremainingsolutionsbytheorthogonaldistancestothereference directionswithnoassociatedsolutions|i.e. Z (2) and Z (4) .Thisse- lectionisscalabletolarger#ofobjectives.Tri-objectiveexampleis shownin(b)...............................72 Figure5.7:ImageNetArchitecturesfromTFront.............80 Figure5.8: MAddsvs.ImageNetAccuracy .NATNetsoutperformother modelsinbothobjectives.Inparticular,NAT-M4achievesanew state-of-the-arttop-1accuracyof80.5%undermobilesetting( 600MMAdds).NAT-M1improvesMobileNet-v3top-1accuracyby 2.3%withsimilar#MAdds.......................81 Figure5.9: MAddsvs.Accuracy curves comparingNATexisting architecturesonadiversesetofdatasets.Thedatasetsarearranged inascendingorderoftrainingsetsize.Methodsshowninthelegend pre-trainonImageNetandtheweightsonthetargetdataset. Methodsin blue trainfromscratchoruseexternaltrainingdata...82 Figure5.10:Transferarchitectures(350MMAdds)ontendatasets.........83 Figure5.11:Non-dominatedarchitecturesto f top-1accuracy,#MAdds g obtained byNATontdatasets......................84 Figure5.12: Toprow: NATNetsobtainedfromtri-objectivesearchtomaximize ImageNettop-1accuracy,minimizemodelsize(#Params),andmini- mize f #MAdds,CPUlatency,GPUlatency g fromlefttoright. Bot- tomrow: 2Dprojectionsfromabove3Dscatter,showingtop-1ac- curacyvs.eachofthefourrelatedmeasurements.Tobetter visualize(thecomparisonwithMobileNet-v3[21]andMUXNet[22]), partialsolutionsfromthenon-dominatedfrontiersareshown.All top-1accuracyshownarewithout............85 xii Figure5.13: Left: ParallelCoordinatePlot(PCP)whereeachverticalbarisan objectiveandeachlineisanon-dominatedarchitecturesachievedby NATfroma12-objoptimizationofminimizing#MAddsandmax- imizingaccuracyonthe11datasetsinTable5.2.Themodelwith thebest(seeSection5.1.6fordetails)ishighlightedindark blue. Right: 1-on-1comparisonbetweentheselectedNATNet(top- rankedin)andrepresentativepeermodelsontop-1accuracy onvariousdatasets.Methodwithlargerareaisbetter.........87 Figure5.14:Tvaluesofall45architecturesachievedfromthe12-objective searchpresentedinFigure5.13.....................88 Figure5.15: Toprow: Spearmanrankcorrelationbetweenpredictedaccuracy andtrueaccuracyoftsurrogatemodelsacrossmanydatasets. Eachaccuracypredictorisconstructedfrom250samples(trained architectures).Errorbarsshowmeanandstandarddeviationoverten runs. Bottomrow: GoodnessofvisualizationofRBFensemble, thebestaccuracypredictor.......................89 Figure5.16:Ablationstudyonthecrossoveroperator:(a)themedianperfor- mancefromelevenrunsofourevolutionaryalgorithmwithandwith- outthecrossoveroperator.(b)themedianperformancedeteriorates asthecrossoverprobabilityreducesfrom0.9to0.2..........91 Figure5.17:Hyperparameterstudyon(a)mutationprobability p m and(b)muta- tionindexparameter m .Foreachstudy,werunNATeleventimes onfourdatasetstomaximizetop-1accuracyandreportthemedian performance................................93 Figure5.18:Comparingtheperformanceof adaptingsupernet , adaptingsubnet and additional underabi-objectivesearchsetuponfour datasets.DetailsareprovidedinSection5.3.5.............93 Figure6.1:VisualizationofChestXray14datasets.Examplesshowingeightcom- monthoracicdiseasesarefrom[23]...................97 Figure6.2:NSGANet-Xmulti-labelperformanceonChestX-Ray14 (a)andtheclass-wisemeantestAUROCcomparisonwithpeermeth- ods(b)..................................98 Figure6.3:Examplesofclassactivationmap[24]ofNSGANet-X,highlighting theclass-spdiscriminativeregions.Thegroundtruthbounding boxesareplottedovertheheatmaps..................99 xiii Figure6.4: RobustnessObjective: Wearobustnessobjectiveunder theFGSM[25]attackasfollows:1)obtainclasscationaccuracyon adversarialimagesgeneratedbyFGSMaswevary ,2)computethe areaunderthecurve(blueline),approximatedbytheareaofthe greenregion;2)normalizetherobustnessvaluetotherectangular areaformedbythe Idealpoint and Nadirpoint ;3)Idealpointis at100%accuracyatmaximum value,andthe nadirpointisastheaccuracyofrandomguessingat =0 (cleanimages)...............................101 Figure6.5:Tfrontieroftherobustnessexperiments.Colorindicatesthe generation(iteration)atwhichanetworkarchitectureiseliminated fromthesurvivingparentpopulation.Thesizeofeachpointispro- portionaltothenetworkarchitecture'snumberoftrainableparame- ters.WenotethatnetworksforlattergenerationsformthePareto front(darkbluepoints).........................102 Figure6.6:(a)Examplesofthecomputationalblocksdiscoveredwithhighclas- accuracy.Forthesenetworks,themeanaccuracyandro- bustnessobjectivesare0.8543and0.0535,respectively;(b)Exam- plesofthecomputationalblocksdiscoveredwithhighrobustness againstFGSMattack,themeanaccuracyandrobustnessobjectives are0.8415and0.1036,respectively;(c)Examplesofthecomputa- tionalblocksdiscoveredalongthepareto-frontthatprovidesanef- tbetweenaccuracyandadversarialro- bustness.Theyarearrangedintheorderofdescendingaccuracyand ascendingrobustness...........................103 Figure6.7:Parallelcoordinateplotsofthe1,200networkarchitecturessampled byNSGANet.Eachlinerepresentsanetworkarchitecture,eachver- ticallineisanattributeassociatedwiththenetwork.(a)Networks thathavetheskipconnectionbitinactive,wecanseethatnoneof themhavegoodmeasurementonrobustnessagainstadversarialat- tacks.(b)Networksthathavetheskipconnectionbitactive.This skipconnectionbitreferstotheconnectionthatgoespastallcompu- tationwithinaphase,asanormalresidualconnectionwould.When theskipconnectionisactive,thenetworkscoverthefullrangeof adversarialrobustness..........................104 Figure6.8:Spatial-resolution-changepathoftheNSGANet-C0architecture.Each circleencodesaresidualmodule[11].Circlescoloredinwhiteareal- waysexecutedatthebeginning.Thearrowsandbluescirclesare partsofthesearchdecisions.Thetotalnumberoflayers(L)areset tobenine................................105 xiv Figure6.9:PreliminaryexperimentonconstructingDNNarchitecturesusinglay- erswithnon-learnableweights.Eachlayeriscomposedofseveralnon- learnableoperationsinparallel.Wemanuallyconstructedahandful ofsuchlayersandevaluatethemonCIFAR-10(a).Anexamplecon- basedonthebetweenaccuracyandthenumberof theparameters,isshownin(b).Wevalidatetheenessofnon- learnablelayersbyreplacingtheoriginal3 3convolutionint ResNetmodelswiththechosenurationonbothCIFAR-100and ImageNet(c).Evidently,layerwithnon-learnableweightsiscapable ofyieldingcompetitiveperformancewhilebeingcompu- tationaltasopposedtoconventionallearnableconvolutions. ......................................107 xv Chapter1 Introduction Theprocessofcomputersperformingthetasks,or automation hasbeentheleadingfore- frontoftechnologicaladvances.Automationinengineeringdesign,manufacturing,logistics hasdramaticallytransformedhowgoodsaremadeandexcelledproductivitybyordersof magnitudeintherecentpast.However,theapplicationsofmachinelearning(ML)still largelyrelyonhumanexpertise.Giventherecentimprovementsincomputinghardware| e.g.graphicsprocessingunits(GPUs)andservice|e.g.clouding/edgecomputing,itis clearthattheavailabilityofskilledresearchersandpractitionersisthebottleneck,rather thancomputationalpower.AndautomatingthecreationofMLapplicationsisaplausible avenuetoalleviatethisbottleneckandtlyacceleratetheprogressofMLtechnolo- gies.ThroughestablishinganalgorithmicfoundationtoautomateMLapplications,this dissertationhopestocontributetothisvision. 1.1Motivation RecentyearshavewitnessedanexplosivegrowthinMLapplications,drivenbybothadvance- mentsincomputinghardwareandamountofhigh-quality(annotated)data.Inparticular, aspecialformofML,knownasdeeplearning(DL),hasmerged.DLreferstotheuseof deep(intermsofnumberoflayers)neuralnetworks(DNNs)toautomaticallyextractdis- criminativefeaturerepresentationsfromdatainanend-to-endfashion.Withinfewyears 1 ofexistence,DNNshavesurpassedthestate-of-the-artinawiderangeofbenchmarksthat werepreviouslydominatedbyothermachinelearningalgorithms.Thesebenchmarksspread fromcomputationvision[26],speechrecognition[27],tonaturallanguageprocessing[28] domains. Figure1.1:Avisualizationoftheneuralnetworkarchitecturesthathavebeenproposedin thedeeplearningcommunity[9].Thediversityinneuralnetworksimpliestheimportance ofarchitecturalchoices. Oneofthekeydrivingforcesbehindthissuccessistheintroductionofmanytask-spec DNNarchitectures(seeFigure1.1foranvisualization).Inthedomainofcomputervision (themainfocusofthisdissertation),thewaveofarchitecturaldevelopmentfocuseson solelyimprovingthepredictiveperformanceofconvolutionalneuralnetworks(CNNs)includ- ingVGG[29],GoogLeNet[30],ResNet[11],etc.Thesemodelsprogressivelyincreasedepth andnumberofparallelbranches.Thesecondwavefocusesonimprovingnetworks'compu- tationalwhiletradipredictiveperformancetoasmallextent.Modelsinthis categoryincludeDenseNet[31],Sh[32],MobileNet-v2[12],etc.Thesemodelsuse sophisticatedlayerconnections(e.g.denseconnection[31],invertedresidualconnection[12]) andlayerarrangements(e.g.replacingconventionalconvolutionlayerwithdepth-wisesep- arableconvolutionfollowedbypoint-wiseconvolution[12])tosqueezethemodelsizeand maintainperformanceinthesametime.Duetoalackoftheoreticalunderstandingofneural networkarchitectures,thesedevelopmentscostyearsofpainstakingandelaborated designfromempiricallytestingnetworkmodelsonbenchmarktasks.Furthermore,it'sbe- 2 comingapparentthatsuchamanualprocessofadjustingnetworkarchitecturesinnotonly tedious,butalsosub-optimal.Analgorithmicapproachtoautomatethedesignofnetwork architectureisessentialtosustainthesteadygrowthinapplicationsofCNNs. 1.2Challenges Towardsdesigninganautomaticmethodtosearcharchitecturesthatcanscalewiththe complexityofthesearchspaceandthenumberofobjectivesbyvastdeployment scenarios,thereareafewkeychallengesneedtobeaddressed. tiable: Onemajorissueisthattheobjectivelandscapewithrespecttothe architectureisnotcontinuous,thereby,thereisnogradientinformationforsuchsearch.Un- likethetrainingprocessforDNNs,thegradientcanbecomputedforthelossfunction(e.g. cross-entropylossincaseofation)withrespecttothenetworkparameters(weights) inordertominimizetheloss[33].Thiscomputationispossiblebecausethenetworkprim- itives(e.g.convolution,activationfunctions,etc.)aretiable;hence,thelosscan bemathematicallyformulatedasafunctionoftheweightsinaneuralnetwork,i.e. entiable.However,nosuchrelationshipexistsbetweenthenetwork'sarchitecturetopology andthelossfunction.Withoutanygradientinformation,thesearchforasuitableDNN becomesablack-boxoptimizationproblemwhereonlythevalueoftheobjectivefunction (i.e.performanceofDNNonatask)canbecomputed.Furthermore,thesearchspacefor networkarchitecturesisdiscrete(includingbothordinalandnominalvariables),andthe arrangementoflayersinaDNNcantakeonanyarbitrarygraphstructure.Anypractical automatedmethodforoptimizingnetworkstructuremustbeabletohandlesuchalargeand complexsearchspacetly. 3 Multipleobjectives: Inadditiontotheneedforhigh-performanceresultsfromDNNs, real-worldapplicationsofthesemodelsdemandnetworkarchitectureswithtcom- plexitiesforerentdeploymentscenarios|{e.g.IoTsystems,mobiledevices,embedded devices,automotives,cloudservers,etc.Thesecomputingdevicesareoftenconstrainedby avarietyofthardwareresources,suchaspowerconsumption,availablememory, andlatencyconstraints,tonameafew.Thesehardware/relatedobjectivesare oftentimesinwiththepredictiveperformanceofDNNs,leadingtotheneedfor multi-objectiveoptimizationasidealsolutionsthatoptimizedmultipleobjectivesdonotex- ist.Automatedmethodsundersuchascenarioneedtobalancetheamongthese competingobjectivesandidentifydiversesetofarchitecturestorepresentthetrade-o Expensiveevaluation: Inorderforanautomatedarchitecturesearchalgorithmtoop- timizeDNNs,itmusthavetosearchthroughthousands,ifnothundredsofthousandsof networktopologies.Unfortunately,DNNsarecomputationallyveryintensiveandrequire specializedhardwaresuchasgraphicsprocessingunits(GPUs)totrain.Forexample,it takesonemonthtoevaluate27KDNNarchitectureswithonehundredGPUsonCIFAR- 10[34]dataset.Unsurprisingly,thetotalnumberofGPUhours/daysrequiredisextremely high(uptotensofthousands)forasingleexperimentalrun.Attemptingtotrainthou- sandsofnetworksequallyandthoroughlyisthusimpracticalevenwithcloudcomputing (e.g.AmazonWebServices). Vastdeploymentscenarios: Giventhediversehardwareplatformsandcycon- straints,designingspecializedDNNsforeveryscenarioisengineer-expensiveandcompu- tationallyexpensive,eitherwithhuman-basedmethodsorautomatedsystems.Sincesuch methodsneedtorepeatthenetworkdesignprocessandretrainthedesignednetworkfrom 4 scratchforeachcase.Theirtotalcostgrowslinearlyasthenumberofdeploymentscenarios increases,whichwillresultinexcessiveenergyconsumption. Thisdissertationwillsolvethechallengeslistedabovebyusingasurrogate-assistedmulti- objectiveevolutionaryapproach.Evolutionaryalgorithms(EA)areblack-boxoptimization algorithmsthatcantlysearchthroughhigh-dimensional,large,andcomplexsearch spaces[35].SincetherealreadyexistsEAsdesignedfortexplorationofmultiple objectives,expensiveobjectivefunctionanddiscretevariables,separately,EAspresenta promisingstartingpointforimprovingneuralarchitecturesearch. 1.3DissertationOverview Therestofdissertationisstructuredasfollows.InChapter2,weprovidebackgroundand literaturesurveyontherelatedtopicsandmethods;InChapter3,weincorporatemulti- objectivePareto-optimizationtoneuralarchitecturesearchandpresenttheinitialalgorithm framework;InChapter4,weenhancethesearchoftheproposedalgorithmthrough surrogatemodelling;Thenwefurtherimprovethemethod'sscalabilitytodiversedeployment scenarioswithadaptivetransferlearninginChapter5.Variousapplicationsoftheproposed algorithmarepresentedinChapter6.Finallywesummarythecontributionsandprovide interestingdirectionsforfutureworksinChapter7. 5 Chapter2 Background Theworkinthisdissertation,evolutionaryneuralarchitecturesearch,buildsupontwomajor areasofresearch:deeplearningandevolutionaryalgorithms.Thus,thischapterreviews foundationsandrelatedworkfromthesetwoareasofresearchthatarerelevanttothetopic ofthisdissertation.First,anintroductiontodeepneuralnetworksisgiven.Second,relavent topicswithinevolutionaryoptimizationarereviewed. 2.1NeuralArchitectureSearch Neuralarchitecturesearch(NAS)presentsapromisingpathtoalleviatethepainfulprocessof manualadjustmentbyposingthedesignofCNNarchitecturesasanoptimizationproblem. Byalteringthearchitecturalcomponentsinanalgorithmicfashion,novelCNNscanbe discoveredthatexhibitimprovedperformancemetricsonrepresentativedatasets.Thehuge surgeinresearchandapplicationsofNASindicatethetremendousacademicandindustrial interestNAShasattracted,asteamsseektostakeoutsomeofthisterritory.Itisnowwell recognizedthatdesigningbespokeneuralnetworkarchitecturesforvarioustasksisoneofthe mostchallengingandpracticallybcomponentoftheentireDeepNeuralNetwork (DNN)developmentprocess,andisafundamentalsteptowardsautomatedmachinelearning. Ingeneral,theproblemofdesigningCNNarchitecturesforatargetdataset D = fD trn ; D vld ; D tst g canbeviewedasabi-leveloptimizationproblem[10].Itcanbemathematicallyformulated 6 as, minimize 2A F ;w ( ) ; subjectto w ( ) 2 argmin w 2 f ( w ; ) ; (2.1) wheretheupper-levelvariable aCNNarchitecture,andthelower-levelvariable w itsassociatedweights.Theobjectivefunctions F attheupper-leveland f at thelower-leveldenotethelossonthevalidationdata D vld andthetrainingdata D trn , respectively. TheprobleminEq.(2.1)posesseveralchallengestoconventionaloptimizationmethods. First,evaluating F foragiven involvesanotherprolongedoptimizationprocess(lower-level optimization),astherelationshipbetweenan anditscorrespondingoptimal w ( )isnot analyticallyduetothenon-linearnatureofmodernCNNs.Andsuchalower-level optimizationprocessneedstorepeatedforeveryevaluationof F intheupper-level.Secondly, theupper-levelproblemisnottiable,asthegradientof F cannotbereliablycomputed orevenapproximatedduetothecategoricalnatureoftheupper-levelvariable involving thechoiceoflayeroperations,activationfunctions,etc. EarlymethodsforNASreliedonReinforcementLearning(RL)tonavigateandsearch forarchitectureswithhighperformance.Amajorlimitationoftheseapproaches[36,37]is thesteepcomputationalrequirementforthesearchprocessitself,oftenrequiringweeksof wallclocktimeonhundredsofGraphicsProcessingUnit(GPU)cards.Recent relaxation - basedmethods[38,39]seektoimprovethecomputationaleofNASapproachesby approximatingtheconnectivitybetweenentlayersintheCNNarchitecturesbyreal- valuedvariablesthatarelearned(optimized)throughgradientdescenttogetherwiththe weights.However,suchrelaxation-basedNASmethodsfromexcessiveGPUmemory 7 requirementsduringsearch,resultinginconstraintsonthesizeofthesearchspace(e.g., reducedlayeroperationchoices). Inadditiontotheneedforhigh-performanceresultsfromNASmodels,real-worldap- plicationsofthesemodelsdemandgnetworkarchitectureswithtcomplexities fortdeploymentscenarios|e.g.,IoTsystems,mobiledevices,automotives,cloud servers,etc.Thesecomputingdevicesareoftenconstrainedbyavarietyofhardwarere- sources,suchaspowerconsumption,availablememory,andlatencyconstraints,tonamea few.Eventhoughtherearemethodswithmoreadvancedtechniqueslikeweightsharing[40] toimproveRL'ssearch,andbinarygating[41]toreducetheGPUmemoryfoot- printofrelaxation-basedNASmethods,mostexistingRLandrelaxation-basedmethodsare notreadilyapplicableformulti-objectiveNAS. DespitethemanytNASmethodsbeingcontinuallyproposed,Evolutionaryal- gorithms(EAs)aregettingaplethoraofattention,duetotheirpopulation-basednature andyinencoding.Theyaviablealternativetoconventionalmachinelearn- ing(ML)-orientedapproaches,especiallyunderthescopeofmulti-objectiveNAS.AnEA, ingeneral,isaniterativeprocessinwhichindividualsinapopulationaremadegradually betterbyapplyingvariationstoselectedindividualsand/orrecombiningpartsofmultiple individuals.Despitetheeaseofextendingthemtohandlemultipleobjectives,mostexisting EA-basedNASmethods[42{44]arestillsingle-criteriondriven. 2.2EvolutionaryBi-levelOptimization bi-leveloptimizationproblemsarecharacterisedbyahierarchicaltwolevelstructure,in whichthereisanestedinneroptimizationproblemasaconstrainttoaouteroptimization 8 problem.Theouteroptimizationproblemisoftenreferredasthe upperlevel taskorthe leader task,andtheinneroptimizationproblemisoftenreferredasthe lowerlevel taskor thefollower task.bi-leveloptimizationproblemsaretypicallychallengingtohandlebecause thefeasibilityofasolution(upperlevel)canonlybedeterminedafteranotheroptimization process(lowerlevel).Figure2.1illustratesageneralbi-levelproblemsolvingstructure involvinginterlinkedoptimizationanddecision-makingtasksatbothlevels. Figure2.1:Generalsketchofabi-levelproblemtakenfrom[10]. Practicalapplicationsthatarehierarchicalinnaturespanallareasofsciencesanden- gineering.Forinstance,suchproblemsariseintransportationandnetworkdesign[45,46], environmentaleconomics[47],chemicalengineering[48,49],optimaldesign[50,51],defense applications[52,53],facilitylocation[54,55],machinelearning[56]etc.Inaddition,many problems[57]inpracticecanaredecomposedintomultiplebutinterconnectedlevelssothat eachlevelofproblemshavefewervariablesandparameters,andismadeupofobjectives andconstraintsofitsownforaneasierunderstandingandhandling. 9 Mostoftheevolutionaryalgorithmsforbi-leveloptimizationarenestedinnature,where oneoptimizationalgorithmisusedwithintheother.Theouteralgorithmhandlestheupper levelproblemandtheinneralgorithmhandlesthelowerlevelproblem.Suchastructure necessitatesthattheinneralgorithmiscalledforeveryupperlevelpointgeneratedbythe outeralgorithm.Therefore,nestedapproachescanbequitecomputationallydemanding, andcanonlybeappliedtosmallscaleproblems.Onecanstudieswithevolutionary algorithmbeingusedfortheupperlevelproblemandclassicalapproachbeingusedforthe lowerlevelproblem.Ifthelowerlevelproblemiscomplex,researchershaveusedevolutionary algorithmsatbothlevels.Belowweprovideareviewofevolutionarybi-leveloptimization algorithmsfromthepast. Mathieuetal.[58]wasoneofthetoproposeabi-levelalgorithmusingevolutionary algorithms.Heusedageneticalgorithmtohandletheupperlevelproblemandlinear programmingtosolvethelowerlevelproblemforeveryupperlevelmembergeneratedusing geneticoperations.ThisstudywasfollowedbynestingtheFrank-Wolfealgorithm(reduced gradientmethod)withinageneticalgorithminYin[59].Otherauthorsutilizedsimilarnested schemesin[60,61].Studiesinvolvingevolutionaryalgorithmsatbothlevelsinclude[62,63], whereauthorshaveusedtialevolutionatbothlevelsinthestudy,andtial evolutionwithinantcolonyoptimizationinthesecondstudy. Replacingthelowerlevelprobleminbi-leveloptimizationwithitsKarush{Kuhn{Tucker (KKT)conditionsisacommonapproachforsolvingtheproblembothinclassicalaswellas evolutionarycomputationliterature.However,aKKTbasedreductioncanonlybeapplied toproblemswherethelowerlevelisconvexandadherestocertainregularityconditions[64]. Someofthepastevolutionarystudiesthatutilizethisideainclude[65,66].Theapproach hasbeenpopularandevenrecentlyresearchersarerelyingonreducingthebi-levelproblem 10 intosinglelevelproblemusingKKTandsolvingthereducedproblemusingevolutionary algorithm,forexample,see[67{70]. WhileKKTconditionscanonlybeappliedtoproblemswherethelowerleveladheres tocertainmathematicallysimplifyingassumptions,theresearchersareexploringtechniques thatcansolvemoregeneralinstancesofbi-leveloptimizationproblems.Someoftheap- proachesarebasedonmeta-modelingthemappingswithinbi-leveloptimization,whileothers maybebasedonmeta-modelingtheentirebi-levelproblemitself.Studiesinthisdirection include[71]. 2.3EvolutionaryMulti-objectiveOptimization Evolutionarymulti-objectiveoptimization(EMO)methodsbasedonevolutionaryalgorithms (EAs)havebecomepopularandbeenwidelyusedinthepastfewdecades[72,73].EMO methodsstartwithapopulationofsolutions,whichismadegraduallybetterbyapplying geneticoperators(e.g.crossover,andmutation).Thisdissertationbuildsuponanevolu- tionaryalgorithmframeworkbecauseofitsyandwide-applicability.Inparticular, evolutionarymethodstypicallydonotrequireorassumeanyconvexityortiability oftheobjectiveorconstraintfunctionsanditspopulation-basedapproachprovidesnature advantageshandlingproblemswithlocallyoptimalsolutions. InsteadofasinglePareto-optimalsolutionatatime(throughobjectivescalariza- tion),Pareto-dominancebasedmulti-objectiveoptimizationmethodsattempttoaset ofPareto-optimalsolutionsinasinglesimulationrun[72].ThegoalsofanEMOare: 1. Findawell-convergedsetofsolutions,and 2. Findawell-divsetofsolutionsacrosstheentiretfront. 11 ThepopulationapproachofanEMOandtheinitsoperatorsallowboththe abovegoalstobeachievedformulti-objectiveproblems.EMOalgorithms,suchasNSGA- II[74]andSPEA2[75],arepopularmethodsforhandlingtwoandthree-objectiveproblems. Recentextensions,suchasNSGA-III[76]andMOEA/D[77],areabletohandlefourormore objectives(evenmorethantenobjectives).Thetopicissoimportanttodaythathandling suchlargenumbersofobjectivesisgivenaseparatename,evolutionarymany-objective optimization.Figure2.2illustratestheprincipleofPareto-dominancebasedmulti-objective optimization. Figure2.2:AprincipleofPareto-dominancebasedmulti-objectiveoptimizationprocedure isillustrated. Ineachiteration,asetofpointsgetsoperatedbyanEMOprocedureandanewsetof points|hopefullytowardstheParetofrontandwithmorediversity|aregenerated.This processisexpectedtocontinuewithgenerationsbeforeconvergingonthetrueParetofront coveringitsentirerange.WhenaPareto-basedbasedmulti-objectiveoptimizationalgorithm transitsfromonesetofnon-dominatedpointstohopefullyabettersetofnon-dominated points,thereisaneedtoevaluatetheconvergenceanddiversityofthenewsetofsolutions 12 vis-a-vistheoldsolutions.Itisbecause,ifwecanmeasurebothaspectswellthroughone ormoreperformancemetrics,theycanalsobeusedtodeterminetheterminationofthe optimizationalgorithm. Hypervolume(HV)isawidely-usedmetricforcomparingalgorithmsundermultiple objectives,astheevaluationmetric.HVnumerousadvantagesincludingParetocom- pliance[78]anditdoesnotrequirethetrueParetofrontbeforehand.Inthebi-objective case,hypervolumecanbeinterpretedasthedominatedareawithrespectivetoareference point.OnehypotheticalexampleisprovidedinFigure2.3,wherecircledpointsarethe solutionsfoundbymulti-objectivealgorithmsandthecorrespondinghypervolume isshowninshade.Thehigherthehypervolumemeasure,thebetterarethesolutionsfound intermsofbothobjectives.Anestimateofthe nadir point{asavectorcomprises oftheworstobjectivevalueofallPareto-optimalsolutions{isacommonchoiceforsetting thereferencepoint. Figure2.3:Anillustrationofhypervolumeunderbi-objectiveoptimization. Despiteseveraladvantages,EAshaveonemajorlimitation:theycanconsumemany 13 functionevaluationstoanapproximatedsetofPareto-optimal(PO)solutions[79].This issuebecomesmoreprominentwhentheproblemtobesolvedinvolvescomputationallyex- pensivefunctionswhichisanotherchallengeinsolvingindustrialoptimizationproblems.To obtainsolutionsforexpensiveproblemsinalimitednumberofexpensivefunctionevalua- tions,surrogateshavebeenusedintheliteratureasanalternativetoexpensiveevaluations. 2.4EvolutionarySurrogate-AssistedOptimization Inmostpractical(real-world)optimizationproblems(includingNAS),theevaluationofa solutiontypicallyinvolvescomputationallyexpensivesimulationsandprocedures,whichre- mainasoneofthemainbottlenecksofcompletinganoptimizationrunwithinareasonable amountoftime.However,sincesolutionscreatedatthebeginningofanoptimizationsim- ulationareexpectedtobefarfromtheoptimalregion,anexactandpreciseevaluationof thesolutionmaynotbenecessary.Asthesolutionsprogresstowardstheoptimalregion,a morepreciseevaluationmaybeinvoked. Therecentdevelopmentsofoptimizationmethodshaveledtoanincreasinginterestof approximationmodelsor surrogate models[80,81].Theuseofmetamodels(orsurrogate models)toapproximatethefunctionalformofexactobjectivesandconstraintsbyusinga fewysolutionevaluationsisacommonapproach[82].Amongvariousmethods, theKriging(alsoknownasGaussianProcess)methodisoneofthewidelyusedsurrogate models,whichcanprovideanestimatedfunctionvalueandalsosimultaneouslyprovide anerrorestimateoftheapproximation[83].Theoptimizationprocessisthencontinued usingthesurrogatemodeluntilthemodelisdeemedtobeaccurateenoughinthecurrent regionoffocus.Duetothisreason,surrogatemodelsareeinreducingtheoverall 14 computationaltimerequiredinreal-worlddesignoptimization. 2.5RelatedWork Recentyearshavewitnessedgrowinginterestinneuralarchitecturesearch(NAS).The promiseofbeingabletoautomaticallyandtlysearchfortask-dependentnetwork architecturesisparticularlyappealingasdeepneuralnetworksarewidelydeployedindi- verseapplicationsandcomputationalenvironments.Earlymethods[84,85]madeto simultaneouslyevolvethetopologyofneuralnetworksalongwithweightsandhyperparam- eters.Thesemethodsperformcompetitivelywithhand-craftednetworksoncontroltasks withshallowfullyconnectednetworks. EvolutionaryAlgorithm(EA): Designingneuralnetworksthroughevolutionhasbeen atopicofinterestforalongtime.Recentevolutionaryapproachesfocusonevolvingsolely thetopologywhileleavingthelearningofweightstogradientdescentalgorithms,andusing hyper-parametersettingsthataremanuallytuned.XieandYuille'sworkofGeneticCNN [42]isoneoftheearlystudiesthatshowsthepromiseofusingEAsforNAS.Realetal. [43]introduceperhapsthetrulylargescaleapplicationofasimpleEAtoNAS.The extensionofthismethodpresentedin[6],calledAmoebaNet,providesthelargescale comparisonofEAandRLmethods.TheirEA,usinganage-basedselectionsimilarto[86], hasdemonstratedfasterconvergencetoanaccuratenetworkwhencomparedtoRLand randomsearch.Concurrently,Liuetal.[44]evolveahierarchicalrepresentationthatallows non-modularlayerstructurestoemerge.Despitetheimpressiveimprovementsachievedon variousdatasets,theseEAmethodsareextremelycomputationallyt,e.g.onerun oftheregularizedevolutionmethod[6]takesoneweekon450GPUcards. 15 Concurrently,anotherstreamliningofEAmethodsforuseinbudgetedNAShasemerged. Suganumaetal.[87]useCartesiangeneticprogrammingtoassembleanarchitecturefrom existingmodularblocks(e.g.,Residualblocks).Sunetal.in[88]usearandomforestas ansurrogatemodeltopredicttheperformanceofarchitectures,partiallyeliminating thelower-leveloptimizationviagradientdescent.Thereportedresultsyield3 savings inwallclocktimewithsimilarperformancewhencomparedtotheirprevious works[89,90].However,resultsreportedfromthesebudgetedEAmethodsarefarfrom state-of-the-artandonlydemonstratedonsmall-scaledatasets,i.e.CIFAR-10andCIFAR- 100. ReinforcementLearning(RL): Q -learning[91]isaverypopularvalueiterationmethod usedforRL.TheMetaQNNmethod[92]employsan -greedy Q -learningstrategywith experiencereplaytosearchconnectionsbetweenconvolution,pooling,andfullyconnected layers,andtheoperationscarriedoutinsidethelayers.Zhongetal.[13]extendedthis ideawiththeBlockQNNmethod.BlockQNNsearchesthedesignofacomputationalblock withthesame Q -learningapproach.Theblockisthenrepeatedtoconstructanetwork, resultinginamuchmoregeneralnetworkthatachievesbetterresultsthanitspredecessor onCIFAR-10[34]. Apolicygradientmethodseekstoapproximatesometiablerewardfunction totrainamodelthatrequiresparametergradients,likeaneuralnetworkarchitecture.Zoph andLe[36]applythismethodinarchitecturesearchtotrainarecurrentneuralnetwork controllerthatconstructsnetworks.Theoriginalmethodin[36]usesthecontrollertogener- atetheentirenetworkatonce.Thiscontrastswithitssuccessor,NASNet[5],whichdesigns aconvolutionalandpoolingblockthatisrepeatedtoconstructanetwork.NASNetout- performsitspredecessorandproducesanetworkachievingstate-of-the-artperformanceon 16 CIFAR-10andImageNet.Hsuetal.[93]extendstheNASNetapproachtoamulti-objective domaintooptimizemultiplelinearcombinationsofaccuracyandenergyconsumptioncriteria usingtscalarizationparameters. Relaxation-basedApproachesandOthers: Approximatingtheconnectivitybetween tlayersinCNNarchitecturesbyreal-valuedvariablesweightingtheimportanceof eachlayeristhecommonprincipleofrelaxation-basedNASmethods.Liuetal.imple- mentthisideaintheDARTSalgorithm[38].DARTSseekstoimprovesearchby theweightswhileupdatingthearchitectures,showingconvergenceonbothCIFAR-10 andPennTreebank[94]withinonedayinwallclocktimeonasingleGPUcard.Subsequent approachesinthislineofresearchinclude[95{98].Thesearchoftheseapproaches stemsfromweightsharingduringthesearchprocess.Thisideaiscomplementarytoour approachandcanbeincorporatedintoanevolutionaryframeworkaswell.However,itis beyondthescopeofthisdissertationandisatopicoffuturestudy. MethodsnotcoveredbytheEA-,RL-orrelaxation-basedparadigmshavealsoshown successinarchitecturesearch.Liuetal.[2]proposedamethodthatprogressivelyexpands networksfromsimplecellsandonlytrainsthebest K networksthatarepredictedtobe promisingbyaRNNmeta-modeloftheencodingspace.LiandTalwalkar[99]showthatan augmentedrandomsearchapproachisanealternativetoNAS.Kandasamyetal.[100] presentaGaussian-process-basedapproachtooptimizenetworkarchitectures,viewingthe processthroughaBayesianoptimizationlens. Multi-objectiveNAS: Inthiswork,theterm multi-objectiveNAS referstomethodsthat simultaneouslyapproximatetheentiresetofcientarchitecturesinonerun[101]. Kimetal.[102]presentedNEMO,oneoftheearliestevolutionarymulti-objectiveapproaches 17 toevolveCNNarchitectures.NEMOusesNSGA-II[74]tomaximizeperfor- manceandinferencetimeofanetworkandsearchesoverthespaceofthenumberofoutput channelsfromeachlayerwithinarestrictedspaceofseventarchitectures.DPP- Net[103],anextensionfrom[2],progressivelyexpandsnetworksfromsimplestructuresand onlytrainsthetop-K(basedonPareto-optimality)networksthatarepredictedtobepromis- ingbyasurrogatemodel.Elskenetal.[104]presenttheLEMONADEmethod,whichis formulatedtodevelopnetworkswithhighpredictiveperformanceandlowerresourcecon- straints.LEMONADEreducescomputerequirementsthroughacustom-designedapproxi- matenetworkmorphisms[105],whichallowsnewlygeneratednetworkstoshareparameters withtheirforerunners,obviatingtheneedtotrainnewnetworksfromscratch.However, LEMONADEstillrequirescloseto100GPU-daystosearchontheCIFARdatasets[34]. Multi-objNASthroughScalarization: Aportfolioofworksthataimstodesignhardware- spnetworkarchitecturesemerges.Thisinclude,ProxylessNAS[41],MnasNet[20],FB- Net[97],andMobileNetV3[21]whichuseascalarizedobjectivethatencourageshighaccu- racyandpenalizescomputeatthesametime,e.g.,maximize Acc ( Latency=Target ) 0 : 07 . Thesemethodsrequireapreferenceweightingoftheimportanceoftob- jectivesbeforethesearch,whichinitselfrequiresanumberoftrials. 18 Chapter3 Multi-objectiveEvolutionofDeep NeuralNetworkArchitectures PracticalapplicationsofNAScanrarelybeconsideredfromthepointofviewofasingle objectiveofmaximizingperformance;rather,theymustbeevaluatedfromatleastone additional,objectivethatissptothedeploymentscenario.Inthiswork,we approachtheproblemofdesigninghigh-performancearchitectureswithdiversecomplexities fortdeploymentscenariosasamulti-objectivebi-leveloptimizationproblem.We mathematicallyformulatetheproblemas, minimize 2A F ;w ( ) ; C ( ) ; subjectto w ( ) 2 argmin w 2 f ( w ; ) ; (3.1) where C ( )measuresthecomplexityofarchitectures.Theupper-levelvariable a CNNarchitecture,andthelower-levelvariable w itsassociatedweights.Theobjective functions F attheupper-leveland f atthelower-leveldenotethelossonthevalidationdata D vld andthetrainingdata D trn ,respectively. 19 Figure3.1: Overview: Givenadatasetandobjectives,NSGANetdesignsasetofcus- tomarchitecturesspanningthefront.NSGANetestimatetheperformanceof anarchitecturethroughits proxymodel ,optimizedbyStochasticGradientDescent(SGD) inthelower-level.Thesearchproceedsin exploration viageneticoperations,followedby exploitation viadistributionestimation.SeeAlgorithm1forpseudocodeandcolorsarein correspondence. 3.1ProposedApproach Ourproposedalgorithm,NSGANet,isaniterativeprocessinwhichinitialarchitecturesare madegraduallybetterasagroup,calleda population .Ineveryiteration,agroupof (i.e.,newarchitectures)iscreatedbyapplyingvariationsthroughcrossoverandmutation tothemorepromisingofthearchitecturesalreadyfound,alsoknownas parents ,fromthe population.Everymemberinthepopulation(includingbothparentsandcompete forsurvivalandreproduction(becomingaparent)ineachiteration.Theinitialpopulation maybegeneratedrandomlyorguidedbyprior-knowledge,i.e.seedingthepastsuccessful architecturesdirectlyintotheinitialpopulation.Subsequenttoinitialization,NSGANet proceedsthesearchintwosequentialstages:(i) exploration ,withthegoalofdiscovering diversewaystoconstructarchitectures,and(ii) exploitation thatreinforcestheemerging patternscommonlysharedamongthearchitecturessuccessfulduringexploration.Asetof architecturesrepresentingtbetweennetworkperformanceandcomplexity 20 isobtainedattheendofevolution,throughgeneticoperatorsandaBayesianmodelbased learningprocedure.Awchartandapseudocodeoutliningtheoverallapproachisshown inFig.3.1andAlgorithm1,respectively.Intheremainderofthissection,weprovidea detaileddescriptionoftheaforementionedcomponentsinSections3.1.1-3.1.2. Algorithm1: GeneralframeworkofNSGANet Input: Complexityobjective ~ f (seeEq.(3.1)),Max.numberofgenerations G ,Populationsize K ,Crossoverprobability p c ,Mutationprobability p m ,Thestartinggenerationof exploitation ˝ . 1 g 0//initializeagenerationcounter. 2 ˆ 1//initializethecontrolparameterfor exploration . 3 A initializeanemptyarchivetokeeptrackofevaluatedarchs. 4 P initializetheparentpopulationbyuniformsampling. 5 //computeaccuracythroughlower-leveloptimizationinAlgo.2. 6 f Evaluate ( P ) 7 //calculatedominationrankandcrowdingdistance. 8 [ F 1 ;F 2 ;::: ] NondominatedSort ( f; ~ f ( P )) 9 dist CrowdingDistance ( F 1 ;F 2 ;::: ) 10 while g˝ then update ˆ accordingtoEq.(3.2).; 33 end 34 Return parentpopulation P . 21 3.1.1SearchSpaceandEncoding Thesearchforoptimalnetworkarchitecturescanbeperformedovermanytsearch spaces.Thegeneralityofthechosensearchspacehasamajoronthequality ofresultsthatareevenpossible.MostexistingevolutionaryNASapproaches[42,87,88, 104]searchonlyoneaspectofthearchitecturespace|e.g.,theconnectionsand/orhyper- parameters.Incontrast,NSGANetsearchesoverbothoperationsandconnections|the searchspaceisthusmorecomprehensive,includingmostoftheprevioussuccessfularchitec- turesdesignedbothbyhumanexpertsandalgorithmically. ModernCNNarchitecturesareoftencomposedofanouterstructure( network-level ) designwherethewidth(i.e.,#ofchannels),thedepth(i.e.,#oflayers)andthespatial resolutionchanges(i.e.,locationsofpoolinglayers)aredecided;andaninnerstructure ( block-level )designwherethelayer-wiseconnectionsandcomputationsarespe.g., Inceptionblock[30],ResNetblock[11],andDenseNetblock[31],etc.AsseenintheCNN literature,thenetwork-leveldecisionsaremostlyhand-tunedbasedonmeta-heuristicsfrom priorknowledgeandthetaskathand,asisthecaseinthiswork.Forblock-leveldesign,we adopttheoneusedin[2,5,6,38]tobeconsistentwithpreviouswork. A block isasmallconvolutionalmodule,typicallyrepeatedmultipletimestoformthe entireneuralnetwork.Toconstructscalablearchitecturesforimagesoftresolutions, weusetwotypesofblockstoprocessintermediateinformation:(1)the Normal block,a blocktypethatreturnsinformationofthesamespatialresolution;and(2)the Reduction block,anotherblocktypethatreturnsinformationwithspatialresolutionhalvedbyusinga strideoftwo.SeeFig.3.2forapictorialillustration. Weusedirectedacyclicgraphs(DAGs)consistingofenodestoconstructbothtypes 22 ofblocks(aReductionblockusesastrideoftwo).Each node isatwo-branchedstructure, mappingtwoinputstooneoutput.Foreachnodeinblock i ,weneedtopicktwoinputs fromamongtheoutputofthepreviousblock h i 1 ,theoutputoftheprevious-previous block h i 2 ,andthesetofhiddenstatescreatedinanypreviousnodesofblock i .Forpairs ofinputschosen,wechooseacomputationoperationfromamongthefollowingoptions, collectedbasedontheirprevalenceintheCNNliterature: identity 3x3maxpooling 3x3averagepooling squeeze-and-excitation[106] 3x3localbinaryconv[107] 5x5localbinaryconv[107] 3x3dilatedconvolution 5x5dilatedconvolution 3x3depthwise-separableconv 5x5depthwise-separableconv 7x7depthwise-separableconv 1x7then7x1convolution Theresultscomputedfrombothbranchesarethenaddedtogethertocreateanewhidden state,whichisavailableforsubsequentnodesinthesameblock.SeeFig.3.2(b)-3.2(d)for pictorialillustrations.Thesearchspaceweconsiderinthispaperisanexpandedversionof the microsearchspace usedinourpreviouswork[15].Sp,thecurrentsearchspace (i)searchforafactorthatgraduallyincrementsthechannelsizeofeachblockwithdepth (seeFig.3.2(a))asopposedtosharplydoublingthechannelsizewhendown-sampling.(ii) considersanexpandedsetofprimitiveoperationstoincludebothmorerecentadvancedlayer primitivessuchassqueeze-and-excitation[106]andmoreparametetlayerprimitives likelocalbinaryconvolution[107]. Withtheabove-mentionedsearchspace,thereareintotal20decisionstoconstitutea blockstructure,i.e.choosetwopairsofinputandoperationforeachnode,andrepeatfor 23 (a) (b) (c) (d) Figure3.2:SchematicoftheNSGANetsearchspacemotivatedfrom[5]:(a)Anarchitecture iscomposedofstackedblocks.(b)Thenumberofchannelsineachblockisgradually increasedwithdepthofthenetwork.(c)Eachblockiscomposedofenodes,where eachnodeisatwo-branchedcomputationappliedtooutputsfromeitherpreviousblocksor previousnodeswithinthesameblock.(d)Agraphicalvisualizationof(c). enodes.Theresultingnumberofcombinationsforablockstructureis: B = ( n +1)! 2 ( n ops ) 2 n where n denotesthenumberofnodes, n ops denotesthenumberofconsideredoperations. Therefore,withoneNormalblockandoneReductionblockwithenodesineach,the overallsizeoftheencodedsearchspaceisapproximately10 33 . 3.1.2AlgorithmProcedures PerformanceEstimationStrategy: ToguideNSGANettowardsmoreaccurate andtarchitectures,weconsidertwometricsasobjectives,namely,ac- curacyandarchitecturecomplexity.Assessingtheaccuracyofanarchitecture duringsearchrequiresanotheroptimizationtoidentifytheoptimalvaluesoftheasso- ciatedweights(seeAlgorithm2).Eventhoughthereexistwell-establishedgradientdescent algorithmstotlysolvethisoptimization,repeatedlyexecutingsuchanalgorithmfor 24 everycandidatearchitecturerenderstheoverallprocesscomputationallyveryprohibitive. Therefore,toovercomethiscomputationalbottleneck,wecarefully(usingaseriesofabla- tionstudies)down-scalethearchitecturestocreatetheirproxymodels[5,6],whichcanbe optimizedtlyinthelower-levelthroughSGD(Algo.2)Theirperformancesbecome surrogatemeasurementstoselectarchitecturesduringsearch. Algorithm2: PerformanceEvaluationofaCNN Input: Thearchitecture ,trainingdata D trn , validationdata D vld ,number T ofepochs, weightdecay ,initiallearningrate max . 1 ! Randomlyinitializetheweightsin ; 2 t 0; 3 while t > < > > : a i +((2 u ) 1 = (1+ m ) 1)( a i a ( L ) i ) ; for u 0 : 5 ; a i +(1 2(1 u ) 1 = (1+ m ) )( a ( U ) i a i ) ; for u> 0 : 5 (5.1) where u isauniformrandomnumberintheinterval[0 ; 1]. a ( L ) i and a ( U ) i arethelower andupperboundsof a i ,respectively.Eachmutatedvalueinaningisroundedto thenearestinteger.ThePMoperatorinheritsthe parent-centric convention,inwhichthe areintentionallycreatedaroundtheparents.Thecentricityiscontrolledviaan indexhyperparameter m .Inparticular,high-valuesof m tendtocreatemutated aroundtheparent,andlow-valuesencouragemutatedtobefurtherawayfromthe parentarchitecture.SeeFig.5.5(b)foravisualizationoftheof m .Itistheworth notingthatthePMoperatorwasoriginallyproposedforcontinuousoptimizationwheredis- tancesbetweenvariablevaluesarenaturallyIncontrast,incontextofourencoding, ourvariablesarecategoricalinnature,indicatingaparticularlayerhyperparameter.Sowe sortthesearchedsubnetsinascendingorderof#MAdds,suchthat m nowcontrolsthe in#MAddsbetweentheparentandthemutated WeapplyPMtoeverymemberintheingpopulation(createdfromcrossover). Wethenmergethemutatedpopulationwiththeparentpopulationandselectthe tophalfusingmany-objectiveselectionoperatordescribedinAlgorithm7.Thisprocedure createstheparentpopulationforthenextgeneration.Werepeatthisoverallprocessfora 70 pre-spnumberofgenerationsandoutputtheparentpopulationattheconclusionof theevolution. 5.1.4Many-ObjectiveSelection Inadditiontohighpredictiveaccuracy,real-worldapplicationsdemandNASalgorithmsto simultaneouslybalanceafewotherobjectivesthataresptothedeployment scenarios.Forinstance,mobileorembeddeddevicesoftenhaverestrictionsintermsof modelsize,multiply-adds,latency,powerconsumption,andmemoryfootprint.Withno priorassumptiononthecorrelationamongtheseobjectives,ascalable(tothenumberof objectives)selectionisrequiredtodrivethesearchtowardsthehighdimensionalPareto front.Inthiswork,weadoptthereferencepointguidedselectionoriginallyproposedin NSGA-III[76],whichhasbeenshowntobeeinhandlingproblemswithasmany as15objectives.Intheremainderofthissection,weprovideanoverviewofNSGA-III procedureandreferreaderstotheoriginalpublicationformoredetails. Algorithm7: ReferencePointBasedSelection Input: Asetofarchs R ,theirobjectives F ,numberofarchstoselect N ,referencedirections Z . 1 //putarchsintotfronts(ranklevels)basedondomination. 2 ( F 1 ;F 2 ;::: ) non dominated sort ( F ) 3 S ; , i 1 4 while j S j + j F i j f k ( j ) g Avg : Gain( i ; j )= P M k =1 max(0 ;f k ( i ) f k ( j )) P m k =1 f 1 j f k ( i ) >f k ( j ) g Thereafter,thesolutionshavingthehighestvalueindicatesthatitcausesthelargest averagelossinsomeobjectivestomakeaunitaveragegaininotherobjectivestochoose anyofitsneighbors.Ifthishighesttradvalueismuchlargerstatisticallythanother solutions,thenthehighesttrasolutionisthepreferredchoice,incaseofnopreferences providedfromusers. 5.2ExperimentsandResults Inthissectionwepresentexperimentalresultstoevaluatetheof NeuralArchitecture Transfer onmultipleimagetasks.Inadditionwealsoinvestigatethescalability ofourapproachtomorethantwoobjectives.Foralltheexperimentsinthissection,weuse thesamesetofhyperparmaters(seeTable5.1)forthetcomponentsofNAT.These choiceswereguidedbytheablationstudiesdescribedinSection5.3. 5.2.1Datasets Weconsiderelevenimageiondatasetsforevaluationwithsamplesizevaryingfrom 2,040to180,000images(20to18,000imagesperclass;Table5.2).Thesedatasetsspanawide varietyofimagetasks,includingsuperordinate-levelrecognition(ImageNet[1], CIFAR-10[34],CIFAR-100[34],CINIC-10[127],STL-10[128]);recognition 76 Table5.1:HyperparameterSettings Category Parameter Setting global archivesize 300 numberofiterations 30 accuracypredictor Trainsize 100 Ensemblesize 500 evolutionarysearch populationsize 100 numberofgenerationsperiteration 100 crossoverprobability 0.9 mutationprobability 0.1 mutationindex m 1.0 supernet numberofepochsperiteration 5 Table5.2:BenchmarkDatasetsforEvaluation DatasetTypeTrainSizeTestSize#Classes ImageNet[1] multi-class 1,281,16750,0001,000 CINIC-10[127]180,0009,00010 CIFAR-10[34]50,00010,00010 CIFAR-100[34]50,00010,00010 STL-10[128]5,0008,00010 Food-101[134] 75,75025,250101 StanfordCars[135]8,1448,041196 FGVCAircraft[136]6,6673,333100 DTD[137]3,7601,88047 Oxford-IIITPets[138]3,6803,36937 OxfordFlowers102[129]2,0406,149102 (Food-101[134],StanfordCars[135],FGVCAircraft[136],Oxford-IIITPets[138],Oxford Flowers102[129]);andtexture(DTD[137]).WeusetheImageNetdatasetfor trainingthesupernet,andusetheothertendatasetsforarchitecturetransfer. 5.2.2SupernetPreparation Oursupernetisconstructedbysettingthearchitectureencodingatthemaximumvalue,i.e. fourlayersineachstageandeverylayerusesexpandratioofsixandkernelsizeofseven. Adaptingsubnetsofasupernetwithrandomlyinitializedweightsleadstotraininginstability 77 andlargevarianceinitsperformance.Therefore,wewarm-upthesupernetweightson ImageNetfollowingthe progressiveshrinking algorithm[124],wherethesupernetis trainedatfull-scale,withsubnetscorrespondingtotoptions(expandratio,kernel size,#oflayers)beinggraduallyactivatedduringthetrainingprocess.Thisprocedure, whichtakesabout6daysonaseverwitheightV100GPUs,isoptimizedwithonlythe cross-entropylossi.e.,asingleobjective.Wenotethatsupernetpreparation k expenseisa one-timecostthatamortizesoveranysubsequenttransfertotdatasetsandobjective combinationsweshowinthefollowingsubsections. 5.2.3ImageNet Beforeweevaluateourapproachforarchitecturetransfertootherdatasets,westvalidate itsenessontheImageNet-1Kdataset.Thisexperimentevaluatestheenessof NATinadaptingandsearchingforarchitecturesthatspanbetweentwoobjectives. Forthisexperiment,weconsideraccuracyand#MAddsasthetwoobjectiveofinterest.We runNATfor30iterations,andfromthearchiveofarchitecturesweselectfourmodels rangingfrom200MMAddsto600MMAdds(forhigh-endmobiledevices).Following[124] wefurther eachmodelfor75epochs.Ourtraininglargelyfollows[20]: RMSPropoptimizerwithdecay0.9andmomentum0.9;batchnormalizationmomentum 0.99;weightdecay1e-5.Weuseabatchsizeof512andaninitiallearningrateof0.012 thatgraduallyreducestozerofollowingthecosineannealingschedule.Ourregularization settingsaresimilarasin[17]:weuseaugmentationpolicy[139],dropconnectratio0.2,and k Manysupernets,pre-trainedonImageNet,arealreadyavailableifonedoesnotwanttotrainthem fromscratch.See https://github.com/mit-han-lab/once-for-all , https://github.com/meijieru/ AtomNAS and https://github.com/JiahuiYu/slimmable_networks . Section5.3.5studiestheimpactofablatingthestep. 78 dropoutratio0.2. Table5.3showstheperformanceofNATmodelsobtainedthroughbi-objectiveopti- mizationofmaximizingaccuracyandminimizing#MAdds.Ourmodels,referredtoas NAT-M f 1,2,3,4 g ,areinascendingorderof#MAdds(Fig.5.7).Figure5.8showsthefull #MAdds-accuracycurvecomparisonbetweenNATandexistingNASmethods. ResultsindicatethatNATNetscompletelydominate(i.e.betterinboth#MAddsand accuracy)allexistingdesigns,bothmanualandfromotherNASalgorithms,undermobile settings( 600MMAdds).Comparedtomanually.designednetworks,NATisnotice- ablymoret.NAT-M1is 2.3% and 1.5%moreaccurate thanMobileNet-v3[21] andFBNetV2-F4[4]respectively,whilebeingequivalentine(i.e.#MAdds,CPU andGPUlatency).Furthermore,NATNetsareconsistently 6%moreaccurate than MobileNet-v2[12]scaledbywidthmultiplierfrom200Mto600M#MAdds.Ourlargest model,NAT-M4,achievesanewstate-of-the-artImageNettop-1accuracyof80.5%under mobilesettings( 600M#MAdds).Interestingly,eventhoughthisexperimentdidnot explicitlyoptimizeforCPUorGPUlatency,NATNetsarefasterthanthose(MobileNet- V3[21],MNasNet[20])thatexplicitlydooptimizeforlatency. 5.2.4ArchitectureTransfer ExistingNASapproachesarerarelyappliedtotasksbeyondstandarddatasets(i.e.CIFAR- 10[34]andImageNet[1]),wheretheclascationtaskisatsuperordinate-levelandthe# oftrainingimagesaresutlylarge.Instead,theyadoptaconventionaltransferlearning setup[144],inwhichthearchitecturesfoundbysearchingonstandardbenchmarkdatasets aretransferredasis,withweightstonewdatasets.Wearguethatsuchaprocessis conceptuallycontradictorytothegoalofNAS.Thearchitecturestransferredfromstandard 79 Figure5.7:ImageNetArchitecturesfromTFront. Table5.3: ImageNet-1K[1]: NATNetscomparisonwithmanualandauto- mateddesignoftconvolutionalneuralnetworks.Modelsaregroupedintosectionsfor bettervisualization.Ourresultsareunderlinedandthebestresultineachsectionisinbold. CPUlatency(batchsize=1)ismeasuredonInteli7-8700KandGPUlatency(batchsize=64) ismeasuredon1080Ti.\WS"standsforweightsharing.Allmethodsareundersinglecrop andsinglemodelcondition,withoutanyadditionaldata. Model Method #Params#Multi-AddsCPULat(ms)GPULat(ms) Top-1Acc(%)Top-5Acc(%) NAT-M1 WS+EA 6.0M 225M 9.1 30 77.5 93.5 MobileNet-v2[12] manual 3.5M300M 8.323 72.091.0 ProxylessNAS[41] RL/gradient 4.0M465M8.527 75.192.5 MnasNet-A1[20] RL 3.9M312M9.331 75.292.5 MobileNet-v3[21] RL/NetAdapt 5.4M 219M 10.633 75.2- MUXNet-m[22] EA 3.4M218M 14.742 75.392.5 FBNetV2-F4[4] gradient 7.0M238M15.644 76.0- tNet-B0[17] RL/scaling 5.3M390M14.446 76.393.2 NAT-M2 WS+EA 7.7M 312M 11.4 37 78.6 94.3 MUXNet-l[22] EA 4.0M 318M19.274 76.693.2 AtomNAS-C+[126] WS+shrinkage 5.9M363M-- 77.693.5 AutoNL-L[140] gradient 5.6M353M-- 77.793.7 DNA-c[141] gradient 5.3M466M14.567 77.893.7 ResNet-152[11] mannual 60M11.3B66.7176 77.893.8 NAT-M3 WS+EA 9.1M 490M 16.1 62 79.9 94.9 tNet-B1[17] RL/scaling 7.8M700M19.567 78.894.4 MixNet-L[125] RL 7.3M565M29.4105 78.994.2 BigNASModel-L[142] WS 6.4M 586M-- 79.5- OnceForAll[124] WS+EA 9.1M595M16.572 80.094.9 NAT-M4 WS+EA 9.1M 0.6B 17.3 78 80.5 95.2 Inception-v4[143] manual 48M13B84.6206 80.095.0 Inception-ResNet-v2[143] manual 56M13B99.1289 80.195.1 datasetsaresub-optimaleitherwithrespecttoaccuracy,orboth.Ontheother hand,bytransferringbotharchitectureandweightsNATcanindeeddesignbespokemodels foreachtask. 80 Figure5.8: MAddsvs.ImageNetAccuracy .NATNetsoutperformothermodelsinboth objectives.Inparticular,NAT-M4achievesanewstate-of-the-arttop-1accuracyof80.5% undermobilesetting( 600MMAdds).NAT-M1improvesMobileNet-v3top-1accuracyby 2.3%withsimilar#MAdds. WeevaluatedNATontenimagedatasets(seeTable5.2)thatpresent tchallengesintermsofdiversityinclasses(superordinatevs.grained)and sizeoftrainingset(largevssmall).Foreachdataset,werunNATwithtwoobjectives: maximizetop-1accuracyonvalidationdata(20%randomlyseparatedfromthetrainingset) andminimize#MAdds.WestartfromthesupernettrainedonImageNet(whichiscreated oncebeforeallexperiments;seeSection5.2.2)andtransferittothenewtask.Duringthis procedurethelastlinearlayerisresetdependingonthenumberofcategoriesinthenewtask. NATisnowappliedforatotalof30iterations.Ineachiterationthesupernetisadaptedfor 5epochsusingSGDwithamomentumof0.9.Thelearningrateisinitializedat0.01and annealedtozeroin150epochs(30iterationswithveepochsineach).Allhyperparameters aresetatdefaultvaluesfromTable5.1.Foreachdataset,theoverallNATprocesstakes 81 slightlyunderadayonaserverwitheight2080TiGPUs. Figure5.9: MAddsvs.Accuracy curves yy comparingNATexistingarchitectures onadiversesetofdatasets.Thedatasetsarearrangedinascendingorderoftrainingset size.Methodsshowninthelegendpre-trainonImageNetandtheweightsonthe targetdataset.Methodsin blue trainfromscratchoruseexternaltrainingdata. Figure5.9showstheaccuracyand#MAddsforeachdatasetacrossawiderange ofmodels,includingNATNetsandconventionaltransferlearningweights)of existingNASandhanddesignedmodels.AcrossalldatasetsNATNetsconsistentlyachieve betteraccuracywhilebeinganorderofmagnitudemoret(#MAdds)thanexisting models.Undermobilesettings( 600M),NATNetsachievethestate-of-the-artonthese datasets,andanewstate-of-the-artaccuracyonbothSTL-10[128]andCINIC-10 zz [127] datasets.Surprisingly,onsmallscaledatasetse.g.OxfordFlowers102[129],Oxford-IIIT Pets[138],DTD[137]andSTL-10[128],weobservethatNATNetsaretlymore ethanconventionalEvenonddatasetssuchasStanfordCars yy ActualvaluesofeachNATNetmodelareprovidedinTable5.4. zz Accordingto[19]forSTL-10,and[16]forCINIC-10. 82 Figure5.10:Transferarchitectures(350MMAdds)ontendatasets. andFGVC,whereconventionaldidnotimproveupontrainingfromscratch, NATNetsimproveaccuracywhilealsobeingtlymoret. Figure5.10showsavisualizationofarchitectureswith350MMAddsforeachdataset. ThelackofsimilarityinthenetworkssuggestthatNATwasabletogeneratenetworks customizedforeachtask.Additionalvisualizationofarchitecturessearchedonalldatasets 83 Table5.4:NATmodelperformancecorrespondingtoFigure5.9. Flowers102[129] Oxford-IIITPets[138] DTD[137] STL10[128] FGVCAircraft[136] #Params#MAddsTop-1Acc(%) #Params#MAddsTop-1Acc(%) #Params#MAddsTop-1Acc(%) #Params#MAddsTop-1Acc(%) #Params#MAddsTop-1Acc(%) 3.3M152M97.5 4.0M160M91.8 2.2M136M76.1 4.4M240M96.7 3.2M175M87.0 3.4M195M97.9 5.5M306M93.5 4.0M297M77.6 5.1M303M97.2 3.4M235M89.0 3.7M250M98.1 5.7M471M94.1 4.1M347M78.4 7.5M436M97.8 5.1M388M90.1 4.2M400M98.3 8.5M744M94.3 6.3M560M79.1 7.5M573M97.9 5.3M581M90.8 StanfordCars[135] CIFAR-100[34] CIFAR-10[34] Food-101[134] CINIC-10[127] #Params#MAddsTop-1Acc(%) #Params#MAddsTop-1Acc(%) #Params#MAddsTop-1Acc(%) #Params#MAddsTop-1Acc(%) #Params#MAddsTop-1Acc(%) 2.4M165M90.9 3.8M261M86.0 4.3M232M97.4 3.1M198M87.4 4.6M317M93.4 2.7M222M92.2 6.4M398M87.5 4.6M291M97.9 4.1M266M88.5 6.2M411M94.1 3.5M289M92.6 7.8M492M87.7 6.2M392M98.2 3.9M299M89.0 8.1M501M94.3 3.7M369M92.9 9.0M796M88.3 6.9M468M98.4 4.5M361M89.4 9.1M710M94.8 isprovidedinFigure5.11. Flowers102 Pets DTD STL-10 Aircraft StanfordCars CIFAR-10 Food-101 ImageNet Figure5.11:Non-dominatedarchitecturesto f top-1accuracy,#MAdds g obtainedbyNAT ontdatasets. 5.2.5ScalabilitytoObjectives PracticalapplicationsofNAScanrarelybeconsideredfromthepointofviewofasingle objective,andmostoften,theymustbeevaluatedfrommanyt,possiblycompeting, objectives.WedemonstratethescalabilityofNATtomorethantwoobjectivesandevaluate itseness. 84 Figure5.12: Toprow: NATNetsobtainedfromtri-objectivesearchtomaximizeImageNet top-1accuracy,minimizemodelsize(#Params),andminimize f #MAdds,CPUlatency, GPUlatency g fromlefttoright. Bottomrow: 2Dprojectionsfromabove3Dscatter, showingtop-1accuracyvs.eachofthefourcyrelatedmeasurements.Tobetter visualize(thecomparisonwithMobileNet-v3[21]andMUXNet[22]),partialsolutionsfrom thenon-dominatedfrontiersareshown.Alltop-1accuracyshownarewithout 5.2.5.1ThreeObjectives WeuseNATtosimultaneouslyoptimizeforthreeobjectives|namely,modelaccuracyon ImageNet,modelsize(#params),andmodelcomputational.Weconsiderthree tmetricstoquantifycomputationalCPUlatency,andGPU latency.Intotal,werunthreeinstancesofthree-objectivesearch|i.e.maximizeaccuracy, minimize#params,andminimizeoneof#MAdds,CPUlatencyorGPUlatency.Wefollow thesettingsfromtheImageNetexperimentinSection5.2.3,exceptthestep. Afterobtainingthenon-dominated)solutions,wevisualizetheobjectives inFig.5.12.WeobservethatParetosurfacesemergeathighermodelcomplexityregime (i.e.high#params,#MAdds,etc.),showninthe3Dscatterplotinthetoprow,suggesting thatexistbetweenmodelsize(#params)andmodel(#MAddsand 85 latency).Inotherwords,#paramsand f #MAdds,CPU,GPUlatency g arenotcompletely correlated|e.g.amodelwithafewer#paramsisnotnecessarilymoretin#MAdds orlatencythananothermodelwithmore#params.Thisisoneoftheadvantagesofusinga many-objectiveoptimizationalgorithmcomparedtooptimizingasinglescalarizedobjective (such,asaweighted-sumofobjectives). Figure5.12visualizes,in2D,thetop-1accuracyasawitheachoneofthefour consideredmetricsinthebottomrow.The2Dprojectionisobtainedbyignor- ingthethirdobjective.Forbettervisualizationweonlyshowthearchitecturesthatare closetotheperformanceofMobilNet-v3[21].NATNetsobtaineddirectlyfromthe three-objectivesearchi.e.,beforeanyoftheirweights,consistentlyoutperform MobileNet-v3onImageNetalongalltheobjectives(top-1accuracy,#params,#MAdds, CPUandGPUlatency).Additionally,wecomparetoMUXNets[22]whicharealsoob- tainedfromathree-objectiveNASoptimizing f top-1accuracy,#params,and#MAdds g . However,MUXNetsadoptasearchspacethatissptailoredforreducingmodelsize. Therefore,incomparisontoMUXNets,weobservethatNATNetsperformfavourablyonall theremainingthreemetrics,exceptfor#params. 5.2.5.2TwelveObjectives TofurthervalidatethescalabilityofNATtoalargenumberofobjectives,weconsider thetop-1accuracyoneachofthe11datasetsshowninTable5.2alongwith#MAdds,as separateobjectives,resultingina12-objectiveoptimizationproblem.Notonlyissucha large-scaleoptimizationplausiblewithNAT,italsorevealsimportantinformation,which alow-dimensionaloptimizationmaynot.Duringsearch,theaccuracyoneachdatasetis computedbyinheritingweightsfromthedataset-spsupernetsgeneratedfromprevious 86 Figure5.13: Left: ParallelCoordinatePlot(PCP)whereeachverticalbarisanobjective andeachlineisanon-dominatedarchitecturesachievedbyNATfroma12-objoptimization ofminimizing#MAddsandmaximizingaccuracyonthe11datasetsinTable5.2.Themodel withthebest(seeSection5.1.6fordetails)ishighlightedindarkblue. Right: 1- on-1comparisonbetweentheselectedNATNet(top-rankedin)andrepresentative peermodelsontop-1accuracyonvariousdatasets.Methodwithlargerareaisbetter. experiments(Section5.2.4).Sincethesupernetsarealreadyadaptedtoeachdataset,we excludethesupernetadaptationstepinNATforthisexperiment. Figure5.13(Left)showsthe12objectivevaluesforall45non-dominatedarchitectures obtainedbyNATinaparallelcoordinateplot(PCP),whereeachverticalbarisanobjective andeachlineconnectingall12verticalbarsisanarchitecture.Wenowapplythe decisionanalysispresentedinSection5.1.6andobservethatthehighestsolutionis morethan( +3 ˙ )trade-oawayfromtherestof44solutions(seeFigure5.14).Thissolution ishighlightedindarkblueinFig.5.13(Left).Itsintermediateperformanceinallobjectives indicatethatthisbestsolutionmakesagoodcompromiseonall12objectives amongall45obtainedsolutions.InFig.5.13(Right),wecomparethissolutionwitht baselinemodelsthataretoeachdatasetseparately.Notably,ourNATNetachieves betteraccuracyonalldatasetswithsimilarorless#MAddsthantNet-B0[17], Mobilenet-v2[12],NASNet-A[5],andResNet-50[11],makingourhighestsolution apreferredone. 87 Figure5.14:Tvaluesofall45architecturesachievedfromthe12-objectivesearch presentedinFigure5.13. Theaboveanalysisalludestoacomputationalmechanismforchoosingasinglepreferred solutionfromtheParetosolutionsobtainedbyamany-objectiveoptimizational- gorithm.IfsuchanoverwhelminglyhighsolutionexistsintheParetofront,it becomesoneofthebestchoicesandcanoutperformsolutionsfoundbyasingle-objective optimizationalgorithm.Withoutresortingtoamany-objectiveoptimizationtomultiple solutions,idenofsuchahightradsolutionisverychallenging. 5.3AblationStudy 5.3.1AccuracyPredictorPerformance Inthissubsectionweassesstheenessoftaccuracypredictormodels.We uniformlysampled350architecturesfromoursearchspaceandtrainedthemusingSGD for150epochsonImageNet.Eachoneofthemisfor50epochsontheother tendatasets(Table5.2).Fromthe350pairsofarchitecturesandtop-1accuracycomputed 88 Figure5.15: Toprow: Spearmanrankcorrelationbetweenpredictedaccuracyandtrue accuracyoftsurrogatemodelsacrossmanydatasets.Eachaccuracypredictoris constructedfrom250samples(trainedarchitectures).Errorbarsshowmeanandstandard deviationovertenruns. Bottomrow: GoodnessofvisualizationofRBFensemble,the bestaccuracypredictor. oneachdataset,wereservedasubset(randomlychosen)of50pairsfortesting,andthe remaining300pairsarethenavailablefortrainingthepredictormodels. Figure5.4comparesthemean(over11datasets)Spearmanrankcorrelationbetweenthe predictedandtrueaccuracyforeachaccuracypredictorasthetrainingsamplesizeisvaried from50to300.Empirically,weobservethatradialbasisfunction(RBF)hasSpearmanrank correlationcomparedtotheotherthreemodels.And,theproposedRBFensemblemodel furtherimprovesperformanceoverthestandaloneRBFmodelacrossalltrainingsample sizeregimes.Figure5.15showsavisualizationofthecomparativeperformanceofpredic- tormodelsontdatasets.Fromtheperspectiveofminimizingnumberof trainingexamples(whichreducestheoverallcomputationalcost)andmaximizingSpearman rankcorrelationinprediction(whichimprovestheaccuracyinrankingarchitecturesduring search),wechosetheRBFensembleasouraccuracypredictormodelandatrainingsizeof 100forallourexperiments. 89 Table5.5:ComparingtherelativesearchofNATtoothermethods.\{"denotesfor notapplicable,\WS"standsforweightsharingand\SMBO"standsforsequentialmodel- basedoptimization[3]. y istakenfrom[4], z estimatebasedonthe#ofmodelsevaluated duringsearch(20Kin[5],1.2Kin[2],27Kin[6]). denotesre-rankingstagewheretop 100-250modelsundergoextendedtrainingandevaluationfor300epochsbeforeselecting themodel. MethodType Top-1 Acc.(%) #Madds (M) EstimatedSearchCost(GPUhours) Prior-searchDuring-searchPost-search Total ImageNet MnasNet[20]gradient 75.2312 --- 91k y OnceForAll[124]WS+EA 76.9230 1,2004075 1.3k NAT(ours) WS+EA 77.5225 1,20015075 1.4k CIFAR-10 NASNet[5]RL 97.4569 -10,000 z 6000 16k PNASNet[2]SMBO 96.6588 -600 z 36 0.6k DARTS[38]WS+gradient 97.3595 -9636 0.1k AmoeabaNet[6]EA 97.5555 -13,500 z 2400 16k NAT(ours) transfer+EA 98.4468 -24- 0.1k 5.3.2Search TheoverallcomputationcostconsumedbyaNASalgorithmcanbefactoredintothree phases:(1) Prior-search :Costincurredpriortoarchitecturesearch,e.g.trainingsupernet incaseofone-shotapproaches[122,124]orconstructingaccuracypredictor[121],etc;(2) During-search :Costassociatedwithmeasuringtheperformanceofcandidatearchitectures sampledduringsearchthroughinference.Italsoincludesthecostoftrainingthesupernet incaseitiscoupledwiththesearch,likein[38]andNAT;(3) Post-search :Costassociated withchoosingaarchitecture,and/orning/re-trainingthearchitecturesfrom scratch.Forcomparison,weselectrepresentativeNASalgorithms,includingthosebasedon reinforcementlearning(RL),gradients,evolutionaryalgorithm(EA),andweightsharing (WS).Table5.5showsresultsforImageNetandCIFAR-10.Theformeristhedataseton whichthesupernetistrainedandthelatterisaproxyfortransferlearningtoanon-standard dataset.NATconsistentlyachievesbetterperformance,bothintermsoftop-1accuracyand model(e.g.#MAdds),comparedtothebaselineswhilecomputationalcostis similarorlower.TheprimarycomputationalcostofNATisthe prior-search trainingof 90 supernetfor1200hours.Weemphasize,again,thatitisaone-timecostthatisamortized acrossallsubsequentdeploymentscenarios(e.g.10additionaldatasetsinSection5.2.4). 5.3.3AnalysisofCrossover (a)ofCrossover (b)ofCrossoverProbability Figure5.16:Ablationstudyonthecrossoveroperator:(a)themedianperformancefrom elevenrunsofourevolutionaryalgorithmwithandwithoutthecrossoveroperator.(b)the medianperformancedeterioratesasthecrossoverprobabilityreducesfrom0.9to0.2. Crossoverisastandardoperatorinevolutionaryalgorithms,buthaslargelybeenavoided byexistingEA-basedNASmethods[6,43,44].Butaswedemonstratehere,acarefullyde- signedcrossoveroperationcansigntlyimprovesearch.Weruntheevolution- arysearchofNATwithandwithoutthecrossoveroperatoronfourdatasets;ImageNet[1], CIFAR-10[34],OxfordFlowers102[129],andStanfordCars[135].Thehyperparameters thatwecompareare: 1. w/crx :crossoverprobabilityof0.9;mutationprobabilityof0.1;mutationindex m of3. 2. w/ocrx :crossoverprobabilityof0.0;mutationprobabilityof0.2;mutationindex m of3. 91 Wedoublethemutationprobabilitywhencrossoverisnotusedtocompensateforthereduced explorationabilityofthesearch.Oneachdataset,weruneachsettingtomaximizethe top-1accuracy11timesandreportthemedianperformanceasafunctionofthenumberof architecturesampledinFig5.16(a).Onallfourdatasets,thecrossoveroperatortly improvestheoftheevolutionarysearchalgorithm.Tofurthervalidate,wesweep overtheprobabilityofcrossoverwhilemaintainingtherestofthesettings.Themedian performance(over11runs)deterioratesasthecrossoverprobabilityisreducedfrom0.9to 0.2(seeFig.5.16(b)). 5.3.4AnalysisofMutationHyperparameters ThemutationoperatorusedinNATiscontrolledviatwohyperparameters|namely,the mutationprobability p m andmutationindex m .Toidentifytheoptimalhyperparameter values,weconductthefollowingparametersweepexperiments.Settingtherestofthe hyperparameterstotheirdefaultvalues(seeTable5.1),wesweepthevalueof p m from 0.1to0.8,and m from1.0to20.Andforeachsetting,werunNATeleventimesonfour datasets(sameasthecrossoverexperiment)tomaximizethetop-1accuracy.Figures5.17(a) and5.17(b)showtheofmutationprobability p m andmutationindex m ,respectively. Weobservethatincreasingthemutationprobabilityhasanadverseonperformance. Similarly,lowvaluesof m ,whichencouragesthemutatedtobefurtherawayfrom parentarchitectures,improvestheperformance.Basedontheseobservations,wesetthe mutationprobability p m andmutationindex m parametersto0.1and1.0,respectively,for allourexperimentsinSection5.2. 92 (a)ofMutationProbability (b)ofMutationHyperparameter m Figure5.17:Hyperparameterstudyon(a)mutationprobability p m and(b)mutationindex parameter m .Foreachstudy,werunNATeleventimesonfourdatasetstomaximizetop-1 accuracyandreportthemedianperformance. 5.3.5enessofSupernetAdaptation Figure5.18:Comparingtheperformanceof adaptingsupernet , adaptingsubnet and addi- tional underabi-objectivesearchsetuponfourdatasets.Detailsareprovided inSection5.3.5. Table5.6:offor75epochsonImageNet. NAT-M1 NAT-M2 NAT-M3 NAT-M4 w/o 75.9 77.4 78.9 79.4 w/ 77.5 78.6 79.9 80.5 (+1.6) (+1.2) (+1.0) (+1.1) RecallthatNATadoptsanysupernettrainedonalarge-scaledataset,e.g.ImageNet andseekstotlytransfertoatask-spsupernetonagivendataset.Herewe 93 comparethisproceduretoamoreconventionalapproachofadaptingeverysubnet(candidate architecturesinsearch)directly.Sp,weconsiderthefollowing, 1. SupernetAdaptation :supernetfor5epochsineachiterationanduseaccuracy frominheritedweights(withoutfurthertraining)toselectarchitecturesduringsearch. 2. SubnetAdaptation :eachsubnetfor5epochsfromtheinheritedweights,then measuretheaccuracy. Weapplythesetwoapproachestoabi-objectivesearchofmaximizingtop-1accuracyand minimizing#MAddsonfourdatasets,includingCIFAR-10,CIFAR-100,OxfordFlowers102, andSTL-10.Figure5.18comparestheParetofronts.Adaptingthesupernetyields tlybetterperformancethanadaptingindividualsubnets.Furthermore,weselect asubsetofsearchedsubnetsafter subnetadaptation andtheirweightsforan additional150epochs.Werefertothisas additional inFig.5.18.Empirically, weobservethatfurthercanmatchtheperformanceof supernetadaptation on datasetswithlargertrainingsamplesperclass(e.g.4,000inCIFAR-10).Ondatasetswith fewersamplesperclass(e.g.20inFlowers102),thereisstillalargeperformancegap between supernetadaptation and additional .Overalltheresultssuggestthat supernetadaptation ismoreectiveontaskswithlimitedtrainingsamples.Finally,we reporttheofthetraarchitecturesonImageNetinSection5.2.3for anadditional75epochsinTable5.6. 94 5.4Conclusion Thissectionconsideredtheproblemofdesigningcustomneuralnetworkarchitecturesthat multipleobjectivesforagivenimagetask.Weintroduced Neural ArchitectureTransfer (NAT),apracticalandeeapproachforthispurpose.Wede- scribedourtoharnesstheconceptofasupernetandanevolutionarysearchalgorithm fordesigningtask-spneuralnetworkstrading-oaccuracyandcomputationalcomplex- ity.Wealsoshowedhowtouseanonlineregressor,asasurrogatemodeltopredictthe accuracyofsubnetsinthesupernet.Experimentalevaluationonelevenbenchmarkimage datasets,rangingfromlarge-scalemulti-classtosmall-scaletasks, showedthatnetworksobtainedbyNAToutperformconventionalbasedtrans- ferlearningwhilebeingordersofmagnitudemoretundermobilesettings( 600M Multiply-Adds).NATwasespeciallyeforsmall-scaleinedtaskswhere tuningpre-trainedImageNetmodelsise.Finally,wealsodemonstratedtheutility ofNATinoptimizinguptotwelveobjectiveswithasubsequenttradanalysisprocedure foridentifyingasinglepreferredsolution.Overall,NATisthelargescaledemonstration ofmany-objectiveneuralarchitecturesearchfordesigningcustomtask-spmodelson diverseimagedatasets. 95 Chapter6 ApplicationsofEvolutionaryNeural ArchitectureSearch 6.1MedicalImaging TheChestX-Ray14benchmarkwasrecentlyintroducedin[23].Thedatasetcontains112,120 highresolutionfrontal-viewchestX-rayimagesfrom30,805patients,andeachimageis labeledwithoneormultiplecommonthoraxdiseases,or\Normal",otherwise.Visualization ofexampleX-rayimagesfromthedatabaseisprovidedinFig.6.1.Thedatasetispublicly availablefromNIHat https://nihcc.app.box.com/v/ChestXray-NIHCC .Wefollowthe train val list.txt and test list.txt providedalongwiththeX-rayimagestosplitthedatabase fortraining,validationandtesting. Tocopewiththemulti-labelnatureofthedataset,weuseamultitasklearningsetup inwhicheachdiseaseistreatedasanindividualbinaryproblem.Weea 14-dimensionallabelvectorofbinaryvaluesindicatingthepresenceofoneormorediseases, resultinginaregressionlossasopposedtothestandardcross-entropyforsingle-labelprob- lems.Pastapproaches[23,145,146]typicallyextendfromexistingarchitectures,andthe currentstate-of-the-artmethod[146]usesavariantoftheDenseNet[31]architecture,which isdesignedmanuallybyhumanexperts. ThesearchissetupalongthesamelinesasthatofCIFAR-100describedinChapter3, 96 Figure6.1:VisualizationofChestXray14datasets.Examplesshowingeightcommonthoracic diseasesarefrom[23]. exceptthattheaccuracyobjectiveisreplacedbytheaverageareaunderthe ROCcurve(AUROC)metrictoguidetheselectionofarchitectures.Wesplitthedatasetinto threepartswith70%fortraining,10%forvalidation,andtheremaining20%fortesting.All imagesarepre-processedto224 224pixelsusingImageNetdataaugmentationtechniques (e.g.,normalization,randomhorizontalng,etc.). Oncethenon-dominatedarchitecturesareobtainedaftertheconvergenceofevolution, weselectthearchitecturethatmaximizestheratiobetweenthegaininAUROCoverthe inFLOPs.WecallthisarchitectureNSGANet-X,andwere-traintheweights thoroughlyfromscratchwithanextendednumberofepochs.Thelearningrateisgradually reducedwhentheAUROConthevalidationsetplateaus. Table6.1comparestheperformanceofNSGANet-Xwithpeermethodsthatareextended fromexistingmanuallydesignedarchitectures.Thisincludesarchitecturesusedbytheau- thorswhooriginallyintroducedtheChestX-Ray14dataset[23],andtheCheXNet[146], whichisthecurrentstate-of-the-artonthisdataset.Wealsoincluderesultsfromcommer- 97 Table6.1:AUROConChestX-Ray14testingset. MethodType#ParamsTestAUROC(%) Wangetal.(2017)[23]manual-73.8 Yaoetal.(2017)[145]manual-79.8 CheXNet(2017)[146]manual7.0M84.4 GoogleAutoML(2018)[147]RL-79.7 LEAF(2019)[148]EA-84.3 NSGANet-A3EA5.0M84.7 NSGANet-XEA2.2M84.6 y GoogleAutoMLresultisfrom[148]. z NSGANet-A3representsresultsunderthestandardtransferlearningsetup. cialAutoMLsystems,i.e.GoogleAutoML[147],andLEAF[148],ascomparisonswith NASbasedmethods.ThesetupdetailsofthesetwoAutoMLsystemsareavailablein[148]. Noticeably,theperformanceofNSGANet-XexceedsGoogleAutoML'sbyalargemarginof nearly 4AUROCpoints .Inaddition,NSGANet-Xmatchesthestate-of-the-artresults fromhumanengineeredCheXNet,whileusing 3.2xfewerparameters .Forcompleteness, wealsoincludetheresultfromNSGANet-A3,whichisevolvedonCIFAR-100,todemon- stratethetransferlearningcapabilitiesofNSGANet. (a) (b) Figure6.2:NSGANet-Xmulti-labelperformanceonChestX-Ray14(a)andthe class-wisemeantestAUROCcomparisonwithpeermethods(b). Moredetailedresultsshowingthedisease-wiseROCcurveofNSGANet-Xanddisease- 98 Atelectasis Cardiomegaly Pneumonia Pneumothorax Figure6.3:Examplesofclassactivationmap[24]ofNSGANet-X,highlightingtheclass- spdiscriminativeregions.Thegroundtruthboundingboxesareplottedoverthe heatmaps. wiseAUROCcomparisonwithotherpeermethodsareprovidedinFigs.6.2(a)and6.2(b),re- spectively.TounderstandthepatternbehindthediseasedecisionsofNSGANet- X,wevisualizetheclassactivationmap(CAM)[24],whichiscommonlyadoptedforlocal- izingthediscriminativeregionsforimagetion.IntheexamplesshowninFig.6.3, strongerCAMareasarecoveredwithwarmercolors.Wealsooutlinetheboundingboxes providedbytheChestX-Ray14dataset[23]asreferences. Theseresultsfurthervalidatetheabilityofourproposedalgorithmtogeneratetask- dependentarchitecturesautomatically.Conventionalapproaches,e.g.,transferlearningfrom existingarchitectures,canbeeinyieldingsimilarperformance,however,asdemon- 99 stratedbyNSGANet,simultaneouslyconsideringcomplexityalongwithperformanceinan algorithmicfashionallowsarchitecturestobepracticallydeployedinresource-constrained environments. 6.2AgainstAdversarialAttack BasedonouranalysisinSectionV-B,yearsofarchitecturaladvancementshavetranslated tominusculeimprovementsinrobustnessagainstadversarialexamples.Simpleone-iteration attackstrategylikeFGSM[25]isenoughtoconstructingexamplesthatturnmanymodern DNNtorandom-guess.Inthissection,wemakeantoimproveadversarial robustnessfromthearchitecturalperspective.Thesearchspaceusedinthemainpaper searchesoverbothlayeroperationsandlayerconnections(seeSectionIII-A).Toisolatethe ofthesetwoaspectstotheadversarialrobustness,wethelayeroperationtobasic residualmodule[11]andsearchovertheconnectionsamongthesemodulestoimproveboth accuracyoncleanimagesandrobustnessagainstadversarialexamples. Designingameasure/objectiveforrobustnessagainstadversarialrobustnessisanareaof activeresearch(e.g.,[149]).Forourpurposes,wepresentapossiblemeasurehere,illustrated inFig.6.4.UsingtheFSGMpresentedby[25],thisrobustnessobjectiveprogressively increasesnoiseproducedbyFSGM.The axisinFig.6.4referstothehyper-parameterin theFSGMequation, x 0 = x + sign( r x L ( x ;y true )) ; where x istheoriginalimage, x 0 isadversarialimage, y true istrueclasslabel,and L cross- 100 Figure6.4: RobustnessObjective: WearobustnessobjectiveundertheFGSM[25] attackasfollows:1)obtainationaccuracyonadversarialimagesgeneratedbyFGSM aswevary ,2)computetheareaunderthecurve(blueline),approximatedbythearea ofthegreenregion;2)normalizetherobustnessvaluetotherectangularareaformedby the Idealpoint and Nadirpoint ;3)Idealpointisat100%accuracyat maximum value,andthenadirpointisastheaccuracyofrandomguessingat =0 (cleanimages). 101 entropyloss.Therefore,forthisexperiment,weseektomaximizetwoobjectives,namely, accuracyandtherobustnessobjectiveabove. Thesetupfortherobustnessexperimentisasfollows.Fortrainingweuse40,000CIFAR- 10imagesfromtheCIFAR-10trainingdata,10,000ofwhicharereservedforvali- dation.Eachnetworkisencodedwiththreeblocksusingthemacrospaceencodingfrom ourpreviouswork[15].Ineachphaseamaximumofsizenodesmaybeactive|wherethe computationateachnodeis3x3convolutionfollowedbyReLUandbatchnormalization. Eachnetworkistrainedfor20epochswithSGDonacosineannealedlearningratesched- ule.TheepsilonvaluesusedintheFSGMrobustnesscalculationare[0.0,0.01,0.03,0.05, 0.07,0.1,0.15].Asbefore,NSGANetinitiatesthesearchwith40randomlycreatednetwork architecture,and40newnetworkarchitecturesarecreatedateachgeneration(iteration)via geneticoperations(seemainpaperfordetails).Thesearchisterminatedat30generations. Figure6.5:Tfrontieroftherobustnessexperiments.Colorindicatesthegeneration (iteration)atwhichanetworkarchitectureiseliminatedfromthesurvivingparentpopula- tion.Thesizeofeachpointisproportionaltothenetworkarchitecture'snumberoftrainable parameters.WenotethatnetworksforlattergenerationsformtheParetofront(darkblue points). Empirically,weobserveaclearbetweenaccuracyandrobustness,asshownin 102 (a) (b) (c) Figure6.6:(a)Examplesofthecomputationalblocksdiscoveredwithhighclasscation accuracy.Forthesenetworks,themeanaccuracyandrobustnessobjectivesare0.8543and 0.0535,respectively;(b)Examplesofthecomputationalblocksdiscoveredwithhighro- bustnessagainstFGSMattack,themeanaccuracyandrobustnessobjectivesare0.8415 and0.1036,respectively;(c)Examplesofthecomputationalblocksdiscoveredalongthe pareto-frontthatprovidesantbetweenaccuracyandadver- sarialrobustness.Theyarearrangedintheorderofdescendingaccuracyandascending robustness. Fig.6.5.Visualizationofthenon-dominatedarchitecturesareprovidedinFig.6.6(c).In ouropinion,NSGANetisusefulincapturingpatternsthattiatearchitecturesthat aregoodforcompetingobjectives.Wethatthe\wide"networks(likeResNeXt[112]or Inceptionblocks[30])appeartoprovidegoodaccuracyonstandardbenchmarkimages,but arefragiletotheFSGMattack.Ontheotherhand,\deep"networks(akintoResNet[11]or VGG[29])aremorerobusttoFSGMattack,whilehavinglessaccuracy.Thisphenomenon isillustratedwithexamplesinFigs.6.6(a)and6.6(b),respectively.Furthermore,theskip connectionofskippingtheentireblock'scomputationappearstobecriticalinobtaininga networkthatisrobusttoadversarialattacks;seeFig.6.7(a)and6.7(b). 6.3Multi-viewCarAlignment Inadditiontoobjectdenseimageprediction(e.g.objectalignment,human bodyposeestimationandsemanticsegmentation,etc.)isanotherclassofproblemsthatis ofgreatimportancetocomputervision.Denseimagepredictionassignsaclasslabeltoeach 103 (a) (b) Figure6.7:Parallelcoordinateplotsofthe1,200networkarchitecturessampledbyNS- GANet.Eachlinerepresentsanetworkarchitecture,eachverticallineisanattributeasso- ciatedwiththenetwork.(a)Networksthathavetheskipconnectionbitinactive,wecansee thatnoneofthemhavegoodmeasurementonrobustnessagainstadversarialattacks.(b) Networksthathavetheskipconnectionbitactive.Thisskipconnectionbitreferstothe connectionthatgoespastallcomputationwithinaphase,asanormalresidualconnection would.Whentheskipconnectionisactive,thenetworkscoverthefullrangeofadversarial robustness. 104 pixelinthequeryimages,asopposedtoonelabeltotheentireimageincaseofclascation. Inthissection,weapplyNSGANettotheproblemofmulti-viewcarkey-pointsalignment. WeusetheCMU-Cardatasetoriginallyintroducedin[7].Thedatasetcontainsaround 10,000carimagesintorientations,environments,andocclusionsituations.Inthis case,wesearchforthepathofimageresolutionchanges,similarto[98].Thenode-level structureiskeptusingthebasicresidualunit[11].Theperformanceofarchitectures inthiscaseiscalculatedusingtherootmeansquare(RMS)errorbetweenthepredicted heatmapandgroundtruthforeachkey-point,moredetailsareavailablein[7].Weuse FLOPsasthesecondobjectiveforarchitecturecomplexitymeasurement.Theobtained architecturesarenamedasNSGANet-C0and-C1.Theobtainedresultsareprovidedin Table6.2andthevisualizationofthearchitecturesisprovidedinFig.6.8. Table6.2:PreliminaryresultsontheCMU-Caralignment[7].Notably,ourproposedal- gorithmisabletoarchitectureswithcompetitiveperformancewhilehaving2xless parameterswhencomparedtohuman-designedarchitecture[8]. ArchitecturesParams.FLOPsRegression (M)(M)Error(%) Hourglass[8]3.3836137.80 NSGANet-C01.5325848.66 NSGANet-C11.6126638.64 Figure6.8:Spatial-resolution-changepathoftheNSGANet-C0architecture.Eachcircle encodesaresidualmodule[11].Circlescoloredinwhitearealwaysexecutedatthebeginning. Thearrowsandbluescirclesarepartsofthesearchdecisions.Thetotalnumberoflayers (L)aresettobenine. 105 6.4Non-learnableLayer Ourpost-optimizationanalysisontheevolvedarchitectures,showninChapter3under Section3.2.5,hasrevealedsomeinterestingoneofwhichbeingtheenessof non-parametricoperations,e.g.identitymapping,average/maxpooling,etc.,intrading performanceforarchitecturalcomplexity.Tofurthervalidatethisobservation, weconsideraexpandedrangeofoperationsincludingbothnon-parametricandweigh operations,whichwenameas non-learnableoperations inthispaper.Wemanuallyconstruct suchlayersbyconcatenatemultiplenon-learnableoperationsinparallel.Theobtainedresults areshowninFigure6.9. Ourpreliminaryresultsonmanualconstructionofnon-learnablelayersareveryen- couraging.Inadditionaltothecomparativeperformancetoregularfullylearnedlayers, non-learnablelayersuniqueadvantagesintermsofre-usableweightsformulti-tasking networkarchitectures,astheweightsareagnostic(notsplearnedonaparticular task).Webelievedesigningdedicatedsearchalgorithmtoshapetheconstructionofthese non-learnablelayersisapromisingdirectionforNAStowardsautomaticdesignformulti- taskingarchitectures. 106 (a) (b) (c) Figure6.9:PreliminaryexperimentonconstructingDNNarchitecturesusinglayerswith non-learnableweights.Eachlayeriscomposedofseveralnon-learnableoperationsinparal- lel.WemanuallyconstructedahandfulofsuchlayersandevaluatethemonCIFAR-10(a). Anexamplebasedonthetrade-obetweenaccuracyandthenumberofthe parameters,isshownin(b).Wevalidatetheenessofnon-learnablelayersbyreplac- ingtheoriginal3 3convolutionintResNetmodelswiththechosen onbothCIFAR-100andImageNet(c).Evidently,layerwithnon-learnableweightsiscapa- bleofyieldingcompetitiveperformancewhilebeingcomputationaltas opposedtoconventionallearnableconvolutions. 107 Chapter7 Conclusion Inthissection,welistthetcontributionsofthisdissertation,followedbythe futuredirectionsandconcludingremarks. 7.1Contributions Multi-objective: Inthisdissertation,weshowthatthroughantevolutionary frameworkwithhighlycustomizedoperators(i.e.crossover,mutation,distributionestima- tion),asetofarchitecturesthatmultiplecompetingobjectivescanbeobtained inasingleruntly.Thisallowsdesignerstochooseasuitablenetwork a-posteriori as opposedtoapreferenceweightingeachobjectivepriortothesearch.Further post-optimalanalysisofthesetofnon-dominatedarchitecturesoftentimesrevealsvaluable designprinciples,whichstaysasanotherbforposingNASasamulti-objectiveopti- mizationproblem,asincaseofthisdissertation. SurrogateModeling: Inthisdissertation,weshowanalternativeapproachtosolvethe nestedbi-levelNASproblem,i.e.,simultaneouslyoptimizingthearchitectureandlearnthe optimalmodelweights.However,insteadofgradientbasedrelaxations,weadvocatefor surrogatemodels.Attheupper-level,weconstructmultiplethenadaptivelyselectone predictormodelforpredictinganarchitecture'sperformancefromitsencoding.Thisprocess itselfreducestheoverallsearchcostbyafactorof 45x comparedtopreviousapproaches. 108 Atthelower-level,wedeployasupernetthatallowsthetrainingofanarchitecturetostart fromamuchadvancedstageascomparedtoarandompoint,leadingtoareductionof 30x intrainingiency(i.e.therequiredtrainingepochsreducesfrom150to5). ScalabilityviaTransferLearning: MostexistingNASapproachesrequireonecomplete searchforeachdeploymentspofhardwareorobjective.Thisisacomputationally impracticalendeavorgiventhepotentiallylargenumberofapplicationscenarios.Weusean onlinesearch-guidedapproachasopposetoconventionaltransferlearningtorapidlyadapta supernetthat'slearnedonupstreamtasks(e.g.ImageNet)toanewtaskathand,including bothtypesofdatasetsandtherequiredcomputationalconstraints.Suchanapproachnot onlyallowsanarchitecturesearchtobetlyexecuted,butalsoensuretheperformance isbeyondthecurrentbestpractice,i.e.transferlearning. CompetitiveandComprehensiveResults: Thisdissertationdemonstratethescalability andpracticalityoftheproposedapproachonmanydatasetscorrespondingtodiversescenar- ios;large-scalemulti-class(ImageNet,CINIC-10[127]),medium-scalemulti-class(CIFAR-10, CIFAR-100),small-scalemulti-class(STL-10[128]),large-scale(Food-101[134]), medium-scale(StanfordCars[135],FGVCAircraft[136])andsmall-scale grained(DTD[137],Oxford-IIITPets[138],OxfordFlowers102[129])datasets.Wefuther demonstratethescalabilityandutilityoftheproposedapproachacrossmanyobjectives. Optimizingaccuracy,modelsizeandoneofMAdds,CPUorGPUlatency,theobtained architecturesdominatesMobileNet-v3[21]acrossallobjectives.Wealsoconsidera12ob- jectiveproblemofacommonarchitectureacrosselevendatasetswhileminimizing MAdds.Thebestarchitecturedominatesallmodelsacrossthesedatasetsunder mobilesettings. Real-worldApplications: Towardstheendofthisdissertation,weapplyourproposed 109 algorithmtovariouspracticalapplications,includingmedicalimagingofclassifyingcommon thoraxdiseases,improvingrobustnessagainstadversarialattack,denseimagepredictionof multi-viewcarkey-pointsalignment,andcustomlayerdesignofnotrainableparameters. 7.2DiscussionandFutureWork Despitethesteadystreamofempiricalimprovementsonmanydatasetsandapplications showninthisdissertation,therearestilltremendousscopesandopportunitieslefttobe explored.Intheremainderofthissection,weprovideabriefdiscussiononsomeofthe promisingfutureextensionsanddirections. Higher-levelTasks: Thisdissertationmainlyfocusesonobjectrecognition,whichisthe fundamentaltaskincomputervision.Inaddition,thereexistsmanyothermorechallenging tasksincludingobjectdetection,semanticsegmentation,panopticsegmentation,humanpose analysis,alignment,imagegeneration,tonameafew.Alloftheseaforementionedtasksare stillheavilyrelyingonhumanengineeringinnovationstoprogressthearchitecturaldesign. Adaptingtheproposedalgorithminthisdissertationtoautomatethedesignofnetwork architectureforthesetaskswillbeimportantandimpactfulforthecomputervision SearchSpace: Designingansearchspace(possiblywithtransformerarchi- tectures)mayhelpbridgethegapbetweenthesequential/tabulardata(naturallanguage processing)andpixeldata(computervision),bothintermsofresearchinnovationsandin thewayinformationisreasonedabout.Thereistinterestintasksthatinvolve simultaneouslyhandlingvisualandtextualinputs,whichisessentiallyrequiredforachieve true autonomous inmachinelearning.Thesebimodaltaskshavebeenhistoricallyknownto becultwithconventionalarchitectures.NASwithsuchansearchspacecouldbe 110 apromisingavenuetotacklethischallengingbutimportantproblem. SearchbasedDNNTraining: Gradientthroughback-propagationbasedweightlearning hasbeenthecornerstonefordeeplearning.It'spolynomialtimenaturewithrespectto thenetworkparameters(alongwiththeadvancedcomputinghardware)hasmadetheDNN trainingnotonlypossiblebutalsot(relatively).However,ithasalsoleadtoseveral compromises,includingthechoiceoflayerbeingsimpleoratleasttiable,singleor scalarizedobjectiveintraining,heavilyrelyingoninitialization(Heinitialization[11])and normalization(BatchNorm),tonameafew.FreetheDNNtrainingfromgradienthasgreat potentialindiscoveringmorenovelarchitecturewithcomplexoperations,andcanbeeasily extendedtohandlemultipleobjectives. 7.3ConcludingRemarks Theworkinthisdissertationdemonstratedthepotentialofevolutionarycomputationin automatingthedesignofneuralnetworksolutionstomanyMLapplications.Thetopology, andofweightsofthearchitecturecanbeoptimizedsimultaneouslytomeettherequirements ofthetask,resultinginsuperiorperformanceundermultipleobjectives.Thisdissertation primarilyshowedthatcomputervisiondomain(objectrecognitioninparticular)cangreatly bfromevolutionaryneuralarchitecturesearch;manymoredomainsremainwherethe workinthisdissertationcanbeappliedtoimprovetheuserexperienceorproductivity. 111 BIBLIOGRAPHY 112 BIBLIOGRAPHY [1] O.Russakovsky,J.Deng,H.Su,J.Krause,S.Satheesh,S.Ma,Z.Huang,A.Karpathy, A.Khosla,M.Bernstein etal. ,\Imagenetlargescalevisualrecognitionchallenge," InternationalJournalofComputerVision ,vol.115,no.3,pp.211{252,2015. [2] C.Liu,B.Zoph,M.Neumann,J.Shlens,W.Hua,L.-J.Li,L.Fei-Fei,A.Yuille, J.Huang,andK.Murphy,\Progressiveneuralarchitecturesearch,"in EuropeanCon- ferenceonComputerVision(ECCV) ,2018. [3] F.Hutter,H.H.Hoos,andK.Leyton-Brown,\Sequentialmodel-basedoptimization forgeneralalgorithmin InternationalConferenceonLearningand IntelligentOptimization .Springer,2011,pp.507{523. [4] A.Wan,X.Dai,P.Zhang,Z.He,Y.Tian,S.Xie,B.Wu,M.Yu,T.Xu,K.Chen etal. ,\Fbnetv2:tiableneuralarchitecturesearchforspatialandchanneldi- mensions,"in IEEEConferenceonComputerVisionandPatternRecognition(CVPR) , 2020. [5] B.Zoph,V.Vasudevan,J.Shlens,andQ.V.Le,\Learningtransferablearchitectures forscalableimagerecognition,"in IEEEConferenceonComputerVisionandPattern Recognition(CVPR) ,2018. [6] E.Real,A.Aggarwal,Y.Huang,andQ.V.Le,\Regularizedevolutionforimage architecturesearch,"in AAAIConferenceonAIntelligence ,2019. [7] V.NareshBoddeti,T.Kanade,andB.VijayaKumar,\Correlationersforob- jectalignment,"in IEEEConferenceonComputerVisionandPatternRecognition (CVPR) ,2013. [8] A.Newell,K.Yang,andJ.Deng,\Stackedhourglassnetworksforhumanposeesti- mation,"in EuropeanConferenceonComputerVision(ECCV) ,2016. [9] F.VanVeenandS.Leijnen,\Theneuralnetworkzoo,"https://www.asimovinstitute. org/neural-network-zoo,accessed:2020-05-17. [10] A.Sinha,P.Malo,andK.Deb,\Areviewonbileveloptimization:Fromclassical toevolutionaryapproachesandapplications," IEEETransactionsonEvolutionary Computation ,vol.22,no.2,pp.276{295,2017. 113 [11] K.He,X.Zhang,S.Ren,andJ.Sun,\Deepresiduallearningforimagerecognition," in IEEEConferenceonComputerVisionandPatternRecognition(CVPR) ,2016. [12] M.Sandler,A.Howard,M.Zhu,A.Zhmoginov,andL.-C.Chen,\Mobilenetv2:In- vertedresidualsandlinearbottlenecks,"in IEEEConferenceonComputerVisionand PatternRecognition(CVPR) ,2018. [13] Z.Zhong,J.Yan,W.Wu,J.Shao,andC.-L.Liu,\Practicalblock-wiseneuralnet- workarchitecturegeneration,"in IEEEConferenceonComputerVisionandPattern Recognition(CVPR) ,2018. [14] E.ZitzlerandL.Thiele,\Multiobjectiveoptimizationusingevolutionaryalgorithms |acomparativecasestudy,"in ParallelProblemSolvingfromNature|PPSNV , A.E.Eiben,T.ack,M.Schoenauer,andH.-P.Schwefel,Eds.Berlin,Heidelberg: SpringerBerlinHeidelberg,1998,pp.292{301. [15] Z.Lu,I.Whalen,V.Boddeti,Y.Dhebar,K.Deb,E.Goodman,andW.Banzhaf, \Nsga-net:Neuralarchitecturesearchusingmulti-objectivegeneticalgorithm,"in GeneticandEvolutionaryComputationConference(GECCO) ,2019. [16] N.Nayman,A.Noy,T.Ridnik,I.Friedman,R.Jin,andL.Zelnik,\Xnas:Neural architecturesearchwithexpertadvice,"in AdvancesinNeuralInformationProcessing Systems(NeurIPS) ,2019. [17] M.TanandQ.V.Le,tnet:Rethinkingmodelscalingforconvolutionalneural networks,"in ICML ,2019. [18] D.Berthelot,N.Carlini,I.Goodfellow,N.Papernot,A.Oliver,andC.A. \Mixmatch:Aholisticapproachtosemi-supervisedlearning,"in AdvancesinNeural InformationProcessingSystems(NeurIPS) ,2019. [19] X.Wang,D.Kihara,J.Luo,andG.-J.Qi,\Enaet:Self-trainedensembleautoencoding transformationsforsemi-supervisedlearning," arXivpreprintarXiv:1911.09265 ,2019. [20] M.Tan,B.Chen,R.Pang,V.Vasudevan,M.Sandler,A.Howard,andQ.V.Le, \Mnasnet:Platform-awareneuralarchitecturesearchformobile,"in IEEEConference onComputerVisionandPatternRecognition(CVPR) ,2019. [21] A.Howard,M.Sandler,G.Chu,L.-C.Chen,B.Chen,M.Tan,W.Wang,Y.Zhu, R.Pang,V.Vasudevan,Q.V.Le,andH.Adam,\Searchingformobilenetv3,"in InternationalConferenceonComputerVision(ICCV) ,2019. 114 [22] Z.Lu,K.Deb,andV.N.Boddeti,\MUXConv:Informationmultiplexingincon- volutionalneuralnetworks,"in IEEEConferenceonComputerVisionandPattern Recognition(CVPR) ,2020. [23] X.Wang,Y.Peng,L.Lu,Z.Lu,M.Bagheri,andR.M.Summers,\Chestx-ray8: Hospital-scalechestx-raydatabaseandbenchmarksonweakly-supervised andlocalizationofcommonthoraxdiseases,"in IEEEConferenceonComputerVision andPatternRecognition(CVPR) ,2017. [24] B.Zhou,A.Khosla,A.Lapedriza,A.Oliva,andA.Torralba,\Learningdeepfeatures fordiscriminativelocalization,"in IEEEConferenceonComputerVisionandPattern Recognition(CVPR) ,2016. [25] I.J.Goodfellow,J.Shlens,andC.Szegedy,\Explainingandharnessingadversarial examples," arXivpreprintarXiv:1412.6572 ,2014. [26] A.Krizhevsky,I.Sutskever,andG.E.Hinton,\Imagenetwithdeep convolutionalneuralnetworks,"in AdvancesinNeuralInformationProcessingSystems (NeurIPS) ,2012. [27] A.Graves,A.-r.Mohamed,andG.Hinton,\Speechrecognitionwithdeeprecurrent neuralnetworks,"in IEEEInternationalconferenceonacoustics,speechandsignal processing(ICASSP) ,2013. [28] R.CollobertandJ.Weston,\Aarchitecturefornaturallanguageprocessing: Deepneuralnetworkswithmultitasklearning,"in InternationalConferenceonMa- chineLearning(ICML) ,2008. [29] K.SimonyanandA.Zisserman,\VeryDeepConvolutionalNetworksforLarge-scale ImageRecognition,"in InternationalConferenceonLearningRepresentations(ICLR) , 2015. [30] C.Szegedy,W.Liu,Y.Jia,P.Sermanet,S.Reed,D.Anguelov,D.Erhan,V.Van- houcke,andA.Rabinovich,\Goingdeeperwithconvolutions,"in IEEEConference onComputerVisionandPatternRecognition(CVPR) ,2015. [31] G.Huang,Z.Liu,L.VanDerMaaten,andK.Q.Weinberger,\Denselyconnected convolutionalnetworks,"in IEEEConferenceonComputerVisionandPatternRecog- nition(CVPR) ,2017. 115 [32] X.Zhang,X.Zhou,M.Lin,andJ.Sun,\ShAnextremelytconvolu- tionalneuralnetworkformobiledevices,"in IEEEConferenceonComputerVision andPatternRecognition(CVPR) ,2018. [33] R.Hecht-Nielsen,\Theoryofthebackpropagationneuralnetwork,"in Neuralnetworks forperception .Elsevier,1992,pp.65{93. [34] A.Krizhevsky,G.Hinton etal. ,\Learningmultiplelayersoffeaturesfromtinyimages," Citeseer,Tech.Rep.,2009. [35] K.DebandC.Myburgh,\Breakingthebillion-variablebarrierinreal-worldoptimiza- tionusingacustomizedevolutionaryalgorithm,"in ProceedingsoftheGeneticand EvolutionaryComputationConference2016 ,2016,pp.653{660. [36] B.ZophandQ.V.Le,\Neuralarchitecturesearchwithreinforcementlearning,"in InternationalConferenceonLearningRepresentations(ICLR) ,2017. [37] B.Baker,O.Gupta,N.Naik,andR.Raskar,\Designingneuralnetworkarchitectures usingreinforcementlearning,"in InternationalConferenceonLearningRepresenta- tions(ICLR) ,2017. [38] H.Liu,K.Simonyan,andY.Yang,\DARTS:tiablearchitecturesearch,"in InternationalConferenceonLearningRepresentations(ICLR) ,2019. [39] R.Luo,F.Tian,T.Qin,E.Chen,andT.-Y.Liu,\Neuralarchitectureoptimization," in AdvancesinNeuralInformationProcessingSystems(NeurIPS) ,2018. [40] H.Pham,M.Guan,B.Zoph,Q.Le,andJ.Dean,tneuralarchitecturesearch viaparameterssharing,"in InternationalConferenceonMachineLearning(ICML) , 2018. [41] H.Cai,L.Zhu,andS.Han,\ProxylessNAS:Directneuralarchitecturesearchontarget taskandhardware,"in InternationalConferenceonLearningRepresentations(ICLR) , 2019. [42] L.XieandA.Yuille,\Geneticcnn,"in InternationalConferenceonComputerVision (ICCV) ,2017. [43] E.Real,S.Moore,A.Selle,S.Saxena,Y.L.Suematsu,J.Tan,Q.V.Le,andA.Ku- rakin,\Large-scaleevolutionofimagein InternationalConferenceonMa- chineLearning(ICML) ,2017. 116 [44] H.Liu,K.Simonyan,O.Vinyals,C.Fernando,andK.Kavukcuoglu,\Hierarchicalrep- resentationsforcientarchitecturesearch,"in InternationalConferenceonLearning Representations(ICLR) ,2018. [45] A.Chen,J.Kim,S.Lee,andY.Kim,\Stochasticmulti-objectivemodelsfornetwork designproblem," ExpertSystemswithApplications ,vol.37,no.2,pp.1608{1619,2010. [46] W.FanandR.Machemehl,\Bi-leveloptimizationmodelforpublictransportation networkredesignproblem:Accountingforequityissues," TransportationResearch Record:JournaloftheTransportationResearchBoard ,vol.2263,pp.151{162,2011. [47] M.A.AmouzegarandK.Moshirvaziri,\Determiningoptimalpollutioncontrolpolicies: Anapplicationofbilevelprogramming," EuropeanJournalofOperationalResearch , vol.119,no.1,pp.100{120,1999. [48] P.A.ClarkandA.W.Westerberg,\Bilevelprogrammingforsteady-statechemical processdesign-i.fundamentalsandalgorithms," ComputersandChemicalEngineering , vol.14,no.1,pp.87{97,1990. [49] A.U.RaghunathanandL.T.Biegler,\Mathematicalprogramswithequilibriumcon- straints(mpecs)inprocessengineering," ComputersandChemicalEngineering ,vol.27, no.10,pp.1381{1392,2003. [50] S.Christiansen,M.Patriksson,andL.Wynter,\Stochasticbilevelprogrammingin structuraloptimization," Structuralandmultidisciplinaryoptimization ,vol.21,no.5, pp.361{371,2001. [51] J.Herskovits,A.Leontiev,G.Dias,andG.Santos,\Contactshapeoptimization:a bilevelprogrammingapproach," Structuralandmultidisciplinaryoptimization ,vol.20, no.3,pp.214{221,2000. [52] G.Brown,M.Carlyle,J.on,andK.Wood,\Defendingcriticalinfrastructure," Interfaces ,vol.36,no.6,pp.530{544,2006. [53] E.IsraeliandR.K.Wood,\Shortest-pathnetworkinterdiction," Networks ,vol.40, no.2,pp.97{111,2002. [54] H.Ku˘cukaydin,N.Aras,andI.K.\Competitivefacilitylocationproblem withattractivenessadjustmentofthefollower:Abilevelprogrammingmodelandits solution," EuropeanJournalofOperationalResearch ,vol.208,no.3,pp.206{220, 2011. 117 [55] T.Uno,H.Katagiri,andK.Kato,\Anevolutionarymulti-agentbasedsearchmethod forstackelbergsolutionsofbilevelfacilitylocationproblems," InternationalJournalof InnovativeComputing,InformationandControl ,vol.4,no.5,pp.1033{1042,2008. [56] J.LiangandR.Miikkulainen,\Evolutionarybileveloptimizationforcomplexcontrol tasks,"in GeneticandEvolutionaryComputationConference(GECCO) ,2015. [57] B.Barnhart,Z.Lu,M.Bostian,A.Sinha,K.Deb,L.Kurkalova,M.Jha,andG.Whit- taker,\Handlingpracticalitiesinagriculturalpolicyoptimizationforwaterqualityim- provements,"in GeneticandEvolutionaryComputationConference(GECCO) ,2017. [58] R.Mathieu,L.Pittard,andG.Anandalingam,\Geneticalgorithmbasedapproachto bi-levellinearprogramming," OperationsResearch ,vol.28,no.1,pp.1{21,1994. [59] Y.Yin,\Geneticalgorithmbasedapproachforbilevelprogrammingmodels," Journal ofTransportationEngineering ,vol.126,no.2,pp.115{120,2000. [60] H.LiandY.Wang,\Ahybridgeneticalgorithmforsolvingnonlinearbilevelprogram- mingproblemsbasedonthesimplexmethod," InternationalConferenceonNatural Computation ,vol.4,pp.91{95,2007. [61] X.Zhu,Q.Yu,andX.Wang,\Ahybridtialevolutionalgorithmforsolving nonlinearbilevelprogrammingwithlinearconstraints,"in CognitiveInformatics,2006. ICCI2006.5thIEEEInternationalConferenceon ,vol.1.IEEE,2006,pp.126{131. [62] J.Angelo,E.Krempser,andH.Barbosa,tialevolutionforbilevelprogram- ming,"in Proceedingsofthe2013CongressonEvolutionaryComputation(CEC-2013) . IEEEPress,2013. [63] J.S.AngeloandH.J.C.Barbosa,\Astudyontheuseofheuristicstosolveabilevel programmingproblem," InternationalTransactionsinOperationalResearch ,2015. [64] J.A.Mirrlees,\Thetheoryofmoralhazardandunobservablebehaviour:Parti," The ReviewofEconomicStudies ,vol.66,no.1,pp.3{21,1999. [65] S.R.Hejazi,A.Memariani,G.Jahanshahloo,andM.M.Sepehri,\Linearbilevelpro- grammingsolutionbygeneticalgorithm," Computers&OperationsResearch ,vol.29, no.13,pp.1913{1925,2002. [66] Y.Wang,Y.C.Jiao,andH.Li,\Anevolutionaryalgorithmforsolvingnonlinear bilevelprogrammingbasedonanewconstraint-handlingscheme," IEEETransactions 118 onSystems,Man,andCybernetics,PartC:ApplicationsandReviews ,vol.32,no.2, pp.221{232,2005. [67] Y.Wang,H.Li,andC.Dang,\Anewevolutionaryalgorithmforaclassofnonlinear bilevelprogrammingproblemsanditsglobalconvergence," INFORMSJournalon Computing ,vol.23,no.4,pp.618{629,2011. [68] Y.Jiang,X.Li,C.Huang,andX.Wu,\Applicationofparticleswarmoptimization basedonchkssmoothingfunctionforsolvingnonlinearbilevelprogrammingproblem," AppliedMathematicsandComputation ,vol.219,no.9,pp.4332{4339,2013. [69] H.Li,\Ageneticalgorithmusingasearchspaceforsolvingnonlinear/linear fractionalbilevelprogrammingproblems," AnnalsofOperationsResearch ,pp.1{16, 2015. [70] Z.Wan,G.Wang,andB.Sun,\Ahybridintelligentalgorithmbycombiningparti- cleswarmoptimizationwithchaossearchingtechniqueforsolvingnonlinearbilevel programmingproblems," SwarmandEvolutionaryComputation ,vol.8,pp.26{32, 2013. [71] A.Sinha,P.Malo,andK.Deb,\Evolutionaryalgorithmforbileveloptimizationusing approximationsofthelowerleveloptimalsolutionmapping," EuropeanJournalof OperationalResearch ,vol.257,no.2,pp.395{411,2017. [72] K.Deb, Multi-objectiveoptimizationusingevolutionaryalgorithms .Chichester,UK: Wiley,2001. [73] C.Coello,G.Lamont,andD.VanVeldhuizen, EvolutionaryAlgorithmsforSolving Multi-objectiveProblems .Springer,NewYork,2ndedition,2007. [74] K.Deb,A.Pratap,S.Agarwal,andT.Meyarivan,\Afastandelitistmultiobjective geneticalgorithm:Nsga-ii," IEEETransactionsonEvolutionaryComputation ,vol.6, no.2,pp.182{197,2002. [75] E.Zitzler,M.Laumanns,andL.Thiele,\SPEA2:ImprovingtheStrengthPareto EvolutionaryAlgorithm,"in EUROGEN2001.EvolutionaryMethodsforDesign,Op- timizationandControlwithApplicationstoIndustrialProblems ,Athens,Greece,2002, pp.95{100. [76] K.DebandH.Jain,\Anevolutionarymany-objectiveoptimizationalgorithmusing reference-point-basednondominatedsortingapproach,parti:Solvingproblemswith 119 boxconstraints," IEEETransactionsonEvolutionaryComputation ,vol.18,no.4,pp. 577{601,Aug2014. [77] Q.ZhangandH.Li,\Moea/d:Amultiobjectiveevolutionaryalgorithmbasedon decomposition," IEEETransactionsonevolutionarycomputation ,vol.11,pp.712{ 731,2007. [78] H.Ishibuchi,H.Masuda,Y.Tanigaki,andY.Nojima,\Modistancecalculationin generationaldistanceandinvertedgenerationaldistance,"in Internationalconference onevolutionarymulti-criterionoptimization .Springer,2015,pp.110{125. [79] J.Yaochu,\Surrogate-assistedevolutionarycomputation:Recentadvancesandfuture challenges," SwarmandEvolutionaryComputation ,vol.1,no.2,pp.61{70,2011. [80] A.Forrester,A.Sobester,andA.Keane, Engineeringdesignviasurrogatemodelling: Apracticalguide .JohnWiley&Sons,2008. [81] K.Shimoyama,K.Sato,S.Jeong,andS.Obayashi,\Updatingkrigingsurrogatemod- elsbasedonthehypervolumeindicatorinmulti-objectiveoptimization," Journalof MechanicalDesign ,vol.135,no.9,p.094503,2013. [82] A.CassioliandF.Schoen,\Globaloptimizationofexpensiveblackboxproblemswith aknownlowerbound," JournalofGlobalOptimization ,pp.177{190,2013. [83] D.R.Jones,\Ataxonomyofglobaloptimizationmethodsbasedonresponsesurfaces," Journalofglobaloptimization ,vol.21,no.4,pp.345{383,2001. [84] X.Yao,\Evolvingneuralnetworks," ProceedingsoftheIEEE ,vol.87,no.9, pp.1423{1447,1999. [85] K.O.StanleyandR.Miikkulainen,\Evolvingneuralnetworksthroughaugmenting topologies," EvolutionaryComputation ,vol.10,no.2,pp.99{127,2002. [86] G.S.Hornby,\Alps:theage-layeredpopulationstructureforreducingtheprob- lemofprematureconvergence,"in GeneticandEvolutionaryComputationConference (GECCO) ,2006. [87] M.Suganuma,S.Shirakawa,andT.Nagao,\Ageneticprogrammingapproachto designingconvolutionalneuralnetworkarchitectures,"in GeneticandEvolutionary ComputationConference(GECCO) ,2017. 120 [88] Y.Sun,H.Wang,B.Xue,Y.Jin,G.G.Yen,andM.Zhang,\Surrogate-assistedevolu- tionarydeeplearningusinganend-to-endrandomforest-basedperformancepredictor," IEEETransactionsonEvolutionaryComputation ,2020. [89] Y.Sun,B.Xue,M.Zhang,andG.G.Yen,\Completelyautomatedcnnarchitec- turedesignbasedonblocks," IEEETransactionsonNeuralNetworksandLearning Systems ,2020. [90] Y.Sun,B.Xue,M.Zhang,G.G.Yen,andJ.Lv,\Automaticallydesigningcnn architecturesusingthegeneticalgorithmforimage IEEETransactions onCybernetics ,pp.1{15,2020. [91] C.J.C.H.Watkins,\Learningfromdelayedrewards,"Ph.D.dissertation,King's College,Cambridge,UK,May1989.[Online].Available:http://www.cs.rhul.ac.uk/ ˘ chrisw/new thesis.pdf [92] B.Baker,O.Gupta,N.Naik,andR.Raskar,\Designingneuralnetworkarchitectures usingreinforcementlearning,"in InternationalConferenceonLearningRepresenta- tions(ICLR) ,2017. [93] C.-H.Hsu,S.-H.Chang,J.-H.Liang,H.-P.Chou,C.-H.Liu,S.-C.Chang,J.-Y.Pan, Y.-T.Chen,W.Wei,andD.-C.Juan,\Monas:Multi-objectiveneuralarchitecture searchusingreinforcementlearning," arXivpreprintarXiv:1806.10332 ,2018. [94] M.Marcus,G.Kim,M.A.Marcinkiewicz,R.MacIntyre,A.Bies,M.Ferguson, K.Katz,andB.Schasberger,\Thepenntreebank:annotatingpredicateargument structure,"in ProceedingsoftheworkshoponHumanLanguageTechnology .Associ- ationforComputationalLinguistics,1994,pp.114{119. [95] X.DongandY.Yang,\Searchingforarobustneuralarchitectureinfourgpuhours," in IEEEConferenceonComputerVisionandPatternRecognition(CVPR) ,2019. [96] S.Xie,H.Zheng,C.Liu,andL.Lin,\SNAS:stochasticneuralarchitecturesearch," in InternationalConferenceonLearningRepresentations(ICLR) ,2019. [97] B.Wu,X.Dai,P.Zhang,Y.Wang,F.Sun,Y.Wu,Y.Tian,P.Vajda,Y.Jia,and K.Keutzer,\Fbnet:Hardware-awaretconvnetdesignviatiableneural architecturesearch,"in IEEEConferenceonComputerVisionandPatternRecognition (CVPR) ,2019. 121 [98] C.Liu,L.-C.Chen,F.ScH.Adam,W.Hua,A.L.Yuille,andL.Fei-Fei,\Auto- deeplab:Hierarchicalneuralarchitecturesearchforsemanticimagesegmentation,"in IEEEConferenceonComputerVisionandPatternRecognition(CVPR) ,2019. [99] L.LiandA.Talwalkar,\Randomsearchandreproducibilityforneuralarchitecture search," arXivpreprintarXiv:1902.07638 ,2019. [100] K.Kandasamy,W.Neiswanger,J.Schneider,B.Poczos,andE.P.Xing,\Neural architecturesearchwithbayesianoptimisationandoptimaltransport,"in Advancesin NeuralInformationProcessingSystems(NeurIPS) ,2018. [101] Y.JinandB.\Pareto-basedmultiobjectivemachinelearning:Anoverview andcasestudies," IEEETransactionsonSystems,Man,andCybernetics,PartC (ApplicationsandReviews) ,vol.38,no.3,pp.397{415,2008. [102] Y.-H.Kim,B.Reddy,S.Yun,andC.Seo,\Nemo:Neuro-evolutionwithmultiob- jectiveoptimizationofdeepneuralnetworkforspeedandaccuracy,"in International ConferenceonMachineLearning(ICML)AutoMLWorkshop ,2017. [103] J.-D.Dong,A.-C.Cheng,D.-C.Juan,W.Wei,andM.Sun,\Dpp-net:Device-aware progressivesearchforpareto-optimalneuralarchitectures,"in EuropeanConferenceon ComputerVision(ECCV) ,2018. [104] T.Elsken,J.H.Metzen,andF.Hutter,tmulti-objectiveneuralarchitecture searchvialamarckianevolution,"in InternationalConferenceonLearningRepresen- tations(ICLR) ,2019. [105] T.Wei,C.Wang,Y.Rui,andC.W.Chen,\Networkmorphism,"in International ConferenceonMachineLearning(ICML) ,2016. [106] J.Hu,L.Shen,andG.Sun,\Squeeze-and-excitationnetworks,"in IEEEConference onComputerVisionandPatternRecognition(CVPR) ,2018. [107] F.Juefei-Xu,V.N.Boddeti,andM.Savvides,\Localbinaryconvolutionalneural networks,"in IEEEConferenceonComputerVisionandPatternRecognition(CVPR) , 2017. [108] Y.Leung,Y.Gao,andZ.-B.Xu,\Degreeofpopulationdiversity-aperspectiveon prematureconvergenceingeneticalgorithmsanditsmarkovchainanalysis," IEEE TransactionsonNeuralNetworks ,vol.8,no.5,pp.1165{1176,1997. 122 [109] B.L.Miller,D.E.Goldberg etal. ,\Geneticalgorithms,tournamentselection,and theofnoise," Complexsystems ,vol.9,no.3,pp.193{212,1995. [110] K.DebandR.B.Agrawal,\Simulatedbinarycrossoverforcontinuoussearchspace," Complexsystems ,vol.9,no.2,pp.115{148,1995. [111] M.Pelikan,D.E.Goldberg,andE.Cantu-Paz,\Boa:Thebayesianoptimization algorithm,"in GeneticandEvolutionaryComputationConference(GECCO) ,1999. [112] S.Xie,R.Girshick,P.ar,Z.Tu,andK.He,\Aggregatedresidualtransforma- tionsfordeepneuralnetworks,"in IEEEConferenceonComputerVisionandPattern Recognition(CVPR) ,2017. [113] J.Malik,\Technicalperspective:Whatledcomputervisiontodeeplearning?" Com- municationsoftheACM ,vol.60,no.6,pp.82{83,2017. [114] S.Xie,A.Kirillov,R.Girshick,andK.He,\Exploringrandomlywiredneuralnetworks forimagerecognition,"in InternationalConferenceonComputerVision(ICCV) ,2019, pp.1284{1293. [115] I.LoshchilovandF.Hutter,\Sgdr:Stochasticgradientdescentwithwarmrestarts," in InternationalConferenceonLearningRepresentations(ICLR) ,2017. [116] X.Zhang,X.Zhou,M.Lin,andJ.Sun,\ShAnextremelytconvolu- tionalneuralnetworkformobiledevices,"in IEEEConferenceonComputerVision andPatternRecognition(CVPR) ,2018. [117] K.DebandA.Srinivasan,\Innovization:Innovatingdesignprinciplesthroughopti- mization,"in GeneticandEvolutionaryComputationConference(GECCO) ,2006. [118] Y.Netzer,T.Wang,A.Coates,A.Bissacco,B.Wu,andA.Y.Ng,\Readingdigitsin naturalimageswithunsupervisedfeaturelearning,"in AdvancesinNeuralInformation ProcessingSystems(NeurIPS)WorkshoponDeepLearningandUnsupervisedFeature Learning ,2011. [119] H.Xiao,K.Rasul,andR.Vollgraf,\Fashion-MNIST:Anovelimagedatasetforbench- markingmachinelearningalgorithms," arXivpreprintarXiv:1708.07747 ,2017. [120] B.Baker,O.Gupta,R.Raskar,andN.Naik,\Acceleratingneuralarchitecturesearch usingperformanceprediction," arXivpreprintarXiv:1705.10823 ,2017. 123 [121] X.Dai,P.Zhang,B.Wu,H.Yin,F.Sun,Y.Wang,M.Dukhan,Y.Hu,Y.Wu, Y.Jia etal. ,\Chamnet:Towardstnetworkdesignthroughplatform-aware modeladaptation,"in IEEEConferenceonComputerVisionandPatternRecognition (CVPR) ,2019. [122] G.Bender,P.-J.Kindermans,B.Zoph,V.Vasudevan,andQ.Le,\Understanding andsimplifyingone-shotarchitecturesearch,"in InternationalConferenceonMachine Learning(ICML) ,2018. [123] A.Brock,T.Lim,J.Ritchie,andN.Weston,\SMASH:One-shotmodelarchitecture searchthroughhypernetworks,"in InternationalConferenceonLearningRepresenta- tions(ICLR) ,2018. [124] H.Cai,C.Gan,T.Wang,Z.Zhang,andS.Han,\Onceforall:Trainonenetwork andspecializeitfortdeployment,"in InternationalConferenceonLearning Representations(ICLR) ,2020. [125] M.TanandQ.V.Le,\Mixconv:Mixeddepthwiseconvolutionalkernels,"in British MachineVisionConference(BMVC) ,2019. [126] J.Mei,Y.Li,X.Lian,X.Jin,L.Yang,A.Yuille,andJ.Yang,\Atomnas:Fine-grained end-to-endneuralarchitecturesearch,"in InternationalConferenceonLearningRep- resentations(ICLR) ,2020. [127] L.N.Darlow,E.J.Crowley,A.Antoniou,andA.J.Storkey,\Cinic-10isnotimagenet orcifar-10," arXivpreprintarXiv:1810.03505 ,2018. [128] A.Coates,A.Ng,andH.Lee,\Ananalysisofsingle-layernetworksinunsupervised featurelearning,"in ProceedingsoftheFourteenthInternationalConferenceonA cialIntelligenceandStatistics ,2011. [129] M.NilsbackandA.Zisserman,\Automatedweroveralargenumber ofclasses,"in 2008SixthIndianConferenceonComputerVision,GraphicsImage Processing ,2008. [130] A.G.Howard,M.Zhu,B.Chen,D.Kalenichenko,W.Wang,T.Weyand,M.An- dreetto,andH.Adam,\Mobilenets:tconvolutionalneuralnetworksformobile visionapplications," arXivpreprintarXiv:1704.04861 ,2017. [131] I.DasandJ.E.Dennis,\Normal-boundaryintersection:Anewmethodfor generatingtheparetosurfaceinnonlinearmulticriteriaoptimizationproblems," 124 SIAMJ.onOptimization ,vol.8,no.3,p.631{657,Mar.1998.[Online].Available: https://doi.org/10.1137/S1052623496307510 [132] Z.Guo,X.Zhang,H.Mu,W.Heng,Z.Liu,Y.Wei,andJ.Sun,\Singlepathone-shot neuralarchitecturesearchwithuniformsampling," arXivpreprintarXiv:1904.00420 , 2019. [133] X.Chu,B.Zhang,R.Xu,andJ.Li,\Fairnas:Rethinkingevaluationfairnessofweight sharingneuralarchitecturesearch," arXivpreprintarXiv:1907.01845 ,2019. [134] L.Bossard,M.Guillaumin,andL.VanGool,\Food-101{miningdiscriminativecom- ponentswithrandomforests,"in EuropeanConferenceonComputerVision(ECCV) , 2014. [135] J.Krause,M.Stark,J.Deng,andL.Fei-Fei,\3dobjectrepresentationsforrained categorization,"in IEEEWorkshopon3DRepresentationandRecognition(3dRR-13) , Sydney,Australia,2013. [136] S.Maji,J.Kannala,E.Rahtu,M.Blaschko,andA.Vedaldi,\Fine-grainedvisual ofaircraft,"Tech.Rep.,2013. [137] M.Cimpoi,S.Maji,I.Kokkinos,S.Mohamed,andA.Vedaldi,\Describingtexturesin thewild,"in IEEEConferenceonComputerVisionandPatternRecognition(CVPR) , 2014. [138] O.M.Parkhi,A.Vedaldi,A.Zisserman,andC.Jawahar,\Catsanddogs,"in IEEE ConferenceonComputerVisionandPatternRecognition(CVPR) ,2012. [139] E.D.Cubuk,B.Zoph,J.Shlens,andQ.V.Le,\Randaugment:Practicalautomated dataaugmentationwithareducedsearchspace," arXivpreprintarXiv:1909.13719 , 2019. [140] Y.Li,X.Jin,J.Mei,X.Lian,L.Yang,C.Xie,Q.Yu,Y.Zhou,S.Bai,andA.Yuille, \Neuralarchitecturesearchforlightweightnon-localnetworks,"in IEEEConference onComputerVisionandPatternRecognition(CVPR) ,2020. [141] C.Li,J.Peng,L.Yuan,G.Wang,X.Liang,L.Lin,andX.Chang,\Blockwisely supervisedneuralarchitecturesearchwithknowledgedistillation,"in IEEEConference onComputerVisionandPatternRecognition(CVPR) ,2020. 125 [142] J.Yu,P.Jin,H.Liu,G.Bender,P.-J.Kindermans,M.Tan,T.Huang,X.Song, R.Pang,andQ.Le,\Bignas:Scalingupneuralarchitecturesearchwithbigsingle- stagemodels," arXivpreprintarXiv:2003.11142 ,2020. [143] C.Szegedy,S.V.Vanhoucke,andA.A.Alemi,\Inception-v4,inception-resnet andtheimpactofresidualconnectionsonlearning,"in AAAIConferenceonA Intelligence ,2017. [144] S.Kornblith,J.Shlens,andQ.V.Le,\Dobetterimagenetmodelstransferbetter?" in IEEEConferenceonComputerVisionandPatternRecognition(CVPR) ,2019. [145] L.Yao,E.Poblenz,D.Dagunts,B.Covington,D.Bernard,andK.Lyman,\Learning todiagnosefromscratchbyexploitingdependenciesamonglabels," arXivpreprint arXiv:1710.10501 ,2017. [146] P.Rajpurkar,J.Irvin,K.Zhu,B.Yang,H.Mehta,T.Duan,D.Ding,A.Bagul, C.Langlotz,K.Shpanskaya etal. ,\Chexnet:Radiologist-levelpneumoniadetection onchestx-rayswithdeeplearning," arXivpreprintarXiv:1711.05225 ,2017. [147] G.R.Blog,\Automlforlargescaleimageationandobjectdetec- tion," GoogleResearch,https://research.googleblog.com/2017/11/automl-for-large- scaleimage.html,Blog ,2017. [148] J.Liang,E.Meyerson,B.Hodjat,D.Fink,K.Mutch,andR.Miikkulainen,\Evolu- tionaryneuralautomlfordeeplearning,"in GeneticandEvolutionaryComputation Conference(GECCO) ,2019. [149] N.CarliniandD.Wagner,\Towardsevaluatingtherobustnessofneuralnetworks,"in IEEESymposiumonSecurityandPrivacy(SP) ,2017. 126