ADAPTIVEON-DEVICEDEEPLEARNINGSYSTEMS By BiyiFang ADISSERTATION Submittedto MichiganStateUniversity inpartialful˝llmentoftherequirements forthedegreeof ElectricalEngineeringDoctorofPhilosophy 2019 ABSTRACT ADAPTIVEON-DEVICEDEEPLEARNINGSYSTEMS By BiyiFang Mobilesystemssuchassmartphones,drones,andaugmented-realityheadsetsarerevo- lutionizingourlives.On-devicedeeplearningisregardedasthekeyenablingtechnologyfor realizingtheirfullpotential.Thisisbecausecommunicationwithcloudaddsadditionalla- tencyorcost,ortheapplicationsmustoperateevenwithintermittentinternetconnectivity. Thekeytoachievingthefullpromiseofthesemobilevisionsystemsise˙ectivelyanalyzing thestreamingvideoframes.However,processingstreamingvideoframestakeninmobile settingsischallengingintwofolds.First,theprocessingusuallyinvolvesmultiplecomputer visiontasks.Thismulti-tenantcharacteristicrequiresmobilevisionsystemstoconcurrently runmultipleapplicationsthattargetdi˙erentvisiontasks.Second,thecontextinmobile settingscanbefrequentlychanged.Thisrequiresmobilevisionsystemstobeabletoswitch applicationstoexecutenewvisiontasksencounteredinthenewcontext. Inthisarticle,we˝llthiscriticalgapbyproposingNestDNN,aframeworkthatenables resource-awaremulti-tenanton-devicedeeplearningforcontinuousmobilevision.NestDNN enableseachdeeplearningmodeltoo˙er˛exibleresource-accuracytrade-o˙s.Atruntime, itdynamicallyselectstheoptimalresource-accuracytrade-o˙foreachdeeplearningmodel to˝tthemodel'sresourcedemandtothesystem'savailableruntimeresources.Indoing so,NestDNNe˚cientlyutilizesthelimitedresourcesinmobilevisionsystemstojointly maximizetheperformanceofalltheconcurrentlyrunningapplications. AlthoughNestDNNisabletoe˚cientlyutilizetheresourcebybeingresource-aware,it essentiallytreatsthecontentofeachinputimageequallyandhencedoesnotrealizethe fullpotentialofsuchpipelines.Torealizeitsfullpotential,wefurtherproposeFlexDNN,a novelcontent-adaptiveframeworkthatenablescomputation-e˚cientDNN-basedon-device videostreamanalyticsbasedonearlyexitmechanism.Comparedtostate-of-the-artearly exit-basedsolutions,FlexDNNaddressestheirkeylimitationsandpushesthestate-of-the-art forwardthroughitsinnovative˝ne-graineddesignandautomaticapproachforgenerating theoptimalnetworkarchitecture. Copyrightby BIYIFANG 2019 TABLEOFCONTENTS LISTOFTABLES .................................... vii LISTOFFIGURES ................................... viii Chapter1Introduction ............................... 1 Chapter2RelatedWork .............................. 3 Chapter3Resource-AwareMulti-TenantOn-DeviceDeepLearning ... 6 3.1IntroductionofNestDNN............................6 3.2ChallengesandOurSolutions..........................10 3.3NestDNNOverview................................12 3.4DesignofNestDNN................................14 3.4.1FilterbasedModelPruning.......................14 3.4.1.1BackgroundonCNNArchitecture..............14 3.4.1.2Bene˝tsofFilterPruning...................15 3.4.1.3FilterImportanceRanking...................16 3.4.1.4PerformanceofFilterImportanceRanking..........17 3.4.1.5FilterPruningRoadmap....................19 3.4.2Freeze-&-GrowbasedModelRecovery.................19 3.4.2.1MotivationandKeyIdea...................19 3.4.2.2ModelFreezingandFilterGrowing..............20 3.4.2.3SuperiorityofMulti-CapacityModel.............21 3.4.3Resource-AwareScheduler........................22 3.4.3.1MotivationandKeyIdea...................22 3.4.3.2CostFunction..........................23 3.4.3.3SchedulingSchemes......................24 3.4.3.4CachedGreedyHeuristicApproximation...........25 3.5Evaluation.....................................27 3.5.1Datasets,DNNsandApplications....................27 3.5.1.1Datasets.............................27 3.5.1.2DNNModels..........................28 3.5.1.3MobileVisionApplications..................28 3.5.2PerformanceofMulti-CapacityModel..................29 3.5.2.1ExperimentalSetup......................29 3.5.2.2OptimizedResource-AccuracyTrade-o˙s...........31 3.5.2.3ReductiononMemoryFootprint...............32 3.5.2.4ReductiononModelSwitchingOverhead...........33 3.5.3PerformanceofResource-AwareScheduler...............35 3.5.3.1ExperimentalSetup......................35 3.5.3.2ImprovementonAccuracyandFrameRate.........37 v 3.5.3.3ReductiononEnergyConsumption..............39 3.6ConclusionofNestDNN.............................39 Chapter4Content-AdaptiveOn-DeviceDeepLearning ........... 41 4.1IntroductionofFlexDNN.............................41 4.2Background&Motivation............................46 4.2.1DynamicsofMobileVideoContents&Bene˝tofLeveragingtheDy- namics...................................46 4.2.2DrawbacksofExistingSolutions.....................48 4.3DesignofFlexDNN................................50 4.3.1FilterRankingbasedonCollectiveImportance.............51 4.3.1.1Motivation...........................51 4.3.1.2CollectiveImportance-basedFilterRanking.........51 4.3.1.3SuperiorityofCollectiveImportance-basedFilterRanking.53 4.3.2SearchingfortheOptimalEarlyExitArchitecture...........53 4.3.2.1Motivation...........................53 4.3.2.2EarlyExitArchitectureSearchScheme............54 4.3.2.3EarlyExitRate.........................56 4.3.3EarlyExitInsertionPlan.........................57 4.3.3.1Motivation...........................57 4.3.3.2ProblemFormulation......................57 4.3.3.3E˚ciencyofEarlyExit....................58 4.3.4RuntimeAdaptationtoWorkloadandSystemResourceDynamics..60 4.3.4.1Motivation...........................60 4.3.4.2Accuracy-ResourcePro˝le...................60 4.3.4.3RuntimeAdaptation......................62 4.4Implementation..................................63 4.5Evaluation.....................................65 4.5.1Methodology...............................65 4.5.2ModelPerformance............................67 4.5.2.1HighEarlyExitRate.....................67 4.5.2.2Computation-E˚cientEarlyExits..............68 4.5.2.3HighComputationalConsumptionReduction........68 4.5.2.4CompactMemoryFootprint..................69 4.5.3RuntimePerformance...........................70 4.5.3.1Top-1Accuracyvs.FrameProcessingTime.........70 4.5.3.2ReductiononEnergyConsumption..............70 4.5.3.3PerformanceofRuntimeAdaptation.............72 4.5.4PerformanceonImageNet........................73 4.6ConclusionofFlexDNN.............................74 Chapter5Conclusion ................................ 75 BIBLIOGRAPHY .................................... 76 vi LISTOFTABLES Table3.1:De˝nedterminologiesandtheirexplanations................14 Table3.2:Summaryofdatasets,DNNmodels,andmobilevisionapplicationsused inthiswork..................................26 Table3.3:Bene˝tofmulti-capacitymodelonmemoryfootprintreduction.....32 Table3.4:Bene˝tofmulti-capacitymodelonmodelswitching(modelupgrade)in termsofmemoryusage............................35 Table3.5:Bene˝tofmulti-capacitymodelonmodelswitching(modeldowngrade) intermsofmemoryusage..........................35 Table4.1:Summaryofthreeapplications,twobasemodels,andsixgenerated FlexDNNmodels...............................63 vii LISTOFFIGURES Figure3.1:NestDNNarchitecture............................11 Figure3.2:Illustrationof˝lterpruning[1].Bypruning˝lters,bothmodelsizeand computationalcostarereduced.......................16 Figure3.3:Filterimportancepro˝lingperformanceof(a)TRRand(b) L 1 -normon VGG-16trainedonCIFAR-10........................17 Figure3.4:Illustrationofmodelfreezingand˝ltergrowing..............20 Figure3.5:Illustrationofmodelswitching(modelupgradevs.modeldowngrade)of multi-capacitymodel.............................22 Figure3.6:Top-1testaccuracyvs.modelsizecomparisonbetweendescendentmod- elsandbaselinemodels...........................29 Figure3.7:Computationalcostcomparisonbetweendescendantmodelsandvanilla models....................................30 Figure3.8:Modelswitchingenergyconsumptioncomparisonbetweenmulti-capacity modelsandindependentmodels.Theenergyconsumptionismeasured onaSamsungGalaxyS8smartphone....................34 Figure3.9:Pro˝leofallaccumulatedsimulationsgeneratedbyourdesignedbench- mark......................................37 Figure3.10:RuntimeperformancecomparisonbetweenbaselineandNestDNNunder schedulingscheme:(a)MinTotalCost;(b)MinMaxCost..........37 Figure3.11:EnergyconsumptioncomparisonbetweenbaselineandNestDNNunder schedulingscheme:(a)MinTotalCost;(b)MinMaxCost..........39 Figure4.1:AconceptualoverviewofFlexDNN.FlexDNNisbuiltontopofabase modelwiththeaugmentationofoneormoreearlyexitsinsertedthrough- outthebasemodel.Foraneasyinput,itexitsattheearlyexitinserted atanearlierlocationsincetheextractedfeaturesaregoodenoughto classifythecontentintheeasyinput.Assuch,theeasyinputavoidsfur- thercomputationalcostincurredonward.Forahardinput,itproceeds deeperuntiltheextractedfeaturesaregoodenoughtoexitthehardinput.44 viii Figure4.2:Illustrationoffourframesofavideoclipofbikingcapturedusingamobile camerainUCF-101dataset:(a)and(d)areframeswithcontentsthat areeasytorecognize;(b)and(c)areframeswithcontentsthatarehard torecognize..................................47 Figure4.3:Bluesolidcurve:minimumcomputationalconsumptiontocorrectlyrec- ognizethecontentineachframe( optimalmodel ).Reddottedcurve: computationalconsumptionofthe one-size-˝ts-allmodel .........47 Figure4.4:Bene˝tbroughtbytheadaptationvs.modelswitchingoverheadofthe dynamiccon˝gurationapproach.......................49 Figure4.5:FlexDNNarchitecture............................50 Figure4.6:Comparisonoftheminimumnumberof˝ltersneededtoahievethe sameaccuracyasusingall˝lterswithineachlayerbetweencollective importance-basedrankingschemeandtheschemebasedonindependent ranking....................................52 Figure4.7:Illustrationofderivationofqualityofanearlyexit............59 Figure4.8:Accuracy-latencypro˝leunderdi˙erent values..............61 Figure4.9:(a)UCF-15:examplevideoframeofbiking(left)andskiing(right). (b)Place-8:illustrationofdatacollectionusingaORDROEP5head- mountedcamera(left);examplevideoframeofparkinglot(right).(c) TDrone:illustrationofdatacollectionusingaDJIMavicProdrone (left);examplevideoframeoftra˚csurveillanceintheresidentialarea (right).....................................62 Figure4.10:ComparisonbetweenFlexDNNandbaselineapproachesintheaccuracy- frameprocessingtimespace.........................66 Figure4.11:(a)Comparisonbetweentheaccumulatedcomputationalcostofallthe earlyexitsandthecomputationalcostofthebasemodel.(b)Comparison betweensaving(theaveragecomputationsavedfromearlyexitingper frame)andoverhead(theaveragecomputationconsumedbytheearly exitseachframegoesthroughbutfailstoexit)..............66 Figure4.12:Cumulativeexitrateateachwithoutlossofaccuracy. (markedasE1,E2,...)areorderedbasedontheirdistancestothe inputlayer(i.e.,E1istheearliestexit).FEdenotestheregularexitof thebasemodel................................67 ix Figure4.13:MemoryfootprintcomparisonbetweentheFlexDNNmodelanditsbag- of-modelcounterpart.............................69 Figure4.14:ComparisonbetweenFlexDNNandbaselineapproachesintheaccuracy- energyconsumptionspace..........................71 Figure4.15:Performanceofruntimeadaptationtoworkloadandsystemresourcedy- namics.....................................72 Figure4.16:Top-1accuracycomparisonbetweenMSDNet-mobile...........73 x Chapter1 Introduction Mobilevisionsystemssuchassmartphones,drones,andaugmentedrealityheadsetsare ubiquitoustoday.AsAIchipsetsemerge,thesesystemsbegintosupporton-devicelive videostreamprocessing,whichistheenablerofawiderangeofcontinuousmobilevision applications.Thistrendisfueledbytherecentadvancementindeeplearning(i.e.,Deep NeuralNetworks(DNNs))[2]duetoitssuccessinachievingimpressivelyhighaccuraciesin manyimportantvisiontasks. Continuousmobilevisionapplicationsrequirecontinuouslyprocessingthestreaming videoinput,andreturningtheprocessingresultswithlowlatency.Unfortunately,DNNs areknowntobememoryandcomputationallyintensive[3],andhighcomputationalde- mandtranslatesintohighprocessinglatencyandhighenergyconsumption.Thus,consider- ingmobilesystemsareconstrainedbylimitedresourcesintermsofcomputation,memory, andbatterycapacities,reducingresourcedemandsofDNNsiscrucialtorealizingthefull potentialofcontinuousmobilevisionapplications. Thekeytoachievingthefullpromiseofthesemobilevisionsystemsise˙ectivelyanalyzing thestreamingvideoframes.However,processingstreamingvideoframestakeninmobile settingsischallengingintwofolds.First,theprocessingusuallyinvolvesmultiplecomputer visiontasks.Thismulti-tenantcharacteristicrequiresmobilevisionsystemstoconcurrently runmultipleapplicationsthattargetdi˙erentvisiontasks.Second,thecontextinmobile settingscanbefrequentlychanged.Thisrequiresmobilevisionsystemstobeabletoswitch 1 applicationstoexecutenewvisiontasksencounteredinthenewcontext. Inwe˝llthiscriticalgapbyproposingNestDNN,aframeworkthatenablesresource- awaremulti-tenanton-devicedeeplearningforcontinuousmobilevision.Ittakesthedynam- icsofruntimeresourcesinamobilevisionsystemintoconsideration,anddynamicallyselects theoptimalresource-accuracytrade-o˙andresourceallocationforeachoftheconcurrently runningdeeplearningmodelstojointlymaximizetheirperformance.WeevaluateNestDNN usingsixmobilevisionapplicationsthattargetsomeofthemostimportantvisiontasksfor mobilevisionsystems.Resultshaveshownitoutperformstheresource-agnosticcounterpart signi˝cantly.FromNestDNN,wediscoveryetanotherproblemthatitdoesnotsolve. AlthoughNestDNNisabletoachieveresource-awaremulti-tenanton-devicedeeplearn- ing,itessentiallytreatsthecontentofeachinputimageequally.Intorealizethefull potentialofDNN-basedprocessingpipeline,wefurtherproposeFlexDNN,anovelcontent- adaptiveframeworkthatenablescomputation-e˚cientDNN-basedon-devicevideostream analyticsbasedonearlyexitmechanism.Comparedtostate-of-the-artearlyexit-basedso- lutions,FlexDNNaddressestheirkeylimitationsandpushesthestate-of-the-artforward throughitsinnovative˝ne-graineddesignandautomaticapproachforgeneratingtheopti- malnetworkarchitecture.WeuseFlexDNNtobuildthreecomputation-e˚cientcontinuous mobilevisionapplicationsontopofMobileNets.OurevaluationresultsshowthatFlexDNN signi˝cantlyoutperformsbothcomputation-e˚cientcontent-agnosticandstate-of-the-art content-adaptiveapproachesinreducingcomputationalconsumptionbyalargemargin. Lastly,weconcludethisreportin 2 Chapter2 RelatedWork Ourrelatedworkcontainsfourparts:1)MobileSensingSystems;2)DeepNeuralNetwork ModelCompression;3)ContinuousMobileVision;and4)content-adaptivevideostream analytics. MobileSensingSystems. Ourworkisalsobroadlyrelatedtoresearchinmobilesensing systems.Priormobilesensingsystemshaveexploredavarietyofsensingmodalitiesthathave enabledawiderangeofinnovativeapplications.Amongthem,accelerometer,microphone andphysiologicalsensorsaresomeofthemostlyexploredsensingmodalities.Forexample, Mokaya etal. developedanaccelerometer-basedsystemtosenseskeletalmusclevibrations forquantifyingskeletalmusclefatigueinanexercisesetting[4].Nirjon etal. developed MusicalHeart[5]whichintegratedamicrophoneintoanearphonetoextractheartbeatinfor- mationfromaudiosignals.Nguyen etal. designedanin-earsensingsystemintheformof earplugsthatisabletocaptureEEG,EOG,andEMGsignalsforsleepmonitoring[6].Re- cently,researchershavestartedexploringusingwirelessradiosignalasacontactlesssensing mechanism.Forexample,Wang etal. developedWiFall[7]thatusedwirelessradiosignal todetectaccidentalfalls.Fang etal. usedradioasasinglesensingmodalityforintegrated activitiesofdailylivingandvitalsignmonitoring[8].Inthiswork,weexploreinfraredlight asanewsensingmodalityinthecontextofASLtranslation.Itcomplementsexistingmobile sensingsystemsbyprovidinganon-intrusiveandhigh-resolutionsensingscheme.Weregard 3 thisworkasanexcellentexampletodemonstratetheusefulnessofinfraredsensingformobile systems.Withtheincomingeraofvirtual/augmentedreality,weenvisioninfraredsensing willbeintegratedintomanyfuturemobilesystemssuchassmartphonesandsmartglasses. DeepNeuralNetworkModelCompression. Modelcompressionfordeepneuralnet- workshasattractedalotofattentionsinrecentyearsduetotheimperativedemandon runningdeeplearningmodelsonmobilesystems.Oneofthemostprevalentmethodsfor compressingdeepneuralnetworksispruning.Han etal. [9]proposedaparameterpruning methodthatremovesnodeconnectionswithsmallweights.Althoughthismethodise˙ective atreducingmodelsizes,itdoesnote˙ectivelyreducecomputationalcosts.Toovercomethis problem,Li etal. [1]proposeda˝lterpruningmethodthathasachievedupto 38% reduc- tionincomputationalcost.Ourworkalsofocusesoncompressingdeepneuralnetworksvia ˝lterpruning.Ourproposed˝lterpruningapproachoutperformsthestate-of-the-art.More- over,unlikeexistingmodelcompressionmethodswhichproduceprunedmodelswith˝xed resource-accuracytrade-o˙s,ourproposedmulti-capacitymodelisabletoprovidedynamic resource-accuracytrade-o˙s.Thisissimilartotheconceptofdynamicneuralnetworksin thedeeplearningliterature ContinuousMobileVision. Theconceptofcontinuousmobilevisionwas˝rstadvanced byBahl etal. [13].Thelastfewyearshavewitnessedmanye˙ortstowardsrealizingthe visionofcontinuousmobilevisionInparticular,in[14],LiKamWa etal. proposed aframeworknamedStar˝sh,whichenablese˚cientrunningconcurrentvisionapplications onmobiledevicesbysharingcommoncomputationandmemoryobjectsacrossapplications. OurworkisinspiredbyStar˝shintermsofsharing.Bysharingparametersamongdescen- dentmodels,ourproposedmulti-capacitymodelhasacompactmemoryfootprintandincurs 4 littlemodelswitchingoverhead.Ourworkisalsoinspiredby[15].In[15],Han etal. pro- posedaframeworknamedMCDNN,whichappliesvariousmodelcompressiontechniquesto generateacatalogofmodelvariantstoprovidedi˙erentresource-accuracytrade-o˙s.How- ever,inMCDNN,thegeneratedmodelvariantsareindependentofeachother,anditrelies oncloudconnectivitytoretrievethedesiredmodelvariant.Incontrast,ourworkfocuseson developinganon-devicedeeplearningframeworkwhichdoesnotrelyoncloudconnectivity. Moreover,MCDNNfocusesonmodelsharingacrossconcurrentlyrunningapplications.In contrast,NestDNNtreatseachoftheconcurrentlyrunningapplicationsindependently,and focusesonmodelsharingacrossdi˙erentmodelvariantswithineachapplication. Content-AdaptiveVideoStreamAnalytics. FlexDNNiscloselyrelatedtocontent- adaptivevideostreamanalytics.In[18],theauthorsproposedacontent-adaptivevideo streamanalyticssystemsnamedChameleonthatdynamicallychangestheDNNmodelsto adapttothedi˚cultylevelsofthevideoframes.However,itrequiresthesystemtocarryall themodelvariantswithvariouscapacitiesandthusdoesnot˝tresource-constrainedmobile platforms.Toaddressthislimitation,BranchyNet[19]andMSDNet[20]useasinglemodel withaugmentedearlyexitstorealizetheadaptationtothedi˚cultylevelsofvideoframes. SimilartoBranchyNetandMSDNet,FlexDNNalsousesasinglemodelwithaugmented earlyexitsforcontentadaptation.However,FlexDNNdi˙ersfromthesepioneerworkin thatitadoptsa˝ne-grainedapproachtomakeearlypredictionsatthegranularityof˝lters, anditautomaticallygeneratestheoptimalearlyexitarchitectureandtheoptimalearlyexit insertionplanwiththeobjectivetomaximizethebene˝tbroughtbyearlyexits. 5 Chapter3 Resource-AwareMulti-TenantOn-Device DeepLearning 3.1IntroductionofNestDNN Mobilesystemswithonboardvideocamerassuchassmartphones,drones,wearablecameras, andaugmented-realityheadsetsarerevolutionizingthewaywelive,work,andinteractwith theworld.Byprocessingthestreamingvideoinputs,thesemobilesystemsareableto retrievevisualinformationfromtheworldandarepromisedtoopenupawiderangeof newapplicationsandservices.Forexample,adronethatcandetectvehicles,identifyroad signs,andtracktra˚c˛owswillenablemobiletra˚csurveillancewithaerialviewsthat traditionaltra˚csurveillancecameraspositionedat˝xedlocationscannotprovide[21]. Awearablecamerathatcanrecognizeeverydayobjects,identifypeople,andunderstand thesurroundingenvironmentscanbealife-changerfortheblindandvisuallyimpaired individuals[22]. Thekeytoachievingthefullpromiseofthesemobilevisionsystemsise˙ectivelyanalyzing thestreamingvideoframes.However,processingstreamingvideoframestakeninmobile settingsischallengingintwofolds.First,theprocessingusuallyinvolves multiple computer visiontasks.Thismulti-tenantcharacteristicrequiresmobilevisionsystemsto concurrently runmultipleapplicationsthattargetdi˙erentvisiontasks[14].Second,the context inmobile 6 settingscanbefrequentlychanged.Thisrequiresmobilevisionsystemstobeabletoswitch applicationstoexecutenewvisiontasksencounteredinthenewcontext[15]. Inthepastfewyears,deeplearning(e.g.,DeepNeuralNetworks(DNNs))[2]hasbecome thedominantapproachincomputervisionduetoitscapabilityofachievingimpressivelyhigh accuraciesonavarietyofimportantvisiontasksAsdeeplearningchipsetsemerge, thereisasigni˝cantinterestinleveragingtheon-devicecomputingresourcestoexecute deeplearningmodelsonmobilesystemswithoutcloudsupportComparedtothe cloud,mobilesystemsareconstrainedbylimitedresources.Unfortunately,deeplearningis knowntoberesource-demanding[3].Toenableon-devicedeeplearning,oneofthecommon techniquesusedbyapplicationdevelopersiscompressingthedeeplearningmodeltoreduce itsresourcedemandatamodestlossinaccuracyastrade-o˙[15,29].Althoughthistechnique hasgainedconsiderablepopularityandhasbeenappliedtodevelopingstate-of-the-artmobile deeplearningsystems[17,ithasa keydrawback :sinceapplicationdevelopersdevelop theirapplicationsindependently,theresource-accuracytrade-o˙ofthecompressedmodel ispredeterminedbasedona static resourcebudgetatapplicationdevelopmentstageand is ˝xed aftertheapplicationisdeployed.However,theavailableresourcesinmobilevision systemsatruntimearealways dynamic becausetheconcurrentlyrunningapplicationschange dependingonthecontext.Asaconsequence,whentheresourcesavailableatruntimedo notmeettheresourcedemandsofthecompresseddeeplearningmodels,resourcecontention amongconcurrentlyrunningapplicationsoccurs,forcingthestreamingvideotobeprocessed atamuchlowerframerate.Onetheotherhand,whenextraresourcesatruntimebecome available,thecompresseddeeplearningmodelscannotutilizetheextraavailableresources toregaintheirsacri˝cedaccuraciesback. Inthiswork,wepresentNestDNN,aframeworkthattakesthe dynamicsofruntime 7 resources intoconsiderationtoenableresource-awaremulti-tenanton-devicedeeplearning formobilevisionsystems. Itreplaces˝xedresource-accuracytrade-o˙swith˛exibleresource-accuracytrade-o˙s, anddynamicallyselectstheoptimalresource-accuracytrade-o˙ofthedeeplearningmodel atruntimetotthemodel'sresourcedemandtothesystem'savailableruntimeresources. Atitscore,NestDNNemploysa pruningandrecovery schemewhichtransformsano˙- the-shelfdeeplearningmodelintoasinglecompact multi-capacitymodel .Themulti-capacity modelhastwokeyfeatures.First,themulti-capacitymodeliscomprisedofasetofsub- models.Eachsub-modelhasauniquecapacitytoprovideanoptimizedresource-accuracy trade-o˙.Second,unliketraditionalmodelvariantsthatareindependentofeachother,the sub-modelwithsmallercapacity shares itsmodelarchitectureanditsmodelparameterswith thesub-modelwithlargercapacity,makingitself nested insidethesub-modelwithlarger capacitywithouttakingextramemoryspace.Bydoingso,themulti-capacitymodelisable toprovidevariousresource-accuracytrade-o˙swithacompactmemoryfootprint. Thecreationofmulti-capacitymodelenablesNestDNNtojointlymaximizetheperfor- manceofconcurrentvisionapplicationsrunningonmobilevisionsystems.Thispossibility comesfromtwokey insights .First,whileacertainamountofruntimeresourcescanbe tradedforanaccuracygaininsomevisionapplication,thesameamountofruntimere- sourcescanbetradedforamuchlargeraccuracygaininsomeothervisionapplication. Second,somevisionapplicationdoesnotrequirereal-timeresponseandthuscantoleratea relativelylargeinferencelatency(e.g.,sceneunderstanding).Thispresentsanopportunity toreallocatesomeruntimeresourcesfromthelatency-tolerantvisionapplicationtovision applicationsthatneedmoreruntimeresourcestomeetthereal-timerequirement(e.g.,road signidenti˝cation).NestDNNexploitstheseinsightsbyincorporatingtheaccuracyandin- 8 ferencelatencyintoa costfunction foreachvisionapplication.Giventhecostfunctionsof alltheconcurrentlyrunningvisionapplications,NestDNNemploysa runtimescheduler that selectsthemostsuitablesub-modelforeachapplicationanddeterminestheoptimalamount ofruntimeresourcestoallocatetoeachselectedsub-modeltojointlymaximizetheaccuracy andminimizetheinferencelatencyofconcurrentvisionapplications. WehaveconductedarichsetofexperimentstoevaluatetheperformanceofNestDNN.To examinetheperformanceofthemulti-capacitymodel,weevaluateditonsixmobilevision applicationsthattargetsomeofthemostimportantcomputervisiontasksformobilevision systems.Theseapplicationsaredevelopedbasedontwowidelyuseddeeplearningmodels VGGNetandResNetandsixcommonlyuseddatasetsfromcomputervisioncommunity. Toexaminetheperformanceofruntimescheduling,weimplementedNestDNNandthesix mobilevisionapplicationsonthreesmartphones.Thedetailedcontributionsandimportant resultsaresummarizedasfollows: Multi-CapacityModel. Wehavedesignedapruningandrecoveryschemetotransform ano˙-the-shelfdeeplearningmodelintoasinglecompactmulti-capacitymodelthatis abletoprovideoptimizedresource-accuracytrade-o˙s.Thepruningandrecoveryscheme consistsofa modelpruningphase anda modelrecoveryphase .Formodelpruning,we havedevisedastate-of-the-art ˝lterpruning approachnamed TripletResponseResidual (TRR)thatsigni˝cantreducesnotonlythesizeofadeeplearningmodelbutalsoits computationalcost.Formodelrecovery,wehavedevisedaninnovative modelfreezing and˝ltergrowing (i.e., freeze-&-grow )approachthatgeneratesthemulti-capacitymodel inaniterativemanner.OurexperimentalresultsshowthattheNestDNNmulti-capacity modelisabletoprovideoptimizedresource-accuracytrade-o˙s,andsigni˝cantlyreduce modelmemoryfootprintaswellasmodelswitchingoverhead. 9 Resource-AwareScheduler. Wehavedesignedaruntimeschedulerthatjointlymaxi- mizestheaccuracyandminimizestheinferencelatencyofconcurrentvisionapplications runningonmobilevisionsystems.Theavailableruntimeresourcesinmobilevisionsys- temsaredynamicandcanbechangedduetoeventssuchasstartingnewapplications, closingexistingapplications,andapplicationprioritychanges.Theschedulerisableto selectthemostappropriateaccuracy-resourcetrade-o˙foreachoftheconcurrentvi- sionapplicationsanddeterminetheamountofruntimeresourcestoallocatetoeach applicationtooptimizetheschedulingobjective.Ourexperimentalresultsshowthat theNestDNNruntimescheduleroutperformsthenon-resource-awarecounterpartontwo state-of-the-artschedulingschemes,achievingasmuchas 4 : 2% increaseonaccuracy, 2 : 0 increaseonframerateand 1 : 7 reductiononenergyconsumptionwhenrunning concurrentmobilevisionapplications. Tothebestofourknowledge,NestDNNrepresentsthe˝rstframeworkthatenables resource-awaremulti-tenanton-devicedeeplearningforcontinuousmobilevision.Itcon- tributesnoveltechniquesthataddresstheuniquechallengesincontinuousmobilevision. Webelievethatourworkrepresentsasigni˝cantsteptoturningtheenvisionedcontinuous mobilevisionintoreality[13,14,34]. 3.2ChallengesandOurSolutions ThedesignofNestDNNpresentsanumberofchallenges.Inthissection,wedescribethese challengesfollowedbyexplaininghowtheycanbee˙ectivelyaddressedbyNestDNN. ComputationalIntensityofDeepLearningModels .ThefoundationofNestDNNis thegenerationofamulti-capacitymodelfromano˙-the-shelfdeeplearningmodel.The 10 Figure3.1: NestDNNarchitecture. multi-capacitymodelneedstobecompactandcomputationallylightweightsuchthatitis abletoe˚cientlyrunonresource-limitedmobilesystems.However,deeplearningmodels areknowntobememoryandcomputationalintensive[35,36].Runningonedeeplearning modelissometimeschallengingformobilesystems,letalonerunningmultipleofthemconcur- rently.Toaddressthischallenge,weproposeastate-of-the-artdeeplearningmodelpruning approachnamedTripletResponseResidual(TRR).Mostwidelyusedpruningapproaches focusonpruningmodelparameters[9,37].Althoughpruningparameterscansigni˝cantly reducemodelsize,itdoesnotnecessarilyreducecomputationalcostintermsof˛oating pointoperations(i.e.,FLOP)[1,38,39],makingitlessusefulformobilesystemswhichneed toprovidereal-timeservices.Incontrast,ourapproachfocusesonpruning˝lters,which e˙ectivelyreducesthesizeofadeeplearningmodelanditscomputationalcost. LargeAmountofModelVariants .OnekeyfeatureofNestDNNistheprovisionof˛ex- ibletrade-o˙sbetweenaccuracyandresourceuse.Toachievethisresource-aware˛exibility, onenaiveapproachistohaveallthepossiblemodelvariantswithvariousresource-accuracy trade-o˙sinstalledinthemobilesystem.However,sincethesemodelvariantsare indepen- dent ofeachother,thisapproachisnotscalableandbecomesinfeasiblewhenthemobile systemconcurrentlyrunsmultipledeeplearningmodels,witheachofwhichhavingmul- tiplemodelvariants.Toaddressthischallenge,weproposeaninnovativefreeze-&-grow 11 approachthatusesthe˝lterpruningroadmapgeneratedduringthemodelpruningphase astheguidelinetoproduceasinglemulti-capacitymodelwithallmodelvariantswithinit- self.Moreimportantly,unliketraditionalmodelvariantsthatareindependentofeachother, thesemodelvariantssharemodelarchitectureandparameterswitheachother,makingthe multi-capacitymodelhavingacompactmemoryfootprint. LackofResource-AwareScheduler .Theavailableruntimeresourcesinmobilevision systemsaredynamicandarechangedduetoeventssuchasstartingnewapplications,closing existingapplications,andapplicationprioritychanges.Assuch,usingamodelwitha˝xed resource-accuracytrade-o˙foravisionapplicationwillmaketheapplicationunderperformed whenextraresourcesbecomeavailable.Toaddressthisproblem,weproposearuntime schedulerthatisresource-aware.Onceruntimeresourcesbecomeavailable,thescheduler determineswhichresource-accuracytrade-o˙tobeadoptedforeachapplicationandhow manyruntimeresourcesshouldbeallocatedforeachapplication.Assuch,runtimeresources arebestutilizedtomaximizetheperformanceofallthevisionapplicationsconcurrently runningonamobilevisionsystem. 3.3NestDNNOverview Figure3.1illustratesthearchitectureofNestDNN,whichissplitintoan o˜inestage and an onlinestage . Theo˜inestageconsistsofthreephases:modelpruningmodelrecovery andmodelpro˝ling. Inthemodelpruningphase,byfollowingthe˝lterimportancerankingprovidedbyour TRRapproach,˝ltersinagiventrainedo˙-the-shelfdeeplearningmodel(i.e., vanillamodel ) 12 isiterativelypruned.Duringeachiteration,lessimportant˝ltersarepruned,andthepruned modelisretrainedtocompensatetheaccuracydegradation(ifthereisany)causedby˝lter pruning.Theiterationendswhentheprunedmodelcouldnotmeettheminimumaccuracy requirementsetbytheuser.Thissmallestprunedmodeliscalled seedmodel .Asaresult, a ˝lterpruningroadmap iscreatedwhereeachfootprintontheroadmapisaprunedmodel withits˝lterpruningrecord.This˝lterpruningroadmapincludingtheseedmodelispassed tothemodelrecoveryphase. Inthemodelrecoveryphase,byfollowingthe˝lterpruningroadmapandthefreeze-&- growapproach,the multi-capacitymodel ofthevanillamodelisiterativelygenerated.Model recoveryusestheseedmodelasthestartingpoint.Duringeachiteration,thearchitecture andparametersofthemodelis˝rstfrozen.Byfollowingthe˝lterpruningroadmapinthe reverseorder andaddingthepruned˝ltersback,a descendantmodel withalargercapacity isgenerated.Accuracyisregainedbyretrainingthedescendantmodel.Byrepeatingthe iteration,anewdescendantmodelisgrownuponthepreviousdescendantmodel.Asaresult, the˝naldescendantmodelhasthecapacitiesofallthepreviousdescendantmodelsandis thusnamedmulti-capacitymodel. Inthemodelpro˝lingphase,giventhesystemsspecsofamobilevisionsystem,amodel pro˝leisgeneratedforthemulti-capacitymodelincludingtheaccuracy,memoryfootprint, andprocessinglatencyofeachofitscapacities. Finally,intheonlinestage,theNestDNNschedulercontinuouslymonitorsevents includingchangesinavailableruntimeresources,startingnewapplications,closingexisting applications,andchangesinapplicationpriority,andistriggeredoncesucheventisoccurred. Oncetriggered,theschedulerexaminesthemodelpro˝lesofallrunningvisionapplications, selectstheappropriatedescendantmodelforeachapplication,anddeterminestheamount 13 Terminology Explanation VanillaModel O˙-the-shelfdeeplearningmodel(e.g., ResNet)trainedonagivendataset(e.g., ImageNet). PrunedModel Intermediateresultobtainedduringmodel pruning. SeedModel Smallestprunedmodelgeneratedduring modelpruningthatmeetstheminimumac- curacyrequirementsetbytheuser.Itis alsothestartingpointofmodelrecovery. DescendantModel Asub-modelgrownupontheSeedModel duringmodelrecovery.Ithasaunique resource-accuracycapacity. Multi-CapacityModel ThelastlygeneratedDescendantModel thathasthecapacitiesofallthepreviously generateddescendantmodelsbutnestedin asinglemodel. Table3.1: De˝nedterminologiesandtheirexplanations. ofruntimeresourcestoallocatetoeachselecteddescendantmodeltojointlymaximizethe accuracyandminimizetheinferencelatencyofthoseapplications. Forclari˝cationpurpose,Table3.1summarizestheterminologiesde˝nedinthiswork andtheirbriefexplanations.Inthenextsection,wedescribeNestDNNindetail. 3.4DesignofNestDNN 3.4.1FilterbasedModelPruning 3.4.1.1BackgroundonCNNArchitecture Beforedelvingdeepinto˝lterpruning,itisimportanttounderstandthearchitectureof aconvolutionalneuralnetwork(CNN).Ingeneral,aCNNconsistsoffourtypesoflayers: convolutionallayers,activationlayers,poolinglayers,andfully-connectedlayers.Dueto thecomputationalintensityofconvolutionoperations,convolutionallayersarethemost computationalintensivelayersamongthefourtypesoflayers.Speci˝cally,eachconvolutional 14 layeriscomposedofasetof 3D˝lters ,whichplaystheroleofBy convolvinganimagewiththese3D˝lters,itgeneratesasetoffeaturesorganizedinthe formof featuremaps ,whicharefurthersenttothefollowingconvolutionallayersforfurther featureextraction. 3.4.1.2Bene˝tsofFilterPruning Figure4.6illustratesthedetailsof˝lterpruning.Let j 1 2 R w j 1 h j 1 m j 1 denote theinputfeaturemapsofthe j thconvolutionallayer conv j ofaCNN,where w j 1 and h j 1 arethewidthandheightofeachoftheinputfeaturemaps;and m j 1 isthetotalnumber oftheinputfeaturemaps.Theconvolutionallayer conv j consistsof m j 3D˝lterswithsize k k m j 1 ( k k isthe2Dkernel).Itappliesthese˝ltersontotheinputfeaturemaps j 1 togeneratetheoutputfeaturemaps j 2 R w j h j m j ,whereone3D˝ltergenerates oneoutputfeaturemap.Thisprocessinvolvesatotalof m j k 2 m j 1 w j h j ˛oatingpoint operations(i.e.,FLOP). Sinceone3D˝ltergeneratesoneoutputfeaturemap,pruningone3D˝lterin conv j (markedingreenin conv j )resultsinremovingoneoutputfeaturemapin j (markedingreen in j ),whichleadsto k 2 m j 1 parameterand k 2 m j 1 w j h j FLOPreduction.Subsequently, m j +1 2Dkernelsappliedontothatremovedoutputfeaturemapintheconvolutionallayer conv j +1 (markedingreenin conv j +1 )arealsoremoved.Thisleadstoanadditional k 2 m j +1 parameterand k 2 m j +1 w j +1 h j +1 FLOPreduction.Therefore,bypruning˝lters,bothmodel size(i.e.,modelparameters)andcomputationalcost(i.e.,FLOP)arereduced[1]. 15 Figure3.2: Illustrationof˝lterpruning[1].Bypruning˝lters,bothmodelsizeand computationalcostarereduced. 3.4.1.3FilterImportanceRanking Thekeyto˝lterpruningisidentifyinglessimportant˝lters.Bypruningthose˝lters,the sizeandcomputationalcostofaCNNmodelcanbee˙ectivelyreduced. Tothisend,weproposea˝lterimportancerankingapproachnamed TripletResponse Residual (TRR)tomeasuretheimportanceof˝ltersandrank˝ltersbasedontheirrelative importance.OurTRRapproachisinspiredbyone keyintuition :sincea˝lterplaystherole of a˝lterisimportantifitisabletoextractfeaturemapsthatareuseful todi˙erentiateimagesbelongingtodi˙erentclasses .Inotherwords,a˝lterisimportantif thefeaturemapsitextractsfromimagesbelongingtothesameclassaremoresimilarthan theonesextractedfromimagesbelongingtodi˙erentclasses. Let{ anc , pos , neg }denoteatripletthatconsistsofananchorimage( anc ),apositive image( pos ),andanegativeimage( neg )wheretheanchorimageandthepositiveimageare fromthesameclass,whilethenegativeimageisfromadi˙erentclass.Byfollowingthekey intuition,TRRof˝lter i isde˝nedas: TRR i = X ( kF i ( anc ) F i ( neg ) k 2 2 kF i ( anc ) F i ( pos ) k 2 2 ) (3.1) where F ( ) denotesthegeneratedfeaturemap.Essentially,TRRcalculatesthe L 2 distances 16 Figure3.3: Filterimportancepro˝lingperformanceof(a)TRRand(b) L 1 -normonVGG-16 trainedonCIFAR-10. offeaturemapsbetween( anc , neg )andbetween( anc , pos ),andmeasurestheresidual betweenthetwodistances.Bysumminguptheresidualsofallthetripletsfromthetraining dataset,thevalueofTRRofaparticular˝lterre˛ectsitscapabilityofdi˙erentiatingimages belongingtodi˙erentclasses,actingasameasureofimportanceofthe˝lterwithintheCNN model. 3.4.1.4PerformanceofFilterImportanceRanking Figure3.3(a)illustratesthe˝lterimportancepro˝lingperformanceofourTRRapproach onVGG-16[36]trainedontheCIFAR-10dataset[40].ThevanillaVGG-16modelcontains 13convolutionallayers.Eachofthe13curvesinthe˝guredepictsthetop-1testaccuracies when˝ltersofoneparticularconvolutionallayerareprunedwhiletheotherconvolutional layersremainunmodi˝ed.Eachmarkeronthecurvecorrespondstothetop-1testaccuracy whenaparticularpercentageof˝ltersispruned.Asanexample,thetopmostcurve(blue dottedlinewithbluetrianglemarkers)showstheaccuraciesare89.75%,89.72%and87.40% 17 when0%(i.e.,vanillamodel),50%and90%ofthe˝ltersinthe13thconvolutionallayer conv 13 arepruned,respectively. Wehavetwokeyobservationsfromthe˝lterimportancepro˝lingresult.First,we observethatourTRRapproachisabletoe˙ectivelyidentifyredundant˝lterswithineach convolutionallayer.Inparticular,theaccuracyremainsthesamewhen59.96%ofthe˝ltersin conv 13 arepruned.Thisindicatesthatthesepruned˝lters,identi˝edbyTRR,areredundant. Bypruningtheseredundant˝lters,thevanillaVGG-16modelcanbee˙ectivelycompressed withoutanyaccuracydegradation.Second,weobservethatourTRRapproachisableto e˙ectivelyidentifyconvolutionallayersthataremoresensitiveto˝lterpruning.Thisis re˛ectedbythedi˙erencesinaccuracydropswhenthesamepercentageof˝ltersarepruned atdi˙erentconvolutionallayers.Thissensitivitydi˙erenceacrossconvolutionallayershas beentakenintoaccountintheiterative˝lterpruningprocess. TodemonstratethesuperiorityofourTRRapproach,wehavecompareditwiththe state-of-the-art˝lterpruningapproach.Thestate-of-the-art˝lterpruningapproachuses L 1 - normofa˝ltertomeasureitsimportance[1].Figure3.3(b)illustratesthe˝lterimportance pro˝lingperformanceof L 1 -normonthesamevanillaVGG-16modeltrainedontheCIFAR- 10dataset.BycomparingFigure3.3(a)toFigure3.3(b),weobservethatTRRachieves betteraccuracythan L 1 -normatalmosteverypruned˝lterpercentageacrossall13curves. Asaconcreteexample,TRRachievesanaccuracyof89.72%and87.40%when50%and90% ofthe˝ltersat conv 13 areprunedrespectively,while L 1 -normonlyachievesanaccuracyof 75.45%and42.65%correspondingly.Thisresultindicatesthatthe˝ltersprunedbyTRR havemuchlessimpactonaccuracythantheonesprunedby L 1 -norm,demonstratingthat TRRoutperforms L 1 -normatidentifyinglessimportant˝lters. 18 3.4.1.5FilterPruningRoadmap Byfollowingthe˝lterimportancerankingprovidedbyTRR,weiterativelyprunethe˝lters inaCNNmodel.Duringeachiteration,lessimportant˝ltersacrossconvolutionallayersare pruned,andtheprunedmodelisretrainedtocompensatetheaccuracydegradation(ifthere isany)causedby˝lterpruning.Theiterationendswhentheprunedmodelcouldnotmeet theminimumaccuracygoalsetbytheuser.Asaresult,a ˝lterpruningroadmap iscreated whereeachfootprintontheroadmapisaprunedmodelwithits˝lterpruningrecord.The smallestprunedmodelontheroadmapiscalled seedmodel .This˝lterpruningroadmapis usedtoguidethemodelrecoveryprocessdescribedbelow. 3.4.2Freeze-&-GrowbasedModelRecovery 3.4.2.1MotivationandKeyIdea The˝lterpruningprocessgeneratesaseriesofprunedmodels,eachofwhichactingasa modelvariantofthevanillamodelwithauniqueresource-accuracytrade-o˙.However,due totheretrainingstepwithineachpruningiteration,theseprunedmodelshavedi˙erent modelparameters,andthusareindependentfromeachother.Therefore,althoughthese prunedmodelsprovidedi˙erentresource-accuracytrade-o˙s,keepingallofthemlocallyin resource-limitedmobilesystemsispracticallyinfeasible. Toaddressthisproblem,weproposetogenerateasingle multi-capacitymodel thatacts equivalentlyastheseriesofprunedmodelstoprovidevariousresource-accuracytrade-o˙s buthasamodelsizethatismuchsmallerthantheaccumulatedmodelsizeofallthepruned models.Thisisachievedbyaninnovative modelfreezingand˝ltergrowing (i.e., freeze-&- grow )approach. 19 Figure3.4: Illustrationofmodelfreezingand˝ltergrowing. Intheremainderofthissection,wedescribethedetailsofthefreeze-&-growapproach andhowthemulti-capacitymodelisiterativelygenerated. 3.4.2.2ModelFreezingandFilterGrowing Thegenerationofthemulti-capacitymodelstartsfromtheseedmodelderivedfromthe˝lter pruningprocess.Byfollowingthe˝lterpruningroadmapandthefreeze-&-growapproach, themulti-capacitymodelisiterativelycreated. Figure3.4illustratesthedetailsofmodelfreezingand˝ltergrowingduringthe˝rst iteration.Forillustrationpurpose,onlyoneconvolutionallayerisdepicted.Asshown,given theseedmodel,we˝rstapply modelfreezing tofreezetheparametersofallits˝lters(marked asbluesquares).Next,sinceeachfootprintontheroadmaphasits˝lterpruningrecord,we followthe˝lterpruningroadmapinthe reverseorder andapply ˝ltergrowing toaddthe pruned˝ltersback(markedasgreenstripesquares).Withtheadded˝lters,thecapacity ofthis descendantmodel isincreased.Lastly,weretrainthisdescendantmodeltoregain accuracy.Itisimportanttonotethatduringretraining,sincetheseedmodelisfrozen,its parametersarenotchanged;onlytheparametersoftheadded˝ltersarechanged(marked asgreensolidsquarestoindicatetheparametersarechanged).Assuch,wehavegenerated asinglemodelthatnotonlyhasthecapacityoftheseedmodelbutalsohasthecapacityof 20 thedescendantmodel.Moreover,theseedmodel shares allitsmodelparameterswiththe descendantmodel,makingitself nested insidethedescendantmodelwithouttakingextra memoryspace. Byrepeatingtheiteration,anewdescendantmodelisgrownuponthepreviousone.As such,the˝naldescendantmodelhasthecapacitiesofallthepreviousdescendantmodels andisthusnamed multi-capacitymodel . 3.4.2.3SuperiorityofMulti-CapacityModel Thegeneratedmulti-capacitymodelhasthefollowingthreekeyadvantages. OneCompactModelwithMultipleCapabilities .Thegeneratedmulti-capacitymodel isabletoprovidemultiplecapacitiesnestedinasinglemodel.Thiseliminatestheneedof installingpotentiallyalargenumberofindependentmodelvariantswithdi˙erentcapacities. Moreover,bysharingparametersamongdescendantmodels,themulti-capacitymodelisable tosavealargeamountofmemoryspacetosigni˝cantlyreduceitsmemoryfootprint. OptimizedResource-AccuracyTrade-o˙s .Eachcapacityprovidedbythemulti- capacitymodelhasauniqueoptimizedresource-accuracytrade-o˙.OurTRRapproach isabletoprovidestate-of-the-artperformanceatidentifyingandpruninglessimportant˝l- ters.Asaresult,themulti-capacitymodeldeliversstate-of-the-artinferenceaccuracyunder agivenresourcebudget. E˚cientModelSwitching .Becauseofparametersharing,themulti-capacitymodelis abletoswitchmodelswithlittleoverhead.Switchingindependentdeeplearningmodels causessigni˝cantoverhead.Thisisbecauseitrequirestopageinandpageoutthe entire deeplearningmodels.Multi-capacitymodelalleviatesthisprobleminanelegantmannerby onlyrequiringtopageinandpageoutaverysmallportionofdeeplearningmodels. 21 Figure3.5: Illustrationofmodelswitching(modelupgradevs.modeldowngrade)of multi-capacitymodel. Figure3.5illustratesthedetailsofmodelswitchingofmulti-capacitymodel.Forillus- trationpurpose,onlyoneconvolutionallayerisdepicted.Asshown,sinceeachdescendant modelisgrownuponitspreviousdescendantmodels,whenthemulti-capacitymodelis switchingtoadescendantmodelwithlargercapability(i.e., modelupgrade ),itincurs zero page-outoverhead ,andonlyneedstopageintheextra˝ltersincludedinthedescendant modelwithlargercapability(markedasgreensquares).Whenthemulti-capacitymodelis switchingtoadescendantmodelwithsmallercapability(i.e., modeldowngrade ),itincurs zeropage-inoverhead ,andonlyneedstopageoutthe˝ltersthatthedescendantmodelwith smallercapabilitydoesnothave(markedasgraysquares).Asaresult,themulti-capacity modelsigni˝cantlyreducestheoverheadofmodelpageinandpageout,makingmodel switchingextremelye˚cient. 3.4.3Resource-AwareScheduler 3.4.3.1MotivationandKeyIdea Thecreationofthemulti-capacitymodelenablesNestDNNtojointlymaximizetheperfor- manceofvisionapplicationsthatareconcurrentlyrunningonamobilevisionsystem.This possibilitycomesfromtwo keyinsights .First,whileacertainamountofruntimeresources canbetradedforanaccuracygaininsomeapplication,thesameamountofruntimeresources 22 maybetradedfora larger accuracygaininsomeotherapplication.Second,forapplications thatdonotneedreal-timeresponseandthuscantoleratearelativelylargeprocessingla- tency,wecan reallocate someruntimeresourcesfromthoselatency-tolerantapplicationsto otherapplicationsthatneedmoreruntimeresourcestomeettheirreal-timegoals.NestDNN exploitsthesetwokeyinsightsbyencodingtheinferenceaccuracyandprocessinglatencyinto a costfunction foreachvisionapplication,whichservesasthefoundationforresource-aware scheduling. 3.4.3.2CostFunction Let V denotethesetofvisionapplicationsthatareconcurrentlyrunningonamobilevision system,andlet A min ( v ) and L max ( v ) denotetheminimuminferenceaccuracyandthe maximumprocessinglatencygoalssetbytheuserforapplication v 2 V .Additionally, let M v denotethemulti-capacitymodelgeneratedforapplication v ,andlet m v denotea descendantmodel m v 2 M v .Thecostfunctionofthedescendantmodel m v forapplication v isde˝nedasfollows: C ( m v ;u v ;v )=max(0 ;A min ( v ) A ( m v ))+ max(0 ; L ( m v ) u v L max ( v )) (3.2) where A ( m v ) istheinferenceaccuracyof m v , u v 2 (0 ; 1] isthecomputingresourcepercentage allocatedto v ,and L ( m v ) istheprocessinglatencyof m v when100%computingresources areallocatedto v .Thevaluesof A ( m v ) and L ( m v ) areobtainedviapro˝lingatthemulti- capacitymodeldevelopmentstage. Essentially,the˝rstterminthecostfunctionrepresentsthepenaltyforselectinga descendantmodel m v thathasaninferenceaccuracylowerthantheminimumaccuracy 23 goal.Thesecondterminthecostfunctionrepresentsthepenaltyforselectingadescendant model m v thathasaprocessinglatencyhigherthanthemaximumprocessinglatencygoal. 2 [0 ; 1] isaknobsetbytheusertodeterminethelatency-accuracytrade-o˙preference. Alarge weightsmoreonthepenaltyforlatencywhileasmall favorshigheraccuracy. 3.4.3.3SchedulingSchemes Giventhecostfunctionofeachdescendantmodelofeachconcurrentlyrunningapplication, theresource-awareschedulerincorporatestwowidelyusedschedulingschemestojointly maximizetheperformanceofconcurrentvisionapplicationsfortwodi˙erentoptimization objectives. MinTotalCost .TheMinTotalCost(i.e.,minimizethetotalcost)schedulingschemeaims tominimizethetotalcostofallconcurrentapplications.Thisoptimizationproblemcanbe formulatedasfollows: min u v ;m v 2 M v X v 2 V C ( m v ;u v ;v ) (3.3) s:t: X v 2 V S ( m v ) S max ; X v 2 V u v 1 where S ( m v ) denotestheruntimememoryfootprintofthedescendantmodel m v .Thetotal memoryfootprintofalltheconcurrentapplicationscannotexceedthemaximummemory spaceofthemobilevisionsystemdenotedas S max . UndertheMinTotalCostschedulingscheme,theresource-awareschedulerfavorsappli- cationswithlowercostsandthusisoptimizedtoallocatemoreruntimeresourcestothem. MinMaxCost .TheMinMaxCost(i.e.,minimizethemaximumcost)schedulingscheme 24 aimstominimizethecostoftheapplicationthathasthehighestcost.Thisoptimization problemcanbeformulatedasfollows: min u v ;m v 2 M v k (3.4) s:t: 8 v : C ( m v ;u v ;v ) k; X v 2 V S ( m v ) S max ; X v 2 V u v 1 wherethecostofanyoftheconcurrentlyrunningapplicationsmustbesmallerthan k where k isminimized. UndertheMinMaxCostschedulingscheme,theresource-awareschedulerisoptimizedto fairlyallocateruntimeresourcestoalltheconcurrentapplicationstobalancetheirperfor- mance. 3.4.3.4CachedGreedyHeuristicApproximation SolvingthenonlinearoptimizationproblemsinvolvedinMinTotalCostandMinMaxCost schedulingschemesiscomputationallyhard.Toenablereal-timeonlineschedulinginmobile systems,weutilizeagreedyheuristicinspiredby[29]toobtainapproximatesolutions. Speci˝cally,wede˝neaminimumindivisibleruntimeresourceunit u (e.g., 1% ofthe totalcomputingresourcesinamobilevisionsystem)andstartallocatingthecomputing resourcesfromscratch.ForMinTotalCost,weallocate u tothedescendentmodel m v of application v suchthat C ( m v ; u;v ) hasthesmallestcostincreaseamongotherconcurrent applications.ForMinMaxCost,weselectapplication v withthehighestcost C ( m v ;u v ;v ) , 25 Type Dataset DNNModel MobileVisionApplication Generic Category CIFAR-10 VGG-16 VC ImageNet-50 ResNet-50 RI-50 ImageNet-100 ResNet-50 RI-100 Class Speci˝c GTSRB VGG-16 VS Adience-Gender VGG-16 VG Places-32 ResNet-50 RP Table3.2: Summaryofdatasets,DNNmodels,andmobilevisionapplicationsusedinthiswork. andallocate u to v andchoosetheoptimaldescendentmodel m v =argmin m v C ( m v ;u v ;v ) for v .ForbothMinTotalCostandMinMaxCost,theruntimeresourcesareiterativelyallo- cateduntilexhausted. Theruntimeofexecutingthegreedyheuristiccanbefurthershortenedviathecaching technique.Thisisparticularlyattractivetomobilesystemswithverylimitedresources. Speci˝cally,whenallocatingthecomputingresources,insteadofstartingfromscratch,we startfromthepointwhereacertainamountofcomputingresourceshasalreadybeenallo- cated.Forexample,wecancachetheun˝nishedrunningschemewhere 70% ofthecomputing resourceshavebeenallocatedduringoptimization.Inthenextoptimizationiteration,we directlystartfromtheun˝nishedrunningschemeandallocatetheremaining 30% computing resources,thussaving 70% oftheoptimizationtime.Topreventfromfallingintoalocal minimumovertime,acompleteexecutionofthegreedyheuristicisperformedperiodically toenforcethecachedsolutiontobeclosetotheoptimalone. 26 3.5Evaluation 3.5.1Datasets,DNNsandApplications 3.5.1.1Datasets ToevaluatethegeneralizationcapabilityofNestDNNondi˙erentvisiontasks,weselecttwo typesoftasksthatareamongthemostimportanttasksformobilevisionsystems. Generic-CategoryObjectRecognition. Thistypeofvisiontasksaimstorecognizethe genericcategoryofanobject(e.g.,aroadsign,aperson,oranindoorplace).Without lossofgenerality,weselect3commonlyusedcomputervisiondatasets,eachcontaininga small,amedium,andalargenumberofobjectcategoriesrespectively,representinganeasy, amoderate,andadi˚cultvisiontaskcorrespondingly. CIFAR-10 [40].Thisdatasetcontains50Ktrainingimagesand10Ktestingimages belongingto10genericcategoriesofobjects. ImageNet-50 [41].ThisdatasetisasubsetoftheILSVRCImageNet.Itcontains63K trainingimagesand2Ktestingimagesbelongingtotop50mostpopularobjectcategories basedonthepopularityrankingprovidedbytheo˚cialImageNetwebsite. ImageNet-100 [41].SimilartoImageNet-50,thisdatasetisasubsetoftheILSVRC ImageNet.Itcontains121Ktrainingimagesand5Ktestingimagesbelongingtotop100 mostpopularobjectcategoriesbasedonthepopularityrankingprovidedbytheo˚cial ImageNetwebsite. Class-Speci˝cObjectRecognition. Thistypeofvisiontasksaimstorecognizethe speci˝cclassofanobjectwithinagenericcategory(e.g.,astopsign,afemaleperson,ora kitchen).Withoutlossofgenerality,weselect3objectcategories:1)roadsigns,2)people, 27 and3)places,whicharecommonlyseeninmobilesettings. GTSRB [42].Thisdatasetcontainsover50Kimagesbelongingto43classesofroad signssuchasspeedlimitsignsandstopsign. Adience-Gender [43].Thisdatasetcontainsover14Kimagesofhumanfacesoftwo genders. Places-32 [44].ThisdatasetisasubsetofthePlaces365-Standarddataset.Places365- Standardcontains1.8millionimagesof365sceneclassesbelongingto16higher-level categories.Weselecttworepresentativesceneclasses(e.g.,parkinglotandkitchen)from eachofthe16higher-levelcategoriesandobtaina32-classdatasetthatincludesover 158Kimages. 3.5.1.2DNNModels ToevaluatethegeneralizationcapabilityofNestDNNondi˙erentDNNmodels,weselect tworepresentativeDNNmodels:1)VGG-16and2)ResNet-50.VGG-16[36]isconsidered asoneofthemoststraightforwardDNNmodelstoimplement,andthusgainsconsiderable popularityinbothacademiaandindustry.ResNet-50[35]isconsideredasoneofthetop- performingDNNmodelsincomputervisionduetoitssuperiorrecognitionaccuracy. 3.5.1.3MobileVisionApplications Withoutlossofgenerality,werandomlyassignCIFAR-10,GTSRBandAdience-Genderto VGG-16;andassignImageNet-50,ImageNet-100andPlaces-32toResNet-50tocreatesix mobilevisionapplicationslabeledas VC (i.e.,VGG-16trainedontheCIFAR-10dataset), RI-50 , RI-100 , VS , VG ,and RP ,respectively.WetrainandtestallthevanillaDNNmodels andallthedescendantmodelsgeneratedbyNestDNNbystrictlyfollowingtheprotocol 28 Figure3.6: Top-1testaccuracyvs.modelsizecomparisonbetweendescendentmodelsand baselinemodels. providedbyeachofthesixdatasetsdescribedabove. Thedatasets,DNNmodels,andmobilevisionapplicationsaresummarizedinTable3.2. 3.5.2PerformanceofMulti-CapacityModel Inthissection,weevaluatetheperformanceofthegeneratedmulti-capacitymodelwiththe goaltodemonstrateitssuperioritylistedin 3.5.2.1ExperimentalSetup SelectionofDescendantModels. Withoutlossofgenerality,foreachmobilevision application,wegenerateamulti-capacitymodelthatcontains˝vedescendantmodels.These descendantmodelsaredesignedtohavediverseresource-accuracytrade-o˙s.Weselectthese descendantmodelswiththepurposetodemonstratethatourmulti-capacitymodelenables applicationstorunevenwhenavailableresourcesareverylimited.Itshouldbenotedthata multi-capacitymodelisneitherlimitedtooneparticularsetofresource-accuracytrade-o˙s norlimitedtooneparticularnumberofdescendantmodels.NestDNNprovidesthe˛exibility 29 Figure3.7: Computationalcostcomparisonbetweendescendantmodelsandvanillamodels. todesignamulti-capacitymodelbasedonusers'preferences. Baseline. Tomakeafaircomparison,weusethesamearchitectureofdescendantmodelsfor baselinemodelssuchthattheirmodelsizesandcomputationalcostsareidentical.Inaddi- tion,wepre-trainedbaselinemodelsonImageNetdatasetandthen˝ne-tunedthemoneach ofthesixdatasets.Pre-trainingisane˙ectivewaytoboostaccuracy,andthusisadoptedas aroutineinmachinelearningcommunity[45,46].Wealsotrainedbaselinemodelswithout pre-training,andobservedthatbaselinemodelswithpre-trainingconsistentlyoutperform thosewithoutpre-training.Therefore,weonlyreportaccuraciesofbaselinemodelswith pre-training. 30 3.5.2.2OptimizedResource-AccuracyTrade-o˙s Figure3.6illustratesthecomparisonbetweendescendentmodelsandbaselinemodelsacross sixmobilevisionapplications.Foreachapplication,weshowthetop-1testaccuraciesofboth descendantmodelsandbaselinemodelsasafunctionofmodelsize.Forbetterillustration purpose,thehorizontalaxisisplottedusingthelogarithmicscale. Wehavetwokeyobservationsfromtheresult.First,weobservethatdescendantmodels consistentlyachievehigheraccuraciesthanbaselinemodelsateverymodelsizeacrossall thesixapplications.Onaverage,descendantmodelsachieve4.98%higheraccuracythan baselinemodels.Thisindicatesthatourdescendantmodelateachcapacityisabletodeliver state-of-the-artinferenceaccuracyunderagivenmemorybudget.Second,weobservethat smallerdescendantmodelsoutperformbaselinemodelsmorethanlargerdescendantmodels. Onaverage,thetwosmallestdescendantmodelsachieve6.68%higheraccuracywhilethetwo largestdescendantmodelsachieve3.72%higheraccuracycomparedtotheircorresponding baselinemodels.ThisisbecauseourTRRapproachisabletopreserveimportant˝lters whilepruninglessimportantones.Despitehavingasmallcapacity,asmalldescendant modelbene˝tsfromtheseimportant˝lterswhilethecorrespondingbaselinemodeldoesnot. Figure3.7showsthecomputationalcostsof˝vedescendantmodelsandthecorrespond- ingvanillamodelsofthesixapplicationsinGFLOPs(i.e.,GigaFLOPs).Asshown,all descendantmodelshavelessGFLOPsthanthecorrespondingvanillamodels.Thisresult indicatesthatour˝lterpruningapproachisabletoe˙ectivelyreducethecomputational costsacrosssixapplications,demonstratingthegeneralizationofour˝lterpruningapproach ondi˙erentdeeplearningmodelstrainedondi˙erentdatasets. 31 Application Multi-Capacity ModelSize(MB) Accumulated ModelSize(MB) ReducedMemory Footprint(MB) VC 196.0 437.5 241.5 VS 12.9 19.8 6.9 VG 123.8 256.0 132.2 RI-50 42.4 58.1 15.7 RI-100 87.1 243.5 156.4 RP 62.4 97.1 34.7 AllIncluded 524.6 1112.0 587.4 Table3.3: Bene˝tofmulti-capacitymodelonmemoryfootprintreduction. 3.5.2.3ReductiononMemoryFootprint Anotherkeyfeatureofmulti-capacitymodelissharingparametersamongitsdescendant models.Toquantifythebene˝tofparametersharingonreducingmemoryfootprint,we comparethemodelsizeofmulti-capacitymodelwiththeaccumulatedmodelsizeofthe˝ve descendantmodelsasiftheywereindependent.Thismimicstraditionalmodelvariantsthat areusedinexistingmobiledeeplearningsystems. Table3.3liststhecomparisonresultsacrossthesixmobilevisionapplications.Obviously, themodelsizeofthemulti-capacitymodelissmallerthanthecorrespondingaccumulated modelsizeforeachapplication.Moreover,deeplearningmodelwithlargermodelsize bene˝tsmorefromparametersharing.Forexample, VC hasthelargestmodelsizeacross thesixapplications.Withparametersharing,itachievesareducedmemoryfootprintof241.5 MB.Finally,ifweconsiderrunningallthesixapplicationsconcurrently,themulti-capacity modelachievesareducedmemoryfootprintof587.4MB,demonstratingtheenormousbene˝t ofmulti-capacitymodelonmemoryfootprintreduction. 32 3.5.2.4ReductiononModelSwitchingOverhead Anotherbene˝tofparametersharingisreducingtheoverheadofmodelswitchingwhen thesetofconcurrentapplicationschanges.Toquantifythisbene˝t,weconsiderallthe possiblemodelswitchingcasesamongallthe˝vedescendantmodelsofeachmulti-capacity model,andcalculatetheaveragepage-inandpage-outoverheadformodelupgradeand modeldowngrade,respectively.Wecompareitwiththecasewheredescendantmodelsare treatedasiftheywereindependent,whichagainmimicstraditionalmodelvariants. Table3.4liststhecomparisonresultsofallsixmobilevisionapplicationsformodel upgradeintermsofmemoryusage.Asexpected,theaveragepage-inandpage-outmemory usageofindependentmodelsduringmodelswitchingislargerthanmulti-capacitymodels foreveryapplication.Thisisbecauseduringmodelswitching,independentmodelsneed topageinandpageoutthe entire modelswhilemulti-capacitymodelsonlyneedtopage inaverysmallportionofthemodels.Itshouldbenotedthatthepage-outoverheadof multi-capacitymodelduringmodelupgradeiszero.Thisisbecausethedescendantmodel withsmallercapabilityispartofthedescendantmodelwithlargercapability,andthusit doesnotneedtobepagedout. Table3.5liststhecomparisonresultsofallsixapplicationsformodeldowngrade.Sim- ilarresultsareobserved.Theonlydi˙erenceisthatduringmodeldowngrade,thepage-in overheadofmulti-capacitymodeliszero. Besidesmemoryusage,wealsoquantifythebene˝tonreducingtheoverheadofmodel switchingintermsofenergyconsumption.Speci˝cally,wemeasuredenergyconsumedby randomlyswitchingmodelsfor250,500,750,and1,000timesusingdescendantmodelsand independentmodels,respectively. 33 Figure3.8: Modelswitchingenergyconsumptioncomparisonbetweenmulti-capacitymodelsand independentmodels.TheenergyconsumptionismeasuredonaSamsungGalaxyS8smartphone. Figure3.8showsthecomparisonresultsacrossallsixmobilevisionapplications.As expected,energyconsumedbyswitchingmulti-capacitymodelislowerthanswitchingin- dependentmodelsforeveryapplication.Thisbene˝tbecomesmoreprominentwhenthe modelsizeislarge.Forexample,thesizeofthelargestdescendantmodelof VC and VS is 196.0MBand12.9MB,respectively.Thecorrespondingenergyconsumptionreductionfor every1,000modelswitchesis602.1Jand4.0J,respectively. Takentogether,thegeneratedmulti-capacitymodelisabletosigni˝cantlyreducethe overheadofmodelswitchingintermsofbothmemoryusageandenergyconsumption.The bene˝tbecomesmoreprominentwhenmodelswitchingfrequencyincreases.Thisispartic- ularlyimportantformemoryandbatteryconstrainedmobilesystems. 34 Application Multi-CapacityModel UpgradeOverhead(MB) IndependentModels UpgradeOverhead(MB) Page-In Page-Out Page-In Page-Out VC 81.4 0 128.2 46.8 VS 1.3 0 1.7 0.3 VG 50.0 0 76.2 26.2 RI-50 19.2 0 21.2 2.0 RI-100 38.3 0 67.9 29.5 RP 26.4 0 34.9 4.6 Table3.4: Bene˝tofmulti-capacitymodelonmodelswitching(modelupgrade)intermsof memoryusage. Application Multi-CapacityModel DowngradeOverhead(MB) IndependentModels DowngradeOverhead(MB) Page-In Page-Out Page-In Page-Out VC 0 81.4 46.8 128.2 VS 0 1.3 0.3 1.7 VG 0 50.0 26.2 76.2 RI-50 0 19.2 2.0 21.2 RI-100 0 38.3 29.5 67.9 RP 0 26.4 4.6 34.9 Table3.5: Bene˝tofmulti-capacitymodelonmodelswitching(modeldowngrade)intermsof memoryusage. 3.5.3PerformanceofResource-AwareScheduler 3.5.3.1ExperimentalSetup DeploymentPlatforms. WeimplementedNestDNNandthesixmobilevisionapplications onthreesmartphones:SamsungGalaxyS8,SamsungGalaxyS7,andLGNexus5,all runningAndroidOS7.0.WeusedtheMonsoonpowermonitor[47]tomeasurethepower consumption.Wehaveachievedconsistentexperimentalresultsacrossallthreesmartphones. WeonlyreportthebestresultsobtainedfromGalaxyS8. 35 Baseline. WeusethemodellocatedattheofeveryyellowcurveinFigure3.6 asourbaselineforeachmobilevisionapplication.Thisisthemodelthato˙ersthe best resource-accuracytrade-o˙amongallthemodelvariants. BenchmarkDesign. Wehavedesignedabenchmarkthatsimulatesruntimeapplication queriesindi˙erentscenarios.Speci˝cally,ourbenchmarkcreatesanewapplicationorkillsa runningapplicationwithcertainprobabilitiesateverysecond.Thenumberofconcurrently runningapplicationsisfrom2to6.Atthesametime,themaximumavailablememoryto runconcurrentapplicationsissetto 400 MB.Eachsimulationgeneratedbyourbenchmark lastsfor60seconds.Werepeatthesimulation100timesandreporttheaverageruntime performance. Figure3.9illustratesthepro˝leofallaccumulatedsimulationsgeneratedbyourbench- mark.Figure3.9(a)showsthetimedistributionofdi˙erentnumbersofconcurrentappli- cations.Asshown,thepercentageoftimewhentwo,three,four,˝veandsixapplications runningconcurrentlyis9.8%,12.7%,18.6%,24.5%and34.3%,respectively.Itshowsthat ourbenchmarkhascoveredallthenumbersofconcurrentapplications.Figure3.9(b)shows therunningtimedistributionofeachindividualapplication.Asshown,therunningtimeof eachapplicationisevenlydistributed,indicatingourbenchmarkensuresafairtimeshare amongallapplications. EvaluationMetrics. Weusethefollowingtwometricstoevaluatetheschedulingperfor- mance. AccuracyGain .SincetheabsoluteTop-1accuraciesachievedbythesixvisionappli- cationsarenotinthesamerange,weuseaccuracygainoverthebaselinewithineach applicationasamoremeaningfulmetric. 36 Figure3.9: Pro˝leofallaccumulatedsimulationsgeneratedbyourdesignedbenchmark. Figure3.10: RuntimeperformancecomparisonbetweenbaselineandNestDNNunderscheduling scheme:(a)MinTotalCost;(b)MinMaxCost. FrameRate .Weuseframerateasthemetrictomeasurethereal-timeperformance ofavisionapplication.Itisastandardmetricformobilevisionsystems.Framerateis inverselyproportionaltoinferencelatency.Thelowertheinferencelatencyis,thehigher theframerateis. 3.5.3.2ImprovementonAccuracyandFrameRate Figure3.10(a)showsthecomparisonbetweenthebaselineandNestDNNundertheMinTo- talCostschedulingscheme.Eachbluediamondmarkerrepresentstheruntimeperformance obtainedbyschedulingwithaparticular inthecostfunctionde˝nedinTheyellow circlerepresentstheruntimeperformanceofthebaseline. 37 Wehavetwokeyobservationsfromourcomparisonresult.First,byadjustingthevalue of ,NestDNNisabletoprovideavarietyoftrade-o˙sbetweenaccuracyandframerate. Incontrast,thebaselineis˝xedando˙ersnotrade-o˙betweenaccuracyandframerate. Second,theverticaldottedlineandthehorizontaldottedlinealtogetherpartitionthe˝gure intofourquadrants.Theupperrightquadrantrepresentsaregionthathasbothhigher accuracygainandhigherframeratecomparedtothebaseline.Asshown,therearemany bluediamondmarkerslocatingatthisupperrightquadrant.Ateachofthosebluedia- mondmarkers,NestDNNisabletoachievebothhigheraccuracygainandhigherframe ratecomparedtothebaseline.Inparticular,weselect3ofthosebluediamondmarkers todemonstratetheruntimeperformanceimprovementachievedbyNestDNN.Speci˝cally, whenNestDNNhasthesameaccuracygainasthebaseline,NestDNNachieves 2 : 0 frame rateintheunitofframepersecond(FPS)comparedtothebaseline.WhenNestDNNhas thesameframerateasthebaseline,NestDNNachieves 4 : 1% accuracygaincomparedto thebaseline.Finally,weselecttheofthebluediamondcurve,whicho˙ersthebest accuracy-frameratetrade-o˙amongallthe .AttheNestDNNachieves 1 : 5 frame rateand 2 : 6% accuracygaincomparedtothebaseline. Figure3.10(b)showsthecomparisonbetweenthebaselineandNestDNNundertheMin- MaxCostschedulingscheme.WhenNestDNNhasthesameaccuracygainasthebaseline, NestDNNachieves 1 : 9 frameratecomparedtothebaseline.WhenNestDNNhasthesame framerateasthebaseline,NestDNNachieves 4 : 2% accuracygaincomparedtothebaseline. AttheNestDNNachieves 1 : 5 framerateand 2 : 1% accuracygaincomparedtothe baseline. 38 Figure3.11: EnergyconsumptioncomparisonbetweenbaselineandNestDNNunderscheduling scheme:(a)MinTotalCost;(b)MinMaxCost. 3.5.3.3ReductiononEnergyConsumption Finally,weevaluatetheinferenceenergyconsumptionduringscheduling.Inthisexperiment, weevaluateinferenceenergyconsumptionofNestDNNatthe Figure3.11(a)showsthecomparisonbetweenthebaselineandNestDNNunderthe MinTotalCostschedulingscheme.Acrossdi˙erentnumbersofinferences,NestDNNachieves anaverage 1 : 7 energyconsumptionreductioncomparedtothebaseline.Similarly,Fig- ure3.11(b)showsthecomparisonbetweenthebaselineandNestDNNundertheMinMax- Costschedulingscheme.NestDNNachievesanaverage 1 : 5 energyconsumptionreduction comparedtothebaseline. 3.6ConclusionofNestDNN Inthischapter,wepresentedthedesign,implementationandevaluationofNestDNN,a frameworkthatenablesresource-awaremulti-tenanton-devicedeeplearningformobilevision systems.Itcontributesnoveltechniquesthataddresstheuniquechallengesofmobilevision systems.WeevaluatedNestDNNusingsixmobilevisionapplicationsthattargetsomeof themostimportantvisiontasksformobilevisionsystems.Ourresultsshowthatcompared tothenon-resource-awarecounterpart,NestDNNachievesasmuchas 4 : 2% increaseon 39 accuracy, 2 : 0 increaseonframerateand 1 : 7 reductiononenergyconsumptionwhen runningmultiplemobilevisionapplicationsconcurrently. 40 Chapter4 Content-AdaptiveOn-DeviceDeep Learning 4.1IntroductionofFlexDNN Mobilevisionsystemssuchasmobilephones,drones,andaugmentedrealityheadsetsare ubiquitoustoday.DrivenbyrecentbreakthroughinDeepNeuralNetworks(DNNs)[2]and theemergenceofAIchipsets,state-of-the-artmobilevisionsystemsstarttouseDNN-based processingpipelinesforon-devicevideostreamanalytics,whichactsastheenablerofawide rangeofcontinuousmobilevisionapplications. On-devicevideostreamanalyticsrequiresprocessingstreamingvideoframesathigh throughputandreturningtheprocessingresultswithlowlatency.Unfortunately,DNNsare knowntobecomputationallyexpensive[3],andhighcomputationalconsumptiondirectly translatestohighprocessinglatencyandhighenergyconsumption.Givenmobilesystems areconstrainedbylimitedcomputeresourcesandbatterycapacities,reducingcomputational consumptionofDNN-basedpipelinesiscrucialtohigh-throughput,low-latency,andlow- energyon-devicevideostreamanalytics. Toreducecomputationalconsumption,mostexistingworkpursuesmodelcompression techniques[9,37,48].However,modelcompressionyieldsanone-size-˝ts-allnetworkthat requiresthesamesetoffeaturemapstobeextractedforallvideoframesagnostictothe 41 contentofeachframe. Infact,thecomputationconsumedbyaDNN-basedprocessingpipelineisheavilyde- pendentonthecontentofthevideoframes[20].Forvideoframeswithcontentsthatare easytorecognize,asmalllow-capacityDNNmodelissu˚cientwhilealargehigh-capacity DNNmodelthatconsumesmorecomputationisoverkill;ontheotherhand,forvideoframes withcontentsthatarehardtorecognize,itisnecessarytoemploylargehigh-capacityDNN modelsintheprocessingpipelinetoensurethecontentstobecorrectlyrecognized.Thisis verysimilartohowhumanvisionsystemworkswhereaglimpseissu˚cienttorecognize simplescenesandobjectsinordinaryposes,whereasmoreattentionande˙ortsareneeded tounderstandcomplexscenesandobjectsthatarecomplicatedorpartiallyoccluded[49]. Basedonthiskeyinsight,content-adaptivevideostreamanalyticssystemssuchas Chameleon[18]haverecentlyemerged.Leveragingthedynamicsofvideocontents,thesesys- temse˙ectivelyreducethecomputationalconsumptionofDNN-basedprocessingpipelines bydynamicallychangingtheDNNmodelsinthepipelinetoadapttothedi˚cultylevelsof thevideoframes.However,thisdynamiccon˝gurationapproachisamismatchtoresource- constrainedmobilesystems.Thisisbecauseitrequiresallthemodelvariantswithvarious capacitiestobeinstalledinthemobilesystem,whichresultsinlargememoryfootprint.More importantly,ifalargenumberofmodelvariantsisincorporatedandthecontentdynamics issubstantial,theoverheadofsearchingfortheoptimalmodelvariantandswitchingmodels atruntimecanbeprohibitivelyexpensive,whichconsiderablydwarfsthebene˝tbroughtby adaptation. ThelimitationofChameleonisrootedintheconstraintwhereitrequireshavingmultiple modelvariantswithvariouscapacitiestoadapttovariousdi˚cultylevelsofvideoframes.To addressthislimitation,state-of-the-artsuchasBranchyNet[19]andMSDNet[20]introduces 42 theideaofconstructingasinglemodelbyaddingatlayersofaregularDNN modeltomakeearlyprediction.Withsuchearlypredictionmechanism,easyframesdo notneedtogothroughallthelayersandtheircomputationalconsumptionisthusreduced. Whiletheconceptofearlypredictionispromising,thesepioneerworksareconstrainedby thefollowinglimitations: Thesepioneerworksarecoarse-grainedapproachwhereearlyexitsareinsertedatthe outputsofconvolutionallayersofaDNNmodel.Eachconvolutionallayeriscomposed ofalargenumberofconvolutional˝lters,andthese˝ltersdominatethecomputational consumptionoftheDNNmodel.However,notallthe˝lterswithineachconvolutional layerareneededtoearlyexiteasyframes.Asaconsequence,computationconsumedby thoseunnecessary˝ltersiswastedbythecoarse-grainedapproachduetoitsconstraint onmakingearlypredictionsatthegranularityoflayers. Theearlyexitsthemselvesalsoconsumecomputation.Computationconsumedby framesthatfailtoexitattheearlyexitsiswastedandbecomestheoverheadsincurredby thisapproach.Unfortunately,theearlyexitarchitectureofpriorworksisdesignedbased onheuristicswithoutfocusingonthetrade-o˙betweenearlyexitratesandtheincurred overheads.Withoutcarefullyaccountingforthetrade-o˙,theincurredoverheadscould diminishthebene˝tofearlyexits. Lastly,thenumberandlocationsoftheinsertedearlyexitsinpriorworksarealsodeter- minedbasedonheuristics.Whilee˙ectiveincomparisontomodelswithoutearlyexits, consideringtheexponentialcombinationsofnumberandlocationsofearlyexits,even fordeveloperswithextensivedomainexpertise,withoutconsiderablee˙ortsontrialand error,itwouldbeextremelychallengingtoderiveanearlyexitinsertionplanthatcan 43 Figure4.1: AconceptualoverviewofFlexDNN.FlexDNNisbuiltontopofabasemodelwith theaugmentationofoneormoreearlyexitsinsertedthroughoutthebasemodel.Foraneasy input,itexitsattheearlyexitinsertedatanearlierlocationsincetheextractedfeaturesaregood enoughtoclassifythecontentintheeasyinput.Assuch,theeasyinputavoidsfurther computationalcostincurredonward.Forahardinput,itproceedsdeeperuntiltheextracted featuresaregoodenoughtoexitthehardinput. fullyleveragethecomputationalconsumptionreductionbene˝tbroughtbyearlyexits. Moreover,sinceearlyexitsincuroverheads,thenumberandlocationsoftheinserted earlyexitsplayacriticalroleindeterminingtheamountofcomputationthatcanbe saved,makingthederivationoftheearlyexitinsertionplanevenmorechallenging. Inthispaper,wepresentFlexDNN,acontent-adaptiveframeworkforcomputation- e˚cientDNN-basedmobilevideostreamanalytics.Figure4.1providesaconceptualoverview ofFlexDNN.FlexDNNe˙ectivelyaddressestheselimitationswiththreenovelcontribu- tions: Fine-GrainedEarlyPrediction. Insteadofmakingearlypredictionsatthegranular- ityoflayers,FlexDNNbreaksthenaturalboundariesatlayersandadoptsa˝ne-grained approachtomakeearlypredictionsatthegranularityof˝lters.Forthis,wehavedesigned anovel˝lterimportancerankingschemebasedontheconceptofcollectiveimportance toidentifyredundant˝lterswithineachconvolutionallayer.Withsuch˝ne-grainedap- 44 proach,FlexDNNisabletoprovidemuchmore˛exibilityonmakingearlypredictions. Moreover,itcutso˙unnecessarycomputationwastedbyredundant˝lterswithineach layerthatcoarse-grainedapproachescouldnotachieve. OptimalArchitecture. FlexDNNisabletogeneratetheoptimalarchitecturefor eachearlyexitandtheoptimalearlyexitinsertionplan.Forthis,wehavedesigned anarchitecturesearch-basedschemethatisableto˝ndtheoptimalarchitecturethat balancesthetrade-o˙betweenearlyexitrateandcomputationaloverheadforeachearly exit.Wehavealsoformulatedthedeterminationofnumberoflocationsofearlyexitsas anoptimizationproblemtoderivetheoptimalearlyexitinsertionplanthatmaximizes thebene˝tbroughtbycontentadaptation.Assuch,FlexDNNallowsdeveloperswith limiteddomainexpertisetobuildDNN-basedcomputation-e˚cientcontinuousmobile visionapplicationswithminimume˙orts. RuntimeAdaptation. Atruntime,mobilevisionsystemsmayexperienceworkload dynamicsunderdi˙erentcontextsaswellassystemresourcedynamicsduetomulti- tenancy.Asits˝nalcontribution,thedesignofFlexDNNprovidesanaturalmechanism toadapttobothworkloadandsystemresourcedynamicsatruntime. WeimplementFlexDNNinTensorFlow[50]anduseittobuildthreerepresentative continuousmobilevisionapplications:ActivityRecognition,SceneUnderstanding,and Tra˚cSurveillancebasedonVGGNetandMobileNets.Weusereal-worlddatasetswith videostakenbymobiledevicestopro˝letheseapplicationsandevaluatetheirruntimeper- formanceonbothmobileCPUandmobileGPUplatforms. WhileMobileNets(content-agnosticmodels)areknownfortheircomputation-e˚cient designs,ourevaluationshowsthatFlexDNNsigni˝cantlyoutperformsMobileNetsdueto 45 itscontent-adaptivecapability:achievingupto31.0%reductionincomputationconsump- tionandupto29.6%reductioninenergyconsumptionwhilehavingthesameaccuracy.We alsocompareFlexDNNtoBranchyNet,arecentstate-of-the-artcontent-adaptivemodelthat isbasedoncoarse-grainedearlyexitdesign.Ourresultsshowthatbesidesthebene˝tof automaticallygeneratingtheoptimalearlyexitarchitectureandtheoptimalearlyexitinser- tionplan,FlexDNNsigni˝cantlyoutperformsBranchyNetduetoits˝ne-graineddesignand architecturesuperiority,achievingupto41.2%reductionincomputationconsumptionand upto38.4%reductioninenergyconsumptionwhilehavingthesameaccuracy.Finally,our evaluationshowsthatunderworkloadandsystemresourcedynamicsatruntime,FlexDNN isabletoadapttothedynamicsandachievemuchhigheraccuraciesthanBranchyNet. 4.2Background&Motivation 4.2.1DynamicsofMobileVideoContents&Bene˝tofLeveraging theDynamics Duetomobilityofcameras,videostakeninreal-worldmobilesettingsexhibitsubstantial contentdynamicsintermsofdi˚cultylevelacrossframesovertime.Toillustratethis, Figure4.2showsfourframesofavideoclipofbikingcapturedusingamobilecamerain thehumanactivityvideodataset UCF-101 [51].Amongthem,sincetheentiretyofboththe bikerandherbikeiscaptured,frame(a)and(d)arerelativelyeasiertorecognizeasbiking activity.Incontrast,frame(b)and(c)capturethebikerwithonlypartofthebike,andare thusrelativelyhardertorecognize.Insuchcase,asmallermodelissu˚cientforframe(a) and(d),butamorecomplexmodelisnecessaryforframe(b)and(c). 46 Figure4.2: Illustrationoffourframesofavideoclipofbikingcapturedusingamobilecamerain UCF-101dataset:(a)and(d)areframeswithcontentsthatareeasytorecognize;(b)and(c)are frameswithcontentsthatarehardtorecognize. Figure4.3: Bluesolidcurve:minimumcomputationalconsumptiontocorrectlyrecognizethe contentineachframe( optimalmodel ).Reddottedcurve:computationalconsumptionofthe one-size-˝ts-allmodel . Theintrinsicdynamicsofvideocontentscreateanopportunitytoreducecomputational consumptionbymatchingthecapacityoftheDNNmodeltothedi˚cultylevelofeach videoframe.Toquantifyhowmuchcomputationalconsumptioncanbereduced,we˝rst pro˝letheminimumcomputationalconsumptionintermsofthenumberof˛oatingpoint operations(FLOPs)thatisneededtocorrectlyrecognizethecontentineachframeofan 400-framevideoclip.Speci˝cally,wederivetenmodelvariantswithdi˙erentcapacities from MobileNetV1 byvaryingitsnumbersoflayersand˝lters.Foreachframe,weselectthe modelvariantwiththelowestFLOPsthatisabletocorrectlyrecognizethecontentin that particular frame( optimalmodel ).Wethencompareittothemodelvariantwiththelowest FLOPsthatisabletocorrectlyrecognizethecontentsin all 400frames( one-size-˝ts-all model )framebyframe. 47 Figure4.3showsourpro˝lingresult.Asshowninthebluesolidcurve,theminimum computationconsumedtocorrectlyrecognizethecontentineachframechangesfrequently acrossframes.Thisobservationstronglyre˛ectstheintrinsicdynamicsofvideocontents illustratedinFigure4.2.Inaddition,thedi˙erencebetweenunderthetwocurves re˛ectsthe bene˝t broughtbythe optimalmodel .Thelargedi˙erenceindicatesthatconsid- erablecomputationalconsumptioncanbereducedbymatchingthecapacityofthemodel tothedi˚cultylevelofeachvideoframe. 4.2.2DrawbacksofExistingSolutions Thebene˝tofcomputationreductionmotivatestodynamicallychangethemodelcapacity toadapttothecontentsofvideoframes.Toachievecontentadaptation,existingsolutions suchas Chameleon [18]usedynamiccon˝gurationwhereXXX.Whilee˙ectiveasasolution forresourcefulsystems,dynamiccon˝gurationisamismatchtoresource-constrainedmobile platformsforthefollowingtworeasons. First,dynamiccon˝gurationrequiresallthemodelvariantswithvariouscapacitiesto beinstalledinthemobilesystem.Thisis,unfortunately,notascalablesolutionandcould leadtolargememoryfootprint.FortheexampleusedinFigure4.3,asingle MobileNetV1 modelis 14 MBbutthetotalmemoryfootprintoftenmodelvariantsis 67 MB;thememory footprintwouldonlyincreaseifthenumberofmodelvariantsincreases. Second,dynamiccon˝gurationincurslargeoverheadsonsearchingfortheoptimalmodel variantandmodelswitchingatruntime.Takemodelswitchingasanexample.Model switchinginvolvestwosteps: modelinitialization (i.e.,allocatingmemoryspaceforthemodel toswitchto)and parameterloading (i.e.,loadingthemodelparametersintotheallocated memoryspace).AsshowninFigure4.3,modelswitchingcouldoccurveryfrequently(110 48 Figure4.4: Bene˝tbroughtbytheadaptationvs.modelswitchingoverheadofthedynamic con˝gurationapproach. timeswithin400frames)becausethecontentsofvideoscapturedbymobilecamerascan changequitedrasticallyinashortperiodoftime.Toquantifymodelswitchingoverhead,we pro˝letheaverageprocessingtimeofbothmodelinitializationandparameterloadingofall themodelswitchingoccurredinthesame400-framevideoclipontheSamsungGalaxyS8 smartphoneCPU.Tomaketheresultmoremeaningful,wealsopro˝letheaverageinference timeperframeof optimalmodel and one-size-˝ts-allmodel . Figure4.4showsourresult.Theaverageinferencetimeperframeof one-size-˝ts-all model and optimalmodel is 79 : 5 msand 44 : 1 ms.Thedi˙erencebetweenthemmeasures thebene˝tbroughtbytheadaptation:byusing optimalmodel insteadof one-size-˝ts-all model oneachframe,weareabletoreduce44.5%ofthecomputation.However,theaverage processingtimeofmodelinitializationandparameterloadingis 18 : 9 msand 9 : 6 ms,This modelswitchingoverheaddropsthe actual computationreductiontoonly8.7%,whichcuts thebene˝tbroughtbytheadaptationsigni˝cantly. 49 Figure4.5: FlexDNNarchitecture. 4.3DesignofFlexDNN Inthiswork,wepresentFlexDNN,acontent-adaptiveframeworkforcomputation-e˚cient DNN-basedmobilevideostreamanalytics,thate˙ectivelyaddressesthedrawbacksofexist- ingsolutions.Figure4.5illustratesthehigh-levelarchitectureofFlexDNN.Asanoverview, toaddressthelimitationcausedbycoarse-graineddesign,FlexDNNadoptsa˝ne-grained designtomakeearlypredictionsatthegranularityof˝lters.Itincorporatesacollective importance-based˝lterrankingschemetoranktheimportanceof˝lterswithineachcon- volutionallayerSuch˝lterrankinglaysthefoundationofFlexDNNandallows ustosystematicallygeneratetheoptimalarchitectureforeachearlyexitthroughanarchi- tecturesearchschemeandderivetheoptimalearlyexitinsertionplanthroughan optimizationformulationLastly,the˛exiblearchitectureofthecontent-adaptive computation-e˚cientmodelgeneratedbyFlexDNNprovidesanaturalmechanismtoof- fertrade-o˙betweenaccuracyandresourcedemand.3.4.2).Suchmechanismallows 50 FlexDNNtoadapttobothworkloadandsystemresourcedynamicsatruntime 4.3.1FilterRankingbasedonCollectiveImportance 4.3.1.1Motivation ThefoundationofFlexDNNistorank˝ltersbasedontheirimportanceandidentifyless important˝lterswithineachlayer.Toranktheimportanceof˝ltersofaparticularlayer, existingworkadoptstheapproachwhichrankseach˝lterindependentlybasedonaprede- ˝nedimportanceindicator.Forinstance,in[1], L 2 -normisusedastheindicatorbasedon theheuristicthatimportant˝lterstendtohavelarger L 2 -normvalues.Unfortunately,this approachhasakeydrawback:itignoresthedependencebetween˝ltersineachlayer.As aresult,informationcontainedinthetop-ranked˝lterscanbehighlyoverlapped,whichis therootcauseofredundancy. 4.3.1.2CollectiveImportance-basedFilterRanking Toaddressthedrawbackofexisting˝lterimportancerankingapproaches,weproposea˝lter importancerankingschemethattakes˝lterdependenceintoaccounttorankthecollective importanceinsteadofindividualimportanceof˝lterswithineachlayer. Ourcollectiveimportance-based˝lterrankingschemeisinspiredbythesequentialback- wardselection(SBS)approachinfeatureselectionliterature[52].Atahighlevel,ourscheme startswithall˝lters,anditerativelyremovesthe˝lterthatleastreducestheinferenceaccu- racyfromthe˝lterset. Speci˝cally,ourschememaintainstwolists:rankedandunranked,withrankedinitialized asbeingemptyandunrankedinitiatedwiththefullsetof˝ltersincludedinaconvolutional 51 Figure4.6: Comparisonoftheminimumnumberof˝ltersneededtoahievethesameaccuracyas usingall˝lterswithineachlayerbetweencollectiveimportance-basedrankingschemeandthe schemebasedonindependentranking. layer.Foreach˝lter f i inunranked,wetemporallydropitfromunranked,addanearlyexit withthefeaturemapsgeneratedbythe˝ltersinsideunrankedasitsinput,and˝ne-tune theparametersoftheearlyexit.Next,weobtainthevalidationaccuracyofthisearlyexit andstoreitinatableaccasakey-valuepairwiththekeybeing f i andthevaluebeingthe validationaccuracy.Thedropped˝lter f i isthenaddedbacktounranked.Thisprocedure iteratesuntilallthe˝ltersaregonethrough.Finally,the˝ltercorrespondstothehighest accuracyintheaccisidenti˝edastheleastimportant˝lteramongunranked.Hence,we permanentlydropthis˝lterfromunranked,andinsertitatthetopofranked.Thisprocess terminatesuntillthenumberof˝ltersinunrankedislessthan1/5˝ltersofthislayer.This isbecauseempiricallythefeaturemapextractedbylessthan1/5˝ltersisnotenoughfor anearlyexittomakeearlypredictionwithhighcon˝dence.Wetreateachoftheremaining ˝ltersinunrankedequallyimportant,andaddalloftheseremaining˝ltersinsideunranked tothetopofranked.Basedontheranked˝ltersinsideranked,FlexDNNisableto˝ndthe minimumnumberof˝ltersthatanearlyexitneedstoachievethesameaccuracyasusing allthe˝lterswithineachlayer.Thisisachievedbyiterativelydroppingthelowestranked ˝ltersuntiltheaccuracyofearlyexitstartstodrop. 52 4.3.1.3SuperiorityofCollectiveImportance-basedFilterRanking Wecompareourschemetoindependent˝lterimportancerankingschemebasedon L 2 -norm. Todothis,weuseUCF-15asthevideodatasetandMobileNetV1asourbasemodeltorank theimportanceof˝ltersofeachlayer.We˝ndtheminimumnumberof˝lterstoachievethe sameaccuracyasusingall˝lterswithineachlayerforbothourscheme(Fine-Grained-SBS) andtheindependent˝lterimportancerankingschemebasedon L 2 -norm(Fine-Grained-L2). Figure4.6showstheresult.Duetothelimitationofspace,wedonotshowtheresults ofL11toL13sincetheyhavesimilarresultsasL10.Asshown,theminimumnumber of˝ltersobtainedbyourcollectiveimportance-based˝lterrankingschemeislowerthan L 2 -normacrossallthelayers.Inparticular,ourschemeisabletoidentifyupto28.1% moreredundant˝lterscomparedto L 2 -norm.Thisisbecauseourschemeaccountsfor thedependencebetween˝lterswithineachlayerandthusonlyselects˝ltersthatcontain complementaryinformation. 4.3.2SearchingfortheOptimalEarlyExitArchitecture 4.3.2.1Motivation Anearlyexitisessentiallyaself-containedneuralnetworkthatcanmakeearlypredictions. However,designingthearchitectureofearlyexitsisnottrivial:earlyexitsconsumecom- putation.Computationconsumedbyframesthatfailtoexitiswastedandbecomesthe overheads.Assuch,itisnecessarytominimizesuchoverheadsbyusingleastnumberoflay- ersand˝lterstobuildeachearlyexit.However,earlyexitswithsuchextremelylightweight architecturecouldexitmuchlessframes,diminishingthebene˝tofearlyexits.Therefore, thereexistsatrade-o˙betweenearlyexitratesandcomputationaloverheadsinthedesign 53 spaceofearlyexitarchitecture.Moreover,thelocationsofearlyexitstoinsertatalsoa˙ect theearlyexitrates.Thisrequiresustodesignthearchitectureforeachearlyexitbased onitsinsertedlocation,whichmakesthetaskofearlyexitarchitecturedesignevenmore challenging. ExistingworksuchasBranchyNetdesignsthearchitectureofearlyexitsbasedonheuris- ticsandconsiderablee˙ortsontrialanderror,butdoesnotprovidedetaileddesignguidelines duetothecomplexityofthisdesigntask.Asaresult,theirapproachmaynotbegeneralized toothermodels.Moreover,itforcesdeveloperstohavedecentknowledgeonDNNarchitec- tureandtospendconsiderablee˙ortsontrialanderror,prohibitingitswideadoptionin practice. 4.3.2.2EarlyExitArchitectureSearchScheme Toaddressthisissue,weadoptafundamentallydi˙erentapproach.Insteadofmanually designingthearchitecturebasedonheuristicsandtrialsanderrors,weproposeascheme basedonarchitecturesearchto˝ndtheoptimalarchitecturethatoptimizesthetrade-o˙ betweenearlyexitratesandcomputationaloverheadsforeachearlyexit. Ingeneral,networkarchitecturesearchtechniquescanbegroupedintotwocategories:1) thebottom-upapproachthatsearchesforanoptimalcellstructurebasedonreinforcement learning(RL)orgeneticevolutionandstackscellstogethertoformanetwork2)the top-downapproachthatprunesanover-parameterizednetworkuntiltheoptimalnetwork architectureisfound[1,56].Althoughbothapproachesworkinourscenario,inthiswork, weselectthetop-downapproachduetoitsmoree˚cientsearchprocess. Atahighlevel,ourearlyexitarchitecturesearchschemestartswithinsertinganover- parameterizedearlyexitateverypossiblelocationinaDNNmodel.Itthenmaximizesthe 54 earlyexitratebytrainingwithemphasisonimportant˝ltersfollowedbyminimizingthe computationaloverheadsofearlyexitsviaiterativelayerand˝lterpruning.Indoingso,the besttrade-o˙betweenearlyexitrateandcomputationaloverheadforeachindividualearly exitisachieved. Weexplainthedetailsofeachstagebelow. Stage-1:InsertOver-ParameterizedEarlyExits .Foreachlayer,weinsert M early exitswhere k th earlyexitusesthe˝rst d kN=M e mostimportant˝ltersrankedinranked asinput,where N denotesthetotalnumberof˝ltersofthislayer.Weinitializeeachearly exitwithfourlayerswitheachlayerhavingtwiceasmany˝ltersasitscorresponding previouslayer.Weincreasethenumberof˝ltersbythefactoroftwosoastoproperly encodetheincreasinglyricherrepresentationsaswegodeeper.Similarsetupcanbe foundinpopularDNNssuchasMobileNetsandInceptionV3. Stage-2:MaximizeEarlyExitRatesbyTrainingwithEmphasisonImportant Filters .Givenhowweinsertearlyexitsinourpreviousstep,thehighera˝lterisranked inranked,themorefrequentthe˝lterisusedbyearlyexits.Toforcethesefrequentlyused ˝lterstolearnmoresalientfeatures,wetrainthose˝lterswithbyassigning lowerdropoutratecomparedto˝ltersthatarelowerranked.Indoingso,thisprocess essentiallymaximizestheexitrateofeachearlyexit.Wedetailthede˝nitionofearly exitratein Stage-3:MinimizeOverheadsbyPruningLayersandFilters .Althoughtheexit rateofeachearlyexitismaximized,theoverheadofeachover-parameterizedearlyexit isstillsigni˝cantduetoover-parameterization.FlexDNNminimizesthisoverheadby iterativelypruninglayersand˝ltersofeachearlyexituntiltheexitratestartstodrop. 55 Speci˝cally,foreachearlyexit,westartwithlayer-wisepruningbyiterativelypruningits layersuntilitsexitratedrops.Wethenapply˝lter-wisepruningbyiterativelypruning lowerranked˝lterswithineachremaininglayeruntilitsexitratedrops.Asaresult,the architectureofeachinsertedearlyexitachievestheoptimizedtrade-o˙betweentheexit rateandcomputationaloverhead. 4.3.2.3EarlyExitRate Earlyexitrateofagivenearlyexitre˛ectstheprobabilityofaframetobeexitedtoavoid furthercomputation.Aframeisexitedifitscon˝dencescoreishigherthanapre-de˝ned threshold.Formally,ourcon˝dencescoreisde˝nedas: Conf ( y )=1+ 1 log C X c 2 C y c log y c (4.1) where y =[ y 1 ;y 2 ;:::;y c ;:::;y C ] isthesoftmaxclassi˝cationprobabilityvectorgeneratedby anearlyexit, C isthetotalnumberofclasses,and P c 2 C y c log y c isthenegativeentropyof thesoftmaxclassi˝cationprobabilitydistributionoveralltheclasses.Thethresholdforeach earlyexitsuchthattheframesareexitedwithoutlossofaccuracycanbeobtainedusing cross-validation[19].Wedenotethosethresholdsas: T lossless =( T EE 1 ;:::;T EE i ;:::;T EE K ) (4.2) where EE i denotesthe i th earlyexitof ~ Net , K istotalnumberofearlyexitsinsertedin ~ Net . 56 4.3.3EarlyExitInsertionPlan 4.3.3.1Motivation Althoughtheoptimizedtrade-o˙betweentheexitrateandcomputationaloverheadofeach earlyexitisachievedbyourarchitecturesearchschemeexplainedabove,theoptimized trade-o˙fortheentirenetworkisnot.Thisisbecauseearlyexitshavebeeninsertedat everypossiblelocationthroughoutthenetworkandhenceaccumulateimmenseoverhead altogether. Toobtainthegloballyoptimizedtrade-o˙,incontrasttoBranchyNetandMSDNetwhich manuallydeterminetheearlyexitinsertionlocationsbytrialanderror,FlexDNNadopts asystematicapproachtoderiveanoptimalearlyexitinsertionplan.Atahighlevel,we formulatethederivationofearlyexitinsertionplanasanoptimizationproblemandsolveit usingane˚cientgreedyheuristicthatgreedilyprunesine˚cientearlyexitsinaniterative manner. 4.3.3.2ProblemFormulation FlexDNNformulatesthederivationofearlyexitinsertionplanasthefollowingoptimization problem: Net =argmin ~ Net Res ( ~ Net ) (4.3) where Net isthemodelwithoptimalearlyexitinsertionplan, ~ Net isthesetofcandidates withallthepossibleinsertioncombinations, Res ( ) evaluatesthecomputationalconsumption ofaspeci˝cinsertionplan. SolvingEq.(4.3)bysearchingallthepossibleinsertionplansiscomputationallyexpensive becausethereare 2 K combinations,where K isthetotalnumberofinsertionlocations.To 57 reducethecomplexityofthisprocess,FlexDNNutilizesagreedyheuristictoobtainan approximationsolutionbasedonthefollowingkeyobservation. Whenamodelisdenselyinsertedwithearlyexits,pruninganyoftheearlyexitsleads toreductionofcomputationalconsumption.Amongallearlyexits,pruningtheearlyexit withsmallestearlyexitrateandlargecomputationaloverheadleadstolargestreductionof computationalconsumption(i.e.,ine˚cientearlyexit).Basedonthisobservation,FlexDNN greedilyprunesthesetheseine˚cientearlyexitsinaniterativemanner.FlexDNNterminates thisiterationprocesswhencomputationalconsumptionofthemodelstartstoincrease.This isbecauseatthisstage,alltheremainingearlyexitsarecontributingtothecomputational e˚ciencyandhencearebene˝cialtothemodel.Assuch,theremainingearlyexitsrepresent theoptimalearlyexitinsertionplan. 4.3.3.3E˚ciencyofEarlyExit Toidentifyine˚cientearlyexits,wede˝neametric Q thatquanti˝esthequalityofthetrade- o˙betweenearlyexitrateandcomputationaloverheadofaparticularearlyexit.Speci˝cally, forearlyexit j ,wede˝neitsquality Q j astheratiobetweenthegain G j itbringsandthe cost C j itincurs: Q j = G j =C j (4.4) where C j isthecomputationconsumedbythethisearlyexit,and G j isthecomputation avoidedduetotheexistenceoftheearlyexit.Both C j and G j aremeasuredbythenumber of˛oatingpointoperations. Figure4.7illustrateshow G j and C j arecalculated.Inparticular,the˝gureshows threeconsecutiveearlyexitsinsertedatthe i th , j th ,and k th earlyexitpositionsofbase 58 Figure4.7: Illustrationofderivationofqualityofanearlyexit. model( i