COLLABORATIVEDISTRIBUTEDDEEPLEARNINGSYSTEMS ONTHEEDGES By XiaoZeng ADISSERTATION Submittedto MichiganStateUniversity inpartialoftherequirements forthedegreeof ElectricalEngineeringŠDoctorofPhilosophy 2021 ABSTRACT COLLABORATIVEDISTRIBUTEDDEEPLEARNINGSYSTEMS ONTHEEDGES By XiaoZeng DeeplearninghasrevolutionizedawiderangeofInspiteofitssuccess,mostdeep learningsystemsareproposedinthecloud,wheredataareprocessedinacentralizedmannerwith abundantcomputeandnetworkresources.Thisraisesaproblemwhendeeplearningisdeployed ontheedgewheredistributedcomputeresourcesarelimited.Inthisdissertation,weproposethree distributedsystemstoenablecollaborativedeeplearningontheedge.Thesethreesystemstarget differentscenariosandtasks.ThesystemdubbedDistreamisadistributedlivevideoanalytics systembasedonthesmartcamera-edgeclusterarchitecture.Distreamfullyutilizesthecompute resourcesatbothendstoachieveoptimizedsystemperformance.ThesecondsystemdubbedMer- curyisasystemthataddressesthekeybottleneckofcollaborativelearning.Mercuryenhancesthe trainingefyofon-devicecollaborativelearningwithoutcompromisingtheaccuraciesofthe trainedmodels.ThethirdsystemdubbedFedAceisadistributedtrainingsystemthatimproves trainingefyunderfederatedlearningsettingwhereprivateon-devicedataarenotallowedto besharedamonglocaldevices.Withineachparticipatingclient,FedAceachievessuchimprove- mentbyprioritizingimportantdata.Intheserverwheremodelaggregationisperformed,FedAce exploitstheclientimportanceandprioritizesimportantclientstoreducestragglersandreducethe totalnumberofrounds.Inaddition,FedAceconductsfederatedmodelcompressiontoreducethe per-roundcommunicationcostandobtainsacompactmodelaftertrainingcompletes.Extensive experimentsshowthattheproposedthreesystemsareabletoachieveimprovements overstatus-quosystems. ACKNOWLEDGMENTS Iwouldliketoexpressthedeepestappreciationtomanypeople.First,Iamhonoredandgrateful tohaveDr.MiZhangasmyacademicadvisor.Hehasbeennotonlyanexcellentsupervisor butalsoagreatfriendtomesincewemetforthetime.Iwouldnothavemadesuchmuch achievementandcompletedbyPh.D.degreewithouthim.IwouldalsoliketothankDr.Xiaobo Tan,Dr.XiaomingLiuandDr.JiliangTangwhoserveasmycommitteemembersandprovide guidanceformyacademiccareer. Second,Iwouldliketothankmyfellowsinthelab,includingBiyiFang,WeiDu,DongChen, JiajiaLi,YuZheng,WeiAo,ShenYan,XiaoyuZhang,HaochenSun,KaiCao,KaixiangLinand LaiWeifortheirencouragementandhelpduringmylifeatMichiganStateUniversity.Iamalso gratefulforallthestaffandfacultyintheECEdepartment. Lastly,Iwouldliketothankmyfamilywhohavebeenprovidingtremendoussupportand encouragement.IhaveleftthemforsomanyyearsandIamtrulythankfulfortheirunderstanding. Especially,Iwouldliketothankmywife,whohasbeenstayingwithmesincewecametothethe U.S.tomyPh.D.degree.Iamalsoblessedtohavemyson,Lucas,duringthelastyearofmy Ph.D.career.HeisthebestgiftevergiventomeinmyPh.D.life. iii TABLEOFCONTENTS LISTOFTABLES ....................................... vi LISTOFFIGURES ...................................... vii Chapter1CollaborativeDeepLearningInferenceontheEdge ........... 1 1.1Introduction......................................1 1.2BackgroundandMotivation..............................6 1.2.1LiveVideoAnalyticsPipeline........................6 1.2.2WorkloadDynamicsinReal-worldDeployments..............8 1.2.3NeedforWorkloadAdaptation........................9 1.3DistreamDesign....................................12 1.3.1OverallArchitecture.............................12 1.3.2BalancingtheWorkloadsacrossSmartCameras...............13 1.3.3PartitioningtheWorkloadbetweenSmartCamerasandEdgeCluster....16 1.3.4WorkloadAdaptationController.......................19 1.4SystemImplementation................................23 1.5Evaluation.......................................25 1.5.1ExperimentalMethodology..........................26 1.5.2OverallPerformance.............................28 1.5.3Component-wiseAnalysis..........................32 1.5.4SensitivityAnalysis..............................33 1.5.5ScalingPerformance.............................35 1.5.6SystemOverheads..............................36 1.5.7toExistingSystems.........................36 1.6RelatedWork.....................................37 1.7ConclusionandFutureWork.............................39 Chapter2CollaborativeDeepLearningTrainingontheEdge ............ 40 2.1Introduction......................................40 2.2BackgroundandMotivation..............................45 2.2.1ArchitectureofOn-DeviceCollaborativeLearningSystems.........45 2.2.2ExistingApproachesandTheirLimitations.................46 2.3DistreamOverview..................................48 2.3.1KeyInsight..................................49 2.3.2RudimentaryFrameworkandPerformanceModel..............51 2.3.3OverallDesign................................52 2.4DesignDetails.....................................53 2.4.1Group-wiseImportanceComputationandSampling.............53 2.4.2Importance-awareDataResharding.....................56 2.4.3BACCScheduler...............................57 2.4.4ProofofTrainingCorrectness........................59 iv 2.5Implementation....................................60 2.6Evaluation.......................................62 2.6.1ExperimentalMethodology..........................62 2.6.2OverallPerformance.............................65 2.6.3Component-wiseAnalysis..........................67 2.6.4BandwidthAdaptationPerformance.....................70 2.6.5ScalingPerformance.............................71 2.7RelatedWork.....................................71 2.8Conclusion......................................72 Chapter3CollaborativeLearningforFederatedLearning ......... 73 3.1Introduction......................................73 3.2RelatedWork.....................................76 3.3BackgroundandMotivation..............................78 3.3.1ArchitectureofFederatedLearning......................78 3.3.2FederatedLearningCharacteristics......................79 3.3.3ScopeandImprovingOpportunity......................80 3.4FedAceDesign....................................82 3.4.1OverallArchitecture.............................82 3.4.2FederatedImportanceSampling.......................83 3.4.2.1DataImportanceSampling.....................84 3.4.2.2ClientImportanceSampling....................85 3.4.3FederatedModelCompression........................87 3.5Experiments......................................91 3.5.1Setup.....................................91 3.5.2Results....................................92 3.5.2.1End-to-EndPerformanceComparison...............93 3.5.2.2SensitivityAnalysis........................93 3.6Conclusion......................................97 Chapter4Conclusion .................................. 98 BIBLIOGRAPHY ....................................... 99 v LISTOFTABLES Table1.1:ComparisonbetweenDistreamandstatusquolivevideoanalyticssystems....2 Table1.2:PerformancegainoverCamera-Only(CampusSurveillance)...........29 Table1.3:PerformancegainoverCamera-Only(TrafMonitoring).............30 Table1.4:Component-wiseanalysisofDistream.PerformancegainsareoverCamera-Only.33 Table1.5:PerformanceofDistreamunderlow(10Mbps)andhigh(50Mbps)network bandwidths.....................................35 Table1.6:Cross-cameraworkloadbalancingoverheadsofDistream.............36 Table3.1:Comparisonbetweenfederatedlearninganddistributedtraining.........80 Table3.2:End-to-endperformance.ThenumbersindicatethespeedupsofFedAceover FedAvg.......................................93 Table3.3:Performanceofcompressionrateschedule.....................96 Table3.4:Performanceofcompressionratescheduleunderhighcompressionrate.....96 vi LISTOFFIGURES Figure1.1:LivevideoanalyticspipelineusedinDistream..................8 Figure1.2:Workloaddynamicsinreal-worlddeployment..................9 Figure1.3:SystemarchitectureofDistream..........................10 Figure1.4:Comparisonofthroughput(left)andlatency(right)betweenworkload-agnostic andworkload-adaptive(ours)schemes......................11 Figure1.5:Illustrationonwheretheworkload-agnosticschemefallsshortunderdynamic workloadsinreal-worlddeployments.......................11 Figure1.6:Overviewofthecross-cameraworkloadbalancer.................14 Figure1.7:FullDAGThepercentageontopofeachvertexisthenormalized accumulatedinferencecost............................17 Figure1.8:StochasticDAGpartition.Vertexesinbluecolorareallocatedto runatthecamera,andvertexesinbrowncolorareallocatedtorun attheedgecluster.................................18 Figure1.9:Overviewofcamera-clusterworkloadpartitioninworkloadadaptationcon- troller.TheoptimalDAGpartitionpointsofallcamerasarejointlydetermined.22 Figure1.10:ThroughputandlatencydistributiononCampusSurveillance..........30 Figure1.11:ThroughputandlatencydistributiononTrafMonitoringapplication.....31 Figure1.12:LatencySLOmissratecomparison........................31 Figure1.13:IllustrationonwhyDistreamoutperformsbaselines...............32 Figure1.14:Impactofsystemhyper-parametersontheperformanceofDistream.......34 Figure1.15:ScalingperformanceofDistream.........................35 Figure1.16:Resource-accuracytradeoffcurvecomparisonbetweenVideoEdgeandVideoEdge +Distream.....................................37 Figure2.1:Themulti-parameterserverarchitectureofon-devicecollaborativelearning systems.......................................45 vii Figure2.2:Comparisonincommunication-computationoverlappingbetweenhighband- widthandlowbandwidthsetting.........................47 Figure2.3:Distributionsofdataimportanceasthenumberofiterationsincreasesduring training.......................................48 Figure2.4:Aconceptualcomparisonbetween(a)randomsamplingand(b)importance sampling......................................50 Figure2.5:ConceptualComparisonbetweenrudimentaryimportancesampling-basedframe- workandMercury.................................52 Figure2.6:Distributionsofrankdifference..........................53 Figure2.7:Illustrationofgroup-wiseimportancecomputationandsampling.Eachdot representsonesampleandthedarknessindicatesitsimportance.........55 Figure2.8:Importance-agnosticdatareshardingvs.importance-awaredataresharding...57 Figure2.9:(a)UnoptimizedPipelinevs.(b)Optimizedpipeline. w t representsthemodel weightatiteration t .R=R1+R2+R3+R4inlength..............58 Figure2.10:Distreamimplementation.............................61 Figure2.11:OverallperformancecomparisononCifar10andCifar100...........64 Figure2.12:OverallperformancecomparisononSVHNandAID..............65 Figure2.13:OverallperformancecomparisononTwSpeechCommandandAG- News.......................................65 Figure2.14:Testaccuracycurvesduringthecompletetrainingprocessintermsofnumber ofcommunicationrounds(left)andwall-clocktime(right)...........67 Figure2.15:Testaccuracycurvesintermsofnumberofiterations(left)andwall-clocktime (right)usinggroup-wise(Groupwise-IS)andall-inclusiveimportancecompu- tationandsampling(All-IS)............................67 Figure2.16:Importance-awarevs.importance-agnosticdataresharding,vs.non-resharding.69 Figure2.17:Testaccuracycurvesintermsofnumberofiterations(left)andwall-clocktime (right)usingBACCandsequentialpipeline....................69 Figure2.18:SpeeduponCIFAR-10...............................70 viii Figure2.19:Bandwidthsnapshot................................70 Figure2.20:ScalingperformanceofDistreamon4,8,12edgedevicesundervariousnet- workbandwidths..................................71 Figure3.1:Federatedlearningoverview...........................80 Figure3.2:Trainingprocessfederatedlearning.......................82 Figure3.3:OverviewofFedAce...............................83 Figure3.4:Overviewoffederatedimportancesampling...................84 Figure3.5:Overviewoffederatedmodelcompression....................88 Figure3.6:Examplesofmaskcaching............................90 Figure3.7:Performanceunderdifferentnumberofepochs.................94 Figure3.8:PerformancegainofFedAceinthreefederatedlearningdatasets........95 Figure3.9:Testaccuracycomparisonunderdifferentmaskcachesettings..........97 ix Chapter1 CollaborativeDeepLearningInferenceon theEdge 1.1Introduction Videocamerasareubiquitous.Today,camerashavebeendeployedatscaleatplacessuchastraf intersections,universitycampuses,andgrocerystores.Drivenbytherecentbreakthroughindeep learning(DL)[49],organizationsthathavedeployedthesecamerasstarttouseDL-basedtech- niquesfor livevideoanalytics [86,35,40].Analyzinglivevideosstreamedfromthesedistributed camerasisthebackboneofawiderangeofapplicationssuchastrafcontrolandsecuritysurveil- lance.Asmanyoftheseapplicationsrequireproducinganalyticsresultsinreal-time,achieving low-latency , high-throughput ,and scalable videostreamprocessingiscrucial[77]. Livevideoanalyticssystemsrequirehigh-resolutioncamerastocapturehigh-qualityvisual dataforanalytics.Ascameranumberscalesup,thesealways-oncamerascollectivelygenerate hundredsofgigabytesofdataeverysinglesecond,makingitinfeasibletotransmitsuchgigantic volumeofdatatodatacentersinthecloudforreal-timeprocessingduetoinsufnetwork bandwidthandlongtransmissionlatencybetweencamerasandthecloud[88]. Thekeytoovercomingthisbottleneckisto movecomputeresourcesclosetowheredatareside . Statusquolivevideoanalyticssystemshencestreamcamerafeedstolocaledgeclustersfordata 1 Architecture VideoAnalytics PipelinePartition Workload Adaptive Cross-Camera WorkloadBalancing VideoStorm[86]CentralizedN/ANoN/A Chameleon[40]CentralizedN/ANoN/A NoScope[44]CentralizedN/ANoN/A FilterForward[13]Camera-OnlyFixedNoNo VideoEdge[35]DistributedDynamicNoNo DistreamDistributedDynamicYesYes Table1.1:ComparisonbetweenDistreamandstatusquolivevideoanalyticssystems. analyticswheremuchhighernetworkbandwidthisprovided[86,40,44,65].Tomovecompute resources evencloser todatasources,majorvideoanalyticsproviders(e.g.,Avigilon,Hikvision, NVIDIA)arereplacingtraditionalvideocameraswithfismartcamerasfl.Equippedwithonboard DLaccelerators,thesesmartcamerasarenotonlyabletoperformbasicvideoprocessingtasks suchasbackgroundsubtractionandmotiondetection,butalsocapableofexecutingcomplicated compute-intensiveDL-basedpipelinestodetectandrecognizetheobjectsandavarietyoftheir attributes[27,82,41].Sinceeachsmartcamerabringsextracomputeresourcestoprocessvideo streamsgeneratedbyitself,this smartcamera-edgeclusterarchitecture isthekeytoenablinglive videoanalyticsatscale[35,13,41]. Motivation&LimitationsofStatusQuo. Inreal-worlddeployments,dependingonwhatareas thecamerasarecovering,thenumberofobjectsofinterest(e.g.,people,vehicles,bikes)captured byeachcameraisdifferentandcanvaryovertime.Forexample,forsurveillance systemsdeployedonuniversitycampuses,acamerathatcoversthebuildingentrancecaptures muchlargernumbersofpeoplebeforeandafterclassesthananyothertime;acamerathatpointsat theemergencyexitwherepeoplerarelyvisithasnoobjectsofinterestcapturedmostofthetime. Asaconsequence,theworkloadofrecognizingthecapturedobjectsandtheirattributesproduced ateachcameraisdifferentandinherentlydynamicovertime. AslistedinTable1.1,inrecentyears,livevideoanalyticssystemssuchasVideoStorm[86], 2 Chameleon[40]andNoScope[44]haveemerged.Thesesystemsenableefprocessingof alargenumberofcamerastreams,butaredesignedtoprocessthestreamswithinacentralized cluster. Withtheemergenceofsmartcameras,recentsystemsstarttoleveragethecomputeresources insidesmartcamerasfordistributedlivevideoanalytics.FilterForward[13]proposesacamera- onlysolutionwhichimportantvideoframesandoutunimportantonesdirectlyon smartcameras.VideoEdge[35],ontheotherhand,proposesafullydistributedframeworkthat partitionsthevideoanalyticspipelineacrosscamerasandclusterwiththeobjectivetooptimize theresource-accuracytradeoff.Whilethesesystemsaimtooptimizelivevideoanalyticsfrom avarietyofperspectives,theyare agnostic totheworkloaddynamicsinreal-worlddeployments describedabove,makingthemfallshortintwosituations:ononehand,failingtoutilizethe computeresourcesinsideidlecamerascouldconsiderablyjeopardizesystemthroughput;onthe otherhand,failingtoalleviatetheworkloadsfromcamerasthatareoverloadedbyburstyworkloads couldincurhighlatency,causingthesystemnotabletomeetthelatencyservicelevel objective(SLO)imposedbythelivevideoanalyticsapplications. OverviewoftheProposedApproach. Inthispaper,wepresentDistreamŒadistributedframe- workbasedonthesmartcamera-edgeclusterarchitectureŒthatisableto adapt totheworkloaddy- namicsinreal-worlddeploymentstoachievelow-latency,high-throughput,andscalableDL-based livevideoanalytics.TheunderpinningprinciplebehindthedesignofDistreamistoadaptively balancetheworkloadsacrosssmartcamerasaswellaspartitiontheworkloadsbetweencameras andtheedgecluster.Indoingso,Distreamisabletofullyutilizethecomputeresourcesatboth endstojointlymaximizethesystemthroughputandminimizethelatency withoutthe videoanalyticsaccuracy . 3 ThedesignofDistreaminvolvesthreekeychallenges. Challenge#1:Cross-CameraWorkloadBalancing .Onekeyobstacletoachievinghigh- throughputlow-latencylivevideoanalyticsiscausedbytheimbalancedworkloadsacross cameras.Therefore,thechallengeliesindesigningaschemethatbalancestheworkloads acrosscameras.However,thecross-cameraworkloadcorrelation,theheterogeneousonboard computecapacitiesofsmartcameras,andtheoverheadofworkloadbalancingaltogethermake designingsuchaschemenottrivial. Challenge#2:Camera-ClusterWorkloadPartitioning .Anotherkeyobstacletoachiev- inghigh-throughputlow-latencylivevideoanalyticsiscausedbytheimbalancedworkloads betweensmartcamerasandtheedgecluster.Tobalancetheworkloadsbetweencamerasand edgecluster,thevideoanalyticspipelineshouldbepartitionedbasedontheworkloadratio ofthetwosides.However,thepossibleoptionstopartitionthevideoanalyticspipelineare quitelimitedinnumber,makingworkloadpartitioningbetweencameraandedgeclusterby nature coarse-grained .Asaresult,thepartitionedvideoanalyticspipelinemaynotmatchthe workloadratioofthetwosides. Challenge#3:AdaptationtoWorkloadDynamics .Giventhedynamicsofworkloadsinreal- worlddeployments,theoptimalsolutionsforcross-cameraworkloadbalancingandcamera- clusterworkloadpartitioning varyovertime .Beingabletoadapttosuchworkloaddynamics isamustforhigh-performancelivevideoanalyticssystems.Designingsuchanadaptation scheme,however,isnottrivial,astheoptimalpipelinepartitioningsolutionforeachcamera canbe different .Moreimportantly,sincetheworkloadsarejointlyexecutedbetweencameras andedgecluster,forthewholesystemtoachievethebestperformance,theoptimalpipeline partitioningsolutionsforallthecamerasneedtobe jointly determined.Thisisamuchmore 4 challengingproblemcomparedtothesingle-pairworkloadpartitioningproblemtackledinthe literature[16,14]. Toaddressthetchallenge,Distreamincorporatesacross-cameraworkloadbalancerthat takesthecross-cameraworkloadcorrelation,heterogeneouscomputecapacitiesofsmartcameras, aswellastheoverheadofworkloadbalancingintoaccount,andformulatesthetaskofcross- cameraworkloadbalancingasanoptimizationproblem.Inparticular,theproposedcross-camera workloadbalancerincorporatesalong-shorttermmemory(LSTM)-basedrecurrentneuralnet- workwhichisabletoenhancetheperformanceofcross-cameraworkloadbalancingbypredicting incomingworkloadsinthenearfuturetoavoidmigratingworkloadstocamerasthataregoingto experiencehighworkloads. Toaddressthesecondchallenge,Distreamincorporatesastochasticpartitioningschemethat partitionsthevideoanalyticspipelineina stochastic manner.Indoingso,itprovidesmuchmore partitionxibilityandmuchpartitiongranularity.Assuch,Distreamisabletopartitionthe pipelinetomatchtheworkloadratioofthesmartcameraandedgeserver. Toaddressthethirdchallenge,Distreamincorporatesaworkloadadaptioncontrollerwhich triggersthecross-cameraworkloadbalancerwhencross-cameraworkloadimbalanceisdetected. Moreover,itformulatesthetaskofjointlyidentifyingtheoptimalpipelinepartitioningsolutions forallthecamerasasanoptimizationproblemwiththeobjectivetomaximizetheoverallsystem throughputsubjecttothelatencySLOimposedbythelivevideoanalyticsapplications. SystemImplementation&SummaryofEvaluationResults. WeimplementedDistreamand deployeditonaself-developedtestbedthatconsistsof24smartcamerasanda4-GPUedgeclus- ter.WeevaluateDistreamwith500hoursofdistributedvideostreamsfromtworeal-worldvideo datasets:onefromsixtrafcamerasdeployedinJacksonHole,WY[7]fortrafmonitoring 5 application,andtheotherfrom24surveillancecamerasdeployedonauniversitycampusforsecu- ritysurveillanceapplication.WecompareDistreamagainstthreebaselines: Centralized ,which processesalltheworkloadsontheedgecluster; Camera-Only ,whichprocessesalltheworkloads onsmartcameras;and VideoEdge [35].OurresultsshowthatDistreamconsistentlyoutperforms thebaselinesintermsofthroughput,latency,andlatencySLOmissratebyalargemargindue toitsworkloadadaptationschemes.Moreover,ourscalingexperimentsshowthatDistreamis abletoscaleupsystemthroughputnearlylinearlywhilemaintainingalowlatencywithnegligible overheads.Finally,weshowthattheworkloadadaptationtechniquesproposedinDistreamcould existinglivevideoanalyticssystemsandenhancetheirperformanceaswell. SummaryofContributions. Tothebestofourknowledge,Distreamrepresentsthedis- tributedframeworkthatenablesworkl-oad-adaptivelivevideoanalyticsunderthesmartcamera- edgeclusterarchitecture.Itakeyperformancebottleneckandcontributesnoveltech- niquesthataddressthelimitationsofexistingsystems.Webelieveourworkrepresentsa steptowardsturningtheenvisionedlarge-scalelivevideoanalyticsintoreality. 1.2BackgroundandMotivation 1.2.1LiveVideoAnalyticsPipeline Modernlivevideoanalyticspipelinestypicallyadoptacascadedarchitecturewhichconsistsof afront-endobjectdetectorfollowedbyaback-endmoduletoperformavarietyof analyticstasksoneachofthedetectedobjectofinterestwithinavideoframe. Therearetwotypesofobjectdetectorsthathavebeencommonlyusedinexistinglivevideo analyticssystems[40].ThetypeistheCNN-basedobjectdetector( e.g. ,YOLO[68],SSD[57] 6 andFaster-RCNN[69])whichextractsandalltheobjectsofinterestinaframewithone singleinference.However,suchanobjectdetectorhastobeconstantlyextractingfeaturesfrom framesandperforminginferenceevenifthereisnoobjectofinterestappearinginvideostreams. Inmanyscenariosinreal-worlddeployments,however,theobjectsofinterestmayonlyappearin videostreamsforshortperiodsoftime.Insuchcase,aamountofcomputeresourcesis wasted.Thesecondtypeofobjectdetectoristousealight-weightbackgroundsubtractor[93] toextracttheregionswhereobjectsofinterestresidefromtheframe.Itsendstheseregionstoa toidentifytheobjectwithineachregion.Assuch,itproducesobjectsofinterestonly whentheyappearintheframe.WhileDistreamisagenericframeworkwhichsupportsboth typesofobjectdetectors,inthiswork,wefocusonthebackgroundsubtraction-baseddetectorto illustrateourideas. Themoduleingeneralcanberepresentedasadirectedacyclicgraph(DAG).In thiswork,weuseattributerecognitionasaconcreteexampleofthemoduletoillus- trateourideas.cally,eachvertexintheDAGrepresentsaDNNthatrecognizes aparticularattributeoftheobject;eachdirectededgerepresentsadatawfromoneto another.Forexample,theDAGinFigure1.1consistsofthreebranches.Basedonthe typeofthedetectedobject(vehicle,person,orothers),themoduleselectsoneofthe threebranches,eachofwhichemploysacascadedsequenceoftofurtheridentifythe attributesoftheobject 1 . BasedonthepipelineillustratedinFigure1.1,goingthrougheachwithintheDAGis regardedasanindividual workload 2 .Therefore,identifyinganobjectofinterestanditsattributes 1 Althoughweadoptedthedesignchoiceoftreatingeachtaskindependently,Distreamisxibletosupportmulti- tasklearningwhereaDNNinferenceproducesmultipleresults.Itshouldbenotedthatnotallthetaskscanbe combinedintoasinglemulti-taskDNNmodel.ThusourDAGformulationisgeneralenoughtosupportallthecases. 2 Wedidnotincludecapturingimagesandbackgroundextractionasworkloadsmainlybecausethesestepsconsume muchlesscomputationcomparedtoDNN-basedinferenceandthuscanbeexecutedlocallyfastenoughwithout of 7 Figure1.1:LivevideoanalyticspipelineusedinDistream. withinavideoframeproducesmultipleworkloadstobeprocessedbythelivevideoanalytics system. 1.2.2WorkloadDynamicsinReal-worldDeployments Toillustratetheworkloaddynamicsinreal-worlddeployments,wecollectedadatasetfrom24 surveillancecamerasdeployedonauniversitycampus(§2.6.1).Giventhelimitedspace,wepick arepresentativevideocliptomakeourpoints.,Figure1.2showstheworkloadsgener- atedbyfoursurveillancecamerasdeployedatauniversitybuildingonaweekdaybetween12pmto 12:30pm.Amongthem, CAM1 monitorsasquarenexttothebuildingentrance; CAM2 monitorsa sidewayoutsidethebuilding; CAM3 and CAM4 covertwodifferentcorridorsinsidethebuilding. Asillustrated,dependingontheviewscapturedbythecameras,theworkloadsofreal-world videostreamshavethreekeycharacteristics.(i) Theworkloadsaredifferentacrosscameras :for 8 Figure1.2:Workloaddynamicsinreal-worlddeployment. CAM1 thatmonitorsbusyareas,moreworkloadsaregeneratedaspedestrianspassbymoreoften; for CAM3 thatmonitorsalessbusyindoorcorridor,muchlessworkloadsaregeneratedsporad- icallyandithasnoworkloadsmostofthetime.(ii) Theworkloadgeneratedateachcamerais dynamicovertime :thisisobviousbecausethecontentcapturedbyeachcameraischangingover time.(iii) Theamountofworkloadsgeneratedateachcameracanvary :theworkload differencebetweenburstyandnon-burstydurationscanbemorethan1000 . 1.2.3NeedforWorkloadAdaptation Existinglivevideoanalyticssystemsbasedonthesmartcamera-edgeclusterarchitecture,however, areagnostictothereal-worldworkloaddynamicsillustratedabove.Todemonstratehowexisting systemsfallshortundersuchworkloaddynamics,wecomparethesystemperformancebetween Distreamthatisworkload-adaptiveandaworkload-agnosticbaselinethatallocatescomputere- sourceproportionaltothecomputecapacity.Weusethesamedatasetin§2.2andatestbedthat consistsoffoursmartcameras(2Jetson-TX1and2Jetson-TX2)and1-GPUedgecluster(§1.4)to 9 Figure1.3:SystemarchitectureofDistream. conductourexperiments. Figure1.4illustratesourcomparisonresults.Asshown,Distreamachievesimprove- mentintermsofthroughputandlatencycomparedtotheworkload-agnosticscheme., themedianandpeakthroughputsofDistreamare302IPSand562IPSrespectively,whichis 1 : 4 and2 : 3 improvementovertheworkload-agnosticbaseline,whichonlyachieves213IPS medianand240IPSpeakthroughputs.Distreamalsoachieves0.23smedianand3.92smaxi- mumlatency,whichreducesmediumandmaximumlatencyby57 and40 : 7 comparedtothe workload-agnosticschemewhichachieves13.1smedianlatencyand159.73smaximumlatency. Tounderstandwhytheworkload-agnosticschemefallsshort,weuseonesmartcameraasan exampleandplotitsworkloadsgeneratedoveratwominsvideoclipinFigure1.5.,the blacksolidlinedepictsthegeneratedworkloadsovertime;thereddashedlinedepictsthecamera's processingcapability(i.e.,themaximumnumberofworkloadsitcanprocessperunittime)based ontheworkload-agnosticscheme.Asshown,undertheworkload-agnosticscheme,althoughthe workloadgeneratedatthesmartcameraisdynamicovertime,theprocessingcapabilityofthe camerastaysthesameanddoesnotadapttotheworkloadvariation.Consequently,whenthe generatedworkloadexceedsitsprocessingcapability,thecameraexperiences under-provisioning 10 Figure1.4:Comparisonofthroughput(left)andlatency(right)betweenworkload-agnosticand workload-adaptive(ours)schemes. Figure1.5:Illustrationonwheretheworkload-agnosticschemefallsshortunderdynamicwork- loadsinreal-worlddeployments. (redregion)sothatitfailstoprocesstheworkloadsintime;whentheworkloadisbelowits processingcapability,thecamerabecomes over-provisioning (greenregion)andthuswastesits computeresourcesthatcouldhavebeenusedtoprocessworkloadsgeneratedfromothercameras. Theseobservationsaltogetherhighlighttheneedforworkloadadaptationtoavoidtheoccurrence ofeitherunder-provisioningorover-provisioning,whichmotivatesthedesignofDistream. 11 1.3DistreamDesign 1.3.1OverallArchitecture Figure2.5illustratestheoverallarchitectureofDistream.Asshown,Distreamisdesignedasa distributedframeworkthatspansacrosssmartcamerasandtheedgecluster.Inthedataplane, assoonasthesmartcameracapturesavideoframe,itrunstheonboardbackgroundsubtractor toextractregionsofinterest(ROIs)withintheframeandappendsthemonebyoneintoalocal queue( 1 ).EachROIinthequeueiseitheroftoanothercamera( 2a )orpartiallypro- cessedinsidethelocal inferenceengine accordingtoitspartitionedDAGwhosepartitionpoint isdeterminedbythe DAGpartitioner ( 2b ).Theresultofthepartially-processedDAGisthen transferredtotheedgeclusterandbufferedinaqueue( 3 ),whichwillbeprocessedbythe in- ferenceengine insidetheedgecluster( 4 )tocompletetheremainingunprocessedpartofthe DAG.Alltheinferenceresultsfromthecamerasaregatheredattheedgecluster.Inthecontrol plane,the systemmonitor continuouslymonitorstheworkloadsateachcameraandedgecluster, andsendsthecollectedworkloadinformationtothe workloadadaptationcontroller ( i ).Once theworkloadimbalanceacrosscamerasisdetected,the workloadadaptationcontroller triggers the cross-cameraworkloadbalancer ( iia )tobalancetheworkloadsacrosscameras( iiia ).Mean- while,basedonthecurrentworkloadsateachcameraandedgecluster,the workloadadaptation controller adaptivelytheoptimalDAGpartitionpointforeachcamerathatbalancesthe workloadbetweeneachcameraandtheedgecluster.SuchoptimalDAGpartitionpointsaresent tothe DAGpartitioner ( iib )forDAGpartitioning,andthecorrespondingDAGpartitionresultis senttothe inferenceengine ateachcamera( iiib ). Inthefollowing,wedescribethethreekeycomponentsofDistream: cross-cameraworkload balancer (§1.3.2), DAGpartitioner (§1.3.3),and workloadadaptationcontroller (§1.3.4)indetail. 12 1.3.2BalancingtheWorkloadsacrossSmartCameras ThekeycomponentofDistreamisthe cross-cameraworkloadbalancer ,whichbalancesthe workloadsattheleveloftheextractedregionsofinterest(ROIs)acrosscamerasbymigrating partoftheworkloadsfromheavilyloadedcamerastoidleorlightlyloadedones(Figure1.6).To achievethis,the cross-cameraworkloadbalancer needstoaccommodatethreekeyconsiderations: Consideration#1:cross-cameraworkloadcorrelation .Asalsobeingobservedbymany existingworks[40,41,38],workloadsonnearbycamerasmayexhibitstrongcorrelationsdue tomobilityofobjectsofinterest:acameramayhavehighworkloadwithinashorttimeperiod ifitsnearbycamerasareexperiencinghighworkloads.The cross-cameraworkloadbalancer needstotakesuchcorrelationsintoaccounttoavoidmigratingworkloadstocamerasthatare goingtoexperiencehighworkloads. Consideration#2:heterogeneouscomputecapacities .Smartcamerasmayhaveheteroge- neousonboardcomputecapacitiescausedbydifferentgenerationsofcomputehardware.To preventanycamerafrombecomingthebottleneck,the cross-cameraworkloadbalancer needs tobalancetheworkloadsproportionaltoeachcamera'scomputecapacity. Consideration#3:workloadbalancingoverheads .Inpractice,migratingworkloadsacross camerasincursoverheads.The cross-cameraworkloadbalancer needstotakesuchoverheads intoaccountsuchthattheoverheadsdonotovershadowthebroughtbyloadbalancing. The cross-cameraworkloadbalancer takesthesethreeconsiderationsintoaccount,andfor- mulatesthetaskofcross-cameraworkloadbalancingasanoptimizationproblem. Formally,let N denotethetotalnumberofsmartcameras, w i denotetheworkloadofcamera i beforeloadbalancing,andlet å j x ji and å j x ij denotetheworkloadsthataremovedintoand movedoutofcamera i respectively.Therefore,theworkloadofcamera i afterloadbalancingcan 13 Figure1.6:Overviewofthecross-cameraworkloadbalancer. becalculatedas u i =( w i + å j x ji å j x ij ) .Andthegoalofthe cross-cameraworkloadbalancer istominimize n ,theworkloadimbalanceindexacrosssmartcamerasafterworkloadbalancing, whichcanbecalculatedas n =( u max ¯ u 1 ) (1.1) where u max and¯ u representthemaximumworkloadandaverageworkloadofallcamerasrespec- tively.Byminimizingtheworkloadimbalanceindex,the cross-cameraworkloadbalancer isable toachieveagoodbalancebetweenworkloadbalancingefyandfairnesswithoutincurring thrashing. Toaccommodate Consideration#1 ,wetakethecross-cameraworkloadcorrelationintoac- countbyincorporatingaworkloadpredictortopredictthefutureworkloadateachcamera(not frommigration)basedontheircurrentworkloadsandthecross-cameraworkloadcorrelations. ,weemploylong-shorttermmemory(LSTM)-basedrecurrentneuralnetwork[61]as theworkloadpredictorbecauseitshowsbetterlong-termforecastingabilityinpredictingtime seriesdatacomparedtoothermethods[60].Thepredictedworkload‹ w i atcamera i outputby theworkloadpredictorisaddedto u i asanadjustmenttoobtainamoreaccuratefutureworkload 14 estimateas u i = w i + å j x ji å j x ij + ‹ w i (1.2) Toaccommodate Consideration#2 ,wetaketheheterogeneouscomputecapacitiesofsmart camerasintoaccountbymultiplying u i withanormalizationfactor a i = C i = ¯ C where C i isthe computecapacityofcamera i and ¯ C istheaveragecomputecapacityofallthesmartcameras.As aresult,theworkloadateachcameraafterloadbalancinggetsnormalizedas u 0 i = u i a i andthe workloadimbalanceindex n getsnormalizedas n 0 =( u 0 max ¯ u 0 1 ) (1.3) Toaccommodate Consideration#3 ,atriggeringthreshold b isaddedintoouroptimization objective.Thecross-cameraworkloadbalancingisnottriggeredwhen n 0 isbelow b . Puttingallthepiecestogether,ourcross-cameraworkloadbalancingschemecanbeformulated as min x ij 0 max ( n 0 b ; 0 ) (1.4a) s : t : u i = w i + å j x ji å j x ij + ‹ w i (1.4b) a i = C i = ¯ C (1.4c) u 0 i = u i a i (1.4d) n 0 =( u 0 max ¯ u 0 1 ) (1.4e) Solvingtheabovenonlinearoptimizationproblemiscomputationallyhard.Toenablereal- timecross-cameraworkloadbalancing,weutilizeaheuristictoefientlysolvetheoptimization 15 problem.,ourheuristicstartswithall x ij = 0andincreases x ij iteratively.Ineach iteration,weselectcamerasthathavethelargestandthesmallestworkloadstoformamigration pair.Thenweincreasethemigratedworkload x ij by D x (i.e., x ij = x ij + D x ),where D x = 1% u 0 i .Werepeatsuchiterationuntil n 0 isbelowathresholdor n 0 doesnotimprovebetweentwo consecutiveiterations. Asshownin§1.5.6,suchheuristicisabletosolvetheoptimizationprobleminanefcient mannerandincursnegligibleoverheads. 1.3.3PartitioningtheWorkloadbetweenSmartCamerasandEdgeCluster ThesecondkeycomponentofDistreamisthe DAGpartitioner ,whichpartitionstheworkloads betweensmartcamerasandedgeclusterattheDAGlevel.Toachievethis,the DAGpartitioner needstoaccommodatetwokeyconsiderations: Consideration#1:DAGisconditionallyexecuted .Asthecontentsofvideostreamsare changing,theexecutionwinDAGischangingcorrespondingly.TaketheDAGinFigure1.1 asanexample:ifa`Vehicle'isdetected,theupperpathoftheDAG(i.e.,the`Color',`Type' and`Make'willbeexecuted;ifa`Person'isdetected,themiddlepathoftheDAG (i.e.,`Behavior'and`Gender'willbeexecuted.However,toidentifytheoptimal DAGpartitionpoint,the DAGpartitioner needstoconsiderallthepossiblepartitionpoints fromallthepossibleexecutionpaths. Consideration#2:DAGPartitioningisbynaturecoarse-grained .Ideally,tobalancethe workloadsbetweencamerasandtheedgecluster,theDAGpartitionpointshouldbesetbased ontheworkloadratioofthetwosides.However,thepossiblepartitionpointsinaDAGarethe verticesintheDAG,whicharediscreteandlimitedinnumber.ThismakesDAGpartitioning 16 Figure1.7:FullDAGThepercentageontopofeachvertexisthenormalizedaccumu- latedinferencecost. bynaturecoarse-grained.Asaresult,theDAGpartitionpointmaynotmatchtheworkload ratioofthetwosides. The DAGpartitioner takesthesetwoconsiderationsintoaccount,anddesignsatwo-step schemetopartitiontheDAG. Step#1:FullDAGPr .Inourstep,wefollowthetopologicalorderoftheDAGtoex- tractallthepossibleexecutionpathsintheDAG.Asanexample,Figure1.7showsthreeexecution pathsextractedfromtheDAGinFigure1.1.Foreachextractedexecutionpath,wethen theinferencecostofeach,andlabelthenormalizedaccumulatedinferencecostontopof each.Forexample,inFigure1.7,thenormalizedaccumulatedinferencecostofexecuting the`Vehicle'pathuptoBis35%. Step#2:StochasticDAGPartitioning .Inoursecondstep,weproposeastochasticDAGparti- tioningschemetopartitiontheDAGinamannersuchthatthepartitionedDAGcould bettermatchtheworkloadratioofthecameraandtheedgecluster.,foreachexecution pathobtainedinStep#1,thegoalofourstochasticDAGpartitioningschemeistoa stochastic pathexecutionplan suchthattheexpectednormalizedaccumulatedinferencecostofexecutingthe pathmatchestheoptimalDAGpartitionpoint par 2 [ 0 ; 1 ] providedbythe workloadadaptation controller (§1.3.4). 17 Figure1.8:StochasticDAGpartition.Vertexesinbluecolorareallocatedtorunatthe camera,andvertexesinbrowncolorareallocatedtorunattheedgecluster. Takethe`Vehicle'pathinFigure1.8asanillustrativeexample.Forthe`Vehicle`path, assumeweneedtoachieve par = 0 : 7workloadpartitioning(i.e.,70%workloadallocatedtothe cameraand30%workloadallocatedtotheedgecluster)determinedbythe workloadadaptation controller .OurstochasticDAGpartitioningschemetwovertexes(vertexB,vertex C)alongthe`Vehicle`pathsuchthat par = 0 : 7fallsbetweentheirnormalizedaccumulatedinfer- encecost(Figure1.8(1)).Giventhenormalizedaccumulatedinferencecostsofthesetwovertexes ( Cost B = 0 : 35forvertexB, Cost C = 0 : 85forvertexC),ourstochasticDAGpartitioningscheme thendeterminestheprobabilitiesforpartitioningthe`Vehicle`pathatvertexB( P B )andvertexC ( P C )respectivelyas: P B =( par Cost C ) = ( Cost B Cost C ) P C = 1 P B (1.5) Inthisexample,thesolutionsare P B = 0 : 3and P C = 0 : 7.Therefore,the stochasticpathex- ecutionplan istopartitionthepathatvertexBwithaprobabilityof0.3andtopartitionthe 18 pathatvertexCwithaprobabilityof0.7(Figure1.8(2)).Assuch,theexpectednormalized accumulatedinferencecostofexecutingthepathmatchestheoptimalDAGpartitionpoint par (0 : 3 35% + 0 : 7 85% = 0 : 7). ComparedtothediscreteDAGpartitioningapproach,ourstochasticDAGpartitioningscheme providesmuchmorepartitionxibilityandgranularity.Withsuchxibilityandgranularity, DistreamisabletobetterpartitiontheDAGbetweensmartcamerasandedgeclustertomatchthe workloadratiooftwosides. 1.3.4WorkloadAdaptationController ThethirdkeycomponentofDistreamisthe workloadadaptationcontroller ,whichisthecore toachieveourworkloadadaptationobjective.The workloadadaptationcontroller performstwo tasks.First,whentheworkloadimbalanceindexacrosssmartcamerasisgreaterthanathreshold b ,the workloadadaptationcontroller triggersthe cross-cameraworkloadbalancer tobalancethe workloadsacrossthecamerasasdescribedin§1.3.2.Second,the workloadadaptationcontroller followsacamera-clusterworkloadpartitionscheduletoperiodically(atinterval g )updatetheop- timalDAGpartitionpointforeachcameratobalancetheworkloadsbetweeneachcameraand edgecluster(Figure1.9).Suchoptimalpartitionpointsaresenttothe DAGpartitioner forDAG partitioningasdescribedin§1.3.3. TheDAGpartitionpointsserveasknobstocontroltheDAGexecutionacrosscamerasandedge cluster,andthuscontroltheworkloadsplitbetweenthem.ToidentifytheoptimalDAGpartition points,the workloadadaptationcontroller needstoaccommodatetwokeyconsiderations: Consideration#1:theoptimalDAGpartitionpointisdifferentforeachcamera .Thesmart camerashaveheterogeneouscomputecapacitiesandtheworkloadateachcameracanbediffer- entduetothediscretizationofcross-cameraworkloadbalancing.Thisrequiresthe workload 19 adaptationcontroller toidentifytheoptimalDAGpartitionpointforeachcamera. Consideration#2:theoptimalDAGpartitionpointsofallcamerasneedtobedetermined jointly .Becausetheworkloadsarepartitionedbetweencamerasandedgecluster,thesystem performance(throughputandlatency)is jointly determinedbyallcamerasandedgecluster. Therefore,theoptimalDAGpartitionpointsforallcamerasarelinkedtogetherandneedtobe jointlyadjustedaswell. The workloadadaptationcontroller takesthesetwoconsiderationsintoaccount,andformu- latesthetaskofidentifyingoptimalDAGpartitionpointsasanoptimizationproblem. ,ouroptimizationproblemaimstojointlyidentifytheoptimalDAGpartitionpoints forallthecameraswiththeobjectivetomaximizetheoverallsystemthroughputsubjecttothe latencyservicelevelobjective(SLO)imposedbythelivevideoanalyticsapplications.Below, wedescribehowwemodelthethroughputandthelatencyfollowedbythecompleteoptimization formulation. Throughput. AsaDAGispartitionedintotwopartsandexecutedonacameraandedgecluster respectively,thethroughputoftheentirelivevideoanalyticssystemisthesumofthethroughput ofcamerasandthethroughputoftheedgecluster. Formally,let N denotethenumberofcamerasinthesystem, par i denotetheDAGpartition pointforcamera i ,and TP cam i ( par i ) denotethethroughputofcamera i givenDAGpartitionpoint par i ,and TP edge denotethethroughputattheedgeclusterwhichdependsontheallpartition points f par i g . TP cam i ( par i ) and TP edge canbeaccuratelymeasuredbymonitoringthenumber ofworkloadsprocessedpertimeunitatcamera i andedgeclusterrespectively.Therefore,the throughputof N cameras TP cams ,thethroughputoftheedgecluster TP edge ,andthethroughputof theentiresystem TP system canbecomputedas: 20 TP cams = N å i = 1 TP cam i ( par i ) (1.6a) TP edge = TP edge ( par 1 ; par 2 ;:::; par N ) (1.6b) TP system = TP cams + TP edge (1.6c) Latency. Thelatencyofalivevideoanalyticssystemisasthetimeelapsedbetween receivingaworkloadandproducingtheinferenceresultoftheworkload.Giventhat,thelatency isthesumoftwoparts:1)theworkloadqueuingtime,and2)theworkloadprocessingtimefor producingtheinferenceresult.Becauseworkloadsarepartitionedacrosscameraandedge,the latencyoftwosidesisthuscomputedseparately.Assuch,thelatencyatcamera i andatedge clusterarecomputedas: L cam i = w i TP cam i + 1 TP cam i (1.7a) L edge = w edge TP edge + 1 TP edge (1.7b) where w i isthenumberofpendingworkloadstobeprocessedatcamera i ,and w edge isthenumber ofpendingworkloadstobeprocessedattheedgecluster. CompleteOptimizationFormulation. Puttingallthepiecestogether,thecompleteoptimization problemcanbeformulatedas 3 : 3 DistreamoperatesatDAGnodelevelwhenpartitioningworkloadsbetweencamerasandedgecluster. theoptimizationobjectiveintermsofDAGnodeisabletotheperformancegaininamorestraightforward manner. 21 Figure1.9:Overviewofcamera-clusterworkloadpartitioninworkloadadaptationcontroller.The optimalDAGpartitionpointsofallcamerasarejointlydetermined. max par i 8 i TP system (1.8a) s : t : L cam i L Max ; 8 i (1.8b) L edge L Max (1.8c) where L Max isthelatencySLOimposedbytheapplication. Asshown,since L cam i and L edge aredeterminedby TP cam i , w i , TP edge and w edge ,whichare determinedby par i ,theworkloadsreceivedatcamerasandtheedgeclusterarenotindependent butnegativelycorrelatedwitheachother.Thistradeoffillustratestheofidentifying theoptimalDAGpartitionpointstobalanceworkloadsbetweencamerasandtheedgeclusterto maximizethethroughputunderthelatencySLO. Theoptimizationproblemformulatedaboveisnon-convex.Insteadofexhaustivelysearching overallpossiblepartitionpointsforallcameras,weadoptaheuristictosolvetheoptimization problemef.,ourheuristicisiterativeandeachiterationhastwophases.In 22 thephase,itstartswiththeobjectiveofthepreliminaryDAGpartitionpointsthat maximizethethroughputwithoutconsideringthelatencySLO.Inthesecondphase,weiteratively targetthebottleneckcamerawiththelargestlatencyandadjustitspreliminaryDAGpartitionpoint tomeetthelatencySLO.Inpractice,ourheuristicishighlyeffectivewithnegligibleoverheads evenifthenumberofcamerasscalesup(§1.5.6). 1.4SystemImplementation WeimplementedDistreamusingabout2500linesofGolangcodeand500linesofpythoncode. Inthissection,weprovidedetailsonhowthekeypartsofDistreamwereimplemented. Testbed. Wefollowedthedesignchoicethatiswidelyadoptedbyexistinglivevideoanalytics systemsinreal-worlddeploymentstodevelopourowntestbedtoevaluatetheperformanceofDis- tream.,ourtestbedconsistsof24smartcamerasandasingleedgecluster.Amongthe 24cameras,18wereprototypedusingNvidiaJetsonTX1[3]whiletheothersixwereprototyped usingNvidiaJetsonTX2[6].BothJetsonTX1andTX2aredesignedforembeddedsystemswith onboardDLworkloads 4 .JetsonTX2isanupgradedversionofJetsonTX1withlargercompute capacity.Fortheedgecluster,weuseadesktopserverequippedwithaclusterof4NvidiaTitan XGPUs[1].WefollowedthestandardofexistingIPvideosurveillancesystemsto connectandsetupthenetworkbandwidthbetweensmartcamerasandedgecluster[5]. cally,allthe24smartcamerasandtheedgeclusterarewire-poweredandinterconnectedthrougha singleswitch(D-LinkDGS-1510-28X[2])toformalocalnetwork.EachsmartcamerahasaFast Ethernetlink(10/100Mbps)totheswitch,andthebandwidthfromtheswitchtotheedgecluster is10Gbps.Bydefault,wesetthecamerabandwidthto10Mbps.Wealsoevaluatethesystem 4 AsAIchipsetsevolve,weconjecturethatAIhardwarewithsuchcomputecapacitywillbewidelydeployedinside variousedgedevicessuchassmartcameras. 23 performanceunder50Mbps. LSTMWorkloadPredictor .Weuseatwo-layerLSTMwith256neuronsineachlayerasour workloadpredictor.,theworkloadpredictortakesthearrivedworkloadsofallcameras asinputandpredictstheupcomingworkloadsofeachcamerainthenextsecond.Weobserved thattheamountofworkloadsgeneratedateachcameraishighlyrelevanttothetimeperiodina day.Thuswedivideadayintosixperiodsas00:00-04:00,04:00-8:00,8:00-12:00,12:00-16:00, 16:00-20:00and20:00-24:00.Foreachperiod,wetrainaspecializedLSTMusingworkloads observedfromthatperiod.WethatusingsixspecializedLSTMsreduces10 : 2%meansquare errorcomparedtousingauniversalLSTM.SincerunningaspecializedLSTMisfastonCPU(less than1msforonestep),anditonlymakespredictionseachsecond,ourLSTMworkloadpredictor incursnegligibleoverheads.Althoughthepredictorcouldpotentiallyfromperiodicre- traininggiventhedynamicsofworkloadsovertime,itisnotthefocusofthiswork. VideoAnalyticsPipeline .WeuseOpenCV3.2.0toreadvideostreamsandimplementedthe Gaussianmixturemodel-basedbackgroundsubtractionmethodin[4]toextractregionsofinter- est(ROIs)withlengthofhistorysetto500andthresholdsetto60.Itmaintainsanestimateof backgroundimageandusessubtractionoperationtoextractforegroundregions,followedbyblob detectionoperationstoextractROIs.ThenumberofextractedROIsperframevariesdepending onthenatureofthescenescapturedbythecameras,rangingfrom0to30.For inferenceengine , wedesignedanefDNNmodelbasedonMobileNetV2[70]foreachc,andweare abletocompressover90%ofmodelweightsusingpruning[58]andknowledgedistillation[30] withoutaccuracyloss.GiventhateachDNNmodelhasonlyabout2MBmemoryfootprint,the inferenceengine isabletoloadalltheDNNmodelsintoasingleGPU,avoidingmodelloading orswitchingtime.Similarto[71],weadoptbatchedinference[15]toimprovevideoanalytics 24 throughput.Wetheinferencelatencywithbatchsizeof1,8,16,32and64respectively andsetthebatchsizeto8atthecamerasideand32attheedgeclustergivenitsbetterperformance tradeoff. Scheduling .Weimplementedamonitorproxyateachcameraandedgeclustertokeeptrackofthe runtimeinformationincludingworkloadstatus,networkbandwidth,throughputandlatency.The proxyperiodicallyreportstheseinformationtothe systemmonitor intheedgeclusterevery50ms. Whenperformingthecross-cameraworkloadbalancing,theROIisencapsulatedintothemigrating workloadbecausethetargetcameradoesnothavethevideoframefromthesourcecamera. Communication .DistreamusesZeroMQ[8]forhigh-speedlow-costinter-processcommuni- cationamongcameras.Forremotecommunicationbetweencamerasandedgecluster,weim- plementedalow-costremoteprocedurecall(RPC)totransfercontroldata.,the cross-cameraworkloadbalancinginstructionsandDAGpartitionpointsaretransferredtocam- erasimmediatelyaftertheyaregeneratedbythe cross-cameraworkloadbalancer andthe DAG partitioner respectively.Whenperformingcross-cameraworkloadbalancingandcamera-cluster workloadpartitioning,weusebatchedRPCsforcommunicationbetweentwoentitiestoamortize thecommunicationoverhead.ThemaximumRPCbatchsizeis20. 1.5Evaluation Inthissection,weevaluatetheperformanceofDistreamwiththeaimtoanswerthefollowing questions: Q1(§1.5.2) : DoesDistreamoutperformstatusquolivevideoanalyticssystems?Ifso,what arethereasons? Q2(§1.5.3) : HoweffectiveiseachcoretechniqueincorporatedinthedesignofDistream? 25 Q3(§1.5.4) : HowistheperformanceofDistreamaffectedbythesystemhyper-parametersand networkbandwidth? Q4(§1.5.5) : DoesDistreamscalewell? Q5(§1.5.6) : HowmuchoverheadsdoesDistreamincur? Q6(§1.5.7) : DothetechniquesproposedinDistreamalsostatusquolivevideoanalyt- icssystems? 1.5.1ExperimentalMethodology Applications. Weuse TrMonitoring and CampusSurveillance astworepresentativeappli- cationstoevaluateDistream.Foreachapplication,weusearepresentativereal-worldmulti-camera datasettobenchmarktheperformance. Application#1:TMonitoring. Forthisapplication,weusepubliclyavailablelivevideo streamsfromsixtrafcamerasdeployedatdifferenttrafintersectionsandroadsinJackson Hole,WY[7].Thesesixlivevideostreamshavedynamiccontents,diversetraf conditionsinthecityacrossspaceandtime.Toobtainarepresentativedataset,wesampled 2005-minutevideoclipsacross48hoursofvideostreamsfromeachofthesixcamerasata framerateof15FPSand1280 720Pframeresolution.Thesampledvideoclipscoverdiverse trafconditionsatdifferenttimeofthedayincludingrushhours,lighttraftime,andnight time. Application#2:CampusSurveillance. Duetolackofpubliclyavailablelarge-scaledatasets, weself-collectedadatasetfrom24surveillancecamerasdeployedonauniversitycampus. Thesecamerascoverdiverseareasofthecampus,includingindoorareassuchasdininghalls, cafeterias,andhallwaysaswellasoutdoorareassuchasuniversityentrances,buildingen- 26 trances,squares,streets,andtrafintersections.Forsecuritypurposes,someofthesurveil- lancecamerasareparticularlycoveringareaswherepeoplerarelyvisit.Similartothe Tr Monitoring application,wesampled2005-minute-longvideoclipsacross48hoursofvideo streamsfromeachof24camerasataframerateof15FPSand1280 720Pframeresolution toobtainarepresentativedataset. Baselines. WecompareDistreamagainstthreebaselinesthatcoverallthreetypesofarchitectures listedinTable1.1. Centralized .Thisbaselinerepresentsthecentralizedapproachadoptedbyanumberofexist- inglivevideoanalyticssystemssuchas VideoStorm [86], Chameleon [40]and NoScope [44] wherevideostreamsaresenttoedgeclusterforprocessing. Camera-Only .Thisbaselinerepresentstheapproachattheotherendofthespectrumwhere videostreamsareonlyprocessedlocallyonsmartcameras(e.g.,FilterForward[13]). VideoEdge-Lossless(VideoEdge-L) . VideoEdge [35]isthestatusquolivevideoanalytics systemthatutilizescomputeresourcesacrosscamerasandclustertoprocessvideostreams inadistributedmanner.DifferentfromDistream, VideoEdge isworkload-agnosticandaims toachieveoptimizedresource-accuracytradeoff,whichmaytradeforresourceswithlossof accuracy.SinceDistreamdoesnotaccuracy,forafaircomparison,weimplemented VideoEdge withtheresource-accuracytradeoffdisabled,whichwerefertoas VideoEdge- Lossless(VideoEdge-L) anduseitasourthirdbaseline.As VideoEdge isworkloadagnostic andtheonlypriorknowledgegivenisthecomputecostofeachintheDAGaswell asthecomputecapacityratioofcamerasandcluster,wepartitionedtheDAGandplacethe partitionsaccordingtothecamera-clustercomputeratiotoimplement VideoEdge-L . EvaluationMetrics. WeusethreemetricstoevaluatetheperformanceofDistreamandthebase- 27 lines. Throughput .Livevideoanalyticssystemsneedtoprocessstreamingvideoframesinacon- tinuousmanner,andhigh-throughputprocessingisessentialtokeepingupwiththeincoming videostreams.Inourcase,workloadsareinferencesinvolvedintheDAG.Thus,weusethe numberofinferencesprocessedpersecond(IPS)tomeasurethethroughput. Latency .Livevideoanalyticsapplicationsrequireproducinganalyticsresultswithinashort periodoftime.Wethususelatencywhichisastheelapsedtimefromwhenthework- loadisgeneratedtowhentheinferenceresultoftheworkloadisproducedasoursecondmetric tomeasurethesystem'sresponsiveness.Assuch,latencycanbecalculatedasthesumof networklatency,workloadqueuingtime,andworkloadprocessinglatency. LatencyServiceLevelObjective(SLO)MissRate .Lastly,wemeasurethelatencyservice levelobjective(SLO)missrate,whichthepercentoftheworkloadsthatdonotmeet thelatencyrequirementsetbyalivevideoanalyticsapplication.Inthiswork,wesetthelatency SLOto3seconds,whichisareasonablerequirementforlivevideoanalyticssystems. 1.5.2OverallPerformance WebeginwithcomparingtheoverallperformanceofDistreamandbaselinesonbothapplications. Sincethe TrMonitoring datasetinvolvessixvideostreams,weusesixsmartcameras(4 TX1and2TX2)witheachallocatedtoonevideostreamandincorporateoneGPUintheedge cluster.Wewillscaleto24smartcamerasand4-GPUedgeclusterwhenevaluatingthescaling performanceofDistreamin§1.5.5. ThroughputandLatency .Table1.2andTable1.3listthethroughputandlatencyofDistreamand baselinesonthe CampusSurveillance and TrMonitoring applicationrespectively.Specif- 28 ThroughputLatency avgmed75th99th avgmed75th99th Distream 2.9 3.1 2.9 2.5 128.2 189.3 147.9 112 VideoEdge-L 2 : 1 2 : 2 1 : 9 1 : 5 1 : 5 2 : 4 1 : 6 1 : 4 Centralized 1 : 9 2 : 0 1 : 7 1 : 3 1 : 4 2 : 2 1 : 5 1 : 2 Table1.2:PerformancegainoverCamera-Only(CampusSurveillance). ically,wereporttheaverage,50th(median),75th,and99thpercentileofthroughputandlatency performancegainover Camera-Only . Wehavethreemajorobservations.First,Distreamhasachievedhigherthroughputthanthe baselines.,itachieves2 : 9 and1 : 6 averagethroughputgainontwoapplications, whilethebestbaseline( VideoEdge-L )onlyachieves2 : 1 and1 : 2 .Second,Distreamachieves higherpeakthroughput(99th)thanthebaselines.ThehigherpeakthroughputindicatesthatDis- treamisabletobetterutilizethedistributedcomputeresourcestosurvivethroughtheworkload burstssuchastheonesillustratedinFigure1.2.Third,Distreamarchiveslatencyreduc- tioncomparedtothebaselines.Inparticular,Distreamachieves128 : 2 and184 averagelatency reduction,while VideoEdge-L onlyachieves1 : 5 . Figure1.10andFigure1.11providesamorecomprehensiveviewofthecomparisonbyplotting thefulldistributionofthroughputandlatency,includingthe1st,25th,50th(median),75thand99th percentilesintheboxplot.Intermsofthroughput,Distreamhasamuchwiderthroughputrange, whichindicatesthatDistreamisabletohandleamuchwiderrangeofamountsofworkloadsthan thebaselines.Intermsoflatency,theaveragelatencyofDistreamforthe CampusSurveillance and TrMonitoring applicationis0 : 34sand0 : 27s,whichismuchlessthanthebaselines, whichrangesfrom28 : 54sto43 : 93sfor CampusSurveillance and9 : 98sto49 : 68sfor Tr Monitoring .ThisresultindicatesthatthethroughputimprovementinDistreamdoesnotcomeat thecostofitslatency. 29 ThroughputLatency avgmed75th99th avgmed75th99th Distream 1.6 1.6 1.9 2.0 184 604.3 446.8 52.6 VideoEdge-L 1 : 2 1 : 2 1 : 2 1 : 2 1 : 5 1 : 8 1 : 5 1 : 2 Centralized 1 : 1 1 : 1 1 : 1 1 : 1 1 : 2 1 : 4 1 : 2 1 : 1 Table1.3:PerformancegainoverCamera-Only(TrafMonitoring). Figure1.10:ThroughputandlatencydistributiononCampusSurveillance. LatencySLOMissRate .Figure1.12comparesthelatencySLOmissrateofDistreamagainst baselines.Asshown,Distreamoutperformsthebaselinesbyalargemargin.Inparticular,Distream isabletoachieveanear-zerolatencySLOmissrate(0 : 7%and1 : 5%)onthe CampusSurveil- lance and TrMonitoring applicationrespectively,whilethebaselineshavemuchhighermiss rates(atleast60 : 7% CampusSurveillance and93 : 8%for TrMonitoring ). WhyDistreamOutperformstheBaselines? Distreamoutperformsboth Centralized and Camera- Only becauseDistreamisabletoleveragethecomputeresourcesatbothcamerasandedgecluster sideswhile Centralized and Camera-Only couldnot.For VideoEdge-L ,Distreamisableto achievebetterperformanceforthefollowingtworeasons: Reason#1:HigherResourceUtilization .Intermsofthroughput,Distreamoutperformsthe baselinesbecauseitisabletoutilizetheidlecomputeresourcesatbothcamerasandtheedge clustertoprocesstheworkloadstoenhancethethroughput.Toseethis,wecontinuouslytrackthe workloadimbalanceindexacrosssmartcamerasandtheedgecluster.AsshowninFigure1.13(a), Distreamisabletoachieveanear-zeroworkloadimbalanceforabout60%ofthetime.Incontrast, 30 Figure1.11:ThroughputandlatencydistributiononTrafMonitoringapplication. Figure1.12:LatencySLOmissratecomparison. VideoEdge-L isexperiencingworkloadimbalance(morethan200%imbalanceindex) for90%ofthetimeforbeingagnostictothedynamicsoftheworkloads. Reason#2:LessAccumulatedWorkloads .Intermsoflatency,Distreamoutperformsthe baselinesbecausewithhighercomputeresourceutilization,Distreamisabletoprocessthework- loadsatamuchfasterratethanthebaselines,preventingtheworkloadsfrombeingaccumulatedat camerasoredgecluster.Assuch,Distreamhasmuchlessworkloadswaitinginthequeue,which reducestheworkloadqueuingtimeandhencethelatency.Toseethis,wecontinu- ouslytracktheaccumulatedworkloadsinthelocalqueueateachcameraandtheedgecluster.As showninFigure1.13(b),Distreamisabletoachieveanear-zeroamountofaccumulatedworkloads forabout80%ofthetime.Incontrast, VideoEdge-L isexperiencinghighamountof accumulatedworkloadsformostofthetime.Assuch,theworkloadqueuingtimeinDistreamis muchlower,whichistherootcauseofitslowlatency. 31 Figure1.13:IllustrationonwhyDistreamoutperformsbaselines. 1.5.3Component-wiseAnalysis ThedesignofDistreamenablesadaptivecross-cameraworkloadbalancingandadaptivecamera- clusterworkloadpartitioning.Inthissection,weimplementedtwobreakdownversionsofDis- treamtotakeacloserlookatthecontributionofeachcomponent. Distream-L hasadaptivecross-cameraworkloadbalancing,buthasstaticcamera-clusterwork- loadpartitioningbasedonthecomputecapacityratioofcamerasandcluster. Distream-P hasadaptivecamera-clusterworkloadpartitioning,butdoesnotenablecross- cameraworkloadbalancing. Table1.4showsthecomparisonresultsonthe CampusSurveillance application.Asshown, theperformancegainsbroughtbyboth Distream-L and Distream-P aresimilarand Onaverage, Distream-L achieves2 : 4 throughputgainand5 : 9 latencyreductionrespectively comparedto Camera-Only . Distream-P achieve2 : 5 throughputgainand6 : 3 latencyreduc- tionrespectivelycomparedto Camera-Only .Thisresultindicatestheimportanceofbothadaptive cross-cameraworkloadbalancingandadaptivecamera-clusterworkloadpartitioningtothewhole 32 system.Distreamisabletocombinetheadvantagesofboth Distream-L and Distream-P to achievebetterperformancethanusingonlyoneofthem. ThroughputLatency avgmed75th99th avgmed75th99th Distream-L 2 : 4 2 : 5 2 : 3 2 : 2 5 : 9 8 : 2 5 : 5 5 : 2 Distream-P 2 : 5 2 : 5 2 : 3 2 : 1 6 : 3 9 : 2 6 : 6 5 : 1 Table1.4:Component-wiseanalysisofDistream.PerformancegainsareoverCamera-Only. 1.5.4SensitivityAnalysis ThedesignofDistreaminvolvestwokeysystemhyper-parameters:(i)thecross-cameraworkload balancingthreshold b (§1.3.2),and(ii)thecamera-clusterworkloadpartitionschedulinginter- val g (§1.3.4).Inthissection,weevaluatetheimpactofthesesystemhyper-parametersonthe performanceofDistream.Inaddition,sincecross-cameraworkloadbalancingrequiresmigrating workloadsthroughthenetwork,wealsoevaluatetheimpactofnetworkbandwidthontheperfor- manceofDistream. ImpactofCross-CameraWorkloadBalancingThreshold b .Figure1.14(a)showsDistream's sensitivitytothecross-cameraworkloadbalancingthreshold b .Weonlyshow99thpercentile throughput(lefty-axis,black)and99thpercentilelatency(righty-axix,red)becausetheyare sensitivetohyper-parameterchangesandcanbettertheimpactonthesystemperformance. Weobservethatwhen b isbelow20%,theperformanceofDistreamdoesnotchangemuch.When b exceeds20%,the99thpercentilethroughputbeginstodropand99thlatencyincreases.Thisis becausealarger b wouldtriggerloadbalancinglessoften.Asaresult,thesystemmaysufferfrom performancelossduetoworkloadimbalance.Therefore,weset b tobe20%. ImpactofCamera-ClusterWorkloadPartitionSchedulingInterval g .Figure1.14(b)shows Distream'ssensitivitytothecamera-clusterworkloadpartitionschedulinginterval g .Weseethat 33 the99thpercentilethroughputand99thpercentilelatencydonotchangemuchwhentheinterval islessthan0 : 5s.When g isbetween0 : 5sto1 : 5s,thesystemperformancehasslightlydeclined. When g isgreaterthan1 : 5s,thesystemperformancedeclinesastheintervalincreases. Therefore,weset g to0 : 5s. (a)Impactofcross-cameraworkloadbalancing threshold b . (b)Impactofcamera-clusterworkloadpartition scheduleinterval g . Figure1.14:Impactofsystemhyper-parametersontheperformanceofDistream. ImpactofNetworkBandwidth .WealsoexamineDistream'ssensitivitytonetworkbandwidth. WeevaluateDistreamwithnetworkbandwidthat10Mbps(lowbandwidth)and50Mbps(high bandwidth),whichcoversthebandwidthrangeinstandardvideosurveillancesystemsonthemar- ket.Table1.5listsourevaluationresults.Asshown,boththethroughputandlatencyarevery similarbetween10Mbpsto50Mbps.ThisresultindicatesthatDistreamisrobusttonetworkband- widthchangesandisabletoachievehighperformanceevenwithanetworkbandwidthof10Mbps. 34 Throughput(IPS)Latency(s) avgmed75th99th avgmed75th99th LowBandwidth(10Mbps) 489.1509590647 0.340.090.472.05 HighBandwidth(50Mbps) 490.9510593650 0.310.070.411.75 Table1.5:PerformanceofDistreamunderlow(10Mbps)andhigh(50Mbps)networkbandwidths. 1.5.5ScalingPerformance ToevaluatethescalingperformanceofDistream,weusethe CampusSurveillance datasetand examinethethroughputandlatencyasDistreamscalesupitsnumberofsmartcamerasfrom6to 12,18,and24.Tomatchthecomputeresourceswithnewworkloadsbroughtbythenewcameras, weaddoneGPUtotheedgeclusterwheneversixsmartcamerasareaddedtothesystem. Figure1.15illustratesthescalingperformanceofDistreamintermsofthroughput(upper)and latency(lower).Asshown,Distreamisabletoscaleitsthroughputupnearlylinearly., Distreamachievesa3 : 95 averagethroughputfrom489 : 1IPSwhenprocessingvideosstreamed from6camerasto1931 : 4IPSwhenprocessingvideosstreamedfrom24cameras.Meanwhile, Distreamisabletomaintainasimilarlowlatencywhenscalingup. Figure1.15:ScalingperformanceofDistream. 35 1.5.6SystemOverheads ThesystemoverheadsincurredbythedesignofDistreamcomefromtwosources.Thesource comesfromsolvingtheoptimizationproblemsforcross-cameraworkloadbalancingin§1.3.2and identifyingtheoptimalDAGpartitionpointsin§1.3.4.Theiroverheadis16usand4uswithsix camerasandoneGPU,and21usand8uswhenscaledupto24camerasand4GPUs.Thisresult indicatesthatourproposedheuristicsarehighlyefandincurnegligibleoverheads.The secondsourceofoverheadcomesfromtheRPCcallformigratingworkloadsacrosssmartcameras (§1.3.4).AsshowninTable1.6,evenifunderthelowbandwidthcondition(i.e.,10Mbps),each RPCcallonlytakes2 : 1msonaverageforworkloadmigration.WhenweemploybatchingRPC, thisoverheadisamortizedwhentheRPCbatchsizeincreases,makingtheoverheadsofcross- cameraworkloadbalancingnegligible. BatchSizeTotalMigrationTimeAvgRPCOverhead 12.1ms2.1ms 107.4ms0.74ms 209.2ms0.46ms Table1.6:Cross-cameraworkloadbalancingoverheadsofDistream. 1.5.7toExistingSystems AssummarizedinTable1.1,existinglivevideoanalyticssystemsareagnostictotheworkload dynamicsinreal-worlddeployments.Instead,theyfocusondesigningdifferenttechniquestoop- timizetheresource-accuracytradeoffoflivevideoanalyticssystems.Inthissection,weaddthe workloadadaptationtechniquesproposedinDistreamonto VideoEdge [35]toexaminehowDis- treamcouldenhanceitsperformance.Weselect VideoEdge insteadofotherstatusquolivevideo analyticssystemslistedinTable1.1because VideoEdge isalsoafullydistributedframeworkthat 36 partitionsthevideoanalyticspipelineacrosscamerasandcluster. Figure1.16comparestheperformancebetween VideoEdge and VideoEdge +Distream.As illustrated, VideoEdge +Distreamisabletogenerateabetterresource-accuracytrade-offcurve than VideoEdge .Inparticular,Distreamisabletoboosttheaccuracyby21%whencompute resourceislimited,andisabletoboosttheaccuracyby7%whencomputeresourceisadequate. Suchimprovementismadepossibleduetothehigherresourceutilizationbroughtbytheworkload adaptationmechanismofDistream.Withhigherresourceutilization, VideoEdge doesnotneedto asmuchaccuracyasbeforetotradeforresources. Figure1.16:Resource-accuracytradeoffcurvecomparisonbetweenVideoEdgeandVideoEdge+ Distream. 1.6RelatedWork Livevideoanalyticsisakillerapplicationnotonlyformobilevision[36,21,20,42]butalsofor large-scaledistributedsettings.OurworkiscloselyrelatedtoDL-basedlivevideoanalyticssys- temswithdistributedcameras.Amajorityofthesesystemsaredesignedtoprocessvideostreams inacentralizedcluster.Forexample, Focus [32]customizessmallDLrstogenerateobject indexesineachvideoframetoenablefastofvideoquery. NoScope [44]usesspecializedDL 37 modelstoreduceinferenceoverheadsforthroughputgainwithsmallaccuracyloss.Besidesthem, Chameleon [40], VideoStorm [86],and AWStream [85]leverageresource-accuracytradeoffs forsystemoptimization. Chameleon [40]reducestheoverheadcausedbysearchingforoptimal tradeoff VideoStorm [86]exploitsthevarietyofqualityandlaggoalsinvideo analyticstasksandoptimizescomputeresourcesinGPUclusters. AWStream [85]isfocusedon adaptingtonetworkbandwidthvariationtoachievelow-latencyhigh-accuracywide-areastream- inganalytics.Differentfromtheseworks,insteadoftradingaccuracyforlesscomputeresource consumption,Distreamachieveshighthroughputandlowlatencybymaximizingtheresourceuti- lizationacrosscamerasandedgeclusterwithouttheaccuracy. TheclosestworktoDistreamis VideoEdge [35].AlthoughDistreamand VideoEdge are bothbuiltuponadistributedarchitecturethatinvolvessmartcamerasandedgeclusterforlive videoanalytics,theyaredifferentinthreeaspects.First,Distreamfocusesonenablingworkload- adaptivelivevideoanalyticswhile VideoEdge isworkloadagnostic.Second, VideoEdge targets optimizingresource-accuracytradeoffwhileDistreamfocusesonmaximizingthroughputandat- taininglatencySLOwithoutcompromisingaccuracy.Third, VideoEdge performsvideoanalytics pipelineplacementacrosscameras,edgeandcloudwherethedynamicnetworkbandwidthisthe mainbottleneck.Incontrast,Distreamperformsworkloadallocationacrosscamerasandedge clusterbasedonworkloaddynamicstoeliminatethebottleneckincomputeresources.Infact, Distreamiscomplementaryto VideoEdge :asshownin§5.7,Distreamisabletoprovideabetter resource-accuracytradeoffcurvewhenintegratedwith VideoEdge . 38 1.7ConclusionandFutureWork Inthispaper,wepresentthedesign,implementation,andevaluationofDistream,adistributed workload-adaptivelivevideoanalyticssystembasedonthesmartcamera-edgeclusterarchitecture. Distreamaddressesthekeychallengesbroughtbyworkloaddynamicsinreal-worlddeployments, andcontributesnoveltechniquesthatcomplementexistinglivevideoanalyticssystems.Wehave implementedDistreamandconductedarichsetofexperimentswithatestbedconsistingof24 camerasanda4-GPUedgeclusterontworeal-worlddistributedvideodatasets.Ourexperimental resultsshowthatDistreamconsistentlyoutperformsthestatusquointermsofthroughput,latency, andlatencySLOmissrate.Therefore,webelieveDistreamrepresentsacontributionto enablinglivevideoanalyticsatscale.Inthecurrentform,Distreamtreatscross-cameraworkload balancingandcamera-clusterpartitioningastwoseparatecomponents.Weplantoworkonajoint solutionasourfuturework.Wewillalsoworkondesigningcommonprogrammingandnetwork abstractiontosupportlivevideoanalyticsapplicationdevelopmentbasedonourframework,and plantoexpandourframeworktothewirelesssettings. 39 Chapter2 CollaborativeDeepLearningTrainingon theEdge 2.1Introduction Therecentpasthaswitnessedthesuccessofdeeplearning(DL)inawidespectrumofareasinclud- ingcomputervision[18],automaticspeechrecognition[29],andnaturallanguageprocessing[76]. Partofthissuccessisattributedtothehighlyefdistributedtrainingsystems[51,84,87] thattrainDLmodelsinparallelmachinesinsidedatacentersownedbylargeorganizationssuchas Google,Facebook,Amazon,andMicrosoft.Asintelligenceismovingfromdatacenterstodevices, intelligentedgedevicessuchassmartphones,drones,robots,andsmartcamerasareequippedwith machineintelligencepoweredbydeeplearningtonotonlyperformon-deviceinferencesbutalso altogethertrainaDLmodelfromthedatacollectedbythemselves(i.e.,on-devicedistributedtrain- ing). Dependingontheapplicationsettings,on-devicedistributedtrainingcaningeneralbe intotwocategories: federatedlearning ,and collaborativelearning .Infederatedlearn- ing,dataprivacyisstrictlyenforced:duringtraining,participatingdevicesdonotsharetheirlocal datawitheachother.Inaddition,participatingdevicesareusuallylocatedatdifferentgeographical locationsandadifferentsubsetofthedevicesparticipateineachroundofthetrainingprocess.As 40 such,federatedlearningrequiresthesupportofacloudservertocoordinatethedistributedtraining processandselectwhichdevicestoparticipateineachround.Incontrast,incollaborativelearn- ing,dataprivacyisnotaconcerngiventhattheparticipatingdevicesusuallybelongtothesame individual,group,organization,orcompany.Moreover,participatingdevicesareedthrough- outthetrainingprocessandarelocatedwithinageographicallyclosearea.Thustheyareableto communicatewitheachotherviaalocalwirelessnetworkwithoutthesupportofacloudserver. Inthiswork,wefocusoncollaborativelearning.Thereareawiderangeofreal-worldap- plicationsthatarebuiltuponcollaborativelearning.Forexample,intheU.S.,eachconsumeron averageownsmorethanthreeconnecteddevices(e.g.,smartphones,tablets,smartwatches).In thissetting,personaldataaredistributedacrossallthoseconnecteddevices,andthereisnoprivacy concernsinceallthedevicesareownedbythesameperson.Withwirelessnetworkavailableat homeorintheofcollaborativelearningisthemostpracticalsolutiontocollaborativelylearn aDLmodelfromthosedistributedpersonaldata. Motivation&LimitationsofStatusQuo. Despiteitsconsiderablevalue,thekeybottleneckof makingcollaborativelearningsystemspracticallyusefulinreal-worlddeploymentsisthat they consumeaamountoftrainingtime .Therootcauseofsuchperformancebottleneck isthelimitedwirelessnetworkbandwidth:indatacenters,communicationbetweenparallelma- chinesisconductedviahigh-bandwidthnetworksuchas50GbpsEthernetor100Gbps Band[81].Incontrast,incollaborativelearningsystems,communicationbetweenparticipating devicesisconductedviawirelessnetworksuchasWi-Fi.Thebandwidthofwirelessnetwork, however,ismuchmoreconstrained,andcanbe 10,000timesless thanthebandwidthindatacen- ters.Suchlimitedbandwidthconsiderablyslowsdownthecommunicationandhencetheoverall trainingprocess. Our goal inthisworkistotacklethiscriticalbottleneckto enhancethetrainingef 41 ofon-devicecollaborativelearningwhileretainingtrainingcorrectnesswithoutcompromisingthe trainingquality (i.e.,accuraciesofthetrainedmodels).Althoughanumberoftrainingacceleration approacheshavebeenproposed[10,55,56,25,39,33],aswewilldiscussin§2.2indetail,these approacheseithercompromisethetrainingqualitytogaintrainingefyoraredesignedfor distributedtrainingindatacenterswheretheyachievelimitedtrainingefyenhancementin on-devicecollaborativelearninggiventhegapinnetworkbandwidthbetweenthedata centersettingandtheon-devicesetting. OverviewoftheProposedApproach. Thelimitationsofexistingapproachesmotivateustore- thinkthecollaborativelearningframeworkdesignbytakingadifferentpathfromexistingones.To thisend,wepresentDistream,animportancesampling-basedframeworkthatenablesefon- devicecollaborativelearningwithoutcompromisingthetrainingquality.Inexistingapproaches, eachparticipatingdeviceineachtrainingiteration randomly samplesamini-batchfromitslocal datatocomputeitslocalgradients.Thisrandomsamplingstrategy,atitscore,assumesthateach sampleinthelocaldataisequallyimportantduringmodeltraining.Inpractice,however, notall thesamplesinthelocaldatacontributeequallytomodeltrainingduetoinformationoverlapping amongdifferentsamples .Trainingonlessimportantsamplesnotonlycontributeslittletomodel accuracybutalsoconsiderablyslowsdownthetrainingprocess.Giventhat,insteadofrandom sampling,Distreamfocusesonsamplesthatprovidemoreimportantinformationineachiteration. Bydoingthis,thetrainingefyofeachiterationisimproved.Assuch,thetotalnumberof iterationscanbeconsiderablyreducedsoastospeedupthewholetrainingprocess. ThedesignofDistreaminvolvesthreekeychallenges. Challenge#1 :Importancesamplingincurscomputationcost,andwithoutcarefuldesign,could easilyovershadowtheitbrings.Reducingthecomputationcostofimportancesampling 42 withoutaffectingitseffectivenessandcompromisingthetrainingqualityrepresentsa cantchallenge. Challenge#2 :Sinceimportancecomputationisperformedonthelocaldataofeachdevice, theimportanceonlythelocalimportancedistributionwithineachdeviceratherthan theglobalimportancedistributionacrossallthedistributeddevices.Asaresult,adevicemay repeatedlylearngloballytrivialsamples,whichconsiderablylowersthetrainingefy. Challenge#3 :Eventhoughthecomputationcostofimportancesamplingcanbereduced,the costisstillnotzero.Moreover,comparedtoEthernetorusedinthedatacenter setting,wirelessnetworkintheon-devicesettingismoresusceptibletointerferenceandthus itsbandwidthcouldexperiencevariationsovertimeinreal-worlddeployments.Thedesignof Distreamshouldtakebandwidthvariationintoaccountaswell. Toaddressthechallenge,Distreamincorporatesa group-wiseimportancecomputationand samplingscheme thatcutsthecomputationcostofimportancesamplingbyonlyre-computingthe importanceofasubsetofsampleswithineachiteration.Moreover,byconstructingthemini-batch basedonthere-computedimportanceinastochasticmanner,thetrainingcorrectnessisguaranteed. Toaddressthesecondchallenge,Distreamincorporatesan importance-awaredataresharding scheme thatefcientlyreshufthedataamongdevicestobalancetheimportancedistribution. Itachievesthesameeffectwithmuchloweroverheadcomparedtoimportance-agnosticdatare- shardingbyprioritizingthetransferofmoreimportantsamples. Toaddressthethirdchallenge,Distreamincorporatesa bandwidth-adaptivecomputation-communication scheduler thatschedulestheexecutionofimportancecomputationanddatareshardinginabandwidth- adaptivemannertofurtherimprovethespeedupbycompletelymaskingoutthecostsofimportance samplinganddataresharding. 43 Implementation&EvaluationResults. WeimplementedDistreamusingTensorFlow1.10and deployeditonaself-developedtestbedthatconsistsof12NVIDIAJetsonTX1asedgedevices. WeevaluateDistreamonsixcommonlyuseddatasetsusingadiversesetofDLmodelsacrossthree mostimportanttaskdomains:computervision,speechrecognition,andnaturallanguageprocess- ing.WecompareDistreamagainsttwostatusquobaselines TicTac [25]and AdaComm [78].Our resultsshowthatDistreamconsistentlyoutperformsthebaselinesintrainingefy,achieving 3 : 7 ,3 : 2 ,4 : 0 ,2 : 2 ,2 : 7 ,1 : 8 onthesixdatasets.Moreover,Distreamisabletomain- taintrainingspeedupasthenumberofparticipatingdevicesscalesup,andisrobusttowireless bandwidthvariationsinreal-worlddeployments. Insummary,wemakethreemajorcontributions: Tothebestofourknowledge,Distreamistheefon-devicecollaborativelearning frameworkbasedonimportancesamplingthatachieveshightrainingspeedupwhileretaining trainingcorrectnesswithoutcompromisingtheaccuraciesofthetrainedmodels. Weprovideaperformancemodeloftheproposedimportancesampling-basedcollaborative learningframework.Guidedbytheperformancemodel,weproposethreenoveltechniques, namely,group-wiseimportancecomputationandsampling,importance-awaredataresharding, andbandwidth-adaptivecomputation-communicationscheduling,whichaltogetherfullyex- ploitthebenbroughtbyimportancesampling.Wealsoprovideatheoreticalproofonthe trainingcorrectnessofDistream. WeimplementedDistreamanddeployeditonaself-developedtestbed.Wedemonstrateits effectivenessandshowthatDistreamconsistentlyoutperformstwostatusquoframeworkson sixdatasetsacrosscomputervision,speechrecognition,andnaturallanguageprocessing. 44 Figure2.1:Themulti-parameterserverarchitectureofon-devicecollaborativelearningsystems. 2.2BackgroundandMotivation Inthissection,weintroducethearchitectureofon-devicecollaborativelearningsystems (§2.2.1).Wethendiscusstheexistingtechniquesforenhancingtheirtrainingefyfollowed byexplainingtheirlimitations,whicharethekeymotivationofthiswork(§2.2.2). 2.2.1ArchitectureofOn-DeviceCollaborativeLearningSystems Similartodistributedtrainingindatacenters,on-devicecollaborativelearningalsousesStochastic GradientDescent(SGD)oritsvariantstotraintheDLmodelinaniterativemanner.Ineach SGDiteration,eachdevicerandomlysamplesamini-batchfromitslocaldatatocomputeitslocal gradients.Theselocalgradientsarethenaggregatedfromthedistributeddevicestoupdatethe modelparameters w as: w t + 1 = w t h t 1 K K å k = 1 g k (2.1) where g k = 1 j B k j å i 2 B k Ñ l ( x i ; w t ) arethegradientsatmachine k and B k isthemini-batchofma- chine k . Moderndistributedtrainingsystemsuseparameterserver(PS)architectureforgradientaggre- 45 gation.InPS[87,80,52],oneormoremachinesplaytheroleofserverstoaggregatecomputed gradientsfromworkermachines,updatethemodel,andsendtheupdatedmodeloraggregated gradientsbacktoallworkers.AsshowninFigure2.1,Distreamadoptsamulti-PSarchitecture andtreatseachenddeviceasaparameterservertoo.Bydoingthis,thecommunicationisnot bottleneckedatasingledevicegiventhelownetworkbandwidthintheon-devicesetting. 2.2.2ExistingApproachesandTheirLimitations Thetotaltrainingtimeofcollaborativelearning T canbegenerallyestimatedasthetotalnumber oftrainingiterationsuntilconvergence E multipliesthesumofthetimeconsumedbylocalcom- putation T cp (e.g.,computethegradientsofthemodelparameters)andthecommunicationtime consumedbytransmittingthemodelgradientsbetweentheparameterserverandtheedgedevice T cm ineachiterationasfollows: T = E ( T cp + T cm ) : (2.2) Toreducethetotaltrainingtime T ,existingapproachescanbeingeneralgroupedintothefollow- ingthreecategories. GradientCompression .Toreduce T ,onecommonapproachistoreducethecommunicationtime ineachiteration T cm inEq.(2.2).Thisisachievedbyquantizinggradientsusingsmallernumber ofbits[81,10,55]orselectingimportantgradientstotransfervia[79,56,33].Com- municationinon-devicesettingisconductedthroughwirelessnetworkswhosebandwidthismuch moreconstrainedthanEthernetorusedindatacenters.Givensuchlimitedbandwidth, thecontributionofgradientcompressiontoreducing T cm inEq.(2.2)islimited.Althoughmany 46 worksadoptaggressivegradientcompressionthatisabletopushthelimit,theygaintrainingef ciencybycompromisingthetrainingqualityandhencetheaccuracyofthetrainedmodel, whichcontradictsthegoalofthiswork. Local-updateSGD .IndistributedSGD,globalsynchronizationamongnodesisperformedineach iterationforexchangingandaveraginggradientinformation,whichisanexpensiveandlengthyop- erationinbandwidth-constrainednetwork.Atechniquecalledlocal-updateSGD[89,19,62,78] isproposedtoreducethecommunicationoverhead.Theideaistoalloweachnodetoperform multiplelocalupdatesbeforeglobalsynchronizationineachiteration.Similartothegradient compressionapproach,althoughaggressivelyperformingmultiplelocalupdatesgainstrainingef- ybyreducingcommunicationiterations,itagaincompromisestheaccuracyofthetrained model. Figure2.2:Comparisonincommunication-computationoverlappingbetweenhighbandwidthand lowbandwidthsetting. Communication-ComputationOverlapping .DLmodelsingeneralcanberepresentedasdi- rectedacyclicgraphs(DAGs).ThestructuresofDAGsprovideanopportunitytooverlapgradi- entcomputationandgradientaggregationinapipelinedfashiontomaskoutthecommunication cost.Thisoptimizationtechniquecanbeseenasamechanismtoreducethesum ( T cp + T cm ) in Eq.(2.2).Suchreductioncanbeachievedeitherbyidentifyingtheoptimalgradienttransferor- 47 (a)#iterations=0 (b)#iterations=200 (c)#iterations=600 Figure2.3:Distributionsofdataimportanceasthenumberofiterationsincreasesduringtraining. der[25,39,17,87]orusingstaledmodelweights[54].Althoughthiscomputation-communication overlappingtechniquedoesnotcompromisetheaccuracyofthetrainedmodel,thelimitedband- widthinon-devicecollaborativelearningcausesthecommunication-to-computationratiotobe muchhigherthantheoneindatacentersetting.Asaresult,theeffectivenessofoverlapping techniquesdiminishesbecauseitisnolongerabletomaskoutthecommunication costundersuchhighcommunication-to-computationratio.Todemonstratethis,Figure2.2shows thecostbreakdownofoverlappingtechnique.Asshown,althoughitprovidescostre- duction(47%)underhighbandwidthnetworkindatacentersetting,itonlyreduces15%inlow bandwidthnetworkinon-devicecollaborativelearningsetting. 2.3DistreamOverview ThelimitationsofexistingapproachesmotivateustodesignDistreambytakingadifferentpath. Inthissection,weintroducethekeyinsightthatDistreamisdesignedupon(§2.3.1).We thenintroducearudimentaryframeworkdesigneduponthekeyinsightanditsperformancemodel (§2.3.2).Lastly,weoverviewthetechniqueswedesignundertheguidanceoftheperfor- mancemodelinDistreamthattransformtherudimentaryframeworkintoahighlyefonefor on-devicecollaborativelearning(§2.3.3). 48 2.3.1KeyInsight ThekeyinsightbehindthedesignofDistreamisto exploitdataeftoimprovethetrain- ingefofon-devicecollaborativelearning .Instandarddistributedtraining,ineachSGD iteration,eachdevice randomly samplesamini-batchfromitslocaldatatocomputeitslocalgra- dients.Thisrandomsamplingstrategy,atitscore,assumesthateachdatasampleinthelocaldata isequallyimportantduringthewholemodeltrainingprocess.Inpractice,however, notallthe datacontributeequallytomodeltraining .Someofthemprovidemoreinformationtothetraining andthusaremoreimportant,whiletheothersprovidelessinformationandthusarelessimportant. Asanexample,Figure2.3illustratestheimportancedistributionofthedatainCIFAR-10asthe numberofiterationsincreasesduringtraining.Asshown,moreandmoredatasamplesbecome lessimportantasthetrainingprocessproceeds.Inparticular,after200iterations,theimportance ofdataquicklyconcentratestoasmallportionofdatasamplesandtherestcontributelittletothe modeltraining.ThisobservationshedslightonthelowtrainingefyofstandardSGDwhere mini-batchisgeneratedbyrandomsampling. Basedonthisinsight,weproposetoincorporate importancesampling [91]toimprovethe trainingef ofeachSGDiteration.Thekeyideaofimportancesamplingisthatineach trainingiteration,itfocusesonsamplesthatprovidemoreimportantinformationandcontribute moretothemodeltraining. Formally,let x denoteasampleinthemini-batch, p denotetheuniformdistributionadopted byrandomsampling,and q denoteanewdistributionadoptedbyimportancesampling.Toensure trainingcorrectness,thegradient Ñ l ( x ) (hereweleave w t outforsimplicity)shouldbemultiplied 49 Figure2.4:Aconceptualcomparisonbetween(a)randomsamplingand(b)importancesampling. by p ( x ) = q ( x ) whenshiftingfromrandomsamplingtoimportancesampling[9]: E x ˘ q p ( x ) q ( x ) Ñ l ( x ) = E x ˘ p [ Ñ l ( x )] if q ( x ) > 0whenever p ( x ) > 0.Toensuregradientvariancereductiontoacceleratethetraining process, q ( x ) shouldbeproportionaltothesample'sgradientnorm( jj Ñ l ( x ) jj )[9]: q ( x ) µ jj Ñ l ( x ) jj : Theabovederivationimpliesthat asample'sgradientnormcanbeusedasanindicatorofits importance .Thelargergradientnormasamplehas,themorecontributionthesamplecanmaketo themodeltraining. Figure2.4providesaconceptualcomparisonbetweenrandomandimportancesampling.The bluesolidcurvedepictsthedistributionoftheimportancevaluesofsamples.Inrandomsampling, eachsamplehastheequalprobabilitytobeselectedregardlessofitsimportance.Assuch,if therearemoresampleswithlowerimportanceinadataset,sampleswithlowerimportancehave higherlikelihoodtobeselected(greendottedcurveinFigure2.4a).Incontrast,inimportance sampling,sampleswithhigherimportancehavehigherlikelihoodtobeselected(greendotted 50 curveinFigure2.4b).Assuch,themini-batchgeneratedbyimportancesamplingcarriesmore importantinformationthanrandomsampling.Suchadvantageenhancesthetrainingefy withineachtrainingiteration,whichinturnreducesthetotalnumberofiterationsandspeedsup thewholetrainingprocess. 2.3.2RudimentaryFrameworkandPerformanceModel Inspiredbythebroughtbyimportancesampling,weproposeanimportancesampling- basedframeworkforon-devicecollaborativelearning.Ourframeworkisbuiltuponthestandard distributedSGD.Figure2.5aillustratestheoperationsinvolvedineachSGDiteration.Asshown, theonlydifferencebetweenthemisthattheimportancesampling-basedframeworkreplaces ran- domsampling (i.e.,randomlyselectsamplestoconstructthemini-batch)inthestandarddistributed SGDwithtwonewoperations: importancecomputation whichcomputestheimportance(i.e.,gra- dientnorm)foreverysampleinthelocaldataset,and importancesampling whichselectssamples basedontheircomputedimportancetoconstructthemini-batch. Theperformancemodelofsuchimportancesampling-basedframeworkcanbeformulatedas follows:let T is denotetheincreasedlocalcomputationcostineachSGDiterationcausedbyimpor- tancesampling,and E is denotethecorrespondingtotalnumberofiterationsuntilconvergence.The trainingspeedupofimportancesampling-basedframeworkoverstandardrandomsampling-based distributedSGDis: Speedup = E ( T cp + T cm ) E is ( T cp + T cm + T is ) = 1 E is E ( 1 + T is T cp + T cm ) : (2.3) 51 Figure2.5:ConceptualComparisonbetweenrudimentaryimportancesampling-basedframework andMercury. AsshowninEq.(2.3),theimportancesampling-basedframeworkcouldperformworsethan thestandardrandomsampling-baseddistributedSGD(i.e., Speedup < 1)ifthecomputationcost incurredbyimportancesampling T is overshadowsthebroughtbythereductionoftotal trainingiterations(i.e., E is < E ). 2.3.3OverallDesign Figure2.5billustratestheoveralldesignoftheproposedDistreamframework.Comparedtothe rudimentaryframeworkillustratedinFigure2.5a,Distreamincorporatesthreekeyinnovations guidedbytheperformancemodelinEq.(2.3)whicheffectivelyavoidthepitfallsoftherudimen- taryframework.First,Distreamincorporatesagroup-wiseimportancecomputationandsampling techniquethatconsiderablyreducesthecomputationcostofimportancesamplingwithoutcom- 52 promisingitseffectivenessandtrainingquality.Second,itincorporatesanimportance-awaredata reshardingtechniqueforbalancingimportancedistributionamongedgedevicestofurtheracceler- atethetrainingprocess.Lastly,itincorporatesabandwidth-adaptivecomputation-communication schedulerthatschedulestheexecutionofimportancecomputationanddatareshardinginparallel inabandwidth-adaptivemannersuchthattheircostscanbecompletelyhiddenbehindthestandard distributedSGDDistreamisbuiltupon. Inthefollowing,wedescribethedetailsofthesethreetechniquesin§2.4.1,§2.4.2and§2.4.3, andproveDistream'strainingcorrectnessin§2.4.4. 2.4DesignDetails 2.4.1Group-wiseImportanceComputationandSampling ThekeytechniqueinDistreamis group-wiseimportancecomputationandsampling :atech- niquethatconsiderablyreducesthecomputationalcostofimportancesamplingwithoutcompro- misingitseffectivenessandunbiasedconvergence.Inthestandardimportancesampling-based frameworkintroducedin§2.3.2,ineachSGDiteration,theimportanceof all thelocaldataare computedtoderivetheirimportancedistribution.Theall-inclusivenessofthisapproachnaturally (a)Distributionofrankdifferenceafter1iteration (b)Distributionofrankdifferenceafter10iteration Figure2.6:Distributionsofrankdifference. 53 leadstocostlycomputation.Incontrast,ourgroup-wisestochasticimportancesamplingtechnique foregoessuchall-inclusivenesstocutthecomputationalcost:duringeachSGDiteration,itonly computestheimportanceofa subset (i.e.,group)ofthelocaldata.Toguaranteeunbiasedcon- vergence,itthenadoptsastochasticapproachtoconstructthemini-batchbasedonthecomputed group-wiseimportance. Ourtechniqueisinspiredbyanobservationthat theimportanceofeachsampleofthelocal datadoesnotchangeabruptlyacrossmultipleSGDiterations .Figure2.6depictsthedifference ofimportanceranksbetweentwoiterations.About50%ofdatasampleschangelessthan5%in importancerankingandabout70%ofdatasampleschangelessthan10%inimportanceranking afteronetrainingiteration(Figure2.6(a)).Evenafter10iterations,about65%ofdatasamples changelessthan10%inimportanceranking(Figure2.6(b)).Inotherwords,ifonesampleis asimportant,itwouldpossiblystayasanimportantsampleforthefollowingmultiple SGDiterations.Therefore,itisnotnecessarytore-computetheimportanceofeachsampleinthe localtrainingdatasetforeverySGDiteration.Instead,wecanreusetheimportancecomputation resultsfrompreviousiterationstofurthercutthecomputationcost. Basedonthisobservation,ourgroup-wiseimportancecomputationandsampling(Figure2.7) consistsoftwosteps: Step#1:ComputeGroup-wiseImportance .Inthestep,wesplitthecompletelocaldataset oneachedgedeviceinto D non-overlappinggroups.AteachSGDiteration,insteadofre-computing theimportanceforallthesamples,onlytheimportanceofsamplesinoneofthe D groupsisre- computed.Indoingso,thecomputationalcostofimportancesamplingbecomes1 = D oftheall- inclusiveoneperSGDiteration.Thegroupisselectedinaround-robinfashionsuchthatevery groupisselectedonceevery D SGDiterations. SinceweonlyupdatethesampleimportanceofonegroupineachSGDiterationandreusethe 54 Figure2.7:Illustrationofgroup-wiseimportancecomputationandsampling.Eachdotrepresents onesampleandthedarknessindicatesitsimportance. previouslycomputedimportancefortheremaining D 1groups,thereisadiscrepancybetween thederivedimportancedistributionandtheimportancedistributionderivedbyre-computingthe importanceofallthesamples.Thisisbecauseasmodelparametersusedtocomputethesample importanceareupdatedduetoSGD,thesampleimportancecomputedindifferentiterationsdonot belongtothesamedistribution. Step#2:ConstructMini-BatchbasedonGroup-wiseImportance. Tomitigatetheeffectcaused bythediscrepancy,inthesecondstep,weneedtoconsidertwolevelsofimportancetoconstruct amini-batch.Theisthe inter-grouplevelimportance ,whichtherelativefreshness ofagroupcomparedtotheothergroups.Thesecondisthe intra-grouplevelimportance ,which theimportanceofsampleswithinagroup. Toaccommodatethe inter-grouplevelimportance ,weselect d groupsoutof D groups, andassigntheprobabilityofselectinggroup i as: r i = exp ( b t i ) å D n = 1 exp ( b t n ) (2.4) where t i isthestepindexwhentheimportanceofgroup i isupdatedand b > 0istheamplifying factor.Alarger b encouragestheselectionofnewergroups.Toaccommodatethe intra-grouplevel 55 importance ,weselect j B j M i å d n = 1 M n fromeachselectedgroup i usingimportancesamplingwhere M i isthesizeofgroup i .Toguaranteeunbiasedconvergence,theprobabilityofasamplebeing selectedisproportionaltoitslosswithinitsgroup.Inotherwords,theprobabilityofselecting sample j ingroup i isgivenby: p i ; j = I i ; j å M i n = 1 I i ; n (2.5) where I i ; j istheimportanceofthesample j ingroup i .Finally,themini-batchisconstructedby combiningtheselectedsamplesfromselectedgroups.Toensureunbiasedness,wereweighthe gradientsamongsampleswhencomputingaveragegradientstoobtaintheunbiasedgradient asproposedin[46]. 2.4.2Importance-awareDataResharding ThesecondkeytechniqueinDistreamis importance-awaredataresharding :atechniqueforbal- ancingimportancedistributionamongworkerstoacceleratethetrainingprocess.Asimportance updateisperformedonthelocaldatasetateachedgedevice,thelocalimportancerankwithineach edgedevicedoesnotecttheglobalimportancerankthataccumulativelyconsiderstheimpor- tanceofallthesamplesfromalledgedevices.Asaconsequence,anedgedevicemayrepeatedly learngloballytrivialsamples,whichconsiderablylowersthetrainingefy.Thisproblemis exacerbatedwhendataclassesarenotevenlydistributedacrossedgedevices,whichiscommonin on-devicesettings.Suchproblemcanberesolvedbydataresharding,atechniquethatrandomly redistributessamplesamongworkers.Indatacenters,datareshardingcanbeeasilyachieveddue totheavailabilityofhigh-bandwidthnetwork.However,inon-devicesettingswherethenetwork bandwidthismuchmoreconstrained,shufalargeamountofdataamongedgedevicessignif- 56 Figure2.8:Importance-agnosticdatareshardingvs.importance-awaredataresharding. icantlydelaysthetrainingprocess. Tothisend,weproposeimportance-awaredatareshardingthatonlyredistributesimportant samples.Insteadofblindlyshufdataamongworkers,weonlyselectnon-trivialsamplesto shuftomaximizedatareshardingefywithminimumcommunicationoverhead.Asshown inFigure2.8,supposeafterdatareshardingisdetermined,aworkerwith N samplesand D groups needstoswap N p < N sampleswithotherworkersandthebudgetonlyallows N o < N p samplesto betransferred.Forimportance-agnosticdataresharding,itdirectlytransfers N p sampleswithout takingthesampleimportanceinformationintoconsideration.Incontrast,importance-awaredata reshardingselects N o M i å j M j samplesfromeachgroup i wherethegroupsizeis M i .Theprobabilityof asamplebeingselectedisproportionaltoitsimportancewithinitsbelonginggroupasinEq.(2.5). Asaresult,weareabletoselectimportantsamplestoberedistributedwithoutchangeingroup sizeorlocaldatasetsize. 2.4.3BACCScheduler Anaiveimplementationofimportancesamplingforeachdeviceincollaborativelearningisshown inFigure2.9a.Theimportancecomputationanddatareshardingissequentiallyinsertedinthe originaltrainingprocess,incurringunnecessaryblockingandoverhead. 57 Figure2.9:(a)UnoptimizedPipelinevs.(b)Optimizedpipeline. w t representsthemodelweight atiteration t .R=R1+R2+R3+R4inlength. AsshowninFigure2.9b,Distreamincorporatesa bandwidth-adaptivecomputation-communication (BACC)scheduler whichschedulestheexecutionofimportancecomputationanddataresharding inparallelinabandwidth-adaptivemannertofurtherimprovethespeedupbycompletelymasking outthecostsofimportancesamplinganddataresharding.,sincegradientaggregation onlyusesnetworkresourceandgradientcomputationonlyusescomputeresource,thegroup-wise importancecomputationandimportance-awaredatareshardingcanbeoverlappedwithgradient aggregationandgradientcomputation,respectivelyandexecutedinparallel.Thiscanbeachieved bycreatingtwothreads,oneforstandarddistributedSGDoperationsandtheotherforgroup- wiseimportancecomputationandimportance-awaredataresharding.Indoingso,theoverheads incurredbythesetwotechniquescanbemaskedoutandthetrainingspeedupisfurtherimproved. Giventhatbothgradientaggregationanddatareshardingdependonnetworkbandwidthwhich 58 couldexperiencevariationsovertimeinreal-worlddeployments,achievingfulloverlappingthat completelymasksoutthecostsofimportancecomputationanddatareshardingrequiresthesched- ulertobebandwidth-adaptive.Weproposetwotoadapttothebandwidthvariation. First,tofullyoverlapimportancecomputationwithgradientaggregation,insteadofusingaed groupsize,Distreamadoptsvaryinggroupsizessuchthatcomputingtheimportanceofthesam- plesinagroupconsumesthesameamountoftimeasgradientaggregation.Tocompensateforthe reductionofimportancevarietyingroupsofsmallsizeandensureunbiasedness,theimportance ofeachdatasamplesampledfromgroupsofdifferentsizeswillbereweighedusingitsgroupsize. Second,tofullyoverlapdatareshardingwithgradientcomputation,Distreamadoptsbreakpoint- resumetechnique:datareshardingpauseswhengradientaggregationbeginsandresumeswhen itends.Bydoingthese,thecostsofimportancesamplingcanbecompletelyhiddenbehindthe standarddistributedSGDinabandwidth-adaptivemanner. Figure2.9billustrateshowourproposedBACCschedulerisabletoachievefulloverlappingto completelymaskoutthecostsofimportancecomputationanddataresharding.,after gradientcomputation,Thread1yieldsitscomputeresourcetoThread2toperformimportance computation.Aftergradientaggregationiscomplete,Thread2yieldsthecomputeresourcesto Thread1toperformgradientcomputationforthenextiteration t + 1.Similarly,aftergradient aggregation,Thread1yieldsnetworkresourcetoThread2toperformdataresharding.When gradientaggregationforthenextiterationbegins,Thread2pausesdatareshardingandyields networkresourcebacktoThread1toperformgradientaggregationforthenextiteration. 2.4.4ProofofTrainingCorrectness Lastly,duetopagelimitation,weprovidethetheoreticalresultshowingthatthecollab- orativelearningunderDistreamisguaranteedtoconvergetothesamesolutionasthestandard 59 distributedSGD. SketchofProof :weshowthatthestochasticgradientinformationaveragedattheserveris unbiased.Thenweshowthatitsvarianceisbounded. 1.Thegradientaveragedacrossallworkersisunbiased. Weuse Ñ l k ; i ; j torepresentthegradient fromthesample j ingroup i ofdevice k forthesimplicityofnotation.Usingtheproposedsampling techniquein§2 : 4 : 1,theexpectationofaggregatedgradient g is E ( g t )= 1 å K k = 1 N k K å k = 1 N k 1 N k D k å i = 1 M k ; i å j = 1 Ñ l k ; i ; j = Ñ l ( w t ) : where M k ; i isthenumberofsamplesingroup i atnode k ; N k and D k arethenumberofsamples andthenumberofgroupsatnode k .Itshowsthattheaveragedgradientisunbiased. 2.Thegradientaveragedacrossallthedeviceshasboundedvariance. Notethatthevariance of g t is Var ( g t )= K å k = 1 ( N k å j N j ) 2 Var ( g k ) : Sinceweonlychangetheprobabilityofchoosingthesamples,thevariance Var ( g k ) ateach node k isstillbounded.Therefore,thevarianceof g t isbounded.Forsimplicity,weusethesame notation V forthisbound. 2.5Implementation Testbed .Wedesignedanddevelopedourowntestbedduetolackofoff-the-shelfones.Specif- ically,weuse12NVIDIAJetsonTX1asedgedevices.EachTX1hasanintegratedsmallform 60 Figure2.10:Distreamimplementation. factormobileGPU,whichisdesignedfornext-generationintelligentedgedevicestoexecuteDL workloadsonboard.Forwirelessnetworking,weuseNetgearNighthawkX6SAC4000Tri-band Wi-FirouterstoconnectalltheTX1,anduseLinuxtools tc , qdisc ,and iptables tocontrolthe networkbandwidthtoconductourexperiments. FrameworkImplementation .WeimplementedDistreamusingTensorFlow1.10.Figure2.10 showstheimplementationofourDistreamframework.Distreamisdistributedtrainingframework thatspansacrossmultipleedgedevices.Ineachtrainingiteration, ImportanceSampler constructs amini-batchfromon-devicedatausinggroup-wiseimportancesampling( 1 )basedontheim- portancedistributionoflocaldatastoredin ImportanceCache .Themini-batchisfedinto Model Trainer tocomputelocalgradient( 2 ).Thelocalgradientsfromalldevicesareaggregatedand theaggregatedgradientsaresentto ModelTrainer toupdatethemodel( 3 ).Meanwhile, Im- portanceUpdater computesthedataimportanceandupdates ImportanceCache ( 4 )whenthe deviceisperforminggradientaggregation. ReshardExecutor importantdatasamples from ImportanceCache ( 5 )andcommunicateswithotherdevicestoperformimportance-aware dataresharding( 6 )whenthedeviceisperforminggradientcomputation.Theexecutionsof Re- 61 shardExecutor and ImportanceUpdater aretriggeredandscheduledbythe BACCScheduler at runtime. 2.6Evaluation Inthissection,weevaluatetheperformanceofDistreamwiththeaimtoanswerthefollowing questions: Q1(§2.6.2): DoesDistreamoutperformstatusquo?Ifso,whatarethereasons? Q2(§2.6.3): HoweffectiveiseachcoretechniqueincorporatedinthedesignofDistream? Q3(§2.6.4): CanDistreamadapttowirelessnetworkbandwidthvariationwell? Q4(§2.6.5): DoesDistreamscalewellwhenthenumberofedgedevicesincreases? 2.6.1ExperimentalMethodology TaskDomains,Datasets,andDLModels .TodemonstratethegeneralityofDistreamacross domains,datasets,andDLmodels,weevaluateDistreamonsixcommonlyuseddatasetsusing adiversesetofDLmodelsacrossthreemostimportanttaskdomains:computervision,speech recognition,andnaturallanguageprocessing. ComputerVision. Inthedomainofcomputervision,weusefourdatasets., weselectCIFAR-10,CIFAR-100,andSVHNastheyarethreeofthemostcommonlyused computervisiondatasetsfordeeplearningalgorithmevaluation.BothCIFAR-10andCIFAR- 100[48]consistof50 ; 000trainingimagesand10 ; 000testimagein10classes.SVHN[82] consistsof73 ; 257trainingand26 ; 032testimages.WeuseResNet18[28]totrainonCIFAR- 10andSVHNtraining,anduseRetNet50[28]totrainonCIFAR-100.Inaddition,weselect 62 AID[83]asourfourthcomputervisiondataset.AIDisalarge-scaleaerialimagedataset collectedfromGoogleEarth.Ithas10 ; 000imagesin30classes,includingairport,mountain, desert,forest,etc.Werandomlyselect80%imagesfortrainingandtheremaining20%images fortesting.Weselectthisdatasettoemulatetheapplicationofon-devicecollaborativelearning acrossaswarmofdrones.WeuseMobileNetV2[70]totrainonthisdataset. SpeechRecognition .Inthedomainofspeechrecognition,weselectTwSpeechCom- mand[75]asourdataset.Thisdatasetconsistsof105 ; 829audioutterancesof35shortwords, recordedbyavarietyofdifferentpeople.Therecordedaudioclipsareintendedforsimple speechinstructionssuchas go , stop , yes , no ,etc.Thetrainingsethas84 ; 843samplesandthe testsethas11 ; 005samples.Weextractthe2-Dspectrogramsfromrawaudioclips,anduse VGG-13[73]asthemodel. NaturalLanguageProcessing .Inthedomainofnaturallanguageprocessing,weselectAG NewsCorpus[90]asourdataset.ThisdatasetcontainsnewsarticlesfromtheAG'scorpus.It has120 ; 000trainingand7 ; 600testingsamples.Eachsampleconsistsofacoupleofsentences andislabeledasoneofthefourclasses: world , sports , business and sci/tech .Weuseatwo- layerLSTMwithattention[11,31]totrainonthisdataset. Baselines .ThegoalofDistreamistoenhancethetrainingefyofon-devicecollaborative learningwhileretainingalgorithmcorrectnessguaranteeswithoutcompromisingtheaccuraciesof thetrainedmodels.Forfaircomparison,wecompareDistreamagainsttwostatusquoframeworks whichsharethesamegoal 1 . TicTac [25]. TicTac isastatusquodistributedtrainingframeworkfortrainingacceleration. 1 Gaia [33]isthestatusquocommunication-efdistributedtrainingframeworkbasedongradientcompres- sion.However, Gaia obtainstrainingefybycompromisingtheaccuraciesofthetrainedmodels.Therefore,we didnotincludeitasabaseline. 63 Itproposestwoheuristicstore-orderparametertransfernodesinacomputationalgraphto increasetheoverlappingbetweencomputationandcommunication,outperformingTensorFlow by1 : 19 shorterintotaltrainingtime. AdaComm [78]. AdaComm isastatusquocommunicati-on-efdistributedtraining framework. AdaComm reducestrainingtimebyallowingeachworkertoperformmultiple localSGDiterationsbeforecommunicationandadaptivelybalancingbetweenthenumberof localSGDiterationsandcommunication. EvaluationMetrics .WeusetwometricstoevaluatetheperformanceofDistreamandthebase- lines. TotalTrainingTime .Weusetotaltrainingtimeasourtmetric.Totaltrainingtimeis asthewall-clocktimefromthebeginningtotheconvergenceoftrainingprocess.This metriccontainsnotonlyinformationaboutthenumberofiterationsuntilconvergencebutalso informationaboutthetrainingtimeperiteration. TrainingQuality .Thesecondmetricistrainingquality.Weusetop-1testaccuracyofthe trainedDLmodeltomeasurethetrainingquality. TrainingDetails .Duringtraining,thebatchsizeofeachdeviceissetto32,andweadoptSGD optimizerandcosine-annealinglearningratedecay[59]. (a)CIFAR-10 (b)CIFAR-10 Figure2.11:OverallperformancecomparisononCifar10andCifar100. 64 (a)SVHN (b)AID Figure2.12:OverallperformancecomparisononSVHNandAID. (a)TwSpeechCommand (b)AGNews Figure2.13:OverallperformancecomparisononTwSpeechCommandandAGNews. 2.6.2OverallPerformance WecomparetheoverallperformanceofDistreamwith TicTac and AdaComm .Todoso,we runtrainingexperimentsonfouredgedevices,andmeasuretheirtotaltrainingtimespeedupsover standarddistributedSGDonthesixdatasets.Toprovideacomprehensiveevaluation,wedidour experimentsundervariousnetworkbandwidthsrangingfrom10Mbpsto200Mbps,emulating scenarioswithdifferentwirelessnetworkbandwidthavailability. Figure2.11,Figure2.12andFigure2.13showstheresults.WeseethatDistreamcantly outperforms TicTac and AdaComm onallthesixdatasetsacrossallthenetworkbandwidths,with improvementsfrom1 : 6 to3 : 9 .,Distreamachievesupto3 : 74 ,3 : 21 ,4 : 08 , 2 : 21 ,2 : 74 and1 : 85 speedupsoverstandarddistributedSGDonsixdatasetsrespectively. Incontrast, TicTac achieves1 : 01 to1 : 21 speedupand AdaComm achieves1 : 2 to1 : 9 in 65 wirelesssettingswherebandwidthsareconsiderablyconstrained. WealsonoticethatDistreamachieveshigherspeedupswhenbandwidthbecomesmorecon- strained.Forexample,acrosssixdatasets,Distreamachievesthehighestperformancegainunder 10Mbps,thelowestbandwidthinourexperiment.Thisisbecauseundermoreconstrainedband- width,thecommunicationtimeislargerandDistreamcanspendmoretimeoncomputinglatest importancerank.Thisresultdemonstratesthatthemoreconstrainedthebandwidthis,themore superiorityDistreamhas. WhyDistreamOutperforms TicTac and AdaComm .TounderstandwhyDistreamisableto achievehigherspeedupscomparedto TicTac and AdaComm ,weuse CIFAR-10 asanexample, anddepictthetestaccuracyofDistream(blue), TicTac (orange)and AdaComm (green)during thecompletetrainingprocessinFigure2.14.Wehavetwoobservations. First,Distreamoutperforms TicTac byalargemarginfromthebeginninguntilconvergence. ,Distreamuses2 : 7 and2 : 4 fewercommunicationroundson CIFAR-10 and2 : 7 on TSpeechCommand toconvergecomparedto TicTac .Thisisbecauseunderwireless networksettingwherebandwidthislimited,thecommunicationtimedominatestrainingoverhead overlappingcomputationwithcommunicationisnoteffectiveanymore(Figure2.2).Therefore, TicTac provideslittleruntimereductionineachiteration.Incontrast,Distreamaimsatimproving thetrainingefy periteration toreducethetotalnumberofiterations(communicationrounds) withoutadditionaloverhead. Second,although AdaComm convergesfasterthanDistreaminearlystagewhenthenumber ofcommunicationsissmallerthan10 10 3 (Figure2.14(left)), AdaComm beginstoloseeffect andconvergesslowerthanDistream.Thisisbecause,although AdaComm canalsoreducecom- municationrounds,itachievessuchreductionbyperforminglocaltrainingstepsinearlystage oftraining.Toguaranteethetrainingquality, AdaComm hastoperformlesslocalstepsandthe 66 Figure2.14:Testaccuracycurvesduringthecompletetrainingprocessintermsofnumberof communicationrounds(left)andwall-clocktime(right). accelerationof AdaComm beginstoslowdownaftertheearlystage.Inaddition,performing multiplelocalstepsincreasethecomputationtimeineachround.Incontrast,withthetrainingcor- rectnessguarantee,Distreamisabletoincreasetrainingefythroughouttheentiretraining processwithoutadditionaloverheads.Therefore,eventhough AdaComm achievesfasterconver- genceduringearlystageoftraining,Distreameventuallyoutperforms AdaComm intermsofboth numberofcommunicationsandtotaltrainingtime. 2.6.3Component-wiseAnalysis Next,weevaluatetheeffectivenessofeachofthethreekeytechniquesincorporatedinDistream. Theexperimentalsetupisthesameasin§2.6.2.Weonlyincludetheresultsobtainedin CIFAR-10 Figure2.15:Testaccuracycurvesintermsofnumberofiterations(left)andwall-clocktime(right) usinggroup-wise(Groupwise-IS)andall-inclusiveimportancecomputationandsampling(All-IS). 67 heresinceweachievedsimilarresultsontheothervedatasets.Theexperimentresultsaltogether showthatwiththethreeproposedtechniques,Distreamcanachievehigherperformancegaincom- paredtotherudimentaryimportancesampling-basedframework. Component#1Analysis:Group-wisevs.All-inclusiveImportanceComputationandSam- pling .Weevaluatetheeffectivenessoftheproposedgroup-wiseimportancecomputationand samplingtechniquedescribedin§2.4.1.Todoso,wecompareitagainsttheall-inclusiveimpor- tancecomputationandsamplingstrategy,whichre-computestheimportanceofeverysamplein thelocaldatasetateachSGDiteration.TheresultsareshowninFigure2.15. Aswecansee,whilegroup-wiseimportancecomputationandsamplinguses20%moreit- erationstoconvergetothesametestaccuracyastheall-inclusiveimportancecomputationand samplingscheme(Figure2.15(left)),group-wiseis4 : 2 fasterthantheall-inclusivewhentrans- latedintototaltrainingtime(Figure2.15(right)).Thisisbecauseeventhoughre-computing theimportanceofeverysampleindeedimprovesthetrainingqualityperiteration,the computationcostitincursprolongsthetrainingtimeperiteration.Incontrast,bysac- marginaltrainingqualityperiteration,group-wisecomputationandimportancesampling isabletoconsiderablycutthetrainingtimeperiteration,leadingtoamuchreducedtotaltraining time. Component#2Analysis:Importance-awarevs.Impor-tance-agnosticDataResharding. Next, weevaluatetheeffectivenessoftheproposedimportance-awaredatareshardingtechniquede- scribedin§2.4.2.Todoso,wecompareitagainsttheimportance-agnosticdatareshardingstrategy. Differentfromimportance-awarereshardingwhichselectstheimportantsamplesduringreshard- ing,importance-agnostictreatseverysampleequallyandrandomlyreshufthesamplesacross edgedevices.Wealsocomparetooneextremecasewheredatareshardingwasnotperformedat 68 Figure2.16:Importance-awarevs.importance-agnosticdataresharding,vs.non-resharding. all.Figure2.16showstheresults.Wehavetwoobservations. First,similartothefrommanyotherworks[22,23,64],weobservethatcomparedto non-resharding,datareshardingisabletoboostthetestaccuracyupby1 : 84%,demonstratingthe necessityofdatareshardingforachievingstate-of-the-arttestaccuracy. Second,comparedtoimportance-agnosticdataresharding,theproposedimportance-aware strategyisabletoconvergetothesametestaccuracy,butwithonly50%ofitsdatatransferamount. Thisisbecauseimportance-awarereshardingprioritizesshufmoreimportantsamples,which considerablyimprovesthereshardingefyandreducingthenetworktraffordatareshard- ing. Figure2.17:Testaccuracycurvesintermsofnumberofiterations(left)andwall-clocktime(right) usingBACCandsequentialpipeline. Component#3Analysis:BACCSchedulervs.SequentialPipeline. Toevaluatetheeffective- nessoftheproposedBACCschedulerdescribedin§2.4.3,wecompareitagainstthesequential 69 Figure2.18:SpeeduponCIFAR-10. Figure2.19:Bandwidthsnapshot. pipeline,whichperformsdataresharding,group-wiseimportancecomputationandsampling,gra- dientcomputationandgradientaggregationsequentially.AsshowninFigure2.17,althoughBACC uses5%moreiterationstoconverge(Figure2.17(left)),whentranslatedintototaltrainingtime,it is2 : 2 fasterthanthesequentialpipeline(Figure2.17(right)). 2.6.4BandwidthAdaptationPerformance WetakeacloserlookatthebandwidthadaptationperformanceoftheBACCschedulerproposedin Distream.Whendeployingourtestbedinthereal-worldsettings,weobservedthatthebandwidth variationsacrossdifferentbandwidthsare10%onaverage(Figure2.19showsasnapshot).To examinethebandwidthadaptationperformanceoftheBACCscheduler,wecompareitagainsta bandwidth-agnosticsolutionwhereedgroupsizesanded-timedatareshardingareadopted. Figure2.18showsthespeedupcomparisonacrossfourbandwidths.Asshown,withoutbandwidth adaptation,thetrainingspeedupdropsfrom3 : 74 ,3 : 64 ,3 : 62 ,3 : 48 to3 : 43 ,3 : 38 ,3 : 33 , 3 : 25 ,whichvalidatestheeffectivenessofDistreaminadaptingtobandwidthvariations. 70 Figure2.20:ScalingperformanceofDistreamon4,8,12edgedevicesundervariousnetwork bandwidths. 2.6.5ScalingPerformance Finally,weevaluatethescalabilityofDistreambyrunningitonmoreedgedevicesunderdifferent bandwidths.Again,weonlyshowtheresultsofCIFAR-10hereduetopagelimitationsincewe observedsimilarresultsontheothervedatasets.AsshowninFigure2.20,whenscalingfrom4 devicesto8and12devices,Distreamgracefullymaintainsthespeedupoverstandarddistributed SGDundervariousnetworkbandwidths.ThisresultdemonstratesthatDistreamisabletomaintain itssuperiorityoverstandarddistributedSGDwhenthenumberofedgedevicesscalesup. 2.7RelatedWork ThedesignofDistreamwasinspiredbydistributedtrainingframeworksindatacentersetting. ThemostsimilarworkstoDistreamare AdaComm [78]whichadaptivelyadjuststhenumber oflocalSGDiterationstoreducecommunicationcost,and TicTac [25]whichoverlapsgradient computationwithcommunicationtoreduceper-iterationoverhead.Althoughmanytechniques wereproposedintheseworkstoacceleratetrainingprocess,theyaredesignedfordistributed trainingsystemsindatacenters.Unfortunately,thereislimitedperformancegainwhendirectly applyingthesetechniquestoon-devicesettinggiventhegapinnetworkbandwidth 71 betweenthesetwosettings. ThedesignofDistreamwasalsoinspiredbyexistingworksinfederatedlearning[62,47]Œthe othercategoryofon-devicedistributedtrainingaswediscussinintroduction.However,thetraining efyoptimizationtechniquesinvolvedinthoseworksachievetrainingefyatthecost ofcompromisingtheaccuraciesofthetrainedmodels,whichmotivatesourworkfordesigning techniquesthatimprovetrainingefywithoutsuchcompromise.Infact,ourproposedgroup- wiseimportancecomputationandsamplingtechniqueandBACCschedulercanalsobeappliedfor enhancingthetrainingefyoffederatedlearning. 2.8Conclusion Inthispaper,wepresentthedesign,implementation,andevaluationofDistream,animportance sampling-basedframeworkthatenablesefon-devicecollaborativelearningwithoutcompro- misingtheaccuraciesofthetrainedmodels.Distreamaddressesthekeybottleneckofcollaborative learning,andcontributesnoveltechniquesthattakeadifferentpathfromexistingapproaches.We implementedDistreamandconductedarichsetofexperimentswithaself-developedtestbedon sixcommonlyuseddatasetsacrosscomputervision,speechrecognition,andnaturallanguagepro- cessing.OurresultsshowthatDistreamconsistentlyoutperformsthestatusquo.Therefore,we believeDistreamrepresentsacontributiontomakingcollaborativelearningsystems practicallyusefulinreal-worlddeployments. 72 Chapter3 CollaborativeLearningfor FederatedLearning 3.1Introduction Recentyearshavewitnessedtheboomingofedgedevices.Itisexpectedthatby2030,thenumber ofedgedeviceswillincreasefrom8.74billionin2020to25.44billion[12].Theseedgedevices, suchassmartphones,drones,wearables,arerevolutionizingthewaywelive,work,andinteract withtheworld.Equippedwithavarietyofsensors,thesedevicesarecollectingamassivevolume ofdatafrompeopleaswellassurroundingenvironmentseveryday.Thesedataareofgreatvalue totechnologycompaniesforimprovingtheirproductsandprovidingbetterservicesandcustomer experience.Inordertogainrealvalueoutofthesedata,thestatusquoapproachistouploadthese decentralizeddataontocentralserversinthecloudwherepowerfulmachinelearningmodelssuch asdeepneuralnetworksaretrained.However,theseuserdataoftencontainprivateorprivacy- sensitiveinformationsuchasfaces,licenseplates,housenumbers,etc.Asaresult,thesedata areprotectedfromleakagebylaw,anduploadingthedatatocentralcloudserversforfurther processingisnotafeasiblesolution. WiththeemergenceofAIchips,themajorityoftheedgedevicesarenowabletonotonly performon-devicemodelinference,butalsotrainadeeplearningmodelusingthedatacollected 73 bythemselves.Thereisainterestinleveragingtheon-devicecomputationalcapabilities tocollaborativelytrainagloballysharedmodelfromthedecentralizeddatawhilekeepingthedata locallyoneachdevice[72].Manypeoplehaveshiftedfromcentralizedservertoedgesettingfor machinelearningmodeldevelopmentandtraining. Drivenbythistrend,Googleproposedfederatedlearning[47]in2016.Federatedlearningis anemergingmachinelearningparadigmthathasrecentlyattractedconsiderableattentionduetoits widerangeofapplicationsinmobilescenarios[63,47,74].Differentfromthestandardmachine learningapproachwhichrequiresallthetrainingdatatobecentralizedinaserverorinadatacenter, itenablesgeographicallydistributededgedevicessuchasmobilephonestocollaborativelylearna sharedmodelwithoutexposingthetrainingdataoneachedgedevice.Assuch,federatedlearning enablesdistributingtheknowledgeacrosssmartphoneswithoutexposingusers'privatedata. Federatedlearningusesdistributedstochasticgradientdescent(SGD)andrequiresacentralto coordinatetheiterativetrainingprocess.Beforeeachroundbegins,theserverselectsasubsetof clientsanddistributesthelatestglobalmodeltoalltheparticipatingdevices.Eachdevicecomputes thegradientorupdateofthemodelparametersusingitslocaldata.Theserveraggregatesthe gradientsormodelsupdatefromalldevices,averagesthem,andupdatestheglobalmodel.The centralserversendstheupdatedmodeltotheclientstoperformlocaltrainingofnextround.In suchmanner,eachdevicefromobtainingabettermodelthantheonetrainedonlyonthe locallystoredprivatedata. AlthoughFLaddressestheprivacyissue,itstillhasmanyproblemsandsuffersfromsigni performancedegradationduetothreeproblems.Theislowtrainingefy.Infederated learning,thetrainingdataismassivelydistributedacrossalargenumberofclients.Existingap- proachesadoptrandomsamplingwhenitselectsclientstoparticipateintrainingbeforeeachround begins.Ineachround,theclientsadoptrandomsamplingstrategytosamplemini-batcheswhen 74 performinglocaltraining.Consequently,thetrainingefyinfederatedlearningislowand itrequiresmorestepstoconverge.Thesecondproblemislimitednetworkbandwidth.Indata centers,communicationbetweentheserverandworkingnodesisconductedviaGbpsEthernet orniBandnetworkwithevenhigherbandwidth[81].Incontrast,communicationinfederated learningreliesonwirelessnetworkssuchas4GandWi-Fi.Bothuplinkanddownlinkbandwidths ofthosewirelessnetworksareatMbpsscale,whichismuchlowerthantheGbpsscaleinthedata centersetting.Thelimitedbandwidthinfederatedlearningillustratesthenecessityofreducing thecommunicationcosttoacceleratethetrainingprocess.Thelastproblemisfederatedlearn- ingislowaccuracyduetotheadoptionofsmallcapacitymodel.Althoughexistingedgedevices suchassmartphonesareabletotrainneuralnetworkmodels,theyhavemuchlesscomputere- sourcescomparedtocentralizedcloudservers.Moreover,thesedevicesoftenneedtoconcurrently runmultiplemodelsthatperformdifferenttasksinrealtime.Itthusrequiresthecollaboratively trainedmodelinfederatedlearningtobecomputationallyefTherefore,federatedlearning usuallyaccuracyandusesasmallmodelinordertoreducethecomputationalcost. Inthiswork,weproposeFedAce,aneffederatedlearningframeworkthataddressesthe aforementionedproblems.Itisfeaturedbytwokeycomponents.Thecomponentisfeder- atedimportancesampling,whichnotonlyperformsimportantdataselectionwithineachclient inlocalmodeltraining,butalsoselectsimportantclientstoparticipateineachroundoftrain- ing.Byfocusingonimportantclientsanddata,FedAceisabletoincreasetrainingefyand reducetrainingtime.Thesecondcomponentisfederatedmodelcompression,whichiteratively compressesthegloballysharedmodelthroughoutthewholetrainingprocess.Theglobalmodel iscompressedusingaglobalmaskaggregatedbylocallycomputedmasks,whichrepresentsthe opinionsonthelocationsofimportantparametersfromdistributedclients.Aftertrainingcom- pletes,onlyimportantparametersarekept,andtheglobalmodeliscomputationallyef 75 Astheexchangedinformationbetweenclientsandthecentralserverareintheformofmodel,we canreducethecommunicationcostbytransferringacompressedmodel.Asaresult, thetrainingtimeisreduced. WeimplementFedAceusingFedML[26],aresearchlibraryandbenchmarkforfederated learning.WeexaminetheperformanceofFedAceonfourfederatedlearningdatasetsacrosscom- putervision,speechrecognition,andnaturallanguageprocessingtasks.Ourexperimentresults showthatFedAceisabletoincreasetrainingefyandachievetrainingspeedup comparedtoexistingFLframeworks.Meanwhile,FedAceachievesmodelcompression andcommunicationreductionratioswithnoorlittleaccuracyloss. 3.2RelatedWork FederatedLearning. Asanemergingfederatedlearninghasrecentlyattractedalotofat- tention.[63]developedafederatedlearningapproachbasedoniterativemodelaveragingtotackle thestatisticalchallenge,especiallyforthemobilesetting.[74]proposedadistributedlearning frameworkbasedonmulti-tasklearninganddemonstratedthattheirapproachisabletoaddress thestatisticalchallengeandisrobusttostragglers.[47]developedstructuredandsketchedupdate techniquestoreduceuplinkcommunicationcostinfederatedlearning. DistributedSGDTraining. Ourworkisrelatedtodistributedtrainingaccelerationinfederated learning.Mostexistingworksarefocusedononreducingcommunicationtimetocutthewall- clocktime.Thisisachievedbyquantizinggradientsusingsmallernumberofbits[81,10,55]or selectingimportantgradientstotransfervia[79,56,33].Anotherresearchaims atincreasingmini-batchsizeforlesscommunication.In[37],theauthorsproposedtouselarger mini-batchsizetoreducecommunicationiterations.Althoughthisapproacheffectivelyreducesthe 76 wall-clocktime,itcomeswiththecostofloweringtheaccuracyofthetrainedmodel.Lastly,deep learningmodelsingeneralcanberepresentedasdirectedacyclicgraphs(DAGs).Thestructures ofDAGsprovideanopportunitytooverlapcomputationoperationsandcommunicationoperations inapipelinedfashiontomaskoutthecommunicationcost.Thisoptimizationtechniquecanbe seenasamechanismtoreducethetotaltrainingtimebyidentifyingtheoptimalgradienttransfer order[25,39,17,87]orusingstaledmodelweights[54]. ModelCompression. Ourworkisalsorelatedtomodelcompression.Deeplearningmodels havebeenincreasinglycomputationallyexpensive.Modelcompressionaimstoreducethecom- putationalintensityofdeepneuralnetworksbypruningoutredundantmodelparameters.Ithas receivedalotofattentionsinrecentyearsduetotheimperativedemandonrunningdeeplearning modelsonresource-constrainededgedevices[24].Themostpopulartechniqueformodelcom- pressionisparameterspruning,wheretheleastimportantparametersareremovedtoreducethe modelsizeandinferencecost.Dependingonthestructureofprunedparameters,therearetwo typesofpruningtechnique.Theisunstructuredpruning,whichfocusesonpruningmodel parameters[24].Althoughunstructuredpruningparametersachieveshighermodelcompression rateandeffectiveatreducingmodelsize,itdoesnotnecessarilyreducecomputationalcostswhen itisappliedtoCNNs.Theotherisstructuredpruning[50],whichandprunesunimpor- tantinCNNs.Structuredpruningachieveslowercompressionrate,itcaneffectivelyreduce runtimecost. Ourapproachalsodiffersfrompriorworkonfederatedlearning,wheretheyfocus oneithertacklingstatisticalandstragglerchallenges[63,74]orreducinguplinkcommunication costalone[47].Ourapproachisalsodifferentfrompriorworkonmodelcompression[24,66] andcommunicationreductionfordistributedlearning[10,81,56],wheretheytreatthemastwo separate tasks.Inotherwords,thenoveltyofourproposedapproachliesatintegratingmodel 77 compressionandcommunicationreductionintoasingleoptimizationframeworkbasedontraining datadistributedatdifferentdevices. 3.3BackgroundandMotivation Inthissection,weintroducethebackgroundknowledgeoffederatedlearning.Thenwecom- parethedifferencebetweenfederatedlearningandstandarddistributedlearning.Finallywedis- cussthedrawbacksofexistingfederatedlearningapproachesandprovideimprovingopportunities forfederatedlearning,whichisthemotivationofFedAce. 3.3.1ArchitectureofFederatedLearning Federatedlearningisnewtrainingparadigmfortrainingmachinelearningmodelsusinggeograph- icallydistributedclientsundertheorchestrationofacentralserver.Thetrainingprocessiscon- ductedintheformofiterativerounds.Ineachround,thecentralserverselectsagroupofclients randomlyfromalltheavailableclientstoparticipateinthetraining.Thecentralserverwill sendtolatestglobalmodeltoselectedclientsinthecurrentround.Afterreceivingthelatestmodel fromthecentralserver,eachselectedclientcomputesthemodelupdateorgradientusingitslocal data,whichwillbesenttoandaveragedinthecentralserver.Thecentralserverwillaggregatethe modelupdatesorgradientsfromclientsandupdatetheglobalmodel: W t + 1 = K å k = 1 n k n W t + 1 k (3.1) where W t + 1 istheupdatedglobalmodelforround t + 1. K isthenumberofselectedclientsthat participateinthetrainingand w t + 1 k istheupdatedmodelofclient k . n k isthenumberofdatain 78 client k andwehave å K k = 1 n k = n .TheoverviewoftrainingprocessisshowninFigure3.1. 3.3.2FederatedLearningCharacteristics Althoughbothfederatedlearninganddistributedaimatcollaborativelytrainingaglobalmodel,the settingwherefederatedlearningisfocusedonisfundamentallydifferentfromdistributedtraining. Wepresentthekeydifferencesbetweenfederatedlearninganddistributedtrainingasfollows. DataHeterogeneity. Thedifferenceisthatthereisdataheterogeneityinfederatedlearning. ,thelocaldataacrossclientsarenon-i.i.ddistributed.Thisincludesnotonlythevaria- tioninthenumberoflocaldata,butalsothefeatureskewnessandlabelskewness[43].Incontrast, thedataisindependentandidenticallydistributedacrossworkersandthequantityoflocaldata isthesame.Dataheterogeneityimposesancantchallengeinfederatedlearninganditis regardedasthemajorcauseofthemodelperformanceloss. ClientHeterogeneity. Infederatedlearning,theclientsfrommillionsofusersacrossworldcould beselectedtoparticipateintraining.Thus,thehardwareandcomputecapacitiesofparticipating clientscouldbedifferent.Moreover,asthecentralserveronlyselectsasubsetofclientstopartici- pateinjoiningineachround,theclientheterogeneityisalsochangingineachround.Indistributed training,eachworkhasthesamecomputecapability. NetworkHeterogeneity. Infederatedlearning,eachclientisunreliablyconnectedtothecentral serverandthebandwidthissmall.Astheclientsaredistributedacrosscontinentsortheworld, thenetworkbandwidthsbetweenclientsandthecentralserverareheterogeneous.Indistributed training,astheclientsarelocatedinlocalregion,thebandwidthsamongclientsarethesameand eachclientisreliablyconnectedtothecentralserver. ThesedifferencesmakeexistingoptimizationtechniquesfordistributedSGDtrainingfallshort, andhighlightthenecessitytodevelopanewtrainingframeworktooptimizethesystemperfor- 79 manceforfederatedlearning.Table3.1summarizesthekeydifferencesofdistributedtrainingand federatedlearning. FederatedLearningDistributedTraining Datadistributionnon-i.i.di.i.d Numberoflocaldatasmall,typically<500large,typically>1000 DataAccessibilityinvisibletootherclientsvisibletootherclients Numberofclientstypically>=500typically<=50 Participatingclients<10%100% Clientlocationwidearealocalregion Clientconnectivityunreliablyconnectedalwaysconnected Networkbandwidthlowhigh Table3.1:Comparisonbetweenfederatedlearninganddistributedtraining Figure3.1:Federatedlearningoverview 3.3.3ScopeandImprovingOpportunity Federatedlearningischallengingresearchareaandmanyproblemsremaintobesolved,suchas datanon-iidness,slowconvergenceandadversarialattacks.Inthiswork,FedAceonlyfocuses 80 onaddressingthreeproblemsinfederatedlearning.Next,weidentifyanddiscussimpossible improvementforeachproblem. LowTraining. Currentpracticesinfederatedlearningadoptrandomsamplinginboth dataselectionandclientselection.Whilethissimplesolutionfacilitatestheimplementationand deploymentoffederatedlearningapplications,thetrainingefyislow.Thisisbecausethe localdataateachclientcontributesdifferentlytotraining.Asthereisdataheterogeneityandthe localdatacannotbesharedamongclients,eachclientalsocontributesdifferently.Consequently, usingrandomsamplingstrategyindataandclientselectionwastesmuchtrainingeffortonunim- portantsamplesandclients.Thismotivatesustoadopt non-randomsamplingstrategy whenwe selectclientsfortrainingandconstructmini-batchforlocaltraininordertoincreasethetraining efy. LargeCommunicationCost. Asdiscussedbefore,theclientsaredistributedacrosstheworld, andtheirbandwidthsarelimitedandheterogeneous.Hence,thecommunicationoverheadineach roundislarge,makingfederatedlearningprohibitivelyimpracticaltotrainaglobalmodelwithin ashortperiodtime.Tomitigatethisproblem,weshouldnotonlyincreasethetrainingefy forreductionoftrainingsteps,butalsoreducetheoverheadcommunicationcost.Therefore,the secondimprovingopportunityliesatreducingthecommunicationcosttofurtherreducethetotal trainingtime. LowModelAccuracy. Infederatedlearning,theparticipatingclientsareusuallyresource-constrained edgedevices(e.g.,smartphones),wherethenetworkresourcesandcomputeresourcesarelimited. Trainingalargemodelnotonlyresultsinslowconvergenceduetolargeper-roundcommunication cost,butalsoamodelinefforinference.Givensuchresourceconstraints,thecommon practiceofexistingworkistotrainasmallneuralnetworkforefinferenceafterthetraining. However,neuralnetworksthatachievestate-of-the-artaccuracyinmostlearningtasks(especially 81 difones)aremodelswithlargecapacitiesanddeeparchitectures.Currentpracticeoffed- eratedlearningcouldnotleveragethesuperioraccuracybroughtbylargeneuralnetworks.Asa sanitycheck,wetrainalargemodelandasmallmodel(similartothelargemodelbutwith10% weights)underthesamefederatedsetting.AsshowninFigure3.2,thereisanaccuracygapbe- tweenthetwomodels.Therefore,itisnottrivialtoenhancethemodelperformancewhilekeeping thetrainedmodelcompactandef Figure3.2:Trainingprocessfederatedlearning 3.4FedAceDesign Inthissection,wepresentthedetaileddesignofFedAce.Firstwedescribethesystemarchitecture ofFedAce.Thenweintroducetwoalgorithmiccomponents,whicharethebackbonesofFedAce. 3.4.1OverallArchitecture FedAceisadistributedframeworkthatspansclientsandthecentralserver,asillustratedinFig- ure3.3.Atclientside,theimportantdatasamplerselectsimportantdatatoconstructmini-batches toperformlocaltraining.Afterlocaltrainingcompletesandbeforetheclientsendsthelocally 82 updatedmodeltothecentralserver,thelocalcompressorcomputesalocalmaskthatindicates importantparameterstobepreservedandunimportantparameterstobepruned.Atcentralserver side,theimportantclientsamplerselectsimportantfromavailableclientstoparticipateintraining forthenextround.Theimportantclientsamplerisalsoresponsibleformaintainingtheimportance estimateofavailableclients.Theglobalcompressoraggregatesthelocalmaskssentfromclients intoaglobalmask,andprunetheglobalmodelusingtheglobalmask. Nextwewilldescribethedetailedapproachesbehindthesecomponents. Figure3.3:OverviewofFedAce 3.4.2FederatedImportanceSampling Federatedlearningemploysahierarchicalselectionstrategytotrainaglobalmodel:thecentral serverselectsasubsetofclientstoparticipateintraining;theneachparticipatingclientselects importantdatatoperformlocaltraining.Accordingly,FedAceadoptsfederatedimportancesam- 83 pling,whichconsistsoftwoimportancesamplertoincreasetrainingefyinahierarchical way:inthehighlevelatcentralserverside,itusesimportantclientsamplertoselectimportant clientstoparticipateintraining;inthelowlevelatclientside,eachclientusesdataimportance samplingtosampleimportantdata,asillustrateinFigure3.4. Figure3.4:Overviewoffederatedimportancesampling. 3.4.2.1DataImportanceSampling Thedatasamplerateachclientmaintainsanestimateofimportancedistribution P ofitslocal data D .Thenwhenconstructingthemini-batches,thedataissampledfromthedistribution P withoutreplacement.Weemploydatalossastheimportanceindicatorwhencomputingimportance distribution P asdatalossisangoodapproximationofthedataimportanceandcomputinglossis 84 cheap,aspresentedinMercury.Therefore,theimportancedistribution P canbecomputedas p i = l i å j D j n = 1 l n (3.2) where j D j isthetotalnumberofdataatlocalclient. Differentfromdistributedtraining,thenumberofdataineachclientissmallandcommunica- tionprocesstakeslongertimeastheclientsaredistributedaroundtheworld,thecostofcomputing thelossesforalldatacanbeentirelyhiddenwhenitisparalleledwithcommunicationprocess. Moreimportantly,asfederatedlearningadoptslocalSGDwherethelocalmodelisupdatedby performingmultiplestepsandthecommunicationcostislarge,thecostoflosscomputationis comparablysmallevenifitiscarriedoutwithoutparallelization. 3.4.2.2ClientImportanceSampling Insteadofselectingclientrandomly,thecentralserverinFedAceprioritizestheimportantclients toparticipateintraining.Beforeeachroundbegins,thecentralservercollectsclientimportance informationfromconnectedclientsandselectstheimportantclientsbasedonthecollectedimpor- tanceinformation.Tocollectclientimportance,FedAcesendsthelatestmodeltoclients.Once theclientsreceivedthemodelsentbythecentralserverforimportancecollection,itwillperform inferenceforallofitslocaldata.Theimportanceofclient K iscomputedas: I k = ¯ L k n k n (3.3) where ¯ L k istheaveragesamplelossatclient k , n k isthenumberoflocaldataatclient k ,and n is thetotalnumberofdata.Inotherwords,aclientneedstohavealargenumberofimportantdatato 85 beviewedasanimportantclient.Thentheprobabilityofselectingaclient k iscomputedas: p k = I k å K k = 1 I k (3.4) Onelimitationofthisstrategyisthatthecentralserverhastosendthelatestmodeltoallclients sothatallclientscanestimatetheirimportanceandsendtheimportancebacktocentralserverto identifyimportantclients.Thismaycauselargeoverheadbecausetheserverhastosendthemodel toclientseventhoughonlysomeofclientswillbeselectedtoparticipateinthetraining.FedAce adoptstwotechniquestoachieveagoodbalancebetweentheselectionofimportantclientandthe incurredcost.Inthetechnique,thecentralserverdoesnotupdatetheimportanceofallclients ineveryroundandreusesthecachedclientimportance.Tocalibratethestalenessinthecached importanceinformation,theclientimportanceisre-weightedusingastalenesstermforbalancing explorationandexploitation: I k = ¯ L 0 k n k n exp ( b jj W k W jj jj W jj ) (3.5) ¯ L 0 k representsthelastupdatedlossofclient k . W k isthemodelweightusedtocompute ¯ L 0 k ,and W is latestmodelweightatcentralserver.exp ( b jj W k W jj jj W jj ) isastalenesstermtocalibratetheimportance ofaclient.Ifthistermislarge,itmeansthedifferencebetween W k and W islargeandthemodel W k usedtocompute ¯ L 0 k isstaled.Insuchcase,Thecentralservershouldprioritizethisclientto participateintraining,asthisclienthasahigherchanceofbecomingmoreimportantsincelast update. b > 0isascalingfactordeterminedempirically.Alarge b emphasizesthefreshnessof importanceestimateandencouragesFedAcetofocusmoreonexploration. Thesecondtechniquetoreduceoverheadistoallowasubsetofclientstoperformlocaltraining 86 whileallowinganothernon-overlappingsubsetofclientstoperformimportanceupdatesimulta- neously.,theleastrecentlyupdatedclientsinthenon-overlappingsubsetofclients willbeselectedtoupdatetheirclientimportance.Sincetheimportancecollectionprocessisnot dependentonthelocaltrainingprocess,itcanbeexecutedinparallelwiththelocaltrainingpro- cess.Becausetheimportancecollectionprocessonlyrequiresfeed-forwardinferencetoobtain thesamplelosses,andonlyrequiresdownloadingglobalmodelwithouttheneedofuploadingthe locallyupdatedmodel,ittakesshortertimethanlocaltrainingprocess.Therefore,wecanensure thatbeforeeachroundbegins,thecentralserverisabletohavetheclientimportanceinforma- tionupdatedforclientsampling.Thisselectionstrategyisabletoselectimportantclientswhile ensuringon-timeclientimportanceupdatewithminimumoverhead. 3.4.3FederatedModelCompression Federatedmodelcompressionaimstocompressthemodelineachroundinordertosaveper-round cost,andobtainanefandcompactmodelaftertrainingcompletes.Tocompressthemodel, weneedtoidentifyandpruneunimportantparameters.Toachievethat,wecomputethebinary masktoindicatetheunimportantparametersinthemodel(0forunimportantparametersand1 forimportantparameters).Ineachround,FedAcecomputessuchabinarymaskandapplythe computedmasktotheglobalmodeltoobtainaprunedmodel.Theprunedmodelwillbeused toproceedthetrainingofnextround.Next,wewilldescribethegenerationofthemaskandtwo techniquesformaintainingaccuracyaftercompression. MaskGeneration .Atahighlevel,thegenerationofmaskconsistsofthreesteps,asshownin Figure3.5.First,afterreceivingthelatestmodel,eachclientnotonlytrainsthelocalmodelusing itsowndata,butalsocomputesthelocalmask.Thelocalmaskwillbesentthecentralserver togetherwiththeupdatedmodelwhenlocaltrainingends.Then,thecentralserveraggregatesthe 87 localmasktogeneratetheglobalmask,whichwillbeappliedtotheaggregatedglobalmodel. Figure3.5:Overviewoffederatedmodelcompression. FedAceadoptsunstructuredpruningandusesmagnitude-basedapproachtodeterminetheim- portantparameters.,thelocalmaskatclient k canbecomputedas: M k = s ( W abs k ; s ) (3.6) where W abs k isobtainedbytakingtheabsolutevalueoftheelementsoftheweights W k atclient k . s ( W ; s ) isanactivationfunctionwheretheelementsin W thataregreaterthan s willbesettoone andtozeroifotherwise. Afterthelocalmasksaregenerated,theywillbesenttothecentralservertogetherwiththe modelweights.Then,thecentralservernotonlyaggregatesthemodelweightsasitnormallydoes infederatedlearning,butalsoaggregatesthelocalmaskstocreateaglobalmask.Weusemajority voteastheaggregationmethodforlocalmasks.,FedAcesumsthecollectedlocal maskstogether.Thosesummedelementssmallerthanathreshold s inthesummedmaskindicate thecorrespondingweightsoftheglobalmodelwillbeprunedbasedontheagreementfromthe 88 majorityoftheclients: M g = s ( K å k = 1 M k ; s ) ; (3.7) Inotherwords,thoseparametersthatmostlocalmasksagreetopreservewillbekeptandothers willbepruned.Aftertheglobalmask M g iscomputed,theprunedglobalmodelisobtainedby W g = W M g (3.8) where istheelement-wisemultiplication.Usingmasksbringsextracommunicationcosttothe training,astheclientsneedtouploadthelocalmaskstothecentralserver.Toreducecost,weuse bitoperationtoencodethebinarymaskintolastbitoftheweight.Changingthelastbitdoesnot affectthemodelaccuracy,whichhasbeenvinmanypriorworksonmodelquantization[92]. Oneproblemoffederatedmodelcompressionisthatthemodelaccuracyaftercompression maydrop,especiallywiththeexistenceofdatanon-i.i.dness,whichcausestheclients todisagreewitheachotheronpatternofoptimalmask.Toaddressthisissue,weproposedtwo techniquestomaintaintheaccuracy.Theiscompressionrateschedule,thesecondismask caching. CompressionRateSchedule. Onekeyprerequisiteforaccuratelyidentifyingtheunimportant weightstopruneistotrainthemodeltonearlyconvergencewhentheweightsbecomestable. However,thisprerequisitedoesnotholdinthecaseoffederatedmodelcompressionbecauseprun- ingisperformedthroughoutthetrainingprocesswherethemodelweightsmaynotbestabilized. Asaresult,generatinglocalmasksfromlocalmodelsthatarenotconvergedmayfalselyremove importantweightsandpreserveredundantweights. Toachievetheprerequisite,FedAceincorporatesawarm-upstageandanadaptivemodelspar- sitypolicyduringthefederatedlearningprocess.,FedAcedoesnotperformmodel 89 Figure3.6:Examplesofmaskcaching. compressioninthewarm-upstage.Thegoalistoallowthemodelquicklyidentifyimportantand unimportantparameters.Afterthewarm-upstage,thefederatedmodelcompressionbeginsandthe compressionratebeginstoincreaseexponentially.Indoingso,weallowthecompressedmodelto regainaccuracyslowly,especiallyinthelatestagewheremistakenlypruninganimportantparam- eterleadstoaaccuracydrop. MaskCaching. Giventhenon-iidcharacteristicofthefederatedparadigm,thelocalmasksofthe clientsmayhavedifferentviewsontheimportanceofthemodelweights.Thisissueisexacerbated bythepracticethatateachroundoftheFL,onlyasubsetoftheclientsisselectedtoparticipate inthetrainingprocess.Iftheglobalmaskisgeneratedattheendofeveryround,theglobalmask couldbeeasilybiasedbythatsubsetoftheclients,andthusdeviatedfromtheoptimal globalmask.Asaconsequence,theglobalmodelprunedusingthebiasedglobalmaskcouldlead tosub-optimalperformance. Toaddressthischallenge,FedAcecachesthereceivedlocalmasksinthecentralserverovera numberofroundsfrommoreclientsbeforegeneratingtheglobalmask.Assuch,thelocalmasks collectedfromalargerpoolofclientsareabletocaptureabetterglobalviewoftheimportanceof themodelweights.Figure3.6illustratestheideaofmaskcaching.,thereisnocaching whencacheequals1(leftchartinFigure3.6).Whencacheequals2,themaskswillbecachedand 90 aggregatedeverytworounds(rightchartinFigure3.6). 3.5Experiments Inthissection,weevaluatetheeffectivenessandperformanceofFedAceunderfederatedlearning onfourfederateddatasetsacrosscomputevision,speechrecognitionandnaturallanguagemodel- ingtasks. 3.5.1Setup Implementationandhardwareenvironment. WehaveimplementedFedAceusingFedML[26]. FedMLisaresearchlibraryandbenchmarkforfederatedmachinelearning.Userscandevelop variantsoffederatedlearningalgorithmsusingthisframework.WeimplementedFedAceusing 900linesofPythoncode.Alltheexperimentsarerunonadesktopserverwith4RTX-8000GPUs. Datasetsanddeeplearningmodels. 1)Therstdatasetis Fed-CIFAR10 ,whichisconstructedbypartitioning CIFAR-10 into1000 shardstosimulate1000clients.Inordertosimulatethedataheterogeneityacrossdevices,we partitionthedatashardsusinglatentDirichletallocation(LDA)[34]andeachshardhasanimbal- ancedlabeldistribution,alsoknownaslabelskewness[43].WeuseResNet18[28]inthisdataset. Becausethestatisticsparametersofbatchnormalizationreliedonlocaldata,wefollow[67]to replacethebatchnormalizationlayerwithgroupnormalizationlayer. 2)Thethirddatasetis FEMNIST [67].Ithas749068imagesofdigitsandEnglishcharacters,with atotalof62classes.Thedatasetispartitionedinto3400shardswithdifferentshardsizes.Forthis dataset,weuseaCNNmodelwithtwoconvolutionallayersandtwolinerlayers. 3)TheseconddatasetisFed-TSC,whichisconstructedbypartitioningTwSpeechCom- 91 manddatasetinto1000shards.Thisdatasetconsistsof105,829audioutterancesof35shortwords, recordedbyavarietyofdifferentpeople[75].Similarly,wepartitionthedatashardsusingLDA distributionsimulatelabelskewness.WeconverttheaudiosintospectrumandweuseVGG-13for 4)Thedatasetis StackOverFlow [67].StackOvwisalanguagemodelingdataset.Itcon- sistsofquestionsandanswersfromawebsitecalledstackovw.Thedatasetcontains342,477 uniqueuserswhichweuseasclients.Weperformnext-wordpredictioninthistask.Weuseaone layerLSTMwith670hiddendimensiontopredictthenextwordinthesentence.Eachwordis embeddedina96-dimensionalspacewithaRNN.Thedetailsofthedatasetsandusedmodelcan befoundin[67]. Baselines. Whiletherearemanyworksaimingatimprovingfederatedlearning,mostofthese worksarefocusedonaddressingthedatanon-iidness[45,53]orpreservingtheuserprivacy.The approachesproposedinFedAcearecomplementarytotheseworks.Therefore,weadoptFedAvg asthebaselineinthisexperiment,whichisastate-of-the-artfederatedlearningtrainingmethodfor federatedlearningwhereeachclienttrainsforoneormoreepochsandthelocalmodelupdatesare senttoandaveragedintheremotecentralserver.ForbothFedAvgandFedAce,weusethesame traininghyperparameters.,atclientside,weadoptSGDoptimizerandthelearning rateis0.01andweusecosine-annealinglearningratedecay[59].Thebatchsizeis20.Atserver side,weuseadamoptimzerandthelearningrateis0.001.Thenumberoflocalepochsaresetto1. 3.5.2Results WeshowthatFedAceachievessuperiorperformancecomparedtoFedAvgbyhighlightingthekey results: 92 Ł FedAceachieves3 : 24 to6 : 54 speedupoverFedAvgwithoutlossofaccuracy. Ł FedAcereducesnetworktrafby2.4 to3.44 ineachround. Ł FedAceobtainsacompactandefntsparsemodelaftertrainingwith17 : 2%to41 : 2%of originalmodelsize. 3.5.2.1End-to-EndPerformanceComparison Wecomparetheend-to-endperformanceofFedAceandbaseline.Table3.2showsthespeedups obtainedbyFedAceoverFedAvg.Theoverallspeedupsare5 : 28 ,4 : 7 ,5 : 28 ,and5 : 28 on Fed-Cifar10,Femnist,Fed-TSCandStackOvwrespectively.,thespeedupsare contributedbytwosources.Theisthereductioninthenumberofroundscontributedby adoptingimportance-awaredataselectionandclientselectionstrategy.Thesecondsourceisthe reductioninthecommunicationcostcontributedbycompressingthemodel.Allofthespeedups obtainedbyFedAcewithouthurtingaccuracy.ItindicatesthatFedAcecanincreasethetraining efyandreducethetrainingoverhead. Dataset #Rounds Reduction AvgComm. CostReduction Speedup FinalModel Sparisty Fed-Cifar102.2 2.4 5.28 69.3% Femnist1.7 2.76 4.7 75.3% Fed-TSC1.9 3.44 6.54 82.8% StackOvw1.6 2.03 3.24 58.8% Table3.2:End-to-endperformance.ThenumbersindicatethespeedupsofFedAceoverFedAvg. 3.5.2.2SensitivityAnalysis Inthissection,weperformsensitivityanalysistoshowthatFedAceisrobusttovarioushyperpa- rameters. 93 ImpactofNumberofLocalEpochs. Weanalyzetheimpactofthenumberoflocalepochs. WerunFedAcewithdifferentnumberoflocalepochsonFemnistinthisexperiment.Theresult isshowninFigure3.7.Wecanseethatwhenthenumberoflocalepochsincreases,thespeedups obtainedbyFedAcebecomessmaller.Thisisbecausewhenthenumberoflocalepochsincreases, thelocalmodeldriftsmorefromtheglobaloptimalmodel.Asaresult,theestimatedimportance distributionsoflocaldataandclientsarelessaccurate,limitingtheimprovementbroughtbyim- portancesampling.Furthermore,asthenumberoflocalepochsincrease,themaskofthelocally trainedmodelisencodedwithlocalinformation,makingitdivergefromthemasksgeneratedby otherclients.FedAceneedsmoreroundstoachievethesameaccuracy.Eventhoughthespeedups becomessmaller,FedAceisstillabletoachievespeedupoverbaseline. Figure3.7:Performanceunderdifferentnumberofepochs ImpactofNumberofClients. Wethenstudytheimpactofnumberclients.WerunFedAcewith differentnumberofclientsonFemnistinthisexperiment.Thenumberofclientsaffectsthenumber localdatainthedevice,thusthenon-iidnessoflocaldataisalsoaffected.Wedonotenablemodel compressioninthisexperimenttoeliminateitsFigure3.8showsthespeedupsbrought bydatasamplingandclientsamplingundervariousnumberofclients.Wecanseethatwhenthe numberofclientincreases,thespeedupachievedbydataimportancesamplingbecomessmaller, whilethespeedupachievedbyclientimportancesamplingincreases.Thisisbecausewhenthe numberofclientsincreases,theimportancediversityinlocaldataissmallandthedatasampling 94 doesnotprovidemuchperformancegainevenifimportantdataisselected.However,inthiscase, thespeedupachievedbyclientbecomeslarger,sincetheimportancediversityofclientsbecomes dominant.Thisdemonstratestheneedstocombinetwosamplingstrategiestoachieveimportance samplinginfederatedlearning. Figure3.8:PerformancegainofFedAceinthreefederatedlearningdatasets CompressionRateSchedule. Inthisexperiment,weanalyzetheeffectofcompressionratesched- ule.Wecomparetheproposedcompressratescheduleandconstantcompressrate(setas sparsityfromthebeginningtilltheend). Table3.3andTable3.4showstheaccuracyofdifferentcompressionrateschedulestrategies onFemnistandStackOvw.Wecanseethatwiththesamesparsity,theconstantschedule suffersfromaccuracydrop.FedAcewithwarm-upstrategyonlyisabletomitigatethe accuracyloss.FedAcewithFedAcewithwarm-upandexponentialincreaseschedulesuffersfrom theleastaccuracydrop.Thisisbecausebygraduallyincreasingthecompressionrate,FedAce canhavemoretimetoregainaccuracyafterpruning.Thisdemonstratesthatthecompressionrate scheduleiseffectiveinmaintainingaccuracy. NumberofCachedMasks .Finally,westudytheimpactofthenumberofcachedmaskincen- 95 Dataset Compression RateSchedule ModelSparsityAccuracyChange(%) Femnist Constant 75.3% -1.93 warmuponly-0.64 warm-up+exponentialincrease -0.05 StackOvw Constant 58.8% -0.92 warmuponly-0.32 warm-up+exponentialincrease -0.02 Table3.3:Performanceofcompressionrateschedule. Dataset Compression RateSchedule ModelSparsityAccuracyDrop Femnist Constant 90% -3.32 Schedule(warm-uponly)-2.92 Schedule(warm-up+exponentialincrease) -2.53 StackOvw Constant 90% -2.03 Schedule(warmuponly)-1.23 Schedule(warm-up+exponentialincrease) -0.49 Table3.4:Performanceofcompressionratescheduleunderhighcompressionrate. tralserver.Figure3.9plotstheaccuracyofFedAceusingdifferentnumberofcachedmaskson twodatasets.Wehavetwoobservations.Ononehand,usingalargernumberofcachedmask canachievebetteraccuracythanusingnocachedmask(cache=1).Forexample,usingcache=5in Feminstdatasetachieveshigheraccuracyandusingcache=10achieveshigheraccuracyinStack- Ovw.Thereasonis,byincreasingthenumberofcachedmasks,FedAceisabletoaggregate morelocalinformationinordertodeterminetheglobaloptimalmask.Ontheotherhand,keep- ingincreasingthenumberofcachedmasksmayhurtaccuracy.Forexample,usingcache=10in Feminstandcache=20inStackOvwisworsethanusingcache=1.Thisisbecausethemaskare computedbylocalmodels,whichareupdatedineachround.Aggregatingtoomanylocalmasks computedindifferentroundsintroducesstalenessinlocalmasksaggregation. 96 (a)Femnist (b)StackOvw Figure3.9:Testaccuracycomparisonunderdifferentmaskcachesettings. 3.6Conclusion Inthiswork,weproposeFedAce,aneftrainingframeworkforfederatedlearning.FedAce addressestwokeyproblemsinfederatedlearning.Theproblemisfederatedlearningrequires alargenumberofcommunicationroundsduelowsamplingefyinbothdataselectionand clientselection.Thesecondproblemisthemodeltrainedinfederatedlearninghastobeasmall modelduetoinferencespeedrequirementandconstrainedbandwidth,resultinginpoorperfor- mance.Toaddresstheabovechallenges,FedAcehastwocomponents:aimportancesamplerthat selectsimportantdataforlocaltrainingandimportantclientsineachroundtoincreasetraining efy,andamodelcompressorthatprogressivelycompressesthelargemodel.Experimentre- sultsonfourfederatedlearningdatasetsshowthattheFedAcecanreducethetraining overheadby3 : 2 to6 : 5 withoutaccuracydrop. 97 Chapter4 Conclusion Inthisdissertation,weproposethreecollaborativedistributeddeeplearningsystemsfortheedge: Distream,Mercury,andFedAce,toenabledeeplearning-basedservingandtrainingfordistributed edgesystemsrespectively.Distreamaddressesthekeychallengesbroughtbyworkloaddynamics inreal-worlddeployments,andcontributesnoveltechniquesthatcomplementexistinglivevideo analyticssystems.Mercuryrepresentsadistributedtrainingframeworkforefdistributedon- devicedeeplearningunderdistributededgesetting.Mercuryimprovestrainingqualityandreduce convergencewall-clocktimewithoutadditionaloverheadbyadaptingtothebandwidthchanges usingthreenoveltechniques:group-wiseimportancesampling,importance-awarereshardingand overlapping.FedAceisaneffederatedlearningsystemthatimprovestrainingefyby combiningdataandclientimportancesamplingandadaptivemodelcompressing. Throughexperiments,weshowthatthesethreesystemsoutperformstate-of-the-artapproaches andexistingsolutionscanfromtheproposedtechniquesforadditionalperformancegain. Therefore,webelievethesethreeworksrepresentascontributiontoenablinglarge-scale deeplearningservingandtrainingsystemondistributededgearchitecture. 98 BIBLIOGRAPHY 99 BIBLIOGRAPHY [1] Nvidiatitanx,2016. [2] 24-portgigabitstackablesmartmanagedswitchwith410gbesfp+ports,2017. [3] Nvidiajetsontx1,2017. [4] Opencvbackgroundsubtraction,2017. [5] Networkingsolutionsforipsurveillance,2018. [6] Nvidiajetsontx2,2018. [7] Jacksonhole,2019. [8] F.Akgul. ZeroMQ .PacktPublishingLtd,2013. [9] G.Alain,A.Lamb,C.Sankar,A.Courville,andY.Bengio.Variancereductioninsgdby distributedimportancesampling. arXivpreprintarXiv:1511.06481 ,2015. [10] D.Alistarh,D.Grubic,J.Li,R.Tomioka,andM.Vojnovic.Qsgd:Communication-ef sgdviagradientquantizationandencoding.In ProceedingsoftheAdvancesinNeuralInfor- mationProcessingSystems ,pages1709Œ1720,2017. [11] D.Bahdanau,K.Cho,andY.Bengio.Neuralmachinetranslationbyjointlylearningtoalign andtranslate. arXivpreprintarXiv:1409.0473 ,2014. [12] D.Bastos,M.Shackleton,andF.El-Moussa.Internetofthings:Asurveyoftechnologies andsecurityrisksinsmarthomeandcityenvironments.2018. [13] C.Canel,T.Kim,G.Zhou,C.Li,H.Lim,D.G.Andersen,M.Kaminsky,andS.R.Dulloor. Scalingvideoanalyticsonconstrainededgenodes. arXivpreprintarXiv:1905.13536 ,2019. [14] B.-G.Chun,S.Ihm,P.Maniatis,M.Naik,andA.Patti.Clonecloud:elasticexecutionbe- tweenmobiledeviceandcloud.In ProceedingsofthesixthconferenceonComputersystems , pages301Œ314,2011. [15] D.Crankshaw,X.Wang,G.Zhou,M.J.Franklin,J.E.Gonzalez,andI.Stoica.Clipper:A low-latencyonlinepredictionservingsystem.In 14th f USENIX g SymposiumonNetworked SystemsDesignandImplementation( f NSDI g 17) ,pages613Œ627,2017. [16] E.Cuervo,A.Balasubramanian,D.-k.Cho,A.Wolman,S.Saroiu,R.Chandra,andP.Bahl. 100 Maui:makingsmartphoneslastlongerwithcodeofIn Proceedingsofthe8thinterna- tionalconferenceonMobilesystems,applications,andservices ,pages49Œ62,2010. [17] H.Cui,H.Zhang,G.R.Ganger,P.B.Gibbons,andE.P.Xing.Geeps:Scalabledeeplearning ondistributedgpuswithagpu-specializedparameterserver.In ProceedingsoftheEleventh EuropeanConferenceonComputerSystems ,page4.ACM,2016. [18] J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei.Imagenet:Alarge-scalehierar- chicalimagedatabase.In Proceedingsofthe2009IEEEconferenceoncomputervisionand patternrecognition ,pages248Œ255.Ieee,2009. [19] S.Dutta,G.Joshi,S.Ghosh,P.Dube,andP.Nagpurkar.Slowandstalegradientscanwinthe race:Error-runtimetrade-offsindistributedsgd. arXivpreprintarXiv:1803.01113 ,2018. [20] B.Fang,X.Zeng,F.Zhang,H.Xu,andM.Zhang.Flexdnn:Input-adaptiveon-devicedeep learningforefmobilevision.In Proceedingsofthe5thACM/IEEESymposiumon EdgeComputing(SEC) ,2020. [21] B.Fang,X.Zeng,andM.Zhang.Nestdnn:Resource-awaremulti-tenanton-devicedeep learningforcontinuousmobilevision.In Proceedingsofthe24thAnnualInternationalCon- ferenceonMobileComputingandNetworking(MobiCom) ,pages115Œ127,2018. [22] P.Goyal,P.Dollár,R.Girshick,P.Noordhuis,L.Wesolowski,A.Kyrola,A.Tulloch,Y.Jia, andK.He.Accurate,largeminibatchsgd:Trainingimagenetin1hour. arXivpreprint arXiv:1706.02677 ,2017. [23] M.Gürbüzbalaban,A.Ozdaglar,andP.Parrilo.Whyrandomreshufbeatsstochastic gradientdescent. arXivpreprintarXiv:1510.08560 ,2015. [24] S.Han,H.Mao,andW.J.Dally.Deepcompression:Compressingdeepneuralnetworks withpruning,trainedquantizationandhuffmancoding. arXivpreprintarXiv:1510.00149 , 2015. [25] S.H.Hashemi,S.A.Jyothi,andR.H.Campbell.Tictac:Acceleratingdistributeddeep learningwithcommunicationscheduling.In Proceedingsofthe2ndSysMLConference , 2019. [26] C.He,S.Li,J.So,M.Zhang,H.Wang,X.Wang,P.Vepakomma,A.Singh,H.Qiu,L.Shen, etal.Fedml:Aresearchlibraryandbenchmarkforfederatedmachinelearning. arXiv preprintarXiv:2007.13518 ,2020. [27] K.He,G.Gkioxari,P.Dollár,andR.Girshick.Maskr-cnn.In ComputerVision(ICCV), 2017IEEEInternationalConferenceon ,pages2980Œ2988.IEEE,2017. [28] K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.In 101 ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition ,pages770Œ 778,2016. [29] G.Hinton,L.Deng,D.Yu,G.Dahl,A.-r.Mohamed,N.Jaitly,A.Senior,V.Vanhoucke, P.Nguyen,B.Kingsbury,etal.Deepneuralnetworksforacousticmodelinginspeechrecog- nition. IEEESignalprocessingmagazine ,29,2012. [30] G.Hinton,O.Vinyals,andJ.Dean.Distillingtheknowledgeinaneuralnetwork. arXiv preprintarXiv:1503.02531 ,2015. [31] S.HochreiterandJ.Schmidhuber.Longshort-termmemory. Neuralcomputation ,9(8):1735Œ 1780,1997. [32] K.Hsieh,G.Ananthanarayanan,P.Bodik,P.Bahl,M.Philipose,P.B.Gibbons,and O.Mutlu.Focus:Queryinglargevideodatasetswithlowlatencyandlowcost. arXivpreprint arXiv:1801.03493 ,2018. [33] K.Hsieh,A.Harlap,N.Vijaykumar,D.Konomis,G.R.Ganger,P.B.Gibbons,andO.Mutlu. Gaia:Geo-distributedmachinelearningapproaching f LAN g speeds.In 14th f USENIX g SymposiumonNetworkedSystemsDesignandImplementation( f NSDI g 17) ,pages629Œ647, 2017. [34] T.-M.H.Hsu,H.Qi,andM.Brown.Measuringtheeffectsofnon-identicaldatadistribution forfederatedvisual arXivpreprintarXiv:1909.06335 ,2019. [35] C.-C.Hung,G.Ananthanarayanan,P.Bodik,L.Golubchik,M.Yu,P.Bahl,andM.Phili- pose.Videoedge:Processingcamerastreamsusinghierarchicalclusters.In 2018IEEE/ACM SymposiumonEdgeComputing(SEC) ,pages115Œ131.IEEE,2018. [36] L.N.Huynh,Y.Lee,andR.K.Balan.Deepmon:Mobilegpu-baseddeeplearningframe- workforcontinuousvisionapplications.In Proceedingsofthe15thAnnualInternational ConferenceonMobileSystems,Applications,andServices ,pages82Œ95,2017. [37] F.N.Iandola,M.W.Moskewicz,K.Ashraf,andK.Keutzer.Firecaffe:near-linearaccel- erationofdeepneuralnetworktrainingoncomputeclusters.In ProceedingsoftheIEEE ConferenceonComputerVisionandPatternRecognition ,pages2592Œ2600,2016. [38] S.Jain,G.Ananthanarayanan,J.Jiang,Y.Shu,andJ.Gonzalez.Scalingvideoanalytics systemstolargecameradeployments.In Proceedingsofthe20thInternationalWorkshopon MobileComputingSystemsandApplications ,pages9Œ14,2019. [39] A.Jayarajan,J.Wei,G.Gibson,A.Fedorova,andG.Pekhimenko.Priority-basedparameter propagationfordistributeddnntraining. arXivpreprintarXiv:1905.03960 ,2019. [40] J.Jiang,G.Ananthanarayanan,P.Bodik,S.Sen,andI.Stoica.Chameleon:scalableadapta- 102 tionofvideoanalytics.In Proceedingsofthe2018ConferenceoftheACMSpecialInterest GrouponDataCommunication ,pages253Œ266.ACM,2018. [41] J.Jiang,Y.Zhou,G.Ananthanarayanan,Y.Shu,andA.A.Chien.Networkedcamerasarethe newbigdataclusters.In Proceedingsofthe2019WorkshoponHotTopicsinVideoAnalytics andIntelligentEdges ,pages1Œ7,2019. [42] S.Jiang,Z.Ma,X.Zeng,C.Xu,M.Zhang,C.Zhang,andY.Liu.Scylla:Qoe-aware continuousmobilevisionwithfpga-baseddynamicdeepneuralnetworkIn IEEEINFOCOM2020-IEEEConferenceonComputerCommunications ,pages1369Œ1378. IEEE,2020. [43] P.Kairouz,H.B.McMahan,B.Avent,A.Bellet,M.Bennis,A.N.Bhagoji,K.Bonawitz, Z.Charles,G.Cormode,R.Cummings,etal.Advancesandopenproblemsinfederated learning. arXivpreprintarXiv:1912.04977 ,2019. [44] D.Kang,J.Emmons,F.Abuzaid,P.Bailis,andM.Zaharia.Noscope:optimizingneural networkqueriesovervideoatscale. ProceedingsoftheVLDBEndowment ,10(11):1586Œ 1597,2017. [45] S.P.Karimireddy,S.Kale,M.Mohri,S.Reddi,S.Stich,andA.T.Suresh.Scaffold:Stochas- ticcontrolledaveragingforfederatedlearning.In InternationalConferenceonMachine Learning ,pages5132Œ5143.PMLR,2020. [46] A.KatharopoulosandF.Fleuret.Notallsamplesarecreatedequal:Deeplearningwith importancesampling. arXivpreprintarXiv:1803.00942 ,2018. [47] J.Kone cn ˚ y,H.B.McMahan,F.X.Yu,P.Richtárik,A.T.Suresh,andD.Bacon. Federatedlearning:Strategiesforimprovingcommunicationefy. arXivpreprint arXiv:1610.05492 ,2016. [48] A.Krizhevsky.Learningmultiplelayersoffeaturesfromtinyimages.Technicalreport, Citeseer,2009. [49] Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning. nature ,521(7553):436,2015. [50] H.Li,A.Kadav,I.Durdanovic,H.Samet,andH.P.Graf.Pruningforefcon- vnets. arXivpreprintarXiv:1608.08710 ,2016. [51] M.Li,D.G.Andersen,J.W.Park,A.J.Smola,A.Ahmed,V.Josifovski,J.Long,E.J. Shekita,andB.-Y.Su.Scalingdistributedmachinelearningwiththeparameterserver.In 11th USENIXSymposiumonOperatingSystemsDesignandImplementation(OSDI14) ,pages 583Œ598,CO,Oct.2014.USENIXAssociation. [52] M.Li,D.G.Andersen,J.W.Park,A.J.Smola,A.Ahmed,V.Josifovski,J.Long,E.J. 103 Shekita,andB.-Y.Su.Scalingdistributedmachinelearningwiththeparameterserver.In Proceedingsofthe11thUSENIXSymposiumonOperatingSystemsDesignandImplementa- tion(OSDI14) ,pages583Œ598,2014. [53] T.Li,A.K.Sahu,M.Zaheer,M.Sanjabi,A.Talwalkar,andV.Smith.Federatedoptimization inheterogeneousnetworks. arXivpreprintarXiv:1812.06127 ,2018. [54] Y.Li,M.Yu,S.Li,S.Avestimehr,N.S.Kim,andA.Schwing.Pipe-SGD:Adecentralized pipelinedsgdframeworkfordistributeddeepnettraining.In ProceedingsoftheAdvancesin NeuralInformationProcessingSystems ,pages8045Œ8056,2018. [55] H.Lim,D.G.Andersen,andM.Kaminsky.3lc:Lightweightandeffectivetrafcompres- sionfordistributedmachinelearning. arXivpreprintarXiv:1802.07389 ,2018. [56] Y.Lin,S.Han,H.Mao,Y.Wang,andW.J.Dally.Deepgradientcompression:Reducingthe communicationbandwidthfordistributedtraining. arXivpreprintarXiv:1712.01887 ,2017. [57] W.Liu,D.Anguelov,D.Erhan,C.Szegedy,S.Reed,C.-Y.Fu,andA.C.Berg.Ssd:Single shotmultiboxdetector.In Europeanconferenceoncomputervision ,pages21Œ37.Springer, 2016. [58] Z.Liu,M.Sun,T.Zhou,G.Huang,andT.Darrell.Rethinkingthevalueofnetworkpruning. arXivpreprintarXiv:1810.05270 ,2018. [59] I.LoshchilovandF.Hutter.Sgdr:Stochasticgradientdescentwithwarmrestarts. arXiv preprintarXiv:1608.03983 ,2016. [60] L.Ma,D.VanAken,A.Hefny,G.Mezerhane,A.Pavlo,andG.J.Gordon.Query-based workloadforecastingforself-drivingdatabasemanagementsystems.In Proceedingsofthe 2018InternationalConferenceonManagementofData ,pages631Œ645.ACM,2018. [61] X.Ma,H.Zhong,Y.Li,J.Ma,Z.Cui,andY.Wang.Forecastingtransportationnetwork speedusingdeepcapsulenetworkswithnestedlstmmodels. IEEETransactionsonIntelligent TransportationSystems ,2020. [62] B.McMahan,E.Moore,D.Ramage,S.Hampson,andB.A.yArcas.Communication- eflearningofdeepnetworksfromdecentralizeddata.In Intelligenceand Statistics ,pages1273Œ1282.PMLR,2017. [63] H.B.McMahan,E.Moore,D.Ramage,S.Hampson,etal.Communication-eflearn- ingofdeepnetworksfromdecentralizeddata.In Proceedingsofthe20thInternational ConferenceonIntelligenceandStatistics(AISTATS) ,2017. [64] Q.Meng,W.Chen,Y.Wang,Z.-M.Ma,andT.-Y.Liu.Convergenceanalysisofdistributed stochasticgradientdescentwithshuf arXivpreprintarXiv:1709.10432 ,2017. 104 [65] S.A.Noghabi,L.Cox,S.Agarwal,andG.Ananthanarayanan.Theemerginglandscapeof edgecomputing. GetMobile:MobileComputingandCommunications ,23(4):11Œ20,2020. [66] M.Rastegari,V.Ordonez,J.Redmon,andA.Farhadi.Xnor-net:Imagenet usingbinaryconvolutionalneuralnetworks.In EuropeanConferenceonComputerVision , pages525Œ542.Springer,2016. [67] S.Reddi,Z.Charles,M.Zaheer,Z.Garrett,K.Rush,J.Kone cn ˚ y,S.Kumar,andH.B. McMahan.Adaptivefederatedoptimization. arXivpreprintarXiv:2003.00295 ,2020. [68] J.RedmonandA.Farhadi.Yolov3:Anincrementalimprovement. arXivpreprint arXiv:1804.02767 ,2018. [69] S.Ren,K.He,R.Girshick,andJ.Sun.Fasterr-cnn:Towardsreal-timeobjectdetection withregionproposalnetworks.In Advancesinneuralinformationprocessingsystems ,pages 91Œ99,2015. [70] M.Sandler,A.Howard,M.Zhu,A.Zhmoginov,andL.-C.Chen.Mobilenetv2:Inverted residualsandlinearbottlenecks.In ProceedingsoftheIEEEConferenceonComputerVision andPatternRecognition ,pages4510Œ4520,2018. [71] H.Shen,L.Chen,Y.Jin,L.Zhao,B.Kong,M.Philipose,A.Krishnamurthy,andR.Sun- daram.Nexus:agpuclusterengineforacceleratingdnn-basedvideoanalysis.In Proceedings ofthe27thACMSymposiumonOperatingSystemsPrinciples ,pages322Œ337,2019. [72] R.ShokriandV.Shmatikov.Privacy-preservingdeeplearning.In Proceedingsofthe22nd ACMSIGSACconferenceoncomputerandcommunicationssecurity ,pages1310Œ1321. ACM,2015. [73] K.SimonyanandA.Zisserman.Verydeepconvolutionalnetworksforlarge-scaleimage recognition. arXivpreprintarXiv:1409.1556 ,2014. [74] V.Smith,C.-K.Chiang,M.Sanjabi,andA.S.Talwalkar.Federatedmulti-tasklearning.In AdvancesinNeuralInformationProcessingSystems ,pages4427Œ4437,2017. [75] w.Twspeechcommanddataset,2018. [76] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,Kaiser,and I.Polosukhin.Attentionisallyouneed.In ProceedingsoftheAdvancesinneuralinformation processingsystems ,pages5998Œ6008,2017. [77] S.Venkataraman,A.Panda,K.Ousterhout,M.Armbrust,A.Ghodsi,M.J.Franklin, B.Recht,andI.Stoica.Drizzle:Fastandadaptablestreamprocessingatscale.In Pro- ceedingsofthe26thSymposiumonOperatingSystemsPrinciples ,pages374Œ389.ACM, 2017. 105 [78] J.WangandG.Joshi.Adaptivecommunicationstrategiestoachievethebesterror-runtime trade-offinlocal-updateSGD. arXivpreprintarXiv:1810.08313 ,2018. [79] J.Wangni,J.Wang,J.Liu,andT.Zhang.Gradientforcommunication-ef distributedoptimization.In ProceedingsoftheAdvancesinNeuralInformationProcessing Systems ,pages1299Œ1309,2018. [80] P.Watcharapichat,V.L.Morales,R.C.Fernandez,andP.Pietzuch.Ako:Decentraliseddeep learningwithpartialgradientexchange.In ProceedingsoftheSeventhACMSymposiumon CloudComputing ,pages84Œ97.ACM,2016. [81] W.Wen,C.Xu,F.Yan,C.Wu,Y.Wang,Y.Chen,andH.Li.Terngrad:Ternarygradients toreducecommunicationindistributeddeeplearning.In Advancesinneuralinformation processingsystems ,pages1509Œ1519,2017. [82] I.H.Witten,E.Frank,M.A.Hall,andC.J.Pal. DataMining:Practicalmachinelearning toolsandtechniques .MorganKaufmann,2016. [83] G.-S.Xia,J.Hu,F.Hu,B.Shi,X.Bai,Y.Zhong,L.Zhang,andX.Lu.Aid:Abenchmark datasetforperformanceevaluationofaerialsceneclas IEEETransactionson GeoscienceandRemoteSensing ,55(7):3965Œ3981,2017. [84] E.P.Xing,Q.Ho,W.Dai,J.K.Kim,J.Wei,S.Lee,X.Zheng,P.Xie,A.Kumar,andY.Yu. Petuum:Anewplatformfordistributedmachinelearningonbigdata. IEEETransactionson BigData ,1(2):49Œ67,2015. [85] B.Zhang,X.Jin,S.Ratnasamy,J.Wawrzynek,andE.A.Lee.Awstream:adaptivewide- areastreaminganalytics.In Proceedingsofthe2018ConferenceoftheACMSpecialInterest GrouponDataCommunication ,pages236Œ252.ACM,2018. [86] H.Zhang,G.Ananthanarayanan,P.Bodik,M.Philipose,P.Bahl,andM.J.Freedman.Live videoanalyticsatscalewithapproximationanddelay-tolerance.In NSDI ,volume9,page1, 2017. [87] H.Zhang,Z.Zheng,S.Xu,W.Dai,Q.Ho,X.Liang,Z.Hu,J.Wei,P.Xie,andE.P.Xing. Poseidon:Anefcommunicationarchitecturefordistributeddeeplearningon f GPU g clusters.In 2017 f USENIX g AnnualTechnicalConference( f USENIX gf ATC g 17) ,pages 181Œ193,2017. [88] M.Zhang,F.Zhang,N.D.Lane,Y.Shu,X.Zeng,B.Fang,S.Yan,andH.Xu.Deeplearning intheeraofedgecomputing:Challengesandopportunities. FogComputing:Theoryand Practice ,pages67Œ78,2020. [89] S.Zhang,A.E.Choromanska,andY.LeCun.Deeplearningwithelasticaveragingsgd. Advancesinneuralinformationprocessingsystems ,28:685Œ693,2015. 106 [90] X.Zhang,J.Zhao,andY.LeCun.Character-levelconvolutionalnetworksfortextclassi- In ProceedingsoftheAdvancesinneuralinformationprocessingsystems ,pages 649Œ657,2015. [91] P.ZhaoandT.Zhang.Stochasticoptimizationwithimportancesamplingforregularizedloss minimization.In ProceedingsoftheInternationalConferenceonMachineLearning ,pages 1Œ9,2015. [92] C.Zhu,S.Han,H.Mao,andW.J.Dally.Trainedternaryquantization. arXivpreprint arXiv:1612.01064 ,2016. [93] Z.ZivkovicandF.VanDerHeijden.Efadaptivedensityestimationperimagepixelfor thetaskofbackgroundsubtraction. Patternrecognitionletters ,27(7):773Œ780,2006. 107