COLLABORATIVEDISTRIBUTEDDEEPLEARNINGSYSTEMS
ONTHEEDGES
By
XiaoZeng
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialoftherequirements
forthedegreeof
ElectricalEngineeringŠDoctorofPhilosophy
2021
ABSTRACT
COLLABORATIVEDISTRIBUTEDDEEPLEARNINGSYSTEMS
ONTHEEDGES
By
XiaoZeng
DeeplearninghasrevolutionizedawiderangeofInspiteofitssuccess,mostdeep
learningsystemsareproposedinthecloud,wheredataareprocessedinacentralizedmannerwith
abundantcomputeandnetworkresources.Thisraisesaproblemwhendeeplearningisdeployed
ontheedgewheredistributedcomputeresourcesarelimited.Inthisdissertation,weproposethree
distributedsystemstoenablecollaborativedeeplearningontheedge.Thesethreesystemstarget
differentscenariosandtasks.ThesystemdubbedDistreamisadistributedlivevideoanalytics
systembasedonthesmartcamera-edgeclusterarchitecture.Distreamfullyutilizesthecompute
resourcesatbothendstoachieveoptimizedsystemperformance.ThesecondsystemdubbedMer-
curyisasystemthataddressesthekeybottleneckofcollaborativelearning.Mercuryenhancesthe
trainingefyofon-devicecollaborativelearningwithoutcompromisingtheaccuraciesofthe
trainedmodels.ThethirdsystemdubbedFedAceisadistributedtrainingsystemthatimproves
trainingefyunderfederatedlearningsettingwhereprivateon-devicedataarenotallowedto
besharedamonglocaldevices.Withineachparticipatingclient,FedAceachievessuchimprove-
mentbyprioritizingimportantdata.Intheserverwheremodelaggregationisperformed,FedAce
exploitstheclientimportanceandprioritizesimportantclientstoreducestragglersandreducethe
totalnumberofrounds.Inaddition,FedAceconductsfederatedmodelcompressiontoreducethe
per-roundcommunicationcostandobtainsacompactmodelaftertrainingcompletes.Extensive
experimentsshowthattheproposedthreesystemsareabletoachieveimprovements
overstatus-quosystems.
ACKNOWLEDGMENTS
Iwouldliketoexpressthedeepestappreciationtomanypeople.First,Iamhonoredandgrateful
tohaveDr.MiZhangasmyacademicadvisor.Hehasbeennotonlyanexcellentsupervisor
butalsoagreatfriendtomesincewemetforthetime.Iwouldnothavemadesuchmuch
achievementandcompletedbyPh.D.degreewithouthim.IwouldalsoliketothankDr.Xiaobo
Tan,Dr.XiaomingLiuandDr.JiliangTangwhoserveasmycommitteemembersandprovide
guidanceformyacademiccareer.
Second,Iwouldliketothankmyfellowsinthelab,includingBiyiFang,WeiDu,DongChen,
JiajiaLi,YuZheng,WeiAo,ShenYan,XiaoyuZhang,HaochenSun,KaiCao,KaixiangLinand
LaiWeifortheirencouragementandhelpduringmylifeatMichiganStateUniversity.Iamalso
gratefulforallthestaffandfacultyintheECEdepartment.
Lastly,Iwouldliketothankmyfamilywhohavebeenprovidingtremendoussupportand
encouragement.IhaveleftthemforsomanyyearsandIamtrulythankfulfortheirunderstanding.
Especially,Iwouldliketothankmywife,whohasbeenstayingwithmesincewecametothethe
U.S.tomyPh.D.degree.Iamalsoblessedtohavemyson,Lucas,duringthelastyearofmy
Ph.D.career.HeisthebestgiftevergiventomeinmyPh.D.life.
iii
TABLEOFCONTENTS
LISTOFTABLES
.......................................
vi
LISTOFFIGURES
......................................
vii
Chapter1CollaborativeDeepLearningInferenceontheEdge
...........
1
1.1Introduction......................................1
1.2BackgroundandMotivation..............................6
1.2.1LiveVideoAnalyticsPipeline........................6
1.2.2WorkloadDynamicsinReal-worldDeployments..............8
1.2.3NeedforWorkloadAdaptation........................9
1.3DistreamDesign....................................12
1.3.1OverallArchitecture.............................12
1.3.2BalancingtheWorkloadsacrossSmartCameras...............13
1.3.3PartitioningtheWorkloadbetweenSmartCamerasandEdgeCluster....16
1.3.4WorkloadAdaptationController.......................19
1.4SystemImplementation................................23
1.5Evaluation.......................................25
1.5.1ExperimentalMethodology..........................26
1.5.2OverallPerformance.............................28
1.5.3Component-wiseAnalysis..........................32
1.5.4SensitivityAnalysis..............................33
1.5.5ScalingPerformance.............................35
1.5.6SystemOverheads..............................36
1.5.7toExistingSystems.........................36
1.6RelatedWork.....................................37
1.7ConclusionandFutureWork.............................39
Chapter2CollaborativeDeepLearningTrainingontheEdge
............
40
2.1Introduction......................................40
2.2BackgroundandMotivation..............................45
2.2.1ArchitectureofOn-DeviceCollaborativeLearningSystems.........45
2.2.2ExistingApproachesandTheirLimitations.................46
2.3DistreamOverview..................................48
2.3.1KeyInsight..................................49
2.3.2RudimentaryFrameworkandPerformanceModel..............51
2.3.3OverallDesign................................52
2.4DesignDetails.....................................53
2.4.1Group-wiseImportanceComputationandSampling.............53
2.4.2Importance-awareDataResharding.....................56
2.4.3BACCScheduler...............................57
2.4.4ProofofTrainingCorrectness........................59
iv
2.5Implementation....................................60
2.6Evaluation.......................................62
2.6.1ExperimentalMethodology..........................62
2.6.2OverallPerformance.............................65
2.6.3Component-wiseAnalysis..........................67
2.6.4BandwidthAdaptationPerformance.....................70
2.6.5ScalingPerformance.............................71
2.7RelatedWork.....................................71
2.8Conclusion......................................72
Chapter3CollaborativeLearningforFederatedLearning
.........
73
3.1Introduction......................................73
3.2RelatedWork.....................................76
3.3BackgroundandMotivation..............................78
3.3.1ArchitectureofFederatedLearning......................78
3.3.2FederatedLearningCharacteristics......................79
3.3.3ScopeandImprovingOpportunity......................80
3.4FedAceDesign....................................82
3.4.1OverallArchitecture.............................82
3.4.2FederatedImportanceSampling.......................83
3.4.2.1DataImportanceSampling.....................84
3.4.2.2ClientImportanceSampling....................85
3.4.3FederatedModelCompression........................87
3.5Experiments......................................91
3.5.1Setup.....................................91
3.5.2Results....................................92
3.5.2.1End-to-EndPerformanceComparison...............93
3.5.2.2SensitivityAnalysis........................93
3.6Conclusion......................................97
Chapter4Conclusion
..................................
98
BIBLIOGRAPHY
.......................................
99
v
LISTOFTABLES
Table1.1:ComparisonbetweenDistreamandstatusquolivevideoanalyticssystems....2
Table1.2:PerformancegainoverCamera-Only(CampusSurveillance)...........29
Table1.3:PerformancegainoverCamera-Only(TrafMonitoring).............30
Table1.4:Component-wiseanalysisofDistream.PerformancegainsareoverCamera-Only.33
Table1.5:PerformanceofDistreamunderlow(10Mbps)andhigh(50Mbps)network
bandwidths.....................................35
Table1.6:Cross-cameraworkloadbalancingoverheadsofDistream.............36
Table3.1:Comparisonbetweenfederatedlearninganddistributedtraining.........80
Table3.2:End-to-endperformance.ThenumbersindicatethespeedupsofFedAceover
FedAvg.......................................93
Table3.3:Performanceofcompressionrateschedule.....................96
Table3.4:Performanceofcompressionratescheduleunderhighcompressionrate.....96
vi
LISTOFFIGURES
Figure1.1:LivevideoanalyticspipelineusedinDistream..................8
Figure1.2:Workloaddynamicsinreal-worlddeployment..................9
Figure1.3:SystemarchitectureofDistream..........................10
Figure1.4:Comparisonofthroughput(left)andlatency(right)betweenworkload-agnostic
andworkload-adaptive(ours)schemes......................11
Figure1.5:Illustrationonwheretheworkload-agnosticschemefallsshortunderdynamic
workloadsinreal-worlddeployments.......................11
Figure1.6:Overviewofthecross-cameraworkloadbalancer.................14
Figure1.7:FullDAGThepercentageontopofeachvertexisthenormalized
accumulatedinferencecost............................17
Figure1.8:StochasticDAGpartition.Vertexesinbluecolorareallocatedto
runatthecamera,andvertexesinbrowncolorareallocatedtorun
attheedgecluster.................................18
Figure1.9:Overviewofcamera-clusterworkloadpartitioninworkloadadaptationcon-
troller.TheoptimalDAGpartitionpointsofallcamerasarejointlydetermined.22
Figure1.10:ThroughputandlatencydistributiononCampusSurveillance..........30
Figure1.11:ThroughputandlatencydistributiononTrafMonitoringapplication.....31
Figure1.12:LatencySLOmissratecomparison........................31
Figure1.13:IllustrationonwhyDistreamoutperformsbaselines...............32
Figure1.14:Impactofsystemhyper-parametersontheperformanceofDistream.......34
Figure1.15:ScalingperformanceofDistream.........................35
Figure1.16:Resource-accuracytradeoffcurvecomparisonbetweenVideoEdgeandVideoEdge
+Distream.....................................37
Figure2.1:Themulti-parameterserverarchitectureofon-devicecollaborativelearning
systems.......................................45
vii
Figure2.2:Comparisonincommunication-computationoverlappingbetweenhighband-
widthandlowbandwidthsetting.........................47
Figure2.3:Distributionsofdataimportanceasthenumberofiterationsincreasesduring
training.......................................48
Figure2.4:Aconceptualcomparisonbetween(a)randomsamplingand(b)importance
sampling......................................50
Figure2.5:ConceptualComparisonbetweenrudimentaryimportancesampling-basedframe-
workandMercury.................................52
Figure2.6:Distributionsofrankdifference..........................53
Figure2.7:Illustrationofgroup-wiseimportancecomputationandsampling.Eachdot
representsonesampleandthedarknessindicatesitsimportance.........55
Figure2.8:Importance-agnosticdatareshardingvs.importance-awaredataresharding...57
Figure2.9:(a)UnoptimizedPipelinevs.(b)Optimizedpipeline.
w
t
representsthemodel
weightatiteration
t
.R=R1+R2+R3+R4inlength..............58
Figure2.10:Distreamimplementation.............................61
Figure2.11:OverallperformancecomparisononCifar10andCifar100...........64
Figure2.12:OverallperformancecomparisononSVHNandAID..............65
Figure2.13:OverallperformancecomparisononTwSpeechCommandandAG-
News.......................................65
Figure2.14:Testaccuracycurvesduringthecompletetrainingprocessintermsofnumber
ofcommunicationrounds(left)andwall-clocktime(right)...........67
Figure2.15:Testaccuracycurvesintermsofnumberofiterations(left)andwall-clocktime
(right)usinggroup-wise(Groupwise-IS)andall-inclusiveimportancecompu-
tationandsampling(All-IS)............................67
Figure2.16:Importance-awarevs.importance-agnosticdataresharding,vs.non-resharding.69
Figure2.17:Testaccuracycurvesintermsofnumberofiterations(left)andwall-clocktime
(right)usingBACCandsequentialpipeline....................69
Figure2.18:SpeeduponCIFAR-10...............................70
viii
Figure2.19:Bandwidthsnapshot................................70
Figure2.20:ScalingperformanceofDistreamon4,8,12edgedevicesundervariousnet-
workbandwidths..................................71
Figure3.1:Federatedlearningoverview...........................80
Figure3.2:Trainingprocessfederatedlearning.......................82
Figure3.3:OverviewofFedAce...............................83
Figure3.4:Overviewoffederatedimportancesampling...................84
Figure3.5:Overviewoffederatedmodelcompression....................88
Figure3.6:Examplesofmaskcaching............................90
Figure3.7:Performanceunderdifferentnumberofepochs.................94
Figure3.8:PerformancegainofFedAceinthreefederatedlearningdatasets........95
Figure3.9:Testaccuracycomparisonunderdifferentmaskcachesettings..........97
ix
Chapter1
CollaborativeDeepLearningInferenceon
theEdge
1.1Introduction
Videocamerasareubiquitous.Today,camerashavebeendeployedatscaleatplacessuchastraf
intersections,universitycampuses,andgrocerystores.Drivenbytherecentbreakthroughindeep
learning(DL)[49],organizationsthathavedeployedthesecamerasstarttouseDL-basedtech-
niquesfor
livevideoanalytics
[86,35,40].Analyzinglivevideosstreamedfromthesedistributed
camerasisthebackboneofawiderangeofapplicationssuchastrafcontrolandsecuritysurveil-
lance.Asmanyoftheseapplicationsrequireproducinganalyticsresultsinreal-time,achieving
low-latency
,
high-throughput
,and
scalable
videostreamprocessingiscrucial[77].
Livevideoanalyticssystemsrequirehigh-resolutioncamerastocapturehigh-qualityvisual
dataforanalytics.Ascameranumberscalesup,thesealways-oncamerascollectivelygenerate
hundredsofgigabytesofdataeverysinglesecond,makingitinfeasibletotransmitsuchgigantic
volumeofdatatodatacentersinthecloudforreal-timeprocessingduetoinsufnetwork
bandwidthandlongtransmissionlatencybetweencamerasandthecloud[88].
Thekeytoovercomingthisbottleneckisto
movecomputeresourcesclosetowheredatareside
.
Statusquolivevideoanalyticssystemshencestreamcamerafeedstolocaledgeclustersfordata
1
Architecture
VideoAnalytics
PipelinePartition
Workload
Adaptive
Cross-Camera
WorkloadBalancing
VideoStorm[86]CentralizedN/ANoN/A
Chameleon[40]CentralizedN/ANoN/A
NoScope[44]CentralizedN/ANoN/A
FilterForward[13]Camera-OnlyFixedNoNo
VideoEdge[35]DistributedDynamicNoNo
DistreamDistributedDynamicYesYes
Table1.1:ComparisonbetweenDistreamandstatusquolivevideoanalyticssystems.
analyticswheremuchhighernetworkbandwidthisprovided[86,40,44,65].Tomovecompute
resources
evencloser
todatasources,majorvideoanalyticsproviders(e.g.,Avigilon,Hikvision,
NVIDIA)arereplacingtraditionalvideocameraswithﬁsmartcamerasﬂ.Equippedwithonboard
DLaccelerators,thesesmartcamerasarenotonlyabletoperformbasicvideoprocessingtasks
suchasbackgroundsubtractionandmotiondetection,butalsocapableofexecutingcomplicated
compute-intensiveDL-basedpipelinestodetectandrecognizetheobjectsandavarietyoftheir
attributes[27,82,41].Sinceeachsmartcamerabringsextracomputeresourcestoprocessvideo
streamsgeneratedbyitself,this
smartcamera-edgeclusterarchitecture
isthekeytoenablinglive
videoanalyticsatscale[35,13,41].
Motivation&LimitationsofStatusQuo.
Inreal-worlddeployments,dependingonwhatareas
thecamerasarecovering,thenumberofobjectsofinterest(e.g.,people,vehicles,bikes)captured
byeachcameraisdifferentandcanvaryovertime.Forexample,forsurveillance
systemsdeployedonuniversitycampuses,acamerathatcoversthebuildingentrancecaptures
muchlargernumbersofpeoplebeforeandafterclassesthananyothertime;acamerathatpointsat
theemergencyexitwherepeoplerarelyvisithasnoobjectsofinterestcapturedmostofthetime.
Asaconsequence,theworkloadofrecognizingthecapturedobjectsandtheirattributesproduced
ateachcameraisdifferentandinherentlydynamicovertime.
AslistedinTable1.1,inrecentyears,livevideoanalyticssystemssuchasVideoStorm[86],
2
Chameleon[40]andNoScope[44]haveemerged.Thesesystemsenableefprocessingof
alargenumberofcamerastreams,butaredesignedtoprocessthestreamswithinacentralized
cluster.
Withtheemergenceofsmartcameras,recentsystemsstarttoleveragethecomputeresources
insidesmartcamerasfordistributedlivevideoanalytics.FilterForward[13]proposesacamera-
onlysolutionwhichimportantvideoframesandoutunimportantonesdirectlyon
smartcameras.VideoEdge[35],ontheotherhand,proposesafullydistributedframeworkthat
partitionsthevideoanalyticspipelineacrosscamerasandclusterwiththeobjectivetooptimize
theresource-accuracytradeoff.Whilethesesystemsaimtooptimizelivevideoanalyticsfrom
avarietyofperspectives,theyare
agnostic
totheworkloaddynamicsinreal-worlddeployments
describedabove,makingthemfallshortintwosituations:ononehand,failingtoutilizethe
computeresourcesinsideidlecamerascouldconsiderablyjeopardizesystemthroughput;onthe
otherhand,failingtoalleviatetheworkloadsfromcamerasthatareoverloadedbyburstyworkloads
couldincurhighlatency,causingthesystemnotabletomeetthelatencyservicelevel
objective(SLO)imposedbythelivevideoanalyticsapplications.
OverviewoftheProposedApproach.
Inthispaper,wepresentDistreamŒadistributedframe-
workbasedonthesmartcamera-edgeclusterarchitectureŒthatisableto
adapt
totheworkloaddy-
namicsinreal-worlddeploymentstoachievelow-latency,high-throughput,andscalableDL-based
livevideoanalytics.TheunderpinningprinciplebehindthedesignofDistreamistoadaptively
balancetheworkloadsacrosssmartcamerasaswellaspartitiontheworkloadsbetweencameras
andtheedgecluster.Indoingso,Distreamisabletofullyutilizethecomputeresourcesatboth
endstojointlymaximizethesystemthroughputandminimizethelatency
withoutthe
videoanalyticsaccuracy
.
3
ThedesignofDistreaminvolvesthreekeychallenges.

Challenge#1:Cross-CameraWorkloadBalancing
.Onekeyobstacletoachievinghigh-
throughputlow-latencylivevideoanalyticsiscausedbytheimbalancedworkloadsacross
cameras.Therefore,thechallengeliesindesigningaschemethatbalancestheworkloads
acrosscameras.However,thecross-cameraworkloadcorrelation,theheterogeneousonboard
computecapacitiesofsmartcameras,andtheoverheadofworkloadbalancingaltogethermake
designingsuchaschemenottrivial.

Challenge#2:Camera-ClusterWorkloadPartitioning
.Anotherkeyobstacletoachiev-
inghigh-throughputlow-latencylivevideoanalyticsiscausedbytheimbalancedworkloads
betweensmartcamerasandtheedgecluster.Tobalancetheworkloadsbetweencamerasand
edgecluster,thevideoanalyticspipelineshouldbepartitionedbasedontheworkloadratio
ofthetwosides.However,thepossibleoptionstopartitionthevideoanalyticspipelineare
quitelimitedinnumber,makingworkloadpartitioningbetweencameraandedgeclusterby
nature
coarse-grained
.Asaresult,thepartitionedvideoanalyticspipelinemaynotmatchthe
workloadratioofthetwosides.

Challenge#3:AdaptationtoWorkloadDynamics
.Giventhedynamicsofworkloadsinreal-
worlddeployments,theoptimalsolutionsforcross-cameraworkloadbalancingandcamera-
clusterworkloadpartitioning
varyovertime
.Beingabletoadapttosuchworkloaddynamics
isamustforhigh-performancelivevideoanalyticssystems.Designingsuchanadaptation
scheme,however,isnottrivial,astheoptimalpipelinepartitioningsolutionforeachcamera
canbe
different
.Moreimportantly,sincetheworkloadsarejointlyexecutedbetweencameras
andedgecluster,forthewholesystemtoachievethebestperformance,theoptimalpipeline
partitioningsolutionsforallthecamerasneedtobe
jointly
determined.Thisisamuchmore
4
challengingproblemcomparedtothesingle-pairworkloadpartitioningproblemtackledinthe
literature[16,14].
Toaddressthetchallenge,Distreamincorporatesacross-cameraworkloadbalancerthat
takesthecross-cameraworkloadcorrelation,heterogeneouscomputecapacitiesofsmartcameras,
aswellastheoverheadofworkloadbalancingintoaccount,andformulatesthetaskofcross-
cameraworkloadbalancingasanoptimizationproblem.Inparticular,theproposedcross-camera
workloadbalancerincorporatesalong-shorttermmemory(LSTM)-basedrecurrentneuralnet-
workwhichisabletoenhancetheperformanceofcross-cameraworkloadbalancingbypredicting
incomingworkloadsinthenearfuturetoavoidmigratingworkloadstocamerasthataregoingto
experiencehighworkloads.
Toaddressthesecondchallenge,Distreamincorporatesastochasticpartitioningschemethat
partitionsthevideoanalyticspipelineina
stochastic
manner.Indoingso,itprovidesmuchmore
partitionxibilityandmuchpartitiongranularity.Assuch,Distreamisabletopartitionthe
pipelinetomatchtheworkloadratioofthesmartcameraandedgeserver.
Toaddressthethirdchallenge,Distreamincorporatesaworkloadadaptioncontrollerwhich
triggersthecross-cameraworkloadbalancerwhencross-cameraworkloadimbalanceisdetected.
Moreover,itformulatesthetaskofjointlyidentifyingtheoptimalpipelinepartitioningsolutions
forallthecamerasasanoptimizationproblemwiththeobjectivetomaximizetheoverallsystem
throughputsubjecttothelatencySLOimposedbythelivevideoanalyticsapplications.
SystemImplementation&SummaryofEvaluationResults.
WeimplementedDistreamand
deployeditonaself-developedtestbedthatconsistsof24smartcamerasanda4-GPUedgeclus-
ter.WeevaluateDistreamwith500hoursofdistributedvideostreamsfromtworeal-worldvideo
datasets:onefromsixtrafcamerasdeployedinJacksonHole,WY[7]fortrafmonitoring
5
application,andtheotherfrom24surveillancecamerasdeployedonauniversitycampusforsecu-
ritysurveillanceapplication.WecompareDistreamagainstthreebaselines:
Centralized
,which
processesalltheworkloadsontheedgecluster;
Camera-Only
,whichprocessesalltheworkloads
onsmartcameras;and
VideoEdge
[35].OurresultsshowthatDistreamconsistentlyoutperforms
thebaselinesintermsofthroughput,latency,andlatencySLOmissratebyalargemargindue
toitsworkloadadaptationschemes.Moreover,ourscalingexperimentsshowthatDistreamis
abletoscaleupsystemthroughputnearlylinearlywhilemaintainingalowlatencywithnegligible
overheads.Finally,weshowthattheworkloadadaptationtechniquesproposedinDistreamcould
existinglivevideoanalyticssystemsandenhancetheirperformanceaswell.
SummaryofContributions.
Tothebestofourknowledge,Distreamrepresentsthedis-
tributedframeworkthatenablesworkl-oad-adaptivelivevideoanalyticsunderthesmartcamera-
edgeclusterarchitecture.Itakeyperformancebottleneckandcontributesnoveltech-
niquesthataddressthelimitationsofexistingsystems.Webelieveourworkrepresentsa
steptowardsturningtheenvisionedlarge-scalelivevideoanalyticsintoreality.
1.2BackgroundandMotivation
1.2.1LiveVideoAnalyticsPipeline
Modernlivevideoanalyticspipelinestypicallyadoptacascadedarchitecturewhichconsistsof
afront-endobjectdetectorfollowedbyaback-endmoduletoperformavarietyof
analyticstasksoneachofthedetectedobjectofinterestwithinavideoframe.
Therearetwotypesofobjectdetectorsthathavebeencommonlyusedinexistinglivevideo
analyticssystems[40].ThetypeistheCNN-basedobjectdetector(
e.g.
,YOLO[68],SSD[57]
6
andFaster-RCNN[69])whichextractsandalltheobjectsofinterestinaframewithone
singleinference.However,suchanobjectdetectorhastobeconstantlyextractingfeaturesfrom
framesandperforminginferenceevenifthereisnoobjectofinterestappearinginvideostreams.
Inmanyscenariosinreal-worlddeployments,however,theobjectsofinterestmayonlyappearin
videostreamsforshortperiodsoftime.Insuchcase,aamountofcomputeresourcesis
wasted.Thesecondtypeofobjectdetectoristousealight-weightbackgroundsubtractor[93]
toextracttheregionswhereobjectsofinterestresidefromtheframe.Itsendstheseregionstoa
toidentifytheobjectwithineachregion.Assuch,itproducesobjectsofinterestonly
whentheyappearintheframe.WhileDistreamisagenericframeworkwhichsupportsboth
typesofobjectdetectors,inthiswork,wefocusonthebackgroundsubtraction-baseddetectorto
illustrateourideas.
Themoduleingeneralcanberepresentedasadirectedacyclicgraph(DAG).In
thiswork,weuseattributerecognitionasaconcreteexampleofthemoduletoillus-
trateourideas.cally,eachvertexintheDAGrepresentsaDNNthatrecognizes
aparticularattributeoftheobject;eachdirectededgerepresentsadatawfromoneto
another.Forexample,theDAGinFigure1.1consistsofthreebranches.Basedonthe
typeofthedetectedobject(vehicle,person,orothers),themoduleselectsoneofthe
threebranches,eachofwhichemploysacascadedsequenceoftofurtheridentifythe
attributesoftheobject
1
.
BasedonthepipelineillustratedinFigure1.1,goingthrougheachwithintheDAGis
regardedasanindividual
workload
2
.Therefore,identifyinganobjectofinterestanditsattributes
1
Althoughweadoptedthedesignchoiceoftreatingeachtaskindependently,Distreamisxibletosupportmulti-
tasklearningwhereaDNNinferenceproducesmultipleresults.Itshouldbenotedthatnotallthetaskscanbe
combinedintoasinglemulti-taskDNNmodel.ThusourDAGformulationisgeneralenoughtosupportallthecases.
2
Wedidnotincludecapturingimagesandbackgroundextractionasworkloadsmainlybecausethesestepsconsume
muchlesscomputationcomparedtoDNN-basedinferenceandthuscanbeexecutedlocallyfastenoughwithout
of
7
Figure1.1:LivevideoanalyticspipelineusedinDistream.
withinavideoframeproducesmultipleworkloadstobeprocessedbythelivevideoanalytics
system.
1.2.2WorkloadDynamicsinReal-worldDeployments
Toillustratetheworkloaddynamicsinreal-worlddeployments,wecollectedadatasetfrom24
surveillancecamerasdeployedonauniversitycampus(§2.6.1).Giventhelimitedspace,wepick
arepresentativevideocliptomakeourpoints.,Figure1.2showstheworkloadsgener-
atedbyfoursurveillancecamerasdeployedatauniversitybuildingonaweekdaybetween12pmto
12:30pm.Amongthem,
CAM1
monitorsasquarenexttothebuildingentrance;
CAM2
monitorsa
sidewayoutsidethebuilding;
CAM3
and
CAM4
covertwodifferentcorridorsinsidethebuilding.
Asillustrated,dependingontheviewscapturedbythecameras,theworkloadsofreal-world
videostreamshavethreekeycharacteristics.(i)
Theworkloadsaredifferentacrosscameras
:for
8
Figure1.2:Workloaddynamicsinreal-worlddeployment.
CAM1
thatmonitorsbusyareas,moreworkloadsaregeneratedaspedestrianspassbymoreoften;
for
CAM3
thatmonitorsalessbusyindoorcorridor,muchlessworkloadsaregeneratedsporad-
icallyandithasnoworkloadsmostofthetime.(ii)
Theworkloadgeneratedateachcamerais
dynamicovertime
:thisisobviousbecausethecontentcapturedbyeachcameraischangingover
time.(iii)
Theamountofworkloadsgeneratedateachcameracanvary
:theworkload
differencebetweenburstyandnon-burstydurationscanbemorethan1000

.
1.2.3NeedforWorkloadAdaptation
Existinglivevideoanalyticssystemsbasedonthesmartcamera-edgeclusterarchitecture,however,
areagnostictothereal-worldworkloaddynamicsillustratedabove.Todemonstratehowexisting
systemsfallshortundersuchworkloaddynamics,wecomparethesystemperformancebetween
Distreamthatisworkload-adaptiveandaworkload-agnosticbaselinethatallocatescomputere-
sourceproportionaltothecomputecapacity.Weusethesamedatasetin§2.2andatestbedthat
consistsoffoursmartcameras(2Jetson-TX1and2Jetson-TX2)and1-GPUedgecluster(§1.4)to
9
Figure1.3:SystemarchitectureofDistream.
conductourexperiments.
Figure1.4illustratesourcomparisonresults.Asshown,Distreamachievesimprove-
mentintermsofthroughputandlatencycomparedtotheworkload-agnosticscheme.,
themedianandpeakthroughputsofDistreamare302IPSand562IPSrespectively,whichis
1
:
4

and2
:
3

improvementovertheworkload-agnosticbaseline,whichonlyachieves213IPS
medianand240IPSpeakthroughputs.Distreamalsoachieves0.23smedianand3.92smaxi-
mumlatency,whichreducesmediumandmaximumlatencyby57

and40
:
7

comparedtothe
workload-agnosticschemewhichachieves13.1smedianlatencyand159.73smaximumlatency.
Tounderstandwhytheworkload-agnosticschemefallsshort,weuseonesmartcameraasan
exampleandplotitsworkloadsgeneratedoveratwominsvideoclipinFigure1.5.,the
blacksolidlinedepictsthegeneratedworkloadsovertime;thereddashedlinedepictsthecamera's
processingcapability(i.e.,themaximumnumberofworkloadsitcanprocessperunittime)based
ontheworkload-agnosticscheme.Asshown,undertheworkload-agnosticscheme,althoughthe
workloadgeneratedatthesmartcameraisdynamicovertime,theprocessingcapabilityofthe
camerastaysthesameanddoesnotadapttotheworkloadvariation.Consequently,whenthe
generatedworkloadexceedsitsprocessingcapability,thecameraexperiences
under-provisioning
10
Figure1.4:Comparisonofthroughput(left)andlatency(right)betweenworkload-agnosticand
workload-adaptive(ours)schemes.
Figure1.5:Illustrationonwheretheworkload-agnosticschemefallsshortunderdynamicwork-
loadsinreal-worlddeployments.
(redregion)sothatitfailstoprocesstheworkloadsintime;whentheworkloadisbelowits
processingcapability,thecamerabecomes
over-provisioning
(greenregion)andthuswastesits
computeresourcesthatcouldhavebeenusedtoprocessworkloadsgeneratedfromothercameras.
Theseobservationsaltogetherhighlighttheneedforworkloadadaptationtoavoidtheoccurrence
ofeitherunder-provisioningorover-provisioning,whichmotivatesthedesignofDistream.
11
1.3DistreamDesign
1.3.1OverallArchitecture
Figure2.5illustratestheoverallarchitectureofDistream.Asshown,Distreamisdesignedasa
distributedframeworkthatspansacrosssmartcamerasandtheedgecluster.Inthedataplane,
assoonasthesmartcameracapturesavideoframe,itrunstheonboardbackgroundsubtractor
toextractregionsofinterest(ROIs)withintheframeandappendsthemonebyoneintoalocal
queue(
1
).EachROIinthequeueiseitheroftoanothercamera(
2a
)orpartiallypro-
cessedinsidethelocal
inferenceengine
accordingtoitspartitionedDAGwhosepartitionpoint
isdeterminedbythe
DAGpartitioner
(
2b
).Theresultofthepartially-processedDAGisthen
transferredtotheedgeclusterandbufferedinaqueue(
3
),whichwillbeprocessedbythe
in-
ferenceengine
insidetheedgecluster(
4
)tocompletetheremainingunprocessedpartofthe
DAG.Alltheinferenceresultsfromthecamerasaregatheredattheedgecluster.Inthecontrol
plane,the
systemmonitor
continuouslymonitorstheworkloadsateachcameraandedgecluster,
andsendsthecollectedworkloadinformationtothe
workloadadaptationcontroller
(
i
).Once
theworkloadimbalanceacrosscamerasisdetected,the
workloadadaptationcontroller
triggers
the
cross-cameraworkloadbalancer
(
iia
)tobalancetheworkloadsacrosscameras(
iiia
).Mean-
while,basedonthecurrentworkloadsateachcameraandedgecluster,the
workloadadaptation
controller
adaptivelytheoptimalDAGpartitionpointforeachcamerathatbalancesthe
workloadbetweeneachcameraandtheedgecluster.SuchoptimalDAGpartitionpointsaresent
tothe
DAGpartitioner
(
iib
)forDAGpartitioning,andthecorrespondingDAGpartitionresultis
senttothe
inferenceengine
ateachcamera(
iiib
).
Inthefollowing,wedescribethethreekeycomponentsofDistream:
cross-cameraworkload
balancer
(§1.3.2),
DAGpartitioner
(§1.3.3),and
workloadadaptationcontroller
(§1.3.4)indetail.
12
1.3.2BalancingtheWorkloadsacrossSmartCameras
ThekeycomponentofDistreamisthe
cross-cameraworkloadbalancer
,whichbalancesthe
workloadsattheleveloftheextractedregionsofinterest(ROIs)acrosscamerasbymigrating
partoftheworkloadsfromheavilyloadedcamerastoidleorlightlyloadedones(Figure1.6).To
achievethis,the
cross-cameraworkloadbalancer
needstoaccommodatethreekeyconsiderations:

Consideration#1:cross-cameraworkloadcorrelation
.Asalsobeingobservedbymany
existingworks[40,41,38],workloadsonnearbycamerasmayexhibitstrongcorrelationsdue
tomobilityofobjectsofinterest:acameramayhavehighworkloadwithinashorttimeperiod
ifitsnearbycamerasareexperiencinghighworkloads.The
cross-cameraworkloadbalancer
needstotakesuchcorrelationsintoaccounttoavoidmigratingworkloadstocamerasthatare
goingtoexperiencehighworkloads.

Consideration#2:heterogeneouscomputecapacities
.Smartcamerasmayhaveheteroge-
neousonboardcomputecapacitiescausedbydifferentgenerationsofcomputehardware.To
preventanycamerafrombecomingthebottleneck,the
cross-cameraworkloadbalancer
needs
tobalancetheworkloadsproportionaltoeachcamera'scomputecapacity.

Consideration#3:workloadbalancingoverheads
.Inpractice,migratingworkloadsacross
camerasincursoverheads.The
cross-cameraworkloadbalancer
needstotakesuchoverheads
intoaccountsuchthattheoverheadsdonotovershadowthebroughtbyloadbalancing.
The
cross-cameraworkloadbalancer
takesthesethreeconsiderationsintoaccount,andfor-
mulatesthetaskofcross-cameraworkloadbalancingasanoptimizationproblem.
Formally,let
N
denotethetotalnumberofsmartcameras,
w
i
denotetheworkloadofcamera
i
beforeloadbalancing,andlet
å
j
x
ji
and
å
j
x
ij
denotetheworkloadsthataremovedintoand
movedoutofcamera
i
respectively.Therefore,theworkloadofcamera
i
afterloadbalancingcan
13
Figure1.6:Overviewofthecross-cameraworkloadbalancer.
becalculatedas
u
i
=(
w
i
+
å
j
x
ji

å
j
x
ij
)
.Andthegoalofthe
cross-cameraworkloadbalancer
istominimize
n
,theworkloadimbalanceindexacrosssmartcamerasafterworkloadbalancing,
whichcanbecalculatedas
n
=(
u
max
¯
u

1
)
(1.1)
where
u
max
and¯
u
representthemaximumworkloadandaverageworkloadofallcamerasrespec-
tively.Byminimizingtheworkloadimbalanceindex,the
cross-cameraworkloadbalancer
isable
toachieveagoodbalancebetweenworkloadbalancingefyandfairnesswithoutincurring
thrashing.
Toaccommodate
Consideration#1
,wetakethecross-cameraworkloadcorrelationintoac-
countbyincorporatingaworkloadpredictortopredictthefutureworkloadateachcamera(not
frommigration)basedontheircurrentworkloadsandthecross-cameraworkloadcorrelations.
,weemploylong-shorttermmemory(LSTM)-basedrecurrentneuralnetwork[61]as
theworkloadpredictorbecauseitshowsbetterlong-termforecastingabilityinpredictingtime
seriesdatacomparedtoothermethods[60].Thepredictedworkload‹
w
i
atcamera
i
outputby
theworkloadpredictorisaddedto
u
i
asanadjustmenttoobtainamoreaccuratefutureworkload
14
estimateas
u
i
=
w
i
+
å
j
x
ji

å
j
x
ij
+
‹
w
i
(1.2)
Toaccommodate
Consideration#2
,wetaketheheterogeneouscomputecapacitiesofsmart
camerasintoaccountbymultiplying
u
i
withanormalizationfactor
a
i
=
C
i
=
¯
C
where
C
i
isthe
computecapacityofcamera
i
and
¯
C
istheaveragecomputecapacityofallthesmartcameras.As
aresult,theworkloadateachcameraafterloadbalancinggetsnormalizedas
u
0
i
=
u
i

a
i
andthe
workloadimbalanceindex
n
getsnormalizedas
n
0
=(
u
0
max
¯
u
0

1
)
(1.3)
Toaccommodate
Consideration#3
,atriggeringthreshold
b
isaddedintoouroptimization
objective.Thecross-cameraworkloadbalancingisnottriggeredwhen
n
0
isbelow
b
.
Puttingallthepiecestogether,ourcross-cameraworkloadbalancingschemecanbeformulated
as
min
x
ij

0
max
(
n
0

b
;
0
)
(1.4a)
s
:
t
:
u
i
=
w
i
+
å
j
x
ji

å
j
x
ij
+
‹
w
i
(1.4b)
a
i
=
C
i
=
¯
C
(1.4c)
u
0
i
=
u
i

a
i
(1.4d)
n
0
=(
u
0
max
¯
u
0

1
)
(1.4e)
Solvingtheabovenonlinearoptimizationproblemiscomputationallyhard.Toenablereal-
timecross-cameraworkloadbalancing,weutilizeaheuristictoefientlysolvetheoptimization
15
problem.,ourheuristicstartswithall
x
ij
=
0andincreases
x
ij
iteratively.Ineach
iteration,weselectcamerasthathavethelargestandthesmallestworkloadstoformamigration
pair.Thenweincreasethemigratedworkload
x
ij
by
D
x
(i.e.,
x
ij
=
x
ij
+
D
x
),where
D
x
=
1%

u
0
i
.Werepeatsuchiterationuntil
n
0
isbelowathresholdor
n
0
doesnotimprovebetweentwo
consecutiveiterations.
Asshownin§1.5.6,suchheuristicisabletosolvetheoptimizationprobleminanefcient
mannerandincursnegligibleoverheads.
1.3.3PartitioningtheWorkloadbetweenSmartCamerasandEdgeCluster
ThesecondkeycomponentofDistreamisthe
DAGpartitioner
,whichpartitionstheworkloads
betweensmartcamerasandedgeclusterattheDAGlevel.Toachievethis,the
DAGpartitioner
needstoaccommodatetwokeyconsiderations:

Consideration#1:DAGisconditionallyexecuted
.Asthecontentsofvideostreamsare
changing,theexecutionwinDAGischangingcorrespondingly.TaketheDAGinFigure1.1
asanexample:ifa`Vehicle'isdetected,theupperpathoftheDAG(i.e.,the`Color',`Type'
and`Make'willbeexecuted;ifa`Person'isdetected,themiddlepathoftheDAG
(i.e.,`Behavior'and`Gender'willbeexecuted.However,toidentifytheoptimal
DAGpartitionpoint,the
DAGpartitioner
needstoconsiderallthepossiblepartitionpoints
fromallthepossibleexecutionpaths.

Consideration#2:DAGPartitioningisbynaturecoarse-grained
.Ideally,tobalancethe
workloadsbetweencamerasandtheedgecluster,theDAGpartitionpointshouldbesetbased
ontheworkloadratioofthetwosides.However,thepossiblepartitionpointsinaDAGarethe
verticesintheDAG,whicharediscreteandlimitedinnumber.ThismakesDAGpartitioning
16
Figure1.7:FullDAGThepercentageontopofeachvertexisthenormalizedaccumu-
latedinferencecost.
bynaturecoarse-grained.Asaresult,theDAGpartitionpointmaynotmatchtheworkload
ratioofthetwosides.
The
DAGpartitioner
takesthesetwoconsiderationsintoaccount,anddesignsatwo-step
schemetopartitiontheDAG.
Step#1:FullDAGPr
.Inourstep,wefollowthetopologicalorderoftheDAGtoex-
tractallthepossibleexecutionpathsintheDAG.Asanexample,Figure1.7showsthreeexecution
pathsextractedfromtheDAGinFigure1.1.Foreachextractedexecutionpath,wethen
theinferencecostofeach,andlabelthenormalizedaccumulatedinferencecostontopof
each.Forexample,inFigure1.7,thenormalizedaccumulatedinferencecostofexecuting
the`Vehicle'pathuptoBis35%.
Step#2:StochasticDAGPartitioning
.Inoursecondstep,weproposeastochasticDAGparti-
tioningschemetopartitiontheDAGinamannersuchthatthepartitionedDAGcould
bettermatchtheworkloadratioofthecameraandtheedgecluster.,foreachexecution
pathobtainedinStep#1,thegoalofourstochasticDAGpartitioningschemeistoa
stochastic
pathexecutionplan
suchthattheexpectednormalizedaccumulatedinferencecostofexecutingthe
pathmatchestheoptimalDAGpartitionpoint
par
2
[
0
;
1
]
providedbythe
workloadadaptation
controller
(§1.3.4).
17
Figure1.8:StochasticDAGpartition.Vertexesinbluecolorareallocatedtorunatthe
camera,andvertexesinbrowncolorareallocatedtorunattheedgecluster.
Takethe`Vehicle'pathinFigure1.8asanillustrativeexample.Forthe`Vehicle`path,
assumeweneedtoachieve
par
=
0
:
7workloadpartitioning(i.e.,70%workloadallocatedtothe
cameraand30%workloadallocatedtotheedgecluster)determinedbythe
workloadadaptation
controller
.OurstochasticDAGpartitioningschemetwovertexes(vertexB,vertex
C)alongthe`Vehicle`pathsuchthat
par
=
0
:
7fallsbetweentheirnormalizedaccumulatedinfer-
encecost(Figure1.8(1)).Giventhenormalizedaccumulatedinferencecostsofthesetwovertexes
(
Cost
B
=
0
:
35forvertexB,
Cost
C
=
0
:
85forvertexC),ourstochasticDAGpartitioningscheme
thendeterminestheprobabilitiesforpartitioningthe`Vehicle`pathatvertexB(
P
B
)andvertexC
(
P
C
)respectivelyas:
P
B
=(
par

Cost
C
)
=
(
Cost
B

Cost
C
)
P
C
=
1

P
B
(1.5)
Inthisexample,thesolutionsare
P
B
=
0
:
3and
P
C
=
0
:
7.Therefore,the
stochasticpathex-
ecutionplan
istopartitionthepathatvertexBwithaprobabilityof0.3andtopartitionthe
18
pathatvertexCwithaprobabilityof0.7(Figure1.8(2)).Assuch,theexpectednormalized
accumulatedinferencecostofexecutingthepathmatchestheoptimalDAGpartitionpoint
par
(0
:
3

35%
+
0
:
7

85%
=
0
:
7).
ComparedtothediscreteDAGpartitioningapproach,ourstochasticDAGpartitioningscheme
providesmuchmorepartitionxibilityandgranularity.Withsuchxibilityandgranularity,
DistreamisabletobetterpartitiontheDAGbetweensmartcamerasandedgeclustertomatchthe
workloadratiooftwosides.
1.3.4WorkloadAdaptationController
ThethirdkeycomponentofDistreamisthe
workloadadaptationcontroller
,whichisthecore
toachieveourworkloadadaptationobjective.The
workloadadaptationcontroller
performstwo
tasks.First,whentheworkloadimbalanceindexacrosssmartcamerasisgreaterthanathreshold
b
,the
workloadadaptationcontroller
triggersthe
cross-cameraworkloadbalancer
tobalancethe
workloadsacrossthecamerasasdescribedin§1.3.2.Second,the
workloadadaptationcontroller
followsacamera-clusterworkloadpartitionscheduletoperiodically(atinterval
g
)updatetheop-
timalDAGpartitionpointforeachcameratobalancetheworkloadsbetweeneachcameraand
edgecluster(Figure1.9).Suchoptimalpartitionpointsaresenttothe
DAGpartitioner
forDAG
partitioningasdescribedin§1.3.3.
TheDAGpartitionpointsserveasknobstocontroltheDAGexecutionacrosscamerasandedge
cluster,andthuscontroltheworkloadsplitbetweenthem.ToidentifytheoptimalDAGpartition
points,the
workloadadaptationcontroller
needstoaccommodatetwokeyconsiderations:

Consideration#1:theoptimalDAGpartitionpointisdifferentforeachcamera
.Thesmart
camerashaveheterogeneouscomputecapacitiesandtheworkloadateachcameracanbediffer-
entduetothediscretizationofcross-cameraworkloadbalancing.Thisrequiresthe
workload
19
adaptationcontroller
toidentifytheoptimalDAGpartitionpointforeachcamera.

Consideration#2:theoptimalDAGpartitionpointsofallcamerasneedtobedetermined
jointly
.Becausetheworkloadsarepartitionedbetweencamerasandedgecluster,thesystem
performance(throughputandlatency)is
jointly
determinedbyallcamerasandedgecluster.
Therefore,theoptimalDAGpartitionpointsforallcamerasarelinkedtogetherandneedtobe
jointlyadjustedaswell.
The
workloadadaptationcontroller
takesthesetwoconsiderationsintoaccount,andformu-
latesthetaskofidentifyingoptimalDAGpartitionpointsasanoptimizationproblem.
,ouroptimizationproblemaimstojointlyidentifytheoptimalDAGpartitionpoints
forallthecameraswiththeobjectivetomaximizetheoverallsystemthroughputsubjecttothe
latencyservicelevelobjective(SLO)imposedbythelivevideoanalyticsapplications.Below,
wedescribehowwemodelthethroughputandthelatencyfollowedbythecompleteoptimization
formulation.
Throughput.
AsaDAGispartitionedintotwopartsandexecutedonacameraandedgecluster
respectively,thethroughputoftheentirelivevideoanalyticssystemisthesumofthethroughput
ofcamerasandthethroughputoftheedgecluster.
Formally,let
N
denotethenumberofcamerasinthesystem,
par
i
denotetheDAGpartition
pointforcamera
i
,and
TP
cam
i
(
par
i
)
denotethethroughputofcamera
i
givenDAGpartitionpoint
par
i
,and
TP
edge
denotethethroughputattheedgeclusterwhichdependsontheallpartition
points
f
par
i
g
.
TP
cam
i
(
par
i
)
and
TP
edge
canbeaccuratelymeasuredbymonitoringthenumber
ofworkloadsprocessedpertimeunitatcamera
i
andedgeclusterrespectively.Therefore,the
throughputof
N
cameras
TP
cams
,thethroughputoftheedgecluster
TP
edge
,andthethroughputof
theentiresystem
TP
system
canbecomputedas:
20
TP
cams
=
N
å
i
=
1
TP
cam
i
(
par
i
)
(1.6a)
TP
edge
=
TP
edge
(
par
1
;
par
2
;:::;
par
N
)
(1.6b)
TP
system
=
TP
cams
+
TP
edge
(1.6c)
Latency.
Thelatencyofalivevideoanalyticssystemisasthetimeelapsedbetween
receivingaworkloadandproducingtheinferenceresultoftheworkload.Giventhat,thelatency
isthesumoftwoparts:1)theworkloadqueuingtime,and2)theworkloadprocessingtimefor
producingtheinferenceresult.Becauseworkloadsarepartitionedacrosscameraandedge,the
latencyoftwosidesisthuscomputedseparately.Assuch,thelatencyatcamera
i
andatedge
clusterarecomputedas:
L
cam
i
=
w
i
TP
cam
i
+
1
TP
cam
i
(1.7a)
L
edge
=
w
edge
TP
edge
+
1
TP
edge
(1.7b)
where
w
i
isthenumberofpendingworkloadstobeprocessedatcamera
i
,and
w
edge
isthenumber
ofpendingworkloadstobeprocessedattheedgecluster.
CompleteOptimizationFormulation.
Puttingallthepiecestogether,thecompleteoptimization
problemcanbeformulatedas
3
:
3
DistreamoperatesatDAGnodelevelwhenpartitioningworkloadsbetweencamerasandedgecluster.
theoptimizationobjectiveintermsofDAGnodeisabletotheperformancegaininamorestraightforward
manner.
21
Figure1.9:Overviewofcamera-clusterworkloadpartitioninworkloadadaptationcontroller.The
optimalDAGpartitionpointsofallcamerasarejointlydetermined.
max
par
i
8
i
TP
system
(1.8a)
s
:
t
:
L
cam
i

L
Max
;
8
i
(1.8b)
L
edge

L
Max
(1.8c)
where
L
Max
isthelatencySLOimposedbytheapplication.
Asshown,since
L
cam
i
and
L
edge
aredeterminedby
TP
cam
i
,
w
i
,
TP
edge
and
w
edge
,whichare
determinedby
par
i
,theworkloadsreceivedatcamerasandtheedgeclusterarenotindependent
butnegativelycorrelatedwitheachother.Thistradeoffillustratestheofidentifying
theoptimalDAGpartitionpointstobalanceworkloadsbetweencamerasandtheedgeclusterto
maximizethethroughputunderthelatencySLO.
Theoptimizationproblemformulatedaboveisnon-convex.Insteadofexhaustivelysearching
overallpossiblepartitionpointsforallcameras,weadoptaheuristictosolvetheoptimization
problemef.,ourheuristicisiterativeandeachiterationhastwophases.In
22
thephase,itstartswiththeobjectiveofthepreliminaryDAGpartitionpointsthat
maximizethethroughputwithoutconsideringthelatencySLO.Inthesecondphase,weiteratively
targetthebottleneckcamerawiththelargestlatencyandadjustitspreliminaryDAGpartitionpoint
tomeetthelatencySLO.Inpractice,ourheuristicishighlyeffectivewithnegligibleoverheads
evenifthenumberofcamerasscalesup(§1.5.6).
1.4SystemImplementation
WeimplementedDistreamusingabout2500linesofGolangcodeand500linesofpythoncode.
Inthissection,weprovidedetailsonhowthekeypartsofDistreamwereimplemented.
Testbed.
Wefollowedthedesignchoicethatiswidelyadoptedbyexistinglivevideoanalytics
systemsinreal-worlddeploymentstodevelopourowntestbedtoevaluatetheperformanceofDis-
tream.,ourtestbedconsistsof24smartcamerasandasingleedgecluster.Amongthe
24cameras,18wereprototypedusingNvidiaJetsonTX1[3]whiletheothersixwereprototyped
usingNvidiaJetsonTX2[6].BothJetsonTX1andTX2aredesignedforembeddedsystemswith
onboardDLworkloads
4
.JetsonTX2isanupgradedversionofJetsonTX1withlargercompute
capacity.Fortheedgecluster,weuseadesktopserverequippedwithaclusterof4NvidiaTitan
XGPUs[1].WefollowedthestandardofexistingIPvideosurveillancesystemsto
connectandsetupthenetworkbandwidthbetweensmartcamerasandedgecluster[5].
cally,allthe24smartcamerasandtheedgeclusterarewire-poweredandinterconnectedthrougha
singleswitch(D-LinkDGS-1510-28X[2])toformalocalnetwork.EachsmartcamerahasaFast
Ethernetlink(10/100Mbps)totheswitch,andthebandwidthfromtheswitchtotheedgecluster
is10Gbps.Bydefault,wesetthecamerabandwidthto10Mbps.Wealsoevaluatethesystem
4
AsAIchipsetsevolve,weconjecturethatAIhardwarewithsuchcomputecapacitywillbewidelydeployedinside
variousedgedevicessuchassmartcameras.
23
performanceunder50Mbps.
LSTMWorkloadPredictor
.Weuseatwo-layerLSTMwith256neuronsineachlayerasour
workloadpredictor.,theworkloadpredictortakesthearrivedworkloadsofallcameras
asinputandpredictstheupcomingworkloadsofeachcamerainthenextsecond.Weobserved
thattheamountofworkloadsgeneratedateachcameraishighlyrelevanttothetimeperiodina
day.Thuswedivideadayintosixperiodsas00:00-04:00,04:00-8:00,8:00-12:00,12:00-16:00,
16:00-20:00and20:00-24:00.Foreachperiod,wetrainaspecializedLSTMusingworkloads
observedfromthatperiod.WethatusingsixspecializedLSTMsreduces10
:
2%meansquare
errorcomparedtousingauniversalLSTM.SincerunningaspecializedLSTMisfastonCPU(less
than1msforonestep),anditonlymakespredictionseachsecond,ourLSTMworkloadpredictor
incursnegligibleoverheads.Althoughthepredictorcouldpotentiallyfromperiodicre-
traininggiventhedynamicsofworkloadsovertime,itisnotthefocusofthiswork.
VideoAnalyticsPipeline
.WeuseOpenCV3.2.0toreadvideostreamsandimplementedthe
Gaussianmixturemodel-basedbackgroundsubtractionmethodin[4]toextractregionsofinter-
est(ROIs)withlengthofhistorysetto500andthresholdsetto60.Itmaintainsanestimateof
backgroundimageandusessubtractionoperationtoextractforegroundregions,followedbyblob
detectionoperationstoextractROIs.ThenumberofextractedROIsperframevariesdepending
onthenatureofthescenescapturedbythecameras,rangingfrom0to30.For
inferenceengine
,
wedesignedanefDNNmodelbasedonMobileNetV2[70]foreachc,andweare
abletocompressover90%ofmodelweightsusingpruning[58]andknowledgedistillation[30]
withoutaccuracyloss.GiventhateachDNNmodelhasonlyabout2MBmemoryfootprint,the
inferenceengine
isabletoloadalltheDNNmodelsintoasingleGPU,avoidingmodelloading
orswitchingtime.Similarto[71],weadoptbatchedinference[15]toimprovevideoanalytics
24
throughput.Wetheinferencelatencywithbatchsizeof1,8,16,32and64respectively
andsetthebatchsizeto8atthecamerasideand32attheedgeclustergivenitsbetterperformance
tradeoff.
Scheduling
.Weimplementedamonitorproxyateachcameraandedgeclustertokeeptrackofthe
runtimeinformationincludingworkloadstatus,networkbandwidth,throughputandlatency.The
proxyperiodicallyreportstheseinformationtothe
systemmonitor
intheedgeclusterevery50ms.
Whenperformingthecross-cameraworkloadbalancing,theROIisencapsulatedintothemigrating
workloadbecausethetargetcameradoesnothavethevideoframefromthesourcecamera.
Communication
.DistreamusesZeroMQ[8]forhigh-speedlow-costinter-processcommuni-
cationamongcameras.Forremotecommunicationbetweencamerasandedgecluster,weim-
plementedalow-costremoteprocedurecall(RPC)totransfercontroldata.,the
cross-cameraworkloadbalancinginstructionsandDAGpartitionpointsaretransferredtocam-
erasimmediatelyaftertheyaregeneratedbythe
cross-cameraworkloadbalancer
andthe
DAG
partitioner
respectively.Whenperformingcross-cameraworkloadbalancingandcamera-cluster
workloadpartitioning,weusebatchedRPCsforcommunicationbetweentwoentitiestoamortize
thecommunicationoverhead.ThemaximumRPCbatchsizeis20.
1.5Evaluation
Inthissection,weevaluatetheperformanceofDistreamwiththeaimtoanswerthefollowing
questions:

Q1(§1.5.2)
:
DoesDistreamoutperformstatusquolivevideoanalyticssystems?Ifso,what
arethereasons?

Q2(§1.5.3)
:
HoweffectiveiseachcoretechniqueincorporatedinthedesignofDistream?
25

Q3(§1.5.4)
:
HowistheperformanceofDistreamaffectedbythesystemhyper-parametersand
networkbandwidth?

Q4(§1.5.5)
:
DoesDistreamscalewell?

Q5(§1.5.6)
:
HowmuchoverheadsdoesDistreamincur?

Q6(§1.5.7)
:
DothetechniquesproposedinDistreamalsostatusquolivevideoanalyt-
icssystems?
1.5.1ExperimentalMethodology
Applications.
Weuse
TrMonitoring
and
CampusSurveillance
astworepresentativeappli-
cationstoevaluateDistream.Foreachapplication,weusearepresentativereal-worldmulti-camera
datasettobenchmarktheperformance.

Application#1:TMonitoring.
Forthisapplication,weusepubliclyavailablelivevideo
streamsfromsixtrafcamerasdeployedatdifferenttrafintersectionsandroadsinJackson
Hole,WY[7].Thesesixlivevideostreamshavedynamiccontents,diversetraf
conditionsinthecityacrossspaceandtime.Toobtainarepresentativedataset,wesampled
2005-minutevideoclipsacross48hoursofvideostreamsfromeachofthesixcamerasata
framerateof15FPSand1280

720Pframeresolution.Thesampledvideoclipscoverdiverse
trafconditionsatdifferenttimeofthedayincludingrushhours,lighttraftime,andnight
time.

Application#2:CampusSurveillance.
Duetolackofpubliclyavailablelarge-scaledatasets,
weself-collectedadatasetfrom24surveillancecamerasdeployedonauniversitycampus.
Thesecamerascoverdiverseareasofthecampus,includingindoorareassuchasdininghalls,
cafeterias,andhallwaysaswellasoutdoorareassuchasuniversityentrances,buildingen-
26
trances,squares,streets,andtrafintersections.Forsecuritypurposes,someofthesurveil-
lancecamerasareparticularlycoveringareaswherepeoplerarelyvisit.Similartothe
Tr
Monitoring
application,wesampled2005-minute-longvideoclipsacross48hoursofvideo
streamsfromeachof24camerasataframerateof15FPSand1280

720Pframeresolution
toobtainarepresentativedataset.
Baselines.
WecompareDistreamagainstthreebaselinesthatcoverallthreetypesofarchitectures
listedinTable1.1.

Centralized
.Thisbaselinerepresentsthecentralizedapproachadoptedbyanumberofexist-
inglivevideoanalyticssystemssuchas
VideoStorm
[86],
Chameleon
[40]and
NoScope
[44]
wherevideostreamsaresenttoedgeclusterforprocessing.

Camera-Only
.Thisbaselinerepresentstheapproachattheotherendofthespectrumwhere
videostreamsareonlyprocessedlocallyonsmartcameras(e.g.,FilterForward[13]).

VideoEdge-Lossless(VideoEdge-L)
.
VideoEdge
[35]isthestatusquolivevideoanalytics
systemthatutilizescomputeresourcesacrosscamerasandclustertoprocessvideostreams
inadistributedmanner.DifferentfromDistream,
VideoEdge
isworkload-agnosticandaims
toachieveoptimizedresource-accuracytradeoff,whichmaytradeforresourceswithlossof
accuracy.SinceDistreamdoesnotaccuracy,forafaircomparison,weimplemented
VideoEdge
withtheresource-accuracytradeoffdisabled,whichwerefertoas
VideoEdge-
Lossless(VideoEdge-L)
anduseitasourthirdbaseline.As
VideoEdge
isworkloadagnostic
andtheonlypriorknowledgegivenisthecomputecostofeachintheDAGaswell
asthecomputecapacityratioofcamerasandcluster,wepartitionedtheDAGandplacethe
partitionsaccordingtothecamera-clustercomputeratiotoimplement
VideoEdge-L
.
EvaluationMetrics.
WeusethreemetricstoevaluatetheperformanceofDistreamandthebase-
27
lines.

Throughput
.Livevideoanalyticssystemsneedtoprocessstreamingvideoframesinacon-
tinuousmanner,andhigh-throughputprocessingisessentialtokeepingupwiththeincoming
videostreams.Inourcase,workloadsareinferencesinvolvedintheDAG.Thus,weusethe
numberofinferencesprocessedpersecond(IPS)tomeasurethethroughput.

Latency
.Livevideoanalyticsapplicationsrequireproducinganalyticsresultswithinashort
periodoftime.Wethususelatencywhichisastheelapsedtimefromwhenthework-
loadisgeneratedtowhentheinferenceresultoftheworkloadisproducedasoursecondmetric
tomeasurethesystem'sresponsiveness.Assuch,latencycanbecalculatedasthesumof
networklatency,workloadqueuingtime,andworkloadprocessinglatency.

LatencyServiceLevelObjective(SLO)MissRate
.Lastly,wemeasurethelatencyservice
levelobjective(SLO)missrate,whichthepercentoftheworkloadsthatdonotmeet
thelatencyrequirementsetbyalivevideoanalyticsapplication.Inthiswork,wesetthelatency
SLOto3seconds,whichisareasonablerequirementforlivevideoanalyticssystems.
1.5.2OverallPerformance
WebeginwithcomparingtheoverallperformanceofDistreamandbaselinesonbothapplications.
Sincethe
TrMonitoring
datasetinvolvessixvideostreams,weusesixsmartcameras(4
TX1and2TX2)witheachallocatedtoonevideostreamandincorporateoneGPUintheedge
cluster.Wewillscaleto24smartcamerasand4-GPUedgeclusterwhenevaluatingthescaling
performanceofDistreamin§1.5.5.
ThroughputandLatency
.Table1.2andTable1.3listthethroughputandlatencyofDistreamand
baselinesonthe
CampusSurveillance
and
TrMonitoring
applicationrespectively.Specif-
28
ThroughputLatency
avgmed75th99th
avgmed75th99th
Distream
2.9

3.1

2.9

2.5

128.2

189.3

147.9

112

VideoEdge-L
2
:
1

2
:
2

1
:
9

1
:
5

1
:
5

2
:
4

1
:
6

1
:
4

Centralized
1
:
9

2
:
0

1
:
7

1
:
3

1
:
4

2
:
2

1
:
5

1
:
2

Table1.2:PerformancegainoverCamera-Only(CampusSurveillance).
ically,wereporttheaverage,50th(median),75th,and99thpercentileofthroughputandlatency
performancegainover
Camera-Only
.
Wehavethreemajorobservations.First,Distreamhasachievedhigherthroughputthanthe
baselines.,itachieves2
:
9

and1
:
6

averagethroughputgainontwoapplications,
whilethebestbaseline(
VideoEdge-L
)onlyachieves2
:
1

and1
:
2

.Second,Distreamachieves
higherpeakthroughput(99th)thanthebaselines.ThehigherpeakthroughputindicatesthatDis-
treamisabletobetterutilizethedistributedcomputeresourcestosurvivethroughtheworkload
burstssuchastheonesillustratedinFigure1.2.Third,Distreamarchiveslatencyreduc-
tioncomparedtothebaselines.Inparticular,Distreamachieves128
:
2

and184

averagelatency
reduction,while
VideoEdge-L
onlyachieves1
:
5

.
Figure1.10andFigure1.11providesamorecomprehensiveviewofthecomparisonbyplotting
thefulldistributionofthroughputandlatency,includingthe1st,25th,50th(median),75thand99th
percentilesintheboxplot.Intermsofthroughput,Distreamhasamuchwiderthroughputrange,
whichindicatesthatDistreamisabletohandleamuchwiderrangeofamountsofworkloadsthan
thebaselines.Intermsoflatency,theaveragelatencyofDistreamforthe
CampusSurveillance
and
TrMonitoring
applicationis0
:
34sand0
:
27s,whichismuchlessthanthebaselines,
whichrangesfrom28
:
54sto43
:
93sfor
CampusSurveillance
and9
:
98sto49
:
68sfor
Tr
Monitoring
.ThisresultindicatesthatthethroughputimprovementinDistreamdoesnotcomeat
thecostofitslatency.
29
ThroughputLatency
avgmed75th99th
avgmed75th99th
Distream
1.6

1.6

1.9

2.0

184

604.3

446.8

52.6

VideoEdge-L
1
:
2

1
:
2

1
:
2

1
:
2

1
:
5

1
:
8

1
:
5

1
:
2

Centralized
1
:
1

1
:
1

1
:
1

1
:
1

1
:
2

1
:
4

1
:
2

1
:
1

Table1.3:PerformancegainoverCamera-Only(TrafMonitoring).
Figure1.10:ThroughputandlatencydistributiononCampusSurveillance.
LatencySLOMissRate
.Figure1.12comparesthelatencySLOmissrateofDistreamagainst
baselines.Asshown,Distreamoutperformsthebaselinesbyalargemargin.Inparticular,Distream
isabletoachieveanear-zerolatencySLOmissrate(0
:
7%and1
:
5%)onthe
CampusSurveil-
lance
and
TrMonitoring
applicationrespectively,whilethebaselineshavemuchhighermiss
rates(atleast60
:
7%
CampusSurveillance
and93
:
8%for
TrMonitoring
).
WhyDistreamOutperformstheBaselines?
Distreamoutperformsboth
Centralized
and
Camera-
Only
becauseDistreamisabletoleveragethecomputeresourcesatbothcamerasandedgecluster
sideswhile
Centralized
and
Camera-Only
couldnot.For
VideoEdge-L
,Distreamisableto
achievebetterperformanceforthefollowingtworeasons:
Reason#1:HigherResourceUtilization
.Intermsofthroughput,Distreamoutperformsthe
baselinesbecauseitisabletoutilizetheidlecomputeresourcesatbothcamerasandtheedge
clustertoprocesstheworkloadstoenhancethethroughput.Toseethis,wecontinuouslytrackthe
workloadimbalanceindexacrosssmartcamerasandtheedgecluster.AsshowninFigure1.13(a),
Distreamisabletoachieveanear-zeroworkloadimbalanceforabout60%ofthetime.Incontrast,
30
Figure1.11:ThroughputandlatencydistributiononTrafMonitoringapplication.
Figure1.12:LatencySLOmissratecomparison.
VideoEdge-L
isexperiencingworkloadimbalance(morethan200%imbalanceindex)
for90%ofthetimeforbeingagnostictothedynamicsoftheworkloads.
Reason#2:LessAccumulatedWorkloads
.Intermsoflatency,Distreamoutperformsthe
baselinesbecausewithhighercomputeresourceutilization,Distreamisabletoprocessthework-
loadsatamuchfasterratethanthebaselines,preventingtheworkloadsfrombeingaccumulatedat
camerasoredgecluster.Assuch,Distreamhasmuchlessworkloadswaitinginthequeue,which
reducestheworkloadqueuingtimeandhencethelatency.Toseethis,wecontinu-
ouslytracktheaccumulatedworkloadsinthelocalqueueateachcameraandtheedgecluster.As
showninFigure1.13(b),Distreamisabletoachieveanear-zeroamountofaccumulatedworkloads
forabout80%ofthetime.Incontrast,
VideoEdge-L
isexperiencinghighamountof
accumulatedworkloadsformostofthetime.Assuch,theworkloadqueuingtimeinDistreamis
muchlower,whichistherootcauseofitslowlatency.
31
Figure1.13:IllustrationonwhyDistreamoutperformsbaselines.
1.5.3Component-wiseAnalysis
ThedesignofDistreamenablesadaptivecross-cameraworkloadbalancingandadaptivecamera-
clusterworkloadpartitioning.Inthissection,weimplementedtwobreakdownversionsofDis-
treamtotakeacloserlookatthecontributionofeachcomponent.

Distream-L
hasadaptivecross-cameraworkloadbalancing,buthasstaticcamera-clusterwork-
loadpartitioningbasedonthecomputecapacityratioofcamerasandcluster.

Distream-P
hasadaptivecamera-clusterworkloadpartitioning,butdoesnotenablecross-
cameraworkloadbalancing.
Table1.4showsthecomparisonresultsonthe
CampusSurveillance
application.Asshown,
theperformancegainsbroughtbyboth
Distream-L
and
Distream-P
aresimilarand
Onaverage,
Distream-L
achieves2
:
4

throughputgainand5
:
9

latencyreductionrespectively
comparedto
Camera-Only
.
Distream-P
achieve2
:
5

throughputgainand6
:
3

latencyreduc-
tionrespectivelycomparedto
Camera-Only
.Thisresultindicatestheimportanceofbothadaptive
cross-cameraworkloadbalancingandadaptivecamera-clusterworkloadpartitioningtothewhole
32
system.Distreamisabletocombinetheadvantagesofboth
Distream-L
and
Distream-P
to
achievebetterperformancethanusingonlyoneofthem.
ThroughputLatency
avgmed75th99th
avgmed75th99th
Distream-L
2
:
4

2
:
5

2
:
3

2
:
2

5
:
9

8
:
2

5
:
5

5
:
2

Distream-P
2
:
5

2
:
5

2
:
3

2
:
1

6
:
3

9
:
2

6
:
6

5
:
1

Table1.4:Component-wiseanalysisofDistream.PerformancegainsareoverCamera-Only.
1.5.4SensitivityAnalysis
ThedesignofDistreaminvolvestwokeysystemhyper-parameters:(i)thecross-cameraworkload
balancingthreshold
b
(§1.3.2),and(ii)thecamera-clusterworkloadpartitionschedulinginter-
val
g
(§1.3.4).Inthissection,weevaluatetheimpactofthesesystemhyper-parametersonthe
performanceofDistream.Inaddition,sincecross-cameraworkloadbalancingrequiresmigrating
workloadsthroughthenetwork,wealsoevaluatetheimpactofnetworkbandwidthontheperfor-
manceofDistream.
ImpactofCross-CameraWorkloadBalancingThreshold
b
.Figure1.14(a)showsDistream's
sensitivitytothecross-cameraworkloadbalancingthreshold
b
.Weonlyshow99thpercentile
throughput(lefty-axis,black)and99thpercentilelatency(righty-axix,red)becausetheyare
sensitivetohyper-parameterchangesandcanbettertheimpactonthesystemperformance.
Weobservethatwhen
b
isbelow20%,theperformanceofDistreamdoesnotchangemuch.When
b
exceeds20%,the99thpercentilethroughputbeginstodropand99thlatencyincreases.Thisis
becausealarger
b
wouldtriggerloadbalancinglessoften.Asaresult,thesystemmaysufferfrom
performancelossduetoworkloadimbalance.Therefore,weset
b
tobe20%.
ImpactofCamera-ClusterWorkloadPartitionSchedulingInterval
g
.Figure1.14(b)shows
Distream'ssensitivitytothecamera-clusterworkloadpartitionschedulinginterval
g
.Weseethat
33
the99thpercentilethroughputand99thpercentilelatencydonotchangemuchwhentheinterval
islessthan0
:
5s.When
g
isbetween0
:
5sto1
:
5s,thesystemperformancehasslightlydeclined.
When
g
isgreaterthan1
:
5s,thesystemperformancedeclinesastheintervalincreases.
Therefore,weset
g
to0
:
5s.
(a)Impactofcross-cameraworkloadbalancing
threshold
b
.
(b)Impactofcamera-clusterworkloadpartition
scheduleinterval
g
.
Figure1.14:Impactofsystemhyper-parametersontheperformanceofDistream.
ImpactofNetworkBandwidth
.WealsoexamineDistream'ssensitivitytonetworkbandwidth.
WeevaluateDistreamwithnetworkbandwidthat10Mbps(lowbandwidth)and50Mbps(high
bandwidth),whichcoversthebandwidthrangeinstandardvideosurveillancesystemsonthemar-
ket.Table1.5listsourevaluationresults.Asshown,boththethroughputandlatencyarevery
similarbetween10Mbpsto50Mbps.ThisresultindicatesthatDistreamisrobusttonetworkband-
widthchangesandisabletoachievehighperformanceevenwithanetworkbandwidthof10Mbps.
34
Throughput(IPS)Latency(s)
avgmed75th99th
avgmed75th99th
LowBandwidth(10Mbps)
489.1509590647
0.340.090.472.05
HighBandwidth(50Mbps)
490.9510593650
0.310.070.411.75
Table1.5:PerformanceofDistreamunderlow(10Mbps)andhigh(50Mbps)networkbandwidths.
1.5.5ScalingPerformance
ToevaluatethescalingperformanceofDistream,weusethe
CampusSurveillance
datasetand
examinethethroughputandlatencyasDistreamscalesupitsnumberofsmartcamerasfrom6to
12,18,and24.Tomatchthecomputeresourceswithnewworkloadsbroughtbythenewcameras,
weaddoneGPUtotheedgeclusterwheneversixsmartcamerasareaddedtothesystem.
Figure1.15illustratesthescalingperformanceofDistreamintermsofthroughput(upper)and
latency(lower).Asshown,Distreamisabletoscaleitsthroughputupnearlylinearly.,
Distreamachievesa3
:
95

averagethroughputfrom489
:
1IPSwhenprocessingvideosstreamed
from6camerasto1931
:
4IPSwhenprocessingvideosstreamedfrom24cameras.Meanwhile,
Distreamisabletomaintainasimilarlowlatencywhenscalingup.
Figure1.15:ScalingperformanceofDistream.
35
1.5.6SystemOverheads
ThesystemoverheadsincurredbythedesignofDistreamcomefromtwosources.Thesource
comesfromsolvingtheoptimizationproblemsforcross-cameraworkloadbalancingin§1.3.2and
identifyingtheoptimalDAGpartitionpointsin§1.3.4.Theiroverheadis16usand4uswithsix
camerasandoneGPU,and21usand8uswhenscaledupto24camerasand4GPUs.Thisresult
indicatesthatourproposedheuristicsarehighlyefandincurnegligibleoverheads.The
secondsourceofoverheadcomesfromtheRPCcallformigratingworkloadsacrosssmartcameras
(§1.3.4).AsshowninTable1.6,evenifunderthelowbandwidthcondition(i.e.,10Mbps),each
RPCcallonlytakes2
:
1msonaverageforworkloadmigration.WhenweemploybatchingRPC,
thisoverheadisamortizedwhentheRPCbatchsizeincreases,makingtheoverheadsofcross-
cameraworkloadbalancingnegligible.
BatchSizeTotalMigrationTimeAvgRPCOverhead
12.1ms2.1ms
107.4ms0.74ms
209.2ms0.46ms
Table1.6:Cross-cameraworkloadbalancingoverheadsofDistream.
1.5.7toExistingSystems
AssummarizedinTable1.1,existinglivevideoanalyticssystemsareagnostictotheworkload
dynamicsinreal-worlddeployments.Instead,theyfocusondesigningdifferenttechniquestoop-
timizetheresource-accuracytradeoffoflivevideoanalyticssystems.Inthissection,weaddthe
workloadadaptationtechniquesproposedinDistreamonto
VideoEdge
[35]toexaminehowDis-
treamcouldenhanceitsperformance.Weselect
VideoEdge
insteadofotherstatusquolivevideo
analyticssystemslistedinTable1.1because
VideoEdge
isalsoafullydistributedframeworkthat
36
partitionsthevideoanalyticspipelineacrosscamerasandcluster.
Figure1.16comparestheperformancebetween
VideoEdge
and
VideoEdge
+Distream.As
illustrated,
VideoEdge
+Distreamisabletogenerateabetterresource-accuracytrade-offcurve
than
VideoEdge
.Inparticular,Distreamisabletoboosttheaccuracyby21%whencompute
resourceislimited,andisabletoboosttheaccuracyby7%whencomputeresourceisadequate.
Suchimprovementismadepossibleduetothehigherresourceutilizationbroughtbytheworkload
adaptationmechanismofDistream.Withhigherresourceutilization,
VideoEdge
doesnotneedto
asmuchaccuracyasbeforetotradeforresources.
Figure1.16:Resource-accuracytradeoffcurvecomparisonbetweenVideoEdgeandVideoEdge+
Distream.
1.6RelatedWork
Livevideoanalyticsisakillerapplicationnotonlyformobilevision[36,21,20,42]butalsofor
large-scaledistributedsettings.OurworkiscloselyrelatedtoDL-basedlivevideoanalyticssys-
temswithdistributedcameras.Amajorityofthesesystemsaredesignedtoprocessvideostreams
inacentralizedcluster.Forexample,
Focus
[32]customizessmallDLrstogenerateobject
indexesineachvideoframetoenablefastofvideoquery.
NoScope
[44]usesspecializedDL
37
modelstoreduceinferenceoverheadsforthroughputgainwithsmallaccuracyloss.Besidesthem,
Chameleon
[40],
VideoStorm
[86],and
AWStream
[85]leverageresource-accuracytradeoffs
forsystemoptimization.
Chameleon
[40]reducestheoverheadcausedbysearchingforoptimal
tradeoff
VideoStorm
[86]exploitsthevarietyofqualityandlaggoalsinvideo
analyticstasksandoptimizescomputeresourcesinGPUclusters.
AWStream
[85]isfocusedon
adaptingtonetworkbandwidthvariationtoachievelow-latencyhigh-accuracywide-areastream-
inganalytics.Differentfromtheseworks,insteadoftradingaccuracyforlesscomputeresource
consumption,Distreamachieveshighthroughputandlowlatencybymaximizingtheresourceuti-
lizationacrosscamerasandedgeclusterwithouttheaccuracy.
TheclosestworktoDistreamis
VideoEdge
[35].AlthoughDistreamand
VideoEdge
are
bothbuiltuponadistributedarchitecturethatinvolvessmartcamerasandedgeclusterforlive
videoanalytics,theyaredifferentinthreeaspects.First,Distreamfocusesonenablingworkload-
adaptivelivevideoanalyticswhile
VideoEdge
isworkloadagnostic.Second,
VideoEdge
targets
optimizingresource-accuracytradeoffwhileDistreamfocusesonmaximizingthroughputandat-
taininglatencySLOwithoutcompromisingaccuracy.Third,
VideoEdge
performsvideoanalytics
pipelineplacementacrosscameras,edgeandcloudwherethedynamicnetworkbandwidthisthe
mainbottleneck.Incontrast,Distreamperformsworkloadallocationacrosscamerasandedge
clusterbasedonworkloaddynamicstoeliminatethebottleneckincomputeresources.Infact,
Distreamiscomplementaryto
VideoEdge
:asshownin§5.7,Distreamisabletoprovideabetter
resource-accuracytradeoffcurvewhenintegratedwith
VideoEdge
.
38
1.7ConclusionandFutureWork
Inthispaper,wepresentthedesign,implementation,andevaluationofDistream,adistributed
workload-adaptivelivevideoanalyticssystembasedonthesmartcamera-edgeclusterarchitecture.
Distreamaddressesthekeychallengesbroughtbyworkloaddynamicsinreal-worlddeployments,
andcontributesnoveltechniquesthatcomplementexistinglivevideoanalyticssystems.Wehave
implementedDistreamandconductedarichsetofexperimentswithatestbedconsistingof24
camerasanda4-GPUedgeclusterontworeal-worlddistributedvideodatasets.Ourexperimental
resultsshowthatDistreamconsistentlyoutperformsthestatusquointermsofthroughput,latency,
andlatencySLOmissrate.Therefore,webelieveDistreamrepresentsacontributionto
enablinglivevideoanalyticsatscale.Inthecurrentform,Distreamtreatscross-cameraworkload
balancingandcamera-clusterpartitioningastwoseparatecomponents.Weplantoworkonajoint
solutionasourfuturework.Wewillalsoworkondesigningcommonprogrammingandnetwork
abstractiontosupportlivevideoanalyticsapplicationdevelopmentbasedonourframework,and
plantoexpandourframeworktothewirelesssettings.
39
Chapter2
CollaborativeDeepLearningTrainingon
theEdge
2.1Introduction
Therecentpasthaswitnessedthesuccessofdeeplearning(DL)inawidespectrumofareasinclud-
ingcomputervision[18],automaticspeechrecognition[29],andnaturallanguageprocessing[76].
Partofthissuccessisattributedtothehighlyefdistributedtrainingsystems[51,84,87]
thattrainDLmodelsinparallelmachinesinsidedatacentersownedbylargeorganizationssuchas
Google,Facebook,Amazon,andMicrosoft.Asintelligenceismovingfromdatacenterstodevices,
intelligentedgedevicessuchassmartphones,drones,robots,andsmartcamerasareequippedwith
machineintelligencepoweredbydeeplearningtonotonlyperformon-deviceinferencesbutalso
altogethertrainaDLmodelfromthedatacollectedbythemselves(i.e.,on-devicedistributedtrain-
ing).
Dependingontheapplicationsettings,on-devicedistributedtrainingcaningeneralbe
intotwocategories:
federatedlearning
,and
collaborativelearning
.Infederatedlearn-
ing,dataprivacyisstrictlyenforced:duringtraining,participatingdevicesdonotsharetheirlocal
datawitheachother.Inaddition,participatingdevicesareusuallylocatedatdifferentgeographical
locationsandadifferentsubsetofthedevicesparticipateineachroundofthetrainingprocess.As
40
such,federatedlearningrequiresthesupportofacloudservertocoordinatethedistributedtraining
processandselectwhichdevicestoparticipateineachround.Incontrast,incollaborativelearn-
ing,dataprivacyisnotaconcerngiventhattheparticipatingdevicesusuallybelongtothesame
individual,group,organization,orcompany.Moreover,participatingdevicesareedthrough-
outthetrainingprocessandarelocatedwithinageographicallyclosearea.Thustheyareableto
communicatewitheachotherviaalocalwirelessnetworkwithoutthesupportofacloudserver.
Inthiswork,wefocusoncollaborativelearning.Thereareawiderangeofreal-worldap-
plicationsthatarebuiltuponcollaborativelearning.Forexample,intheU.S.,eachconsumeron
averageownsmorethanthreeconnecteddevices(e.g.,smartphones,tablets,smartwatches).In
thissetting,personaldataaredistributedacrossallthoseconnecteddevices,andthereisnoprivacy
concernsinceallthedevicesareownedbythesameperson.Withwirelessnetworkavailableat
homeorintheofcollaborativelearningisthemostpracticalsolutiontocollaborativelylearn
aDLmodelfromthosedistributedpersonaldata.
Motivation&LimitationsofStatusQuo.
Despiteitsconsiderablevalue,thekeybottleneckof
makingcollaborativelearningsystemspracticallyusefulinreal-worlddeploymentsisthat
they
consumeaamountoftrainingtime
.Therootcauseofsuchperformancebottleneck
isthelimitedwirelessnetworkbandwidth:indatacenters,communicationbetweenparallelma-
chinesisconductedviahigh-bandwidthnetworksuchas50GbpsEthernetor100Gbps
Band[81].Incontrast,incollaborativelearningsystems,communicationbetweenparticipating
devicesisconductedviawirelessnetworksuchasWi-Fi.Thebandwidthofwirelessnetwork,
however,ismuchmoreconstrained,andcanbe
10,000timesless
thanthebandwidthindatacen-
ters.Suchlimitedbandwidthconsiderablyslowsdownthecommunicationandhencetheoverall
trainingprocess.
Our
goal
inthisworkistotacklethiscriticalbottleneckto
enhancethetrainingef
41
ofon-devicecollaborativelearningwhileretainingtrainingcorrectnesswithoutcompromisingthe
trainingquality
(i.e.,accuraciesofthetrainedmodels).Althoughanumberoftrainingacceleration
approacheshavebeenproposed[10,55,56,25,39,33],aswewilldiscussin§2.2indetail,these
approacheseithercompromisethetrainingqualitytogaintrainingefyoraredesignedfor
distributedtrainingindatacenterswheretheyachievelimitedtrainingefyenhancementin
on-devicecollaborativelearninggiventhegapinnetworkbandwidthbetweenthedata
centersettingandtheon-devicesetting.
OverviewoftheProposedApproach.
Thelimitationsofexistingapproachesmotivateustore-
thinkthecollaborativelearningframeworkdesignbytakingadifferentpathfromexistingones.To
thisend,wepresentDistream,animportancesampling-basedframeworkthatenablesefon-
devicecollaborativelearningwithoutcompromisingthetrainingquality.Inexistingapproaches,
eachparticipatingdeviceineachtrainingiteration
randomly
samplesamini-batchfromitslocal
datatocomputeitslocalgradients.Thisrandomsamplingstrategy,atitscore,assumesthateach
sampleinthelocaldataisequallyimportantduringmodeltraining.Inpractice,however,
notall
thesamplesinthelocaldatacontributeequallytomodeltrainingduetoinformationoverlapping
amongdifferentsamples
.Trainingonlessimportantsamplesnotonlycontributeslittletomodel
accuracybutalsoconsiderablyslowsdownthetrainingprocess.Giventhat,insteadofrandom
sampling,Distreamfocusesonsamplesthatprovidemoreimportantinformationineachiteration.
Bydoingthis,thetrainingefyofeachiterationisimproved.Assuch,thetotalnumberof
iterationscanbeconsiderablyreducedsoastospeedupthewholetrainingprocess.
ThedesignofDistreaminvolvesthreekeychallenges.

Challenge#1
:Importancesamplingincurscomputationcost,andwithoutcarefuldesign,could
easilyovershadowtheitbrings.Reducingthecomputationcostofimportancesampling
42
withoutaffectingitseffectivenessandcompromisingthetrainingqualityrepresentsa
cantchallenge.

Challenge#2
:Sinceimportancecomputationisperformedonthelocaldataofeachdevice,
theimportanceonlythelocalimportancedistributionwithineachdeviceratherthan
theglobalimportancedistributionacrossallthedistributeddevices.Asaresult,adevicemay
repeatedlylearngloballytrivialsamples,whichconsiderablylowersthetrainingefy.

Challenge#3
:Eventhoughthecomputationcostofimportancesamplingcanbereduced,the
costisstillnotzero.Moreover,comparedtoEthernetorusedinthedatacenter
setting,wirelessnetworkintheon-devicesettingismoresusceptibletointerferenceandthus
itsbandwidthcouldexperiencevariationsovertimeinreal-worlddeployments.Thedesignof
Distreamshouldtakebandwidthvariationintoaccountaswell.
Toaddressthechallenge,Distreamincorporatesa
group-wiseimportancecomputationand
samplingscheme
thatcutsthecomputationcostofimportancesamplingbyonlyre-computingthe
importanceofasubsetofsampleswithineachiteration.Moreover,byconstructingthemini-batch
basedonthere-computedimportanceinastochasticmanner,thetrainingcorrectnessisguaranteed.
Toaddressthesecondchallenge,Distreamincorporatesan
importance-awaredataresharding
scheme
thatefcientlyreshufthedataamongdevicestobalancetheimportancedistribution.
Itachievesthesameeffectwithmuchloweroverheadcomparedtoimportance-agnosticdatare-
shardingbyprioritizingthetransferofmoreimportantsamples.
Toaddressthethirdchallenge,Distreamincorporatesa
bandwidth-adaptivecomputation-communication
scheduler
thatschedulestheexecutionofimportancecomputationanddatareshardinginabandwidth-
adaptivemannertofurtherimprovethespeedupbycompletelymaskingoutthecostsofimportance
samplinganddataresharding.
43
Implementation&EvaluationResults.
WeimplementedDistreamusingTensorFlow1.10and
deployeditonaself-developedtestbedthatconsistsof12NVIDIAJetsonTX1asedgedevices.
WeevaluateDistreamonsixcommonlyuseddatasetsusingadiversesetofDLmodelsacrossthree
mostimportanttaskdomains:computervision,speechrecognition,andnaturallanguageprocess-
ing.WecompareDistreamagainsttwostatusquobaselines
TicTac
[25]and
AdaComm
[78].Our
resultsshowthatDistreamconsistentlyoutperformsthebaselinesintrainingefy,achieving
3
:
7

,3
:
2

,4
:
0

,2
:
2

,2
:
7

,1
:
8

onthesixdatasets.Moreover,Distreamisabletomain-
taintrainingspeedupasthenumberofparticipatingdevicesscalesup,andisrobusttowireless
bandwidthvariationsinreal-worlddeployments.
Insummary,wemakethreemajorcontributions:

Tothebestofourknowledge,Distreamistheefon-devicecollaborativelearning
frameworkbasedonimportancesamplingthatachieveshightrainingspeedupwhileretaining
trainingcorrectnesswithoutcompromisingtheaccuraciesofthetrainedmodels.

Weprovideaperformancemodeloftheproposedimportancesampling-basedcollaborative
learningframework.Guidedbytheperformancemodel,weproposethreenoveltechniques,
namely,group-wiseimportancecomputationandsampling,importance-awaredataresharding,
andbandwidth-adaptivecomputation-communicationscheduling,whichaltogetherfullyex-
ploitthebenbroughtbyimportancesampling.Wealsoprovideatheoreticalproofonthe
trainingcorrectnessofDistream.

WeimplementedDistreamanddeployeditonaself-developedtestbed.Wedemonstrateits
effectivenessandshowthatDistreamconsistentlyoutperformstwostatusquoframeworkson
sixdatasetsacrosscomputervision,speechrecognition,andnaturallanguageprocessing.
44
Figure2.1:Themulti-parameterserverarchitectureofon-devicecollaborativelearningsystems.
2.2BackgroundandMotivation
Inthissection,weintroducethearchitectureofon-devicecollaborativelearningsystems
(§2.2.1).Wethendiscusstheexistingtechniquesforenhancingtheirtrainingefyfollowed
byexplainingtheirlimitations,whicharethekeymotivationofthiswork(§2.2.2).
2.2.1ArchitectureofOn-DeviceCollaborativeLearningSystems
Similartodistributedtrainingindatacenters,on-devicecollaborativelearningalsousesStochastic
GradientDescent(SGD)oritsvariantstotraintheDLmodelinaniterativemanner.Ineach
SGDiteration,eachdevicerandomlysamplesamini-batchfromitslocaldatatocomputeitslocal
gradients.Theselocalgradientsarethenaggregatedfromthedistributeddevicestoupdatethe
modelparameters
w
as:
w
t
+
1
=
w
t

h
t
1
K
K
å
k
=
1
g
k
(2.1)
where
g
k
=
1
j
B
k
j
å
i
2
B
k
Ñ
l
(
x
i
;
w
t
)
arethegradientsatmachine
k
and
B
k
isthemini-batchofma-
chine
k
.
Moderndistributedtrainingsystemsuseparameterserver(PS)architectureforgradientaggre-
45
gation.InPS[87,80,52],oneormoremachinesplaytheroleofserverstoaggregatecomputed
gradientsfromworkermachines,updatethemodel,andsendtheupdatedmodeloraggregated
gradientsbacktoallworkers.AsshowninFigure2.1,Distreamadoptsamulti-PSarchitecture
andtreatseachenddeviceasaparameterservertoo.Bydoingthis,thecommunicationisnot
bottleneckedatasingledevicegiventhelownetworkbandwidthintheon-devicesetting.
2.2.2ExistingApproachesandTheirLimitations
Thetotaltrainingtimeofcollaborativelearning
T
canbegenerallyestimatedasthetotalnumber
oftrainingiterationsuntilconvergence
E
multipliesthesumofthetimeconsumedbylocalcom-
putation
T
cp
(e.g.,computethegradientsofthemodelparameters)andthecommunicationtime
consumedbytransmittingthemodelgradientsbetweentheparameterserverandtheedgedevice
T
cm
ineachiterationasfollows:
T
=
E

(
T
cp
+
T
cm
)
:
(2.2)
Toreducethetotaltrainingtime
T
,existingapproachescanbeingeneralgroupedintothefollow-
ingthreecategories.
GradientCompression
.Toreduce
T
,onecommonapproachistoreducethecommunicationtime
ineachiteration
T
cm
inEq.(2.2).Thisisachievedbyquantizinggradientsusingsmallernumber
ofbits[81,10,55]orselectingimportantgradientstotransfervia[79,56,33].Com-
municationinon-devicesettingisconductedthroughwirelessnetworkswhosebandwidthismuch
moreconstrainedthanEthernetorusedindatacenters.Givensuchlimitedbandwidth,
thecontributionofgradientcompressiontoreducing
T
cm
inEq.(2.2)islimited.Althoughmany
46
worksadoptaggressivegradientcompressionthatisabletopushthelimit,theygaintrainingef
ciencybycompromisingthetrainingqualityandhencetheaccuracyofthetrainedmodel,
whichcontradictsthegoalofthiswork.
Local-updateSGD
.IndistributedSGD,globalsynchronizationamongnodesisperformedineach
iterationforexchangingandaveraginggradientinformation,whichisanexpensiveandlengthyop-
erationinbandwidth-constrainednetwork.Atechniquecalledlocal-updateSGD[89,19,62,78]
isproposedtoreducethecommunicationoverhead.Theideaistoalloweachnodetoperform
multiplelocalupdatesbeforeglobalsynchronizationineachiteration.Similartothegradient
compressionapproach,althoughaggressivelyperformingmultiplelocalupdatesgainstrainingef-
ybyreducingcommunicationiterations,itagaincompromisestheaccuracyofthetrained
model.
Figure2.2:Comparisonincommunication-computationoverlappingbetweenhighbandwidthand
lowbandwidthsetting.
Communication-ComputationOverlapping
.DLmodelsingeneralcanberepresentedasdi-
rectedacyclicgraphs(DAGs).ThestructuresofDAGsprovideanopportunitytooverlapgradi-
entcomputationandgradientaggregationinapipelinedfashiontomaskoutthecommunication
cost.Thisoptimizationtechniquecanbeseenasamechanismtoreducethesum
(
T
cp
+
T
cm
)
in
Eq.(2.2).Suchreductioncanbeachievedeitherbyidentifyingtheoptimalgradienttransferor-
47
(a)#iterations=0
(b)#iterations=200
(c)#iterations=600
Figure2.3:Distributionsofdataimportanceasthenumberofiterationsincreasesduringtraining.
der[25,39,17,87]orusingstaledmodelweights[54].Althoughthiscomputation-communication
overlappingtechniquedoesnotcompromisetheaccuracyofthetrainedmodel,thelimitedband-
widthinon-devicecollaborativelearningcausesthecommunication-to-computationratiotobe
muchhigherthantheoneindatacentersetting.Asaresult,theeffectivenessofoverlapping
techniquesdiminishesbecauseitisnolongerabletomaskoutthecommunication
costundersuchhighcommunication-to-computationratio.Todemonstratethis,Figure2.2shows
thecostbreakdownofoverlappingtechnique.Asshown,althoughitprovidescostre-
duction(47%)underhighbandwidthnetworkindatacentersetting,itonlyreduces15%inlow
bandwidthnetworkinon-devicecollaborativelearningsetting.
2.3DistreamOverview
ThelimitationsofexistingapproachesmotivateustodesignDistreambytakingadifferentpath.
Inthissection,weintroducethekeyinsightthatDistreamisdesignedupon(§2.3.1).We
thenintroducearudimentaryframeworkdesigneduponthekeyinsightanditsperformancemodel
(§2.3.2).Lastly,weoverviewthetechniqueswedesignundertheguidanceoftheperfor-
mancemodelinDistreamthattransformtherudimentaryframeworkintoahighlyefonefor
on-devicecollaborativelearning(§2.3.3).
48
2.3.1KeyInsight
ThekeyinsightbehindthedesignofDistreamisto
exploitdataeftoimprovethetrain-
ingefofon-devicecollaborativelearning
.Instandarddistributedtraining,ineachSGD
iteration,eachdevice
randomly
samplesamini-batchfromitslocaldatatocomputeitslocalgra-
dients.Thisrandomsamplingstrategy,atitscore,assumesthateachdatasampleinthelocaldata
isequallyimportantduringthewholemodeltrainingprocess.Inpractice,however,
notallthe
datacontributeequallytomodeltraining
.Someofthemprovidemoreinformationtothetraining
andthusaremoreimportant,whiletheothersprovidelessinformationandthusarelessimportant.
Asanexample,Figure2.3illustratestheimportancedistributionofthedatainCIFAR-10asthe
numberofiterationsincreasesduringtraining.Asshown,moreandmoredatasamplesbecome
lessimportantasthetrainingprocessproceeds.Inparticular,after200iterations,theimportance
ofdataquicklyconcentratestoasmallportionofdatasamplesandtherestcontributelittletothe
modeltraining.ThisobservationshedslightonthelowtrainingefyofstandardSGDwhere
mini-batchisgeneratedbyrandomsampling.
Basedonthisinsight,weproposetoincorporate
importancesampling
[91]toimprovethe
trainingef
ofeachSGDiteration.Thekeyideaofimportancesamplingisthatineach
trainingiteration,itfocusesonsamplesthatprovidemoreimportantinformationandcontribute
moretothemodeltraining.
Formally,let
x
denoteasampleinthemini-batch,
p
denotetheuniformdistributionadopted
byrandomsampling,and
q
denoteanewdistributionadoptedbyimportancesampling.Toensure
trainingcorrectness,thegradient
Ñ
l
(
x
)
(hereweleave
w
t
outforsimplicity)shouldbemultiplied
49
Figure2.4:Aconceptualcomparisonbetween(a)randomsamplingand(b)importancesampling.
by
p
(
x
)
=
q
(
x
)
whenshiftingfromrandomsamplingtoimportancesampling[9]:
E
x
˘
q

p
(
x
)
q
(
x
)
Ñ
l
(
x
)

=
E
x
˘
p
[
Ñ
l
(
x
)]
if
q
(
x
)
>
0whenever
p
(
x
)
>
0.Toensuregradientvariancereductiontoacceleratethetraining
process,
q
(
x
)
shouldbeproportionaltothesample'sgradientnorm(
jj
Ñ
l
(
x
)
jj
)[9]:
q
(
x
)
µ
jj
Ñ
l
(
x
)
jj
:
Theabovederivationimpliesthat
asample'sgradientnormcanbeusedasanindicatorofits
importance
.Thelargergradientnormasamplehas,themorecontributionthesamplecanmaketo
themodeltraining.
Figure2.4providesaconceptualcomparisonbetweenrandomandimportancesampling.The
bluesolidcurvedepictsthedistributionoftheimportancevaluesofsamples.Inrandomsampling,
eachsamplehastheequalprobabilitytobeselectedregardlessofitsimportance.Assuch,if
therearemoresampleswithlowerimportanceinadataset,sampleswithlowerimportancehave
higherlikelihoodtobeselected(greendottedcurveinFigure2.4a).Incontrast,inimportance
sampling,sampleswithhigherimportancehavehigherlikelihoodtobeselected(greendotted
50
curveinFigure2.4b).Assuch,themini-batchgeneratedbyimportancesamplingcarriesmore
importantinformationthanrandomsampling.Suchadvantageenhancesthetrainingefy
withineachtrainingiteration,whichinturnreducesthetotalnumberofiterationsandspeedsup
thewholetrainingprocess.
2.3.2RudimentaryFrameworkandPerformanceModel
Inspiredbythebroughtbyimportancesampling,weproposeanimportancesampling-
basedframeworkforon-devicecollaborativelearning.Ourframeworkisbuiltuponthestandard
distributedSGD.Figure2.5aillustratestheoperationsinvolvedineachSGDiteration.Asshown,
theonlydifferencebetweenthemisthattheimportancesampling-basedframeworkreplaces
ran-
domsampling
(i.e.,randomlyselectsamplestoconstructthemini-batch)inthestandarddistributed
SGDwithtwonewoperations:
importancecomputation
whichcomputestheimportance(i.e.,gra-
dientnorm)foreverysampleinthelocaldataset,and
importancesampling
whichselectssamples
basedontheircomputedimportancetoconstructthemini-batch.
Theperformancemodelofsuchimportancesampling-basedframeworkcanbeformulatedas
follows:let
T
is
denotetheincreasedlocalcomputationcostineachSGDiterationcausedbyimpor-
tancesampling,and
E
is
denotethecorrespondingtotalnumberofiterationsuntilconvergence.The
trainingspeedupofimportancesampling-basedframeworkoverstandardrandomsampling-based
distributedSGDis:
Speedup
=
E

(
T
cp
+
T
cm
)
E
is

(
T
cp
+
T
cm
+
T
is
)
=
1
E
is
E

(
1
+
T
is
T
cp
+
T
cm
)
:
(2.3)
51
Figure2.5:ConceptualComparisonbetweenrudimentaryimportancesampling-basedframework
andMercury.
AsshowninEq.(2.3),theimportancesampling-basedframeworkcouldperformworsethan
thestandardrandomsampling-baseddistributedSGD(i.e.,
Speedup
<
1)ifthecomputationcost
incurredbyimportancesampling
T
is
overshadowsthebroughtbythereductionoftotal
trainingiterations(i.e.,
E
is
<
E
).
2.3.3OverallDesign
Figure2.5billustratestheoveralldesignoftheproposedDistreamframework.Comparedtothe
rudimentaryframeworkillustratedinFigure2.5a,Distreamincorporatesthreekeyinnovations
guidedbytheperformancemodelinEq.(2.3)whicheffectivelyavoidthepitfallsoftherudimen-
taryframework.First,Distreamincorporatesagroup-wiseimportancecomputationandsampling
techniquethatconsiderablyreducesthecomputationcostofimportancesamplingwithoutcom-
52
promisingitseffectivenessandtrainingquality.Second,itincorporatesanimportance-awaredata
reshardingtechniqueforbalancingimportancedistributionamongedgedevicestofurtheracceler-
atethetrainingprocess.Lastly,itincorporatesabandwidth-adaptivecomputation-communication
schedulerthatschedulestheexecutionofimportancecomputationanddatareshardinginparallel
inabandwidth-adaptivemannersuchthattheircostscanbecompletelyhiddenbehindthestandard
distributedSGDDistreamisbuiltupon.
Inthefollowing,wedescribethedetailsofthesethreetechniquesin§2.4.1,§2.4.2and§2.4.3,
andproveDistream'strainingcorrectnessin§2.4.4.
2.4DesignDetails
2.4.1Group-wiseImportanceComputationandSampling
ThekeytechniqueinDistreamis
group-wiseimportancecomputationandsampling
:atech-
niquethatconsiderablyreducesthecomputationalcostofimportancesamplingwithoutcompro-
misingitseffectivenessandunbiasedconvergence.Inthestandardimportancesampling-based
frameworkintroducedin§2.3.2,ineachSGDiteration,theimportanceof
all
thelocaldataare
computedtoderivetheirimportancedistribution.Theall-inclusivenessofthisapproachnaturally
(a)Distributionofrankdifferenceafter1iteration
(b)Distributionofrankdifferenceafter10iteration
Figure2.6:Distributionsofrankdifference.
53
leadstocostlycomputation.Incontrast,ourgroup-wisestochasticimportancesamplingtechnique
foregoessuchall-inclusivenesstocutthecomputationalcost:duringeachSGDiteration,itonly
computestheimportanceofa
subset
(i.e.,group)ofthelocaldata.Toguaranteeunbiasedcon-
vergence,itthenadoptsastochasticapproachtoconstructthemini-batchbasedonthecomputed
group-wiseimportance.
Ourtechniqueisinspiredbyanobservationthat
theimportanceofeachsampleofthelocal
datadoesnotchangeabruptlyacrossmultipleSGDiterations
.Figure2.6depictsthedifference
ofimportanceranksbetweentwoiterations.About50%ofdatasampleschangelessthan5%in
importancerankingandabout70%ofdatasampleschangelessthan10%inimportanceranking
afteronetrainingiteration(Figure2.6(a)).Evenafter10iterations,about65%ofdatasamples
changelessthan10%inimportanceranking(Figure2.6(b)).Inotherwords,ifonesampleis
asimportant,itwouldpossiblystayasanimportantsampleforthefollowingmultiple
SGDiterations.Therefore,itisnotnecessarytore-computetheimportanceofeachsampleinthe
localtrainingdatasetforeverySGDiteration.Instead,wecanreusetheimportancecomputation
resultsfrompreviousiterationstofurthercutthecomputationcost.
Basedonthisobservation,ourgroup-wiseimportancecomputationandsampling(Figure2.7)
consistsoftwosteps:
Step#1:ComputeGroup-wiseImportance
.Inthestep,wesplitthecompletelocaldataset
oneachedgedeviceinto
D
non-overlappinggroups.AteachSGDiteration,insteadofre-computing
theimportanceforallthesamples,onlytheimportanceofsamplesinoneofthe
D
groupsisre-
computed.Indoingso,thecomputationalcostofimportancesamplingbecomes1
=
D
oftheall-
inclusiveoneperSGDiteration.Thegroupisselectedinaround-robinfashionsuchthatevery
groupisselectedonceevery
D
SGDiterations.
SinceweonlyupdatethesampleimportanceofonegroupineachSGDiterationandreusethe
54
Figure2.7:Illustrationofgroup-wiseimportancecomputationandsampling.Eachdotrepresents
onesampleandthedarknessindicatesitsimportance.
previouslycomputedimportancefortheremaining
D

1groups,thereisadiscrepancybetween
thederivedimportancedistributionandtheimportancedistributionderivedbyre-computingthe
importanceofallthesamples.Thisisbecauseasmodelparametersusedtocomputethesample
importanceareupdatedduetoSGD,thesampleimportancecomputedindifferentiterationsdonot
belongtothesamedistribution.
Step#2:ConstructMini-BatchbasedonGroup-wiseImportance.
Tomitigatetheeffectcaused
bythediscrepancy,inthesecondstep,weneedtoconsidertwolevelsofimportancetoconstruct
amini-batch.Theisthe
inter-grouplevelimportance
,whichtherelativefreshness
ofagroupcomparedtotheothergroups.Thesecondisthe
intra-grouplevelimportance
,which
theimportanceofsampleswithinagroup.
Toaccommodatethe
inter-grouplevelimportance
,weselect
d
groupsoutof
D
groups,
andassigntheprobabilityofselectinggroup
i
as:
r
i
=
exp
(
b
t
i
)
å
D
n
=
1
exp
(
b
t
n
)
(2.4)
where
t
i
isthestepindexwhentheimportanceofgroup
i
isupdatedand
b
>
0istheamplifying
factor.Alarger
b
encouragestheselectionofnewergroups.Toaccommodatethe
intra-grouplevel
55
importance
,weselect
j
B
j
M
i
å
d
n
=
1
M
n
fromeachselectedgroup
i
usingimportancesamplingwhere
M
i
isthesizeofgroup
i
.Toguaranteeunbiasedconvergence,theprobabilityofasamplebeing
selectedisproportionaltoitslosswithinitsgroup.Inotherwords,theprobabilityofselecting
sample
j
ingroup
i
isgivenby:
p
i
;
j
=
I
i
;
j
å
M
i
n
=
1
I
i
;
n
(2.5)
where
I
i
;
j
istheimportanceofthesample
j
ingroup
i
.Finally,themini-batchisconstructedby
combiningtheselectedsamplesfromselectedgroups.Toensureunbiasedness,wereweighthe
gradientsamongsampleswhencomputingaveragegradientstoobtaintheunbiasedgradient
asproposedin[46].
2.4.2Importance-awareDataResharding
ThesecondkeytechniqueinDistreamis
importance-awaredataresharding
:atechniqueforbal-
ancingimportancedistributionamongworkerstoacceleratethetrainingprocess.Asimportance
updateisperformedonthelocaldatasetateachedgedevice,thelocalimportancerankwithineach
edgedevicedoesnotecttheglobalimportancerankthataccumulativelyconsiderstheimpor-
tanceofallthesamplesfromalledgedevices.Asaconsequence,anedgedevicemayrepeatedly
learngloballytrivialsamples,whichconsiderablylowersthetrainingefy.Thisproblemis
exacerbatedwhendataclassesarenotevenlydistributedacrossedgedevices,whichiscommonin
on-devicesettings.Suchproblemcanberesolvedbydataresharding,atechniquethatrandomly
redistributessamplesamongworkers.Indatacenters,datareshardingcanbeeasilyachieveddue
totheavailabilityofhigh-bandwidthnetwork.However,inon-devicesettingswherethenetwork
bandwidthismuchmoreconstrained,shufalargeamountofdataamongedgedevicessignif-
56
Figure2.8:Importance-agnosticdatareshardingvs.importance-awaredataresharding.
icantlydelaysthetrainingprocess.
Tothisend,weproposeimportance-awaredatareshardingthatonlyredistributesimportant
samples.Insteadofblindlyshufdataamongworkers,weonlyselectnon-trivialsamplesto
shuftomaximizedatareshardingefywithminimumcommunicationoverhead.Asshown
inFigure2.8,supposeafterdatareshardingisdetermined,aworkerwith
N
samplesand
D
groups
needstoswap
N
p
<
N
sampleswithotherworkersandthebudgetonlyallows
N
o
<
N
p
samplesto
betransferred.Forimportance-agnosticdataresharding,itdirectlytransfers
N
p
sampleswithout
takingthesampleimportanceinformationintoconsideration.Incontrast,importance-awaredata
reshardingselects
N
o
M
i
å
j
M
j
samplesfromeachgroup
i
wherethegroupsizeis
M
i
.Theprobabilityof
asamplebeingselectedisproportionaltoitsimportancewithinitsbelonginggroupasinEq.(2.5).
Asaresult,weareabletoselectimportantsamplestoberedistributedwithoutchangeingroup
sizeorlocaldatasetsize.
2.4.3BACCScheduler
Anaiveimplementationofimportancesamplingforeachdeviceincollaborativelearningisshown
inFigure2.9a.Theimportancecomputationanddatareshardingissequentiallyinsertedinthe
originaltrainingprocess,incurringunnecessaryblockingandoverhead.
57
Figure2.9:(a)UnoptimizedPipelinevs.(b)Optimizedpipeline.
w
t
representsthemodelweight
atiteration
t
.R=R1+R2+R3+R4inlength.
AsshowninFigure2.9b,Distreamincorporatesa
bandwidth-adaptivecomputation-communication
(BACC)scheduler
whichschedulestheexecutionofimportancecomputationanddataresharding
inparallelinabandwidth-adaptivemannertofurtherimprovethespeedupbycompletelymasking
outthecostsofimportancesamplinganddataresharding.,sincegradientaggregation
onlyusesnetworkresourceandgradientcomputationonlyusescomputeresource,thegroup-wise
importancecomputationandimportance-awaredatareshardingcanbeoverlappedwithgradient
aggregationandgradientcomputation,respectivelyandexecutedinparallel.Thiscanbeachieved
bycreatingtwothreads,oneforstandarddistributedSGDoperationsandtheotherforgroup-
wiseimportancecomputationandimportance-awaredataresharding.Indoingso,theoverheads
incurredbythesetwotechniquescanbemaskedoutandthetrainingspeedupisfurtherimproved.
Giventhatbothgradientaggregationanddatareshardingdependonnetworkbandwidthwhich
58
couldexperiencevariationsovertimeinreal-worlddeployments,achievingfulloverlappingthat
completelymasksoutthecostsofimportancecomputationanddatareshardingrequiresthesched-
ulertobebandwidth-adaptive.Weproposetwotoadapttothebandwidthvariation.
First,tofullyoverlapimportancecomputationwithgradientaggregation,insteadofusingaed
groupsize,Distreamadoptsvaryinggroupsizessuchthatcomputingtheimportanceofthesam-
plesinagroupconsumesthesameamountoftimeasgradientaggregation.Tocompensateforthe
reductionofimportancevarietyingroupsofsmallsizeandensureunbiasedness,theimportance
ofeachdatasamplesampledfromgroupsofdifferentsizeswillbereweighedusingitsgroupsize.
Second,tofullyoverlapdatareshardingwithgradientcomputation,Distreamadoptsbreakpoint-
resumetechnique:datareshardingpauseswhengradientaggregationbeginsandresumeswhen
itends.Bydoingthese,thecostsofimportancesamplingcanbecompletelyhiddenbehindthe
standarddistributedSGDinabandwidth-adaptivemanner.
Figure2.9billustrateshowourproposedBACCschedulerisabletoachievefulloverlappingto
completelymaskoutthecostsofimportancecomputationanddataresharding.,after
gradientcomputation,Thread1yieldsitscomputeresourcetoThread2toperformimportance
computation.Aftergradientaggregationiscomplete,Thread2yieldsthecomputeresourcesto
Thread1toperformgradientcomputationforthenextiteration
t
+
1.Similarly,aftergradient
aggregation,Thread1yieldsnetworkresourcetoThread2toperformdataresharding.When
gradientaggregationforthenextiterationbegins,Thread2pausesdatareshardingandyields
networkresourcebacktoThread1toperformgradientaggregationforthenextiteration.
2.4.4ProofofTrainingCorrectness
Lastly,duetopagelimitation,weprovidethetheoreticalresultshowingthatthecollab-
orativelearningunderDistreamisguaranteedtoconvergetothesamesolutionasthestandard
59
distributedSGD.
SketchofProof
:weshowthatthestochasticgradientinformationaveragedattheserveris
unbiased.Thenweshowthatitsvarianceisbounded.
1.Thegradientaveragedacrossallworkersisunbiased.
Weuse
Ñ
l
k
;
i
;
j
torepresentthegradient
fromthesample
j
ingroup
i
ofdevice
k
forthesimplicityofnotation.Usingtheproposedsampling
techniquein§2
:
4
:
1,theexpectationofaggregatedgradient
g
is
E
(
g
t
)=
1
å
K
k
=
1
N
k
K
å
k
=
1
N
k
1
N
k
D
k
å
i
=
1
M
k
;
i
å
j
=
1
Ñ
l
k
;
i
;
j
=
Ñ
l
(
w
t
)
:
where
M
k
;
i
isthenumberofsamplesingroup
i
atnode
k
;
N
k
and
D
k
arethenumberofsamples
andthenumberofgroupsatnode
k
.Itshowsthattheaveragedgradientisunbiased.
2.Thegradientaveragedacrossallthedeviceshasboundedvariance.
Notethatthevariance
of
g
t
is
Var
(
g
t
)=
K
å
k
=
1
(
N
k
å
j
N
j
)
2
Var
(
g
k
)
:
Sinceweonlychangetheprobabilityofchoosingthesamples,thevariance
Var
(
g
k
)
ateach
node
k
isstillbounded.Therefore,thevarianceof
g
t
isbounded.Forsimplicity,weusethesame
notation
V
forthisbound.
2.5Implementation
Testbed
.Wedesignedanddevelopedourowntestbedduetolackofoff-the-shelfones.Specif-
ically,weuse12NVIDIAJetsonTX1asedgedevices.EachTX1hasanintegratedsmallform
60
Figure2.10:Distreamimplementation.
factormobileGPU,whichisdesignedfornext-generationintelligentedgedevicestoexecuteDL
workloadsonboard.Forwirelessnetworking,weuseNetgearNighthawkX6SAC4000Tri-band
Wi-FirouterstoconnectalltheTX1,anduseLinuxtools
tc
,
qdisc
,and
iptables
tocontrolthe
networkbandwidthtoconductourexperiments.
FrameworkImplementation
.WeimplementedDistreamusingTensorFlow1.10.Figure2.10
showstheimplementationofourDistreamframework.Distreamisdistributedtrainingframework
thatspansacrossmultipleedgedevices.Ineachtrainingiteration,
ImportanceSampler
constructs
amini-batchfromon-devicedatausinggroup-wiseimportancesampling(
1
)basedontheim-
portancedistributionoflocaldatastoredin
ImportanceCache
.Themini-batchisfedinto
Model
Trainer
tocomputelocalgradient(
2
).Thelocalgradientsfromalldevicesareaggregatedand
theaggregatedgradientsaresentto
ModelTrainer
toupdatethemodel(
3
).Meanwhile,
Im-
portanceUpdater
computesthedataimportanceandupdates
ImportanceCache
(
4
)whenthe
deviceisperforminggradientaggregation.
ReshardExecutor
importantdatasamples
from
ImportanceCache
(
5
)andcommunicateswithotherdevicestoperformimportance-aware
dataresharding(
6
)whenthedeviceisperforminggradientcomputation.Theexecutionsof
Re-
61
shardExecutor
and
ImportanceUpdater
aretriggeredandscheduledbythe
BACCScheduler
at
runtime.
2.6Evaluation
Inthissection,weevaluatetheperformanceofDistreamwiththeaimtoanswerthefollowing
questions:

Q1(§2.6.2):
DoesDistreamoutperformstatusquo?Ifso,whatarethereasons?

Q2(§2.6.3):
HoweffectiveiseachcoretechniqueincorporatedinthedesignofDistream?

Q3(§2.6.4):
CanDistreamadapttowirelessnetworkbandwidthvariationwell?

Q4(§2.6.5):
DoesDistreamscalewellwhenthenumberofedgedevicesincreases?
2.6.1ExperimentalMethodology
TaskDomains,Datasets,andDLModels
.TodemonstratethegeneralityofDistreamacross
domains,datasets,andDLmodels,weevaluateDistreamonsixcommonlyuseddatasetsusing
adiversesetofDLmodelsacrossthreemostimportanttaskdomains:computervision,speech
recognition,andnaturallanguageprocessing.

ComputerVision.
Inthedomainofcomputervision,weusefourdatasets.,
weselectCIFAR-10,CIFAR-100,andSVHNastheyarethreeofthemostcommonlyused
computervisiondatasetsfordeeplearningalgorithmevaluation.BothCIFAR-10andCIFAR-
100[48]consistof50
;
000trainingimagesand10
;
000testimagein10classes.SVHN[82]
consistsof73
;
257trainingand26
;
032testimages.WeuseResNet18[28]totrainonCIFAR-
10andSVHNtraining,anduseRetNet50[28]totrainonCIFAR-100.Inaddition,weselect
62
AID[83]asourfourthcomputervisiondataset.AIDisalarge-scaleaerialimagedataset
collectedfromGoogleEarth.Ithas10
;
000imagesin30classes,includingairport,mountain,
desert,forest,etc.Werandomlyselect80%imagesfortrainingandtheremaining20%images
fortesting.Weselectthisdatasettoemulatetheapplicationofon-devicecollaborativelearning
acrossaswarmofdrones.WeuseMobileNetV2[70]totrainonthisdataset.

SpeechRecognition
.Inthedomainofspeechrecognition,weselectTwSpeechCom-
mand[75]asourdataset.Thisdatasetconsistsof105
;
829audioutterancesof35shortwords,
recordedbyavarietyofdifferentpeople.Therecordedaudioclipsareintendedforsimple
speechinstructionssuchas
go
,
stop
,
yes
,
no
,etc.Thetrainingsethas84
;
843samplesandthe
testsethas11
;
005samples.Weextractthe2-Dspectrogramsfromrawaudioclips,anduse
VGG-13[73]asthemodel.

NaturalLanguageProcessing
.Inthedomainofnaturallanguageprocessing,weselectAG
NewsCorpus[90]asourdataset.ThisdatasetcontainsnewsarticlesfromtheAG'scorpus.It
has120
;
000trainingand7
;
600testingsamples.Eachsampleconsistsofacoupleofsentences
andislabeledasoneofthefourclasses:
world
,
sports
,
business
and
sci/tech
.Weuseatwo-
layerLSTMwithattention[11,31]totrainonthisdataset.
Baselines
.ThegoalofDistreamistoenhancethetrainingefyofon-devicecollaborative
learningwhileretainingalgorithmcorrectnessguaranteeswithoutcompromisingtheaccuraciesof
thetrainedmodels.Forfaircomparison,wecompareDistreamagainsttwostatusquoframeworks
whichsharethesamegoal
1
.

TicTac
[25].
TicTac
isastatusquodistributedtrainingframeworkfortrainingacceleration.
1
Gaia
[33]isthestatusquocommunication-efdistributedtrainingframeworkbasedongradientcompres-
sion.However,
Gaia
obtainstrainingefybycompromisingtheaccuraciesofthetrainedmodels.Therefore,we
didnotincludeitasabaseline.
63
Itproposestwoheuristicstore-orderparametertransfernodesinacomputationalgraphto
increasetheoverlappingbetweencomputationandcommunication,outperformingTensorFlow
by1
:
19

shorterintotaltrainingtime.

AdaComm
[78].
AdaComm
isastatusquocommunicati-on-efdistributedtraining
framework.
AdaComm
reducestrainingtimebyallowingeachworkertoperformmultiple
localSGDiterationsbeforecommunicationandadaptivelybalancingbetweenthenumberof
localSGDiterationsandcommunication.
EvaluationMetrics
.WeusetwometricstoevaluatetheperformanceofDistreamandthebase-
lines.

TotalTrainingTime
.Weusetotaltrainingtimeasourtmetric.Totaltrainingtimeis
asthewall-clocktimefromthebeginningtotheconvergenceoftrainingprocess.This
metriccontainsnotonlyinformationaboutthenumberofiterationsuntilconvergencebutalso
informationaboutthetrainingtimeperiteration.

TrainingQuality
.Thesecondmetricistrainingquality.Weusetop-1testaccuracyofthe
trainedDLmodeltomeasurethetrainingquality.
TrainingDetails
.Duringtraining,thebatchsizeofeachdeviceissetto32,andweadoptSGD
optimizerandcosine-annealinglearningratedecay[59].
(a)CIFAR-10
(b)CIFAR-10
Figure2.11:OverallperformancecomparisononCifar10andCifar100.
64
(a)SVHN
(b)AID
Figure2.12:OverallperformancecomparisononSVHNandAID.
(a)TwSpeechCommand
(b)AGNews
Figure2.13:OverallperformancecomparisononTwSpeechCommandandAGNews.
2.6.2OverallPerformance
WecomparetheoverallperformanceofDistreamwith
TicTac
and
AdaComm
.Todoso,we
runtrainingexperimentsonfouredgedevices,andmeasuretheirtotaltrainingtimespeedupsover
standarddistributedSGDonthesixdatasets.Toprovideacomprehensiveevaluation,wedidour
experimentsundervariousnetworkbandwidthsrangingfrom10Mbpsto200Mbps,emulating
scenarioswithdifferentwirelessnetworkbandwidthavailability.
Figure2.11,Figure2.12andFigure2.13showstheresults.WeseethatDistreamcantly
outperforms
TicTac
and
AdaComm
onallthesixdatasetsacrossallthenetworkbandwidths,with
improvementsfrom1
:
6

to3
:
9

.,Distreamachievesupto3
:
74

,3
:
21

,4
:
08

,
2
:
21

,2
:
74

and1
:
85

speedupsoverstandarddistributedSGDonsixdatasetsrespectively.
Incontrast,
TicTac
achieves1
:
01

to1
:
21

speedupand
AdaComm
achieves1
:
2

to1
:
9

in
65
wirelesssettingswherebandwidthsareconsiderablyconstrained.
WealsonoticethatDistreamachieveshigherspeedupswhenbandwidthbecomesmorecon-
strained.Forexample,acrosssixdatasets,Distreamachievesthehighestperformancegainunder
10Mbps,thelowestbandwidthinourexperiment.Thisisbecauseundermoreconstrainedband-
width,thecommunicationtimeislargerandDistreamcanspendmoretimeoncomputinglatest
importancerank.Thisresultdemonstratesthatthemoreconstrainedthebandwidthis,themore
superiorityDistreamhas.
WhyDistreamOutperforms
TicTac
and
AdaComm
.TounderstandwhyDistreamisableto
achievehigherspeedupscomparedto
TicTac
and
AdaComm
,weuse
CIFAR-10
asanexample,
anddepictthetestaccuracyofDistream(blue),
TicTac
(orange)and
AdaComm
(green)during
thecompletetrainingprocessinFigure2.14.Wehavetwoobservations.
First,Distreamoutperforms
TicTac
byalargemarginfromthebeginninguntilconvergence.
,Distreamuses2
:
7

and2
:
4

fewercommunicationroundson
CIFAR-10
and2
:
7

on
TSpeechCommand
toconvergecomparedto
TicTac
.Thisisbecauseunderwireless
networksettingwherebandwidthislimited,thecommunicationtimedominatestrainingoverhead
overlappingcomputationwithcommunicationisnoteffectiveanymore(Figure2.2).Therefore,
TicTac
provideslittleruntimereductionineachiteration.Incontrast,Distreamaimsatimproving
thetrainingefy
periteration
toreducethetotalnumberofiterations(communicationrounds)
withoutadditionaloverhead.
Second,although
AdaComm
convergesfasterthanDistreaminearlystagewhenthenumber
ofcommunicationsissmallerthan10

10
3
(Figure2.14(left)),
AdaComm
beginstoloseeffect
andconvergesslowerthanDistream.Thisisbecause,although
AdaComm
canalsoreducecom-
municationrounds,itachievessuchreductionbyperforminglocaltrainingstepsinearlystage
oftraining.Toguaranteethetrainingquality,
AdaComm
hastoperformlesslocalstepsandthe
66
Figure2.14:Testaccuracycurvesduringthecompletetrainingprocessintermsofnumberof
communicationrounds(left)andwall-clocktime(right).
accelerationof
AdaComm
beginstoslowdownaftertheearlystage.Inaddition,performing
multiplelocalstepsincreasethecomputationtimeineachround.Incontrast,withthetrainingcor-
rectnessguarantee,Distreamisabletoincreasetrainingefythroughouttheentiretraining
processwithoutadditionaloverheads.Therefore,eventhough
AdaComm
achievesfasterconver-
genceduringearlystageoftraining,Distreameventuallyoutperforms
AdaComm
intermsofboth
numberofcommunicationsandtotaltrainingtime.
2.6.3Component-wiseAnalysis
Next,weevaluatetheeffectivenessofeachofthethreekeytechniquesincorporatedinDistream.
Theexperimentalsetupisthesameasin§2.6.2.Weonlyincludetheresultsobtainedin
CIFAR-10
Figure2.15:Testaccuracycurvesintermsofnumberofiterations(left)andwall-clocktime(right)
usinggroup-wise(Groupwise-IS)andall-inclusiveimportancecomputationandsampling(All-IS).
67
heresinceweachievedsimilarresultsontheothervedatasets.Theexperimentresultsaltogether
showthatwiththethreeproposedtechniques,Distreamcanachievehigherperformancegaincom-
paredtotherudimentaryimportancesampling-basedframework.
Component#1Analysis:Group-wisevs.All-inclusiveImportanceComputationandSam-
pling
.Weevaluatetheeffectivenessoftheproposedgroup-wiseimportancecomputationand
samplingtechniquedescribedin§2.4.1.Todoso,wecompareitagainsttheall-inclusiveimpor-
tancecomputationandsamplingstrategy,whichre-computestheimportanceofeverysamplein
thelocaldatasetateachSGDiteration.TheresultsareshowninFigure2.15.
Aswecansee,whilegroup-wiseimportancecomputationandsamplinguses20%moreit-
erationstoconvergetothesametestaccuracyastheall-inclusiveimportancecomputationand
samplingscheme(Figure2.15(left)),group-wiseis4
:
2

fasterthantheall-inclusivewhentrans-
latedintototaltrainingtime(Figure2.15(right)).Thisisbecauseeventhoughre-computing
theimportanceofeverysampleindeedimprovesthetrainingqualityperiteration,the
computationcostitincursprolongsthetrainingtimeperiteration.Incontrast,bysac-
marginaltrainingqualityperiteration,group-wisecomputationandimportancesampling
isabletoconsiderablycutthetrainingtimeperiteration,leadingtoamuchreducedtotaltraining
time.
Component#2Analysis:Importance-awarevs.Impor-tance-agnosticDataResharding.
Next,
weevaluatetheeffectivenessoftheproposedimportance-awaredatareshardingtechniquede-
scribedin§2.4.2.Todoso,wecompareitagainsttheimportance-agnosticdatareshardingstrategy.
Differentfromimportance-awarereshardingwhichselectstheimportantsamplesduringreshard-
ing,importance-agnostictreatseverysampleequallyandrandomlyreshufthesamplesacross
edgedevices.Wealsocomparetooneextremecasewheredatareshardingwasnotperformedat
68
Figure2.16:Importance-awarevs.importance-agnosticdataresharding,vs.non-resharding.
all.Figure2.16showstheresults.Wehavetwoobservations.
First,similartothefrommanyotherworks[22,23,64],weobservethatcomparedto
non-resharding,datareshardingisabletoboostthetestaccuracyupby1
:
84%,demonstratingthe
necessityofdatareshardingforachievingstate-of-the-arttestaccuracy.
Second,comparedtoimportance-agnosticdataresharding,theproposedimportance-aware
strategyisabletoconvergetothesametestaccuracy,butwithonly50%ofitsdatatransferamount.
Thisisbecauseimportance-awarereshardingprioritizesshufmoreimportantsamples,which
considerablyimprovesthereshardingefyandreducingthenetworktraffordatareshard-
ing.
Figure2.17:Testaccuracycurvesintermsofnumberofiterations(left)andwall-clocktime(right)
usingBACCandsequentialpipeline.
Component#3Analysis:BACCSchedulervs.SequentialPipeline.
Toevaluatetheeffective-
nessoftheproposedBACCschedulerdescribedin§2.4.3,wecompareitagainstthesequential
69
Figure2.18:SpeeduponCIFAR-10.
Figure2.19:Bandwidthsnapshot.
pipeline,whichperformsdataresharding,group-wiseimportancecomputationandsampling,gra-
dientcomputationandgradientaggregationsequentially.AsshowninFigure2.17,althoughBACC
uses5%moreiterationstoconverge(Figure2.17(left)),whentranslatedintototaltrainingtime,it
is2
:
2

fasterthanthesequentialpipeline(Figure2.17(right)).
2.6.4BandwidthAdaptationPerformance
WetakeacloserlookatthebandwidthadaptationperformanceoftheBACCschedulerproposedin
Distream.Whendeployingourtestbedinthereal-worldsettings,weobservedthatthebandwidth
variationsacrossdifferentbandwidthsare10%onaverage(Figure2.19showsasnapshot).To
examinethebandwidthadaptationperformanceoftheBACCscheduler,wecompareitagainsta
bandwidth-agnosticsolutionwhereedgroupsizesanded-timedatareshardingareadopted.
Figure2.18showsthespeedupcomparisonacrossfourbandwidths.Asshown,withoutbandwidth
adaptation,thetrainingspeedupdropsfrom3
:
74

,3
:
64

,3
:
62

,3
:
48

to3
:
43

,3
:
38

,3
:
33

,
3
:
25

,whichvalidatestheeffectivenessofDistreaminadaptingtobandwidthvariations.
70
Figure2.20:ScalingperformanceofDistreamon4,8,12edgedevicesundervariousnetwork
bandwidths.
2.6.5ScalingPerformance
Finally,weevaluatethescalabilityofDistreambyrunningitonmoreedgedevicesunderdifferent
bandwidths.Again,weonlyshowtheresultsofCIFAR-10hereduetopagelimitationsincewe
observedsimilarresultsontheothervedatasets.AsshowninFigure2.20,whenscalingfrom4
devicesto8and12devices,Distreamgracefullymaintainsthespeedupoverstandarddistributed
SGDundervariousnetworkbandwidths.ThisresultdemonstratesthatDistreamisabletomaintain
itssuperiorityoverstandarddistributedSGDwhenthenumberofedgedevicesscalesup.
2.7RelatedWork
ThedesignofDistreamwasinspiredbydistributedtrainingframeworksindatacentersetting.
ThemostsimilarworkstoDistreamare
AdaComm
[78]whichadaptivelyadjuststhenumber
oflocalSGDiterationstoreducecommunicationcost,and
TicTac
[25]whichoverlapsgradient
computationwithcommunicationtoreduceper-iterationoverhead.Althoughmanytechniques
wereproposedintheseworkstoacceleratetrainingprocess,theyaredesignedfordistributed
trainingsystemsindatacenters.Unfortunately,thereislimitedperformancegainwhendirectly
applyingthesetechniquestoon-devicesettinggiventhegapinnetworkbandwidth
71
betweenthesetwosettings.
ThedesignofDistreamwasalsoinspiredbyexistingworksinfederatedlearning[62,47]Œthe
othercategoryofon-devicedistributedtrainingaswediscussinintroduction.However,thetraining
efyoptimizationtechniquesinvolvedinthoseworksachievetrainingefyatthecost
ofcompromisingtheaccuraciesofthetrainedmodels,whichmotivatesourworkfordesigning
techniquesthatimprovetrainingefywithoutsuchcompromise.Infact,ourproposedgroup-
wiseimportancecomputationandsamplingtechniqueandBACCschedulercanalsobeappliedfor
enhancingthetrainingefyoffederatedlearning.
2.8Conclusion
Inthispaper,wepresentthedesign,implementation,andevaluationofDistream,animportance
sampling-basedframeworkthatenablesefon-devicecollaborativelearningwithoutcompro-
misingtheaccuraciesofthetrainedmodels.Distreamaddressesthekeybottleneckofcollaborative
learning,andcontributesnoveltechniquesthattakeadifferentpathfromexistingapproaches.We
implementedDistreamandconductedarichsetofexperimentswithaself-developedtestbedon
sixcommonlyuseddatasetsacrosscomputervision,speechrecognition,andnaturallanguagepro-
cessing.OurresultsshowthatDistreamconsistentlyoutperformsthestatusquo.Therefore,we
believeDistreamrepresentsacontributiontomakingcollaborativelearningsystems
practicallyusefulinreal-worlddeployments.
72
Chapter3
CollaborativeLearningfor
FederatedLearning
3.1Introduction
Recentyearshavewitnessedtheboomingofedgedevices.Itisexpectedthatby2030,thenumber
ofedgedeviceswillincreasefrom8.74billionin2020to25.44billion[12].Theseedgedevices,
suchassmartphones,drones,wearables,arerevolutionizingthewaywelive,work,andinteract
withtheworld.Equippedwithavarietyofsensors,thesedevicesarecollectingamassivevolume
ofdatafrompeopleaswellassurroundingenvironmentseveryday.Thesedataareofgreatvalue
totechnologycompaniesforimprovingtheirproductsandprovidingbetterservicesandcustomer
experience.Inordertogainrealvalueoutofthesedata,thestatusquoapproachistouploadthese
decentralizeddataontocentralserversinthecloudwherepowerfulmachinelearningmodelssuch
asdeepneuralnetworksaretrained.However,theseuserdataoftencontainprivateorprivacy-
sensitiveinformationsuchasfaces,licenseplates,housenumbers,etc.Asaresult,thesedata
areprotectedfromleakagebylaw,anduploadingthedatatocentralcloudserversforfurther
processingisnotafeasiblesolution.
WiththeemergenceofAIchips,themajorityoftheedgedevicesarenowabletonotonly
performon-devicemodelinference,butalsotrainadeeplearningmodelusingthedatacollected
73
bythemselves.Thereisainterestinleveragingtheon-devicecomputationalcapabilities
tocollaborativelytrainagloballysharedmodelfromthedecentralizeddatawhilekeepingthedata
locallyoneachdevice[72].Manypeoplehaveshiftedfromcentralizedservertoedgesettingfor
machinelearningmodeldevelopmentandtraining.
Drivenbythistrend,Googleproposedfederatedlearning[47]in2016.Federatedlearningis
anemergingmachinelearningparadigmthathasrecentlyattractedconsiderableattentionduetoits
widerangeofapplicationsinmobilescenarios[63,47,74].Differentfromthestandardmachine
learningapproachwhichrequiresallthetrainingdatatobecentralizedinaserverorinadatacenter,
itenablesgeographicallydistributededgedevicessuchasmobilephonestocollaborativelylearna
sharedmodelwithoutexposingthetrainingdataoneachedgedevice.Assuch,federatedlearning
enablesdistributingtheknowledgeacrosssmartphoneswithoutexposingusers'privatedata.
Federatedlearningusesdistributedstochasticgradientdescent(SGD)andrequiresacentralto
coordinatetheiterativetrainingprocess.Beforeeachroundbegins,theserverselectsasubsetof
clientsanddistributesthelatestglobalmodeltoalltheparticipatingdevices.Eachdevicecomputes
thegradientorupdateofthemodelparametersusingitslocaldata.Theserveraggregatesthe
gradientsormodelsupdatefromalldevices,averagesthem,andupdatestheglobalmodel.The
centralserversendstheupdatedmodeltotheclientstoperformlocaltrainingofnextround.In
suchmanner,eachdevicefromobtainingabettermodelthantheonetrainedonlyonthe
locallystoredprivatedata.
AlthoughFLaddressestheprivacyissue,itstillhasmanyproblemsandsuffersfromsigni
performancedegradationduetothreeproblems.Theislowtrainingefy.Infederated
learning,thetrainingdataismassivelydistributedacrossalargenumberofclients.Existingap-
proachesadoptrandomsamplingwhenitselectsclientstoparticipateintrainingbeforeeachround
begins.Ineachround,theclientsadoptrandomsamplingstrategytosamplemini-batcheswhen
74
performinglocaltraining.Consequently,thetrainingefyinfederatedlearningislowand
itrequiresmorestepstoconverge.Thesecondproblemislimitednetworkbandwidth.Indata
centers,communicationbetweentheserverandworkingnodesisconductedviaGbpsEthernet
orniBandnetworkwithevenhigherbandwidth[81].Incontrast,communicationinfederated
learningreliesonwirelessnetworkssuchas4GandWi-Fi.Bothuplinkanddownlinkbandwidths
ofthosewirelessnetworksareatMbpsscale,whichismuchlowerthantheGbpsscaleinthedata
centersetting.Thelimitedbandwidthinfederatedlearningillustratesthenecessityofreducing
thecommunicationcosttoacceleratethetrainingprocess.Thelastproblemisfederatedlearn-
ingislowaccuracyduetotheadoptionofsmallcapacitymodel.Althoughexistingedgedevices
suchassmartphonesareabletotrainneuralnetworkmodels,theyhavemuchlesscomputere-
sourcescomparedtocentralizedcloudservers.Moreover,thesedevicesoftenneedtoconcurrently
runmultiplemodelsthatperformdifferenttasksinrealtime.Itthusrequiresthecollaboratively
trainedmodelinfederatedlearningtobecomputationallyefTherefore,federatedlearning
usuallyaccuracyandusesasmallmodelinordertoreducethecomputationalcost.
Inthiswork,weproposeFedAce,aneffederatedlearningframeworkthataddressesthe
aforementionedproblems.Itisfeaturedbytwokeycomponents.Thecomponentisfeder-
atedimportancesampling,whichnotonlyperformsimportantdataselectionwithineachclient
inlocalmodeltraining,butalsoselectsimportantclientstoparticipateineachroundoftrain-
ing.Byfocusingonimportantclientsanddata,FedAceisabletoincreasetrainingefyand
reducetrainingtime.Thesecondcomponentisfederatedmodelcompression,whichiteratively
compressesthegloballysharedmodelthroughoutthewholetrainingprocess.Theglobalmodel
iscompressedusingaglobalmaskaggregatedbylocallycomputedmasks,whichrepresentsthe
opinionsonthelocationsofimportantparametersfromdistributedclients.Aftertrainingcom-
pletes,onlyimportantparametersarekept,andtheglobalmodeliscomputationallyef
75
Astheexchangedinformationbetweenclientsandthecentralserverareintheformofmodel,we
canreducethecommunicationcostbytransferringacompressedmodel.Asaresult,
thetrainingtimeisreduced.
WeimplementFedAceusingFedML[26],aresearchlibraryandbenchmarkforfederated
learning.WeexaminetheperformanceofFedAceonfourfederatedlearningdatasetsacrosscom-
putervision,speechrecognition,andnaturallanguageprocessingtasks.Ourexperimentresults
showthatFedAceisabletoincreasetrainingefyandachievetrainingspeedup
comparedtoexistingFLframeworks.Meanwhile,FedAceachievesmodelcompression
andcommunicationreductionratioswithnoorlittleaccuracyloss.
3.2RelatedWork
FederatedLearning.
Asanemergingfederatedlearninghasrecentlyattractedalotofat-
tention.[63]developedafederatedlearningapproachbasedoniterativemodelaveragingtotackle
thestatisticalchallenge,especiallyforthemobilesetting.[74]proposedadistributedlearning
frameworkbasedonmulti-tasklearninganddemonstratedthattheirapproachisabletoaddress
thestatisticalchallengeandisrobusttostragglers.[47]developedstructuredandsketchedupdate
techniquestoreduceuplinkcommunicationcostinfederatedlearning.
DistributedSGDTraining.
Ourworkisrelatedtodistributedtrainingaccelerationinfederated
learning.Mostexistingworksarefocusedononreducingcommunicationtimetocutthewall-
clocktime.Thisisachievedbyquantizinggradientsusingsmallernumberofbits[81,10,55]or
selectingimportantgradientstotransfervia[79,56,33].Anotherresearchaims
atincreasingmini-batchsizeforlesscommunication.In[37],theauthorsproposedtouselarger
mini-batchsizetoreducecommunicationiterations.Althoughthisapproacheffectivelyreducesthe
76
wall-clocktime,itcomeswiththecostofloweringtheaccuracyofthetrainedmodel.Lastly,deep
learningmodelsingeneralcanberepresentedasdirectedacyclicgraphs(DAGs).Thestructures
ofDAGsprovideanopportunitytooverlapcomputationoperationsandcommunicationoperations
inapipelinedfashiontomaskoutthecommunicationcost.Thisoptimizationtechniquecanbe
seenasamechanismtoreducethetotaltrainingtimebyidentifyingtheoptimalgradienttransfer
order[25,39,17,87]orusingstaledmodelweights[54].
ModelCompression.
Ourworkisalsorelatedtomodelcompression.Deeplearningmodels
havebeenincreasinglycomputationallyexpensive.Modelcompressionaimstoreducethecom-
putationalintensityofdeepneuralnetworksbypruningoutredundantmodelparameters.Ithas
receivedalotofattentionsinrecentyearsduetotheimperativedemandonrunningdeeplearning
modelsonresource-constrainededgedevices[24].Themostpopulartechniqueformodelcom-
pressionisparameterspruning,wheretheleastimportantparametersareremovedtoreducethe
modelsizeandinferencecost.Dependingonthestructureofprunedparameters,therearetwo
typesofpruningtechnique.Theisunstructuredpruning,whichfocusesonpruningmodel
parameters[24].Althoughunstructuredpruningparametersachieveshighermodelcompression
rateandeffectiveatreducingmodelsize,itdoesnotnecessarilyreducecomputationalcostswhen
itisappliedtoCNNs.Theotherisstructuredpruning[50],whichandprunesunimpor-
tantinCNNs.Structuredpruningachieveslowercompressionrate,itcaneffectivelyreduce
runtimecost.
Ourapproachalsodiffersfrompriorworkonfederatedlearning,wheretheyfocus
oneithertacklingstatisticalandstragglerchallenges[63,74]orreducinguplinkcommunication
costalone[47].Ourapproachisalsodifferentfrompriorworkonmodelcompression[24,66]
andcommunicationreductionfordistributedlearning[10,81,56],wheretheytreatthemastwo
separate
tasks.Inotherwords,thenoveltyofourproposedapproachliesatintegratingmodel
77
compressionandcommunicationreductionintoasingleoptimizationframeworkbasedontraining
datadistributedatdifferentdevices.
3.3BackgroundandMotivation
Inthissection,weintroducethebackgroundknowledgeoffederatedlearning.Thenwecom-
parethedifferencebetweenfederatedlearningandstandarddistributedlearning.Finallywedis-
cussthedrawbacksofexistingfederatedlearningapproachesandprovideimprovingopportunities
forfederatedlearning,whichisthemotivationofFedAce.
3.3.1ArchitectureofFederatedLearning
Federatedlearningisnewtrainingparadigmfortrainingmachinelearningmodelsusinggeograph-
icallydistributedclientsundertheorchestrationofacentralserver.Thetrainingprocessiscon-
ductedintheformofiterativerounds.Ineachround,thecentralserverselectsagroupofclients
randomlyfromalltheavailableclientstoparticipateinthetraining.Thecentralserverwill
sendtolatestglobalmodeltoselectedclientsinthecurrentround.Afterreceivingthelatestmodel
fromthecentralserver,eachselectedclientcomputesthemodelupdateorgradientusingitslocal
data,whichwillbesenttoandaveragedinthecentralserver.Thecentralserverwillaggregatethe
modelupdatesorgradientsfromclientsandupdatetheglobalmodel:
W
t
+
1
=
K
å
k
=
1
n
k
n
W
t
+
1
k
(3.1)
where
W
t
+
1
istheupdatedglobalmodelforround
t
+
1.
K
isthenumberofselectedclientsthat
participateinthetrainingand
w
t
+
1
k
istheupdatedmodelofclient
k
.
n
k
isthenumberofdatain
78
client
k
andwehave
å
K
k
=
1
n
k
=
n
.TheoverviewoftrainingprocessisshowninFigure3.1.
3.3.2FederatedLearningCharacteristics
Althoughbothfederatedlearninganddistributedaimatcollaborativelytrainingaglobalmodel,the
settingwherefederatedlearningisfocusedonisfundamentallydifferentfromdistributedtraining.
Wepresentthekeydifferencesbetweenfederatedlearninganddistributedtrainingasfollows.
DataHeterogeneity.
Thedifferenceisthatthereisdataheterogeneityinfederatedlearning.
,thelocaldataacrossclientsarenon-i.i.ddistributed.Thisincludesnotonlythevaria-
tioninthenumberoflocaldata,butalsothefeatureskewnessandlabelskewness[43].Incontrast,
thedataisindependentandidenticallydistributedacrossworkersandthequantityoflocaldata
isthesame.Dataheterogeneityimposesancantchallengeinfederatedlearninganditis
regardedasthemajorcauseofthemodelperformanceloss.
ClientHeterogeneity.
Infederatedlearning,theclientsfrommillionsofusersacrossworldcould
beselectedtoparticipateintraining.Thus,thehardwareandcomputecapacitiesofparticipating
clientscouldbedifferent.Moreover,asthecentralserveronlyselectsasubsetofclientstopartici-
pateinjoiningineachround,theclientheterogeneityisalsochangingineachround.Indistributed
training,eachworkhasthesamecomputecapability.
NetworkHeterogeneity.
Infederatedlearning,eachclientisunreliablyconnectedtothecentral
serverandthebandwidthissmall.Astheclientsaredistributedacrosscontinentsortheworld,
thenetworkbandwidthsbetweenclientsandthecentralserverareheterogeneous.Indistributed
training,astheclientsarelocatedinlocalregion,thebandwidthsamongclientsarethesameand
eachclientisreliablyconnectedtothecentralserver.
ThesedifferencesmakeexistingoptimizationtechniquesfordistributedSGDtrainingfallshort,
andhighlightthenecessitytodevelopanewtrainingframeworktooptimizethesystemperfor-
79
manceforfederatedlearning.Table3.1summarizesthekeydifferencesofdistributedtrainingand
federatedlearning.
FederatedLearningDistributedTraining
Datadistributionnon-i.i.di.i.d
Numberoflocaldatasmall,typically<500large,typically>1000
DataAccessibilityinvisibletootherclientsvisibletootherclients
Numberofclientstypically>=500typically<=50
Participatingclients<10%100%
Clientlocationwidearealocalregion
Clientconnectivityunreliablyconnectedalwaysconnected
Networkbandwidthlowhigh
Table3.1:Comparisonbetweenfederatedlearninganddistributedtraining
Figure3.1:Federatedlearningoverview
3.3.3ScopeandImprovingOpportunity
Federatedlearningischallengingresearchareaandmanyproblemsremaintobesolved,suchas
datanon-iidness,slowconvergenceandadversarialattacks.Inthiswork,FedAceonlyfocuses
80
onaddressingthreeproblemsinfederatedlearning.Next,weidentifyanddiscussimpossible
improvementforeachproblem.
LowTraining.
Currentpracticesinfederatedlearningadoptrandomsamplinginboth
dataselectionandclientselection.Whilethissimplesolutionfacilitatestheimplementationand
deploymentoffederatedlearningapplications,thetrainingefyislow.Thisisbecausethe
localdataateachclientcontributesdifferentlytotraining.Asthereisdataheterogeneityandthe
localdatacannotbesharedamongclients,eachclientalsocontributesdifferently.Consequently,
usingrandomsamplingstrategyindataandclientselectionwastesmuchtrainingeffortonunim-
portantsamplesandclients.Thismotivatesustoadopt
non-randomsamplingstrategy
whenwe
selectclientsfortrainingandconstructmini-batchforlocaltraininordertoincreasethetraining
efy.
LargeCommunicationCost.
Asdiscussedbefore,theclientsaredistributedacrosstheworld,
andtheirbandwidthsarelimitedandheterogeneous.Hence,thecommunicationoverheadineach
roundislarge,makingfederatedlearningprohibitivelyimpracticaltotrainaglobalmodelwithin
ashortperiodtime.Tomitigatethisproblem,weshouldnotonlyincreasethetrainingefy
forreductionoftrainingsteps,butalsoreducetheoverheadcommunicationcost.Therefore,the
secondimprovingopportunityliesatreducingthecommunicationcosttofurtherreducethetotal
trainingtime.
LowModelAccuracy.
Infederatedlearning,theparticipatingclientsareusuallyresource-constrained
edgedevices(e.g.,smartphones),wherethenetworkresourcesandcomputeresourcesarelimited.
Trainingalargemodelnotonlyresultsinslowconvergenceduetolargeper-roundcommunication
cost,butalsoamodelinefforinference.Givensuchresourceconstraints,thecommon
practiceofexistingworkistotrainasmallneuralnetworkforefinferenceafterthetraining.
However,neuralnetworksthatachievestate-of-the-artaccuracyinmostlearningtasks(especially
81
difones)aremodelswithlargecapacitiesanddeeparchitectures.Currentpracticeoffed-
eratedlearningcouldnotleveragethesuperioraccuracybroughtbylargeneuralnetworks.Asa
sanitycheck,wetrainalargemodelandasmallmodel(similartothelargemodelbutwith10%
weights)underthesamefederatedsetting.AsshowninFigure3.2,thereisanaccuracygapbe-
tweenthetwomodels.Therefore,itisnottrivialtoenhancethemodelperformancewhilekeeping
thetrainedmodelcompactandef
Figure3.2:Trainingprocessfederatedlearning
3.4FedAceDesign
Inthissection,wepresentthedetaileddesignofFedAce.Firstwedescribethesystemarchitecture
ofFedAce.Thenweintroducetwoalgorithmiccomponents,whicharethebackbonesofFedAce.
3.4.1OverallArchitecture
FedAceisadistributedframeworkthatspansclientsandthecentralserver,asillustratedinFig-
ure3.3.Atclientside,theimportantdatasamplerselectsimportantdatatoconstructmini-batches
toperformlocaltraining.Afterlocaltrainingcompletesandbeforetheclientsendsthelocally
82
updatedmodeltothecentralserver,thelocalcompressorcomputesalocalmaskthatindicates
importantparameterstobepreservedandunimportantparameterstobepruned.Atcentralserver
side,theimportantclientsamplerselectsimportantfromavailableclientstoparticipateintraining
forthenextround.Theimportantclientsamplerisalsoresponsibleformaintainingtheimportance
estimateofavailableclients.Theglobalcompressoraggregatesthelocalmaskssentfromclients
intoaglobalmask,andprunetheglobalmodelusingtheglobalmask.
Nextwewilldescribethedetailedapproachesbehindthesecomponents.
Figure3.3:OverviewofFedAce
3.4.2FederatedImportanceSampling
Federatedlearningemploysahierarchicalselectionstrategytotrainaglobalmodel:thecentral
serverselectsasubsetofclientstoparticipateintraining;theneachparticipatingclientselects
importantdatatoperformlocaltraining.Accordingly,FedAceadoptsfederatedimportancesam-
83
pling,whichconsistsoftwoimportancesamplertoincreasetrainingefyinahierarchical
way:inthehighlevelatcentralserverside,itusesimportantclientsamplertoselectimportant
clientstoparticipateintraining;inthelowlevelatclientside,eachclientusesdataimportance
samplingtosampleimportantdata,asillustrateinFigure3.4.
Figure3.4:Overviewoffederatedimportancesampling.
3.4.2.1DataImportanceSampling
Thedatasamplerateachclientmaintainsanestimateofimportancedistribution
P
ofitslocal
data
D
.Thenwhenconstructingthemini-batches,thedataissampledfromthedistribution
P
withoutreplacement.Weemploydatalossastheimportanceindicatorwhencomputingimportance
distribution
P
asdatalossisangoodapproximationofthedataimportanceandcomputinglossis
84
cheap,aspresentedinMercury.Therefore,theimportancedistribution
P
canbecomputedas
p
i
=
l
i
å
j
D
j
n
=
1
l
n
(3.2)
where
j
D
j
isthetotalnumberofdataatlocalclient.
Differentfromdistributedtraining,thenumberofdataineachclientissmallandcommunica-
tionprocesstakeslongertimeastheclientsaredistributedaroundtheworld,thecostofcomputing
thelossesforalldatacanbeentirelyhiddenwhenitisparalleledwithcommunicationprocess.
Moreimportantly,asfederatedlearningadoptslocalSGDwherethelocalmodelisupdatedby
performingmultiplestepsandthecommunicationcostislarge,thecostoflosscomputationis
comparablysmallevenifitiscarriedoutwithoutparallelization.
3.4.2.2ClientImportanceSampling
Insteadofselectingclientrandomly,thecentralserverinFedAceprioritizestheimportantclients
toparticipateintraining.Beforeeachroundbegins,thecentralservercollectsclientimportance
informationfromconnectedclientsandselectstheimportantclientsbasedonthecollectedimpor-
tanceinformation.Tocollectclientimportance,FedAcesendsthelatestmodeltoclients.Once
theclientsreceivedthemodelsentbythecentralserverforimportancecollection,itwillperform
inferenceforallofitslocaldata.Theimportanceofclient
K
iscomputedas:
I
k
=
¯
L
k
n
k
n
(3.3)
where
¯
L
k
istheaveragesamplelossatclient
k
,
n
k
isthenumberoflocaldataatclient
k
,and
n
is
thetotalnumberofdata.Inotherwords,aclientneedstohavealargenumberofimportantdatato
85
beviewedasanimportantclient.Thentheprobabilityofselectingaclient
k
iscomputedas:
p
k
=
I
k
å
K
k
=
1
I
k
(3.4)
Onelimitationofthisstrategyisthatthecentralserverhastosendthelatestmodeltoallclients
sothatallclientscanestimatetheirimportanceandsendtheimportancebacktocentralserverto
identifyimportantclients.Thismaycauselargeoverheadbecausetheserverhastosendthemodel
toclientseventhoughonlysomeofclientswillbeselectedtoparticipateinthetraining.FedAce
adoptstwotechniquestoachieveagoodbalancebetweentheselectionofimportantclientandthe
incurredcost.Inthetechnique,thecentralserverdoesnotupdatetheimportanceofallclients
ineveryroundandreusesthecachedclientimportance.Tocalibratethestalenessinthecached
importanceinformation,theclientimportanceisre-weightedusingastalenesstermforbalancing
explorationandexploitation:
I
k
=
¯
L
0
k
n
k
n

exp
(
b
jj
W
k

W
jj
jj
W
jj
)
(3.5)
¯
L
0
k
representsthelastupdatedlossofclient
k
.
W
k
isthemodelweightusedtocompute
¯
L
0
k
,and
W
is
latestmodelweightatcentralserver.exp
(
b
jj
W
k

W
jj
jj
W
jj
)
isastalenesstermtocalibratetheimportance
ofaclient.Ifthistermislarge,itmeansthedifferencebetween
W
k
and
W
islargeandthemodel
W
k
usedtocompute
¯
L
0
k
isstaled.Insuchcase,Thecentralservershouldprioritizethisclientto
participateintraining,asthisclienthasahigherchanceofbecomingmoreimportantsincelast
update.
b
>
0isascalingfactordeterminedempirically.Alarge
b
emphasizesthefreshnessof
importanceestimateandencouragesFedAcetofocusmoreonexploration.
Thesecondtechniquetoreduceoverheadistoallowasubsetofclientstoperformlocaltraining
86
whileallowinganothernon-overlappingsubsetofclientstoperformimportanceupdatesimulta-
neously.,theleastrecentlyupdatedclientsinthenon-overlappingsubsetofclients
willbeselectedtoupdatetheirclientimportance.Sincetheimportancecollectionprocessisnot
dependentonthelocaltrainingprocess,itcanbeexecutedinparallelwiththelocaltrainingpro-
cess.Becausetheimportancecollectionprocessonlyrequiresfeed-forwardinferencetoobtain
thesamplelosses,andonlyrequiresdownloadingglobalmodelwithouttheneedofuploadingthe
locallyupdatedmodel,ittakesshortertimethanlocaltrainingprocess.Therefore,wecanensure
thatbeforeeachroundbegins,thecentralserverisabletohavetheclientimportanceinforma-
tionupdatedforclientsampling.Thisselectionstrategyisabletoselectimportantclientswhile
ensuringon-timeclientimportanceupdatewithminimumoverhead.
3.4.3FederatedModelCompression
Federatedmodelcompressionaimstocompressthemodelineachroundinordertosaveper-round
cost,andobtainanefandcompactmodelaftertrainingcompletes.Tocompressthemodel,
weneedtoidentifyandpruneunimportantparameters.Toachievethat,wecomputethebinary
masktoindicatetheunimportantparametersinthemodel(0forunimportantparametersand1
forimportantparameters).Ineachround,FedAcecomputessuchabinarymaskandapplythe
computedmasktotheglobalmodeltoobtainaprunedmodel.Theprunedmodelwillbeused
toproceedthetrainingofnextround.Next,wewilldescribethegenerationofthemaskandtwo
techniquesformaintainingaccuracyaftercompression.
MaskGeneration
.Atahighlevel,thegenerationofmaskconsistsofthreesteps,asshownin
Figure3.5.First,afterreceivingthelatestmodel,eachclientnotonlytrainsthelocalmodelusing
itsowndata,butalsocomputesthelocalmask.Thelocalmaskwillbesentthecentralserver
togetherwiththeupdatedmodelwhenlocaltrainingends.Then,thecentralserveraggregatesthe
87
localmasktogeneratetheglobalmask,whichwillbeappliedtotheaggregatedglobalmodel.
Figure3.5:Overviewoffederatedmodelcompression.
FedAceadoptsunstructuredpruningandusesmagnitude-basedapproachtodeterminetheim-
portantparameters.,thelocalmaskatclient
k
canbecomputedas:
M
k
=
s
(
W
abs
k
;
s
)
(3.6)
where
W
abs
k
isobtainedbytakingtheabsolutevalueoftheelementsoftheweights
W
k
atclient
k
.
s
(
W
;
s
)
isanactivationfunctionwheretheelementsin
W
thataregreaterthan
s
willbesettoone
andtozeroifotherwise.
Afterthelocalmasksaregenerated,theywillbesenttothecentralservertogetherwiththe
modelweights.Then,thecentralservernotonlyaggregatesthemodelweightsasitnormallydoes
infederatedlearning,butalsoaggregatesthelocalmaskstocreateaglobalmask.Weusemajority
voteastheaggregationmethodforlocalmasks.,FedAcesumsthecollectedlocal
maskstogether.Thosesummedelementssmallerthanathreshold
s
inthesummedmaskindicate
thecorrespondingweightsoftheglobalmodelwillbeprunedbasedontheagreementfromthe
88
majorityoftheclients:
M
g
=
s
(
K
å
k
=
1
M
k
;
s
)
;
(3.7)
Inotherwords,thoseparametersthatmostlocalmasksagreetopreservewillbekeptandothers
willbepruned.Aftertheglobalmask
M
g
iscomputed,theprunedglobalmodelisobtainedby
W
g
=
W

M
g
(3.8)
where

istheelement-wisemultiplication.Usingmasksbringsextracommunicationcosttothe
training,astheclientsneedtouploadthelocalmaskstothecentralserver.Toreducecost,weuse
bitoperationtoencodethebinarymaskintolastbitoftheweight.Changingthelastbitdoesnot
affectthemodelaccuracy,whichhasbeenvinmanypriorworksonmodelquantization[92].
Oneproblemoffederatedmodelcompressionisthatthemodelaccuracyaftercompression
maydrop,especiallywiththeexistenceofdatanon-i.i.dness,whichcausestheclients
todisagreewitheachotheronpatternofoptimalmask.Toaddressthisissue,weproposedtwo
techniquestomaintaintheaccuracy.Theiscompressionrateschedule,thesecondismask
caching.
CompressionRateSchedule.
Onekeyprerequisiteforaccuratelyidentifyingtheunimportant
weightstopruneistotrainthemodeltonearlyconvergencewhentheweightsbecomestable.
However,thisprerequisitedoesnotholdinthecaseoffederatedmodelcompressionbecauseprun-
ingisperformedthroughoutthetrainingprocesswherethemodelweightsmaynotbestabilized.
Asaresult,generatinglocalmasksfromlocalmodelsthatarenotconvergedmayfalselyremove
importantweightsandpreserveredundantweights.
Toachievetheprerequisite,FedAceincorporatesawarm-upstageandanadaptivemodelspar-
sitypolicyduringthefederatedlearningprocess.,FedAcedoesnotperformmodel
89
Figure3.6:Examplesofmaskcaching.
compressioninthewarm-upstage.Thegoalistoallowthemodelquicklyidentifyimportantand
unimportantparameters.Afterthewarm-upstage,thefederatedmodelcompressionbeginsandthe
compressionratebeginstoincreaseexponentially.Indoingso,weallowthecompressedmodelto
regainaccuracyslowly,especiallyinthelatestagewheremistakenlypruninganimportantparam-
eterleadstoaaccuracydrop.
MaskCaching.
Giventhenon-iidcharacteristicofthefederatedparadigm,thelocalmasksofthe
clientsmayhavedifferentviewsontheimportanceofthemodelweights.Thisissueisexacerbated
bythepracticethatateachroundoftheFL,onlyasubsetoftheclientsisselectedtoparticipate
inthetrainingprocess.Iftheglobalmaskisgeneratedattheendofeveryround,theglobalmask
couldbeeasilybiasedbythatsubsetoftheclients,andthusdeviatedfromtheoptimal
globalmask.Asaconsequence,theglobalmodelprunedusingthebiasedglobalmaskcouldlead
tosub-optimalperformance.
Toaddressthischallenge,FedAcecachesthereceivedlocalmasksinthecentralserverovera
numberofroundsfrommoreclientsbeforegeneratingtheglobalmask.Assuch,thelocalmasks
collectedfromalargerpoolofclientsareabletocaptureabetterglobalviewoftheimportanceof
themodelweights.Figure3.6illustratestheideaofmaskcaching.,thereisnocaching
whencacheequals1(leftchartinFigure3.6).Whencacheequals2,themaskswillbecachedand
90
aggregatedeverytworounds(rightchartinFigure3.6).
3.5Experiments
Inthissection,weevaluatetheeffectivenessandperformanceofFedAceunderfederatedlearning
onfourfederateddatasetsacrosscomputevision,speechrecognitionandnaturallanguagemodel-
ingtasks.
3.5.1Setup
Implementationandhardwareenvironment.
WehaveimplementedFedAceusingFedML[26].
FedMLisaresearchlibraryandbenchmarkforfederatedmachinelearning.Userscandevelop
variantsoffederatedlearningalgorithmsusingthisframework.WeimplementedFedAceusing
900linesofPythoncode.Alltheexperimentsarerunonadesktopserverwith4RTX-8000GPUs.
Datasetsanddeeplearningmodels.
1)Therstdatasetis
Fed-CIFAR10
,whichisconstructedbypartitioning
CIFAR-10
into1000
shardstosimulate1000clients.Inordertosimulatethedataheterogeneityacrossdevices,we
partitionthedatashardsusinglatentDirichletallocation(LDA)[34]andeachshardhasanimbal-
ancedlabeldistribution,alsoknownaslabelskewness[43].WeuseResNet18[28]inthisdataset.
Becausethestatisticsparametersofbatchnormalizationreliedonlocaldata,wefollow[67]to
replacethebatchnormalizationlayerwithgroupnormalizationlayer.
2)Thethirddatasetis
FEMNIST
[67].Ithas749068imagesofdigitsandEnglishcharacters,with
atotalof62classes.Thedatasetispartitionedinto3400shardswithdifferentshardsizes.Forthis
dataset,weuseaCNNmodelwithtwoconvolutionallayersandtwolinerlayers.
3)TheseconddatasetisFed-TSC,whichisconstructedbypartitioningTwSpeechCom-
91
manddatasetinto1000shards.Thisdatasetconsistsof105,829audioutterancesof35shortwords,
recordedbyavarietyofdifferentpeople[75].Similarly,wepartitionthedatashardsusingLDA
distributionsimulatelabelskewness.WeconverttheaudiosintospectrumandweuseVGG-13for

4)Thedatasetis
StackOverFlow
[67].StackOvwisalanguagemodelingdataset.Itcon-
sistsofquestionsandanswersfromawebsitecalledstackovw.Thedatasetcontains342,477
uniqueuserswhichweuseasclients.Weperformnext-wordpredictioninthistask.Weuseaone
layerLSTMwith670hiddendimensiontopredictthenextwordinthesentence.Eachwordis
embeddedina96-dimensionalspacewithaRNN.Thedetailsofthedatasetsandusedmodelcan
befoundin[67].
Baselines.
Whiletherearemanyworksaimingatimprovingfederatedlearning,mostofthese
worksarefocusedonaddressingthedatanon-iidness[45,53]orpreservingtheuserprivacy.The
approachesproposedinFedAcearecomplementarytotheseworks.Therefore,weadoptFedAvg
asthebaselineinthisexperiment,whichisastate-of-the-artfederatedlearningtrainingmethodfor
federatedlearningwhereeachclienttrainsforoneormoreepochsandthelocalmodelupdatesare
senttoandaveragedintheremotecentralserver.ForbothFedAvgandFedAce,weusethesame
traininghyperparameters.,atclientside,weadoptSGDoptimizerandthelearning
rateis0.01andweusecosine-annealinglearningratedecay[59].Thebatchsizeis20.Atserver
side,weuseadamoptimzerandthelearningrateis0.001.Thenumberoflocalepochsaresetto1.
3.5.2Results
WeshowthatFedAceachievessuperiorperformancecomparedtoFedAvgbyhighlightingthekey
results:
92
Ł
FedAceachieves3
:
24

to6
:
54

speedupoverFedAvgwithoutlossofaccuracy.
Ł
FedAcereducesnetworktrafby2.4

to3.44

ineachround.
Ł
FedAceobtainsacompactandefntsparsemodelaftertrainingwith17
:
2%to41
:
2%of
originalmodelsize.
3.5.2.1End-to-EndPerformanceComparison
Wecomparetheend-to-endperformanceofFedAceandbaseline.Table3.2showsthespeedups
obtainedbyFedAceoverFedAvg.Theoverallspeedupsare5
:
28

,4
:
7

,5
:
28

,and5
:
28

on
Fed-Cifar10,Femnist,Fed-TSCandStackOvwrespectively.,thespeedupsare
contributedbytwosources.Theisthereductioninthenumberofroundscontributedby
adoptingimportance-awaredataselectionandclientselectionstrategy.Thesecondsourceisthe
reductioninthecommunicationcostcontributedbycompressingthemodel.Allofthespeedups
obtainedbyFedAcewithouthurtingaccuracy.ItindicatesthatFedAcecanincreasethetraining
efyandreducethetrainingoverhead.
Dataset
#Rounds
Reduction
AvgComm.
CostReduction
Speedup
FinalModel
Sparisty
Fed-Cifar102.2

2.4

5.28

69.3%
Femnist1.7

2.76

4.7

75.3%
Fed-TSC1.9

3.44

6.54

82.8%
StackOvw1.6

2.03

3.24

58.8%
Table3.2:End-to-endperformance.ThenumbersindicatethespeedupsofFedAceoverFedAvg.
3.5.2.2SensitivityAnalysis
Inthissection,weperformsensitivityanalysistoshowthatFedAceisrobusttovarioushyperpa-
rameters.
93
ImpactofNumberofLocalEpochs.
Weanalyzetheimpactofthenumberoflocalepochs.
WerunFedAcewithdifferentnumberoflocalepochsonFemnistinthisexperiment.Theresult
isshowninFigure3.7.Wecanseethatwhenthenumberoflocalepochsincreases,thespeedups
obtainedbyFedAcebecomessmaller.Thisisbecausewhenthenumberoflocalepochsincreases,
thelocalmodeldriftsmorefromtheglobaloptimalmodel.Asaresult,theestimatedimportance
distributionsoflocaldataandclientsarelessaccurate,limitingtheimprovementbroughtbyim-
portancesampling.Furthermore,asthenumberoflocalepochsincrease,themaskofthelocally
trainedmodelisencodedwithlocalinformation,makingitdivergefromthemasksgeneratedby
otherclients.FedAceneedsmoreroundstoachievethesameaccuracy.Eventhoughthespeedups
becomessmaller,FedAceisstillabletoachievespeedupoverbaseline.
Figure3.7:Performanceunderdifferentnumberofepochs
ImpactofNumberofClients.
Wethenstudytheimpactofnumberclients.WerunFedAcewith
differentnumberofclientsonFemnistinthisexperiment.Thenumberofclientsaffectsthenumber
localdatainthedevice,thusthenon-iidnessoflocaldataisalsoaffected.Wedonotenablemodel
compressioninthisexperimenttoeliminateitsFigure3.8showsthespeedupsbrought
bydatasamplingandclientsamplingundervariousnumberofclients.Wecanseethatwhenthe
numberofclientincreases,thespeedupachievedbydataimportancesamplingbecomessmaller,
whilethespeedupachievedbyclientimportancesamplingincreases.Thisisbecausewhenthe
numberofclientsincreases,theimportancediversityinlocaldataissmallandthedatasampling
94
doesnotprovidemuchperformancegainevenifimportantdataisselected.However,inthiscase,
thespeedupachievedbyclientbecomeslarger,sincetheimportancediversityofclientsbecomes
dominant.Thisdemonstratestheneedstocombinetwosamplingstrategiestoachieveimportance
samplinginfederatedlearning.
Figure3.8:PerformancegainofFedAceinthreefederatedlearningdatasets
CompressionRateSchedule.
Inthisexperiment,weanalyzetheeffectofcompressionratesched-
ule.Wecomparetheproposedcompressratescheduleandconstantcompressrate(setas
sparsityfromthebeginningtilltheend).
Table3.3andTable3.4showstheaccuracyofdifferentcompressionrateschedulestrategies
onFemnistandStackOvw.Wecanseethatwiththesamesparsity,theconstantschedule
suffersfromaccuracydrop.FedAcewithwarm-upstrategyonlyisabletomitigatethe
accuracyloss.FedAcewithFedAcewithwarm-upandexponentialincreaseschedulesuffersfrom
theleastaccuracydrop.Thisisbecausebygraduallyincreasingthecompressionrate,FedAce
canhavemoretimetoregainaccuracyafterpruning.Thisdemonstratesthatthecompressionrate
scheduleiseffectiveinmaintainingaccuracy.
NumberofCachedMasks
.Finally,westudytheimpactofthenumberofcachedmaskincen-
95
Dataset
Compression
RateSchedule
ModelSparsityAccuracyChange(%)
Femnist
Constant
75.3%
-1.93
warmuponly-0.64
warm-up+exponentialincrease
-0.05
StackOvw
Constant
58.8%
-0.92
warmuponly-0.32
warm-up+exponentialincrease
-0.02
Table3.3:Performanceofcompressionrateschedule.
Dataset
Compression
RateSchedule
ModelSparsityAccuracyDrop
Femnist
Constant
90%
-3.32
Schedule(warm-uponly)-2.92
Schedule(warm-up+exponentialincrease)
-2.53
StackOvw
Constant
90%
-2.03
Schedule(warmuponly)-1.23
Schedule(warm-up+exponentialincrease)
-0.49
Table3.4:Performanceofcompressionratescheduleunderhighcompressionrate.
tralserver.Figure3.9plotstheaccuracyofFedAceusingdifferentnumberofcachedmaskson
twodatasets.Wehavetwoobservations.Ononehand,usingalargernumberofcachedmask
canachievebetteraccuracythanusingnocachedmask(cache=1).Forexample,usingcache=5in
Feminstdatasetachieveshigheraccuracyandusingcache=10achieveshigheraccuracyinStack-
Ovw.Thereasonis,byincreasingthenumberofcachedmasks,FedAceisabletoaggregate
morelocalinformationinordertodeterminetheglobaloptimalmask.Ontheotherhand,keep-
ingincreasingthenumberofcachedmasksmayhurtaccuracy.Forexample,usingcache=10in
Feminstandcache=20inStackOvwisworsethanusingcache=1.Thisisbecausethemaskare
computedbylocalmodels,whichareupdatedineachround.Aggregatingtoomanylocalmasks
computedindifferentroundsintroducesstalenessinlocalmasksaggregation.
96
(a)Femnist
(b)StackOvw
Figure3.9:Testaccuracycomparisonunderdifferentmaskcachesettings.
3.6Conclusion
Inthiswork,weproposeFedAce,aneftrainingframeworkforfederatedlearning.FedAce
addressestwokeyproblemsinfederatedlearning.Theproblemisfederatedlearningrequires
alargenumberofcommunicationroundsduelowsamplingefyinbothdataselectionand
clientselection.Thesecondproblemisthemodeltrainedinfederatedlearninghastobeasmall
modelduetoinferencespeedrequirementandconstrainedbandwidth,resultinginpoorperfor-
mance.Toaddresstheabovechallenges,FedAcehastwocomponents:aimportancesamplerthat
selectsimportantdataforlocaltrainingandimportantclientsineachroundtoincreasetraining
efy,andamodelcompressorthatprogressivelycompressesthelargemodel.Experimentre-
sultsonfourfederatedlearningdatasetsshowthattheFedAcecanreducethetraining
overheadby3
:
2

to6
:
5

withoutaccuracydrop.
97
Chapter4
Conclusion
Inthisdissertation,weproposethreecollaborativedistributeddeeplearningsystemsfortheedge:
Distream,Mercury,andFedAce,toenabledeeplearning-basedservingandtrainingfordistributed
edgesystemsrespectively.Distreamaddressesthekeychallengesbroughtbyworkloaddynamics
inreal-worlddeployments,andcontributesnoveltechniquesthatcomplementexistinglivevideo
analyticssystems.Mercuryrepresentsadistributedtrainingframeworkforefdistributedon-
devicedeeplearningunderdistributededgesetting.Mercuryimprovestrainingqualityandreduce
convergencewall-clocktimewithoutadditionaloverheadbyadaptingtothebandwidthchanges
usingthreenoveltechniques:group-wiseimportancesampling,importance-awarereshardingand
overlapping.FedAceisaneffederatedlearningsystemthatimprovestrainingefyby
combiningdataandclientimportancesamplingandadaptivemodelcompressing.
Throughexperiments,weshowthatthesethreesystemsoutperformstate-of-the-artapproaches
andexistingsolutionscanfromtheproposedtechniquesforadditionalperformancegain.
Therefore,webelievethesethreeworksrepresentascontributiontoenablinglarge-scale
deeplearningservingandtrainingsystemondistributededgearchitecture.
98
BIBLIOGRAPHY
99
BIBLIOGRAPHY
[1]
Nvidiatitanx,2016.
[2]
24-portgigabitstackablesmartmanagedswitchwith410gbesfp+ports,2017.
[3]
Nvidiajetsontx1,2017.
[4]
Opencvbackgroundsubtraction,2017.
[5]
Networkingsolutionsforipsurveillance,2018.
[6]
Nvidiajetsontx2,2018.
[7]
Jacksonhole,2019.
[8]
F.Akgul.
ZeroMQ
.PacktPublishingLtd,2013.
[9]
G.Alain,A.Lamb,C.Sankar,A.Courville,andY.Bengio.Variancereductioninsgdby
distributedimportancesampling.
arXivpreprintarXiv:1511.06481
,2015.
[10]
D.Alistarh,D.Grubic,J.Li,R.Tomioka,andM.Vojnovic.Qsgd:Communication-ef
sgdviagradientquantizationandencoding.In
ProceedingsoftheAdvancesinNeuralInfor-
mationProcessingSystems
,pages1709Œ1720,2017.
[11]
D.Bahdanau,K.Cho,andY.Bengio.Neuralmachinetranslationbyjointlylearningtoalign
andtranslate.
arXivpreprintarXiv:1409.0473
,2014.
[12]
D.Bastos,M.Shackleton,andF.El-Moussa.Internetofthings:Asurveyoftechnologies
andsecurityrisksinsmarthomeandcityenvironments.2018.
[13]
C.Canel,T.Kim,G.Zhou,C.Li,H.Lim,D.G.Andersen,M.Kaminsky,andS.R.Dulloor.
Scalingvideoanalyticsonconstrainededgenodes.
arXivpreprintarXiv:1905.13536
,2019.
[14]
B.-G.Chun,S.Ihm,P.Maniatis,M.Naik,andA.Patti.Clonecloud:elasticexecutionbe-
tweenmobiledeviceandcloud.In
ProceedingsofthesixthconferenceonComputersystems
,
pages301Œ314,2011.
[15]
D.Crankshaw,X.Wang,G.Zhou,M.J.Franklin,J.E.Gonzalez,andI.Stoica.Clipper:A
low-latencyonlinepredictionservingsystem.In
14th
f
USENIX
g
SymposiumonNetworked
SystemsDesignandImplementation(
f
NSDI
g
17)
,pages613Œ627,2017.
[16]
E.Cuervo,A.Balasubramanian,D.-k.Cho,A.Wolman,S.Saroiu,R.Chandra,andP.Bahl.
100
Maui:makingsmartphoneslastlongerwithcodeofIn
Proceedingsofthe8thinterna-
tionalconferenceonMobilesystems,applications,andservices
,pages49Œ62,2010.
[17]
H.Cui,H.Zhang,G.R.Ganger,P.B.Gibbons,andE.P.Xing.Geeps:Scalabledeeplearning
ondistributedgpuswithagpu-specializedparameterserver.In
ProceedingsoftheEleventh
EuropeanConferenceonComputerSystems
,page4.ACM,2016.
[18]
J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei.Imagenet:Alarge-scalehierar-
chicalimagedatabase.In
Proceedingsofthe2009IEEEconferenceoncomputervisionand
patternrecognition
,pages248Œ255.Ieee,2009.
[19]
S.Dutta,G.Joshi,S.Ghosh,P.Dube,andP.Nagpurkar.Slowandstalegradientscanwinthe
race:Error-runtimetrade-offsindistributedsgd.
arXivpreprintarXiv:1803.01113
,2018.
[20]
B.Fang,X.Zeng,F.Zhang,H.Xu,andM.Zhang.Flexdnn:Input-adaptiveon-devicedeep
learningforefmobilevision.In
Proceedingsofthe5thACM/IEEESymposiumon
EdgeComputing(SEC)
,2020.
[21]
B.Fang,X.Zeng,andM.Zhang.Nestdnn:Resource-awaremulti-tenanton-devicedeep
learningforcontinuousmobilevision.In
Proceedingsofthe24thAnnualInternationalCon-
ferenceonMobileComputingandNetworking(MobiCom)
,pages115Œ127,2018.
[22]
P.Goyal,P.Dollár,R.Girshick,P.Noordhuis,L.Wesolowski,A.Kyrola,A.Tulloch,Y.Jia,
andK.He.Accurate,largeminibatchsgd:Trainingimagenetin1hour.
arXivpreprint
arXiv:1706.02677
,2017.
[23]
M.Gürbüzbalaban,A.Ozdaglar,andP.Parrilo.Whyrandomreshufbeatsstochastic
gradientdescent.
arXivpreprintarXiv:1510.08560
,2015.
[24]
S.Han,H.Mao,andW.J.Dally.Deepcompression:Compressingdeepneuralnetworks
withpruning,trainedquantizationandhuffmancoding.
arXivpreprintarXiv:1510.00149
,
2015.
[25]
S.H.Hashemi,S.A.Jyothi,andR.H.Campbell.Tictac:Acceleratingdistributeddeep
learningwithcommunicationscheduling.In
Proceedingsofthe2ndSysMLConference
,
2019.
[26]
C.He,S.Li,J.So,M.Zhang,H.Wang,X.Wang,P.Vepakomma,A.Singh,H.Qiu,L.Shen,
etal.Fedml:Aresearchlibraryandbenchmarkforfederatedmachinelearning.
arXiv
preprintarXiv:2007.13518
,2020.
[27]
K.He,G.Gkioxari,P.Dollár,andR.Girshick.Maskr-cnn.In
ComputerVision(ICCV),
2017IEEEInternationalConferenceon
,pages2980Œ2988.IEEE,2017.
[28]
K.He,X.Zhang,S.Ren,andJ.Sun.Deepresiduallearningforimagerecognition.In
101
ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition
,pages770Œ
778,2016.
[29]
G.Hinton,L.Deng,D.Yu,G.Dahl,A.-r.Mohamed,N.Jaitly,A.Senior,V.Vanhoucke,
P.Nguyen,B.Kingsbury,etal.Deepneuralnetworksforacousticmodelinginspeechrecog-
nition.
IEEESignalprocessingmagazine
,29,2012.
[30]
G.Hinton,O.Vinyals,andJ.Dean.Distillingtheknowledgeinaneuralnetwork.
arXiv
preprintarXiv:1503.02531
,2015.
[31]
S.HochreiterandJ.Schmidhuber.Longshort-termmemory.
Neuralcomputation
,9(8):1735Œ
1780,1997.
[32]
K.Hsieh,G.Ananthanarayanan,P.Bodik,P.Bahl,M.Philipose,P.B.Gibbons,and
O.Mutlu.Focus:Queryinglargevideodatasetswithlowlatencyandlowcost.
arXivpreprint
arXiv:1801.03493
,2018.
[33]
K.Hsieh,A.Harlap,N.Vijaykumar,D.Konomis,G.R.Ganger,P.B.Gibbons,andO.Mutlu.
Gaia:Geo-distributedmachinelearningapproaching
f
LAN
g
speeds.In
14th
f
USENIX
g
SymposiumonNetworkedSystemsDesignandImplementation(
f
NSDI
g
17)
,pages629Œ647,
2017.
[34]
T.-M.H.Hsu,H.Qi,andM.Brown.Measuringtheeffectsofnon-identicaldatadistribution
forfederatedvisual
arXivpreprintarXiv:1909.06335
,2019.
[35]
C.-C.Hung,G.Ananthanarayanan,P.Bodik,L.Golubchik,M.Yu,P.Bahl,andM.Phili-
pose.Videoedge:Processingcamerastreamsusinghierarchicalclusters.In
2018IEEE/ACM
SymposiumonEdgeComputing(SEC)
,pages115Œ131.IEEE,2018.
[36]
L.N.Huynh,Y.Lee,andR.K.Balan.Deepmon:Mobilegpu-baseddeeplearningframe-
workforcontinuousvisionapplications.In
Proceedingsofthe15thAnnualInternational
ConferenceonMobileSystems,Applications,andServices
,pages82Œ95,2017.
[37]
F.N.Iandola,M.W.Moskewicz,K.Ashraf,andK.Keutzer.Firecaffe:near-linearaccel-
erationofdeepneuralnetworktrainingoncomputeclusters.In
ProceedingsoftheIEEE
ConferenceonComputerVisionandPatternRecognition
,pages2592Œ2600,2016.
[38]
S.Jain,G.Ananthanarayanan,J.Jiang,Y.Shu,andJ.Gonzalez.Scalingvideoanalytics
systemstolargecameradeployments.In
Proceedingsofthe20thInternationalWorkshopon
MobileComputingSystemsandApplications
,pages9Œ14,2019.
[39]
A.Jayarajan,J.Wei,G.Gibson,A.Fedorova,andG.Pekhimenko.Priority-basedparameter
propagationfordistributeddnntraining.
arXivpreprintarXiv:1905.03960
,2019.
[40]
J.Jiang,G.Ananthanarayanan,P.Bodik,S.Sen,andI.Stoica.Chameleon:scalableadapta-
102
tionofvideoanalytics.In
Proceedingsofthe2018ConferenceoftheACMSpecialInterest
GrouponDataCommunication
,pages253Œ266.ACM,2018.
[41]
J.Jiang,Y.Zhou,G.Ananthanarayanan,Y.Shu,andA.A.Chien.Networkedcamerasarethe
newbigdataclusters.In
Proceedingsofthe2019WorkshoponHotTopicsinVideoAnalytics
andIntelligentEdges
,pages1Œ7,2019.
[42]
S.Jiang,Z.Ma,X.Zeng,C.Xu,M.Zhang,C.Zhang,andY.Liu.Scylla:Qoe-aware
continuousmobilevisionwithfpga-baseddynamicdeepneuralnetworkIn
IEEEINFOCOM2020-IEEEConferenceonComputerCommunications
,pages1369Œ1378.
IEEE,2020.
[43]
P.Kairouz,H.B.McMahan,B.Avent,A.Bellet,M.Bennis,A.N.Bhagoji,K.Bonawitz,
Z.Charles,G.Cormode,R.Cummings,etal.Advancesandopenproblemsinfederated
learning.
arXivpreprintarXiv:1912.04977
,2019.
[44]
D.Kang,J.Emmons,F.Abuzaid,P.Bailis,andM.Zaharia.Noscope:optimizingneural
networkqueriesovervideoatscale.
ProceedingsoftheVLDBEndowment
,10(11):1586Œ
1597,2017.
[45]
S.P.Karimireddy,S.Kale,M.Mohri,S.Reddi,S.Stich,andA.T.Suresh.Scaffold:Stochas-
ticcontrolledaveragingforfederatedlearning.In
InternationalConferenceonMachine
Learning
,pages5132Œ5143.PMLR,2020.
[46]
A.KatharopoulosandF.Fleuret.Notallsamplesarecreatedequal:Deeplearningwith
importancesampling.
arXivpreprintarXiv:1803.00942
,2018.
[47]
J.Kone

cn
˚
y,H.B.McMahan,F.X.Yu,P.Richtárik,A.T.Suresh,andD.Bacon.
Federatedlearning:Strategiesforimprovingcommunicationefy.
arXivpreprint
arXiv:1610.05492
,2016.
[48]
A.Krizhevsky.Learningmultiplelayersoffeaturesfromtinyimages.Technicalreport,
Citeseer,2009.
[49]
Y.LeCun,Y.Bengio,andG.Hinton.Deeplearning.
nature
,521(7553):436,2015.
[50]
H.Li,A.Kadav,I.Durdanovic,H.Samet,andH.P.Graf.Pruningforefcon-
vnets.
arXivpreprintarXiv:1608.08710
,2016.
[51]
M.Li,D.G.Andersen,J.W.Park,A.J.Smola,A.Ahmed,V.Josifovski,J.Long,E.J.
Shekita,andB.-Y.Su.Scalingdistributedmachinelearningwiththeparameterserver.In
11th
USENIXSymposiumonOperatingSystemsDesignandImplementation(OSDI14)
,pages
583Œ598,CO,Oct.2014.USENIXAssociation.
[52]
M.Li,D.G.Andersen,J.W.Park,A.J.Smola,A.Ahmed,V.Josifovski,J.Long,E.J.
103
Shekita,andB.-Y.Su.Scalingdistributedmachinelearningwiththeparameterserver.In
Proceedingsofthe11thUSENIXSymposiumonOperatingSystemsDesignandImplementa-
tion(OSDI14)
,pages583Œ598,2014.
[53]
T.Li,A.K.Sahu,M.Zaheer,M.Sanjabi,A.Talwalkar,andV.Smith.Federatedoptimization
inheterogeneousnetworks.
arXivpreprintarXiv:1812.06127
,2018.
[54]
Y.Li,M.Yu,S.Li,S.Avestimehr,N.S.Kim,andA.Schwing.Pipe-SGD:Adecentralized
pipelinedsgdframeworkfordistributeddeepnettraining.In
ProceedingsoftheAdvancesin
NeuralInformationProcessingSystems
,pages8045Œ8056,2018.
[55]
H.Lim,D.G.Andersen,andM.Kaminsky.3lc:Lightweightandeffectivetrafcompres-
sionfordistributedmachinelearning.
arXivpreprintarXiv:1802.07389
,2018.
[56]
Y.Lin,S.Han,H.Mao,Y.Wang,andW.J.Dally.Deepgradientcompression:Reducingthe
communicationbandwidthfordistributedtraining.
arXivpreprintarXiv:1712.01887
,2017.
[57]
W.Liu,D.Anguelov,D.Erhan,C.Szegedy,S.Reed,C.-Y.Fu,andA.C.Berg.Ssd:Single
shotmultiboxdetector.In
Europeanconferenceoncomputervision
,pages21Œ37.Springer,
2016.
[58]
Z.Liu,M.Sun,T.Zhou,G.Huang,andT.Darrell.Rethinkingthevalueofnetworkpruning.
arXivpreprintarXiv:1810.05270
,2018.
[59]
I.LoshchilovandF.Hutter.Sgdr:Stochasticgradientdescentwithwarmrestarts.
arXiv
preprintarXiv:1608.03983
,2016.
[60]
L.Ma,D.VanAken,A.Hefny,G.Mezerhane,A.Pavlo,andG.J.Gordon.Query-based
workloadforecastingforself-drivingdatabasemanagementsystems.In
Proceedingsofthe
2018InternationalConferenceonManagementofData
,pages631Œ645.ACM,2018.
[61]
X.Ma,H.Zhong,Y.Li,J.Ma,Z.Cui,andY.Wang.Forecastingtransportationnetwork
speedusingdeepcapsulenetworkswithnestedlstmmodels.
IEEETransactionsonIntelligent
TransportationSystems
,2020.
[62]
B.McMahan,E.Moore,D.Ramage,S.Hampson,andB.A.yArcas.Communication-
eflearningofdeepnetworksfromdecentralizeddata.In
Intelligenceand
Statistics
,pages1273Œ1282.PMLR,2017.
[63]
H.B.McMahan,E.Moore,D.Ramage,S.Hampson,etal.Communication-eflearn-
ingofdeepnetworksfromdecentralizeddata.In
Proceedingsofthe20thInternational
ConferenceonIntelligenceandStatistics(AISTATS)
,2017.
[64]
Q.Meng,W.Chen,Y.Wang,Z.-M.Ma,andT.-Y.Liu.Convergenceanalysisofdistributed
stochasticgradientdescentwithshuf
arXivpreprintarXiv:1709.10432
,2017.
104
[65]
S.A.Noghabi,L.Cox,S.Agarwal,andG.Ananthanarayanan.Theemerginglandscapeof
edgecomputing.
GetMobile:MobileComputingandCommunications
,23(4):11Œ20,2020.
[66]
M.Rastegari,V.Ordonez,J.Redmon,andA.Farhadi.Xnor-net:Imagenet
usingbinaryconvolutionalneuralnetworks.In
EuropeanConferenceonComputerVision
,
pages525Œ542.Springer,2016.
[67]
S.Reddi,Z.Charles,M.Zaheer,Z.Garrett,K.Rush,J.Kone

cn
˚
y,S.Kumar,andH.B.
McMahan.Adaptivefederatedoptimization.
arXivpreprintarXiv:2003.00295
,2020.
[68]
J.RedmonandA.Farhadi.Yolov3:Anincrementalimprovement.
arXivpreprint
arXiv:1804.02767
,2018.
[69]
S.Ren,K.He,R.Girshick,andJ.Sun.Fasterr-cnn:Towardsreal-timeobjectdetection
withregionproposalnetworks.In
Advancesinneuralinformationprocessingsystems
,pages
91Œ99,2015.
[70]
M.Sandler,A.Howard,M.Zhu,A.Zhmoginov,andL.-C.Chen.Mobilenetv2:Inverted
residualsandlinearbottlenecks.In
ProceedingsoftheIEEEConferenceonComputerVision
andPatternRecognition
,pages4510Œ4520,2018.
[71]
H.Shen,L.Chen,Y.Jin,L.Zhao,B.Kong,M.Philipose,A.Krishnamurthy,andR.Sun-
daram.Nexus:agpuclusterengineforacceleratingdnn-basedvideoanalysis.In
Proceedings
ofthe27thACMSymposiumonOperatingSystemsPrinciples
,pages322Œ337,2019.
[72]
R.ShokriandV.Shmatikov.Privacy-preservingdeeplearning.In
Proceedingsofthe22nd
ACMSIGSACconferenceoncomputerandcommunicationssecurity
,pages1310Œ1321.
ACM,2015.
[73]
K.SimonyanandA.Zisserman.Verydeepconvolutionalnetworksforlarge-scaleimage
recognition.
arXivpreprintarXiv:1409.1556
,2014.
[74]
V.Smith,C.-K.Chiang,M.Sanjabi,andA.S.Talwalkar.Federatedmulti-tasklearning.In
AdvancesinNeuralInformationProcessingSystems
,pages4427Œ4437,2017.
[75]
w.Twspeechcommanddataset,2018.
[76]
A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,Kaiser,and
I.Polosukhin.Attentionisallyouneed.In
ProceedingsoftheAdvancesinneuralinformation
processingsystems
,pages5998Œ6008,2017.
[77]
S.Venkataraman,A.Panda,K.Ousterhout,M.Armbrust,A.Ghodsi,M.J.Franklin,
B.Recht,andI.Stoica.Drizzle:Fastandadaptablestreamprocessingatscale.In
Pro-
ceedingsofthe26thSymposiumonOperatingSystemsPrinciples
,pages374Œ389.ACM,
2017.
105
[78]
J.WangandG.Joshi.Adaptivecommunicationstrategiestoachievethebesterror-runtime
trade-offinlocal-updateSGD.
arXivpreprintarXiv:1810.08313
,2018.
[79]
J.Wangni,J.Wang,J.Liu,andT.Zhang.Gradientforcommunication-ef
distributedoptimization.In
ProceedingsoftheAdvancesinNeuralInformationProcessing
Systems
,pages1299Œ1309,2018.
[80]
P.Watcharapichat,V.L.Morales,R.C.Fernandez,andP.Pietzuch.Ako:Decentraliseddeep
learningwithpartialgradientexchange.In
ProceedingsoftheSeventhACMSymposiumon
CloudComputing
,pages84Œ97.ACM,2016.
[81]
W.Wen,C.Xu,F.Yan,C.Wu,Y.Wang,Y.Chen,andH.Li.Terngrad:Ternarygradients
toreducecommunicationindistributeddeeplearning.In
Advancesinneuralinformation
processingsystems
,pages1509Œ1519,2017.
[82]
I.H.Witten,E.Frank,M.A.Hall,andC.J.Pal.
DataMining:Practicalmachinelearning
toolsandtechniques
.MorganKaufmann,2016.
[83]
G.-S.Xia,J.Hu,F.Hu,B.Shi,X.Bai,Y.Zhong,L.Zhang,andX.Lu.Aid:Abenchmark
datasetforperformanceevaluationofaerialsceneclas
IEEETransactionson
GeoscienceandRemoteSensing
,55(7):3965Œ3981,2017.
[84]
E.P.Xing,Q.Ho,W.Dai,J.K.Kim,J.Wei,S.Lee,X.Zheng,P.Xie,A.Kumar,andY.Yu.
Petuum:Anewplatformfordistributedmachinelearningonbigdata.
IEEETransactionson
BigData
,1(2):49Œ67,2015.
[85]
B.Zhang,X.Jin,S.Ratnasamy,J.Wawrzynek,andE.A.Lee.Awstream:adaptivewide-
areastreaminganalytics.In
Proceedingsofthe2018ConferenceoftheACMSpecialInterest
GrouponDataCommunication
,pages236Œ252.ACM,2018.
[86]
H.Zhang,G.Ananthanarayanan,P.Bodik,M.Philipose,P.Bahl,andM.J.Freedman.Live
videoanalyticsatscalewithapproximationanddelay-tolerance.In
NSDI
,volume9,page1,
2017.
[87]
H.Zhang,Z.Zheng,S.Xu,W.Dai,Q.Ho,X.Liang,Z.Hu,J.Wei,P.Xie,andE.P.Xing.
Poseidon:Anefcommunicationarchitecturefordistributeddeeplearningon
f
GPU
g
clusters.In
2017
f
USENIX
g
AnnualTechnicalConference(
f
USENIX
gf
ATC
g
17)
,pages
181Œ193,2017.
[88]
M.Zhang,F.Zhang,N.D.Lane,Y.Shu,X.Zeng,B.Fang,S.Yan,andH.Xu.Deeplearning
intheeraofedgecomputing:Challengesandopportunities.
FogComputing:Theoryand
Practice
,pages67Œ78,2020.
[89]
S.Zhang,A.E.Choromanska,andY.LeCun.Deeplearningwithelasticaveragingsgd.
Advancesinneuralinformationprocessingsystems
,28:685Œ693,2015.
106
[90]
X.Zhang,J.Zhao,andY.LeCun.Character-levelconvolutionalnetworksfortextclassi-
In
ProceedingsoftheAdvancesinneuralinformationprocessingsystems
,pages
649Œ657,2015.
[91]
P.ZhaoandT.Zhang.Stochasticoptimizationwithimportancesamplingforregularizedloss
minimization.In
ProceedingsoftheInternationalConferenceonMachineLearning
,pages
1Œ9,2015.
[92]
C.Zhu,S.Han,H.Mao,andW.J.Dally.Trainedternaryquantization.
arXivpreprint
arXiv:1612.01064
,2016.
[93]
Z.ZivkovicandF.VanDerHeijden.Efadaptivedensityestimationperimagepixelfor
thetaskofbackgroundsubtraction.
Patternrecognitionletters
,27(7):773Œ780,2006.
107