COLLABORATIVELEARNING:THEORY,ALGORITHMS,AND APPLICATIONS By KaixiangLin ADISSERTATION Submittedto MichiganStateUniversity inpartialful˝llmentoftherequirements forthedegreeof ComputerScienceDoctorofPhilosophy 2020 ABSTRACT COLLABORATIVELEARNING:THEORY,ALGORITHMS,ANDAPPLICATIONS By KaixiangLin Humanintelligenceprosperswiththeadvantageofcollaboration.Tosolveoneoraset ofchallengingtasks,wecane˙ectivelyinteractwithpeers,fuseknowledgefromdi˙erent sources,continuouslyinspire,contribute,anddeveloptheexpertiseforthebene˝tofthe sharedobjectives.Humancollaborationis˛exible,adaptive,andscalableintermsofvarious cooperativeconstructions,collaboratingacrossinterdisciplinary,evenseeminglyunrelated domains,andbuildinglarge-scaledisciplinedorganizationsforextremelycomplextasks.On theotherhand,whilemachineintelligenceachievedtremendoussuccessinthepastdecade, theabilitytocollaborativelysolvecomplicatedtasksisstilllimitedcomparedtohuman intelligence. Inthisdissertation,westudytheproblemofcollaborativelearning-building˛exible, generalizable,andscalablecollaborativestrategiestofacilitatethee˚ciencyoflearningone orasetofobjectives.Towardsachievingthisgoal,weinvestigatethefollowingconcreteand fundamentalproblems:1.Inthecontextofmulti-tasklearning,canweenforce˛exibleforms ofinteractionsfrommultipletasksandadaptivelyincorporatehumanexpertknowledgeto guidethecollaboration?2.Inreinforcementlearning,canwedesigncollaborativemethods thate˙ectivelycollaborateamongheterogeneouslearningagentstoimprovethesample- e˚ciency?3.Inmulti-agentlearning,canwedevelopascalablecollaborativestrategyto coordinateamassivenumberoflearningagentsaccomplishingasharedtask?4.Infederated learning,canwehaveprovablebene˝tfromincreasingthenumberofcollaborativelearning agents? Thisthesisprovidesthe˝rstlineofresearchtoviewtheabovelearning˝eldsinauni˝ed framework,whichincludesnovelalgorithmsfor˛exible,adaptivecollaboration,real-world applicationsusingscalablecollaborativelearningsolutions,andfundamentaltheoriesfor propellingtheunderstandingofcollaborativelearning. ACKNOWLEDGMENTS Firstandforemost,Iwouldliketothankmyadvisor,Dr.JiayuZhou,forhisadvice, encouragement,inspirations,andendlesssupportformyresearchandcareer.Throughout thepast˝veyearsatMichiganStateUniversity,Dr.Zhouhasalwaysin˛uencedmewithhis curiosity,passion,andpersistenceofresearch.Heiswillingtodiscussthegrantpictureofthe researchandprovideconstructivesuggestionsinthetechnicaldetails.Meanwhile,despite beingcreativeandproductive,healsogivesmethefreedomtoworkonavarietyofproblems, evensomearenotalignedwithhisinterest.IwouldliketothankDrs.JiliangTang,Zhaojian Li,andAnilK.Jainforbeingonmythesiscommittee. I'mveryhappytohavehadtheopportunitytocollaboratewiththewonderfulgroup ofcolleagues,faculty,andresearchersthroughoutmyPh.D.Fortheworkpresentedinthis dissertation,IenjoyedworkingwithDr.JianpengXu,Dr.InciM.Baytas,Dr.ShuiwangJi, Dr.ShuWang,RenyuZhao,Dr.ZheXu,ZhaonanQu,Dr.ZhaojianLi,Dr.Zhengyuan ZhouandDr.JiayuZhou.Ithankthemfortheircontributionsandforeverythingtheyhave taughtme.Besidestheworkpresentedinthisthesis,Ialsohadthepleasureofworking withmanyoutstandingresearchers,includingLiyangXie,Dr.FeiWang,Dr.Pang-Ning Tan,FengyiTang,IkechukwuUchendu,BoyangLiu,DingWang,ZhuangdiZhu,andDr.Bo Dai.IwouldliketothankallofmyamazingcolleaguesinILLIDANlab:QiWang,Dr.Inci M.Baytas,LiyangXie,MengyingSun,FengyiTang,BoyangLiu,ZhuangdiZhu,Junyuan Hong,XitongZhangandIkechukwuUchenduforacollaborative,friendly,andproductive environment. IalsowanttoexpressmysincerethankstotheamazingcolleaguesImetduringthe internships,includingDr.PinghuaGong,WeiChen,GuojunWu,ZhengtianXu,Hongyu iv Zheng,JintaoKe,HuaxiuYao,DanWang,LiliCao,LingkaiYang,QiqiWang,Dr.Yaguang Li,Dr.PengWang,Dr.JieWang,ChaoTao,Dr.JiaChen,andDr.YoujieZhou.Many thankstoDr.PinghuaGong,Dr.PengWang,forhostingmeasaninternatDidiChuxing in2017and2018.IamalsomostthankfultoDr.JiaChenandDr.YoujieZhoufortheir patienceandendlesshelpduringmyinternshipatGooglein2019. Finally,Ithankmyparents,fortheirunconditionalloveandsupport. v TABLEOFCONTENTS LISTOFTABLES .................................... ix LISTOFFIGURES ................................... xi LISTOFALGORITHMS ............................... xiv Chapter1Introduction ............................... 1 1.1DissertationContributions............................2 1.1.1Model-drivencollaboration........................2 1.1.2Data-drivencollaboration........................4 1.1.3Large-scaleCollaborativeMulti-agentLearning............5 1.1.4TheProvableAdvantageofCollaborativeLearning..........6 1.2DissertationStructure..............................7 Chapter2Background ................................ 9 2.1CollaborativeLearningProblemFormulation..................9 2.2ATaxonomyofCollaboration..........................11 2.2.1Model-DrivenCollaboration.......................11 2.2.2Data-drivenCollaboration........................12 2.2.3CollaborativeMulti-agentLearning...................13 Chapter3Model-DrivenCollaborativeLearning ............... 14 3.1Multi-TaskFeatureInteractionLearning....................15 3.1.1Introduction................................15 3.1.2RelatedWork...............................18 3.1.3Taskrelatednessinhighorderfeatureinteractions...........22 3.1.4FormulationsandalgorithmsofthetwoMTILapproaches......27 3.1.4.1Preliminary...........................28 3.1.4.2SharedInteractionApproach.................28 3.1.4.3EmbeddedInteractionApproach...............31 3.1.5Experiments................................35 3.1.6SyntheticDataset.............................35 3.1.6.1E˙ectivenessofmodelingfeatureinteractions........35 3.1.6.2E˙ectivenessofMTIL.....................37 3.1.7SchoolDataset..............................39 3.1.8ModelingAlzheimer'sDisease......................40 3.1.9Discussion.................................41 3.2Multi-TaskRelationshipLearning........................42 3.2.1Introduction................................42 3.2.2RelatedWork...............................46 3.2.3InteractiveMulti-TaskRelationshipLearning..............49 vi 3.2.3.1RevisittheMulti-taskRelationshipLearning.........49 3.2.3.2TheiMTRLFramework....................52 3.2.3.3Aknowledge-awareextensionofMTRL...........54 3.2.3.4E˚cientOptimizationforkMTRL..............56 3.2.3.5BatchModePairwiseConstraintsActivelearning......59 3.2.4Experiments................................61 3.2.4.1ImportanceofHigh-QualityTaskRelationship........61 3.2.4.2E˙ectivenessofQueryStrategy................63 3.2.4.3InteractiveSchemeforQueryStrategy............64 3.2.4.4PerformanceonRealDatasets.................65 3.2.5CaseStudy:BrainAtrophyandAlzheimer'sDisease.........67 Chapter4Data-DrivenCollaborativeLearning ................ 71 4.1CollaborativeDeepReinforcementLearning..................71 4.1.1Introduction................................71 4.1.2RelatedWork...............................76 4.1.3Background................................79 4.1.3.1ReinforcementLearning....................79 4.1.3.2AsynchronousAdvantageactor-criticalgorithm(A3C)...80 4.1.3.3Knowledgedistillation.....................81 4.1.4Collaborativedeepreinforcementlearningframework.........82 4.1.5Collaborativedeepreinforcementlearning...............83 4.1.6Deepknowledgedistillation.......................85 4.1.7CollaborativeAsynchronousAdvantage Actor-Critic................................88 4.1.8Experiments................................91 4.1.8.1TrainingandEvaluation....................91 4.1.8.2Certi˝catedHomogeneoustransfer..............91 4.1.8.3Certi˝catedHeterogeneousTransfer..............93 4.1.8.4CollaborativeDeepReinforcementLearning.........96 4.2RankingPolicyGradient.............................97 4.2.1Introduction................................97 4.2.2Relatedworks...............................98 4.2.3NotationsandProblemSetting.....................100 4.2.4RankingPolicyGradient.........................100 4.2.5O˙-policyLearningasSupervisedLearning...............106 4.2.6Analgorithmicframeworkforo˙-policylearning............113 4.2.7SampleComplexityandGeneralizationPerformance..........115 4.2.8Supervisionstage:Learninge˚ciency..................117 4.2.9Explorationstage:Exploratione˚ciency................120 4.2.10JointAnalysisCombiningExplorationandSupervision........122 4.2.11ExperimentalResults...........................123 4.2.12AblationStudy..............................125 4.2.13Conclusion.................................127 vii Chapter5CollaborativeMulti-AgentLearning ................ 128 5.1Introduction....................................128 5.2RelatedWorks...................................132 5.3ProblemStatement................................134 5.4ContextualMulti-AgentReinforcementLearning................137 5.4.1IndependentDQN............................137 5.4.2ContextualDQN.............................138 5.4.3ContextualActor-Critic.........................140 5.5E˚cientallocationwithlinearprogramming..................143 5.6SimulatorDesign.................................148 5.7Experiments....................................151 5.7.1Experimentalsettings...........................151 5.7.2Performancecomparison.........................152 5.7.3OntheE˚ciencyofReallocations....................155 5.7.4Thee˙ectivenessofaveragedrewarddesign...............158 5.7.5Ablationsonpolicycontextembedding.................159 5.7.6Ablationstudyongroupingthelocations................160 5.7.7Qualitativestudy.............................161 5.8Conclusion.....................................162 Chapter6TheProvableAdvantageofCollaborativeLearning ....... 164 6.1Introduction....................................164 6.2Setup........................................167 6.2.1TheFederatedAveraging(FedAvg)Algorithm.............168 6.2.2Assumptions................................169 6.3LinearSpeedupAnalysisofFedAvg.......................170 6.3.1StronglyConvexandSmoothObjectives................170 6.3.2ConvexSmoothObjectives........................172 6.4LinearSpeedupAnalysisofNesterovAcceleratedFedAvg...........174 6.4.1StronglyConvexandSmoothObjectives................174 6.4.2ConvexSmoothObjectives........................175 6.5GeometricConvergenceofFedAvgintheOverparameterizedSetting.....176 6.5.1GeometricConvergenceofFedAvgintheOverparameterizedSetting.177 6.5.2OverparameterizedLinearRegressionProblems............178 6.6NumericalExperiments..............................180 Chapter7Conclusion ................................ 182 APPENDICES ...................................... 185 AppendixARankingPolicyGradient.........................186 AppendixBFederatedLearning............................215 BIBLIOGRAPHY .................................... 277 viii LISTOFTABLES Table3.1:Examplesofcommonsmoothlossfunctions..................27 Table3.2:PerformancecomparisonMTILandbaselinesontheSchooldataset....40 Table3.3:PerformancecomparisonMTILandbaselinesontheADNIdataset.....41 Table3.4: TheaverageRMSEofqueryandrandomstrategyontestingdatasetover5 randomsplittingoftrainingandvalidationsamples.............63 Table3.5:TheRMSEcomparisonofkMTRLandbaselines...............63 Table3.6: ThenameofthebrainregionsinFigure3.8,where(C)denotescortical parcellationand(W)denoteswhitematterparcellation..........67 Table4.1:NotationsforSection4.2............................101 Table5.1: PerformancecomparisonofcompetingmethodsintermsofGMVandorder responseratewithoutrepositioncost.....................155 Table5.2: PerformancecomparisonofcompetingmethodsintermsofGMV,order responserate(ORR),andreturnoninvest(ROI)inXianconsidering repositioncost..................................155 Table5.3: PerformancecomparisonofcompetingmethodsintermsofGMV,order responserate(ORR),andreturnoninvest(ROI)inWuhanconsidering repositioncost.................................156 Table5.4: E˙ectivenessofcontextualmulti-agentactor-criticconsideringreposition costs.......................................156 Table5.5:E˙ectivenessofaveragedrewarddesign...................159 Table5.6:E˙ectivenessofcontextembedding......................159 Table5.7:E˙ectivenessofgroupregularizationdesign.................161 ix Table6.1: ConvergenceresultsforFedAvgandacceleratedFedAvg.Throughoutthepaper, N isthetotalnumberoflocaldevices,and K N isthemaximalnumberofdevices thatareaccessibletothecentralserver. T isthetotalnumberofstochasticupdates performedbyeachlocaldevice, E isthelocalstepsbetweentwoconsecutive servercommunications(andhence T=E isthenumberofcommunications). y Inthelinearregressionsetting,wehave = 1 forFedAvgand = p 1 ~ for acceleratedFedAvg,where 1 and p 1 ~ areconditionnumbersde˝nedinSection 6.5.Since 1 ~ ,thisimpliesaspeedupfactorof p 1 ~ foracceleratedFedAvg. 166 TableA.1: AcomparisonofstudiesreducingRLtoSL.The Objective columndenotes whetherthegoalistomaximizelong-termreward.The Cont.Action columndenoteswhetherthemethodisapplicabletobothcontinuousand discreteactionspaces.The Optimality denoteswhetherthealgorithmscan modeltheoptimalpolicy. X y denotestheoptimalityachievedbyERLis w.r.t.theentropyregularizeobjectiveinsteadoftheoriginalobjectiveon return.The O˙-Policy columndenotesifthealgorithmsenableo˙-policy learning.The NoOracle columndenotesifthealgorithmsneedtoaccessto acertaintypeoforacle(expertpolicyorexpertdemonstrations).....189 TableA.2:HyperparametersofRPGnetwork......................213 TableB.1: Ahigh-levelsummaryoftheconvergenceresultsinthispapercomparedto priorstate-of-the-artFLalgorithms.Thistableonlyhighlightsthedepen- denceon T (numberofiterations), E (themaximalnumberoflocalsteps), N (thetotalnumberofdevices),and K N thenumberofparticipated devices. istheconditionnumberofthesystemand 2 (0 ; 1) .Wedenote NesterovacceleratedFedAvgasN-FedAvginthistable...........217 x LISTOFFIGURES Figure3.1: IllustrationofMTLwithfeatureinteractions.(a)thefeatureinteractions frommultipletaskscanbecollectivelyrepresentedasatensor Q ;group sparsestructures(c)andlow-rankstructures(b)infeatureinteractions canbeusedtofacilitatemulti-taskmodels.................20 Figure3.2: RMSEcomparisonbetweenRRandSTILontwosyntheticdatasetswith samplesizeof1kand5k,respectively....................36 Figure3.3: Syntheticdataset(Multi-task):RootMeanSquareError(RMSE)compar- isonsamongallthemethods.TheY-axisisRMSE,X-axisisdimensionof features....................................37 Figure3.4: OverviewoftheproposediMTRLframework,whichinvolveshumanexperts intheloopofmulti-tasklearning.Theframeworkconsistsofthreephases: (1) Knowledge-awaremulti-tasklearning :learningmulti-tasklearningmod- elsfromknowledgeanddata,(2) Solicitation :solicitingmostinformative knowledgefromhumanexpertsusingactivelearningbasedquerystrategy, (3) Encoding :encodingthedomainknowledgetofacilitateinductivetransfer. 44 Figure3.5: PerformanceofMTRLandeMTRLasthenumberoffeatureschanging,in termsof(a)Frobeniusnormand(b)RMSE.MTRL[ 227 ]learnsbothtask modelsandtaskrelationshipatthesametime,whileeMTRLherelearns thetaskmodelswhilethetaskrelationship is˝xedtogroundtruth,i.e. encodingthecorrectdomainknowledgeaboutthetaskrelationship...63 Figure3.6: TheaveragedRMSEofkMTRLusingdi˙erentsettingofquerystrategy. ThekMTRL-10-100meansselecting10pairwiseconstraintsattheend ofeachiteration,startfromzero,add10pairwiseconstraintsatatime, until100constraints.Forall4schemes,kMTRLwithzeroconstraintsis equivalenttoMTRL.Resultsaretheaverageover5foldrandomsplitting.65 Figure3.7: Thedistributionofcompetenceon(a)intra-regioncovarianceand(b) inter-regioncovariance.kMTRLperformsbetterthanMTRLwhen competence > 1 .Highercompetenceindicatesbetterperformanceachieved bykMTRLascomparedtoMTRL.Weseeinamajorityofregionsthe kMTRLoutperformstheMTRL.......................68 xi Figure3.8: Comparisonofsub-matricesofcovarianceamong(left)taskcovariance using 90% alldatapointsthatisconsideredas(middle) thecovariancematrixlearnedviaMTRLon 20% dataand(right)the covariancematrixlearnedviakMTRLon 20% datawith0.8%pair-wise constraintsqueriedbytheproposedqueryscheme.............68 Figure4.1:IllustrationofCollaborativeDeepReinforcementLearningFramework..72 Figure4.2: Deepknowledgedistillation.In(a),theteacher'soutputlogits z is mappedthroughadeepalignmentnetworkandthealignedlogits F ! ( z ) isusedasthesupervisiontotrainthestudent.In(b),theextrafully connectedlayerfordistillationisaddedforlearningknowledgefromteacher. Forsimplicity'ssake,timestep t isomittedhere..............82 Figure4.3:Performanceofonlinehomogeneousknowledgedistillation........93 Figure4.4: Performanceofonlineknowledgedistillationfromaheterogeneoustask.(a) distillationfroma Pong expertusingthepolicylayertotraina Bowling student(KD-policy).(b)distillationfroma Pong experttoa Bowling studentusinganextradistillationlayer(KD-distill)............94 Figure4.5: Theactionprobabilitydistributionsofa Pong expert,a Bowling expert andanaligned Pong expert.........................94 Figure4.6: Performanceof o˜ine , online deepknowledgedistillation,andcollabora- tivelearning..................................95 Figure4.7:O˙-policylearningframework.........................113 Figure4.8: ThebinarytreestructureMDP( M 1 )withoneinitialstate,similaras discussedin[ 184 ].Inthissubsection,wefocusontheMDPsthathaveno duplicatedstates.TheinitialstatedistributionoftheMDPisuniform andtheenvironmentdynamicsisdeterministic.For M 1 theworstcase explorationisrandomexplorationandeachtrajectorywillbevisitedat sameprobabilityunderrandomexploration.Notethatinthistypeof MDP,theAssumption5issatis˝ed......................121 Figure4.9: ThetrainingcurvesoftheproposedRPGandstate-of-the-art.Allresults areaveragedoverrandomseedsfrom1to5.The x -axisrepresentsthe numberofstepsinteractingwiththeenvironment(weupdatethemodel everyfoursteps)andthe y -axisrepresentstheaveragedtrainingepisodic return.Theerrorbarsareplottedwithacon˝denceintervalof95%...123 Figure4.10:Thetrade-o˙betweensamplee˚ciencyandoptimality..........125 xii Figure4.11: Expectedexploratione˚ciencyofstate-of-the-art.Theresultsareaveraged overrandomseedsfrom1to10.......................126 Figure5.1: Thegridworldsystemandaspatial-temporalillustrationoftheproblem setting.....................................137 Figure5.2: Illustrationofcontextualmulti-agentactor-critic.Theleftpartshowsthe coordinationofdecentralizedexecutionbasedontheoutputofcentralized valuenetwork.Therightpartillustratesembeddingcontexttopolicy network....................................144 Figure5.3: ThesimulatorcalibrationintermsofGMV.TheredcurvesplottheGMV valuesofrealdataaveragedover7dayswithstandarddeviation,in10- minutetimegranularity.Thebluecurvesaresimulatedresultsaveraged over7episodes.................................150 Figure5.4:Simulatortimelineinonetimestep(10minutes)..............151 Figure5.5: IllustrationofallocationsofcA2CandLP-cA2Cat18:40and19:40,resp- sectively.....................................158 Figure5.6: ConvergencecomparisonofcA2Canditsvariationswithoutusingcontext embeddinginbothsettings,withandwithoutrepositioncosts.TheX-axis isthenumberofepisodes.TheleftY-axisdenotesthenumberofcon˛icts andtherightY-axisdenotesthenormalizedGMVinoneepisode.....159 Figure5.7: Illustrationontherepositionsnearbytheairportat1:50amand06:40 pm.Thedarkercolordenotesthehigherstatevalueandthebluearrows denotetherepositions.............................162 Figure5.8:Thenormalizedstatevalueanddemand-supplygapoveroneday.....163 Figure6.1: ThelinearspeedupofFedAvginfullparticipation,partialparticipation, andthelinearspeedupofNesterovacceleratedFedAvg,respectively...181 FigureA.1:ThebinarytreestructureMDPwithtwoinitialstates...........194 FigureA.2: Thedirectedgraphthatdescribestheconditionalindependenceofpairwise relationshipofactions,where Q 1 denotesthereturnoftakingaction a 1 atstate s ,followingpolicy ˇ in M ,i.e., Q ˇ M ( s;a 1 ) . I 1 ; 2 isarandom variablethatdenotesthepairwiserelationshipof Q 1 and Q 2 ,i.e., I 1 ; 2 = 1 ; i : i : f :Q 1 Q 2 ; o : w :I 1 ; 2 =0 .......................206 FigureB.1:TheconvergenceofFedAvgw.r.tthenumberoflocalsteps E .......276 xiii LISTOFALGORITHMS Algorithm3.1knowledge-awareMulti-TaskRelationshipLearning(kMTRL)... 58 Algorithm3.2Projectionalgorithm.......................... 58 Algorithm3.3QueryStrategyofPairwiseConstraints................ 59 Algorithm3.4iMTRLframework........................... 59 Algorithm4.1OnlinecA3C.............................. 90 Algorithm4.2O˙-PolicyLearningforRankingPolicyGradient(RPG)...... 115 Algorithm5.1 -greedypolicyforcDQN....................... 141 Algorithm5.2ContextualDeepQ-learning(cDQN)................. 141 Algorithm5.3ContextualMulti-agentActor-CriticPolicyforward......... 144 Algorithm5.4ContextualMulti-agentActor-CriticAlgorithmfor N agents.... 145 xiv Chapter1 Introduction Humanintelligenceisremarkableatcollaboration.Besidesindependentlearning,ourlearning processishighlyimprovedbysummarizingwhathasbeenlearned,communicatingitwith peers,andsubsequentlyfusingknowledgefromdi˙erentsourcestoassistthecurrentlearning goal.This collaborativelearning procedureensuresthattheknowledgeisshared,continuously re˝ned,andconcludedfromdi˙erentperspectivestoconstructaincreasinglyprofound understanding,whichcansigni˝cantimprovethethelearninge˚ciency. Ontheotherhand,machineintelligencestillpalesincomparisontohumaninsome aspects,despiteitsphenomenaldevelopmentinrecentyears:theyingeneraldesignedfor onespeci˝ctask,withanisolated,dataine˚cient,andcomputationallyexpensivelearning paradigm. Theresearchgoalpresentedinthisdissertationistobuildanintelligentsystemwith multiplelearningagentsthatcollaborativelyresolvesoneorasetoftasksmoree˚ciently.In particular,wetacklethefollowingchallengesinvariousdomainsofcollaborativelearning. Flexibleandinteractivecollaboration. Howcanmodelsofmultiplelearning agentsinteracttoleveragetheknowledgefromrelatedtasksina˛exible,stable,and interactiveway?Moreconcretely,howcanweincorporatehigher-orderinteractions intothemultiplelearningmodelsduringtraining?Howcanwecontinuouslyguidethe learningofmultiplemodelsandselectivelysolicitthehumanexpertknowledgetoescort 1 theircollaborationinteractively? Heterogeneouscollaboration. Onelimitationincollaborativelearningisthat thelearningmodelsingeneral,haveahomogeneousstructure.Howcanwedesign collaborativestrategiesamongheterogeneouslearningagentstoimprovethesample- e˚ciency? Large-scalecollaboration. Inpractice,ane˙ectiveande˚cientcollaborationamong alargeamountoflearningagentsisdesired.Howcanwescalethecollaborationto thousandsofagents? Theoreticalguaranteeofcollaboration. Besidesthepracticalalgorithmsand applications,whatarethetheoreticaladvantagesofcollaborativelearning?Doesthe learningbene˝tfrommorelearningagents? 1.1DissertationContributions Toresolvetheaforementionedchallengesofcollaborativelearning,thisthesispresentshowthe collaborationisachievedtoimprovesample-e˚ciencyinvariousscenarios.Moreconcretely, thecontributionsofthisthesisaresummarizedinthefollowingsections. 1.1.1Model-drivencollaboration Wediscussmodel-drivencollaborationinthecontextofmulti-tasklearning.The˝rstpart inthisChapterdiscusseshowdowecapturethehigh-orderfeatureinteractionsamong relatedtaskscollaboratively.Traditionalmulti-tasklearningwithlinearmodelsarewidely usedinvariousdataminingandmachinelearningalgorithms.Onemajorlimitationof 2 suchmodelsisthelackofcapabilitytocapturepredictiveinformationfrominteractions betweenfeatures.Whileintroducinghigh-orderfeatureinteractiontermscanovercomethis limitation,thisapproachdramaticallyincreasesthemodelcomplexityandimposessigni˝cant challengesinthelearningagainstover˝tting.Whentherearemultiplerelatedlearning tasks,featureinteractionsfromthesetasksareusuallyrelatedandmodelingsuchrelatedness isthekeytoimprovetheirgeneralization.Here,wepresentanovelMulti-Taskfeature InteractionLearning(MTIL)frameworktoexploitthetaskrelatednessfromhigh-order featureinteractions.Speci˝cally,wecollectivelyrepresentthefeatureinteractionsfrom multipletasksasatensor,andpriorknowledgeoftaskrelatednesscanbeincorporatedinto di˙erentstructuredregularizationsonthistensor.Weformulatetwoconcreteapproaches underthisframework,namelythesharedinteractionapproachandtheembeddedinteraction approach.Theformerassumestaskssharethesamesetofinteractions,andthelatterassumes featureinteractionsfrommultipletasksshareacommonsubspace.Wehaveprovidede˚cient algorithmsforsolvingthetwoformulations. Thesecondpartinthischapterinvestigatessolicitingandincorporatingtaskrelatedness informationfromhumanexperttothemodel,whichguidesthedirectionofthemodel-based collaboration.InthecenterofMTLalgorithmsishowtherelatednessoftasksaremodeled andencodedinlearningformulationstofacilitateknowledgetransfer.AmongtheMTL algorithms,themulti-taskrelationshiplearning(MTRL)attractedmuchattentioninthe communitybecauseitlearnstaskrelationshipfromdatatoguideknowledgetransfer,instead ofimposingapriortaskrelatednessassumption.However,thismethodheavilydependson thequalityoftrainingdata.Whenthereisinsu˚cienttrainingdataorthedataistoonoisy, thealgorithmcouldlearnaninaccuratetaskrelationshipthatmisleadsthelearningtowards suboptimalmodels.Toaddresstheaforementionedchallenge,weproposeanovelinteractive 3 multi-taskrelationshiplearning(iMTRL)frameworkthate˚cientlysolicitspartialorder knowledgeoftaskrelationshipfromhumanexperts,e˙ectivelyincorporatestheknowledge inaproposedknowledge-awareMTRLformulation.Weproposeane˚cientoptimization algorithmforkMTRLandcomprehensivelystudyquerystrategiesthatidentifythecritical pairsthataremostin˛uentialtothelearning.Wepresentextensiveempiricalstudieson bothsyntheticandrealdatasetstodemonstratethee˙ectivenessofproposedframework. 1.1.2Data-drivencollaboration InChapter3,wediscussdata-drivencollaborationinthecontextofreinforcementlearning andusethedataasamediumtofacilitatecollaborationamongmultiplelearningagents, whichcanthenlargelyimprovethesample-e˚ciency. Inthischapter,we˝rstleveragetheknowledgedistillationtoenforcethecollaboration amongheterogeneouslearningagents.Theideaofknowledgetransferhasledtomany advancesinmachinelearninganddatamining,butsigni˝cantchallengesremain,especially whenitcomestoreinforcementlearning,heterogeneousmodelstructures,anddi˙erent learningtasks.Motivatedbyhumancollaborativelearning,weproposeacollaborative deepreinforcementlearning(CDRL)frameworkthatperformsadaptiveknowledgetransfer amongheterogeneouslearningagents.Speci˝cally,theproposedCDRLconductsanovel deepknowledgedistillationmethodtoaddresstheheterogeneityamongdi˙erentlearning taskswithadeepalignmentnetwork.Furthermore,wepresentane˚cientcollaborative AsynchronousAdvantageActor-Critic(cA3C)algorithmtoincorporatedeepknowledge distillationintotheonlinetrainingofagents,anddemonstratethee˙ectivenessoftheCDRL frameworkusingextensiveempiricalevaluationonOpenAIgym. Inadditiontoknowledgetransferamongdi˙erenttasks,wecanfurthercoordinate 4 di˙erenthomogeneouslearningagentsforthesametask,whichfurtheradvancesmorestable optimizationandsample-e˚cientlearning.Themainideaisano˙-policylearningframework thatdisentanglesexplorationandexploitationinreinforcementlearning,whichbuildupon theconnectionbetweenimitationlearningandreinforcementlearning.Thestate-of-the-art estimatestheoptimalactionvalueswhileitusuallyinvolvesanextensivesearchoverthe state-actionspaceandunstableoptimization.Towardsthesample-e˚cientRL,wepropose rankingpolicygradient(RPG),apolicygradientmethodthatlearnstheoptimalrankofa setofdiscreteactions.Toacceleratethelearningofpolicygradientmethods,weestablish theequivalencebetweenmaximizingthelowerboundofreturnandimitatinganear-optimal policywithoutaccessinganyoracles.Theseresultsleadtoageneralo˙-policylearning framework,whichpreservestheoptimality,reducesvariance,andimprovesthesample- e˚ciency.Weconductextensiveexperimentsshowingthatwhenconsolidatingwiththe o˙-policylearningframework,RPGsubstantiallyreducesthesamplecomplexity,comparing tothestate-of-the-art. 1.1.3Large-scaleCollaborativeMulti-agentLearning Inthischapter,weapplycollaborativemulti-agentreinforcementlearningtoareal-world˛eet managementapplication,whichisanessentialcomponentforonlineride-sharingplatforms. Large-scaleonlineride-sharingplatformshavesubstantiallytransformedourlivesbyreallo- catingtransportationresourcestoalleviatetra˚ccongestionandpromotetransportation e˚ciency.Ane˚cient˛eetmanagementstrategynotonlycansigni˝cantlyimprovethe utilizationoftransportationresourcesbutalsoincreasetherevenueandcustomersatisfaction. Itisachallengingtasktodesignane˙ective˛eetmanagementstrategythatcanadapttoan environmentinvolvingcomplexdynamicsbetweendemandandsupply.Existingstudiesusu- 5 allyworkonasimpli˝edproblemsettingthatcanhardlycapturethecomplicatedstochastic demand-supplyvariationsinhigh-dimensionalspace.Weproposetotacklethelarge-scale ˛eetmanagementproblemusingreinforcementlearning,andproposeacontextualmulti-agent reinforcementlearningframeworkincludingtwoconcretealgorithms,namelycontextualdeep Q -learningandcontextualmulti-agentactor-critic,toachieveexplicitcoordinationamonga largenumberofagentsadaptivetodi˙erentcontexts.Weshowsigni˝cantimprovementsof theproposedframeworkoverstate-of-the-artapproachesthroughextensiveempiricalstudies. 1.1.4TheProvableAdvantageofCollaborativeLearning Previously,weproposetheheuristiccollaborativeapproachtocoordinatealargenumber oflearningagentstoresolveareal-worldapplication.Inaddition,wewouldliketoprovide arigorousanswertowhetherthereisaprovablebene˝tfromincreasingthenumberof collaborativelearningagents.Weinvestigatethisprobleminfederatedlearning,whichisa criticalscenarioinbothindustryandacademia.Federatedlearning(FL)learnsamodeljointly fromasetofparticipatingdeviceswithoutsharingeachother'sprivatelyhelddata.The characteristicsofnon- iid dataacrossthenetwork,lowdeviceparticipation,andthemandate thatdataremainprivatebringchallengesinunderstandingtheconvergenceofFLalgorithms, particularlyinregardstohowconvergencescaleswiththenumberofparticipatingdevices. Here,wefocusonFederatedAveraging(FedAmostwidelyusedande˙ectiveFL algorithminusetodaprovideacomprehensivestudyofitsconvergencerate.Although FedAvghasrecentlybeenstudiedbyanemerginglineofliterature,itremainsopenas tohowFedAvg'sconvergencescaleswiththenumberofparticipatingdevicesintheFL crucialquestionwhoseanswerwouldshedlightontheperformanceofFedAvgin largeFLsystems.We˝llthisgapbyestablishingconvergenceguaranteesforFedAvgunder 6 threeclassesofproblems:stronglyconvexsmooth,convexsmooth,andoverparameterized stronglyconvexsmoothproblems.WeshowthatFedAvgenjoyslinearspeedupineach case,althoughwithdi˙erentconvergencerates.Foreachclass,wealsocharacterizethe correspondingconvergenceratesfortheNesterovacceleratedFedAvgalgorithmintheFL setting:tothebestofourknowledge,thesearethe˝rstlinearspeedupguaranteesforFedAvg whenNesterovaccelerationisused.ToaccelerateFedAvg,wealsodesignanewmomentum- basedFLalgorithmthatfurtherimprovestheconvergencerateinoverparameterizedlinear regressionproblems.Empiricalstudiesofthealgorithmsinvarioussettingshavesupported ourtheoreticalresults. 1.2DissertationStructure Theremainderofthisdissertationisorganizedasfollows.Weintroducethebackgroundof collaborativelearninginChapter2.InChapter3,westartwithlearninglinearmodelsfor multipletaskswhileincorporating˛exibleformsofinteractionsanddevelopaninteractive approachtosolicithumanexpertknowledgeformodelcollaborations.Thischapterwas previouslypublishedas"Multi-taskFeatureInteractionLearning"[ 115 ]and"Interactive Multi-taskRelationshipLearning"[ 117 ].InChapter4,wepresentdata-drivencollaboration methodstointeractamongheterogeneouslearningagents,whichcanlargelyimprovethe sample-e˚ciencyofreinforcementlearningalgorithms.Thematerialsinthischapterarebased on"CollaborativeDeepReinforcementLearning"[ 114 ]and"RankingPolicyGradient"[ 118 ]. InChapter5,westudyareal-worldapplicationanddesignacoordinationstrategythatcan scaletoalargenumberoflearningagents.Thematerialsinthischapterwerepublishedas "E˚cientlarge-scale˛eetmanagementviamulti-agentdeepreinforcementlearning"[ 116 ].In 7 Chapter6,wepresentrigoroustheoriesontheimprovementofconvergencerateswithrespect totheincreasingnumberofcollaborativelearningagents,whichadvocatetheadvantageof collaborativelearning.Thematerialsinthischapterarebasedon"FederatedLearning's Blessing:FedAvghasLinearSpeedup"[157].WeconcludethisdissertationinChapter7. 8 Chapter2 Background Inthischapter,we˝rstgiveacoherentde˝nitionof collaborativelearning usedinthroughout thisdissertation,thenwediscussconnectionsanddiscrepanciesamongfourspeci˝cscenarios underthisoverarchingframework. 2.1CollaborativeLearningProblemFormulation Indisciplinesofcognitivescience,educationandpsychology, collaborativelearning ,asituation inwhichagroupofpeoplelearntoachieveasetoftaskstogether,hasbeenadvocated throughoutpreviousstudies[ 50 ].Motivatedbythephenomenalsuccessofhumancollaborative learning,westudythecollaborativelearninginthedomainofarti˝cialintelligence.We˝rst provideageneralde˝nitionofcollaborativelearninginthisthesis. De˝nition1 (Collaborativelearning) . Collaborativelearningisagenerallearningparadigm thatmultiplelearningagentscollaboratetosolveoneorasetoftasks. Here,wewouldliketoclarifytheseveralterminologiesusedinDe˝nition1. multiple :incontrasttoindividuallearning,collaborativelearningherecoversawide rangeoflearning:fromasmallscalesuchasapairoflearningagentstolarge-scale suchasthousandsoflearningagents. 9 learningagents :Thelearningagentreferstoamachinelearningmodelthatbehaves di˙erentlyfromeachother.Forexample,learningagentscanbeparameterizedby di˙erentdeepneuralnetworks.Theneuralnetworkcanhavedi˙erentdomainsor architectures.Thecentralrequirementisthateachlearningagentcanlearnindividually andconductdecisionmakingindependently. collaborate :theinteractionamongdi˙erentlearningagents.Thestrategyofthis interactionisthecentraldesignofthecollaborativelearningalgorithm. solveoneorasetoftasks :Inmachinelearning,solvingoneorasetoftasksrefers tooptimizingoneorseveralobjectivefunctionsthatgeneralizewelltotheunseen scenarios. Moreconcretely,weprovidetheproblemformulationofcollaborativelearningasfollows: min W = f w i g K i =1 K X i =1 F i ( W ) s.t. w i 2C i ( W ) 8 i =1 ;:::;K (2.1) where F i ;i =1 ;:::;K referstothesetoftaskswewanttosolve.Themodelparameter w i denotesthelearningagents.Itisworthnotingthat w i isnotnecessarilyrepresentedbya singleinstance,e.g.,aneuralnetwork,adecisiontree,etc.Weuse w i todenoteallvariables thatneedtobedeterminedforadecisionprocess,whichconstructsamappingfromtheinput oftask i totheaction,suchasregression,classi˝cation,etc.Weusetheset C i ; 8 i =1 ;:::;K denotestheinteractionsbetweenlearningagent i andothers,whichcanencodevarioustypes ofcollaborationstrategiesintothelearningprocessaswewilldiscussshortly.Forsimplicity, wedenotetheunionofmodelsofalllearningagentsas W = f w i g K i =1 .Therationaleof collaborativelearningisthattheproperdesignofinteractions C amongthelearningagents 10 facilitatestheoptimizationofobjectives. Itisworthnotingthatthecollaborationsetisamoregeneralexpressioncomparingto theregularization.Theregularizationhasaspeci˝cformonenforcingtheformulationwhile thesetofcollaborationcanintegratemore˛exiblealgorithmicdesignsofinteraction.Inthis thesis,despitedi˙erentiationsexistintermsofhowdi˙erentlearningagentsinteract,we followthecommonpracticeandusecooperationandcollaborationinterchangeably[50]. 2.2ATaxonomyofCollaboration Inthissection,wepresentdi˙erentcategoriesofcollaborations,whichleadstoseveralsub˝elds inthemachinelearningcommunity.Wediscusstheconnectionsanddiscrepanciesofthose relatedsub˝eldsandexplorethepossibleadvantagesoforganizingtheminauni˝edview. 2.2.1Model-DrivenCollaboration The˝rstcategoryofcollaborativelearningismodel-drivencollaboration,whichdirectly enforcestheinteractionoflearningagentsintheparameterspace.Fromtheperspectiveof transferlearning,theseapproachesimplementknowledgetransferfromintroducinginductive biasduringthelearning.Itspeci˝callyspecifytheconditionsoflearnedsolutionneedstobe satis˝ed,suchassparsityorlow-rankproperty.Inthiscase,thecollaborationconstrainreduces tothevariousregularizationsandthecollaborativelearningreducestomulti-tasklearning andfederatedlearning.Moreconcretely,weset C i = R ( W ) ,where R ( ) istheregularization addedtothe W .Forexample,underthesituationsthat W isamatrix(eachlearningagents' modelisavector),acommonregularizationistracenorm R ( W )= f W jk W k tr thatcontrols thesubspaceofmultiplemodels. 11 Multi-TaskLearning(MTL)isaprincipledlearningparadigmthatleveragesuseful informationcontainedinmultiplerelatedtaskstohelpimprovethegeneralizationperformance ofallthetasks[ 226 ].ThegoalofMTListolearn K functionsforthetaskssuchthat f k ( x ik )= y ik ,basedontheassumptionthatalltaskfunctionsarerelatedtosomeextent, whereeachfunction f k isparameterizedby w k .Thegeneralmulti-tasklearningformulation isgivenby: min W K X k =1 F k ( w k )+ R ( W ) (2.2) Another˝eldthatfallsintomodel-drivencollaborationisfederatedlearning.Federated learning(FL)learnsasinglemodeljointlyfromasetoflearningagents.Ingeneral,each learningagentcorrespondstoalocaldeviceandthetrainingisperformedsharingeachother's privatelyhelddata.Asfornow,theprevalentcollaborationstrategyistheaggregationofall learningagents'models.Thechallengeoffederatedlearningisthepracticalconstraintson collaboration:toreducethecommunicationcost(thefrequencyofcollaboration),dealwith systemheterogeneity,andunderstandthetheoreticalpropertiesofthissimplecollaborative strategy.WewillproviderigorousanswerstothosequestionsinChapter6. 2.2.2Data-drivenCollaboration Onelimitationforthetraditionalmodel-basedcollaborationisthatthemodelstructure restrictedduetotheusageofinductivetransfer.Toovercomethisissue,thedata-driven collaborationleveragesthetechniquessuchasknowledgedistillation,mimiclearning. 12 Inthiscase,thedata-drivencollaborationconstrainisgivenby C i ( w i )= f argmin w i ` ( w i ;f w j ( x ) ;y ) ; 8 ( x ;y ) 2 B g ; where B denotesthereplaybu˙erthatcontainasetofselecteddataaccordingtothe task-speci˝ccriteria.Noticethattheinteractionbetweenlearningagentsnowareconducted throughthedatacollectedin B .Sincetheotherlearningagent'smodellabeledthedata in B ,itcontainsinformationlearnedinagent j ,whichisthendistilledtoagent i through lossfunction ` ( ) .Inthisway,wecanempowera˛exiblenetworkstructureamongdi˙erent agents,thusachievecollaborationamongheterogeneouslearningagents.Theseapproaches willbeintroducedinChapter4. 2.2.3CollaborativeMulti-agentLearning Incollaborativemulti-agentlearning,themultiplelearningagentsinteractwithothersto achieveacommontask.Eachlearningagentcanperformthelearningprocessindividually whiletheWeemphasizethisproblemasadistincttypeofcollaborationsincetheagentscan adapttheircollaborationsthroughtheenvironmentfeedback,thoughthistrialanderrorcan becomputationallyintractable.Toimprovethesample-e˚ciencyinthisscenario,wecan enforceatask-speci˝cmodel-drivenordata-drivenapproachduringthelearning.Weprovide aconcretereal-worldapplicationtodemonstratethiscategoryinChapter5. 13 Chapter3 Model-DrivenCollaborativeLearning Inthischapter,wediscussmodel-drivencollaborationinthecontextofmulti-tasklearning. Morespeci˝cally,we˝rstproposedanovelMulti-TaskfeatureInteractionLearning(MTIL) frameworktoexploitthetaskrelatednessfromhigh-orderfeatureinteractions,whichprovides bettergeneralizationperformancebyinductivetransferamongtasksviasharedrepresentations offeatureinteractions.Weformulatetwoconcreteapproachesunderthisframeworkand providee˚cientalgorithms:thesharedinteractionapproachandtheembeddedinteraction approach.Theformerassumestaskssharethesamesetofinteractions,andthelatter assumesfeatureinteractionsfrommultipletaskscomefromasharedsubspace.Wehave providede˚cientalgorithmsforsolvingthetwoapproaches.Secondly,theclassicalmulti-task relationshiplearningcouldlearnaninaccuratetaskrelationshipwhenthereareinsu˚cient trainingdataorthedataistoonoisy,andwouldmisleadthelearningtowardssuboptimal models.Inthischapter,weproposedanovelinteractivemulti-taskrelationshiplearning (iMTRL)frameworkthate˚cientlysolicitspartialorderknowledgeoftaskrelationshipfrom humanexperts,e˙ectivelyincorporatestheknowledgeinaproposedknowledge-awareMTRL formulation.Weproposede˚cientoptimizationalgorithmforkMTRLandcomprehensively studyquerystrategiesthatidentifythecriticalpairsthataremostin˛uentialtothelearning. 14 3.1Multi-TaskFeatureInteractionLearning 3.1.1Introduction Linearmodelsaresimpleyetpowerfulmachinelearninganddataminingmodelsthatare widelyusedinmanyapplications.Duetotheadditivenatureofthelinearmodels,itcanfully unleashthepoweroffeatureengineering,allowingcraftedfeaturestobeeasilyintegrated intothelearningsystem.Thisisadesiredpropertyinmanypracticalapplications,inwhich high-qualityfeaturesarethekeytopredictiveperformance.Moreover,e˚cientparallel algorithmsarereadilyavailabletolearnlinearmodelsfromlarge-scaledatasets.Despiteits attractiveproperties,oneapparentlimitationofsuchmodelsisthattheycanonlylearna setofindividuale˙ectsoffeaturescontributingtotheresponse,duetoitslinearadditive property.Thuswhenapartoftheresponseisderivedfrominteractionsbetweenfeatures, suchmodelswouldnotbeabletodetectsuchnon-linearpredictiveinformation,thereby leadingtopoorpredictiveperformance. Inpractice,high-orderfeatureinteractionsarecommoninmanydomains.Forexample, ingeneticsstudies,environmentale˙ectsandgenetic-environmentalinteractionarefound tohavestrongrelationshipwiththevariabilityinadopteeaggressivity,conductdisorder andadultantisocialbehavior[ 29 ].Similarly,theinteractione˙ectsbetweencontinuance commitmentanda˙ectivecommitmentwasfoundinpredictingannexedabsences[ 177 ]. Also,arecentstudyofdepressionfoundthatgenotype,sex,environmentalriskandtheir interactionhavecombinedin˛uenceondepressionsymptoms[ 52 ].Itisalsoreportedthatthe interactionofbrain-derivedneurotrophicfactorandearlylifestressexposureareidenti˝ed inpredictingsyndromaldepressionandanxiety,andassociatedalterationsincognition[ 63 ]. Inbiomedicalstudies,manyhumandiseasesarearesultofcomplicatedinteractionsamong 15 geneticvariantsandenvironmentalfactors[ 79 ].Oneintuitivesolutiontoovercomethis limitationistoaugmentinteractiontermsintolinearmodels,explicitlymodelingthee˙ects fromtheinteractions.However,thiswilldramaticallyincreasethemodelcomplexityandlead topoorgeneralizationperformancewhenthereislimitedamountofdata[ 35 , 39 , 124 , 158 , 216 ]. Ontheotherhand,whentherearemultiplerelatedlearningtasks,themulti-tasklearning (MTL)paradigm[ 10 , 19 , 33 ]haso˙eredaprincipledwaytoimprovethegeneralization performanceofsuchlearningtasksbyleveragingtherelatednessamongtasksandperforming inductivetransferamongthem.Thepastdecadehaswitnessedagreatamountofsuccess inapplyingMTLtotackleproblemswherelargeamountoflabeleddataarenotavailable orcreatingsuchdatasetsincursprohibitivecost.Suchproblemsareespeciallyprevalentin biologicalandmedicaldomains,whereMTLhasachievedsigni˝cantsuccess,includingdata analysisongenotypeandgeneexpression[ 101 ],breastcancerdiagnosis[ 228 ]andprogression modelingofAlzheimer'sDisease[ 68 ],etc.TheMTLimprovesgeneralizationperformance bylearningasharedrepresentationfromalltasks,whichservesastheagentforknowledge transfer.Structuredregularizationhasprovidedane˙ectivemeansofmodelingsuchshared representationandencodingvarioustypesofdomainknowledgeontasks[ 10 , 89 , 142 , 199 ]. Theattractivebene˝tsprovidedbyMTLmakeitanidealschemewhenlearningproblems involvemultiplerelatedtaskswithfeatureinteractions,becausetasksmayberelatedwith eachotherbysharedstructuresonfeatureinteractions.Forexample,predictingvarious cognitivefunctionsmayinvolveasharedsetofinteractionsamongbrainregions. However,manyexistingMTLframeworksarebasedonlinearmodels[ 10 ]intheoriginal inputspace.Thustheycannotbedirectlyappliedtoexploretaskrelatednessintheform ofhigh-orderfeatureinteractions.Ontheotherhand,althoughtraditionalnonlinearMTL methodsbasedonneuralnetworks(e.g.,[ 13 ])canexploitnon-linearfeatureinteractions 16 tosomeextends,itisgenerallydi˚culttoencodepriorknowledgeontaskrelatednessto suchmodels.Inthischapter,weproposeanovelmulti-taskfeatureinteractionlearning framework,whichlearnsasetofrelatedtasksbyexploitingtaskrelatednessintheform ofsharedrepresentationsinboththeoriginalinputspaceandtheinteractionspaceamong features.Westudytwoconcreteapproachesunderthisframework,accordingtodi˙erentprior knowledgeabouttherelatednessviafeatureinteractions.The sharedinteractionapproach assumesthatthereareonlyasmallnumberofinteractionsthatarerelevanttothepredictions, andalltaskssharethesamesetofinteractions;the embeddedinteractionapproach assumes that,foreachtask,thefeatureinteractionsarederivedfromalow-dimensionalsubspace thatissharedacrossdi˙erenttasks.Wehaveprovidedformulationsande˚cientalgorithms forbothapproaches.Weconductempiricalstudiesonbothsyntheticandrealdatasetsto demonstratethee˙ectivenessoftheproposedframeworkonleveragingfeatureinteractions fromtasks.Thecontributionsofthispaperarethreefolds: OurnovelframeworkhasextendedtheMTLparadigm,forthe˝rsttime,toallowhigh- orderrepresentationstobesharedamongtasks,byexploitingpredictiveinformation fromfeatureinteractions. Weproposedtwonovelapproachesunderourframeworktomodeldi˙erenttask relatednessoverfeatureinteractions. Ourcomprehensiveempiricalstudiesonbothsyntheticandrealdatahaveledto practicalinsightsoftheproposedframework. Theremainderofthispaperisorganizedasfollows:Section3.1.2reviewsrelatedworkof MTLandmodelsinvolvingfeatureinteractions.Section3.1.3introducestheframeworkfor 17 MTIL.ThetwoapproachesunderMTILhavebeengivenin3.1.4.Section6.6presentsthe experimentalresultsonbothsyntheticandrealdatasets. 3.1.2RelatedWork TheproposedresearchisrelatedtoexistingworkonMTLandfeatureinteractionlearning. Inthissection,webrie˛ysummarizethetheserelatedworkandshowhowourworkadvances theseareas. Multi-TaskLearning. MTLhasbeenextensivestudiedoverthelasttwodecades.Inthe centerofmostMTLalgorithmsishowtaskrelationshipsareassumedandencodedinto thelearningformulations.Theconceptoflearningmultiplerelatedtasksinparallelwas ˝rstintroducedin[ 33 ].Itwasdemonstratedinmultiplereal-worldapplicationsthatadding asharedrepresentationinneuralnetworktaskscanhelpothersgetbettermodels.Such discoveryhadinspiredmanysubsequentresearche˙ortsinthecommunityandapplicationsin diverseapplicationdomains.Amongthesestudies,theregularizedMTLframeworkhasbeen pioneeredby[ 55 ].Theregularizationschemecaneasilyintegratevarioustaskrelationshipinto existinglearningformulationstocoupleMTL,thusprovidinga˛exiblemulti-taskextension toexistingalgorithms.ItiswelladoptedandissoongeneralizedtoarichfamilyofMTL algorithms. MTLviaRegularization. AmongtheworkintheregularizationbasedMTLscheme,there aremanydi˙erentassumptionsabouthowtasksarerelated,leadingtodi˙erentregularization termsintheformulation.Forexample,onecommonassumptionisthatthetaskssharea subsetoffeatures,andthetaskrelatednesscanbecapturedbyimposingagroupsparsity penaltyonthemodelstoachievesimultaneousfeatureselectionacrosstasks[ 199 , 142 ]. 18 Anothercommonassumptionisthatthemodelsoftaskscomefromthesamesubspace, leadingtoalow-rankstructurewithinthemodelmatrix.Directlypenalizingtherankfunction leadstoNP-hardproblems,andoneconvexalternativeistopenalizetheconvexenvelopof therankfunction,i.e.,tracenorm.Thisencourageslow-rankbyintroducingsparsitytothe singularvaluesofthemodelmatrix[ 89 ].In[ 10 ],theauthorsstudiedaMTLformulationthat learnsacommonfeaturemappingforthetasksandassumedalltaskssharethesamefeatures afterthemapping.Theauthorshaveshownthatthisassumptioncanalsobeequivalently expressedbyalow-rankregularizationonthemodel.Therearemanymoreformulations thatfallintothiscategoryofformulationtocapturetaskrelatednessbydesigningdi˙erent sharedrepresentationandregularizationterms,suchasclusterstructures[ 232 ],tree/graph structures[ 101 , 38 ],etc.However,tothebestofourknowledge,alloftheseformulationsdo notconsiderfeatureinteractionsinthemodel,andextensionstoconsiderinteractionsare notstraightforward.Inthiswork,wewillextendtheMTLframeworktoenableknowledge transfernotonlyintheoriginalinputspace,butalsoinhigher-orderfeatureinteraction space. MultilinearMTL. TheuseoftensorinMTLhasshowntobeverye˙ectiveinrepresenting structuralinformationunderlyinginMTLproblems.In[ 162 ],Romera-Paredes etal. proposed amultilinearmultitask(MLMTL)frameworkthatarrangesparametersoflineare˙ectsfrom alltasksintoatensor W ,bywhichtheyareabletorepresentthemulti-modalrelationships amongtasks.Inadatasetcontainingmulti-modalrelationships,taskscanbereferenced bymultipleindices.InMLMTL,theauthorsemployedaregularizeron W toinducea low-rankstructuretotransferknowledgeamongtasks.Theoptimizationproblemcontains theminimizationoftensor'srank,whichleadstosolvinganon-convexproblem.Thusthe authorsdevelopanalternatingalgorithm,employingtheTuckerdecompositionandconvex 19 a)tensorrepresentationoffeatureinteractions b)structuredsparsityofaninteractiontensor c)low-rankstructureofaninteractiontensor Figure3.1:IllustrationofMTLwithfeatureinteractions.(a)thefeatureinteractionsfrom multipletaskscanbecollectivelyrepresentedasatensor Q ;groupsparsestructures(c)and low-rankstructures(b)infeatureinteractionscanbeusedtofacilitatemulti-taskmodels. relaxationusingtensortracenorm.Althoughtheauthorsalsousedatensorrepresentation inMTL,thelearningformulations,implications,aswellasthemeaningofsuchthetensor isfundamentallydi˙erentfromthoseinourwork.TheproposedMTILframeworkutilizes tensortocapturetherelatednessamongtasksandtransferknowledgethroughhigh-order featureinteractions,whichcannotbeachievedbyanyexistingMTLformulations.Notethat thetensorinMLMTLisindexedbymulti-modaltasks.InMTIL,thetensorisindexedby featuresandtasks,whichisclearlydi˙erentfromtheaforementionedwork.Intheproposed embeddedinteractionapproachforMTIL,however,wefaceasimilarchallengeinMLMTL toseekasolutioninvolvingalow-ranktensor. FeatureInteraction Inmanymachinelearningtasks,weareinterestedinlearningalinearpredictivemodel. Giventheinputfeaturevectorofasample,theresponseisgivenbyalinearcombinationof 20 thesefeatures,i.e.,aweightedsumofthefeatures.Becauseofthisreasonwecallthemlinear e˙ects.Therearestrongevidencesfoundinmanycomplexapplicationsthat,inadditionto thelineare˙ects,therearealsoe˙ectsfromhigh-orderinteractionsbetweensuchfeatures.As aresult,thereareconsiderablee˙ortsfrombothacademiaandindustryaimingataddressing thislimitationbyremovingtheadditiveassumptionandincludinginteractione˙ects. Toovercomethedimensionalityissuesintroducedbyinteractione˙ects,twotypesof heredityconstraintshavebeenstudied[ 20 ];namelystronghierarchyinwhichaninteraction e˙ectcanbeselectedintothemodelonlyifbothofitscorrespondinglineare˙ectshavebeen selected,andweakhierarchy,inwhichaninteractione˙ectcanbeselectedifatleastoneof itscorrespondinglineare˙ectshasbeenselected.In[ 39 ],theauthorsproposedanapproach knownasSHIMtoidentifytheimportantinteractione˙ects.SHIMextendstheclassical Lasso[ 194 ]andenforcesastronghierarchy.Aniterativealgorithmwasproposedbasedon Lasso,whichmaynotscaletoproblemswithhighdimensionalfeaturespace.Radchenko et. al proposedtheVANISHmethodtoaddresstheproblem[ 158 ].Theydevelopedaconvex formulationwithare˝nedpenaltythatcannotonlylearnthesparsesolution,butalsotreat thelinearandinteractione˙ectsusingdi˙erentweights.Thisway,themaine˙ectcould havemorein˛uenceontheprediction.In[ 20 ],ahierarchicallassowasproposedtosearch forinteractionswithlargemaine˙ectsinsteadofconsideringallpossibleinteractions.The authorsproposedanalgorithmbasedonADMMforstronghierarchylassoandageneralized gradientdescentforweakhierarchicallasso.Morerecently,Liu etal. [ 124 ]proposedan e˚cientalgorithmforsolvingthenon-convexweakhierarchicalLassodirectly,basedonthe frameworkofgeneraliterativeshrinkageandthresholding(GIST)[ 67 ].Theauthorsproposed aclosedformsolutionofproximaloperatorandfurtherimprovedthee˚ciencyofsolvingthe subproblemofproximaloperatorfromquadratictolinearithmictimecomplexity. 21 Inmanyrealworkapplicationstherearemultiplerelatedtasks.Whenthosethesetasks involveinteractione˙ects,thetaskscouldberelatedviathehighorderfeatureinteractions. Inourpaper,weproposetoaddressthemodelcomplexityissuefrominteractione˙ectsusing anewperspective,byleveragingsuchrelatedness. 3.1.3Taskrelatednessinhighorderfeatureinteractions Inthissection,wepresenttheframeworkofMulti-TaskfeatureInteractionLearning(MTIL). Forcompleteness,wegiveaself-containedintroductionofourwork.Wewillderiveconcrete learningalgorithmsunderthisframeworkinSection3.1.4. LinearandInteractionE˙ects. Considerthetraditionallinearmodels.Foraninput featurevector x 2 R d andascalarresponse y ,wehaveassumedthefollowingunderlying lineargenerativemodel: y = d X i =1 x i w i + where w 2 R d istheweightvectorforlineare˙ects,and ˘N (0 ;˙ 2 ) isaGaussiannoise. Alinearmodel f ( x ; w )= x T w canbeaquitee˙ectivepredictionfunction.However,ifthe underlyinggenerativemodelincludese˙ectsfromfeatureinteractions,i.e., y = d X i =1 x i w i + d X i =1 d X j =1 x i x j Q i;j + where x i x j Q i;j isthejointe˙ectbetweenthe i thfeatureandthe j thfeature,and Q i;j isthe weightforthisjointe˙ect.Thistypeoffeatureinteractionshavebeencommonlyfoundin manyapplications.Ifthetrainingdatafollowthisdistributionthenthelinearmodelisnot enoughtocapturetherelationshipbetweeninputfeaturesandoutputresponses.Oneofthe 22 approachesistointroducenon-linearfeatureinteractiontermsintothelinearmodel.Thatis, wecandenoteitasaquadraticfunction: f ( x ; w ; Q )= x T w + x T Qx ; (3.1) where w 2 R d and Q 2 R d d collectivelyrepresenttheparametersforlineare˙ectsand interactione˙ects,respectively.Wenotethat Q istypicallysymmetricbecausethisrepresen- tationincludestwotermsinvolvingfeature i and j : x i x j ( Q i;j + Q j;i ) anditalsoincludes second-orderfeaturetransformationsoftheoriginalfeatures x 2 i Q i;i . DiscussionsonFeatureInteractions. Insupervisedlearning,weseekapredictive functionthatmapsaninputvector x 2 R d toacorrespondingoutput y 2 R .Let ( X ; y )= f ( x 1 ;y 1 ) ; ( x 2 ;y 2 ) ;::: ( x n ;y n ) g beatrainingdataset,inwhicheachdatapointisdrawnfrom certain i.i.d. distribution .Thegoaloflearningisto˝ndthebestpredictor ^ f 2H sothat thepredictedvalue ^ y i fortheinputdata x i isascloseaspossibletothegroundtruth y i , 8 ( x i ;y i ) 2 ( X ; y ) ,givenalossfunction L ( :;: ) .Wehopethatthepredictor f learnedinthis wayisclosetotheoptimalmodelthatminimizestheexpectedlossaccordingtothe : R ( f )= E ( X ; y ) ˘ L ( f ( X ) ; y ) : (3.2) Suchpredictorisgivenbytheminimumoftheempiricalrisk: ^ f =argmin f 2H n X i =1 L ( f ( x i ) ; y i ) : Theerrorcausedbylearningthebestpredictorinthetrainingdatasetiscalledtheestimation error.Theerrorcausedbyusingarestricted H iscalledtheapproximationerror.Fora 23 ˝xeddatasize,thesmallerthehypothesisspace H ,thelargertheapproximationerror,and viceversa.Thetrade-o˙betweenapproximationerrorandestimationerroriscontrolledby selectingthesizeof H .Byincludingfeatureinteractionswewouldenlargethehypothesis space,andwemaybeabletodramaticallyminimizetheapproximationerrorcompared tothetraditionalhypothesisspaceforlinearmodels.Ontheotherhand,wenotethat givenalimitedamountofdata,alargehypothesisspacemayresultinmodelswithpoor generalizationperformance.Wewillneedtoeitherincreaseourtrainingdata,orprovide e˙ectiveregularizationstonarrowdownthehypothesisspace. Multi-taskFeatureInteractions. Weconsiderthesettingthattherearemultiplelearning taskswhicharerelatednotonlyintheoriginalfeaturespace,butalsointermsoffeature interactions.Theproposeframeworksimultaneouslylearnsallrelatedtasksandprovidesan e˙ectiveregularizationonthehypothesisspaceusingrelatednessontheinteractions. Let D = ( X 1 ; y 1 ) ;:::; ( X T ; y T ) bethetrainingdataforthe T learningtasks,andthe i.i.d. trainingsamplesfortask t isdrawnfrom ( t ) m t ,where m t isthenumberofdatapoints availablefortask t .Wecollectivelydenotethedistributionas D˘ = Q T t =1 ( t ) m t .All taskshavea d -dimensionalfeaturespace(i.e., x i 2 R d ).Thecorrespondingfeaturesare homogeneousandhavethesamesemanticmeaning.Thetotaltrainingdatapointsare: ( X t ; y t )= f ( x 1 t ;y 1 t ) ; ( x 2 t ;y 2 t ) ;:::; ( x mt ;y mt ) g ;t =1 ;:::;T; ThegoalofMTListolearn T functionsforthetaskssuchthat f t ( x it )= y it ,basedonthe assumptionthatalltaskfunctionsarerelatedtosomeextent. Inordertoconsiderinteractionsforeachtask,weusethequadraticpredictivefunctionin Eq.3.1foralltasks.Wecollectivelyrepresentthelineare˙ectsfromalltasksasamatrix 24 W =[ w 1 ;:::; w T ] 2 R d T , w i 2 R d andtheinteractione˙ectsasatensor Q2 R d d T , inwhichthe t -thfrontalslice Q t 2 R d d representstheinteractione˙ectsfortask t .We illustratethisinteractiontensorinFigure3.1(a). Givenspeci˝clossfunctions ^ ` forsamplesfromonetask,(e.g.,squarelossforregres- sionandlogisticlossforclassi˝cation,seeTable3.1),thelossfunctionforeachtaskis ` t ( f; w ; Q ; X ; y )= P m t i =1 ^ ` ( f ( x i ; w ; Q ) ;y i ) .Ourmulti-taskfeatureinteractionlossfunction isgivenby: L ( W ; Q ; f; X ; Y )= T X t =1 ` t ( f; w t ; Q t ; X t ; Y t ) : (3.3) Notethatitisnotnecessaryforalltaskstohavethesamelossfunction.InMTL,thelearning ofeachtaskbene˝tsfromtheknowledgefromothertasks,whiche˙ectivelyreducesthe hypothesisspaceforalltasks.Inordertoachieveknowledgetransferamongtasks,wewould liketoimposesharedrepresentationsviadesigningregularizationtermsonboth W and Q , whichspecifyhowtasksarerelatedintheoriginalfeaturespaceandfeaturesinteractions, respectively. TheMTILFramework. TheproposedMulti-TaskfeatureInteractionLearning(MTIL) frameworkisthengivenbythefollowinglearningobjective: min W ; Q L ( W ; Q ; f; X ; Y )+ R R F ( W )+ I R I ( Q ) ; (3.4) where R F ( W ) istheregularizationprovidingtaskrelatednessintheoriginalfeaturespace, R I ( Q ) istheregularizationencodingourknowledgeabouthowfeatureinteractionsarerelated amongtasks, R and I arethecorrespondingregularizationcoe˚cients.For I !1 ,the 25 problemreducestotraditionalMTL,when R I ischosenproperly.Inthispaper,weformulate twoconcreteapproachestocapturethefeatureinteractionpatterns: SharedInteractionApproach. Inmanyapplications,eventhoughwehavea largenumberoffeatureinteractions,onlyafewinteractionsmayberelatedtothe response[ 20 , 39 ].Whenlearningwithmultipletasks,di˙erenttasksmayshareexactly thesamesetoffeatureinteractions,butwithdi˙erente˙ects.Assuch,wecandesign MTILformulationsthatlearnsasetofcommonfeatureinteractions,whichcould e˙ectivelyreducethehypothesisspace.Duringthelearningprocesstheselected featureinteractionsforonetaskwillbetask'sknowledge,contributingtotheshare representation:asetofindicesofcommoninteractions.AnanalogyintraditionalMTL isthejointfeaturelearningapproach[ 142 , 199 ],inwhichtaskssharethesamesetof features.Onewaytoachievethisapproachisbyusingthestructuredsparsitytoinduce thesamesparsitypatternsontheinteractione˙ects.Anillustrationofthisapproachis giveninFigure3.1(b). EmbeddedInteractionApproach. Whentheresponsefromonetaskisrelatedto complicatedfeatureinteractions,thepatternsofsuchinteractionsmaybecapturedby alow-dimensionalspace,resultinginalow-rankinteractionmatrix.Whenthereare multiplerelatedtasks,theycouldhaveasharedlow-dimensionalspace,i.e.,di˙erent interactionmatricesmaysharethesamesetofrank-1basismatrices,buthavedi˙erent weightsassociatedwiththesebasismatrices.Whencollectivelyrepresentedbyatensor, weendupwithalow-ranktensor.Duringthelearningprocess,eachtaskcontributes theirsubspaceinformationtofacilitatelearningofthesharelow-dimensionalsubspace, whichinturn,improvesthefeaturespace.TheanalogyintraditionalMTListhe 26 Table3.1:Examplesofcommonsmoothlossfunctions. LosswithInteraction Lossfunction L i Gradient|LinearE˙. r W L i Gradient|InteractionE˙. r Q t L i LogisticLoss [ log ( g ( x i )) y ti +(1 y ti )( log (1 g ( x i )))] ( g ( x i ) y ti ) x i ( g ( x i ) y ti ) x i x T i SquaredLoss 1 2 jj x T i w t + x T i Q t x i y ti jj 2 2 x i ( x T i w t + x T i Q t x i y ti ) x i ( x T i w t + x T i Q t x i y ti ) x T i SquaredHinge y h ( y ti ( x T i w t + x T i Q t x i )) y ti x i h 0 ( x T i w t + x T i Q t x i ) y ti x i x T i h 0 ( x T i w t + x T i Q t x i ) g ( x ) isthesigmoidfunctionde˝nedas g ( x i )=1 = 1+ exp ( ( x T i w t + x T i Q t x i )) y h 0 ( z )= 1 for z 0 ;z 1 for 0 1 ,we de˝nethe l p;q normof A as k A k p;q =( P d i =1 k e a i k q p ) 1 p .Thesetof K integersisdenotedas N K =[1 ;:::;K ] .Weuse I d todenotea d d identitymatrix,and 1 d todenotea d dimension vectorwithallelementsare1.Unlessstatedotherwise,allvectorsarecolumnvectors. 3.2.2RelatedWork Multi-tasklearning. MTLhasbeensuccessfullyappliedtosolvemanychallengingmachine learningproblemsinvolvingmultiplerelatedtasks.RecentlytheregularizationbasedMTL approachhasreceivedalotofattentionbecauseofits˛exibilityande˚cientimplementations. OnemajorresearchdirectioninregularizedMTListoencodetherelationshipamong tasks[ 54 , 96 , 58 , 227 , 171 , 22 ].TheregularizedMTLalgorithmscanberoughlyclassi˝edinto twotypes:the˝rstinvolvesassumptionsabouttaskrelatedness,whicharethen intoproperregularizationtermsintheregularizationtoinferasharedrepresentation,that servesasthemediaofknowledgetransfer.Anexampleisthelow-rankMTL[ 54 , 96 , 171 ], whichseeksasharedlow-dimensionalsubspaceintaskmodels,andthetasksarerelated throughthesharedsubspace.Onepotentialissueinsuchmethodsisthatthepriorknowledge maynotalwaysaccurateandtheassumptionmaynotbesuitableforalltasks.Lateronsome studiesfocusoninferthetaskrelationshipfromthedataset[ 227 , 58 , 22 ],e.g,bylearninga varianceovertasks.Sincethelearnedcovariancematrixgoverningtheknowledge transferisalsolearnedfromdata,thesemethodsisheavilydependentonthequalityand quantityofthetrainingsamplesavailable.Whenaninaccuratetaskrelationshipislearned,it willleadtopointtheknowledgetransferinawrongdirectionandleadtosuboptimalmodels, 46 aswillbeshowninourempiricalstudies.Toalleviatetheproblemofexistingmodels,we proposeanactivelearningframeworkwhichcaninteractivelylabelthegroundtruthoftask relationshipintolearningmodelandguidecorrectknowledgetransfer. ActiveLearning. Therearetwocommoncategoriesofactivelearning:thepoolbasedand thebatchmode.Thepoolbasedactivelearningapproachesselectthemostinformative unlabeledinstanceiteratively,whichisthenlabeledbyuser,withthegoaloflearningabetter modelwithlesse˙orts[ 173 ].Theselectionprocessisoftenreferredasa query .However,such sequentialqueryselectionstrategyisine˚cientinmanycases,i.e.addingonelabeleddata pointatatimeistypicallyinsu˚cienttosubstantiallyimprovetheperformanceofmodel, andthusthetrainingprocedureisveryslow.Incontrast,thebatch-modeactivelearning approachesselectasetofmostinformativequeryinstancessimultaneously.Tothebestof ourknowledge,allpreviousactivelearningfocusonhowtoselectagroupofmostinformative instancesortrainingsamples.Here,weinsteadproposeanovelquerystrategytoquery anothertypeofsupervision:taskrelationship.Thissupervisionisintuitivebutcomeswitha signi˝cantchallenge,i.e.,mostpreviousactivelearningstrategiescannotbedirectlyapplied. Inourstudythetasksupervisionisrepresentedbypartialorderswhichleadtopair- wiseconstraints.Thereareafewpreviousstudiesonthee˙ectivenessofthepairwise constraints[ 215 , 70 ]underactivelearningframework.In[ 70 ],aclusteringalgorithmnamed Active-PCCAwasproposedtoconsiderwhethertwodatapointsshouldbeassignedtothe sameclusterornot,bywhichitbiasesthecategorizationtowardstheoneexpected.Themost informativepairwiseconstraintsareselectedusingthedatapointsonthefrontierofthoseleast well-de˝nedclusters.In[ 215 ],theauthorsstudiedasemi-supervisedclusteringalgorithmwith aquerystrategytochoosepairwiseconstraintsbyselectingthemostinformativeinstance,as wellasdatapointsinitsneighborhoods.ThepairwiseconstraintsareintheformofMust-link 47 andCannot-link,whichrestricttwodatapointsshouldbeinthesameclassornot.However, thosemethodsaredevelopedforclusteringalgorithms.Howtoselectpairwiseconstraintson taskrelationshipthataresuitablefortheMTLframeworkremainstobeanopenproblem. Inthiswork,westudyquerystrategiesfortaskrelationshipsupervision,includingonenovel strategybasedontheinconsistencyoflearningmodel. InteractiveMachineLearning. Interactivemachinelearning(IML)isasystematicwayto includehumaninthelearningloop,observingtheresultsoflearningandprovidingfeedback toimprovethegeneralizationperformanceoflearningmodel[ 6 ].Ithasprovidedanatural waytointegratebackgroundknowledgeintothelearningprocedure[ 7 , 9 , 210 , 8 ].Forexample, thesystemcallederception-based(PBC)[ 9 ]hasbeenpioneeredtoo˙eran interactivewaytoconstructdecision.ThePBCisabletoconstructasmallerdecisiontree buttheaccuracyachieveddoesn'thassigni˝cantimprovecomparedtootherdecisiontree methodssuchasC4.5.Thedecisionconstructionhasbeenfurtherextendin[ 210 ].Theyalso foundoutthatuserscanbuildgoodmodelsonlywhenthevisualizationareapparentintwo dimension.Manualclassi˝erconstructionisnotsuccessfulforlargedatasetinvolvinghigh dimensioninteraction.In[ 7 ],anend-userIMLsystem(ReGroup)areproposedtobeableto helppeoplecreatecustomizedgroupsinsocialnetworks.In[ 8 ],theauthorsdevelopedan IMLsystemnamedas(CueT)tolearnthetriagingdecisionaboutnetworkalarminahighly dynamicenvironment.Inthispaper,iMTRLisproposedtocombinethedomainknowledge intermsoftaskrelationshiptobuildlearningmodels.Ourworkisexploringacompletely novelproblemcomparedtothepreviousstudiesininteractivemachinelearning. 48 3.2.3InteractiveMulti-TaskRelationshipLearning Inthissection,we˝rstreviewthestrengthsandpotentialissuesofthemulti-taskrelationship learninginSubsection3.2.3.1,whichmotivatetheoverarchingframeworkoftheproposed interactivemulti-taskrelationshiplearning(iMTRL)inSubsection3.2.3.2.Subsection3.2.3.3 presentstheknowledge-awareMTRL(kMTRL)formulationandalgorithm.Subsection3.2.3.5 introducesthenovelbatchmodeknowledgequerystrategybasedonactivelearning. 3.2.3.1RevisittheMulti-taskRelationshipLearning BeforediscussingtheiMTRLframework,werevisitthemulti-taskrelationshiplearning (MTRL)[ 227 ],onepopularMTLmodelthatlearnsnotonlythepredictionmodelsbutalsotask relationship.TheMTRLframeworkhasawellfoundedBayesianbackground.Assumewehave K relatedlearningtasks,andineachtaskwearegivenadatamatrixandtheircorresponding responses.Let d bethenumberoffeatures.Forthetask k ,wearegiven m samplesandtheir correspondingresponses,collectivelydenotedby X k =[( x k 1 ) T ;( x k 2 ) T ; ::: ;( x k m ) T ] 2 R m d and y k 2 R m .Weassumethattheresponsescomefromalinearcombinationoffeatures withaGaussiannoise,sothatforsample j fromtask i ,wehave y i j = w T i x i j + b i + i ,where distributionofthenoiseisgivenby i ˘N (0 ; 2 i ) .Thegoalofthelearningistoestimatethe taskparameters W =[ w 1 ;:::; w K ] andbiasterm b =[ b 1 ;:::;b K ] forall K tasksfromdata. Basedontheassumptionwecanwritethelikelihoodof y i j given x i j ; w i ;b i ,and i isgiven by: p ( y i j j x i j ; w i ;b i ; i ) ˘N ( w T i x i j + b i ; 2 i ) ; where N ( m ; representsthemultivariatedistributionwithmean m andcovariancematrix 49 [21].Theprioron W =( w 1 ;:::; w K ) isgivenby: p ( W j i ) ˘ 0 @ K Y i =1 N ( w i j 0 d ;˙ 2 i I d ) 1 A q ( W ) ; where I d 2 R d d istheidentitymatrix.The˝rsttermistheextensionofridgepriortothe multi-tasklearningsetting,whichcontrolsthemodelcomplexityofeachtask w i .Thesecond termreferstothetaskrelationship,inwhichMTRLtriestolearnthecovarianceof W using amatrix-variatenormaldistributionfor q ( W ) q ( W )= MN d K ( W j 0 d K ; I d ) ; where MN d K ( M ; A B ) denotesmatrix-variatenormaldistributionwithmean M 2 R d K , rowcovariancematrix A 2 R d d andcolumncovariancematrix B 2 R K K .Accordingto theBayes'stheorem,theposteriordistributionfor W isproportionaltotheproductofthe priordistributionandthelikelihoodfunction[21]: p ( W j X ; y ; b ;˙; ) / p ( y j X ; W ; b ; ) p ( W j ;˙ ) ; (3.9) where X collectivelydenotesthedatamatrixfor K tasksand y =[ y 1 ;:::; y k ] denoteslabels foralldatapoints. BytakingnegativelogarithmofEq.(3.9),themaximumaposterioriestimationof W andmaximumlikelihoodestimationof isgivenby: min W ; K X k =1 1 2 k k y X k w k b k 1 n k k 2 F + 1 ˙ 2 k tr( WW T )+tr( 1 W T )+ d ln( ) : (3.10) 50 Intheaboveformulation,thelastterm d ln ( ) controlsthecomplexityof andisa concavefunction.Inordertoobtainaconvexobjectivefunction,theMTRLproposedto use tr ( )=1 insteadtocontrolthecomplexityandproject tobeapositivesemi-de˝nite matrix.Assuch,theobjectivefunctionofMTRLisderivedasfollows: min W ; K X k =1 1 n k k y k X k w k b k 1 n k k 2 F + 1 2 tr( WW T ) (3.11) + 2 2 tr( 1 W T ) : s.t. 0 ; tr( )=1 Analternatingalgorithmisproposedin[ 227 ]tosolvethisformulation.Thealgorithm iterativelysolvestwosteps:˝rstitoptimizesEq(3.11)withrespectto W and b when is˝xed;itthenoptimizestheobjectivefunctionwithrespectiveto ,whichadmitsa closed-formsolution: =( W T W ) 1 = 2 = tr(( W T W ) 1 = 2 ) : (3.12) WenotethatthereisafeedbackloopinthelearningofMTRLasillustratedabove.The MTRLachievesknowledgetransferamongtaskmodelsviathetaskrelationmatrix ,and thetaskmodelswillbeusedtoestimate .Ifthe canbelearnedcorrectlyorcanclosely representthetruetasksrelationship,itwillbene˝tlearningonthetasksparameters W by guidingtheknowledgetransferinagooddirection.Inturn,thebettertasksparameterswill helpthealgorithmtoidentifyamoreaccurateestimationof .Thepositivefeedbackloop isthekeytohelpbuildingagoodMTRLmodel.Onthecontrary,thetrainingprocedure willbebiasedtowrongdirectiononcewekeepgettingmisleadingfeedbacksintheloop.To bemorespeci˝c,oncedataiseitherlow-qualityorinsu˚cient-quantity,the willindicate aninaccuratedirectiontotransfertheknowledgeamongtasks,whichleadstoanegative 51 feedbackintheloop.Thiswillenduplearningamodelwithpoorgeneralizationperformance, examplesofwhichwillbeelaboratedintheempiricalstudies. AnotherremarkisthatinEq.(3.11),duetotherelaxation,thesolutionof isnolonger theextractsolutionfromthemaximumlikelihoodestimationofcolumncovariancematrix derivedfromEq.(3.10).TheadvantagesoftheobjectivefunctioninEq.(3.11)comparedto Eq.(3.10)havebeendiscussedindetailsin[ 227 ].Wewouldliketofurtherpointoutthatthe learned isactuallyabetterrepresentationoftasksrelationshipthanthecolumncovariance matrix.Recallthatthecovariancesuggeststheextentthatelementsintwovectorsmove tothesamedirection.Supposewehavetasksparameters W 2 R d K ,theunbiasedsample covariancecanbecomputedby C = W T c W c = ( d 1) ,where W c = W 1 T d 1 d W =d isthe centralizedtasksmodels.Thismeasureisonlymeaningfulwhenthereareenoughnumber ofdimension d andthevariancecontainsintasksparameters.If W =[1 ; 2;1 ; 2] ,the covariancematrixwillreturnanall-zeromatrixwhichwillnotindicateacorrectrelationship. Instead,anaccurateestimationcanbeinferredbyusingEq.(3.12).Wecanobtaina correlationmatrix Corr =[1 1; 1 ; 1] from . Theabovediscussionsleadtotwoimportantconclusions:(1)The canindicateagenuine taskrelationship.(2)Maintaininganaccurate isthekeyinthislearningprocedure. 3.2.3.2TheiMTRLFramework InMTLscenarios,thequalityandquantityoftrainingdatausuallyimposesigni˝cant challengestothelearningalgorithms.Thetaskcovariancematrix inferredfromthedata maynotalwaysgiveanaccuratedescriptionofthetruetaskrelationship,whichinturn wouldprevente˙ectiveknowledgetransfer.Fortunately,inmanyreal-worldapplications, humanexpertspossessindispensabledomainknowledgeaboutrelatednessamongsometasks. 52 Forexample,whenbuildingmodelspredictingdi˙erentregionsofthebrainfromclinical features,neuroscientistandmedicalresearchercanrevealimportantrelationshipamongthe regions.Assuch,solicitfeedbackfromhumanexpertsontaskrelationshipandencodethem assupervisionisespeciallyattractive.Toachievethisgoalweneedtoanswerthefollowing problems: 1. Whattypeofknowledgerepresentationcanbee˚cientlysolicitedfromhumanexperts, andalsocanbeusedtoe˙ectivelyguidethelearningalgorithms? 2. HowtodesignMTLalgorithmthatcombinesthedomainknowledgeanddata-driven insights? 3. Howtoe˙ectivelysolicitknowledge,reducingtheworkloadofthehumanexpertsby supplyingonlythemostimportantknowledgethata˙ectsthelearningsystem? Inthispaperweproposeaframeworkofinteractivemulti-taskMachinelearning(iMTRL), whichprovidesanintegratedsolutiontoaddresstheabovechallengingquestions.The frameworkisillustratedinFig3.4.TheiMTRLisaniterativelearningprocedurethat involveshumanexpertsintheloop.Ineachiteration,thelearningprocedureinvolvesthe following: 1. Encoding .Thedomainknowledgeoftaskrelationshipisrepresentedaspartialorders, andcanbeencodedinthelearningaspairwiseconstraints. 2. Knowledge-AwareMulti-TaskLearning .WeproposeanovelMTLalgorithmthatinfers modelsandtaskrelationshipfromdataandconformthesolicitedknowledge. 3. ActiveLearningbasedKnowledgeQuery .Tomaximizetheusefulnessofsolicited knowledge,weproposeaknowledgequerystrategybasedonactivelearning. 53 Itisnaturalandintuitivetousepartialordersastheknowledgepresentationfortask relationship.Queryaquestionthatwhetherthetask i and j aremorerelatedthantask i and k ismucheasierthanaskingtowhichextentthetask i and j arerelatedtoeachother.For example, i thtaskand j thtaskshaspositiverelationshipwhilethe i thtaskand k thtaskhas negativerelationship,thenthisrelationshipisrepresentedbyapartialorder i;j i;k .The focusofthispaperisthealgorithmdevelopmentforiMTRLandwemakeafewassumptions toalleviatecommonissuesinusingthispresentationandsimplyourdiscussions: Assumption1. Thedomainknowledgeacquiredfromhumanexpertisaccurate.Theexpert maychoosenottolabelifhe/sheisnotcon˝dent. Assumption2. Theacquiredpartialordersarecompatible,i.e.when i;j > i;k and i;k > k;p areestablished,the i;j < k;p cannotbeincluded. Ifthissituationhappens,wecandiscardthelessimportantconstraintsandmakethe remainconstraintsbecompatible.Theimportanceofconstraintscanbemeasuredbythe InconsistencywhichwewillintroducedinDe˝nition2. 3.2.3.3Aknowledge-awareextensionofMTRL AssumeinthecurrentiterationofiMTRL,ourdomainknowledgeisstoredinaset T de˝ned by: T = f : i 1 ;j 1 i 2 ;j 2 8 ( i 1 ;j 1 ;i 2 ;j 2 ) 2 S g ; (3.13) whereeachpairwiseconstrainthasspeci˝edapreferredhalf-spacethatanidealsolution shouldbelongto,andtheset S containstheindexesoftasksselectedbyourquerystrategy. Thepartialorderinformationismoreimportantthanthemagnitudeof .Thereasonisthat ifwemultiplyeachelementin withascalar a ,it'sequaltosolvetheEq.(3.15)replacing 54 2 with 2 [ 51 ].Hence,themagnitudeofelementsin canbeadjustedsimultaneously withoutchangingtheresults.Buttheorderofpairsin isamoreimportantstructureto encode.Thesealgorithmicadvantagesreinforcedourchoiceofusingpairwiseconstraintsto representdomainknowledge. WenotethattheconstraintsinEq.(3.13)wouldleadtoatrivialsolutionthat i 1 ;j 1 = i 2 ;j 2 8 ( i 1 ;j 1 ;i 2 ;j 2 ) 2 S ,whichisapparentlynotthee˙ectweseek.Toovercomethis problem,weaddapositiveparameter c sothatwecanassuretheelementsin preservethe truepairwiseorder.Hence,theconvexset T ischangedto: T = f : i 1 ;j 1 i 2 ;j 2 + c; 8 ( i 1 ;j 1 ;i 2 ;j 2 ) 2 S g : (3.14) Theproposedknowledge-awaremulti-taskrelationship(kMTRL)learningextendsthe MTRLbyenforcingafeasiblespacefor speci˝edby T .Tothisend,thekMTRLformulation isgivenbythefollowingoptimizationproblem: min W ; b ; F ( W ; b ; )= K X k =1 1 n k k y k X k w k b k 1 n k k 2 F + 1 2 tr( WW T )+ 2 2 tr( 1 W T ) s.t. 0 ; tr( )=1 ; 2T (3.15) WenotethateventhoughtheproblemofkMTRLisconsideredtobemorechallenging tosolvethanMTRLbecauseofadditionalconstraintsintroducedin T ,thesolutionspace ofkMTRLismuchsmallerbecauseeachconstraintcutsthesolutionspaceinhalf,andthe optimizationalgorithmsmayconvergefasterinthiscase. 55 3.2.3.4E˚cientOptimizationforkMTRL TheproposedkMTRLisaconvexoptimizationproblem,andweproposetosolveitusingan alternatingalgorithm: Step1: We˝rstoptimizetheobjectivefunctionwithrespectto W and b givena˝xed . Thisstepcaneitherbesolvedusingthelinearsystem[ 227 ]oro˙-the-shelfsolverssuchas CVX[ 69 ]andFISTA[ 16 ].Di˙erentsolverscanbeapplieddependingonthenatureofthe data:˝rstordersolverssuchasFISTAismorescalablewhentherearemanysamples,while solvinglinearsystemcanbemoree˚cientasfeaturedimensionishigh. Step2: Given W and b ,theobjectivefunctionwithrespectto isgivenbyananalytical solutionusingEq.(3.12). Step3: The isprojectedtotheconvexset: T = f j 2T ; 0 ; tr( )=1 g bysolvingtheEuclideanprojectionproblembelow: min k ^ k 2 F ;s:t: 2 T wherethe ^ istheanalyticalsolutionweobtainedfromtheEq.(3.12).Thisobjective functioncanbesolvede˚cientlyusingasuccessiveprojectionalgorithm[ 76 ]thatiteratively projectsthesolutiontoeachconstraintintheset. TheKKTanalysis[ 35 ]oftheaboveoptimizationproblemleadstothepropertysummarized inTheorem1,andleadstoAlgorithm3.2.Tosimplifythediscussion,werequiresthetrue pairordersareintheformof i 1 ;j 1 i 2 ;j 2 . 56 Theorem1. Supposethat T = f : i 1 ;j 1 i 2 ;j 2 + c g ,then,forany 2 R K K ,the projectionof totheconvexset T isgivenby: Proj( )= if 2T ; otherwise Proj( )= = 8 > > > > > > < > > > > > > : i 1 ;j 1 = 1 2 ( i 1 ;j 1 + i 2 ;j 2 + c ) i 2 ;j 2 = 1 2 ( i 1 ;j 1 + i 2 ;j 2 c ) p;q = p;q ; 8 ( p;q ) 6 =( i 1 ;j 1) and ( i 2 ;j 2) Inpractice,theterm W T W isnotguaranteedtobeafullrankmatrix.Infact,ina typicalMTLsetting W isalowrankmatrixandthusthe calculatedbyEq.(3.12)isalso arankde˝ciencymatrix.Moreover,recallthattheoperationthatprojects toaconvex sethasaveryhighchanceleadtoasingularmatrix.Thenumericalproblemsduringthe inversionofthesingularmatrix willleadtoameaninglessinverseoftaskrelationmatrix andcorruptthetrainingprocedure.Therefore,weproposetosolveaperturbedversionof ouroriginalobjectivefunctionEq.(3.15)asfollows: min W ; b ; F ( W ; b ; )= K X k =1 1 n k k y k X k w k b k 1 n k k 2 F + 1 2 tr( WW T )+ 2 2 tr( 1 ( WW T + I )) ; s.t. 0 ; tr( )=1 ; 2T (3.16) where T followsthede˝nitioninEq.(3.14).Asaresult,theanalyticalsolutionof in Step 57 Algorithm3.1: knowledge-awareMulti-TaskRelationshipLearning(kMTRL) Require: Trainingdata f X k ; y k g K k ,constraintsset S ,regularizationparameters 1 , 2 ,a positivenumber c .Randomlyinitialize W 0 . 0 = I =d . 1: while W and arenotconverge do 2: Compute f W ; b g =argmin W ; b F ( W ; b ; ) 3: Compute usingEq.(3.12) 4: =Proj( , S , n , c ) 5: endwhile 6: return W , b , Algorithm3.2: Projectionalgorithm Require: Taskcorrelationmatrix ,constraintsset S ,maxiteration n ,apositivenumber c . 1: for i =1 ;:::;n do 2: while 8 ( i 1 ;j 1 ;i 2 ;j 2 ) 2 S do 3: if i 1 ;j 1 < i 2 ;j 2 then 4: i 1 ;j 1 = 1 2 ( i 1 ;j 1 + i 2 ;j 2 + c ) 5: i 1 ;j 1 = 1 2 ( i 1 ;j 1 + i 2 ;j 2 c ) 6: endif 7: endwhile 8: Dynamicupdate c = c 0 : 9 9: Project tobeapositivesemi-de˝nitematrix 10: if Allconstraintsaresatis˝ed then 11: break 12: endif 13: endfor 14: return 2. isthusreplacedbythefollowing: =( W T W + I ) 1 = 2 = tr(( W T W + I ) 1 = 2 : (3.17) ThealgorithmtosolvetheobjectivefunctionEq.(3.16)ispresentedinAlgorithm3.1. Thisalgorithmcanbeinterpretedasalternatelyperformingsupervisedandunsupervisedsteps. Inthesupervisedstepwelearnthetaskspeci˝cparameters( W and b ).Inunsupervisedstep wegetthetaskrelationshipmatrixfromthetaskparameters.Finally,thelastsupervised stepweencodepriorknowledgetothetaskrelationshipmatrix .Werepeatthesteps iterativelyuntilconverge. 58 Algorithm3.3: QueryStrategyofPairwiseConstraints Require: Thetaskcorrelationmatrix ,themodelparametermatrix W foralltasks, thenumberofpairwiseconstraints n selectedtobequery 1: Compute ^ =( W T W ) 1 = 2 = tr(( W T W ) 1 = 2 ) 2: while 8 ( i 1 ;j 1 ;i 2 ;j 2 ) do 3: Compute ( i 1 ;j 1 ;i 2 ;j 2 ) and ^ ( i 1 ;j 1 ;i 2 ;j 2 ) 4: endwhile 5: while 8 ( i 1 ;j 1 ;i 2 ;j 2 ) do 6: ComputeInc ( i 1 ;j 1 ;i 2 ;j 2 ) 7: endwhile 8: Select n pairswithhighestscoresintotheset T 9: return T Algorithm3.4: iMTRLframework Require: Trainingsets f X k ; y k g K k ,numberofselectedqueries q , regularizationparameters 1 , 2 ,positivenumber c , T 0 = ; 1: for i =1 ;:::;n do 2: ( i , W i , b i )=kMTRL( f X k ; y k g K k , T i 1 , 1 , 2 , c ) 3: T i =query( W i , i , q i ) 4: T i = T i [T i 1 5: endfor 6: = i , W = W i , b = b i 7: return , W , b 3.2.3.5BatchModePairwiseConstraintsActivelearning Therearetoomanypossiblepairsforhumanexpertstolabelthemall,andthusthee˚ciency ofiMTRLframeworkheavilyreliesonthequalityofthepairsselectedbythesystem.In thissubsection,wediscusstheimportantquestionofhowtoe˚cientlysolicitthedomain knowledge.Speci˝cally,wewouldliketoselectthepairsthataremostinformativetothe learningprocess.Weproposeane˚cientheuristicquerystrategyaselaboratedasfollows. We˝rstdesignascorefunctionforpairwiseconstraintsbasedonthe inconsistency inthe model.Toexplaintheinconsistency,wedenotetheanalyticalsolutioncalculatedby W as ^ =( W T W ) 1 = 2 = tr (( W T W ) 1 = 2 ) andthedi˙erencebetweenelements i 1 ;j 1 and i 2 ;j 2 in thelearned as ( i 1 ;j 1 ;i 2 ;j 2 ) = i 1 ;j 1 i 2 ;j 2 .Theninconsistencyinthemodelisde˝ned asfollows: 59 De˝nition2. Inconsistencyisde˝nedas: Inc ( i 1 ;j 1 ;i 2 ;j 2 ) = sign ( i 1 ;j 1 ;i 2 ;j 2 ) j ( i 1 ;j 1 ;i 2 ;j 2 ) ^ ( i 1 ;j 1 ;i 2 ;j 2 ) j ; wheresign ( i 1 ;j 1 ;i 2 ;j 2 )= ( i 1 ;j 1 ;i 2 ;j 2 ) ^ ( i 1 ;j 1 ;i 2 ;j 2 ) j ( i 1 ;j 1 ;i 2 ;j 2 ) ^ ( i 1 ;j 1 ;i 2 ;j 2 ) j . TheInc ( i 1 ;j 1 ;i 2 ;j 2 ) representstwotypesofinconsistency: Negativeinconsistency :Giventhatthepairwiseordersoftworelationshipmatrices( and ^ )arenotconsistent,i.e. i 1 ;j 1 > i 2 ;j 2 ,but ^ i 1 ;j 1 < ^ i 2 ;j 2 orviceversa,the Inc ( i 1 ;j 1 ;i 2 ;j 2 ) isalwaysnegative.Thesmallerthe Inc ( i 1 ;j 1 ;i 2 ;j 2 ) is,thehigheristheheuristic score. Positiveinconsistency :Giventhatthepairwiseordersoftworelationshipmatricesare consistent,thentheinconsistencycomesfrom k ( i 1 ;j 1 ;i 2 ;j 2 ) ^ ( i 1 ;j 1 ;i 2 ;j 2 ) k .Thelargerthe Inc ( i 1 ;j 1 ;i 2 ;j 2 ) is,thehigheristheheuristicscore. Notethatthedisorderoftwopairsaremoreimportantthatthedi˙erenceoftwopairs, andallpairswithnegativeinconsistencyhastheprioritytobeselectedoverthosewith positiveinconsistency.Atthe˝rstiteration,beforeaddinganypairwiseconstraintsintothe trainingprocedure,thelearned isveryclosetotheanalyticalsolutioncalculatedfrom W , i.e. ( i 1 ;j 1 ;i 2 ;j 2 ) = ^ ( i 1 ;j 1 ;i 2 ;j 2 ) ,exceptforthedisturbofnumericalterm I .Therefore,the inconsistencyiscausedbysomenumericalissuesinthe˝rstround.Thereforeatthe˝rst trainingiteration,thereisnonegativeinconsistency.Asthenumberofconstraintsaddedinto themodel,theinconsistencywillappearandthequerystrategywillbecomemoree˙ectivein thissituation.TheAlgorithm3.3describesthequerystrategy. Finally,wesummarizeallproceduresofiMTRLinAlgorithm3.4.Theline1meansthere are n iterationslearningproceduresneedtobeconducted.Theline2correspondstothe 60 knowledge-awareMTLstepinouriMTRLframework.Theline3istosolicitthedomain knowledgeandline4istoanswerthequeryandencodingtheknowledgeintothemodel. 3.2.4Experiments 3.2.4.1ImportanceofHigh-QualityTaskRelationship Inthissubsection,weconductexperimentstoshowthatencodinganaccuratetaskrelationship willsigni˝cantlyenhancetheperformanceofMTRL.Thee˙ectivenessofMTRLhasalready beendemonstratedin[227],inwhichtheauthorsshowedthatMTRLcaninferanaccurate taskrelationshipfromarelativelycleandatasetwithsu˚cienttrainingsamples.Herewe useatoyexampletoshowthatMTRLwouldinferamisleadingrelationshipwhennoise presentsandthereareinsu˚cienttrainingsamples.Thetoydatasetisgeneratedasfollows. Therearethreetaskswithdatasampledfrom y =3 x +10 , y = 2 x +5 and y =10 x +1 , respectively.Foreachtaskswegenerate 5 samplesfromauniformlydistributionin [0 ; 10] . ThefunctionoutputsforthreetasksarecorruptedbyaGaussiannoisewithzeromeanand standardvarianceequalto30,10and10,respectively.Accordingtothegenerativeregression functions,weexpectthatthecorrelationbetweenthe˝rsttaskandthirdtaskiscloseto1 andfortherestofpairsiscloseto-1.WeusethelinearkernelofMTRLwith 1 =0 : 01 and 2 =0 : 05 .Thelearned givesacorrelationmatrixasfollows: 2 6 4 10 : 9999 0 : 9999 0 : 99991 1 0 : 9999 11 3 7 5 Fromtheabovematrixweseethatthelearnedrelationshipfortask1isoppositetothe supposedrelationship,becauseofthehighlynoiseddata.Thiswillleadstosuboptimalsolution 61 for W =[ 3 : 7283 ; 2 : 6605 ; 3 : 0105] ,ascomparedtothegroundtruth W =[3 ; 2 ; 10] .On theotherhand,ifweencodethetruetasksrelationshipby˝xingthe tobetheground truthduringthelearningprocess,withtheexactlysameparameterssettingasabove.We canthenlearnamodel W =[0 : 6850 ; 0 : 3878 ; 2 : 5840] thatisclosertothegroundtruthin termsof l 2 normandkeepsthecorrecttasksrelationship.Thisprocedureisdenotedastruth- encodedmulti-taskrelationshiplearning(eMTRL)inthissubsection. Thisobservationmotivatesustofurtherexplorethee˙ectivenessofeMTRL.Wecreated syntheticdatasetbygenerating K =10 tasksparameters w i and b i fromauniformdistribution between0and1.Eachtaskcontains25samplesdrawnfromaGaussiandistributionwithzero meansandthevarianceequalsto10.ThefunctionresponseisalsocorruptedbyaGaussian noisewithzeromeanandhasavarianceof5.Wesplitthissyntheticdatasettotraining, validationandtestingset.Outofthe25samplesforeachtasks,20%arefortraining,30% forvalidationand50%fortesting.We˝xthenumberofsamplesandthenumberoftasks, varythenumberoffeaturesfrom20to100.Theparameters 1 and 2 havebeentuned in [ 1 10 3 ; 1 10 2 ; 1 10 1 ] and [0 ; 1 10 3 ; 1 10 2 ; 1 10 1 ; 1 ; 10 ; 1 10 2 ; 1 10 3 ] , respectively. TheperformancehasbeenevaluatedusingRootMeanSquareError(RMSE)andFrobenius normbetweenlearnedtaskmodelandthegroundtruthtaskmodel.Theresultsshownin Figure3.5indicatethatencodingtheknowledgeabouttaskrelationshipwillsigni˝cantly bene˝ttheprediction.EventhougheMTRLisnotapracticalmodelbecausewecan neverknowthetruetaskrelationship,theexperimentalresultscon˝rmthatthereisahuge potentialtoimprovepredictiveperformanceifwecantakeadvantageofdomainknowledge. Theexperimentalresultsinnextsectionwillshowhowtoe˚cientlysolicitandincorporate thedomainknowledgeabouttasksrelationshipintothelearning. 62 Table3.4:TheaverageRMSEofqueryandrandomstrategyontestingdatasetover5random splittingoftrainingandvalidationsamples. numberofconstraints 0 5 10 15 20 25 30 35 40 QueryStrategy 1.1387 1.1267 1.1224 1.1117 1.1125 1.1101 1.1102 1.1137 1.1168 RandomSelection 1.1387 1.1255 1.1390 1.1284 1.1165 1.1285 1.1379 1.1382 1.1364 Table3.5:TheRMSEcomparisonofkMTRLandbaselines. School RR MTL-L MTL-l21 MTRL kMTRL-20 kMTRL-40 kMTRL-60 kMTRL-80 5% 1.1737 0.0041 1.1799 0.0047 1.176 0.0043 1.0615 0.0167 1.0584 0.0128 1.0553 0.0155 1.0551 0.0158 1.0551 0.0159 10% 1.1428 0.0306 1.1485 0.0293 1.1477 0.0282 0.9872 0.0057 0.9823 0.0030 0.9805 0.0014 0.9803 0.0018 0.9803 0.0018 15% 1.0665 0.0395 1.0699 0.0405 1.0700 0.0399 0.9491 0.0060 0.9334 0.0057 0.9321 0.0081 0.9322 0.0083 0.9323 0.0082 20% 0.9756 0.0157 0.9774 0.0153 0.9776 0.0149 0.9047 0.0031 0.8966 0.0123 0.8906 0.0123 0.8844 0.0022 0.8843 0.0019 MMSE RR MTL-L MTL-l21 MTRL kMTRL-5 kMTRL-10 kMTRL-15 kMTRL-20 2% 0.9503 0.1467 0.9319 0.1497 0.9314 0.1693 0.9106 0.0976 0.9113 0.0982 0.9058 0.0926 0.9058 0.0926 0.9058 0.0926 (a)(b) Figure3.5:PerformanceofMTRLandeMTRLasthenumberoffeatureschanging,in termsof(a)Frobeniusnormand(b)RMSE.MTRL[ 227 ]learnsbothtaskmodelsandtask relationshipatthesametime,whileeMTRLherelearnsthetaskmodelswhilethetask relationship is˝xedtogroundtruth,i.e.encodingthecorrectdomainknowledgeabout thetaskrelationship. 3.2.4.2E˙ectivenessofQueryStrategy Inthissubsection,weconducttheexperimentstoshowthatencodingthedomainknowledgein theformofpartialorderisuseful.Wefollowthesamesyntheticdatasetwith20featuredimen- siongeneratedabove.Thesamesettingofsplittingtraining,testingandvalidationdataset, and5foldrandomsplitvalidationareapplied.Theparameters 1 and 2 havebeentuned in [ 1 10 3 ; 1 10 2 ; 1 10 1 ] and [0 ; 1 10 3 ; 1 10 2 ; 1 10 1 ; 1 ; 10 ; 1 10 2 ; 1 10 3 ] , respectively.Afterthelearningalgorithmconverges,wecomparethethepairwiseconstraints 63 arechosenbytheproposedquerystrategyandtherandomlyselectedstrategy.Theresultsof twostrategiesarereportedinTable3.4.Weseethetrendthatbothoftheproposedquery strategyandtherandomselectionreachbettergeneralizationperformanceasthenumberof incorporatedpairwiseconstraintsincreases.Tobemorespeci˝c,theresultsin˝rstcolumnis worsethanalltheresultsusingquerystrategyandmostoftheresultsusingrandomselection. Thisshowthatsolicitthedomainknowledgeintermsofpairwiseconstraintsise˙ective. Ontheotherhand,whencomparingtheresultsoftheproposedquerystrategyandrandom selection,weseethatourquerystrategyselectsimportantpairwiseconstraints,leadingto abettermodelthantherandomquery.Whenthenumberofpairwiseconstraintsislarger than5,theproposedquerystrategyworksconsistentlybetterthanrandomselection. 3.2.4.3InteractiveSchemeforQueryStrategy Tofurtheranalysisourquerystrategy,wealsoexploredi˙erentinteractiveschemesinour querystrategy.Therearemultiplewaystoqueryacertainamountofpartialorders.We caneitherquerymanytimesandeachtimewithlesslabelinge˙orts,orviceversa.Weuse kMTRL-a-b todenoteatotal b constraintsandeachtimewequery a constraints(thehuman expertneedstointeractwiththesystem b=a times).Thedi˙erentinteractiveschemewill highlyimpacttheuserexperience.Forexample,kMTRL-10-100needstoqueryexperts 10 timesandexpertsneedtolabel 10 constraintsateachtime.Also,ittakes 10 trainingiterations whichismuchmoreexpensivethanotherschemes.Incontrast,kMTRL-100-100onlyneedsto queryexpertsonce,whichisthemoste˚cientscheme.However,thisschemecannotbene˝t fromtheiterativeprocessofiMTRL.Thepairwiseconstraintsaddedinpreviousiterations willa˙ectthemodelandwon'tbeselectedagain.Thiswillrevealotherimportantconstraints. Takingaoneiterationschemecannotutilizethisinformation.Theresultsaresummarized 64 Figure3.6:TheaveragedRMSEofkMTRLusingdi˙erentsettingofquerystrategy.The kMTRL-10-100meansselecting10pairwiseconstraintsattheendofeachiteration,start fromzero,add10pairwiseconstraintsatatime,until100constraints.Forall4schemes, kMTRLwithzeroconstraintsisequivalenttoMTRL.Resultsaretheaverageover5fold randomsplitting. inFigure3.6.WeseethatkMTRL-50-100achievesthebestperformance.Therefore,the bestschemeindicatethatourquerystrategyismostlye˙ectivewhenwebalancethetwo parameters,andthusitdoesnotrequireintensivelyinteractionwithexpertsandmeanwhile utilizesthepreviousinformatione˙ectively 1 . 3.2.4.4PerformanceonRealDatasets Theschooldatasetisawidelyusedbenchmarkdatasetformulti-taskregressionproblem.It contains15372studentswith28featuresfrom139secondaryschoolsintheyearof1985,1986 and1987,providedbytheInnerLondonEducationAuthority(ILEA).Thetaskistopredict thescoreforstudentsin139schools.Theexperimentalsettingsareexplainedasfollows.We ˝rstsplitthedatasetintotraining,validationandtestingdatasets.Thepercentageoftesting samplesvariesfrom 10% to 25% ofallsampleseachtasksinoriginaldataset.Takingthe 10% testingdatasetasanexample,weperform3-foldrandomsplitontherest 90% data. 1 Codeispubliclyavailableat https://github.com/illidanlab/iMTL 65 Eachfoldhas 20% samplesfortrainingand 70% fortesting.Thesamerandomsplittingare appliedtothethreedatasets. AnotherrealdatasetweusedhereisAlzheimer'sDiseaseNeuroimagingInitiative(ADNI) database 2 .Theexperimentalsetupissameasdescribedinthepaper[ 235 ].Thegoalis topredictthesuccessivecognitionstatusofpatientsbasedonthemeasurementsatthe screeningorthebaselinevisit.Weuse 2% samplesfortraining, 10% fortestingandthe restforvalidation.Wealsoperform3-foldrandomsplitonthisdataset.Thepredictive performanceofthecompetingmethodslistedbelowarereportedontherealdatasets: RR:Thisapproachreferstoridgeregression. MTL-L:Thisapproachreferstothelow-rankmulti-tasklearningwithtracenorm regularization[10]. MTL-L21:Thisapproachreferstomulti-taskjointfeaturelearningusing l 2 ; 1 norm thatselectsasubsetoffeaturessharedbyalltasks[121]. MTRL:Thisapproachreferstothemulti-taskrelationshiplearningaswedescribedin Section3.2.3[227]. kMTRL- N :ThisapproachreferstotheproposedkMTRLmethodwith N pairwise encodedintothemodel. Wetunetheregularizationparameterson W in [ 1 10 3 ; 1 10 2 ; 1 10 1 ] forRR, MTL-LandMTL-L21.Theregularizationparameters 1 and 2 inEq.(3.16)aretunedin [ 1 10 3 , 1 10 2 , 1 10 1 ]and[ 0 , 1 10 3 , 1 10 2 , 1 10 1 , 1 , 10 , 1 10 2 , 1 10 3 ] respectively.Thebestparametersareselectedbasedontheperformanceonthevalidation 2 Dataispubliclyavailableat http://adni.loni.usc.edu/ 66 Table3.6:ThenameofthebrainregionsinFigure3.8,where(C)denotescorticalparcellation and(W)denoteswhitematterparcellation. # Intra-region Inter-region(Row) Inter-region(Column) 1 (C)RightCaudalMiddleFrontal (W)RightPutamen (C)RightInferiorTemporal 2 (C)RightPericalcarine (W)LeftCerebralCortex (C)LeftRostralMiddleFrontal 3 (W)CorpusCallosumMidAnterior (W)RightVentralDiencephalon (C)RightParsTriangularis 4 (W)RightCerebellumCortex (C)RightCaudalAnteriorCingulate (C)RightPrecentral 5 (W)CorpusCallosumCentral (C)LeftTemporalPole (C)RightMedialOrbitofrontal 6 (C)LeftBankssts (C)RightPostcentral (C)LeftParsTriangularis 7 (C)RightParsOpercularis (C)RightPrecentral (C)RightSuperiorParietal 8 (C)LeftIsthmusCingulate (W)RightCerebralCortex (C)RightInferiorParietal 9 (C)LeftSupramarginal (C)LeftIsthmusCingulate (C)LeftParsOrbitalis 10 (C)RightInferiorTemporal (C)LeftSuperiorFrontal (W)CorpusCallosumCentral set.TheperformanceoflearnedmodelsaremeasuredbyRMSEonthetestingdataset.The experimentalresultsareshowninTable3.5,fromwhichweseethatkMTRLachievesthe bestresults.Inthisexperiment,weadopttheschemekMTRL-20-80forschooldatasetand kMTRL-5-20forMMSEdatasetasdescribedinprevioussubsection. 3.2.5CaseStudy:BrainAtrophyandAlzheimer'sDisease InthissectionweapplytheproposediMTRLframeworktostudythebrainatrophypatterns andhowthechangesinthebrainisassociatedtodi˙erentclinicaldementiascoresand symptomsthatarerelatedtoAlzheimer'sdisease(AD).Itisestimatedthatthereare currently5millionAmericanshaveAD,andADhasbecomeoneoftheleadingcausesof deathintheUnitedStates.SinceADischaracterizedbystructuralatrophyinthebrain,there isapressingdemandofunderstandinghowthebrainatrophyisrelatedtotheprogressionof thedisease. Inthisworkwestudyhowthestructuralfeaturesofbrainregionscanberelatedto51 cognitivemarkerssuchas,Alzheimer'sDiseaseAssessmentScale(ADAS),clinicaldementia rating(CDR),GlobalDeteriorationScale(GDS),Hachinski,NeuropsychologicalBattery, WMS-RLogic,andotherneuropsychologicalassessmentscores.Weareinterestedinpredicting thevolumeofbrainareasextractedfromthestructuralmagneticresonanceimaging(MRI). 67 (a)(b) Figure3.7:Thedistributionofcompetenceon(a)intra-regioncovarianceand(b)inter-region covariance.kMTRLperformsbetterthanMTRLwhencompetence > 1 .Highercompetence indicatesbetterperformanceachievedbykMTRLascomparedtoMTRL.Weseeinamajority ofregionsthekMTRLoutperformstheMTRL. (a)Intra-regioncovariance (b)Inter-regioncovariance Figure3.8:Comparisonofsub-matricesofcovarianceamong(left)taskcovarianceusing 90% alldatapointsthatisconsideredasround(middle)thecovariancematrixlearned viaMTRLon 20% dataand(right)thecovariancematrixlearnedviakMTRLon 20% data with0.8%pair-wiseconstraintsqueriedbytheproposedqueryscheme. WeusetheADNIcohortconsisting648subjectswhosebaselineMRIimagespassedquality control.WeusedtheFreeSurfertooltoextractthe99brainvolumesfromregionsofinterest (ROIs)ofthebaselineMRIimages.ConsideringthepredictionofthevolumeofeachROI asalearningtask,wethushaveacollectionof99learningtasks,witheachtaskhaving648 samplesand51features.Sincethebrainregionsarerelatedduringtheagingprocessand Alzheimer'sprogression,theMTLapproachcanbeusedtoimprovetheperformanceby consideringsuchrelatednessamongbrainregions. Weadoptthesameexperimentalsettingasinthepreviousexperiments,wherewecompare theMTRLwiththeproposedkMTRLbyqueryingandaddingpair-wiseexpertknowledge 68 andinspectingthee˙ectivenessofthequeriedtaskrelationshipsupervision.Weshowthe di˙erencesamongthe(1)taskcovarianceusing 90% alldatapointsthatisconsideredas (2)thecovariancematrixlearnedviaMTRLon 10% dataand(3)the covariancematrixlearnedviakMTRLon 10% datawith0.8%pair-wiseconstraintsqueried bytheproposedqueryscheme.Sincethecomplete 99 99 covariancematricesarehardto visualize,wechooseinvestigatetwotypesofsubregionsofthecovariancematrices:(a)a randomintraregionofthecovarianceofthesize 10 10 (rowregionsandcolumnregionsare thesame)and(b)arandominterregionofthecovarianceofthesize 10 10 (rowregionsand columnregionsaredi˙erent).Wede˝nethe competence metrictoquantifyhowthequality ofthesub-covariance: k MTRL real k F = k kMTRL real k F ; (3.18) wherethekMTRLperformsbetterthanMTRLwhencompetence > 1 ,andthehigherthe better.Werepeatedlychooserandomsub-covariancesandthedistributionofthecompetence isshownintheFigure3.7,indicatingthatinamajorityofcasesknowledgecanimprove relationshipestimation. Wevisualizetwosub-covariancematricesinFigure3.8,whoseregionsareshownin Table3.6.InFigure3.8(a),weseethatthecovariancesfromboththegroundtruthandthe kMTRLdiscouragethepositiveknowledgetransferfrom RightCerebellumCortex ,which agreeswiththepathologicalcharacteristicsofAD[ 182 ],wherecerebellumdoesnotcorrelate withtheprogressionofAD.Alsothepositivecorrelationbetween CorpusCallosumMid Anterior and CorpusCallosumCentral isidenti˝edinboththegroundtruthandthekMTRL, andignoredbyMTRL.Thesigni˝cantreducedcorpuscallosumsizewaspreviouslyreported 69 inADstudies[ 192 ],andtheprogressionpatternsofthetworegionscanbesimilarbecauseof thephysicaldistancebetweenthetworegions.Figure3.8(b),weseethattheunsubstantiated strongcorrelationbetween RightPrecentral and LeftParsTriangularis asfoundinMTRLhas beenlargelysuppressedbythedomainknowledge.However,sinceweonlyspeci˝edpartial orderrelationship,therearechancestheproposedkMTRLalgorithmmayvthe supervision,aswenoticethatsomeunsubstantiatedpositivecorrelationsinvolving Right VentralDiencephalon areintroducedtothecovariance.Weplantofurtherelaboratethe ˝ndingsandclinicalinsightsofADanddementiainthejournalextensionofthispaper. 70 Chapter4 Data-DrivenCollaborativeLearning Inthischapter,wediscussdata-drivencollaborationinreinforcementlearning.Morespeci˝- cally,we˝rstproposeacollaborativedeepreinforcementlearningframeworkthatcanaddress theknowledgetransferamongheterogeneoustasks.Underthisframework,weproposedeep knowledgedistillationtoadaptivelyalignthedomainofdi˙erenttaskswiththeutilizationof deepalignmentnetwork.Secondly,wefurtherconstructheterogeneouslearningagentsin thesametasktoimproveitssample-e˚ciency.Thecentralideaistodisentangleexploration andexploitationagentsandthenconductdata-driventransferthroughimitationlearning, whichleadstoano˙-policylearningframeworklargelyfacilitatesthelearninge˚ciency.The o˙-policylearningframeworkusesgeneralizedpolicyiterationforexplorationandexploits thestablenessofsupervisedlearningforderivingpolicy,whichaccomplishestheunbiasedness, variancereduction,o˙-policylearning,andsamplee˚ciencyatthesametime. 4.1CollaborativeDeepReinforcementLearning 4.1.1Introduction Ontheotherhand,thestudyofhumanlearninghaslargelyadvancedthedesignofmachine learninganddataminingalgorithms,especiallyinreinforcementlearningandtransferlearning. Therecentsuccessofdeepreinforcementlearning(DRL)hasattractedincreasingattention 71 Figure4.1:IllustrationofCollaborativeDeepReinforcementLearningFramework. fromthecommunity,asDRLcandiscoververycompetitivestrategiesbyhavinglearning agentsinteractingwithagivenenvironmentandusingrewardsfromtheenvironmentas thesupervision(e.g.,[ 132 , 86 , 107 , 174 ]).EventhoughmostofcurrentresearchonDRL hasfocusedonlearningfromgames,itpossessesgreattransformativepowertoimpact manyindustrieswithdataminingandmachinelearningtechniquessuchasclinicaldecision support[ 193 ],marketing[ 3 ],˝nance[ 2 ],visualnavigation[ 236 ],andautonomousdriving[ 32 ]. Althoughtherearemanyexistinge˙ortstowardse˙ectivealgorithmsforDRL[ 131 , 137 ], thecomputationalcoststillimposessigni˝cantchallengesastrainingDRLforevenasimple gamesuchas Pong [ 24 ]remainsveryexpensive.Theunderlyingreasonsfortheobstacle ofe˚cienttrainingmainlylieintwoaspects:First,thesupervision(rewards)fromthe environmentisverysparseandimplicitduringtraining.Itmaytakeanagenthundredsor eventhousandsactionstogetasinglereward,andwhichactionsthatactuallyleadtothis rewardareambiguous.Besidestheinsu˚cientsupervision,trainingdeepneuralnetwork itselftakeslotsofcomputationalresources. 72 Duetotheaforementioneddi˚culties,performingknowledgetransferfromotherrelated tasksorwell-traineddeepmodelstofacilitatetraininghasdrawnlotsofattentioninthe community[ 159 , 191 , 151 , 86 , 166 ].Existingtransferlearningcanbecategorizedintotwo classesaccordingtothemeansthatknowledgeistransferred: datatransfer [ 82 , 151 , 166 ]and modeltransfer [ 53 , 227 , 229 , 151 ].Modeltransfermethodsimplementknowledgetransfer fromintroducinginductivebiasduringthelearning,andhasbeenextensivelystudiedin bothtransferlearning/multi-tasklearning(MTL)communityanddeeplearningcommunity. Forexample,intheregularizedMTLmodelssuchas[ 55 , 233 ],taskswiththesamefeature spacearerelatedthroughsomestructuredregularization.Anotherexampleisthemulti-task deepneuralnetwork,wheredi˙erenttaskssharepartsofthenetworkstructures[ 229 ].One obviousdisadvantageofmodeltransferisthelackof˛exibility:usuallythefeasibilityof inductivetransferhaslargelyrestrictedthemodelstructureoflearningtask,whichmakes itnotpracticalinDRLbecausefordi˙erenttaskstheoptimalmodelstructuresmaybe radicallydi˙erent.Ontheotherhand,therecentlydevelopeddatatransfer(alsoknownas knowledgedistillationormimiclearning)[ 82 , 166 , 151 ]embedsthesourcemodelknowledge intodatapoints.Thentheyareusedasknowledgebridgetotraintargetmodels,whichcan havedi˙erentstructuresascomparedtothesourcemodel[ 82 , 25 ].Becauseofthestructural ˛exibility,thedatatransferisespeciallysuitabletodealwithstructurevariantmodels. TherearetwosituationsthattransferlearningmethodsareessentialinDRL: Certi˝catedheterogeneoustransfer. TrainingaDRLagentiscomputationalexpensive. Ifwehaveawell-trainedmodel,itwillbebene˝cialtoassistthelearningofothertasksby transferringknowledgefromthismodel.Thereforeweconsiderfollowingresearchquestion: Givenone certi˝cated task(i.e.themodeliswell-designed,extensivelytrainedandperforms verywell),howcanwemaximizetheinformationthatcanbeusedinthetrainingofother 73 relatedtasks?Somemodeltransferapproachesdirectlyusetheweightsfromthetrainedmodel toinitializethenewtask[ 151 ],whichcanonlybedonewhenthemodelstructuresarethesame. Thus,thisstrictrequirementhaslargelylimiteditsgeneralapplicabilityonDRL.Ontheother hand,theinitializationmaynotworkwellifthetasksaresigni˝cantlydi˙erentfromeach otherinnature[ 151 ].Thischallengecouldbepartiallysolvedbygeneratinganintermediate dataset(logits)fromtheexistingmodeltohelplearningthenewtask.However,newproblems wouldarisewhenwearetransferringknowledgebetween heterogeneoustasks .Notonlythe actionspacesaredi˙erentindimension,theintrinsicactionprobabilitydistributionsand semanticmeaningsoftwotaskscoulddi˙eralot.Speci˝cally,oneactionin Pong may refertomovethepaddleupwardswhilethesameactionindexin Riverraid [ 24 ]would correspondto˝re.Therefore,thedistilleddatasetgeneratedfromthetrainedsourcetask cannotbedirectlyusedtotraintheheterogeneoustargettask.Inthisscenario,the˝rstkey challengeweidenti˝edinthisworkisthathowtoconductdatatransferamongheterogeneous taskssothatwecanmaximallyutilizetheinformationfromacerti˝catedmodelwhilestill maintainthe˛exibilityofmodeldesignfornewtasks.Duringthetransfer,thetransferred knowledgefromothertasksmaycontradicttotheknowledgethatagentslearnedfromits environment.Onerecentlywork[ 159 ]useanattentionnetworkselectiveeliminatetransferif thecontradictionpresents,whichisnotsuitableinthissettingsincewearegivenacerti˝cated tasktotransfer.Hence,thesecondchallengeishowtoresolvethecon˛ictandperforma meaningfultransfer. Lackofexpertise. AmoregeneraldesiredbutalsomorechallengingscenarioisthatDRL agentsaretrainedformultipleheterogeneoustaskswithoutanypre-trainedmodelsavailable. Onefeasiblewaytoconducttransferunderthisscenarioisthatagentsofmultipletasksshare partoftheirnetworkparameters[ 229 , 166 ].However,aninevitabledrawbackis,multiple 74 modelslosetheirtask-speci˝cdesignssincethesharedpartneedstobethesame.Another solutionistolearnadomaininvariantfeaturespacesharedbyalltasks[ 4 ].However,some task-speci˝cinformationisoftenlostwhileconvertingtheoriginalstatetoanewfeature subspace.Inthiscase,anintriguingquestionsisthat:canwedesignaframeworkthat fullyutilizestheoriginalenvironmentinformationandmeanwhileleveragestheknowledge transferredfromothertasks? Thispaperinvestigatestheaforementionedproblemssystematicallyandproposesanovel CollaborativeDeepReinforcementLearning(CDRL)framework(illustratedinFigure4.1)to resolvethem.Ourmajorcontributionisthreefold: First,inordertotransferknowledgeamongheterogeneoustaskswhileremainingthe task-speci˝cdesignofmodelstructure,anoveldeepknowledgedistillationisproposed toaddresstheheterogeneityamongtasks,withtheutilizationofdeepalignmentnetwork designedforthedomainadaptation. Second,inordertoincorporatethetransferredknowledgefromheterogeneoustasksinto theonlinetrainingofcurrentlearningagents,similartohumancollaborativelearning,an e˚cientcollaborativeasynchronouslyadvantageactor-criticlearning(cA3C)algorithm isdevelopedundertheCDRLframework.IncA3C,thetargetagentsareabletolearn fromenvironmentsanditspeerssimultaneously,whichalsoensuretheinformationfrom originalenvironmentissu˚cientlyutilized.Further,theknowledgecon˛ictamong di˙erenttasksisresolvedbyaddinganextradistillationlayertothepolicynetwork underCDRLframework,aswell. LastbutnotleastwepresentextensiveempiricalstudiesonOpenAIgymtoevaluate theproposedCDRLframeworkanddemonstrateitse˙ectivenessbyachievingmore 75 than10%performanceimprovementcomparedtothecurrentstate-of-the-art. Notations: Inthispaper,weuseteachernetwork/sourcetaskdenotesthenetwork/task containedtheknowledgetobetransferredtoothers.Similarly,thestudentnetwork/target taskisreferredtothosetasksutilizingtheknowledgetransferredfromotherstofacilitateits owntraining.Theexpertnetworkdenotesthenetworkthathasalreadyreachedarelative highaveragedrewardinitsownenvironment.InDRL,anagentisrepresentedbyapolicy networkandavaluenetworkthatshareasetofparameters.Homogeneousagentsdenotes agentsthatperformandlearnunderindependentcopiesofsameenvironment.Heterogeneous agentsrefertothoseagentsthataretrainedindi˙erentenvironments. 4.1.2RelatedWork Multi-agentlearning. Onecloselyrelatedareatoourworkismulti-agentreinforcement learning.Amulti-agentsystemincludesasetofagentsinteractinginoneenvironment. Meanwhiletheycouldpotentiallyinteractwitheachother[ 28 , 103 , 73 , 190 ].Incollaborative multi-agentreinforcementlearning,agentsworktogethertomaximizeasharedrewardmea- surement[ 103 , 73 ].ThereisacleardistinctionbetweentheproposedCDRLframeworkand multi-agentreinforcementlearning.InCDRL,eachagentinteractswithitsownenvironment copyandthegoalistomaximizetherewardofthetargetagents.Theformalde˝nitionof theproposedframeworkisgiveninSection4.1.5. Transferlearning. Anotherrelevantresearchtopicisdomainadaptioninthe˝eldof transferlearning[ 149 , 183 , 200 ].Theauthorsin[ 183 ]proposedatwo-stagedomainadaptation frameworkthatconsidersthedi˙erencesamongmarginalprobabilitydistributionsofdomains, aswellasconditionalprobabilitydistributionsoftasks.Themethod˝rstre-weightsthedata 76 fromthesourcedomainusingMaximumMeanDiscrepancyandthenre-weightsthepredictive functioninthesourcedomaintoreducethedi˙erenceonconditionalprobabilities.In[ 200 ], themarginaldistributionsofthesourceandthetargetdomainarealignedbytraining anetwork,whichmapsinputsintoadomaininvariantrepresentation.Also,knowledge distillationwasdirectlyutilizedtoalignthesourceandtargetclassdistribution.Oneclear limitationhereisthatthesourcedomainandthetargetdomainarerequiredtohavethe samedimensionality(i.e.numberofclasses)withsamesemanticsmeanings,whichisnotthe caseinourdeepknowledgedistillation. In[ 4 ],aninvariantfeaturespaceislearnedtotransferskillsbetweentwoagents.However, projectingthestateintoafeaturespacewouldloseinformationcontainedintheoriginalstate. Thereisatrade-o˙betweenlearningthecommonfeaturespaceandpreservingthemaximum informationfromtheoriginalstate.Inourwork,weusedatageneratedbyintermediate outputsintheknowledgetransferinsteadofasharedspace.Ourapproachthusretains completeinformationfromtheenvironmentandensureshighqualitytransfer.Therecently proposedA2Tapproach[ 159 ]canavoidnegativetransferamongdi˙erenttasks.However, itispossiblethatsomenegativetransfercasesmaybecauseoftheinappropriatedesignof transferalgorithms.Inourwork,weshowthatwecanperformsuccessfultransferamong tasksthatseeminglycausenegativetransfer. Knowledgetransferindeeplearning. Sincethetrainingofeachagentinanenvironment canbeconsideredasalearningtask,andtheknowledgetransferamongmultipletasksbelongs tothestudyofmulti-tasklearning.Themulti-taskdeepneuralnetwork(MTDNN)[ 229 ] transfersknowledgeamongtasksbysharingparametersofseverallow-levellayers.Since thelow-levellayerscanbeconsideredtoperformrepresentationlearning,theMTDNN islearningasharedrepresentationforinputs,whichisthenusedbyhigh-levellayersin 77 thenetwork.Di˙erentlearningtasksarerelatedtoeachotherviathissharedfeature representation.IntheproposedCDRL,wedonotusethesharerepresentationduetothe inevitableinformationlosswhenweprojecttheinputsintoasharedrepresentation.We insteadperformexplicitlyknowledgetransferamongtasksbydistillingknowledgethatare independentofmodelstructures.In[ 82 ],theauthorsproposedtocompresscumbersome models(teachers)tomoresimplemodels(students),wherethesimplemodelsaretrainedby adataset(knowledge)distilledfromtheteachers.However,thisapproachcannothandlethe transferamongheterogeneoustasks,whichisonekeychallengeweaddressedinthispaper. Knowledgetransferindeepreinforcementlearning. Knowledgetransferisalsostudied indeepreinforcementlearning.[ 131 ]proposedmulti-threadedasynchronousvariantsofseveral mostadvanceddeepreinforcementlearningmethodsincludingSarsa,Q-learning,Q-learning andadvantageactor-critic.Amongallthosemethods,asynchronousadvantageactor-critic (A3C)achievesthebestperformance.Insteadofusingexperiencereplayasinpreviouswork, A3Cstabilizesthetrainingprocedurebytrainingdi˙erentagentsinparallelusingdi˙erent explorationstrategies.Thiswasshowntoconvergemuchfasterthanpreviousmethodsand uselesscomputationalresources.WeshowinSection4.1.5thattheA3Cissubsumedto theproposedCDRLasaspecialcase.In[ 151 ],asinglemulti-taskpolicynetworkistrained byutilizingasetofexpertDeepQ-Network(DQN)ofsourcegames.Atthisstage,the goalistoobtainapolicynetworkthatcanplaysourcegamesasclosetoexpertsaspossible. Thesecondstepistotransfertheknowledgefromsourcetaskstoanewbutrelatedtarget task.TheknowledgeistransferredbyusingtheDQNinlaststepastheinitializationof theDQNforthenewtask.Assuch,thetrainingtimeofthenewtaskcanbesigni˝cantly reduced.Di˙erentfromtheirapproach,theproposedtransferstrategyisnottodirectly mimicexperts'actionsorinitializebyapre-trainedmodel.In[ 166 ],knowledgedistillation 78 wasadoptedtotrainamulti-taskmodelthatoutperformssingletaskmodelsofsometasks. Theexpertsforalltasksare˝rstlyacquiredbysingletasklearning.Theintermediateoutputs fromeachexpertarethendistilledtoasimilarmulti-tasknetworkwithanextracontroller layertocoordinatedi˙erentactionsets.Oneclearlimitationisthatmajorcomponentsof themodelareexactlythesamefordi˙erenttasks,whichmayleadtodegradedperformance onsometasks.Inourwork,transfercanhappenevenwhentherearenoexpertsavailable. Also,ourmethodalloweachtasktohavetheirownmodelstructures.Furthermore,even themodelstructuresarethesameformultipletasks,thetasksarenottrainedtoimprove theperformanceofothertasks(i.e.itdoesnotmimicexpertsfromothertasksdirectly). Thereforeourmodelcanfocusonmaximizingitsownreward,insteadofbeingdistractedby others. 4.1.3Background 4.1.3.1ReinforcementLearning Inthiswork,weconsiderthestandardreinforcementlearningsettingwhereeachagent interactswithit'sownenvironmentoveranumberofdiscretetimesteps.Giventhecurrent state s t 2S atstep t ,agent g i selectsanaction a t 2A accordingtoitspolicy ˇ ( a t j s t ) , andreceivesareward r t +1 fromtheenvironment.Thegoaloftheagentistochoosean action a t atstep t thatmaximizethesumoffuturerewards f r t g inadecayingmanner: R t = P 1 i =0 i r t + i ,wherescalar 2 (0 ; 1] isadiscountrate.Basedonthepolicy ˇ ofthis agent,wecanfurtherde˝neastatevaluefunction V ( s t )= E [ R t j s = s t ] ,whichestimatesthe expecteddiscountedreturnstartingfromstate s t ,takingactionsfollowingpolicy ˇ untilthe gameends.Thegoalinreinforcementlearningalgorithmistomaximizetheexpectedreturn. 79 Sincewearemainlydiscussingonespeci˝cagent'sdesignandbehaviorthroughoutthepaper, weleaveoutthenotationoftheagentindexforconciseness. 4.1.3.2AsynchronousAdvantageactor-criticalgorithm(A3C) Theasynchronousadvantageactor-critic(A3C)algorithm[ 131 ]launchesmultipleagentsin parallelandasynchronouslyupdatesaglobalsharedtargetpolicynetwork ˇ ( a j s; p ) aswell asavaluenetwork V ( s; v ) .parametrizedby p and v ,respectively.Eachagentinteracts withtheenvironment,independently.Ateachstep t theagenttakesanactionbasedon theprobabilitydistributiongeneratedbypolicynetwork.Afterplayingan-steprolloutor reachingtheterminalstate,therewardsareusedtocomputetheadvantagewiththeoutput ofvaluefunction.Theupdatesofpolicynetworkisconductedbyapplyingthegradient: r p log ˇ ( a t j s t ; p ) A ( s t ;a t ; v ) ; wheretheadvantagefunction A ( s t ;a t ; v ) isgivenby: X T t 1 i =0 i r t + i + T t V ( s T ; v ) V ( s t ; v ) : Term T representsthestepnumberforthelaststepofthisrollout,itiseitherthemax numberofrolloutstepsorthenumberofstepsfrom t totheterminalstate.Theupdateof valuenetworkistominimizethesquareddi˙erencebetweentheenvironmentrewardsand valuefunctionoutputs,i.e., min v ( X T t 1 i =0 i r t + i + T t V ( s T ; v ) V ( s t ; v )) 2 : 80 Thepolicynetworkandthevaluenetworksharethesamelayersexceptforthelastoutput layer.Anentropyregularizationofpolicy ˇ isaddedtoimproveexploration,aswell. 4.1.3.3Knowledgedistillation Knowledgedistillation[ 82 ]isatransferlearningapproachthatdistillstheknowledgefroma teachernetworktoastudentnetworkusingatemperatureparameterized"softtargets"(i.e. aprobabilitydistributionoverasetofclasses).Ithasbeenshownthatitcanacceleratethe trainingwithlessdatasincethegradientfrom"softtargets"containsmuchmoreinformation thanthegradientobtainedfrom"hardtargets"(e.g.0,1supervision). Tobemorespeci˝c,logitsvector z 2R d for d actionscanbeconvertedtoaprobability distribution h 2 (0 ; 1) d byasoftmaxfunction,raisedwithtemperature ˝ : h ( i )= softmax ( z =˝ ) i = exp ( z ( i ) =˝ ) P j exp ( z ( j ) =˝ ) ; (4.1) where h ( i ) and z ( i ) denotesthe i -thentryof h and z ,respectively. ThentheknowledgedistillationcanbecompletedbyoptimizethefollowingKullback- Leiblerdivergence(KL)withtemperature ˝ [166,82]. L KL ( D; p )= X t =1 softmax ( z t =˝ )ln softmax ( z t =˝ ) softmax ( z t ) (4.2) where z t isthelogitsvectorfromteachernetwork(notation representsteacher)atstep t ,while z t isthelogitsvectorfromstudentnetwork(notation representsstudent)ofthis step. p denotestheparametersofthestudentpolicynetwork. D isasetoflogitsfrom teachernetwork. 81 4.1.4Collaborativedeepreinforcementlearningframework Inthissection,weintroducetheproposedcollaborativedeepreinforcementlearning(CDRL) framework.Underthisframework,acollaborativeAsynchronousAdvantageActor-Critic (cA3C)algorithmisproposedtocon˝rmthee˙ectivenessofthecollaborativeapproach. Beforeweintroduceourmethodindetails,oneunderlyingassumptionweusedisasfollows: Assumption3. Ifthereisauniversethatcontainsallthetasks E = f e 1 ;e 2 ;:::;e 1 g and k i representsthecorrespondingknowledgetomastereachtask e i ,then 8 i;j;k i \ k j 6 = ; . Thisisaformaldescriptionofourcommonsensethatanypairoftasksarenotabsolutely isolatedfromeachother,whichhasbeenimplicitlyusedasafundamentalassumptionby mostpriortransferlearningstudies[ 151 , 166 , 55 , 37 , 235 ].Therefore,wefocusonminingthe sharedknowledgeacrossmultipletasksinsteadofprovidingstrategyselectingtasksthatshare knowledgeasmuchaspossible,whichremainstobeunsolvedandmayleadtoourfuture work.Thegoalhereistoutilizetheexistingknowledgeaswellaspossible.Forexample, wemayonlyhaveawell-trainedexpertonplayingPonggame,andwewanttoutilizeits expertisetohelpusperformbetteronothergames.Thisisoneofthesituationsthatcanbe solvedbyourcollaborativedeepreinforcementlearningframework. (a)Distillationprocedure(b)Studentnetworkstructure. Figure4.2:Deepknowledgedistillation.In(a),theteacher'soutputlogits z ismapped throughadeepalignmentnetworkandthealignedlogits F ! ( z ) isusedasthesupervision totrainthestudent.In(b),theextrafullyconnectedlayerfordistillationisaddedfor learningknowledgefromteacher.Forsimplicity'ssake,timestep t isomittedhere. 82 4.1.5Collaborativedeepreinforcementlearning Indeepreinforcementlearning,sincethetrainingofagentsarecomputationalexpensive,the well-trainedagentsshouldbefurtherutilizedassourceagents(agentswherewetransferred knowledgefrom)tofacilitatethetrainingoftargetagents(agentsthatareprovidedwith theextraknowledgefromsource).Inordertoincorporatethistypeofcollaborationtothe trainingofDRLagents,weformallyde˝nethecollaborativedeepreinforcementlearning (CDRL)frameworkasfollows: De˝nition3. Given m independentenvironments f " 1 ;" 2 ;:::;" m g of m tasks f e 1 ;e 2 ;:::;e m g ,thecorresponding m agents f g 1 ;g 2 ;:::;g m g arecollaborativelytrainedinparalleltomaximize therewards(mastereachtask)withrespecttotargetagents. Environments.Thereisnorestrictionontheenvironments:The m environmentscan betotallydi˙erentorwithsomeduplications. Inparallel.Eachenvironment " i onlyinteractswiththeonecorrespondingagent g i ,i.e., theaction a j t fromagent g j atstep t hasnoin˛uenceonthestate s i t +1 in " i ; 8 i 6 = j . Collaboratively.Thetrainingprocedureofagent g i consistsofinteractingwithenvi- ronment " i andinteractingwithotheragentsaswell.Theagent g i isnotnecessary tobeatsamelevelas"collaborative"de˝nedincognitivescience[ 50 ].E.g., g 1 canbe anexpertfortask e 1 (environment " 1 )whileheishelpingagent g 2 whichisastudent agentintask e 2 . Targetagents.ThegoalofCDRLcanbesetasmaximizingtherewardsthatagent g i obtainsinenvironment " i withthehelpofinteractingwithotheragents,similar toinductivetransferlearningwhere g i isthetargetagentfortargettaskandothers 83 aresourcetasks.Theknowledgeistransferedfromsourcetotarget g i byinteraction. Whenwesetthegoaltomaximizetherewardsofmultipleagentsjointly,itissimilarto multi-tasklearningwherealltasksaresourcetasksandtargettasksatthesametime. Noticethatourde˝nitionisverydi˙erentfromthepreviouslyde˝nedcollaborative multiagentMarkowDecisionProcess(collaborativemultiagentMDP)[ 103 , 73 ]whereaset ofagentsselectaglobaljointactiontomaximizethesumoftheirindividualrewardsand theenvironmentistransittedtoanewstatebasedonthatjointaction.First,MDPis notarequirementinCDRLframework.Second,inCDRL,eachagenthasitsowncopy ofenvironmentandmaximizesitsowncumulativerewards.Thegoalofcollaborationis toimprovetheperformanceofcollaborativeagents,comparedwithisolatedones,whichis di˙erentfrommaximizingthesumofglobalrewardsincollaborativemultiagentMDP.Third, CDRLfocusesonhowagentscollaborateamongheterogeneousenvironments,insteadofhow jointactiona˙ectstherewards.InCDRL,di˙erentagentsareactinginparallel,theactions takenbyotheragentswon'tdirectlyin˛uencecurrentagent'srewards.Whileincollaborative multiagentMDP,theagentsmustcoordinatetheiractionchoicessincetherewardswillbe directlya˙ectedbytheactionchoicesofotheragents. Furthermore,CDRLincludesdi˙erenttypesofinteraction,whichmakesthisageneral framework.Forexample,thecurrentstate-of-the-artisA3C[ 131 ]canbecategorizedasone homogeneousCDRLmethodwithadvantageactor-criticinteraction.Speci˝cally,multiple agentsinA3Caretrainedinparallelwiththesameenvironment.Allagents˝rstsynchronize parametersfromaglobalnetwork,andthenupdatetheglobalnetworkwiththeirindividual gradients.Thisprocedurecanbeseenaseachagentmaintainsitsownmodel(adi˙erent versionofglobalnetwork)andinteractswithotheragentsbysendingandreceivinggradients. Inthispaper,weproposeanovelinteractionmethodnameddeepknowledgedistillation 84 undertheCDRLframework.ItisworthnotingthattheinteractioninA3Conlydealswith thehomogeneoustasks,i.e.allagentshavethesameenvironmentandthesamemodel structuresothattheirgradientscanbeaccumulatedandinteracted.Bydeepknowledge distillation,theinteractioncanbeconductedamongheterogeneoustasks. 4.1.6Deepknowledgedistillation Asweintroducedbefore,knowledgedistillation[ 82 ]istryingtotrainastudentnetwork thatcanbehavesimilarlytotheteachernetworkbyutilizingthelogitsfromtheteacheras supervision.However,transferringtheknowledgeamongheterogeneoustasksfacesseveral di˚culties.First,theactionspacesofdi˙erenttasksmayhavedi˙erentdimensions.Second, evenifthedimensionalityofactionspaceissameamongtasks,theactionprobability distributionsfordi˙erenttaskscouldvaryalot,asweillustratedinFigure4.5(a)and (b).Thus,theactionpatternsrepresentedbythelogitsofdi˙erentpolicynetworksare usuallydi˙erentfromtasktotask.Ifwedirectlyforceastudentnetworktomimictheaction patternofateachernetworkforadi˙erenttask,itcouldbetrainedinawrongdirection, and˝nallyendsupwithworseperformancethanisolatedtraining.Infact,thissuspecthas beenempiricallyveri˝edinourexperiments. Basedontheaboveobservation,weproposedeepknowledgedistillationtotransfer knowledgebetweenheterogeneoustasks.AsillustratedinFigure4.2(a),theapproachfor deepknowledgedistillationisstraightforward.Weuseadeepalignmentnetworktomapthe logitsoftheteachernetworkfromaheterogeneoussourcetask e (environment " ),thenthe logitsisusedasoursupervisiontoupdatethestudentnetworkoftargettask e (environment " ).Thisprocedureisperformedbyminimizingfollowingobjectivefunctionoverstudent 85 policynetworkparameters p 0 : L KL ( D; p 0 ;˝ )= X t l KL ( F ! ( z t ) ; z t 0 ;˝ ) ; (4.3) where l KL ( F ! ( z t ) ; z t 0 ;˝ )= softmax ( F ! ( z t ) =˝ )ln softmax ( F ! ( z t ) =˝ ) softmax ( z t 0 ) : Here ! denotestheparametersofthedeepalignmentnetwork,whichtransfersthelogits z t fromtheteacherpolicynetworkforknowledgedistillationbyfunction F ! ( z t ) atstep t .As weshowinFigure4.2(b), p isthestudentpolicynetworkparameters(includingparameters ofCNN,LSTMandpolicylayer)fortask e ,while p 0 denotesstudentnetworkparameters ofCNN,LSTManddistillationlayer.Itisclearthatthedistillationlogits z t 0 fromthe studentnetworkdoesnotdeterminetheactionprobabilitydistributiondirectly,whichis establishedbythepolicylogits z t ,asillustratedinFigure4.2(b).Weaddanotherfully connecteddistillationlayertodealwiththemismatchofactionspacedimensionalityandthe contradictionofthetransferredknowledgefromsourcedomainandthelearnedknowledge fromtargetdomain.Theinputtobothoftheteacherandthestudentnetworkisthestate ofenvironment " oftargettask e .Itmeansthatwewanttotransfertheexpertisefrom anexpertoftask e towardsthecurrentstate.Symbol D isasetoflogitsfromtheteacher networkinonebatchand ˝ isthetemperaturesameasdescribedinEq(4.1).Inatrivial casethattheteachernetworkandthestudentnetworkaretrainedforsametask( e equals e ),thenthedeepalignmentnetwork F ! wouldreducetoanidentitymapping,andthe problemisalsoreducedtoasingletaskpolicydistillation,whichhasbeenprovedtobe 86 e˙ectivein[ 166 ].Beforewecanapplythedeepknowledgedistillation,weneedto˝rsttrain agooddeepalignmentnetwork.Inthiswork,weprovidetwotypesoftrainingprotocolsfor di˙erentsituations: O˜inetraining :Thisprotocol˝rsttrainstwoteachernetworksinbothenvironment " and " .Thenweusethelogitsofbothtwoteachernetworkstotrainadeepalignment network F ! .Afteracquiringapre-trained F ! ,wetrainastudentnetworkoftask e fromscratch,inthemeanwhiletheteachernetworkoftask e and F ! areusedfordeep knowledgedistillation. Onlinetraining :Supposeweonlyhaveateachernetworkoftask e ,andwewanttouse theknowledgefromtask e totrainthestudentnetworkfortask e togethigherperformance fromscratch.Thepipelineofthismethodisthat,we˝rstlytrainthestudentnetworkby interactingwiththeenvironment " foracertainamountofsteps T 1 ,andthenstartto trainthealignmentnetwork F ! ,usingthelogitsfromtheteachernetworkandthestudent network.Afterwards,atstep T 2 ,westartperformingdeepknowledgedistillation.Obviously T 2 islargerthan T 1 ,andthevalueofthemaretask-speci˝c,whichisdecidedempiricallyin thiswork. Theo˜inetrainingcouldbeusefulifwehavealreadyhadareasonablygoodmodelfor task e ,whilewewanttofurtherimprovetheperformanceusingtheknowledgefromtask e . Theonlinetrainingmethodisusedwhenweneedtolearnthestudentnetworkfromscratch. Bothtypesoftrainingprotocolcanbeextendedtomultipleheterogeneoustasks. 87 4.1.7CollaborativeAsynchronousAdvantage Actor-Critic Inthissection,weintroducetheproposedcollaborativeasynchronousadvantageactor-critic (cA3C)algorithm.Aswedescribedinsection4.1.5,theagentsarerunninginparallel.Each agentgoesthroughthesametrainingprocedureasdescribedinAlgorithm4.1.Asitshows, thetrainingofagent g 1 canbeseparatedintotwoparts:The˝rstpartistointeractwith theenvironment,gettherewardandcomputethegradientstominimizethevaluelossand policylossbasedonGeneralizedAdvantageEstimation(GAE)[169].Thesecondpartisto interactwithsourceagent g 2 sothatthelogitsdistilledfromagent g 2 canbetransferredby thedeepalignmentnetworkandusedassupervisiontobiasthetrainingofagent g 1 . Tobemoreconcrete,thepseudocodeinAlgorithm4.1isanenvolvedversionofA3C basedononlinetrainingofdeepknowledgedistillation.At T -thiteration,theagentinteracts withtheenvironmentfor t max stepsoruntiltheterminalstateisreached(Line6toLine 15).ThentheupdatingofvaluenetworkandpolicynetworkisconductedbyGAE.This variationofA3Cis˝rstlyimplementedinOpenAIuniversestarteragent[ 147 ].Sincethemain asynchronousframeworkisthesameasA3C,westillusetheA3Ctodenotethisalgorithm althoughtheupdatingisthenotthesameasadvantageactor-criticalgorithmusedinoriginal A3Cpaper[131]. TheonlinetrainingofdeepknowledgedistillationismainlycompletedfromLine25to Line32inAlgorithm4.1.Thetrainingofthedeepalignmentnetworkstartsfrom T 1 steps (Line25-28).After T 1 steps,thestudentnetworkisabletogeneratearepresentativeaction probabilitydistribution,andwehavesuitablesupervisiontotrainthedeepalignmentnetwork aswell,parameterizedby ! .After T 2 steps, ! willgraduallyconvergetoalocaloptimal, 88 andwestartthedeepknowledgedistillation.AsillustratedinFigure4.2(b),weusesymbol p 0 torepresenttheparametersofCNN,LSTMandthefullyconnecteddistillationlayer, sincewedon'twantthelogitsfromheterogeneousdirectlya˙ecttheactionpatternoftarget task.Tosimplifythediscussion,theabovealgorithmisdescribedbasedoninteractingwith asingleagentfromaheterogeneoustask.Inalgorithm4.1,logits z t canbeacquiredfrom multipleteachernetworksofdi˙erenttasks,eachtaskwilltrainitsowndeepalignment network ! anddistillthealignedlogitstothestudentnetwork. Aswedescribedinprevioussection4.1.5,therearetwotypesofinteractionsinthis algorithm:1).GAEinteractionusesthegradientssharedbyallhomogeneousagents.2) Distillationinteractionisthedeepknowledgedistillationfromteachernetwork.TheGAE interactionisperformedonlyamonghomogeneoustasks.Bysynchronizingtheparameters fromaglobalstudentnetworkinAlgorithm4.1(line3),thecurrentagentreceivestheGAE updatesfromalltheotheragentswhointeracteswiththesameenvironment.Inline21 and22,thecurrentagentsendshisgradientstotheglobalstudentnetwork,whichwillbe synchronizedwithotherhomogeneousagents.Thedistillationinteractionisthenconducted inline31,wherewehavethealignedlogits F ! ( z t ) andthedistillationlogits z t 0 tocompute thegradientsforminimizingthedistillationloss.Thegradientsofdistillationarealsosentto theglobalstudentnetwork.Theroleofglobalstudentnetworkcanberegardedasaparameter serverthathelpssendinginteractionsamongthehomogeneousagents.Fromadi˙erentangle, eachhomogeneousagentmaintainsaninstinctversionofglobalstudentnetwork.Therefore, bothtwotypesofinteractionsa˙ectallhomogeneousagents,whichmeansthatthedistillation interactionsfromagent g 2 andagent g 1 woulda˙ectallhomogeneousagentsofagent g 1 . 89 Algorithm4.1: OnlinecA3C Require: Globalsharedparametervectors p and v andglobalsharedcounter T =0 ; Agent-speci˝cparametervectors 0 p and 0 v ,GAE[169]parameters and .Timestepto starttrainingdeepalignmentnetworkanddeepknowledgedistillation T 1 ;T 2 . 1: while T = t max 16: R = v t = ( 0 forterminal s t V ( s t ; 0 v ) fornon-terminal s t 17: for i 2f t 1 ;:::;t start g do 18: i = r i + v i +1 v i 19: A = i +( ) A 20: R = r i + R 21: p p + r log ˇ ( a i j s i ; 0 ) A 22: v v + @ ( R v i ) 2 =@ 0 v 23: endfor 24: Performasynchronousupdateof p using p andof v using v . 25: if T T 1 then 26: //Trainingdeepalignmentnetwork. 27: min ! P t l KL ( z t ; z t ;˝ ) , l KL isde˝nedinEq(4.3). 28: endif 29: if T T 2 then 30: //onlinedeepknowledgedistillation. 31: min p 0 P t l KL ( F ! ( z t ) ; z t 0 ) 32: endif 33: endwhile 90 4.1.8Experiments 4.1.8.1TrainingandEvaluation Inthiswork,trainingandevaluationareconductedinOpenAIGym[ 24 ],atoolkitthat includesacollectionofbenchmarkproblemssuchasclassicAtarigamesusingArcadeLearning Environment(ALE)[ 18 ],classiccontrolgames,etc.SameasthestandardRLsetting,anagent isstimulatedinanenvironment,takinganactionandreceivingrewardsandobservationsat eachtimestep.Thetrainingoftheagentisdividedintoepisodes,andthegoalistomaximize theexpectationofthetotalrewardperepisodeortoreachhigherperformanceusingasfew episodesaspossible. 4.1.8.2Certi˝catedHomogeneoustransfer Inthissubsection,weverifythee˙ectivenessofknowledgedistillationasatypeofinteraction incollaborativedeepreinforcementlearningforhomogeneoustasks.Thisisalsotoverifythe e˙ectivenessofthesimplestcasefordeepknowledgedistillation.Althoughthee˙ectiveness ofpolicydistillationindeepreinforcementlearninghasbeenveri˝edin[ 166 ]basedonDQN, thereisnopriorstudiesonasynchronousonlinedistillation.Therefore,our˝rstexperiment istodemonstratethattheknowledgedistilledfromacerti˝catedtaskcanbeusedtotraina decentstudentnetworkforahomogeneoustask.Otherwise,theevenmorechallengingtask oftransferringamongheterogeneoussourcesmaynotwork.Wenotethatinthiscase,the Assumption3isfullysatis˝edgiven k 1 = k 2 ,where k 1 and k 2 aretheknowledgeneededto mastertask e 1 and e 2 ,respectively.Inthisexperiment,weconductexperimentsinagym environmentnamed Pong .ItisaclassicAtarigamethatanagentcontrolsapaddleto bounceaballpassanotherplayeragent.Themaximumrewardthateachepisodecanreach is21. 91 First,wetrainateachernetworkthatlearnsfromitsownenvironmentbyasynchronously performingGAEupdates.Wethentrainastudentnetworkusingonlyonlineknowledge distillationfromtheteachernetwork.Forfaircomparisons,weuse8agentsforallenvironments intheexperiments.Speci˝cally,boththestudentandtheteacheraretrainingin Pong with8agents.The8agentsoftheteachernetworkaretrainedusingtheA3Calgorithm (equivalenttoCDRLwithGAEupdatesinonetask).The8agentsofstudentnetworkare trainedusingnormalpolicydistillation,whichusesthelogitsgeneratedfromtheteacher networkassupervisiontotrainthepolicynetworkdirectly.FromtheresultsinFigure4.3(a) weseethatthestudentnetworkcanachieveaverycompetitiveperformancethatisisalmost sameasthestate-of-arts,usingonlineknowledgedistillationfromahomogeneoustask.It alsosuggeststhattheteacherdoesn'tnecessarilyneedtobeanexpert,beforeitcanguide thetrainingofastudentinthehomogeneouscase.Before2millionsteps,theteacheritselfis stilllearningfromtheenvironment,whiletheknowledgedistilledfromteachercanalready beusedtotrainareasonablestudentnetwork.Moreover,weseethatthehybridoftwotypes ofinteractionsinCDRLhasapositivee˙ectonthetraining,insteadofcausingperformance deterioration. Inthesecondexperiment,thestudentnetworkislearningfromboththeonlineknowledge distillationandtheGAEupdatesfromtheenvironment.We˝ndthattheconvergenceismuch fasterthanthestate-of-art,asshowninFigure4.3(b).Inthisexperiment,theknowledge isdistilledfromtheteachertostudentinthe˝rstonemillionstepsandthedistillationis stoppedafterthat.WenotethatinhomogeneousCDRL,knowledgedistillationisused directlywithpolicylogitsotherthandistillationlogits.Theknowledgetransfersettinginthis experimentisnotapracticalonebecausewealreadyhaveawell-trainedmodelof Pong ,but itshowsthatwhenknowledgeiscorrectlytransferred,thecombinationofonlineknowledge 92 (a)onlineKDonly(b)onlineKDwithGAE Figure4.3:Performanceofonlinehomogeneousknowledgedistillation. distillationandtheGAEupdatesisane˙ectivetrainingprocedure. 4.1.8.3Certi˝catedHeterogeneousTransfer Inthissubsection,wedesignexperimentstoillustratethee˙ectivenessofCDRLincerti˝cated heterogeneoustransfer,withtheproposeddeepknowledgedistillation.Givenacerti˝cated task Pong ,wewanttoutilizetheexistingexpertiseandapplyittofacilitatethetraining ofanewtask Bowling .Inthefollowingexperiments,wedonottuneanymodel-speci˝c parameterssuchasnumberoflayers,sizeof˝lterornetworkstructurefor Bowling .We ˝rstdirectlyperformtransferlearningfrom Pong to Bowling byknowledgedistillation. Sincethetwotaskshasdi˙erentactionpatternsandactionprobabilitydistributions,directly knowledgedistillationwithapolicylayerisnotsuccessful,asshowninFigure4.4(a).Infact, theknowledgedistilledfrom Pong contradictstotheknowledgelearnedfrom Bowling , whichleadstothemuchworseperformancethanthebaseline.WeshowinFigure4.5(a)and (b)thattheactiondistributionsof Pong and Bowling areverydi˙erent.Toresolvethis, wedistilltheknowledgethroughanextradistillationlayerasillustratedinFigure4.2(b). Assuch,theknowledgedistilledfromthecerti˝catedheterogeneoustaskcanbesuccessfully transferredtothestudentnetworkwithimprovedperformanceafterthelearningiscomplete. 93 (a)KDwithpolicylayer(b)KDwithdistillationlayer Figure4.4:Performanceofonlineknowledgedistillationfromaheterogeneoustask.(a) distillationfroma Pong expertusingthepolicylayertotraina Bowling student(KD- policy).(b)distillationfroma Pong experttoa Bowling studentusinganextradistillation layer(KD-distill). (a) Pong (b) Bowling (c)aligned Pong Figure4.5:Theactionprobabilitydistributionsofa Pong expert,a Bowling expertand analigned Pong expert. However,thisleadstoamuchslowerconvergencethanthebaselineasshowninFigure4.4(b), becausethatittakestimetolearnagooddistillationlayertoaligntheknowledgedistilled from Pong tothecurrentlearningtask.Aninterestingquestionisthat,isitpossibletohave bothimprovedperformanceandfasterconvergence? DeepknowledgedistillationO˜inetraining. Tohandletheheterogeneitybetween Pong and Bowling ,we˝rstverifythee˙ectivenessofdeepknowledgedistillationwithan o˜inetrainingprocedure.Theo˜inetrainingissplitintotwostages.Inthe˝rststage,we trainadeepalignmentnetworkwithfourfullyconnectedlayersusingtheReluactivation function.Thetrainingdataarelogitsgeneratedfromanexpert Pong networkand Bowling 94 (a)O˜ine(b)OnlineStrategy1(c)OnlineStrategy2 (d)Collaborative Figure4.6:Performanceof o˜ine , online deepknowledgedistillation,andcollaborative learning. network.Therewardsofthenetworksatconvergenceare20and60respectively.Instage2, withthe Pong teachernetworkandtraineddeepalignmentnetwork,wetraina Bowling studentnetworkfromscratch.ThestudentnetworkistrainedwithbothGAEinteractions withitsenvironment,andthedistillationinteractionsfromtheteachernetworkandthedeep alignmentnetwork.TheresultsinFigure4.6(a)showthatdeepknowledgedistillationcan transferknowledgefrom Pong to Bowling bothe˚cientlyande˙ectively. DeepknowledgedistillationOnlinetraining. AmorepracticalsettingofCDRLis theonlinetraining,wherewesimultaneouslytraindeepalignmentnetworkandconductthe onlinedeepknowledgedistillation.Weusetwoonlinetrainingstrategies:1)Thetrainingof deepalignmentnetworkstartsafter4millionsteps,whenthestudent Bowling network canperformreasonablywell,andtheknowledgedistillationstartsafter6millionsteps.2) Thetrainingofdeepalignmentnetworkstartsafter0.1millionsteps,andtheknowledge distillationstartsafter1millionsteps.ResultsareshowninFigure4.6(b)and(c)respectively. Theresultsshowthatbothstrategiesreachhigherperformancethanthebaseline.Moreover, theresultssuggestthatwedonothavetowaituntilthestudentnetworkreachesareasonable performancebeforewestarttotrainthedeepalignmentnetwork.Thisisbecausethedeep alignmentnetworkistraintoaligntwodistributionsof Pong and Bowling ,insteadof transferringtheactualknowledge.Recallthattheactionprobabilitydistributionof Pong 95 and Bowling arequitedi˙erentasshowninFigure4.5(a)and(b).Afterweprojecting thelogitsof Pong usingthedeepalignmentnetwork,thedistributionisverysimilarto Bowling ,asshowninFigure4.5(c). 4.1.8.4CollaborativeDeepReinforcementLearning Inpreviousexperiments,weassumethatthereisawell-trained Pong expert,andwetransfer knowledgefromthe Pong experttothe Bowling studentviadeepknowledgedistillation. Amorechallengingsettingsthatbothof Bowling and Pong aretrainedfromscratch.In thisexperiment,weweshowthattheCDRLframeworkcanstillbee˙ectiveinthissetting. Inthisexperiment,wetraina Bowling networkanda Pong networkfromscratchusing theproposedcA3Calgorithm.The Pong agentsaretrainedwithGAEinteractionsonly,and thetarget Bowling receivesupervisionfrombothGAEinteractionsanddistilledknowledge from Pong viaadeepalignmentnetwork.Westarttotrainthedeepalignmentnetworkafter 3millionsteps,andperformdeepknowledgedistillationafter4millionsteps,wherethe Pong agentsarestillupdatingfromtheenvironment.Wenotethatinthissetting,theteacher networkisconstantlybeingupdated,asknowledgeisdistilledfromtheteacheruntil15 millionsteps.ResultsinFigure4.6(d)showthattheproposedcA3Cisabletoconvergetoa higherperformancethanthecurrentstate-of-art.Therewardoflastonehundredepisodesof A3Cis 61 : 48 1 : 48 ,whilecA3Cachieves 68 : 35 1 : 32 ,withasigni˝cantrewardimprovement of 11 : 2% . 96 4.2RankingPolicyGradient 4.2.1Introduction Toutilizethecollaborativestrategyforimprovingthesample-e˚ciencyinsingleagent reinforcementlearning,wedisentangletheexplorationandexploitationintotwoseparate agentsandconductdata-drivencollaborationthroughimitationlearning,whichleadstoa moresample-e˚ciento˙-policylearningframework.We˝rstapproachthesample-e˚cient reinforcementlearningfromarankingperspective.Insteadofestimatingtheoptimalaction valuefunction,weconcentrateonlearningoptimalrankofactions.Therankofactions dependsonthe relativeactionvalues .Aslongastherelativeactionvaluespreservethesame rankofactionsastheoptimalactionvalues( Q -values),wechoosethesameoptimalaction. Tolearnoptimalrelativeactionvalues,weproposethe rankingpolicygradient(RPG) that optimizestheactions'rankwithrespecttothelong-termrewardbylearningthepairwise relationshipamongactions. RankingPolicyGradient(RPG)thatdirectlyoptimizesrelativeactionvaluestomaximize thereturnisapolicygradientmethod.Thetrackofo˙-policyactor-criticmethods[ 46 , 72 , 208 ] havemadesubstantialprogressonimprovingthesample-e˚ciencyofpolicygradient.How- ever,thefundamentaldi˚cultyoflearningstabilityassociatedwiththebias-variancetrade-o˙ remains[ 136 ].Inthiswork,we˝rstexploittheequivalencebetweenRLoptimizingthe lowerboundofreturnandsupervisedlearningthatimitatesaspeci˝coptimalpolicy.Build uponthistheoreticalfoundation,weproposeageneralo˙-policylearningframeworkthat equipsthegeneralizedpolicyiteration[ 187 ,Chap.4]withanexternalstepofsupervised learning.Theproposedo˙-policylearningnotonlyenjoysthepropertyofoptimalitypre- serving(unbiasedness),butalsolargelyreducesthevarianceofpolicygradientbecauseofits 97 independenceofthehorizonandrewardscale.Furthermore,thislearningparadigmleads toasamplecomplexityanalysisoflarge-scaleMDP,inanon-tabularsettingwithoutthe lineardependenceonthestatespace.Basedonoursample-complexityanalysis,wede˝nethe exploratione˚ciencythatquantitativelyevaluatesdi˙erentexplorationmethods.Besides, weempiricallyshowthatthereisatrade-o˙betweenoptimalityandsample-e˚ciency,which iswellalignedwithourtheoreticalindication.Lastbutnotleast,wedemonstratethatthe proposedapproach,consolidatingtheRPGwitho˙-policylearning,signi˝cantlyoutperforms thestate-of-the-art[80,17,42,132]. 4.2.2Relatedworks SampleE˚ciency. Thesamplee˚cientreinforcementlearningcanberoughlydividedinto twocategories.The˝rstcategoryincludesvariantsof Q -learning[ 132 , 167 , 203 , 80 ].The mainadvantageof Q -learningmethodsistheuseofo˙-policylearning,whichisessential towardssamplee˚ciency.TherepresentativeDQN[ 132 ]introduceddeepneuralnetwork in Q -learning,whichfurtherinspriedatrackofsuccessfulDQNvariantssuchasDouble DQN[ 203 ],Duelingnetworks[ 209 ],prioritizedexperiencereplay[ 167 ],and Rainbow [ 80 ]. Thesecondcategoryistheactor-criticapproaches.Mostofrecentworks[ 46 , 208 , 71 ]in thiscategoryleveragedimportancesamplingbyre-weightingthesamplestocorrectthe estimationbiasandreducevariance.Themainadvantageisinthewall-clocktimesdueto thedistributedframework,˝rstlypresentedin[ 131 ],insteadofthesample-e˚ciency.Asof thetimeofwriting,thevariantsofDQN[ 80 , 42 , 17 , 167 , 203 ]areamongthealgorithmsof mostsamplee˚ciency,whichareadoptedasourbaselinesforcomparison. RLasSupervisedLearning. Manye˙ortshavefocusedondevelopingtheconnections betweenRLandsupervisedlearning,suchasExpectation-Maximizationalgorithms[ 45 , 152 , 98 102 , 1 ],Entropy-RegularizedRL[ 145 , 74 ],andInteractiveImitationLearning(IIL)[ 44 , 188 , 163 , 165 , 184 , 81 , 148 ].EM-basedapproachesapplytheprobabilisticframeworktoformulate theRLproblemmaximizingalowerboundofthereturnasare-weightedregressionproblem, whileitrequireson-policyestimationontheexpectationstep.Entropy-RegularizedRL optimizingentropyaugmentedobjectivescanleadtoo˙-policylearningwithouttheusageof importancesamplingwhileitconvergestosoftoptimality[74]. Ofthethreetracksinpriorworks,theIILismostcloselyrelatedtoourwork.The IILworks˝rstlypointedouttheconnectionbetweenimitationlearningandreinforcement learning[ 163 , 188 , 165 ]andexploretheideaoffacilitatingreinforcementlearningbyimitating experts.However,mostofimitationlearningalgorithmsassumetheaccesstotheexpertpolicy ordemonstrations.Theo˙-policylearningframeworkproposedinthisthesiscanbeinterpreted asanonlineimitationlearningapproachthatconstructsexpertdemonstrationsduringthe explorationwithoutsolicitingexperts,andconductssupervisedlearningtomaximizereturn atthesametime.Inshort,ourapproachisdi˙erentfrompriorartsintermsofatleastone ofthefollowingaspects:objectives,oracleassumptions,theoptimalityoflearnedpolicy,and on-policyrequirement.Moreconcretely,theproposedmethodisabletolearnoptimalpolicy intermsoflong-termreward,withoutaccesstotheoracle(suchasexpertpolicyorexpert demonstration)anditcanbetrainedbothempiricallyandtheoreticallyinano˙-policy fashion.AmoredetaileddiscussionoftherelatedworkonreducingRLtosupervisedlearning isprovidedinAppendixA. PACAnalysisofRL. Mostexistingstudiesonsamplecomplexityanalysis[ 95 , 180 , 97 , 179 , 105 , 91 , 90 , 223 ]areestablishedonthevaluefunctionestimation.Theproposedapproach leveragestheprobablyapproximatelycorrectframework[ 202 ]inadi˙erentwaysuchthat itdoesnotrelyonthevaluefunction.Suchindependencedirectlyleadstoapractically 99 sample-e˚cientalgorithmforlarge-scaleMDP,aswedemonstratedintheexperiments. 4.2.3NotationsandProblemSetting Here,weconsidera˝nitehorizon T ,discretetimeMarkovDecisionProcess(MDP)with a˝nitediscretestatespace S andforeachstate s 2S ,theactionspace A s is˝nite.The environmentdynamicsisdenotedas P = f p ( s 0 j s;a ) ; 8 s;s 0 2S ;a 2A s g .Wenotethatthe dimensionofactionspacecanvarygivendi˙erentstates.Weuse m = max s kA s k todenote themaximalactiondimensionamongallpossiblestates.Ourgoalistomaximizetheexpected sumofpositiverewards,orreturn J ( )= E ˝;ˇ [ P T t =1 r ( s t ;a t )] ,where 0 j ) denotestheprobabilitythat i -thactionistobe rankedhigherthan j -thaction.Noticethat p ij iscontrolledby through i ; j ˝ Atrajectory ˝ = f s ( ˝;t ) ;a ( ˝;t ) g T t =1 collectedfromthe environment.Itisworthnotingthatthistrajectoryisnot associatedwithanypolicy.Itonlyrepresentsaseriesofstate-action pairs.Wealsousetheabbreviation s t = s ( ˝;t ) , a t = a ( ˝;t ) . r ( ˝ ) Thetrajectoryreward r ( ˝ )= P T t =1 r ( s t ;a t ) isthesumofreward alongonetrajectory. R max R max isthemaximalpossibletrajectoryreward,i.e., R max =max ˝ r ( ˝ ) .SincewefocusonMDPswith˝nitehorizon andimmediatereward,thereforethetrajectoryrewardisbounded. P ˝ Thesummationoverallpossibletrajectories ˝ . p ( ˝ ) Theprobabilityofaspeci˝ctrajectoryiscollectedfromthe environmentgivenpolicy ˇ . p ( ˝ )= p ( s 0 T t =1 ˇ ( a t j s t ) p ( s t +1 j s t ;a t ) T Thesetofallpossiblenear-optimaltrajectories. jTj denotesthe numberofnear-optimaltrajectoriesin T . n Thenumberoftrainingsamplesorequivalentlystateactionpairs sampledfromuniformly(near)-optimalpolicy. m Thenumberofdiscreteactions. actionsandthenchoosethebestaction,whichcanleadtorelativelyhigherreturnthan others.Therefore,analternativesolutionistolearntheoptimalrankoftheactions,instead ofderivingpolicyfromtheactionvalues.Inthissection,weshowhowtooptimizetherank ofactionstomaximizethereturn,andthusavoidthenecessityofaccurateestimationfor optimalactionvaluefunction.Tolearntherankofactions,wefocusonlearning relative 101 actionvalue ( -values),de˝nedasfollows: De˝nition4 (Relativeactionvalue( -values)) . Forastate s ,therelativeactionvalues of m actions( ( s;a k ) ;k =1 ;:::;m )isalistofscoresthatdenotestherankofactions.If ( s;a i ) > ( s;a j ) ,thenaction a i isrankedhigherthanaction a j . Theoptimalrelativeactionvaluesshouldpreservethesameoptimalactionastheoptimal actionvalues: argmax a ( s;a )=argmax a Q ˇ ( s;a ) where Q ˇ ( s;a i ) and ( s;a i ) representtheoptimalactionvalueandtherelativeactionvalue ofaction a i ,respectively.Weomitthemodelparameter in ( s;a i ) forconcisepresentation. Remark1. The -valuesaredi˙erentfromtheadvantagefunction A ˇ ( s;a )= Q ˇ ( s;a ) V ˇ ( s ) .Theadvantagefunctionsquantitativelyshowthedi˙erenceofreturntakingdi˙erent actionsfollowingthecurrentpolicy ˇ .The -valuesonlydeterminetherelativeorderof actionsanditsmagnitudesarenottheestimationsofreturns. Tolearnthe -values,wecanconstructaprobabilisticmodelof -valuessuchthatthe bestactionhasthehighestprobabilitytobeselectedthanothers.Inspiredbylearningto rank[ 26 ],weconsiderthepairwiserelationshipamongallactions,bymodelingtheprobability (denotedas p ij )ofanaction a i toberankedhigherthananyaction a j asfollows: p ij = exp( ( s;a i ) ( s;a j )) 1+exp( ( s;a i ) ( s;a j )) ; (4.4) where p ij =0 : 5 meanstherelativeactionvalueof a i issameasthatoftheaction a j , p ij > 0 : 5 indicatesthattheaction a i isrankedhigherthan a j .GiventheindependentAssumption4, wecanrepresenttheprobabilityofselectingoneactionasthemultiplicationofasetof 102 pairwiseprobabilitiesinEq(4.4).Formally,wede˝nethepairwiserankingpolicyinEq(4.5). PleaserefertoSectionAintheAppendixforthediscussionsonfeasibilityofAssumption4. De˝nition5. Thepairwiserankingpolicyisde˝nedas: ˇ ( a = a i j s )= m j =1 ;j 6 = i p ij ; (4.5) wherethe p ij isde˝nedinEq(4.4).Theprobabilitydependsontherelativeactionvalues q =[ 1 ;:::; m ] .Thehighestrelativeactionvalueleadstothehighestprobabilitytobeselected. Assumption4. Forastate s ,thesetofevents E = f e ij j8 i 6 = j g areconditionallyinde- pendent,where e ij denotestheeventthataction a i isrankedhigherthanaction a j .The independenceoftheeventsisconditionedonaMDPandastationarypolicy. Ourultimategoalistomaximizethelong-termrewardthroughoptimizingthepairwise rankingpolicyorequivalentlyoptimizingpairwiserelationshipamongtheactionpairs.Ideally, wewouldlikethepairwiserankingpolicyselectsthebestactionwiththehighestprobability andthehighest -value.Toachievethisgoal,weresorttothepolicygradientmethod. Formally,weproposetherankingpolicygradientmethod(RPG),asshowninTheorem2. Theorem2 (RankingPolicyGradientTheorem) . ForanyMDP,thegradientoftheexpected long-termreward J ( )= P ˝ p ( ˝ ) r ( ˝ ) w.r.t.theparameter ofapairwiserankingpolicy (Def5)canbeapproximatedby: r J ( ) ˇ E ˝ ˘ ˇ X T t =1 r X m j =1 ;j 6 = i ( i j ) = 2 r ( ˝ ) ; (4.6) andthedeterministicpairwiserankingpolicy ˇ is: a = argmax i i ;i =1 ;:::;m ,where 103 i denotestherelativeactionvalueofaction a i ( ( s t ;a t ) , a i = a t ), s t and a t denotesthe t -thstate-actionpairintrajectory ˝ , j ; 8 j 6 = i denotetherelativeactionvaluesofallother actionsthatwerenottakengivenstate s t intrajectory ˝ ,i.e., ( s t ;a j ) , 8 a j 6 = a t . TheproofofTheorem2isprovidedinAppendixA.Theorem2statesthatoptimizingthe discrepancybetweentheactionvaluesofthebestactionandallotheractions,isoptimizing thepairwiserelationshipsthatmaximizethereturn.OnelimitationofRPGisthatitisnot convenientforthetaskswhereonlyoptimalstochasticpoliciesexistsincethepairwiseranking policytakesextrae˙ortstoconstructaprobabilitydistribution[seeAppendixA].Inorder tolearnthestochasticpolicy,weintroduceListwisePolicyGradient(LPG)thatoptimizes theprobabilityofrankingaspeci˝cactiononthetopofasetofactions,withrespecttothe return.InthecontextofRL,thistoponeprobabilityistheprobabilityofaction a i tobe chosen,whichisequaltothesumofprobabilityallpossiblepermutationsthatmapaction a i atthetop.Thisprobabilityiscomputationallyprohibitivesinceweneedtoconsiderthe probabilityof m ! permutations.Inspiredbylistwiselearningtorankapproach[31],thetop oneprobabilitycanbemodeledbythesoftmaxfunction(seeTheorem3).Therefore,LPGis equivalenttothe Reinforce [ 212 ]algorithmwithasoftmaxlayer.LPGprovidesanother interpretationof Reinforce algorithmfromtheperspectiveoflearningtheoptimalranking andenablesthelearningofbothdeterministicpolicyandstochasticpolicy(seeTheorem4). Theorem3 ([ 31 ],Theorem6) . Giventheactionvalues q =[ 1 ;:::; m ] ,theprobabilityof action i tobechosen(i.e.toberankedonthetopofthelist)is: ˇ ( a t = a i j s t )= ˚ ( i ) P m j =1 ˚ ( j ) ; (4.7) where ˚ ( ) isanyincreasing,strictlypositivefunction.Acommonchoiceof ˚ isthe 104 exponentialfunction. Theorem4 (ListwisePolicyGradientTheorem) . ForanyMDP,thegradientofthelong- termreward J ( )= P ˝ p ( ˝ ) r ( ˝ ) w.r.t.theparameter oflistwiserankingpolicytakesthe followingform: r J ( )= E ˝ ˘ ˇ 2 4 T X t =1 r 0 @ log e i P m j =1 e j 1 A r ( ˝ ) 3 5 ; (4.8) wherethelistwiserankingpolicy ˇ parameterizedby isgivenbyEq(4.9)fortaskswith deterministicoptimalpolicies: a =argmax i i ;i =1 ;:::;m (4.9) orEq(4.10)forstochasticoptimalpolicies: a ˘ ˇ ( s ) ;i =1 ;:::;m (4.10) wherethepolicytakestheformasinEq(4.11) ˇ ( a = a i j s t )= e i P m j =1 e j (4.11) istheprobabilitythataction i beingrankedhighest,giventhecurrentstateandalltherelative actionvalues 1 ::: m . TheproofofTheorem4exactlyfollowsthedirectpolicydi˙erentiation[ 153 , 212 ]by replacingthepolicytotheformoftheSoftmaxfunction.Theactionprobability ˇ ( a i j s ) ; 8 i = 1 ;:::;m formsaprobabilitydistributionoverthesetofdiscreteactions[ 31 ,Lemma7]. 105 Theorem4statesthatthevanillapolicygradient[ 212 ]parameterizedbySoftmaxlayeris optimizingtheprobabilityofeachactiontoberankedhighest,withrespecttothelong-term reward.Furthermore,itenableslearningbothofthedeterministicpolicyandstochastic policy. Tothisend,seekingsample-e˚ciencymotivatesustolearntherelativerelationship(RPG (Theorem2)andLPG(Theorem4))ofactions,insteadofderivingpolicybasedonaction valueestimations.However,bothoftheRPGandLPGbelongtopolicygradientmethods, whichsu˙ersfromlargevarianceandtheon-policylearningrequirement[ 187 ].Therefore, theintuitiveimplementationsofRPGorLPGarestillfarfromsample-e˚cient.Inthenext section,wewilldescribeageneralo˙-policylearningframeworkempoweredbysupervised learning,whichprovidesanalternativewaytoacceleratelearning,preserveoptimality,and reducevariance. 4.2.5O˙-policyLearningasSupervisedLearning Inthissection,wediscusstheconnectionsanddiscrepanciesbetweenRLandsupervised learning,andourresultsleadtoasample-e˚ciento˙-policylearningparadigmforRL.The mainresultinthissectionisTheorem5,whichcaststheproblemofmaximizingthelower boundofreturnintoasupervisedlearningproblem,givenonerelativelymildAssumption5 andpracticalassumptions4,6.Itcanbeshownthattheseassumptionsarevalidinarange ofcommonRLtasks,asdiscussedinLemma6inAppendixA.Thecentralideaistocollect onlythenear-optimaltrajectorieswhenthelearningagentinteractswiththeenvironment, andimitatethenear-optimalpolicybymaximizingtheloglikelihoodofthestate-actionpairs fromthesenear-optimaltrajectories.Withtheroadmapinmind,wethenbegintointroduce ourapproachasfollows. 106 InadiscreteactionMDPwith˝nitestatesandhorizon,giventhenear-optimalpolicy ˇ ,thestationarystatedistributionisgivenby: p ˇ ( s )= P ˝ p ( s j ˝ ) p ˇ ( ˝ ) ,where p ( s j ˝ ) is theprobabilityofacertainstategivenaspeci˝ctrajectory ˝ andisnotassociatedwithany policies,andonly p ˇ ( ˝ ) isrelatedtothepolicyparameters.Thestationarydistributionof state-actionpairsisthus: p ˇ ( s;a )= p ˇ ( s ) ˇ ( a j s ) .Inthissection,weconsidertheMDP thateachinitialstatewillleadtoatleastone(near)-optimaltrajectory.Foramoregeneral case,pleaserefertothediscussioninAppendixA.Inordertoconnectsupervisedlearning (i.e.,imitatinganear-optimalpolicy)withRLandenablesample-e˚ciento˙-policylearning, we˝rstintroducethetrajectoryrewardshaping(TRS),de˝nedasfollows: De˝nition6 (TrajectoryRewardShaping,TRS) . Givena˝xedtrajectory ˝ ,itstrajectory rewardisshapedasfollows: w ( ˝ )= 8 > < > : 1 ; if r ( ˝ ) c 0 ;o:w: where c = R max isaproblem-dependentnear-optimaltrajectoryrewardthresholdthat indicatestheleastrewardofnear-optimaltrajectory, 0 and ˝ R max .Wedenotetheset ofallpossiblenear-optimaltrajectoriesas T = f ˝ j w ( ˝ )=1 g ,i.e., w ( ˝ )=1 ; 8 ˝ 2T . Remark2. Thethreshold c indicatesatrade-o˙betweenthesample-e˚ciencyandthe optimality.Thehigherthethreshold,thelessfrequentlyitwillhitthenear-optimaltrajectories duringexploration,whichmeansithashighersamplecomplexity,whilethe˝nalperformance isbetter(seeFigure4.10). Remark3. Thetrajectoryrewardcanbereshapedtoanypositivefunctionsthatarenot relatedtopolicyparameter .Forexample,ifweset w ( ˝ )= r ( ˝ ) ,theconclusionsinthis sectionstillhold(seeEq(A.6)inAppendixA).Forthesakeofsimplicity,weset w ( ˝ )=1 . 107 Di˙erentfromtherewardshapingwork[ 139 ],whereshapinghappensateachstepon r ( s t ;a t ) ,theproposedapproachdirectlyshapesthetrajectoryreward r ( ˝ ) ,whichfacilitates thesmoothtransformfromRLtoSL.Aftershapingthetrajectoryreward,wecantransfer thegoalofRLfrommaximizingthereturntomaximizethelong-termperformance(Def7). De˝nition7 (Long-termPerformance) . Thelong-termperformanceisde˝nedbytheexpected shapedtrajectoryreward: X ˝ p ( ˝ ) w ( ˝ ) : (4.12) AccordingtoDef6,theexpectationoveralltrajectoriesistheequaltothatoverthenear- optimaltrajectoriesin T ,i.e., P ˝ p ( ˝ ) w ( ˝ )= P ˝ 2T p ( ˝ ) w ( ˝ ) . Theoptimalityispreservedaftertrajectoryrewardshaping( =0 ;c = R max )sincethe optimalpolicy ˇ maximizinglong-termperformanceisalsoanoptimalpolicyfortheoriginal MDP,i.e., P ˝ p ˇ ( ˝ ) r ( ˝ )= P ˝ 2T p ˇ ( ˝ ) r ( ˝ )= R max ,where ˇ = argmax ˇ P ˝ p ˇ ( ˝ ) w ( ˝ ) and p ˇ ( ˝ )=0 ; 8 ˝= 2T (seeLemma4inAppendixA).Similarly,when > 0 ,theoptimal policyaftertrajectoryrewardshapingisanear-optimalpolicyfororiginalMDP.Notethat mostpolicygradientmethodsusethesoftmaxfunction,inwhichwehave 9 ˝= 2T ;p ˇ ( ˝ ) > 0 (seeLemma5inAppendixA).Thereforewhensoftmaxisusedtomodelapolicy,itwill notconvergetoanexactoptimalpolicy.Ontheotherhand,ideally,thediscrepancyofthe performancebetweenthemcanbearbitrarilysmallbasedontheuniversalapproximation[ 83 ] withgeneralconditionsontheactivationfunctionandTheorem1in[188]. Essentially,weuseTRSto˝lteroutnear-optimaltrajectoriesandthenwemaximize theprobabilitiesofnear-optimaltrajectoriestomaximizethelong-termperformance.This procedurecanbeapproximatedbymaximizingthelog-likelihoodofnear-optimalstate-action 108 pairs,whichisasupervisedlearningproblem.Beforewestateourmainresults,we˝rst introducethede˝nitionofuniformlynear-optimalpolicy(Def8)andaprerequisite(Asm.5) specifyingtheapplicabilityoftheresults. De˝nition8 (UniformlyNear-OptimalPolicy,UNOP) . TheUniformlyNear-OptimalPolicy ˇ isthepolicywhoseprobabilitydistributionovernear-optimaltrajectories( T )isauniform distribution.i.e. p ˇ ( ˝ )= 1 jTj ; 8 ˝ 2T ,where jTj isthenumberofnear-optimaltrajectories. Whenweset c = R max ,itisanoptimalpolicyintermsofbothmaximizingreturnand long-termperformance.Inthecaseof c = R max ,thecorrespondinguniformpolicyisan optimalpolicy,wedenotethistypeofoptimalpolicyasuniformlyoptimalpolicy(UOP). Assumption5 (ExistenceofUniformlyNear-OptimalPolicy) . Weassumetheexistenceof UniformlyNear-OptimalPolicy(Def.8). BasedonLemma6inAppendixA,Assumption5issatis˝edforcertainMDPsthat havedeterministicdynamics.OtherthanAssumption5,allotherassumptionsinthiswork (Assumptions4,6)canalmostalwaysbesatis˝edinpractice,basedonempiricalobservations. Withtheserelativelymildassumptions,wepresentthefollowinglong-termperformance theorem,whichshowsthecloseconnectionbetweensupervisedlearningandRL. Theorem5 (Long-termPerformanceTheorem) . Maximizingthelowerboundofexpected long-termperformanceinEq(4.12)ismaximizingthelog-likelihoodofstate-actionpairs sampledfromauniformly(near)-optimalpolicy ˇ ,whichisasupervisedlearningproblem: argmax X s 2S X a 2A s p ˇ ( s;a )log ˇ ( a j s ) (4.13) Theoptimalpolicyofmaximizingthelowerboundisalsotheoptimalpolicyofmaximizing 109 thelong-termperformanceandthereturn. Remark4. ItisworthnotingthatTheorem5doesnotrequireauniformlynear-optimalpolicy ˇ tobedeterministic.Theonlyrequirementistheexistenceofauniformlynear-optimal policy. Remark5. Maximizingthelowerboundoflong-termperformanceismaximizingthelower boundoflong-termrewardsincewecanset w ( ˝ )= r ( ˝ ) and P ˝ p ( ˝ ) r ( ˝ ) P T p ( ˝ ) w ( ˝ ) . Anoptimalpolicythatmaximizesthislowerboundisalsoanoptimalpolicymaximizingthe long-termperformancewhen c = R max ,thusmaximizingthereturn. TheproofofTheorem5canbefoundinAppendixA.Theorem5indicatesthatwebreakthe dependencybetweencurrentpolicy ˇ andtheenvironmentdynamics,whichmeanso˙-policy learningisabletobeconductedbytheabovesupervisedlearningapproach.Furthermore,we pointoutthatthereisapotentialdiscrepancybetweenimitatingUNOPbymaximizinglog likelihood(evenwhentheoptimalpolicy'ssamplesaregiven)andthereinforcementlearning sincewearemaximizingalowerboundofexpectedlong-termperformance(orequivalently thereturnoverthenear-optimaltrajectoriesonly)insteadofreturnoveralltrajectories.In practice,thestate-actionpairsfromanoptimalpolicyishardtoconstructwhiletheuniform characteristicofUNOPcanalleviatethisissue(seeSec4.2.6).Towardssample-e˚cientRL, weapplyTheorem5toRPG,whichreducestherankingpolicygradienttoaclassi˝cation problembyCorollary1. Corollary1 (Rankingperformancepolicygradient) . Thelowerboundofexpectedlong- termperformance(de˝nedinEq(4.12))usingpairwiserankingpolicy(Eq(4.5))canbe 110 approximatelyoptimizedbythefollowingloss: min X s;a i p ˇ ( s;a i ) X m j =1 ;j 6 = i max(0 ; 1+ ( s;a j ) ( s;a i )) : (4.14) Corollary2 (Listwiseperformancepolicygradient) . Optimizingthelowerboundofexpected long-termperformancebythelistwiserankingpolicy(Eq(4.11))isequivalentto: max X s p ˇ ( s ) X m i =1 ˇ ( a i j s )log e i P m j =1 e j (4.15) TheproofofthisCorollaryisadirectapplicationoftheorem5byreplacingpolicywiththe softmaxfunction. TheproofofCorollary1canbefoundinAppendixA.Similarly,wecanreduceLPGto aclassi˝cationproblem(seeCorollary2).OneadvantageofcastingRLtoSLisvariance reduction.Withtheproposedo˙-policysupervisedlearning,wecanreducetheupper boundofthepolicygradientvariance,asshownintheCorollary3.Beforeintroducingthe variancereductionresults,we˝rstmakethecommonassumptionsontheMDPregularity (Assumption6)similarto[ 43 , 46 ,A1].Furthermore,theAssumption6isguaranteedfor boundedcontinuouslydi˙erentiablepolicysuchassoftmaxfunction. Assumption6. weassumetheexistenceofmaximumnormofloggradientoverallpossible state-actionpairs,i.e. C =max s;a kr log ˇ ( a j s ) k 1 111 Corollary3 (Policygradientvariancereduction) . Givenastationarypolicy,theupperbound ofthevarianceofeachdimensionofpolicygradientis O ( T 2 C 2 R 2 max ) .Theupperbound ofgradientvarianceofmaximizingthelowerboundoflong-termperformanceEq(4.13)is O ( C 2 ) ,where C isthemaximumnormofloggradientbasedonAssumption6.Thesupervised learninghasreducedtheupperboundofgradientvariancebyanorderof O ( T 2 R 2 max ) as comparedtotheregularpolicygradient,considering R max 1 ;T 1 ,whichisaverycommon situationinpractice. TheproofofCorollary3canbefoundinAppendixA.Thiscorollaryshowsthatthe varianceofregularpolicygradientisupper-boundedbythesquareoftimehorizonandthe maximumtrajectoryreward.Itisalignedwithourintuitionandempiricalobservation:the longerthehorizontheharderthelearning.Also,thecommonrewardshapingtrickssuch astruncatingtherewardto [ 1 ; 1] [ 34 ]canhelpthelearningsinceitreducesvarianceby decreasing R max .Withsupervisedlearning,weconcentratethedi˚cultyoflong-timehorizon intotheexplorationphase,whichisaninevitableissueforallRLalgorithms,andwedrop thedependenceon T and R max forpolicyvariance.Thus,itismorestableande˚cientto trainthepolicyusingsupervisedlearning.Onepotentiallimitationofthismethodisthat thetrajectoryrewardthreshold c istask-speci˝c,whichiscrucialtothe˝nalperformance andsample-e˚ciency.InmanyapplicationssuchasDialoguesystem[ 111 ],recommender system[ 130 ],etc.,wedesigntherewardfunctiontoguidethelearningprocess,inwhich c isnaturallyknown.Forthecasesthatwehavenopriorknowledgeontherewardfunction ofMDP,wetreat c asatuningparametertobalancetheoptimalityande˚ciency,aswe empiricallyveri˝edinFigure4.10.Themajortheoreticaluncertaintyongeneraltasksisthe existenceofauniformlynear-optimalpolicy,whichisnegligibletotheempiricalperformance. Therigoroustheoreticalanalysisofthisproblemisbeyondthescopeofthiswork. 112 Figure4.7:O˙-policylearningframework. 4.2.6Analgorithmicframeworkforo˙-policylearning BasedonthediscussionsinSection4.2.5,weexploittheadvantageofreducingRLinto supervisedlearningviaaproposedtwo-stageso˙-policylearningframework.Asweillustrated inFigure4.7,theproposedframeworkcontainsthefollowingtwostages: GeneralizedPolicyIterationforExploration. Thegoaloftheexplorationstageis tocollectdi˙erentnear-optimaltrajectoriesasfrequentlyaspossible.Undertheo˙-policy framework,theexplorationagentandthelearningagentcanbeseparated.Therefore,any existingRLalgorithmcanbeusedduringtheexploration.Theprincipleofthisframework isusingthemostadvancedRLagentsasanexplorationstrategyinordertocollectmore near-optimaltrajectoriesandleavethepolicylearningtothesupervisionstage. Supervision. Inthisstage,weimitatetheuniformlynear-optimalpolicy,UNOP(Def8). AlthoughwehavenoaccesstotheUNOP,wecanapproximatethestate-actiondistribution fromUNOPbycollectingthenear-optimaltrajectoriesonly.Thenear-optimalsamples 113 areconstructedonlineandwearenotgivenanyexpertdemonstrationorexpertpolicy beforehand.Thisstepprovidesasample-e˚cientapproachtoconductexploitation,which enjoysthesuperiorityofstability(Figure4.9),variancereduction(Corollary3),andoptimality preserving(Theorem5). Thetwo-stagealgorithmicframeworkcanbedirectlyincorporatedinRPGandLPG toimprovesamplee˚ciency.TheimplementationofRPGisgiveninAlgorithm4.2,and LPGfollowsthesameprocedureexceptforthedi˙erenceinthelossfunction.Themain requirementofAlg.4.2isontheexploratione˚ciencyandtheMDPstructure.Duringthe explorationstage,asu˚cientamountofthedi˙erentnear-optimaltrajectoriesneedtobe collectedforconstructingarepresentativesupervisedlearningtrainingdataset.Theoretically, thisrequirementalwaysholds[seeAppendixSectionA,Lemma7],whilethenumberof episodesexploredcouldbeprohibitivelylarge,whichmakesthisalgorithmsample-ine˚cient. Thiscouldbeapracticalconcernoftheproposedalgorithm.However,accordingtoour extensiveempiricalobservations,wenoticethatlongbeforethevaluefunctionbasedstate-of- the-artconvergestonear-optimalperformance,enoughamountofnear-optimaltrajectories arealreadyexplored. Therefore,wepointoutthatinsteadofestimatingoptimalactionvaluefunctionsand thenchoosingactiongreedily,usingvaluefunctiontofacilitatetheexplorationandimitating UNOPisamoresample-e˚cientapproach.AsillustratedinFigure4.7,valuebasedmethods witho˙-policylearning,bootstrapping,andfunctionapproximationcouldleadtoadivergent optimization[ 187 ,Chap.11].Incontrasttoresolvingtheinstability,wecircumventthis issueviaconstructingastationarytargetusingthesamplesfrom(near)-optimaltrajectories, andperformimitationlearning.Thistwo-stageapproachcanavoidtheextensiveexploration ofthesuboptimalstate-actionspaceandreducethesubstantialnumberofsamplesneeded 114 forestimatingoptimalactionvalues.IntheMDPwherewehaveahighprobabilityofhitting thenear-optimaltrajectories(suchas Pong ),thesupervisionstagecanfurtherfacilitatethe exploration.Itshouldbeemphasizedthatourworkfocusesonimprovingthesample-e˚ciency throughmoree˙ectiveexploitation,ratherthandevelopingnovelexplorationmethod. Algorithm4.2: O˙-PolicyLearningforRankingPolicyGradient(RPG) Require: Thenear-optimaltrajectoryrewardthreshold c ,thenumberofmaximaltraining episodes N max .Maximumnumberoftimestepsineachepisode T ,andbatchsize b . 1: while episode = T 12: if return P T t =1 r t c then 13: Takethenear-optimaltrajectory e t ;t =1 ;:::;T inthelatestepisodefromtheregular replaybu˙er,andinsertthetrajectoryintothenear-optimalreplaybu˙er. 14: endif 15: if t %evaluationstep==0 then 16: EvaluatetheRPGagentbygreedilychoosingtheaction.Ifthebestperformanceis reached,thenstoptraining. 17: endif 18: endwhile 4.2.7SampleComplexityandGeneralizationPerformance Inthissection,wepresentatheoreticalanalysisonthesamplecomplexityofRPGwith o˙-policylearningframeworkinSection4.2.6.Theanalysisleveragestheresultsfromthe ProbablyApproximatelyCorrect(PAC)framework,andprovidesanalternativeapproach 115 toquantifysamplecomplexityofRLfromtheperspectiveoftheconnectionbetweenRL andSL(seeTheorem5),whichissigni˝cantlydi˙erentfromtheexistingapproachesthat usevaluefunctionestimations[ 95 , 180 , 97 , 179 , 105 , 91 , 90 , 223 ].Weshowthatthesample complexityofRPG(Theorem6)dependsonthepropertiesofMDPsuchashorizon,action space,dynamics,andthegeneralizationperformanceofsupervisedlearning.Itisworth mentioningthatthesamplecomplexityofRPGhasnolineardependenceonthestate-space, whichmakesitsuitableforlarge-scaleMDPs.Moreover,wealsoprovideaformalquantitative de˝nition(Def9)ontheexploratione˚ciencyofRL. Correspondingtothetwo-stageframeworkinSection4.2.6,thesamplecomplexityof RPGalsosplitsintotwoproblems: Learninge˚ciency: Howmanystate-actionpairsfromtheuniformlyoptimalpolicy doweneedtocollect,inordertoachievegoodgeneralizationperformanceinRL? Exploratione˚ciency: ForacertaintypeofMDPs,whatistheprobabilityof collecting n trainingsamples(state-actionpairsfromtheuniformlynear-optimalpolicy) inthe˝rst k episodesintheworstcase?Thisquestionleadstoaquantitativeevaluation metricofdi˙erentexplorationmethods. The˝rststageisresolvedbyTheorem6,whichconnectsthelowerboundofthegeneralization performanceofRLtothesupervisedlearninggeneralizationperformance.Thenwediscuss theexploratione˚ciencyoftheworstcaseperformanceforabinarytreeMDPinLemma2. Jointly,weshowhowtolinkthetwostagestogiveageneraltheoremthatstudieshowmany samplesweneedtocollectinordertoachievecertainperformanceinRL. Inthissection,werestrictourdiscussionontheMDPswitha˝xedactionspaceand assumetheexistenceofdeterministicoptimalpolicy.Thepolicy ˇ = ^ h = argmin h 2H ^ ( h ) 116 correspondstotheempiricalriskminimizer(ERM)inthelearningtheoryliterature,whichis thepolicyweobtainedthroughlearningonthetrainingsamples. H denotesthehypothesis classfromwhereweareselectingthepolicy.Givenahypothesis(policy) h ,theempiricalrisk isgivenby ^ ( h )= P n i =1 1 n 1 f h ( s i ) 6 = a i g .Withoutlossofgenerosity,wecannormalizethe rewardfunctiontosettheupperboundoftrajectoryrewardequalstoone( i:e:;R max =1 ), similartotheassumptionin[ 90 ].Itisworthnotingthatthetrainingsamplesaregenerated i.i.d. fromanunknowndistribution,whichisperhapsthemostimportantassumptioninthe statisticallearningtheory. i.i.d. issatis˝edinthiscasesincethestateactionpairs(training samples)arecollectedby˝lteringthesamplesduringthelearningstage,andwecanmanually manipulatethesamplestofollowthedistributionofUOP(Def8)byonlystoringtheunique near-optimaltrajectories. 4.2.8Supervisionstage:Learninge˚ciency Tosimplifythepresentation,werestrictourdiscussiononthe˝nitehypothesisclass(i.e. jHj < 1 )sincethisdependenceisnotgermanetoourdiscussion.However,wenotethatthe theoreticalframeworkinthissectionisnotlimitedtothe˝nitehypothesisclass.Forexample, wecansimplyusetheVCdimension[ 204 ]ortheRademachercomplexity[ 15 ]togeneralize ourdiscussiontothein˝nitehypothesisclass,suchasneuralnetworks.Forcompleteness,we ˝rstrevisitthesamplecomplexityresultfromthePAClearninginthecontextofRL. Lemma1 (SupervisedLearningSampleComplexity[ 133 ]) . Let jHj < 1 ,andlet ; be ˝xed,theinequality ( ^ h ) ( min h 2H ( h ))+2 = holdswithprobabilityatleast 1 ,when 117 thetrainingsetsize n satis˝es: n 1 2 2 log 2 jHj ; (4.16) wherethegeneralizationerror(expectedrisk)ofahypothesis ^ h isde˝nedas: ( ^ h )= X s;a p ˇ ( s;a ) 1 n ^ h ( s ) 6 = a o : Condition1 (Actionvalues) . WerestricttheactionvaluesofRPGincertainrange,i.e., i 2 [0 ;c q ] ,where c q isapositiveconstant. Thisconditioncanbeeasilysatis˝ed,forexample,wecanuseasigmoidtocasttheaction valuesinto [0 ; 1] .WecanimposethisconstraintsinceinRPGweonlyfocusontherelative relationshipofactionvalues.Giventhemildconditionandestablishedonthepriorwork instatisticallearningtheory,weintroducethefollowingresultsthatconnectthesupervised learningandreinforcementlearning. Theorem6 (GeneralizationPerformance) . GivenaMDPwheretheUOP(Def8)isdeter- ministic,let jHj denotethesizeofhypothesisspace,and ;n be˝xed,thefollowinginequality holdswithprobabilityatleast 1 : X ˝ p ( ˝ ) r ( ˝ ) D (1+ e ) (1 m ) T ; where D = jTj ˝ 2T p d ( ˝ )) 1 jTj , p d ( ˝ )= p ( s 1 T t =1 p ( s t +1 j s t ;a t ) denotestheenvironment dynamics. istheupperboundofsupervisedlearninggeneralizationperformance,de˝nedas =(min h 2H ( h ))+2 q 1 2 n log 2 jHj =2 q 1 2 n log 2 jHj . 118 Corollary4 (SampleComplexity) . GivenaMDPwheretheUOP(Def8)isdeterministic, let jHj denotesthesizeofhypothesisspace,andlet be˝xed.Thenforthefollowinginequality toholdwithprobabilityatleast 1 : X ˝ p ( ˝ ) r ( ˝ ) 1 itsu˚cesthatthenumberofstateactionpairs(trainingsamplesize n )fromtheuniformly optimalpolicysatis˝es: n 2( m 1) 2 T 2 (log 1+ e D 1 ) 2 log 2 jHj = O 0 B @ m 2 T 2 log D 1 2 log jHj 1 C A : TheproofsofTheorem6andCorollary4areprovidedinAppendixA.Theorem6 establishestheconnectionbetweenthegeneralizationperformanceofRLandthesample complexityofsupervisedlearning.Thelowerboundofgeneralizationperformancedecreases exponentiallywithrespecttothehorizon T andactionspacedimension m .Thisisaligned withourempiricalobservationthatitismoredi˚culttolearntheMDPswithalonger horizonand/oralargeractionspace.Furthermore,thegeneralizationperformancehasa lineardependenceon D ,thetransitionprobabilityofoptimaltrajectories.Therefore, T , m ,and D jointlydeterminesthedi˚cultyoflearningofthegivenMDP.Aspointedout byCorollary4,thesmallerthe D is,thehigherthesamplecomplexity.Notethat T , m , and D allcharacterizeintrinsicpropertiesofMDPs,whichcannotbeimprovedbyour learningalgorithms.OneadvantageofRPGisthatitssamplecomplexityhasnodependence onthestatespace,whichenablestheRPGtoresolvelarge-scalecomplicatedMDPs,as 119 demonstratedinourexperiments.Inthesupervisionstage,ourgoalisthesameasinthe traditionalsupervisedlearning:toachievebettergeneralizationperformance . 4.2.9Explorationstage:Exploratione˚ciency Theexploratione˚ciencyishighlyrelatedtotheMDPpropertiesandtheexploration strategy.ToprovideinterpretationonhowtheMDPproperties(statespacedimension,action spacedimension,horizon)a˙ectthesamplecomplexitythroughexploratione˚ciency,we characterizeasimpli˝edMDPasin[ 184 ],inwhichweexplicitlycomputetheexploration e˚ciencyofastationarypolicy(randomexploration),asshowninFigure4.8. De˝nition9 (ExplorationE˚ciency) . Wede˝netheexploratione˚ciencyofacertain explorationalgorithm( A )withinaMDP( M )astheprobabilityofsampling i distinctoptimal trajectoriesinthe˝rst k episodes.Wedenotetheexploratione˚ciencyas p A; M ( n traj i j k ) . When M , k , i andoptimalitythreshold c are˝xed,thehigherthe p A; M ( n traj i j k ) ,thebetter theexploratione˚ciency.Weuse n traj todenotethenumberofnear-optimaltrajectoriesin thissubsection.Iftheexplorationalgorithmderivesaseriesoflearningpolicies,thenwehave p A; M ( n traj i j k )= p f ˇ i g t i =0 ; M ( n traj i j k ) ,where t isthenumberofstepsthealgorithm A updatedthepolicy.Ifwewouldliketostudytheexploratione˚ciencyofastationary policy,thenwehave p A; M ( n traj i j k )= p ˇ; M ( n traj i j k ) . De˝nition10 (ExpectedExplorationE˚ciency) . Theexpectedexploratione˚ciencyofa certainexplorationalgorithm( A )withinaMDP( M )isde˝nedas: E A;k; M = X k i =0 p A; M ( n traj = i j k ) i: 120 Figure4.8:ThebinarytreestructureMDP( M 1 )withoneinitialstate,similarasdiscussed in[ 184 ].Inthissubsection,wefocusontheMDPsthathavenoduplicatedstates.Theinitial statedistributionoftheMDPisuniformandtheenvironmentdynamicsisdeterministic.For M 1 theworstcaseexplorationisrandomexplorationandeachtrajectorywillbevisitedat sameprobabilityunderrandomexploration.NotethatinthistypeofMDP,theAssumption5 issatis˝ed. Thede˝nitionsprovideaquantitativemetrictoevaluatethequalityofexploration. Intuitively,thequalityofexplorationshouldbedeterminedbyhowfrequentlyitwillhit di˙erentgoodtrajectories.WeuseDef9fortheoreticalanalysisandDef10forpractical evaluation. Lemma2 (TheExplorationE˚ciencyofRandomPolicy) . TheExplorationE˚ciencyof randomexplorationpolicyinabinarytreeMDP( M 1 )isgivenas: p ˇ r ; M ( n traj i j k )=1 X i 1 i 0 =0 C i 0 jTj P i 0 j =0 ( 1) j C j i 0 ( N jTj + i 0 j ) k N k ; where N denotesthetotalnumberofdi˙erenttrajectoriesintheMDP.InbinarytreeMDP M 1 , N = jS 0 jjAj T ,wherethe jS 0 j denotesthenumberofdistinctinitialstates. jTj denotes thenumberofoptimaltrajectories. ˇ r denotestherandomexplorationpolicy,whichmeans theprobabilityofhittingeachtrajectoryin M 1 isequal. TheproofofLemma2isavailableinAppendixA. 121 4.2.10JointAnalysisCombiningExplorationandSupervision Inthissection,wejointlyconsiderthelearninge˚ciencyandexploratione˚ciencytostudy thegeneralizationperformance.Concretely,wewouldliketostudyifweinteractwiththe environmentacertainnumberofepisodes,whatistheworstgeneralizationperformancewe canexpectwithcertainprobability,ifRPGisapplied. Corollary5 (RLGeneralizationPerformance) . GivenaMDPwheretheUOP(Def8)is deterministic,let jHj bethesizeofthehypothesisspace,andlet ;n;k be˝xed,thefollowing inequalityholdswithprobabilityatleast 1 0 : X ˝ p ( ˝ ) r ( ˝ ) D (1+ e ) (1 m ) T ; where k isthenumberofepisodeswehaveexploredintheMDP, n isthenumberofdistinct optimalstate-actionpairsweneededfromtheUOP(i.e.,sizeoftrainingdata.). n 0 denotes thenumberofdistinctoptimalstate-actionpairscollectedbytherandomexploration. = 2 s 1 2 n log 2 jHj p ˇ r ; M ( n 0 n j k ) p ˇ r ; M ( n 0 n j k ) 1+ 0 . TheproofofCorollary5isprovidedinAppendixA.Corollary5statesthattheprobability ofsamplingoptimaltrajectoriesisthemainbottleneckofexplorationandgeneralization, insteadofstatespacedimension.Ingeneral,theoptimalexplorationstrategydependsonthe propertiesofMDPs.Inthiswork,wefocusonimprovinglearninge˚ciency,i.e.,learning optimalrankinginsteadofestimatingvaluefunctions.Thediscussionofoptimalexploration isbeyondthescopeofthiswork. 122 Figure4.9:ThetrainingcurvesoftheproposedRPGandstate-of-the-art.Allresultsare averagedoverrandomseedsfrom1to5.The x -axisrepresentsthenumberofstepsinteracting withtheenvironment(weupdatethemodeleveryfoursteps)andthe y -axisrepresentsthe averagedtrainingepisodicreturn.Theerrorbarsareplottedwithacon˝denceintervalof 95%. 4.2.11ExperimentalResults Toevaluatethesample-e˚ciencyofRankingPolicyGradient(RPG),wefocusonAtari 2600gamesinOpenAIgym[ 18 , 24 ],withoutrandomlyrepeatingthepreviousaction.We compareourmethodwiththestate-of-the-artbaselinesincludingDQN[ 132 ],C51[ 17 ], IQN[ 42 ], Rainbow [ 80 ],andself-imitationlearning(SIL)[ 145 ].Forreproducibility,weuse theimplementationprovidedinDopamineframework 1 [ 34 ]forallbaselinesandproposed methods,exceptforSILusingtheo˚cialimplementation. 2 .Followthestandardpractice[ 145 , 1 https://github.com/google/dopamine 2 https://github.com/junhyukoh/self-imitation-learning 123 80 , 42 , 17 ],wereportthetrainingperformanceofallbaselinesastheincreaseofinteractions withtheenvironment,orproportionallythenumberoftrainingiterations.Werunthe algorithmswith˝verandomseedsandreporttheaveragerewardswith 95 %con˝dence intervals.TheimplementationdetailsoftheproposedRPGanditsvariantsaregivenas follows 3 : EPG :EPGisthestochasticlistwisepolicygradient(seeEq(4.10))incorporatedwith theproposedo˙-policylearning.Moreconcretely,weapplytrajectoryrewardshaping(TRS, Def6)toalltrajectoriesencounteredduringexplorationandtrainvanillapolicygradient usingtheo˙-policysamples.Thisisequivalenttominimizingthecross-entropyloss(see AppendixEq(4.15))overthenear-optimaltrajectories. LPG :LPGisthedeterministiclistwisepolicygradientwiththeproposedo˙-policylearn- ing.Theonlydi˙erencebetweenEPGandLPGisthatLPGchoosesactiondeterministically (seeAppendixEq(4.9))duringevaluation. RPG :RPGexplorestheenvironmentusingaseparateEPGagentin Pong andIQNin othergames.ThenRPGconductssupervisedlearningbyminimizingthehingelossEq(4.14). Itisworthnotingthattheexplorationagent(EPGorIQN)canbereplacedbyanyexisting explorationmethod.InourRPGimplementation,wecollectalltrajectorieswiththetrajectory rewardnolessthanthethreshold c withouteliminatingtheduplicatedtrajectoriesandwe empiricallyfounditisareasonablesimpli˝cation. Sample-e˚ciency. AstheresultsshowninFigure4.9,ourapproach,RPG,signi˝cantly outperformsthestate-of-the-artbaselinesintermsofsample-e˚ciencyatalltasks.Further- more,RPGnotonlyachievedthemostsample-e˚cientresults,butalsoreachedthehighest ˝nalperformanceat Robotank , DoubleDunk , Pitfall ,and Pong ,comparingtoany 3 Codeisavailableathttps://github.com/illidanlab/rpg. 124 Figure4.10:Thetrade-o˙betweensamplee˚ciencyandoptimality. model-freestate-of-the-art.Inreinforcementlearning,thestabilityofalgorithmshouldbe emphasizedasanimportantissue.Aswecanseefromtheresults,theperformanceofbaselines variesfromtasktotask.Thereisnosinglebaselineconsistentlyoutperformsothers.In contrast,duetothereductionfromRLtosupervisedlearning,RPGisconsistentlystableand e˙ectiveacrossdi˙erentenvironments.Inadditiontothestabilityande˚ciency,RPGenjoys simplicityatthesametime.Intheenvironment Pong ,itissurprisingthatRPGwithout anycomplicatedexplorationmethodlargelysurpassedthesophisticatedvalue-functionbased approaches.MoredetailsofhyperparametersareprovidedintheAppendixSectionA. 4.2.12AblationStudy Thee˙ectivenessofpairwiserankingpolicyando˙-policylearningassupervised learning. TogetabetterunderstandingoftheunderlyingreasonsthatRPGismoresample- e˚cientthanDQNvariants,weperformedablationstudiesinthe Pong environmentby varyingthecombinationofpolicyfunctionswiththeproposedo˙-policylearning.Theresults ofEPG,LPG,andRPGareshowninthebottomright,Figure4.9.RecallthatEPGand LPGuselistwisepolicygradient(vanillapolicygradientusingsoftmaxaspolicyfunction)to conductexploration,theo˙-policylearningminimizesthecross-entropylossEq(4.15).In contrast,RPGsharesthesameexplorationmethodasEPGandLPGwhileusespairwise 125 Figure4.11:Expectedexploratione˚ciencyofstate-of-the-art.Theresultsareaveragedover randomseedsfrom1to10. rankingpolicyEq(4.5)ino˙-policylearningthatminimizeshingelossEq(4.14).Wecansee thatRPGismoresample-e˚cientthanEPG/LPGinlearningdeterministicoptimalpolicy. Wealsocomparedtheadvancedon-policymethodProximalPolicyOptimization(PPO)[ 170 ] withEPG,LPG,andRPG.Theproposedo˙-policylearninglargelysurpassedthebest on-policymethod.Therefore,weconcludethato˙-policyassupervisedlearningcontributes tothesample-e˚ciencysubstantially,whilethepairwiserankingpolicycanfurtheraccelerate thelearning.Inaddition,wecompareRPGtorepresentativeo˙-policypolicygradient approach:ACER[ 208 ].Astheresultsshown,theproposedo˙-policylearningframeworkis moresample-e˚cientthanthestate-of-the-arto˙-policypolicygradientapproaches. OntheTrade-o˙betweenSample-E˚ciencyandOptimality. ResultsinFigure4.10 showthatthereisatrade-o˙betweensamplee˚ciencyandoptimality,whichiscontrolledby thetrajectoryrewardthreshold c .Recallthat c determineshowcloseisthelearnedUNOP tooptimalpolicies.Ahighervalueof c leadstoalessfrequencyofnear-optimaltrajectories beingcollectedandandthusalowersamplee˚ciency,andhoweverthealgorithmisexpected toconvergetoastrategyofbetterperformance.Wenotethat c istheonlyparameterwe tunedacrossallexperiments. 126 ExplorationE˚ciency. WeempiricallyevaluatetheExpectedExplorationE˚ciency (Def9)ofthestate-of-the-arton Pong .ItisworthnotingthattheRLgeneralization performanceisdeterminedbybothoflearninge˚ciencyandexploratione˚ciency.Therefore, higherexploratione˚ciencydoesnotnecessarilyleadtomoresamplee˚cientalgorithmdue tothelearningine˚ciency,asdemonstratedby RainBow and DQN (seeFigure4.11).Also, theImplicitQuantileachievesthebestperformanceamongbaselines,sinceitsexploration e˚ciencylargelysurpassesotherbaselines. 4.2.13Conclusion Inthiswork,weintroducedrankingpolicygradientmethodsthat,forthe˝rsttime,approach theRLproblemfromarankingperspective.Furthermore,towardsthesample-e˚cientRL, weproposeano˙-policylearningframework,whichtrainsRLagentsinasupervisedlearning mannerandthuslargelyfacilitatesthelearninge˚ciency.Theo˙-policylearningframework usesgeneralizedpolicyiterationforexplorationandexploitsthestablenessofsupervised learningforderivingpolicy,whichaccomplishestheunbiasedness,variancereduction,o˙- policylearning,andsamplee˚ciencyatthesametime.Besides,weprovideanalternative approachtoanalyzethesamplecomplexityofRL,andshowthatthesamplecomplexityof RPGhasnodependencyonthestatespacedimension.Lastbutnotleast,empiricalresults showthatRPGachievessuperiorperformanceascomparedtothestate-of-the-art. 127 Chapter5 CollaborativeMulti-AgentLearning Inthischapter,weinvestigatethescalabilityofcollaborativelearninginthecontextof multi-agentlearningforareal-world˛eetmanagementapplication.Weproposetotransfer thecoordinationofalargenumberoflearningagentsintoalinearprogrammingproblem, withproperdomainknowledgetoguidetheoptimization.Weshowthesuperiorityofthis globalcollaborationcomparedtoindividuallearningthroughextensiveevaluationonthe real-worldtra˚cdata. 5.1Introduction Large-scaleonlineride-sharingplatformssuchasUber[ 201 ],Lift[ 126 ],andDidiChuxing[ 40 ] havetransformedthewaypeopletravel,liveandsocialize.Byleveragingtheadvances inandwideadoptionofinformationtechnologiessuchascellularnetworksandglobal positioningsystems,theride-sharingplatformsredistributeunderutilizedvehiclesonthe roadstopassengersinneedoftransportation.Theoptimizationoftransportationresources greatlyalleviatedtra˚ccongestionandcalibratedtheoncesigni˝cantgapbetweentransport demandandsupply[112]. Onekeychallengeinride-sharingplatformsistobalancethedemandsandsupplies,i.e., ordersofthepassengersanddriversavailableforpickinguporders.Inlargecities,although millionsofride-sharingordersareservedeveryday,anenormousnumberofpassengersrequests 128 remainunservicedduetothelackofavailabledriversnearby.Ontheotherhand,there areplentyofavailabledriverslookingforordersinotherlocations.Iftheavailabledrivers weredirectedtolocationswithhighdemand,itwillsigni˝cantlyincreasethenumberof ordersbeingserved,andthussimultaneouslybene˝tallaspectsofthesociety:utilityof transportationcapacitywillbeimproved,incomeofdriversandsatisfactionofpassengers willbeincreased,andmarketshareandrevenueofthecompanywillbeexpanded. ˛eet management isakeytechnicalcomponenttobalancethedi˙erencesbetweendemandand supply,byreallocatingavailablevehiclesaheadoftime,toachievehighe˚ciencyinserving futuredemand. Eventhoughrichhistoricaldemandandsupplydataareavailable,usingthedatato seekanoptimalallocationpolicyisnotaneasytask.Onemajorissueisthatchanges inanallocationpolicywillimpactfuturedemand-supply,anditishardforsupervised learningapproachestocaptureandmodelthesereal-timechanges.Ontheotherhand,the reinforcementlearning(RL)[ 186 ],whichlearnsapolicybyinteractingwithacomplicated environment,hasbeennaturallyadoptedtotacklethe˛eetmanagementproblem[ 64 , 65 , 211 ]. However,thehigh-dimensionalandcomplicateddynamicsbetweendemandandsupplycan hardlybemodeledaccuratelybytraditionalRLapproaches. Recentyearswitnessedtremendoussuccessindeepreinforcementlearning(DRL)in modelingintellectualchallengingdecision-makingproblems[ 132 , 174 , 175 ]thatwerepreviously intractable.Inthelightofsuchadvances,inthischapterweproposeanovelDRLapproachto learnhighlye˚cientallocationpoliciesfor˛eetmanagement.Therearesigni˝canttechnical challengeswhenmodeling˛eetmanagementusingDRL: 1) Feasibilityofproblemsetting. TheRLframeworkisreward-driven,meaningthatasequence of actions fromthepolicyisevaluatedsolelybythe reward signalfromenvironment[ 11 ]. 129 Thede˝nitionsofagent,rewardandactionspaceareessentialforRL.Ifwemodelthe allocationpolicyusingacentralizedagent,theactionspacecanbeprohibitivelylargesince anactionneedstodecidethenumberofavailablevehiclestorepositionfromeachlocationto itsnearbylocations.Also,thepolicyissubjecttoafeasibilityconstraintenforcingthatthe numberofrepositionedvehiclesneedstobenolargerthanthecurrentnumberofavailable vehicles.Tothebestofourknowledge,thishigh-dimensionalexact-constrainsatisfaction policyoptimizationisnotcomputationallytractableinDRL:applyingitinaverysmall-scale problemcouldalreadyincurhighcomputationalcosts[154]. 2) Large-scaleAgents. Onealternativeapproachistoinsteaduseamulti-agentDRLsetting, whereeachavailablevehicleisconsideredasanagent.Themulti-agentrecipeindeedalleviates thecurseofdimensionalityofactionspace.However,suchsettingcreatesthousandsofagents interactingwiththeenvironmentateachtime.TrainingalargenumberofagentsusingDRL isagainchallenging:theenvironmentforeachagentisnon-stationarysinceotheragentsare learninganda˙ectingtheenvironmentatsamethetime.Mostofexistingstudies[ 125 , 60 , 189 ] allowcoordinationamongonlyasmallsetofagentsduetohighcomputationalcosts. 3) CoordinationsandContextDependenceofActionspace Facilitatingcoordinationamong large-scaleagentsremainsachallengingtask.Sinceeachagenttypicallylearnsitsownpolicy oraction-valuefunctionthatarechangingovertime,itisdi˚culttocoordinateagentsfor alargenumberofagents.Moreover,theactionspaceisdynamicchangingovertimesince agentsarenavigatingtodi˙erentlocationsandthenumberoffeasibleactionsdependson thegeographiccontextofthelocation. Inthispaper,weproposeacontextualmulti-agentDRLframeworktoresolvetheafore- mentionedchallenges.Ourmajorcontributionsarelistedasfollows: 130 Weproposeane˚cientmulti-agentDRLsettingforlarge-scale˛eetmanagement problembyaproperdesignofagent,rewardandstate. Weproposecontextualmulti-agentreinforcementlearningframeworkinwhichthree concretealgorithms: contextualmulti-agentactor-critic (cA2C), contextualdeepQ- learning (cDQN),and Contextualmulti-agentactor-criticwithlinearprogramming (LP-cA2C)aredeveloped.Forthe˝rsttimeinmulti-agentDRL,thecontextual algorithmscannotonlyachievee˚cientcoordinationamongthousandsoflearning agentsateachtime,butalsoadapttodynamicallychangingactionspaces. InordertotrainandevaluatetheRLalgorithm,wedevelopedasimulatorthatsimulates real-worldtra˚cactivitiesperfectlyaftercalibratingthesimulatorusingrealhistorical dataprovidedbyDidiChuxing[40]. Lastbutnotleast,theproposedcontextualalgorithmssigni˝cantlyoutperformthe state-of-the-artmethodsinmulti-agentDRLwithamuchlessnumberofrepositions needed. Therestofthischapterisorganizedasfollows.We˝rstgivealiteraturereviewon therelatedworkinSec5.2.ThentheproblemstatementiselaboratedinSec5.3andthe simulationplatformwebuiltfortrainingandevaluationareintroducedinSec5.6.The methodologyisdescribedinSec5.4.Quantitativeandqualitativeresultsarepresentedin Sec6.6.Finally,weconcludeourworkinSec5.8. 131 5.2RelatedWorks IntelligentTransportationSystem. Advancesinmachinelearningandtra˚cdata analyticsleadtowidespreadapplicationsofmachinelearningtechniquestotacklechallenging tra˚cproblems.Onetrendingdirectionistoincorporatereinforcementlearningalgorithms incomplicatedtra˚cmanagementproblems.Therearemanypreviousstudiesthathave demonstratedthepossibilityandbene˝tsofreinforcementlearning.Ourworkhasclose connectionstothesestudiesintermsofproblemsetting,methodologyandevaluation.Among thetra˚capplicationsthatarecloselyrelatedtoourwork,suchastaxidispatchsystemsor tra˚clightcontrolalgorithms,multi-agentRLhasbeenexploredtomodeltheintricatenature ofthesetra˚cactivities[ 14 , 172 , 128 ].Thepromisingresultsmotivatedustousemulti-agent modelinginthe˛eetmanagementproblem.In[ 64 ],anadaptivedynamicprogramming approachwasproposedtomodelstochasticdynamicresourceallocation.Itestimatesthe returnsoffuturestatesusingapiecewiselinearfunctionanddeliversactions(assigningorders tovehicles,reallocateavailablevehicles)givenstatesandonestepfuturestatesvalues,by solvinganintegerprogrammingproblem.In[ 65 ],theauthorsfurtherextendedtheapproach tothesituationsthatanactioncanspanacrossmultipletimeperiods.Thesemethodsare hardtobedirectlyutilizedinthereal-worldsettingwhereorderscanbeservedthroughthe vehicleslocatedinmultiplenearbylocations. Multi-agentreinforcementlearning. Anotherrelevantresearchtopicismulti-agent reinforcementlearning[ 27 ]whereagroupofagentssharethesameenvironment,inwhich theyreceiverewardsandtakeactions.[ 190 ]comparedandcontrastedindependent Q -learning andacooperativecounterpartindi˙erentsettings,andempiricallyshowedthatthelearning speedcanbene˝tfromthecooperationamongagents.Independent Q -learningisextended 132 intoDRLin[ 189 ],wheretwoagentsarecooperatingorcompetingwitheachotheronly throughthereward.In[ 60 ],theauthorsproposedacounterfactualmulti-agentpolicy gradientmethodthatusesacentralizedadvantagetoestimatewhethertheactionofone agentwouldimprovetheglobalreward,anddecentralizedactorstooptimizetheagentpolicy. Ryan etal. alsoutilizedtheframeworkofdecentralizedexecutionandcentralizedtrainingto developmulti-agentmulti-agentactor-criticalgorithmthatcancoordinateagentsinmixed cooperative-competitiveenvironments[ 125 ].However,noneofthesemethodswereapplied whentherearealargenumberofagentsduetothecommunicationcostamongagents. Recently,fewworks[ 230 , 217 ]scaledDRLmethodstoalargenumberofagents,whileitis notapplicabletoapplythesemethodstocomplexrealapplicationssuchas˛eetmanagement. In[ 140 , 141 ],theauthorsstudiedlarge-scalemulti-agentplanningfor˛eetmanagementwith explicitlymodelingtheexpectedcountsofagents. Deepreinforcementlearning. DRLutilizesneuralnetworkfunctionapproximationsand areshowntohavelargelyimprovedtheperformanceoverchallengingapplications[ 175 , 132 ]. ManysophisticatedDRLalgorithmssuchasDQN[ 132 ],A3C[ 131 ]weredemonstratedto bee˙ectiveinthetasksinwhichwehaveaclearunderstandingofrulesandhaveeasy accesstomillionsofsamples,suchasvideogames[ 24 , 18 ].However,DRLapproachesare rarelyseentobeappliedincomplicatedreal-worldapplications,especiallyinthosewith high-dimensionalandnon-stationaryactionspace,lackofwell-de˝nedrewardfunction,andin needofcoordinationamongalargenumberofagents.Inthischapter,weshowthatthrough carefulreformulation,theDRLcanbeappliedtotacklethe˛eetmanagementproblem. 133 5.3ProblemStatement Inthischapter,weconsidertheproblemofmanagingalargesetofavailablehomogeneous vehiclesforonlineride-sharingplatforms.Thegoalofthemanagementistomaximizethe grossmerchandisevolume(GMV:thevalueofalltheordersserved)oftheplatformby repositioningavailablevehiclestothelocationswithlargerdemand-supplygapthanthe currentone.Thisproblembelongstoavariantoftheclassical˛eetmanagementproblem[ 47 ]. Aspatial-temporalillustrationoftheproblemisavailableinFigure5.1.Inthisexample,we use hexagonal-gridworld torepresentthemapandsplitthedurationofonedayinto T =144 timeintervals(onefor10minutes).Ateachtimeinterval,theordersemergestochasticallyin eachgridandareservedbytheavailablevehiclesinthesamegridorsixnearbygrids.The goalof˛eetmanagementhereistodecidehowmanyavailablevehiclestorelocatefromeach gridtoitsneighborsinaheadoftime,sothatmostorderscanbeserved. Totacklethisproblem,weproposetoformulatetheproblemusing multi-agentreinforce- mentlearning [ 27 ].Inthisformulation,weuseasetofhomogeneousagentswithsmallaction spaces,andsplittheglobalrewardintoeachgrid.Thiswillleadtoamuchmoree˚cient learningprocedurethanthesingleagentsetting,duetothesimpli˝edactiondimensionand theexplicitcreditassignmentbasedonsplitreward.Formally,wemodelthe˛eetmanagement problemasaMarkovgame G for N agents,whichisde˝nedbyatuple G =( N; S ; A ; P ; R ; ) , where N; S ; A ; P ; R ; arethenumberofagents,setsofstates,jointactionspace,transition probabilityfunctions,rewardfunctions,andadiscountfactorrespectively.Thede˝nitions aregivenasfollows: Agent :Weconsideranavailablevehicle(orequivalentlyanidledriver)asanagent, andthevehiclesinthesamespatial-temporalnodearehomogeneous,i.e.,thevehicles 134 locatedatthesameregionatthesametimeintervalareconsideredassameagents (whereagentshavethesamepolicy).Althoughthenumberofuniqueheterogeneous agentsisalways N ,thenumberofagents N t ischangingovertime. State s t 2S :Wemaintainaglobalstate s t ateachtime t ,consideringthespatial distributionsofavailablevehiclesandorders(i.e.thenumberofavailablevehiclesand ordersineachgrid)andcurrenttime t (usingone-hotencoding).Thestateofanagent i , s i t ,isde˝nedastheidenti˝cationofthegriditlocatedandthesharedglobalstate i.e. s i t =[ s t ; g j ] 2 R N 3+ T ,where g j istheone-hotencodingofthegridID.Wenote thatagentslocatedatsamegridhavethesamestate s i t . Action a t 2A = A 1 ::: A N t :a jointaction a t = f a i t g N t 1 instructingtheallocation strategyofallavailablevehiclesattime t .Theactionspace A i ofanindividualagent speci˝eswheretheagentisabletoarriveatthenexttime,whichgivesasetofseven discreteactionsdenotedby f k g 7 k =1 .The˝rstsixdiscreteactionsindicateallocatingthe agenttooneofitssixneighboringgrids,respectively.Thelastdiscreteaction a i t =7 meansstayinginthecurrentgrid.Forexample,theaction a 1 0 =2 meanstorelocate the 1 stagentfromthecurrentgridtothesecondnearbygridattime 0 ,asshownin Figure5.1.Foraconcisepresentation,wealsouse a i t , [ g 0 ; g 1 ] torepresentagent i movingfromgrid g 0 to g 1 .Furthermore,theactionspaceofagentsdependsontheir locations.Theagentslocatedatcornergridshaveasmalleractionspace.Wealso assumethattheactionisdeterministic:if a i t , [ g 0 ; g 1 ] ,thenagent i willarriveatthe grid g 1 attime t +1 . Rewardfunction R i 2R = SA! R :Eachagentisassociatedwithareward function R i andallagentsinthesamelocationhavethesamerewardfunction.The 135 i -thagentattemptstomaximizeitsownexpecteddiscountedreturn: E h P 1 k =0 k r i t + k i . Theindividualreward r i t forthe i -thagentassociatedwiththeaction a i t isde˝nedasthe averagedrevenueofallagentsarrivingatthesamegridasthe i -thagentattime t +1 . Sincetheindividualrewardsatsametimeandthesamelocationaresame,wedenote thisrewardofagentsattime t andgrid g j as r t ( g j ) .Suchdesignofrewardsaimsat avoidinggreedyactionsthatsendtoomanyagentstothelocationwithhighvalueof orders,andaligningthemaximizationofeachagent'sreturnwiththemaximizationof GMV(valueofallservedordersinoneday).Itse˙ectivenessisempiricallyveri˝edin Sec6.6. Statetransitionprobability p ( s t +1 j s t ;a t ): SAS! [0 ; 1] :Itgivestheproba- bilityoftransitingto s t +1 givenajointaction a t istakeninthecurrentstate s t .Notice thatalthoughtheactionisdeterministic,newvehiclesandorderswillbeavailable atdi˙erentgridseachtime,andexistingvehicleswillbecomeo˙-lineviaarandom process. Tobemoreconcrete,wegiveanexamplebasedontheaboveproblemsettinginFigure5.1. Attime t =0 ,agent 1 isrepositionedfrom g 0 to g 2 byaction a 1 0 ,andagent 2 isalso repositionedfrom g 1 to g 2 byaction a 2 0 .Attime t =1 ,twoagentsarriveat g 2 ,andanew orderwithvalue 10 alsoemergesatsamegrid.Therefore,thereward r 1 forboth a 1 0 and a 2 0 istheaveragedvaluereceivedbyagentsat g 2 ,whichis 10 = 2=5 . It'sworthtonotethatthisrewarddesignmaynotleadtotheoptimalreallocationstrategy thoughitempiricallyleadstogoodreallocationpolicy.Wegiveasimpleexampletoillustrate thisproblem.WeusethegridworldmapasshowinFigure5.1.Attime t =1 ,thereisan orderwithvalue100emergedin g 1 andanotherorderwithvalue10emergedin g 0 .Suppose 136 Figure5.1:Thegridworldsystemandaspatial-temporalillustrationoftheproblemsetting. wehavetwoagentsthatareavailableingrid g 0 attime t =0 .Theoptimalreallocation strategyinthiscaseistoaskoneagentstayin g 0 andanothergoto g 1 ,bywhichwecan receivethetotalreward110.However,inthecurrentsetting,eachagenttrystomaximize itsownreward.Asaresult,bothofthemwillgoto g 1 andreceive50rewardandnoneof themwillgoto g 1 sincetherewardtheycanreceiveislessthan50.However,weshowthat therearefewwaystoapproximatethisglobaloptimalallocationstrategyusingtheindividual actionfunctionofeachagent. 5.4ContextualMulti-AgentReinforcementLearning Inthissection,wepresenttwonovelcontextualmulti-agentRLapproaches:contextual multi-agentactor-critic(cA2C)andcontextualDQN(cDQN)algorithm.We˝rstbrie˛y introducethebasicmulti-agentRLmethod. 5.4.1IndependentDQN IndependentDQN[ 189 ]combinesindependent Q -learning[ 190 ]andDQN[ 132 ].Astraightfor- wardextensionofindependentDQNfromsmallscaletoalargenumberofagents,istoshare 137 networkparametersanddistinguishagentswiththeirIDs[ 230 ].Thenetworkparameters canbeupdatedbyminimizingthefollowinglossfunction,withrespecttothetransitions collectedfromallagents: E 2 4 Q ( s i t ;a i t ; ) 0 @ r i t +1 + max a i t +1 Q ( s i t +1 ;a i t +1 ; 0 ) 1 A 3 5 2 ; (5.1) where 0 includesparametersofthetarget Q networkupdatedperiodically,and includes parametersofbehavior Q networkoutputtingtheactionvaluefor -greedypolicy,sameas thealgorithmdescribedin[ 132 ].Thismethodcouldworkreasonablywellafterextensive tunningbutitsu˙ersfromhighvarianceinperformance,anditalsorepositionstoomany vehicles.Moreover,coordinationamongmassiveagentsishardtoachievesinceeachunique agentexecutesitsactionindependentlybasedonitsactionvalues. 5.4.2ContextualDQN Sinceweassumethatthelocationtransitionofanagentaftertheallocationactionis deterministic,theactionsthatleadtheagentstothesamegridshouldhavethesameaction value.Inthiscase,thenumberofuniqueaction-valuesforallagentsshouldbeequalto thenumberofgrids N .Formally,foranyagent i where s i t =[ s t ; g i ] , a i t , [ g i ; g d ] and g i 2 Ner ( g d ) ,thefollowingholds: Q ( s i t ;a i t )= Q ( s t ; g d ) (5.2) 138 Hence,ateachtimestep,weonlyneed N uniqueaction-values( Q ( s t ; g j ) ; 8 j =1 ;:::;N )and theoptimizationofEq(5.1)canbereplacedbyminimizingthefollowingmean-squaredloss: " Q ( s t ; g d ; ) r t +1 ( g d )+ max g p 2 Ner ( g d ) Q ( s t +1 ; g p ; 0 ) !# 2 : (5.3) Thisacceleratesthelearningproceduresincetheoutputdimensionoftheactionvaluefunction isreducedfrom R j s t j ! R 7 to R j s t j ! R .Furthermore,wecanbuildacentralizedaction- valuetableateachtimeforallagents,whichcanserveasthefoundationforcoordinatingthe actionsofagents. Geographiccontext. Inhexagonalgridssystems,bordergridsandgridssurroundedby infeasiblegrids(e.g.,alake)havereducedactiondimensions.Toaccommodatethis,foreach gridwecomputea geographiccontext G g j 2 R 7 ,whichisabinaryvectorthat˝ltersout invalidactionsforagentsingrid g j .The k thelementofvector G g j representsthevalidity ofmovingtoward k thdirectionfromthegrid g j .Denote g d asthegridcorrespondstothe k thdirectionofgrid g j ,thevalueofthe k thelementof G g j isgivenby: [ G t; g j ] k = 8 > < > : 1 ; if g d isvalidgrid ; 0 ; otherwise ; (5.4) where k =0 ;:::; 6 andlastdimensionofthevectorrepresentsdirectionstayinginsamegrid, whichisalways1. Collaborativecontext. Toavoidthesituationthatagentsaremovingincon˛ictdirections (i.e.,agentsarerepositionedfromgrid g 1 to g 2 and g 2 to g 1 atthesametime.),weprovide a collaborativecontext C t; g j 2 R 7 foreachgrid g j ateachtime.Basedonthecentralized actionvalues Q ( s t ; g j ) ,werestrictthevalidactionssuchthatagentsatthegrid g j are 139 navigatingtotheneighboringgridswithhigheractionvaluesorstayingunmoved.Therefore, thebinaryvector C t; g j eliminatesactionstogridswithloweractionvaluesthantheaction stayingunmoved.Formally,the k thelementofvector C t; g j thatcorrespondstoactionvalue Q ( s t ; g i ) isde˝nedasfollows: [ C t; g j ] k = 8 > < > : 1 ; if Q ( s t ; g i ) > = Q ( s t ; g j ) ; 0 ; otherwise : (5.5) Aftercomputingbothcollaborativeandgeographiccontext,the -greedypolicyisthen performedbasedontheactionvaluessurvivedfromthetwocontexts.Supposetheoriginal actionvaluesofagent i attime t is Q ( s i t ) 2 R 7 0 ,givenstate s i t ,thevalidactionvaluesafter applyingcontextsisasfollows: q ( s i t )= Q ( s i t ) C t; g j G t; g j : (5.6) Thecoordinationisenabledbecausethattheactionvaluesofdi˙erentagentsleadtothe samelocationarerestrictedtobesamesothattheycanbecompared,whichisimpossiblein independentDQN.Thismethodrequiresthatactionvaluesarealwaysnon-negative,which willalwaysholdbecausethatagentsalwaysreceivenonnegativerewards.Thealgorithmof cDQNiselaboratedinAlg5.2. 5.4.3ContextualActor-Critic Wenowpresentthecontextualmulti-agentactor-critic(cA2C)algorithm,whichisamulti- agentpolicygradientalgorithmthattailorsitspolicytoadapttothedynamicallychanging actionspace.Meanwhile,itachievesnotonlyamorestableperformancebutalsoamuch 140 Algorithm5.1: -greedypolicyforcDQN Require: Globalstate s t 1: Computecentralizedactionvalue Q ( s t ; g j ) ; 8 j =1 ;:::;N 2: for i =1 to N t do 3: Computeactionvalues Q i byEq(5.2),where ( Q i ) k = Q ( s i t ;a i t = k ) . 4: Computecontexts C t; g j and G t; g j foragent i . 5: Computevalidactionvalues q i t = Q i t C t; g j G t; g j . 6: a i t = argmax k q i t withprobability 1 otherwisechooseanactionrandomlyfromthe validactions. 7: endfor 8: return Jointaction a t = f a i t g N t 1 . Algorithm5.2: ContextualDeepQ-learning(cDQN) 1: Initializereplaymemory D tocapacity M 2: Initializeaction-valuefunctionwithrandomweights orpre-trainedparameters. 3: for m =1 to max-iterations do 4: Resettheenvironmentandreachtheinitialstate s 0 . 5: for t =0 to T do 6: Samplejointaction a t usingAlg.5.1,given s t . 7: Execute a t insimulatorandobservereward r t andnextstate s t +1 8: Storethetransitionsofallagents( s i t ;a i t ;r i t ; s i t +1 ; 8 i =1 ;:::;N t )in D . 9: endfor 10: for k =1 to M 1 do 11: Sampleabatchoftransitions( s i t ;a i t ;r i t ; s i t +1 )from D , 12: Computetarget y i t = r i t + max a i t +1 Q ( s i t +1 ;a i t +1 ; 0 ) . 13: Update Q -networkas + r ( y i t Q ( s i t ;a i t ; )) 2 , 14: endfor 15: endfor moree˚cientlearningprocedureinanon-stationaryenvironment.Therearetwomainideas inthedesignofcA2C:1)Acentralizedvaluefunctionsharedbyallagentswithanexpected update;2)Policycontextembeddingthatestablishesexplicitcoordinationamongagents, enablesfastertrainingandenjoysthe˛exibilityofregulatingpolicytodi˙erentactionspaces. Thecentralizedstate-valuefunctionislearnedbyminimizingthefollowinglossfunction 141 derivedfromBellmanequation: L ( v )=( V v ( s i t ) V target ( s t +1 ; 0 v ;ˇ )) 2 ; (5.7) V target ( s t +1 ; 0 v ;ˇ )= X a i t ˇ ( a i t j s i t )( r i t +1 + V 0 v ( s i t +1 )) : (5.8) whereweuse v todenotetheparametersofthevaluenetworkand 0 v todenotethetarget valuenetwork.Sinceagentsstayingunmovedatthesametimearetreatedhomogeneousand sharethesameinternalstate,thereare N uniqueagentstates,andthus N uniquestate-values ( V ( s t ; g j ) ; 8 j =1 ;:::;N )ateachtime.Thestate-valueoutputisdenotedby v t 2 R N ,where eachelement ( v t ) j = V ( s t ; g j ) istheexpectedreturnreceivedbyagentarrivingatgrid g j ontime t .Inordertostabilizelearningofthevaluefunction,we˝xatargetvaluenetwork parameterizedby 0 v ,whichisupdatedattheendofeachepisode.Notethattheexpected updateinEq(5.7)andtrainingactor/criticinano˜inefashionaredi˙erentfromtheupdates in n -stepactor-criticonlinetrainingusingTDerror[ 131 ],whereastheexpectedupdates andtrainingparadigmarefoundtobemorestableandsample-e˚cient.Thisisalsoin linewithpriorworkinapplyingactor-critictorealapplications[ 12 ].Furthermore,e˚cient coordinationamongmultipleagentscanbeestablisheduponthiscentralizedvaluenetwork. PolicyContextEmbedding. Coordinationisachievedbymaskingavailableactionspace basedonthecontext.Ateachtimestep,thegeographiccontextisgivenbyEq(5.4)andthe collaborativecontextiscomputedaccordingtothevaluenetworkoutput: [ C t; g j ] k = 8 > < > : 1 ;ifV ( s t ; g i ) > = V ( s t ; g j ) ; 0 ; otherwise ; (5.9) wherethe k thelementofvector C t; g j correspondstotheprobabilityofthe k thaction 142 ˇ ( a i t = k j s i t ) .Let P ( s i t ) 2 R 7 > 0 denotetheoriginallogitsfromthepolicynetworkoutputfor the i thagentconditionedonstate s i t .Let q valid ( s i t )= P ( s i t ) C t; g j G g j denotethevalid logitsconsideringbothgeographicandcollaborativecontextforagent i atgrid g j ,where denotesanelement-wisemultiplication.Inordertoachievee˙ectivemasking,werestrictthe outputlogits P ( s i t ) tobepositive.Theprobabilityofvalidactionsforallagentsinthegrid g j aregivenby: ˇ p ( a i t = k j s i t )=[ q valid ( s i t )] k = [ q valid ( s i t )] k k q valid ( s i t ) k 1 : (5.10) Thegradientofpolicycanthenbewrittenas: r p J ( p )= r p log ˇ p ( a i t j s i t ) A ( s i t ;a i t ) ; (5.11) where p denotestheparametersofpolicynetworkandtheadvantage A ( s i t ;a i t ) iscomputed asfollows: A ( s i t ;a i t )= r i t +1 + V 0 v ( s i t +1 ) V v ( s i t ) : (5.12) ThedetaileddescriptionofcA2CissummarizedinAlg5.4. 5.5E˚cientallocationwithlinearprogramming Inthissection,wepresenttheproposedLP-cA2Cthatutilizesthestatevaluefunctions learnedbycA2Candcomputethereallocationsinacentralizedview,whichachievesthebest performancewithhighere˚ciency. 143 Algorithm5.3: ContextualMulti-agentActor-CriticPolicyforward Require: Theglobalstate s t . 1: Computecentralizedstate-value v t 2: for i=1to N t do 3: Computecontexts C t; g j and G t; g j foragent i . 4: Computeactionprobabilitydistribution q valid ( s i t ) foragent i ingrid g j (Eq(5.10)). 5: Sampleactionforagent i ingrid g j basedonactionprobability p i . 6: endfor 7: return Jointaction a t = f a i t g N t 1 . Figure5.2:Illustrationofcontextualmulti-agentactor-critic.Theleftpartshowsthe coordinationofdecentralizedexecutionbasedontheoutputofcentralizedvaluenetwork. Therightpartillustratesembeddingcontexttopolicynetwork. Fromanotherperspective,ifweformulatethisproblemasaMDPwherewehavea meta-agentthatcontrolsthedecisionsofalldrivers,ourgoalistomaximizethelongterm rewardoftheplatform: Q c ( s ; a )= E [ 1 X t =1 t 1 r t ( s t ; a t ) j s 0 = s ; a 0 = a ;ˇ ] : The ˇ inaboveformulationdenotestheoptimalglobalreallocationstrategy.Althoughthe sumofimmediaterewardreceivedbyallagentsisequaltothetotalrewardoftheplatform, maximizingthelongtermrewardofeachagentisnotequaltomaximizethelongterm rewardoftheplatform,i.e. P i max a i Q ( s i ;a i ) 6 =max a Q c ( s ; a ) .Incooperativemulti-agent 144 Algorithm5.4: ContextualMulti-agentActor-CriticAlgorithmfor N agents 1: Initialization: 2: Initializethevaluenetworkwith˝xedvaluetable. 3: for m =1 tomax-iterations do 4: Resetenvironment,getinitialstate s 0 . 5: Stage1:Collectingexperience 6: for t =0 to T do 7: Sampleactions a t accordingtoAlg5.3,given s t . 8: Execute a t insimulatorandobservereward r t andnextstate s t +1 . 9: ComputevaluenetworktargetasEq(5.8)andadvantageasEq(5.12) forpolicynetworkandstorethetransitions. 10: endfor 11: Stage2:Updatingparameters 12: for m 1 =1 to M 1 do 13: Sampleabatchofexperience: s i t ;V target ( s i t ; 0 v ;ˇ ) 14: UpdatevaluenetworkbyminimizingthevaluelossEq(5.7)overthebatch. 15: endfor 16: for m 2 =1 to M 2 do 17: Sampleabatchofexperience: s i t ;a i t ;A ( s i t ;a i t ) ; C t;g j ; G g j . 18: Updatepolicynetworkas p p + r p J ( p ) . 19: endfor 20: endfor reinforcementlearning,thesumofrewardsofmultipleagentsistheglobalrewardwewant tomaximize.Inthiscase,givenacentralizedpolicy( ˇ )forallagents,thesummationof longtermrewardshouldbeequaltothegloballongtermreward. N X i =1 Q i ( s i ;a i )= N X i =1 E ˇ " 1 X t =1 t 1 r i t s i 0 = s i ;a i 0 = a i # = E ˇ 2 4 1 X t =1 t 1 N X i =1 r i t s 0 = s ; a 0 = a 3 5 = E ˇ " 1 X t =1 t 1 r t s 0 = s ; a 0 = a # = Q c ( s ; a ) However,inthiswork,thissimplerelationshipdoesnotholdmainlysincethenumberof agents( N t )isnotstatic.AsshowninEq(5.13),theglobalrewardattime t +1 oftheplatform 145 isnotequaltothesumofallcurrentagents'reward(i.e. P N t i =1 r i t +1 6 = P N t +1 i =1 r i t +1 = r t +1 ) evengivenacentralizedpolicy ˇ . N t X i =1 Q ( s i t ;a i t )= N t X i =1 E ˇ [ r i t +1 + max a i t +1 Q ( s i t +1 ;a i t +1 )] (5.13) Ideally,wewouldliketodirectlylearnthecentralizedactionvaluefunction Q c whileit's computationalintractabletoexploreandoptimizethe Q c inthecasewehavesubstantially largeactionspace.Therefore,weneedtoleveragetheaveragedlongtermrewardofeach agenttoapproximatethemaximizationofthecentralizedaction-valuefunction Q c .IncDQN, weapproximatethisallocationbyavoidingthegreedyallocationwith greedystrategyeven duringtheevaluationstage.IncA2C,thepolicywillallocatetheagentsinthesamelocation toitsnearbylocationswithcertainprobabilityaccordingtothestate-values.Infact,we usesthisempiricalstrategytobetteralignthejointactionsofeachindividualagentwiththe actionfromoptimalreallocation.However,bothofthecA2CandcDQNtrytocoordinate agentsfromalocalizedview,inwhicheachagentonlyconsideritsnearbysituationwhen theyarecoordinating.Therefore,theredundantreallocationstillexistsinthosetwomethods. Othermethodsthatcanapproximatethecentralizedaction-valuefunctionsuchasVDN[ 185 ] andQMIX[160]arenotabletoscaletolargenumberofagents. Inthiswork,weproposetoapproximatethecentralizedpolicybyformulatingthe reallocationasalinearprogrammingproblem. max y ( s t ) v ( s t ) T A t c T t y ( s t ) k D ( o t +1 A t y ( s t )) k 2 2 (5.14) s.t. y ( s t ) 0 B t y ( s t )= d t 146 wherethevector y ( s t ) 2 R N r ( t ) 1 denotesthefeasiblerepositionsforallagentsatcurrent timestep t .Eachelementin y ( s t ) representsonerepositionfromcurrentgridtoitsnearbygrid. N r ( t ) isthetotalnumberoffeasiblerepositiondirection.Thenumberoffeasiblerepositions dependsonthecurrentstatevaluesineachgridsincewereallocateagentsfromlocationwith lowerstatevaluetothegridwithhigherstatevalue. A 2 R N N r ( t ) isaindicatormatrix thatdenotestheallocationsthatdispatchdriversintothegrid,i.e. A i;j 2f 0 ; 1 g . A i;j =1 meansthe j -threpositionreallocatesagentsintothe i -thgrid.Similarly, B 2 R N N r ( t ) istheindicatormatrixthatdenotestheallocationsthatdispatchdriversoutofthegrid. D 2f 0 ; 1 g N N istheadjacencymatrixdenotestheconnectivityofthegridworld. o t +1 denotestheestimatednumberofordersineachgridatnexttimestep. c t 2 R N r ( t ) 1 denotes thecostassociatedwitheachrepositionand s ( s t ) 2 R N 1 denotesthestatevalueforeach gridintimestep t . The˝rstterminEq(5.14)approximatesourgoalthatwewanttomaximizethelong termrewardoftheplatform.Sincethestatevaluecanbeinterpretedastheaveragedlong termrewardoneagentwillreceiveifitappearsincertaingrid,the˝rsttermrepresentsthe totalrewardminusthetotalcostassociatedwiththerepositions.However,optimizingthe ˝rsttermwillleadtoagreedysolutionthatreallocatesalltheagentstothenearbygrid withhigheststatevalueminusthecost.Toalleviatethisgreedyreallocation,weaddthe secondtermtoregularizethenumberofagentsreallocatedtoeachgrid.Sincetheagentin currentgridcanpickuptheordersemergedinnearbygrids,weutilizetheadjacencymatrix toregularizethenumberofagentsreallocatedintoagroupofnearbygridsshouldbecloseto thenumberofordersemergedinagroupofnearbygrids.Fromanotherpointofview,the secondtermmorefocusontheimmediaterewardsinceitpreferthesolutionthatallocates rightamountofagentstopick-uptheorderswithoutconsiderthefutureincomethatan 147 agentcanreceivebythatreposition.Theregularizationparameter isusedtobalancethe longtermrewardandtheimmediatereward.Thetwo˛owconservationconstrainsrequires thenumberofrepositionsshouldbepositiveandthenumberofrepositionsfromcurrentgrid shouldbeequaltothenumberofavailableagentsincurrentgrids. Ideally,weneedtosolveaintegerprogrammingproblemwhereoursolutionsatis˝es y ( s t ) 2Z N r .However,solvingintegerprogrammingisNP-hardinworstcasewhilesolving itslinearprogrammingrelaxationisinP.Inpractice,wesolvethelinearprogramming relaxationandroundthesolutionintointegers[49]. 5.6SimulatorDesign AfundamentalchallengeofapplyingRLalgorithminrealityisthelearningenvironment. Unlikethestandardsupervisedlearningproblemswherethedataisstationarytothelearning algorithmsandcanbeevaluatedbythetraining-testingparadigm,theinteractivenature ofRLintroducesintricatedi˚cultiesontrainingandevaluation.Onecommonsolutionin tra˚cstudiesistobuildsimulatorsfortheenvironment[ 211 , 172 , 128 ].Inthissection,we introduceasimulatordesignthatmodelsthegenerationoforders,procedureofassigning ordersandkeydriverbehaviorssuchasdistributionsacrossthecity,on-line/o˙-linestatus controlintherealworld.ThesimulatorservesasthetrainingenvironmentforRLalgorithms, aswellastheirevaluation.Moreimportantly,oursimulatorallowsustocalibratethekey performanceindexwiththehistoricaldatacollectedfroma˛eetmanagementsystem,and thusthepolicieslearnedarewellalignedwithreal-worldtra˚cs. TheDataDescription ThedataprovidedbyDidiChuxingincludesordersandtrajectories ofvehiclesintwocitiesincludingChengduandWuhan.Chengduiscoveredbyahexagonal 148 gridsworldconsistingof504grids.Wuhancontainsmorethanonethousandsgrids.The orderinformationincludesorderprice,origin,destinationandduration.Thetrajectories containthepositions(latitudeandlongitude)andstatus(on-line,o˙-line,on-service)ofall vehicleseveryfewseconds. TimelineDesign. Inonetimeinterval(10minutes),themainactivitiesareconducted sequentially,alsoillustratedinFigure5.4. Vehiclestatusupdates: Vehicleswillbestochasticallyseto˜ine(i.e.,o˙fromservice) oronline(i.e.,startworking)followingaspatiotemporaldistributionlearnedfromreal datausingthemaximumlikelihoodestimation(MLE).Othertypesofvehiclestatus updatesinclude˝nishingcurrentserviceorallocation.Inotherwords,ifavehicle isaboutto˝nishitsserviceatthecurrenttimestep,orarrivingatthedispatched grid,thevehiclesareavailablefortakingnewordersorbeingrepositionedtoanew destination. Ordergeneration: Thenewordersgeneratedatthecurrenttimesteparebootstrapped fromrealordersoccurredinthesametimeinterval.Sincetheorderwillnaturally repositionvehiclesinawiderange,thisprocedurekeepstherepositionfromorders similartotherealdata. Interactwithagents: Thisstepcomputesstateasinputto˛eetmanagementalgorithm andappliestheallocationsforagents. Orderassignments: Allavailableordersareassignedthroughatwo-stageprocedure. Inthe˝rststage,theordersinonegridareassignedtothevehiclesinthesamegrid. Inthesecondstage,theremainingun˝lledordersareassignedtothevehiclesinits neighboringgrids.Inreality,theplatformdispatchesordertoanearbyvehiclewithin 149 Figure5.3:ThesimulatorcalibrationintermsofGMV.TheredcurvesplottheGMVvalues ofrealdataaveragedover7dayswithstandarddeviation,in10-minutetimegranularity. Thebluecurvesaresimulatedresultsaveragedover7episodes. acertaindistance,whichisapproximatelytherangecoveredbythecurrentgridand itsadjacentgrids.Therefore,theabovetwo-stageprocedureisessentialtostimulate thesereal-worldactivitiesandthefollowingcalibration.Thissettingdi˙erentiates ourproblemfromtheprevious˛eetmanagementproblemsetting(i.e.,demandsare servedbythoseresourcesatthesamelocationonly.)andmakeitimpossibletodirectly applytheclassicmethodssuchasadaptivedynamicprogrammingapproachesproposed in[64,65]. Calibration. Thee˙ectivenessofthesimulatorisguaranteedbycalibrationagainstthereal dataregardingthemostimportantperformancemeasurement:thegrossmerchandisevolume (GMV).AsshowninFigure5.3,afterthecalibrationprocedure,theGMVinthesimulatoris verysimilartothatfromtheride-sharingplatform.The r 2 betweensimulatedGMVand realGMVis 0 : 9331 andthePearsoncorrelationis 0 : 9853 with p -value p< 0 : 00001 . 150 Figure5.4:Simulatortimelineinonetimestep(10minutes). 5.7Experiments Inthissection,weconductextensiveexperimentstoevaluatethee˙ectivenessofourproposed method. 5.7.1Experimentalsettings Inthefollowingexperiments,bothoftrainingandevaluationareconductedonthesimulator introducedinSec5.6.Forallthecompetingmethods,weprescribetwosetsofrandomseed thatcontrolthedynamicsofthesimulatorfortrainingandevaluation,respectively.Examples ofdynamicsinsimulatorincludeordergenerations,andstochasticallystatusupdateofall vehicles.Inthissetting,wecantestthegeneralizationperformanceofalgorithmswhenit encountersunseendynamicsasinrealscenarios.TheperformanceismeasuredbyGMV(the totalvalueofordersservedinthesimulator)gainedbytheplatformoveroneepisode(144 timestepsinthesimulator),andorderresponserate(ORR),whichistheaveragednumber ofordersserveddividedbythenumberofordersgenerated.Weusethe˝rst15episodesfor trainingandconductevaluationonthefollowingtenepisodesforalllearningmethods.The numberofavailablevehiclesateachtimeindi˙erentlocationsiscountedbyapre-dispatch procedure.Thisprocedurerunsavirtualtwo-stageorderdispatchingprocesstocomputethe remainingavailablevehiclesineachlocation.Onaverage,thesimulatorhas5356agentsper timestepwaitingformanagement.Allthequantitativeresultsoflearningmethodspresented inthissectionareaveragedoverthreeruns. 151 5.7.2Performancecomparison Inthissubsection,theperformanceoffollowingmethodsareextensivelyevaluatedbythe simulation. Simulation: Thisbaselinesimulatestherealscenariowithoutany˛eetmanagement. ThesimulatedresultsarecalibratedwithrealdatainSec5.6. Di˙usion: Thismethoddi˙usesavailablevehiclestoneighboringgridsrandomly. Rule-based: Thisbaselinecomputesa T N valuetable V rule ,whereeachelement V rule ( t;j ) representstheaveragedrewardofanagentstayingingrid g j attimestep t .Therewardsareaveragedovertenepisodescontrolledbyrandomseedsthatare di˙erentwithtestingepisodes.Withthevaluetable,theagentsamplesitsactionbased ontheprobabilitymassfunctionnormalizedfromthevaluesofneighboringgridsat thenexttimestep.Forexample,ifanagentlocatedin g 1 attime t andthecurrent validactionsare [ g 1 ; g 2 ] and [ g 1 ; g 1 ] ,therule-basedmethodsampleitsactionsfrom p ( a i t , [ g 1 ; g j ])= V rule ( t +1 ;j ) = ( V rule ( t +1 ; 2)+ V rule ( t +1 ; 1)) ; 8 j =1 ; 2 . Value-Iter: Itdynamicallyupdatesthevaluetablebasedonpolicyevaluation[ 186 ]. Theallocationpolicyiscomputedbasedonthenewvaluetable,thesameusedinthe rule-basedmethod,whilethecollaborativecontextisconsidered. T- Q learning :Thestandardindependenttabular Q -learning[ 186 ]learnsatable q tabular 2 R T N 7 with -greedypolicy.Inthiscasethestatereducestotimeandthe locationoftheagent. T-SARSA :TheindependenttabularSARSA[ 186 ]learnsatable q sarsa 2 R T N 7 withsamesettingofstatesasT- Q learning. 152 DQN :TheindependentDQNiscurrentlythestate-of-the-artasweintroducedin Sec5.4.1.Our Q networkisparameterizedbyathree-layerELUs[ 41 ]andweadopt the -greedypolicyastheagentpolicy.The isannealedlinearlyfrom0.5to0.1across the˝rst15trainingepisodesand˝xedas =0 : 1 duringthetesting. cDQN :ThecontextualDQNasweintroducedinSec5.4.2.The isannealedthesame asinDQN.Attheendofeachepisode,the Q -networkisupdatedover4000batches, i.e. M 1 =4000 inAlg5.2.Toensureavalidcontextmasking,theactivationfunction oftheoutputlayerofthe Q -networkisReLU+1. cA2C :Thecontextualmulti-agentactor-criticasweintroducedinSec5.4.3.Atthe endofeachepisode,boththepolicynetworkandthevaluenetworkareupdatedover 4000batches,i.e. M 1 = M 2 =4000 inAlg5.2.SimilartocDQN,Theoutputlayerof thepolicynetworkusesReLU+1astheactivationfunctiontoensurethatallelements intheoriginallogits P ( s i t ) arepositive. LP-cA2C :Thecontextualmulti-agentactor-criticwithlinearprogrammingasintro- ducedinSec5.5.Duringthetrainingstate,weusecA2Ctoexploretheenvironment andlearnthestatevaluefunction.Duringtheevaluation,weconductthepolicygiven bylinearprogramming. Exceptforthe˝rstbaseline,thegeographiccontextisconsideredinallmethodssothat theagentswillnotnavigatetotheinvalidgrid.Unlessotherspeci˝ed,thevaluefunction approximationsandpolicynetworkincontextualalgorithmsareparameterizedbyathree- layerReLU[ 78 ]withnodesizesof128,64and32,fromthe˝rstlayertothethirdlayer.The batchsizeofalldeeplearningmethodsis˝xedas3000,andweuse AdamOptimizer witha learningrateof 1 e 3 .SinceperformanceofDQNvariesalotwhentherearealargenumber 153 ofagents,the˝rstcolumnintheTable5.1forDQNisaveragedoverthebestthreerunsout ofsixruns,andtheresultsforallothermethodsareaveragedoverthreeruns.Also,the centralizedcriticsofcDQNandcA2Careinitializedfromapre-trainedvaluenetworkusing thehistoricalmeanofordervaluescomputedfromtenepisodessimulation,withdi˙erent randomseedsfrombothtrainingandevaluation. Totesttherobustnessofproposedmethod,weevaluateallcompetingmethodsunder di˙erentnumbersofinitialvehiclesaccrossdi˙erentcities.Theresultsaresummarizedin Table5.1,5.2,5.3.Theresultsof Di˙usion improvedtheperformancealotinTable5.1, possiblybecausethatthemethodsometimesencouragestheavailablevehiclestoleavethe gridwithhighdensityofavailablevehicles,andthustheimbalancedsituationisalleviated. However,inamorerealisticsettingthatweconsiderrepositioncost,thismethodcanlead tonegativee˙ectiveduetothehighlyine˚cientreallocations.The Rule-based methodthat repositionsvehiclestothegridswithahigherdemandvalue,improvestheperformance ofrandomrepositions.The Value-Iter dynamicallyupdatesthevaluetableaccordingto thecurrentpolicyappliedsothatitfurtherpromotestheperformanceupon Rule-based . Comparingtheresultsof Value-Iter , T-Qlearning and T-SARSA ,the˝rstmethodconsistently outperformsthelattertwo,possiblybecausethattheusageofacentralizedvaluetableenables coordinations,whichhelpstoavoidcon˛ictrepositions.Theabovemethodssimplifythestate representationintoaspatial-temporalvaluerepresentation,whereastheDRLmethodsaccount bothcomplexdynamicsofsupplyanddemandusingneuralnetworkfunctionapproximations. AstheresultsshowninlastthreerowsofTable5.1,5.2,5.3,themethodswithdeeplearning outperformsthepreviousone.Furthermore,thecontextualalgorithmslargelyoutperform theindependentDQN(DQN),whichisthestate-of-the-artamonglarge-scalemulti-agent DRLmethodandallothercompetingmethods.Lastbutnotleast,thelp-cA2Cacheivethe 154 Table5.1:PerformancecomparisonofcompetingmethodsintermsofGMVandorder responseratewithoutrepositioncost. 100% initialvehicles 90% initialvehicles 10% initialvehicles NormalizedGMVORR NormalizedGMVORR NormalizedGMVORR Simulation 100 : 00 0 : 6081 : 80% 0 : 37% 98 : 81 0 : 5080 : 64% 0 : 37% 92 : 78 0 : 7970 : 29% 0 : 64% Di˙usion 105 : 68 0 : 6486 : 48% 0 : 54% 104 : 44 0 : 5784 : 93% 0 : 49% 99 : 00 0 : 5174 : 51% 0 : 28% Rule-based 108 : 49 0 : 4090 : 19% 0 : 33% 107 : 38 0 : 5588 : 70% 0 : 48% 100 : 08 0 : 5075 : 58% 0 : 36% Value-Iter 110 : 29 0 : 7090 : 14% 0 : 62% 109 : 50 0 : 6889 : 59% 0 : 69% 102 : 60 0 : 6177 : 17% 0 : 53% T-Qlearning 108 : 78 0 : 5190 : 06% 0 : 38% 107 : 71 0 : 4289 : 11% 0 : 42% 100 : 07 0 : 5575 : 57% 0 : 40% T-SARSA 109 : 12 0 : 4990 : 18% 0 : 38% 107 : 69 0 : 4988 : 68% 0 : 42% 99 : 83 0 : 5075 : 40% 0 : 44% DQN 114 : 06 0 : 6693 : 01% 0 : 20% 113 : 19 0 : 6091 : 99% 0 : 30% 103 : 80 0 : 9677 : 03% 0 : 23% cDQN 115 : 19 0 : 4694 : 77% 0 : 32% 114.29 0 : 66 94.00% 0 : 53% 105 : 29 0 : 7079 : 28% 0 : 58% cA2C 115.27 0 : 70 94.99% 0 : 48% 113 : 85 0 : 6993 : 99% 0 : 47% 105.62 0 : 66 79.57% 0 : 51% Table5.2:PerformancecomparisonofcompetingmethodsintermsofGMV,orderresponse rate(ORR),andreturnoninvest(ROI)inXianconsideringrepositioncost. 100% initialvehicles 90% initialvehicles 10% initialvehicles NormalizedGMVORRROI NormalizedGMVORRROI NormalizedGMVORRROI Simulation 100 : 00 0 : 6081 : 80% 0 : 37% - 98 : 81 0 : 5080 : 64% 0 : 37% - 92 : 78 0 : 7970 : 29% 0 : 64% - Di˙usion 103 : 02 0 : 4186 : 49% 0 : 42% 0.5890 102 : 35 0 : 5185 : 00% 0 : 47% 0.7856 97 : 41 0 : 5574 : 51% 0 : 46% 1.5600 Rule-based 106 : 21 0 : 4390 : 00% 0 : 43% 1.4868 105 : 30 0 : 4288 : 58% 0 : 37% 1.7983 99 : 37 0 : 3675 : 83% 0 : 48% 3.2829 Value-Iter 108 : 26 0 : 6590 : 28% 0 : 50% 2.0092 107 : 69 0 : 8289 : 53% 0 : 56% 2.5776 101 : 56 0 : 6577 : 11% 0 : 44% 4.5251 T-Qlearning 107 : 55 0 : 5890 : 12% 0 : 52% 2.9201 106 : 60 0 : 5289 : 17% 0 : 41% 4.2052 99 : 99 1 : 2875 : 97% 0 : 91% 5.2527 T-SARSA 107 : 73 0 : 4689 : 93% 0 : 34% 3.3881 106 : 88 0 : 4588 : 82% 0 : 37% 5.1559 99 : 11 0 : 4075 : 23% 0 : 35% 6.8805 DQN 110 : 81 0 : 6892 : 50% 0 : 50% 1.7811 110 : 16 0 : 6091 : 79% 0 : 29% 2.3790 103 : 40 0 : 5177 : 14% 0 : 26% 4.3770 cDQN 112 : 49 0 : 4294 : 88% 0 : 33% 2.2207 112 : 12 0 : 4094 : 17% 0 : 36% 2.7708 104 : 25 0 : 5579 : 41% 0 : 48% 4.8340 cA2C 112 : 70 0 : 6494 : 74% 0 : 57% 3.1062 112 : 05 0 : 4593 : 97% 0 : 37% 3.8085 104 : 19 0 : 7079 : 25% 0 : 68% 5.2124 LP-cA2C 113.60 0 : 56 95.27% 0 : 36% 4.4633 112.75 0 : 65 94.62% 0 : 47% 5.2719 105.37 0 : 58 80.15% 0 : 46% 7.2949 bestperformanceintermsofreturnoninvestment(thegmvgainperreallocation),GMV, andorderresponserate. 5.7.3OntheE˚ciencyofReallocations Inreality,eachrepositioncomeswithacost.Inthissubsection,weconsidersuchreposition costsandestimatedthembyfuelcosts.Sincethetraveldistancefromonegridtoanother isapproximately1.2kmandthefuelcostisaround0.5RMB/km,wesetthecostofeach repositionas c =0 : 6 .Inthissetting,thede˝nitionofagent,state,actionandtransition probabilityissameaswestatedinSec5.3.Theonlydi˙erenceisthattherepositioningcost isincludedintherewardwhentheagentisrepositionedtodi˙erentlocations.Therefore,the GMVofoneepisodeisthesumofallservedordervaluesubstractedbythetotalofreposition costinoneepisode.Forexample,theobjectivefunctionforDQNnowincludesthereposition 155 Table5.3:PerformancecomparisonofcompetingmethodsintermsofGMV,orderresponse rate(ORR),andreturnoninvest(ROI)inWuhanconsideringrepositioncost. NormalizedGMVORRROI Simulation 100 : 00 0 : 4876 : 56% 0 : 45% - Di˙usion 98 : 84 0 : 4480 : 07% 0 : 24% -0.2181 Rule-based 103 : 84 0 : 6384 : 91% 0 : 25% 0.5980 Value-Iter 107 : 13 0 : 7085 : 06% 0 : 45% 1.6156 T-Qlearning 107 : 10 0 : 6185 : 28% 0 : 28% 1.8302 T-SARSA 107 : 14 0 : 6484 : 99% 0 : 28% 2.0993 DQN 108 : 45 0 : 6286 : 67% 0 : 33% 1.0747 cDQN 108 : 93 0 : 5789 : 03% 0 : 26% 1.1001 cA2C 113 : 31 0 : 5488 : 57% 0 : 45% 4.4163 LP-cA2C 114.92 0 : 65 89.29% 0 : 39% 6.1417 Table5.4:E˙ectivenessofcontextualmulti-agentactor-criticconsideringrepositioncosts. NormalizedGMV ORR Repositions DQN 110 : 81 0 : 68 92 : 50% 0 : 50% 606932 cDQN 112 : 49 0 : 42 94 : 88% 0 : 33% 562427 cA2C 112 : 70 0 : 64 94 : 74% 0 : 57% 408859 LP-cA2C 113 : 60 0 : 56 95 : 27% 0 : 36% 304752 costasfollows: E Q ( s i t ;a i t ; ) r i t +1 c + max a i t +1 Q ( s i t +1 ;a i t +1 ; 0 2 ; (5.15) where a i t , [ g o ; g d ] ,andif g d = g o then c =0 ,otherwise c =0 : 6 .Similarly,wecanconsider thecostsincA2C.However,itishardtoapplythemtocDQNbecausethattheassumption, thatdi˙erentactionsthatleadtothesamelocationshouldsharethesameactionvalue, whichisnotheldinthissetting.Therefore,insteadofconsideringtherepositioncostinthe objectivefunction,weonlyincorporatetherepositioncostwhenweactuallyconductour policybasedoncDQN.Underthissetting,thelearningobjectiveofactionvalueofcDQNis 156 sameasinEq(5.3)whilethecontextembeddingischangedfromEq(5.4)tothefollowing: [ C t; g j ] k = 8 > < > : 1 ; if Q ( s t ; g i ) > = Q ( s t ; g j )+ c; 0 ; otherwise : (5.16) ForLP-cA2C,thecoste˙ectisnaturallyincorporatedintheobjectivefunctionas inEq(5.14).AstheresultsshowninTable5.4,theDQNtendstorepositionmoreagents whilethecontextualalgorithmsachievebetterperformanceintermsofbothGMVand orderresponserate,withlowercost.Moreimportantly,theLP-cA2Coutperformsother methodsinbothoftheperformanceande˚ciency.Thereasonisthatthismethodformulate thecoordinationamongagentsintoanoptimizationproblem,whichapproximatesthe maximizationoftheplatform'slongtermrewardinacentralizedversion.Thecentralized optimizationproblemcanavoidlotsofredundantreallocationscomparedtopreviousmethods. Thetrainingproceduresandthenetworkarchitecturearethesameasdescribedintheprevious section. Tobemoreconcrete,wegiveaspeci˝cscenariotodemonstratethatthee˚ciencyof LP-cA2C.Imagingwewouldliketoaskdriverstomovefromgrid A tonearbygrid B while thereisagrid C thatisadjacenttobothgrid A and B .Inthepreviousalgorithms,since theallocationisjointlygivenbyeachagent,it'sverylikelythatwereallocateagentsby theshortpath A ! B andlongerpath A ! C ! B whentherearesu˚cientamountof agentscanarriveat B from A .Theseine˚cientreallocationscanbeavoidedbyLP-cA2C naturallysincethelongerpathonlyincursahighercostwhichwillbethesuboptimalsolution toourobjectivefunctioncomparedtothesolutiononlycontainsthe˝rstpath.Asshown inFigure5.5(a),theallocationcomputedbycA2Ccontainsmany triangle repositionsas denotedbytheblackcircle,whilewedidn'tobservetheseine˚cientallocationsinFigure5.5 157 (a)cA2C(b)LP-cA2C Figure5.5:IllustrationofallocationsofcA2CandLP-cA2Cat18:40and19:40,respsectively. (b).Therefore,theallocationpolicydeliveredbyLP-cA2Cismoree˚cientthanthosegiven bypreviousalgorithms. 5.7.4Thee˙ectivenessofaveragedrewarddesign Inmulti-agentRL,therewarddesignforeachagentisessentialforthesuccessoflearning.In fullycooperativemulti-agentRL,therewardforallagentsisasingleglobalreward[ 27 ],while itsu˙ersfromthecreditassignmentproblemforeachagent'saction.Splittingtherewardto eachagentwillalleviatethisproblem.Inthissubsection,wecomparetwodi˙erentdesigns fortherewardofeachagent:theaveragedrewardofagridasstatedinSec5.3andthetotal rewardofagridthatdoesnotaverageonthenumberofavailablevehiclesatthattime.As shownintable5.5,themethodswithaveragedreward(cA2C,cDQN)largelyoutperform thoseusingtotalreward,sincethisdesignnaturallyencouragesthecoordinationsamong agents.Usingtotalreward,ontheotherhand,islikelytorepositionanexcessivenumberof agentstothelocationwithhighdemand. 158 Table5.5:E˙ectivenessofaveragedrewarddesign. Proposedmethods RawReward NormalizedGMV/ORR NormalizedGMV/ORR cA2C 115 : 27 0 : 70 / 94 : 99% 0 : 48% 105 : 75 1 : 17 / 88 : 09% 0 : 74% cDQN 115 : 19 0 : 46 / 94 : 77% 0 : 32% 108 : 00 0 : 35 / 89 : 53% 0 : 31% (a)Withoutrepositioncost(b)Withrepositioncost Figure5.6:ConvergencecomparisonofcA2Canditsvariationswithoutusingcontext embeddinginbothsettings,withandwithoutrepositioncosts.TheX-axisisthenumberof episodes.TheleftY-axisdenotesthenumberofcon˛ictsandtherightY-axisdenotesthe normalizedGMVinoneepisode. Table5.6:E˙ectivenessofcontextembedding. NormalizedGMV/ORR Repositions Withoutrepositioncost cA2C 115 : 27 0 : 70 / 94 : 99% 0 : 48% 460586 cA2C-v1 114 : 78 0 : 67 / 94 : 52% 0 : 49% 704568 cA2C-v2 111 : 39 1 : 65 / 92 : 12% 1 : 03% 846880 Withrepositioncost cA2C 112 : 70 0 : 64 / 94 : 74% 0 : 57% 408859 cA2C-v3 110 : 43 1 : 16 / 93 : 79% 0 : 75% 593796 5.7.5Ablationsonpolicycontextembedding Inthissubsection,weevaluatethee˙ectivenessofcontextembedding,includingexplicitly coordinatingtheactionsofdi˙erentagentsthroughthecollaborativecontext,andeliminating theinvalidactionswithgeographiccontext.Thefollowingvariationsofproposedmethods areinvestigatedindi˙erentsettings. 159 cA2C-v1:ThisvariationdropscollaborativecontextofcA2Cinthesettingthatdoes notconsiderrepositioncost. cA2C-v2:ThisvariationdropsbothgeographicandcollaborativecontextofcA2Cin thesettingthatdoesnotconsiderrepositioncost. cA2C-v3:ThisvariationdropscollaborativecontextofcA2Cinthesettingthat considersrepositioncost. TheresultsofabovevariationsaresummarizedinTable5.6andFigure5.6.Asseenin the˝rsttworowsofTable5.6andthered/bluecurvesinFigure5.6(a),inthesettingof zerorepositioncost,cA2Cachievesthebestperformancewithmuchlessrepositions( 65 : 37% ) comparingwithcA2C-v1.Furthermore,collaborativecontextembeddingachievessigni˝cant advantageswhentherepositioncostisconsidered,asshowninthelasttworowsinTable5.6 andFigure5.6(b).Itnotonlygreatlyimprovestheperformancebutalsoacceleratesthe convergence.Sincethecollaborativecontextlargelynarrowsdowntheactionspaceand leadstoabetterpolicysolutioninthesenseofbothe˙ectivenessande˚ciency,wecan concludethatcoordinationbasedoncollaborativecontextise˙ective.Also,comparingthe performancesofcA2CandcA2C-v2(red/greencurvesinFigure5.6(a)),apparentlythe policycontextembedding(consideringbothgeographicandcollaborativecontext)isessential toperformance,whichgreatlyreducestheredundantpolicysearch. 5.7.6Ablationstudyongroupingthelocations Thissectionstudiesthee˙ectivenessofourregularizationdesignforLP-cA2C.Onekey di˙erencebetweenourworkandtraditional˛eetmanagementworks[ 64 , 65 ]isthatwedidn't assumethedriversinonelocationcanonlypickuptheordersinthesamelocation.On 160 thecontrary,oneagentcanalsoservetheordersemergedinthenearbylocations,whichis amorerealisticandcomplicatedsetting.Inthiscase,weregularizethenumberofagents repositionedintoasetofnearbygridsclosetothenumberofestimatedordersatnexttime step.ThisgroupingregularizationinEq(5.14)ismoree˚cientthantheregularizationin Eq(5.17)requiringthenumberofagentsrepositionedintoeachgridisclosetothenumber ofestimatedordersatthatgirdsincelotsofrepositioninsidethesamegroupcanbeavoided. AstheresultsshowninTable5.7,usingthegroupregularizationinEq(5.14)reallocatesless agentswhileachievessamebestperformanceastheoneinEq(5.17)(LP-cA2C'). max y ( s t ) ( v ( s t ) T A t c T t ) y ( s t ) ( o t A t y ( s t )) 2 (5.17) Table5.7:E˙ectivenessofgroupregularizationdesign NormalizedGMV ORR Repositions ROI LP-cA2C 113 : 56 0 : 61 95 : 24% 0 : 40% 341774 3 : 9663 LP-cA2C' 113 : 60 0 : 56 95 : 27% 0 : 36% 304752 4 : 4633 5.7.7Qualitativestudy Inthissection,weanalyzewhetherthelearnedvaluefunctioncancapturethedemand-supply relationaheadoftime,andtherationalityofallocations.Toseethis,wepresentacasestudy ontheregionnearbytheairport.ThestatevalueandallocationpolicyisacquiredfromcA2C thatwastrainedfortenepisodes.Wethenrunthewell-trainedcA2Cononetestingepisode, andqualitativelyexamthestatevalueandallocationsundertheunseendynamics.Thesum ofstatevaluesanddemand-supplygap(de˝nedasthenumberofordersminusthenumberof vehicles)ofsevengridsthatcovertheCTUairportisvisualized.AsseeninFigure5.8,the statevaluecancapturethefuturedramaticchangesofdemand-supplygap.Furthermore,the 161 spatialdistributionofstatevaluescanbeseeninFigure5.7.Afterthemidnight,theairport hasalargenumberoforders,andlessavailablevehicles,andthereforethestatevaluesof airportarehigherthanotherlocations.Duringthedaytime,morevehiclesareavailableat theairportsothateachwillreceivelessrewardandthestatevaluesarelowerthanother regions,asshowninFigure5.7(b).InFigure5.7andFigure5.8,wecanconcludethatthe valuefunctioncanestimatetherelativeshiftofdemand-supplygapfrombothspatialand temporalperspectives.ItiscrucialtotheperformanceofcA2Csincethecoordinationis builtuponthestatevalues.Moreover,asillustratedbybluearrowsinFigure5.7,wesee thattheallocationpolicygivesconsecutiveallocationsfromlowervaluegridstohighervalue grids,whichcanthus˝llthefuturedemand-supplygapandincreasetheGMV. (a)At01:50am.(b)At06:40pm. Figure5.7:Illustrationontherepositionsnearbytheairportat1:50amand06:40pm.The darkercolordenotesthehigherstatevalueandthebluearrowsdenotetherepositions. 5.8Conclusion Inthischapter,we˝rstformulatethelarge-scale˛eetmanagementproblemintoafeasible settingfordeepreinforcementlearning.Giventhissetting,weproposecontextualmulti-agent reinforcementlearningframework,inwhichtwocontextualalgorithmscDQNandcA2Care 162 Figure5.8:Thenormalizedstatevalueanddemand-supplygapoveroneday. developedandbothofthemachievethelargescaleagents'coordinationin˛eetmanagement problem.cA2Cenjoysboth˛exibilityande˚ciencybycapitalizingacentralizedvalue networkanddecentralizedpolicyexecutionembeddedwithcontextualinformation.Itis abletoadapttodi˙erentactionspaceinanend-to-endtrainingparadigm.Asimulatoris developedandcalibratedwiththerealdataprovidedbyDidiChuxing,whichservedasour trainingandevaluationplatform.Extensiveempiricalstudiesunderdi˙erentsettingsin simulatorhavedemonstratedthee˙ectivenessoftheproposedframework. 163 Chapter6 TheProvableAdvantageof CollaborativeLearning 6.1Introduction Federatedlearning(FL)isamachinelearningsettingwheremanyclients(e.g.,mobiledevices ororganizations)collaborativelytrainamodelundertheorchestrationofacentralserver(e.g., serviceprovider),whilekeepingthetrainingdatadecentralized[ 176 , 94 ].Inrecentyears,FL hasswiftlyemergedasanimportantlearningparadigm[ 129 , 109 thatenjoyswidespread successinapplicationssuchaspersonalizedrecommendation[ 36 ],virtualassistant[ 106 ],and keyboardprediction[ 77 ],tonameaatleasttworeasons:First,therapidproliferation ofsmartdevicesthatareequippedwithbothcomputingpoweranddata-capturingcapabilities providedtheinfrastructurecoreforFL.Second,therisingawarenessofprivacyandthe exponentialgrowthofcomputationalpower(blessedbyMoore'slaw)inmobiledeviceshave madeitincreasinglyattractivetopushthecomputationtotheedge. Despiteitspromiseandbroadapplicabilityinourcurrentera,thepotentialvalueFL deliversiscoupledwiththeuniquechallengesitbringsforth.Inparticular,whenFLlearnsa singlestatisticalmodelusingdatafromacrossallthedeviceswhilekeepingeachindividual device'sdataisolated(andhenceprotectsprivacy)[ 94 ],itfacestwochallengesthatareabsent 164 incentralizedoptimizationanddistributed(stochastic)optimization[ 231 , 178 , 99 , 113 , 205 , 213,206,92,219,218,98,104]: 1) Dataheterogeneity: datadistributionsindevicesaredi˙erent(anddatacan'tbe shared); 2) Systemheterogeneity: onlyasubsetofdevicesmayaccessthecentralserverateach timebothbecausethecommunicationsbandwidthpro˝lesvaryacrossdevicesandbecause thereisnocentralserverthathascontroloverwhenadeviceisactive. Toaddressthesechallenges,FederatedAveraging(FedAvg)[ 129 ]wasproposedasa particularlye˙ectiveheuristic,whichhasenjoyedgreatempiricalsuccess[ 77 ].Thissuccess hassincemotivatedagrowinglineofresearche˙ortsintounderstandingitstheoretical convergenceguaranteesinvarioussettings.Forinstance,[ 75 ]analyzedFedAvg(fornon- convexsmoothproblemssatisfyingPLconditions)undertheassumptionthateachlocal device'sminimizeristhesameastheminimizerofthejointproblem(ifalldevices'datais aggregatedtogether),anoverlyrestrictiveassumption.Veryrecently,[ 110 ]furtheredthe progressandestablishedan O ( 1 T ) convergencerateforFedAvgforstronglyconvexsmooth problems.Atthesametime,[ 84 ]studiedtheNesterovacceleratedFedAvgfornon-convex smoothproblemsandestablishedan O ( 1 p T ) convergenceratetostationarypoints. However,despitetheseveryrecentfruitfulpioneeringe˙ortsintounderstandingthe theoreticalconvergencepropertiesofFedAvg,itremainsopenastohowthenumberof thenumberofdevicesthatparticipateinthethe convergencespeed.Inparticular,dowegetlinearspeedupofFedAvg?Whataboutwhen FedAvgisaccelerated?TheseaspectsarecurrentlyunexploredinFL.We˝llinthegapshere byprovidinga˚rmativeanswers. OurContributions WeprovideacomprehensiveconvergenceanalysisofFedAvgandits 165 Table6.1: ConvergenceresultsforFedAvgandacceleratedFedAvg.Throughoutthepaper, N is thetotalnumberoflocaldevices,and K N isthemaximalnumberofdevicesthatareaccessible tothecentralserver. T isthetotalnumberofstochasticupdatesperformedbyeachlocaldevice, E isthelocalstepsbetweentwoconsecutiveservercommunications(andhence T=E isthenumberof communications). y Inthelinearregressionsetting,wehave = 1 forFedAvgand = p 1 ~ for acceleratedFedAvg,where 1 and p 1 ~ areconditionnumbersde˝nedinSection6.5.Since 1 ~ , thisimpliesaspeedupfactorof p 1 ~ foracceleratedFedAvg. h h h h h h h h h h h h h Participation Objectivefunction StronglyConvex Convex Overparameterized Overparameterized generalcase linearregression Full O ( 1 NT + E 2 T 2 ) O 1 p NT + NE 2 T O (exp( NT E 1 )) O (exp( NT E )) y Partial O E 2 KT + E 2 T 2 O E 2 p KT + KE 2 T O (exp( KT E 1 )) O (exp( KT E )) y acceleratedvariantsinthepresenceofbothdataandsystemheterogeneity.Ourcontributions arethreefold. First,weestablishan O (1 =KT ) convergencerateunderFedAvgforstronglyconvexand smoothproblemsandan O (1 = p KT ) convergencerateforconvexandsmoothproblems (where K isthenumberofparticipatingdevices),therebyestablishingthatFedAvgenjoys thedesirablelinearspeeduppropertyintheFLsetup.Priortoourworkhere,thebest andthemostrelatedconvergenceanalysisisgivenby[ 110 ],whichestablishedan O ( 1 T ) convergencerateforstronglyconvexsmoothproblemsunderFedAvg.Ourratematchesthe same(andoptimal)dependenceon T ,butalsocompletesthepicturebyestablishingthe lineardependenceon K . Second,weestablishthesameconvergence O (1 =KT ) forstronglyconvexandsmooth problemsand O (1 = p KT ) forconvexandsmoothproblemsforNesterovacceleratedFedAvg. WeanalyzetheacceleratedversionofFedAvgherebecauseempiricallyittendstoperform better;yet,itstheoreticalconvergenceguaranteeisunknown.Tothebestofourknowledge, thesearethe˝rstresultsthatprovidealinearspeedupcharacterizationofNesterovaccelerated FedAvginthosetwoproblemclasses(thatFedAvgandNesterovacceleratedFedAvgshare thesameconvergencerateistobeexpected:thisisthecaseevenforcentralizedstochastic 166 optimization). Third,westudyasubclassofstronglyconvexsmoothproblemswheretheobjectiveis over-parameterizedandestablishafaster O ( exp ( KT )) convergencerateforFedAvg.Within thisclass,wefurtherconsiderthelinearregressionproblemandestablishanevensharper rateunderFedAvg.Inaddition,weproposeanewvariantofacceleratedFedA acceleratedFedAestablishafasterconvergencerate(comparedtoifnoacceleration isused).Thisstandsincontrasttogeneric(strongly)convexstochasticproblemswhere theoreticallynorateimprovementisobtainedwhenoneacceleratesFedAvg.Thedetailed convergenceresultsaresummarizedinTable6.1. 6.2Setup Inthischapter,westudythefollowingfederatedlearningproblem: min w ˆ F ( w ) , X N k =1 p k F k ( w ) ˙ ; (6.1) where N isthenumberoflocaldevices(users/nodes/workers)and p k isthe k -thdevice's weightsatisfying p k 0 and P N k =1 p k =1 .Inthe k -thlocaldevice,thereare n k datapoints: x 1 k ; x 2 k ;:::; x n k k .Thelocalobjective F k ( ) isde˝nedas: F k ( w ) , 1 n k P n k j =1 ` w ; x j k ,where ` denotesauser-speci˝edlossfunction.Eachdeviceonlyhasaccesstoitslocaldata,which givesrisetoitsownlocalobjective F k .Notethatwedonotmakeanyassumptionsonthe datadistributionsofeachlocaldevice.Thelocalminimum F k = min w 2 R d F k ( w ) canbefar fromtheglobalminimumofEq(6.1). 167 6.2.1TheFederatedAveraging(FedAvg)Algorithm We˝rstintroducethestandardFederatedAveraging(FedAvg)algorithm[ 129 ].FedAvg updatesthemodelineachdevicebylocalStochasticGradientDescent(SGD)andsends thelatestmodeltothecentralserverevery E steps.Thecentralserverconductsaweighted averageoverthemodelparametersreceivedfromactivedevicesandbroadcaststhelatest averagedmodeltoalldevices.Formally,theupdatesofFedAvgatround t isdescribedas follows: v k t +1 = w k t t g t;k ; w k t +1 = 8 > < > : v k t +1 if t +1 = 2I E ; P k 2S t +1 v k t +1 if t +1 2I E ; where w k t isthelocalmodelparametermaintainedinthe k -thdeviceatthe t -thiteration, g t;k := r F k ( w k t ;˘ k t ) isthestochasticgradientbasedon ˘ k t ,thedatasampledfrom k -th device'slocaldatauniformlyatrandom. I E = f E; 2 E;::: g isthesetofglobalcommunication steps.Weuse S t +1 torepresentthesetofactivedevicesat t +1 . Sincefederatedlearningusuallyinvolvesanenormousamountoflocaldevices,itisoften morerealistictoassumeonlyasubsetoflocaldevicesisactiveateachcommunicationround (systemheterogeneity).Inthiswork,weconsiderboththecaseof fullparticipation where themodelisaveragingoveralldevicesatthecommunicationround,i.e., w k t +1 = P N k =1 p k v k t +1 , andthecaseof partialparticipation where jS t +1 j > > < > > > : v k t +1 + t ( v k t +1 v k t ) if t +1 = 2I E ; P k 2S t +1 h v k t +1 + t ( v k t +1 v k t ) i if t +1 2I E ; where g t;k := r F k ( w k t ;˘ k t ) isthestochasticgradientsampledonthe k -thdeviceattime t . Theorem9. Let v T = P N k =1 p k v k T andsetlearningrates t 1 = 3 14( t + )(1 6 t + )max f 1 g , 174 t = 6 1 t + .ThenunderAssumptions7,8,9,10withfulldeviceparticipation, E F ( v T ) F = O 2 max ˙ 2 NT + 2 E 2 G 2 T 2 ; andwithpartialdeviceparticipationwith K sampleddevicesateachcommunicationround, E F ( v T ) F = O 2 max ˙ 2 NT + 2 G 2 KT + 2 E 2 G 2 T 2 : Toourknowledge,thisisthe˝rstconvergenceresultforNesterovacceleratedFedAvg inthestronglyconvexandsmoothsetting.Thesamediscussionaboutlinearspeedupof FedAvgappliestotheNesterovacceleratedvariant.Inparticular,toachieve O (1 =NT ) linear speedup, T iterationsofthealgorithmrequireonly O ( p NT ) communicationroundswith fullparticipation. 6.4.2ConvexSmoothObjectives WenowshowthattheoptimalitygapofNesterovAcceleratedFedAvghas O (1 = p NT ) rate. Thisresultcomplementsthestronglyconvexcaseinthepreviouspart,aswellasthenon- convexsmoothsettingin[ 84 , 218 , 109 ],whereasimilar O (1 = p NT ) rateisgivenintermsof averagedgradientnorm. Theorem10. Setlearningrates t = t = O ( q N T ) .ThenunderAssumptions7,9,10 NesterovacceleratedFedAvgwithfulldeviceparticipationhasrate min t T F ( v t ) F = O 2 max ˙ 2 p NT + NE 2 LG 2 T ; 175 andwithpartialdeviceparticipationwith K sampleddevicesateachcommunicationround, min t T F ( v t ) F = O 2 max ˙ 2 p KT + E 2 G 2 p KT + KE 2 LG 2 T : ItispossibletoextendtheresultsinthissectiontoacceleratedFedAvgalgorithmswith othermomentum-basedupdates.However,inthestochasticoptimizationsetting,noneof thesemethodscanachieveabetterratethantheoriginalFedAvgwithSGDupdatesfor generalproblems[ 100 ].Forthisreason,wewillinsteadturntotheoverparameterizedsetting [ 127 , 119 , 30 ]inthenextsectionwhereweshowthatFedAvgenjoysgeometricconvergence anditispossibletoimproveitsconvergenceratewithmomentum-basedupdates. 6.5GeometricConvergenceofFedAvgintheOverparam- eterizedSetting Overparameterizationisaprevalentmachinelearningsettingwherethestatisticalmodelhas muchmoreparametersthanthenumberoftrainingsamplesandtheexistenceofparameter choiceswithzerotraininglossisensured[ 5 , 224 ].Duetothepropertyof automaticvariance reduction inoverparameterization,alineofrecentworksprovedthatSGDandaccelerated methodsachievegeometricconvergence[ 127 , 134 , 138 , 168 , 181 ].Anaturalquestionis whethersucharesultstillholdsinthefederatedlearningsetting.Inthissection,weprovide the˝rstgeometricconvergencerateofFedAvgfortheoverparameterizedstronglyconvex andsmoothproblems,andshowthatitpreserveslinearspeedupatthesametime.Wethen sharpenthisresultinthespecialcaseoflinearregression.Inspiredbyrecentadvancesin acceleratingSGD[ 123 , 87 ],wefurtherproposeanovelmomentum-basedFedAvgalgorithm, 176 whichenjoysanimprovedconvergencerateoverFedAvg.Detailedproofsaredeferredto AppendixSectionB.Inparticular,wedonotneedAssumptions9and10andusemodi˝ed versionsofAssumptions7and8detailedinthissection. 6.5.1GeometricConvergenceofFedAvgintheOverparameterized Setting RecalltheFLproblem min w P N k =1 p k F k ( w ) with F k ( w )= 1 n k P n k j =1 ` ( w ; x j k ) .Inthissection, weconsiderthestandardEmpiricalRiskMinimization(ERM)settingwhere ` isnon-negative, l -smooth,andconvex,andasbefore,each F k ( w ) is L -smoothand -stronglyconvex.Notethat l L .Thissetupincludesmanyimportantproblemsinpractice.Intheoverparameterized setting,thereexists w 2 argmin w P N k =1 p k F k ( w ) suchthat ` ( w ; x j k )=0 forall x j k .We ˝rstshowthatFedAvgachievesgeometricconvergencewithlinearspeedupinthenumberof workers. Theorem11. Intheoverparameterizedsetting,FedAvgwithcommunicationevery E itera- tionsandconstantstepsize = O ( 1 E N l max + L ( N min ) ) hasgeometricconvergence: E F ( w T ) L 2 (1 ) T k w 0 w k 2 = O L exp E NT l max + L ( N min ) k w 0 w k 2 : LinearspeedupandCommunicationComplexity Thelinearspeedupfactorison theorderof O ( N=E ) for N O ( l L ) ,i.e.FedAvgwith N workersandcommunicationevery E iterationsprovidesageometricconvergencespeedupfactorof O ( N=E ) ,for N O ( l L ) . When N isabovethisthreshold,however,thespeedupisalmostconstantinthenumberof workers.Thismatchesthe˝ndingsin[ 127 ].Ourresultalsoillustratesthat E canbetaken O ( T ) forany < 1 toachievegeometricconvergence,achievingbettercommunication 177 e˚ciencythanthestandardFLsetting. 6.5.2OverparameterizedLinearRegressionProblems WenowturntoquadraticproblemsandshowthattheboundinTheorem11canbeimproved to O ( exp ( N E 1 t )) foralargerrangeof N .WethenproposeavariantofFedAvgthathas provableaccelerationoverFedAvgwithSGDupdates.Thelocaldeviceobjectivesarenow givenbythesumofsquares F k ( w )= 1 2 n k P n k j =1 ( w T x j k z j k ) 2 ,andthereexists w suchthat F ( w ) 0 .Twonotionsofconditionnumberareimportantinourresults: 1 whichisbased onlocalHessians,and ~ ,whichistermedthestatisticalconditionnumber[ 119 , 87 ].Fortheir detailedde˝nitions,pleaserefertoAppendixSectionB.Hereweusethefact ~ 1 .Recall max =max k p k N and min =min k p k N . Theorem12. Fortheoverparamterizedlinearregressionproblem,FedAvgwithcommuni- cationevery E iterationswithconstantstepsize = O ( 1 E N l max + ( N min ) ) hasgeometric convergence: E F ( w T ) O L exp( NT E ( max 1 +( N min )) ) k w 0 w k 2 : When N = O ( 1 ) ,theconvergencerateis O ((1 N E 1 ) T )= O ( exp ( NT E 1 )) ,which exhibitslinearspeedupinthenumberofworkers,aswellasa 1 1 dependenceonthe conditionnumber 1 .Inspiredby[ 119 ],weproposethe MaSSacceleratedFedAvg algorithm (FedMaSS): w k t +1 = 8 > > > < > > > : u k t k 1 g t;k if t +1 = 2I E ; P k 2S t +1 h u k t k 1 g t;k i if t +1 2I E ; 178 u k t +1 = w k t +1 + k ( w k t +1 w k t )+ k 2 g t;k : When k 2 0 ,thisalgorithmreducestotheNesterovacceleratedFedAvgalgorithm.Inthe nexttheorem,wedemonstratethatFedMaSSimprovestheconvergenceto O ( exp ( NT E p 1 ~ )) . Toourknowledge,thisisthe˝rstaccelerationresultofFedAvgwithmomentumupdates overSGDupdates. Theorem13. Fortheoverparamterizedlinearregressionproblem,FedMaSSwithcom- municationevery E iterationsandconstantstepsizes 1 = O ( 1 E N l max + ( N min ) ) ; 2 = 1 (1 1 ~ ) 1+ 1 p 1 ~ ; = 1 1 p 1 ~ 1+ 1 p 1 ~ hasgeometricconvergence: E F ( w T ) O L exp( NT E ( max p 1 ~ +( N min )) ) k w 0 w k 2 : SpeedupofFedMaSSoverFedAvg Tobetterunderstandthesigni˝canceofthe aboveresult,webrie˛ydiscussrelatedworksonacceleratingSGD.NesterovandHeavyBall updatesareknowntofailtoaccelerateoverSGDinboththeoverparameterizedandconvex settings[ 119 , 100 , 122 , 221 ].Thusingeneralonecannothopetoobtainaccelerationresults fortheFedAvgalgorithmwithNesterovandHeavyBallupdates.Luckily,recentworks inSGD[ 87 , 119 ]introducedanadditionalcompensationtermtotheNesterovupdatesto addressthenon-accelerationissue.Surprisingly,weshowthesameapproachcane˙ectively improvetherateofFedAvg.ComparingtheconvergencerateofFedMass(Theorem13)and FedAvg(Theorem12),when N = O ( p 1 ~ ) ,theconvergencerateis O ((1 N E p 1 ~ ) T )= O ( exp ( NT E p 1 ~ )) asopposedto O ( exp ( NT E 1 )) .Since 1 ~ ,thisimpliesaspeedupfactor of q 1 ~ forFedMaSS.Ontheotherhand,thesamelinearspeedupinthenumberofworkers 179 holdsfor N inasmallerrangeofvalues. 6.6NumericalExperiments Inthissection,weempiricallyexaminethelinearspeedupconvergenceofFedAvgandNesterov acceleratedFedAvginvarioussettings,includingstronglyconvexfunction,convexsmooth function,andoverparameterizedobjectives,asanalyzedinprevioussections. Setup. Followingtheexperimentalsettingin[ 178 ],weconductexperimentsonboth syntheticdatasetsandreal-worlddatasetw8a[ 155 ] ( d =300 ;n =49749) .Weconsiderthe distributedobjectives F ( w )= P N k =1 p k F k ( w ) ,andtheobjectivefunctiononthe k -thlocal deviceincludesthreecases:1) Stronglyconvexobjective :theregularizedbinarylogistic regressionproblem, F k ( w )= 1 N k P N k i =1 log (1+ exp ( y k i w T x k i )+ 2 k w k 2 .Theregularization parameterissetto =1 =n ˇ 2 e 5 .2) Convexsmoothobjective :thebinarylogistic regressionproblemwithoutregularization.3) Overparameterizedsetting :thelinear regressionproblemwithoutaddingnoisetothelabel, F k ( w )= 1 N k P N k i =1 ( w T x k i + b y k i ) 2 . LinearspeedupofFedAvgandNesterovacceleratedFedAvg. Toverifythelinear speedupconvergenceasshowninTheorems78910,weevaluatethenumberofiterations neededtoreach -accuracyinthreeobjectives.Weinitializeallrunswith w 0 = 0 d and measurethenumberofiterationstoreachthetargetaccuracy .Foreachcon˝guration ( E;K ) ,weextensivelysearchthelearningratefrom min ( 0 ; nc 1+ t ) ,where 0 2f 0 : 1 ; 0 : 12 ; 1 ; 32 g accordingtodi˙erentproblemsand c cantakethevalues c =2 i 8 i 2 Z .Astheresultsshown inFigure6.1,thenumberofiterationsdecreasesasthenumberof(active)workersincreasing, whichisconsistentforFedAvgandNesterovacceleratedFedAvgacrossallscenarios.For additionalexperimentsontheimpactof E ,detailedexperimentalsetup,andhyperparameter 180 (a)Stronglyconvexobjective(b)Convexsmoothobjective(c)Linearregression Figure6.1:ThelinearspeedupofFedAvginfullparticipation,partialparticipation,andthe linearspeedupofNesterovacceleratedFedAvg,respectively. setting,pleaserefertotheAppendixSectionB. 181 Chapter7 Conclusion Inthisdissertation,weconsideredtheproblemofcollaborativelearning,aimingto˝nd e˙ectivewaystoleverageknowledgefrompeersfore˚cientlearningandbettergeneralization. Tostart,weformallyde˝nedthecollaborativelearningproblemanddiscussedseveral challengesweneedtoresolveunderthissystematicframework.The˝rstchallengewefocus onisthe˛exibilityandinteractivemodel-drivencollaboration.Wepresentalgorithmsthat capturehigh-orderinteractionsandinteractivelyincorporatethehumanexpertknowledge toguidethecollaboration.Thentogeneralizethecollaborationtoheterogeneouslearning agentsandheterogeneoustasks,weproposedata-drivencollaborativealgorithms,wherethe learningagentstransferknowledgefromaselectiveanddynamicdataset.Inadditionto thevariousformofcollaborations,wealsostudythescalabilityofcollaboration,wherewe proposelinearprogrammingbasedcollaborativemulti-agentlearningalgorithminthecontext ofalarge-scale˛eetmanagementapplication.Lastbutnotleast,theempiricalsuccessof collaborativelearningmotivatesustodigintothereasonwhycollaborativelearningcanbe bene˝cial.Weproviderigoroustheoreticalanalysisontheconvergenceimprovementwith respecttotheincreasingnumberoflearningagents. Therearevariousdomainsthatcanbene˝tfromcollaborativelearning,includingbut notlimitedtomulti-taskmeta-learning,transferlearning,federatedlearning,multi-agent reinforcementlearning,etc.Theresearchinthecommunityhasbeendevotedtopushing 182 thefrontierofeachdomainin-depth,whileseldomstudytheirintrinsicconnections,which canbeessentialtowardsbuildingcollaborativemachineintelligence.Thereareemerging researchestorevealtherelationsacrossdi˙erent˝eldssuchastheconnectionbetween federatedlearningandmulti-tasklearning[ 176 ],federatedlearningandmeta-learning[ 57 ], etc,whichcouldserveastheinitialsteptowardsbridgingthegapsacrossmultiple˝elds.One centralmotivationofthisdissertationthatviewsthosedomainsasanintegratedframework isthatthecollaborationshouldbeemphasizedasasigni˝cantlearningobjectiveinstead ofanauxiliaryproduct,alongwithaccomplishingothergoals.Ourvisionisthattowards thebuildingthemachineintelligencethatiscomparabletohumanintelligence,therigorous understandingofcollaborativelearningisinevitable. Moreconcretely,theremanyfuturedirectionsunderthegrantpictureofcollaborative learning.Firstandforemost,onefundamentalquestioniswhattypeoftaskscanbelearned collaboratively,Orwhencanweexpectcollaborativelearningbene˝ttheperformance comparingtolearningindividually?Thisiscloselyrelatedtothenegativetransfer[ 207 ]and andtaskinterference[ 220 ]inmulti-tasklearning.InChapter6,wequantifyasimpli˝ed settinginsupervisedlearningwherethegradientvarianceacrossheterogeneoustasksare bounded,whilethisisfarfromdesire.Inpractice,whatisthee˚cientandtestingstandard beforeconsideringcollaboration?Anotherperspectiveofthinkingthisproblemisthatis therealwaysexistsacollaborationstrategythatworksbetterthanindividuallearning? Despitethelong-termgoalofcollaborativelearning,apromisingdirectionwouldbe learningtocollaborate.Thecurrentcollaborationstrategiesaremostlyprede˝ned.We manuallysetuptherulesofthecollaborationaccordingtocertaindomainknowledge. Canweparameterizethecollaborationandlearntheintrinsicprincipleofcollaborative learningthatisgeneralizable?Recently,wenoticeatrendofmeta-learningandAI-generating 183 algorithms[ 146 , 59 ],whilesimilare˙ortshaven'tbeenfoundincollaborativelearning.Human caneasilygeneralizethestructureoforganizations,communicationprotocol,interaction patternstosolvedi˙erenttasks.Todevelopcollaborativelearningsolutionsalongthis directionwouldbeincrediblyvaluableforgeneratinghuman-likeintelligence. 184 APPENDICES 185 AppendixA RankingPolicyGradient DiscussionofExistingE˙ortsonConnectingReinforce- mentLearningtoSupervisedLearning. Therearetwomaindistinctionsbetweensupervisedlearningandreinforcementlearning.In supervisedlearning,thedatadistribution D isstaticandtrainingsamplesareassumedtobe sampled i.i.d. from D .Onthecontrary,thedatadistributionisdynamicinreinforcement learningandthesamplingprocedureisnotindependent.First,sincethedatadistribution inRLisdeterminedbybothenvironmentdynamicsandthelearningpolicy,andthepolicy keepsbeingupdatedduringthelearningprocess.Thisupdatedpolicyresultsindynamicdata distributioninreinforcementlearning.Second,policylearningdependsonpreviouslycollected samples,whichinturndeterminesthesamplingprobabilityofincomingdata.Therefore,the trainingsampleswecollectedarenotindependentlydistributed.Theseintrinsicdi˚culties ofreinforcementlearningdirectlycausethesample-ine˚cientandunstableperformanceof currentalgorithms. Ontheotherhand,moststate-of-the-artreinforcementlearningalgorithmscanbeshown tohaveasupervisedlearningequivalent.Toseethis,recallthatmostreinforcementlearning algorithmseventuallyacquirethepolicyeitherexplicitlyorimplicitly,whichisamapping fromastatetoanactionoraprobabilitydistributionovertheactionspace.Theuseofsuch 186 amappingimpliesthatultimatelythereexistsasupervisedlearningequivalenttotheoriginal reinforcementlearningproblem,ifoptimalpoliciesexist.Theparadoxisthatitisalmost impossibletoconstructthissupervisedlearningequivalentonthe˛y,withoutknowingany optimalpolicy. Althoughthequestionofhowtoconstructandapplypropersupervisionisstillanopen probleminthecommunity,therearemanyexistinge˙ortsprovidinginsightfulapproachesto reducereinforcementlearningintoitssupervisedlearningcounterpartoverthepastseveral decades.Roughly,wecanclassifytheexistinge˙ortsintothefollowingcategories: Expectation-Maximization(EM) :[45,152,102,1],etc. Entropy-RegularizedRL(ERL) :[144,145,74],etc. InteractiveImitationLearning(IIL) :[44,188,163,165,184],etc. TheearlyapproachesintheEMtrackappliedJensen'sinequalityandapproximation techniquestotransformthereinforcementlearningobjective.Algorithmsarethenderived fromthetransformedobjective,whichresembletheExpectation-Maximizationprocedure andprovidepolicyimprovementguarantee[ 45 ].Theseapproachestypicallyfocusona simpli˝edRLsetting,suchasassumingthattherewardfunctionisnotassociatedwiththe state[ 45 ],approximatingthegoaltomaximizetheexpectedimmediaterewardandthestate distributionisassumedtobe˝xed[ 153 ].Lateronin[ 102 ],theauthorsextendedtheEM frameworkfromtargetingimmediaterewardintoepisodicreturn.Recently,[ 1 ]usedthe EM-frameworkonarelativeentropyobjective,whichaddsaparameterpriorasregularization. Ithasbeenfoundthattheestimationstepusing Retrace [ 135 ]canbeunstableevenwitha linearfunctionapproximation[ 197 ].Ingeneral,theestimationstepinEM-basedalgorithms involveson-policyevaluation,whichisonechallengesharedamongpolicygradientmethods. 187 Ontheotherhand,o˙-policylearningusuallyleadstoamuchbettersamplee˚ciency,and isonemainmotivationthatwewanttoreformulateRLintoasupervisedlearningtask. Toachieveo˙-policylearning,PGQ[ 144 ]connectedtheentropy-regularizedpolicygradient withQ-learningundertheconstraintofsmallregularization.Inthesimilarframework,Soft Actor-Critic[ 74 ]wasproposedtoenablesample-e˚cientandfasterconvergenceunder theframeworkofentropy-regularizedRL.Itisabletoconvergetotheoptimalpolicythat optimizesthelong-termrewardalongwithpolicyentropy.Itisane˚cientwaytomodel thesuboptimalbehaviorandempiricallyitisabletolearnareasonablepolicy.Although recentlythediscrepancybetweentheentropy-regularizedobjectiveandoriginallong-term rewardhasbeendiscussedin[ 143 , 56 ],theyfocusonlearningstochasticpolicywhilethe proposedframeworkisfeasibleforbothlearningdeterministicoptimalpolicy(Corollary1) andstochasticoptimalpolicy(Corollary2).In[ 145 ],thisworksharessimilaritytoour workintermsofthemethodwecollectingthesamples.Theycollectgoodsamplesbased onthepastexperienceandthenconducttheimitationlearningw.r.tthosegoodsamples. However,wedi˙erentiateathowdowelookattheproblemtheoretically.Thisself-imitation learningprocedurewaseventuallyconnectedtolower-bound-soft-Q-learning,whichbelongs toentropy-regularizedreinforcementlearning.Wecommentthatthereisatrade-o˙between sample-e˚ciencyandmodelingsuboptimalbehaviors.Themorestrictrequirementwehave onthesamplescollectedwehavelesschancetohitthesampleswhilewearemorecloseto imitatingtheoptimalbehavior. Fromthetrackofinteractiveimitationlearning,earlye˙ortssuchas[ 163 , 165 ]pointed outthatthemaindiscrepancybetweenimitationlearningandreinforcementlearningisthe violationof i.i.d. assumption. SMILe [ 163 ]and DAgger [ 165 ]areproposedtoovercomethe distributionmismatch.Theorem2.1in[ 163 ]quanti˝edtheperformancedegradationfromthe 188 TableA.1:AcomparisonofstudiesreducingRLtoSL.The Objective columndenoteswhether thegoalistomaximizelong-termreward.The Cont.Action columndenoteswhetherthe methodisapplicabletobothcontinuousanddiscreteactionspaces.The Optimality denotes whetherthealgorithmscanmodeltheoptimalpolicy. X y denotestheoptimalityachieved byERLisw.r.t.theentropyregularizeobjectiveinsteadoftheoriginalobjectiveonreturn. The O˙-Policy columndenotesifthealgorithmsenableo˙-policylearning.The NoOracle columndenotesifthealgorithmsneedtoaccesstoacertaintypeoforacle(expertpolicyor expertdemonstrations). Methods Objective Cont.Action Optimality O˙-Policy NoOracle EM X X X 7 X ERL 7 X X y X X IIL X X X X 7 RPG X 7 X X X expertconsideringthatthelearnedpolicyfailstoimitatetheexpertwithacertainprobability. Thetheoremseemstoresemblethelong-termperformancetheorem(Thm.5)inthischapter. However,itstudiedthescenariothatthelearningpolicyistrainedthroughastatedistribution inducedbytheexpert,insteadofstate-actiondistributionasconsideredinTheorem5.As such,Theorem2.1in[ 163 ]maybemoreapplicabletothesituationwhereaninteractive procedureisneeded,suchasqueryingtheexpertduringthetrainingprocess.Onthecontrary, theproposedworkfocusesondirectlyapplyingsupervisedlearningwithouthavingaccessto theexperttolabelthedata.Theoptimalstate-actionpairsarecollectedduringexploration andconductingsupervisedlearningonthereplaybu˙erwillprovideaperformanceguarantee intermsoflong-termexpectedreward.Concurrently,aresembleofTheorem2.1in[ 163 ]is Theorem1in[ 188 ],wheretheauthorsreducedtheapprenticeshiplearningtoclassi˝cation, undertheassumptionthattheapprenticepolicyisdeterministicandthemisclassi˝cation rateisboundedatalltimesteps.Inthiswork,weshowthatitispossibletocircumvent suchastrongassumptionandreduceRLtoitsSL.Furthermore,ourtheoreticalframework alsoleadstoanalternativeanalysisofsample-complexity.Lateron AggreVaTe [ 164 ] wasproposedtoincorporatetheinformationofactioncoststofacilitateimitationlearning, 189 anditsdi˙erentiableversion AggreVaTeD [ 184 ]wasdevelopedinsuccessionandachieved impressiveempiricalresults.Recently,hingelosswasintroducedtoregular Q -learningasa pre-trainingstepforlearningfromdemonstration[ 81 ],orasasurrogatelossforimitating optimaltrajectories[ 148 ].Inthiswork,weshowthathingelossconstructsanewtypeof policygradientmethodandcanbeusedtolearnoptimalpolicydirectly. Inconclusion,ourmethodapproachestheproblemofreducingRLtoSLfromaunique perspectivethatisdi˙erentfromallpriorwork.WithourreformulationfromRLtoSL,the samplescollectedinthereplaybu˙ersatisfythe i.i.d. assumption,sincethestate-actionpairs arenowsampledfromthedatadistributionofUNOP.Amulti-aspectcomparisonbetween theproposedmethodandrelevantpriorstudiesissummarizedinTableA.1. RankingPolicyGradientTheorem TheRankingPolicyGradientTheorem(Theorem2)formulatestheoptimizationoflong-term rewardusingarankingobjective.Theproofbelowillustratestheformulationprocess. Proof. Thefollowingproofisbasedondirectpolicydi˙erentiation[153,212].Foraconcise presentation,thesubscript t foractionvalue i ; j ,and p ij isomitted. r J ( )= r X ˝ p ( ˝ ) r ( ˝ ) (A.1) = X ˝ p ( ˝ ) r log p ( ˝ ) r ( ˝ ) = X ˝ p ( ˝ ) r log p ( s 0 T t =1 ˇ ( a t j s t ) p ( s t +1 j s t ;a t ) r ( ˝ ) = X ˝ p ( ˝ ) X T t =1 r log ˇ ( a t j s t ) r ( ˝ ) = E ˝ ˘ ˇ X T t =1 r log ˇ ( a t j s t ) r ( ˝ ) 190 = E ˝ ˘ ˇ X T t =1 r log Y m j =1 ;j 6 = i p ij r ( ˝ ) = E ˝ ˘ ˇ " X T t =1 r X m j =1 ;j 6 = i log e ij 1+ e ij ! r ( ˝ ) # = E ˝ ˘ ˇ X T t =1 r X m j =1 ;j 6 = i log 1 1+ e ji r ( ˝ ) (A.2) ˇ E ˝ ˘ ˇ X T t =1 r X m j =1 ;j 6 = i ( i j ) = 2 r ( ˝ ) ; (A.3) wherethetrajectoryisaseriesofstate-actionpairsfrom t =1 ;:::;T , i:e:˝ = s 1 ;a 1 ;s 2 ;a 2 ;:::;s T . FromEq(A.2)toEq(A.3),weusethe˝rst-orderTaylorexpansionof log (1+ e x ) j x =0 = log2+ 1 2 x + O ( x 2 ) tofurthersimplifytherankingpolicygradient. ProbabilityDistributioninRankingPolicyGradient Inthissection,wediscusstheoutputpropertyofthepairwiserankingpolicy.Weshowin Corollary6thatthepairwiserankingpolicygivesavalidprobabilitydistributionwhenthe dimensionoftheactionspace m =2 .Forcaseswhen m> 2 andtherangeof Q -valuesatis˝es Condition2,weshowinCorollary7howtoconstructavalidprobabilitydistribution. Corollary6. ThepairwiserankingpolicyasshowninEq(4.5)constructsaprobability distributionoverthesetofactionswhentheactionspace m isequalto 2 ,givenanyaction values i ;i =1 ; 2 .Forthecaseswith m> 2 ,thisconclusiondoesnotholdingeneral. Itiseasytoverifythat ˇ ( a i j s ) > 0 , P 2 i =1 ˇ ( a i j s )=1 holdsandthesameconclusion cannotbeappliedto m> 2 byconstructingcounterexamples.However,wecanintroduce adummyaction a 0 toformaprobabilitydistributionforRPG.Duringpolicylearning,the algorithmincreasestheprobabilityofbestactionsandtheprobabilityofdummyaction decreases.Ideally,ifRPGconvergestoanoptimaldeterministicpolicy,theprobabilityof 191 takingbestactionisequalto1and ˇ ( a 0 j s )=0 .Similarly,wecanintroduceadummy trajectory ˝ 0 withthetrajectoryreward r ( ˝ 0 )=0 and p ( ˝ 0 )=1 P ˝ p ( ˝ ) .Thetrajectory probabilityformsaprobabilitydistributionsince P ˝ p ( ˝ )+ p ( ˝ 0 )=1 and p ( ˝ ) 0 8 ˝ and p ( ˝ 0 ) 0 .Theproofofavalidtrajectoryprobabilityissimilartothefollowingproof on ˇ ( a j s ) tobeavalidprobabilitydistributionwithadummyaction.Itspracticalin˛uence isnegligiblesinceourgoalistoincreasetheprobabilityof(near)-optimaltrajectories.To presentinaclearway,weavoidmentioningdummytrajectory ˝ 0 inProofAwhileitcanbe seamlesslyincluded. Condition2 (Therangeofaction-value) . Werestricttherangeofaction-valuesinRPGso thatitsatis˝es m ln ( m 1 m 1 1) ,where m = min i;j ji and m isthedimensionofthe actionspace. Thisconditioncanbeeasilysatis˝edsinceinRPGweonlyfocusontherelativerelationship of andwecanconstraintherangeofaction-valuessothat m satis˝esthecondition2. Furthermore,sincewecanseethat m 1 m 1 > 1 isdecreasingw.r.ttoactiondimension m . Thelargertheactiondimension,thelessconstraintwehaveontheactionvalues. Corollary7. GivenCondition2,weintroduceadummyaction a 0 andset ˇ ( a = a 0 j s )= 1 P i ˇ ( a = a i j s ) ,whichconstructsavalidprobabilitydistribution ( ˇ ( a j s )) overtheaction space A[ a 0 . Proof. Sincewehave ˇ ( a = a i j s ) > 0 8 i =1 ;:::;m and P i ˇ ( a = a i j s )+ ˇ ( a = a 0 j s )=1 . Toprovethatthisisavalidprobabilitydistribution,weonlyneedtoshowthat ˇ ( a = a 0 j s ) 0 ; 8 m 2 ,i.e. P i ˇ ( a = a i j s ) 1 ; 8 m 2 .Let m =min i;j ji , X i ˇ ( a = a i j s ) 192 = X i Y m j =1 ;j 6 = i p ij = X i Y m j =1 ;j 6 = i 1 1+ e ji X i Y m j =1 ;j 6 = i 1 1+ e m = m 1 1+ e m m 1 1 (Condition2) : Thisthusconcludestheproof. ConditionofPreservingOptimality ThefollowingconditiondescribeswhattypesofMDPsaredirectlyapplicabletothetrajectory rewardshaping(TRS,Def6): Condition3 (InitialStates) . The(near)-optimaltrajectorieswillcoverallinitialstatesof MDP.i.e. f s ( ˝; 1) j8 ˝ 2Tg = f s ( ˝; 1) j8 ˝ g ,where T = f ˝ j w ( ˝ )=1 g = f ˝ j r ( ˝ ) c g . TheMDPssatisfyingthisconditioncoverawiderangeoftaskssuchasDialogueSys- tem[ 111 ],Go[ 175 ],videogames[ 18 ]andallMDPswithonlyoneinitialstate.Ifwewantto preservetheoptimalitybyTRS,theoptimaltrajectoriesofaMDPneedtocoverallinitial statesorequivalently,allinitialstatesmustleadtoatleastoneoptimaltrajectory.Similarly, thenear-optimalityispreservedforallMDPsthatitsnear-optimaltrajectoriescoverall initialstates. Theoretically,itispossibletotransfermoregeneralMDPstosatisfyCondition3and preservetheoptimalitywithpotential-basedrewardshaping[ 139 ].Moreconcretely,consider thedeterministicbinarytreeMDP( M 1 )withthesetofinitialstates S 1 = f s 1 ;s 0 1 g asde˝ned inFigureA.1.Thereareeightpossibletrajectoriesin M 1 .Let r ( ˝ 1 )=10= R max ;r ( ˝ 8 )= 193 FigureA.1:ThebinarytreestructureMDPwithtwoinitialstates. 3 ;r ( ˝ i )=2 ; 8 i =2 ;:::; 7 .Therefore,thisMDPdoesnotsatisfyCondition3.Wecan compensatethetrajectoryrewardofthebesttrajectorystartingfrom s 0 1 tothe R max by shapingtherewardwiththepotential-basedfunction ˚ ( s 0 7 )=7 and ˚ ( s )=0 ; 8 s 6 = s 0 7 .This rewardshapingrequiresmorepriorknowledge,whichmaynotbefeasibleinpractice.Amore realisticmethodistodesignadynamictrajectoryrewardshapingapproach.Inthebeginning, weset c ( s )= min s 2S 1 r ( ˝ j s ( ˝; 1)= s ) ; 8 s 2S 1 .Take M 1 asanexample, c ( s )=3 ; 8 s 2S 1 . Duringtheexplorationstage,wetrackthecurrentbesttrajectoryofeachinitialstateand update c ( s ) withitstrajectoryreward. Nevertheless,iftheCondition3isnotsatis˝ed,weneedmoresophisticatedpriorknowledge otherthanaprede˝nedtrajectoryrewardthreshold c toconstructthereplaybu˙er(training datasetofUNOP).Thepracticalimplementationoftrajectoryrewardshapingandrigorously theoreticalstudyforgeneralMDPsarebeyondthescopeofthiswork. 194 ProofofLong-termPerformanceTheorem5 Lemma3. Givenaspeci˝ctrajectory ˝ ,thelog-likelihoodofstate-actionpairsoverhorizon T isequaltotheweightedsumovertheentirestate-actionspace,i.e.: 1 T X T t =1 log ˇ ( a t j s t )= X s;a p ( s;a j ˝ )log ˇ ( a j s ) ; wherethesumintherighthandsideisthesummationoverallpossiblestate-actionpairs.It isworthnotingthat p ( s;a j ˝ ) isnotrelatedtoanypolicyparameters.Itistheprobabilityofa speci˝cstate-actionpair ( s;a ) inaspeci˝ctrajectory ˝ . Proof. Givenatrajectory ˝ = f ( s ( ˝; 1) ;a ( ˝; 1)) ;:::; ( s ( ˝;T ) ;a ( ˝;T )) g = f ( s 1 ;a 1 ) ;:::; ( s T ;a T ) g , denotetheuniquestate-actionpairsinthistrajectoryas U ( ˝ )= f ( s i ;a i ) g n i =1 ,where n is thenumberofuniquestate-actionpairsin ˝ and n T .Thenumberofoccurrencesof astate-actionpair ( s i ;a i ) inthetrajectory ˝ isdenotedas j ( s i ;a i ) j .Thenwehavethe following: 1 T X T t =1 log ˇ ( a t j s t ) = X n i =1 j ( s i ;a i ) j T log ˇ ( a i j s i ) = X n i =1 p ( s i ;a i j ˝ )log ˇ ( a i j s i ) = X ( s;a ) 2 U ( ˝ ) p ( s;a j ˝ )log ˇ ( a j s ) (A.4) = X ( s;a ) 2 U ( ˝ ) p ( s;a j ˝ )log ˇ ( a j s )+ X ( s;a ) = 2 U ( ˝ ) p ( s;a j ˝ )log ˇ ( a j s ) (A.5) = X ( s;a ) p ( s;a j ˝ )log ˇ ( a j s ) 195 FromEq(A.4)toEq(A.5)weusedthefact: X ( s;a ) 2 U ( ˝ ) p ( s;a j ˝ )= X n i =1 p ( s i ;a i j ˝ )= X n i =1 j ( s i ;a i ) j T =1 ; andthereforewehave p ( s;a j ˝ )=0 ; 8 ( s;a ) = 2 U ( ˝ ) .Thisthuscompletestheproof. NowwearereadytoprovetheTheorem5: Proof. Thefollowingproofholdsforanarbitrarysubsetoftrajectories T determinedbythe threshold c inDef8.The ˇ isassociatedwith c andthissubsetoftrajectories.Wepresent thefollowinglowerboundoftheexpectedlong-termperformance: argmax X ˝ p ( ˝ ) w ( ˝ ) * w ( ˝ )=0 ; if ˝= 2T =argmax 1 jTj X ˝ 2T p ( ˝ ) w ( ˝ ) useLemma5 * p ( ˝ ) > 0 and w ( ˝ ) > 0 ; ) X ˝ 2T p ( ˝ ) w ( ˝ ) > 0 =argmax log 1 jTj X ˝ 2T p ( ˝ ) w ( ˝ ) * log X n i =1 x i =n X n i =1 log( x i ) =n; 8 i;x i > 0 ; wehave: log 1 jTj X ˝ 2T p ( ˝ ) w ( ˝ ) X ˝ 2T 1 jTj log p ( ˝ ) w ( ˝ ) ; wherethelowerboundholdswhen p ( ˝ ) w ( ˝ )= 1 jTj ; 8 ˝ 2T .Tothisend,wemaximizethe lowerboundoftheexpectedlong-termperformance: argmax X ˝ 2T 1 jTj log p ( ˝ ) w ( ˝ ) 196 =argmax X ˝ 2T log( p ( s 1 ) Y T t =1 ( ˇ ( a t j s t ) p ( s t +1 j s t ;a t )) w ( ˝ )) =argmax X ˝ 2T log p ( s 1 ) Y T t =1 ˇ ( a t j s t ) Y T t =1 p ( s t +1 j s t ;a t ) w ( ˝ ) =argmax X ˝ 2T log p ( s 1 )+ X T t =1 log p ( s t +1 j s t ;a t )+ X T t =1 log ˇ ( a t j s t )+log w ( ˝ ) (A.6) Theaboveshowsthat w ( ˝ ) canbesetasanarbitrarypositiveconstant =argmax 1 jTj X ˝ 2T X T t =1 log Y ( a t j s t ) =argmax 1 jTj T X ˝ 2T X T t =1 log Y ( a t j s t ) (A.7) =argmax 1 jTj X ˝ 2T 1 T X T t =1 log ˇ ( a t j s t ) (theexistenceofUNOPinAssumption5) =argmax X ˝ 2T p ˇ ( ˝ ) 1 T X T t =1 log ˇ ( a t j s t ) where ˇ isaUNOP(Def8) ) p ˇ ( ˝ )=0 8 ˝= 2T (A.8) Eq(A.8)canbeestablishedbasedon X ˝ 2T p ˇ ( ˝ )= X ˝ 2T 1 = jTj =1 =argmax X ˝ p ˇ ( ˝ ) 1 T X T t =1 log ˇ ( a t j s t ) (Lemma3) =argmax X ˝ p ˇ ( ˝ ) X s;a p ( s;a j ˝ )log ˇ ( a j s ) The2ndsumisoverallpossiblestate-actionpairs.(A.9) ( s;a ) representsaspeci˝cstate-actionpair. =argmax X ˝ X s;a p ˇ ( ˝ ) p ( s;a j ˝ )log ˇ ( a j s ) =argmax X s;a X ˝ p ˇ ( ˝ ) p ( s;a j ˝ )log ˇ ( a j s ) =argmax X s;a p ˇ ( s;a )log ˇ ( a j s ) : (A.10) Inthisproofweuse s t = s ( ˝;t ) and a t = a ( ˝;t ) asabbreviations,whichdenotethe t -thstate 197 andactioninthetrajectory ˝ ,respectively. jTj denotesthenumberoftrajectoriesin T .We alsousethede˝nitionof w ( ˝ ) toonlyfocusonnear-optimaltrajectories.Weset w ( ˝ )=1 forsimplicitybutitwillnota˙ecttheconclusionifsettootherconstants. Optimality: Furthermore,theoptimalsolutionfortheobjectivefunctionEq(A.10)isa uniformly(near)-optimalpolicy ˇ . argmax X s;a p ˇ ( s;a )log ˇ ( a j s ) =argmax X s p ˇ ( s ) X a ˇ ( a j s )log ˇ ( a j s ) =argmax X s p ˇ ( s ) X a ˇ ( a j s )log ˇ ( a j s ) X s p ˇ ( s ) X a log ˇ ( a j s ) =argmax X s p ˇ ( s ) X a ˇ ( a j s )log ˇ ( a j s ) ˇ ( a j s ) =argmax X s p ˇ ( s ) X a KL ( ˇ ( a j s ) jj ˇ ( a j s ))= ˇ Therefore,theoptimalsolutionofEq(A.10)isalsothe(near)-optimalsolutionforthe originalRLproblemsince P ˝ p ˇ ( ˝ ) r ( ˝ )= P ˝ 2T 1 jTj r ( ˝ ) c = R max .Theoptimal solutionisobtainedwhenweset c = R max . Lemma4. Givenanyoptimalpolicy ˇ ofMDPsatisfyingCondition3, 8 ˝= 2T ,wehave p ˇ ( ˝ )=0 ,where T denotesthesetofallpossibleoptimaltrajectoriesinthislemma.If 9 ˝= 2T ,suchthat p ˇ ( ˝ ) > 0 ,then ˇ isnotanoptimalpolicy. Proof. Weprovethisbycontradiction.Weassume ˇ isanoptimalpolicy.If 9 ˝ 0 = 2T ,such that1) p ˇ ( ˝ 0 ) 6 =0 ,orequivalently: p ˇ ( ˝ 0 ) > 0 since p ˇ ( ˝ 0 ) 2 [1 ; 0] .and2) ˝ 0 = 2T .Wecan ˝ndabetterpolicy ˇ 0 bysatisfyingthefollowingthreeconditions: p ˇ 0 ( ˝ 0 )=0 and 198 p ˇ 0 ( ˝ 1 )= p ˇ ( ˝ 1 )+ p ˇ ( ˝ 0 ) ;˝ 1 2T and p ˇ 0 ( ˝ )= p ˇ ( ˝ ) ; 8 ˝= 2f ˝ 0 ;˝ 1 g Since p ˇ 0 ( ˝ ) 0 ; 8 ˝ and P ˝ p ˇ 0 ( ˝ )=1 ,therefore p ˇ 0 constructsavalidprobabilitydistribu- tion.Thentheexpectedlong-termperformanceof ˇ 0 isgreaterthanthatof ˇ : X ˝ p ˇ 0 ( ˝ ) w ( ˝ ) X ˝ p ˇ ( ˝ ) w ( ˝ ) = X ˝= 2f ˝ 0 ;˝ 1 g p ˇ 0 ( ˝ ) w ( ˝ )+ p ˇ 0 ( ˝ 1 ) w ( ˝ 1 )+ p ˇ 0 ( ˝ 0 ) w ( ˝ 0 ) X ˝= 2f ˝ 0 ;˝ 1 g p ˇ ( ˝ ) w ( ˝ )+ p ˇ ( ˝ 1 ) w ( ˝ 1 )+ p ˇ ( ˝ 0 ) w ( ˝ 0 ) = p ˇ 0 ( ˝ 1 ) w ( ˝ 1 )+ p ˇ 0 ( ˝ 0 ) w ( ˝ 0 ) ( p ˇ ( ˝ 1 ) w ( ˝ 1 )+ p ˇ ( ˝ 0 ) w ( ˝ 0 )) * ˝ 0 = 2T ; ) w ( ˝ 0 )=0 and ˝ 1 2T ; ) w ( ˝ )=1 = p ˇ 0 ( ˝ 1 ) p ˇ ( ˝ 1 ) = p ˇ ( ˝ 1 )+ p ˇ ( ˝ 0 ) p ˇ ( ˝ 1 )= p ˇ ( ˝ 0 ) > 0 : Essentially,wecan˝ndapolicy ˇ 0 thathashigherprobabilityontheoptimaltrajectory ˝ 1 andzeroprobabilityon ˝ 0 .Thisindicatesthatitisabetterpolicythan ˇ .Therefore, ˇ is notanoptimalpolicyanditcontradictsourassumption,whichprovesthatsuch ˝ 0 doesnot exist.Therefore, 8 ˝= 2T ,wehave p ˇ ( ˝ )=0 . Lemma5 (PolicyPerformance) . IfthepolicytakestheformasinEq(4.7)orEq(4.5),then wehave 8 ˝ , p ( ˝ ) > 0 .Thismeansforallpossibletrajectoriesallowedbytheenvironment, thepolicytakestheformofeitherrankingpolicyorsoftmaxwillgeneratethistrajectorywith probability p ( ˝ ) > 0 .Notethatbecauseofthisproperty, ˇ isnotanoptimalpolicyaccording toLemma4,thoughitcanbearbitrarilyclosetoanoptimalpolicy. 199 Proof. Thetrajectoryprobabilityisde˝nedas: p ( ˝ )= p ( s 1 T t =1 ( ˇ ( a t j s t ) p ( s t +1 j s t ;a t )) Thenwehave: ThepolicytakestheformasinEq(4.7)orEq(4.5) ) ˇ ( a t j s t ) > 0 : p ( s 1 ) > 0 ;p ( s t +1 j s t ;a t ) > 0 : ) p ( ˝ )=0 : p ( s t +1 j s t ;a t )=0 or p ( s 1 )=0 ; ) p ( ˝ )=0 ; whichmeans ˝ isnotapossibletrajectory. Insummary,forallpossibletrajectories, p ( ˝ ) > 0 : Thisthuscompletestheproof. ProofofCorollary1 Corollary8 (Rankingperformancepolicygradient) . Thelowerboundofexpectedlong-term performancebyrankingpolicycanbeapproximatelyoptimizedbythefollowingloss: min X s;a i p ˇ ( s;a i ) L ( s i ;a i ) (A.11) wherethepair-wiseloss L ( s i ;a i ) isde˝nedas: L ( s;a i )= X j A j j =1 ;j 6 = i max(0 ; 1+ ( s;a j ) ( s;a i )) Proof. InRPG,thepolicy ˇ ( a j s ) isde˝nedasinEq(4.5).Wethenreplacetheaction 200 probabilitydistributioninEq(4.13)withtheRPGpolicy. * ˇ ( a = a i j s )= m j =1 ;j 6 = i p ij (A.12) BecauseRPGis˝ttingadeterministicoptimalpolicy, wedenotetheoptimalactiongivensate s as a i ; thenwehave max X s;a i p ˇ ( s;a i )log ˇ ( a i j s ) (A.13) =max X s;a i p ˇ ( s;a i )log m j 6 = i;j =1 p ij ) (A.14) =max X s;a i p ˇ ( s;a i )log m j 6 = i;j =1 1 1+ e ji (A.15) =min X s;a i p ˇ ( s;a i ) m X j 6 = i;j =1 log(1+ e ji ) ˝rstorderTaylorexpansion(A.16) ˇ min X s;a i p ˇ ( s;a i ) m X j 6 = i;j =1 ji s.t. j ij j = c< 1 ; 8 i;j;s (A.17) =min X s;a i p ˇ ( s;a i ) m X j 6 = i;j =1 ( j i ) s.t. j i j j = c< 1 ; 8 i;j;s (A.18) ) min X s;a i p ˇ ( s;a i ) L ( s i ;a i ) (A.19) wherethepairwiseloss L ( s;a i ) isde˝nedas: L ( s;a i )= j A j X j =1 ;j 6 = i max(0 ; margin + ( s;a j ) ( s;a i )) ; (A.20) wherethemargininEq(A.19)isasmallpositiveconstant.(A.21) FromEq(A.18)toEq(A.19),weconsiderlearningadeterministicoptimalpolicy a i = ˇ ( s ) , whereweuseindex i todenotetheoptimalactionateachstate.Theoptimal -values minimizingEq(A.18)(denotedby 1 )needtosatisfy 1 i = 1 j + c; 8 j 6 = i;s .Theoptiaml - 201 valuesminimizingEq(A.19)(denotedby 2 )needtosatisfy 2 i = max j 6 = i 2 j + margin ; 8 j 6 = i;s .Inbothcases,theoptimalpoliciesfromsolvingEq(A.18)andEq(A.18)arethe same: ˇ ( s )= argmax k 1 k = argmax k 2 k = a i .Therefore,weuseEq(A.19)asasurrogate optimizationproblemofEq(A.18). Policygradientvariancereduction Corollary9 (Variancereduction) . Givenastationarypolicy,theupperboundofthevariance ofeachdimensionofpolicygradientis O ( T 2 C 2 R 2 max ) .Theupperboundofgradientvariance ofmaximizingthelowerboundoflong-termperformanceEq(4.13)is O ( C 2 ) ,where C isthe maximumnormofloggradientbasedonAssumption6.Thesupervisedlearninghasreduced theupperboundofgradientvariancebyanorderof O ( T 2 R 2 max ) ascomparedtotheregular policygradient,considering R max 1 ;T 1 ,whichisaverycommonsituationinpractice. Proof. Theregularpolicygradientofpolicy ˇ isgivenas[212]: X ˝ p ( ˝ )[ T X t =1 r log( ˇ ( a ( ˝;t ) j s ( ˝;t ))) r ( ˝ )] Theregularpolicygradientvarianceofthe i -thdimensionisdenotedasfollows: Var 0 @ T X t =1 r log( ˇ ( a ( ˝;t ) j s ( ˝;t )) i ) r ( ˝ ) 1 A Wedenote x i ( ˝ )= P T t =1 r log ( ˇ ( a ( ˝;t ) j s ( ˝;t )) i ) r ( ˝ ) forconvenience.Therefore, x i isa 202 randomvariable.Thenapply var ( x )= E p ( ˝ ) [ x 2 ] E p ( ˝ ) [ x ] 2 ,wehave: Var X T t =1 r log( ˇ ( a ( ˝;t ) j s ( ˝;t )) i ) r ( ˝ ) = Var ( x i ( ˝ )) = X ˝ p ( ˝ ) x i ( ˝ ) 2 [ X ˝ p ( ˝ ) x i ( ˝ )] 2 X ˝ p ( ˝ ) x i ( ˝ ) 2 = X ˝ p ( ˝ )[ X T t =1 r log( ˇ ( a ( ˝;t ) j s ( ˝;t )) i ) r ( ˝ )] 2 X ˝ p ( ˝ )[ X T t =1 r log( ˇ ( a ( ˝;t ) j s ( ˝;t )) i )] 2 R 2 max = R 2 max X ˝ p ( ˝ )[ X T t =1 X T k =1 r log( ˇ ( a ( ˝;t ) j s ( ˝;t )) i ) r log( ˇ ( a ( ˝;k ) j s ( ˝;k ) i )] (Assumption6) R 2 max X ˝ p ( ˝ )[ T X t =1 T X k =1 C 2 ] = R 2 max X ˝ p ( ˝ ) T 2 C 2 = T 2 C 2 R 2 max Thepolicygradientoflong-termperformance(Def7): P s;a p ˇ ( s;a ) r log ˇ ( a j s ) .The policygradientvarianceofthe i -thdimensionisdenotedas: var ( r log ˇ ( a j s ) i ) .Thenthe upperboundisgivenby var ( r log ˇ ( a j s ) i ) = X s;a p ˇ ( s;a )[ r log ˇ ( a j s ) i ] 2 [ X s;a p ˇ ( s;a ) r log ˇ ( a j s ) i ] 2 X s;a p ˇ ( s;a )[ r log ˇ ( a j s ) i ] 2 (Assumption6) X s;a p ˇ ( s;a ) C 2 203 = C 2 Thisthuscompletestheproof. DiscussionsofAssumption5 Inthissection,weshowthatUNOPexistsinarangeofMDPs.Noticethatthelemma6 showsthesu˚cientconditionsofsatisfyingAsumption5ratherthannecessaryconditions. Lemma6. ForMDPsde˝nedinSection4.2.3satisfyingthefollowingconditions: Eachinitialstateleadstooneoptimaltrajectory.Thisalsoindicates jS 1 j = jTj ,where T denotesthesetofoptimaltrajectoriesinthislemma, S 1 denotesthesetofinitial states. Deterministictransitions,i.e., p ( s 0 j s;a ) 2f 0 ; 1 g . Uniforminitialstatedistribution,i.e., p ( s 1 )= 1 jTj ; 8 s 1 2S 1 . Thenwehave: 9 ˇ ,wheres.t. p ˇ ( ˝ )= 1 jTj ; 8 ˝ 2T .Itmeansthatadeterministicuniformly optimalpolicyalwaysexistsforthisMDP. Proof. Wecanprovethisbyconstruction.Thefollowinganalysisappliesforany ˝ 2T . p ˇ ( ˝ )= 1 jTj () log p ˇ ( ˝ )= log jTj () log p ( s 1 )+ X T t =1 log p ( s t +1 j s t ;a t )+ X T t =1 log ˇ ( a t j s t )= log jTj () X T t =1 log ˇ ( a t j s t )= log p ( s 1 ) X T t =1 log p ( s t +1 j s t ;a t ) log jTj 204 whereweuse a t ;s t asabbreviationsof a ( ˝;t ) ;s ( ˝;t ) : Wedenote D ( ˝ )= log p ( s 1 ) X T t =1 log p ( s t +1 j s t ;a t ) > 0 () X T t =1 log ˇ ( a t j s t )= D ( ˝ ) log jTj ) wecanobtainauniformlyoptimalpolicybysolvingthenonlinearprogramming: X T t =1 log ˇ ( a ( ˝;t ) j s ( ˝;t ))= D ( ˝ ) log jTj8 ˝ 2T (A.22) log ˇ ( a ( ˝;t ) j s ( ˝;t ))=0 ; 8 ˝ 2T ;t =1 ;:::;T (A.23) X m i =1 ˇ ( a i j s ( ˝;t ))=1 ; 8 ˝ 2T ;t =1 ;:::;T (A.24) Usethecondition p ( s 1 )= 1 jTj ,thenwehave: * X T t =1 log ˇ ( a ( ˝;t ) j s ( ˝;t )) (A.25) = X T t =1 log1=0( LHSof Eq ( A: 22)) (A.26) * log p ( s 1 ) X T t =1 log p ( s t +1 j s t ;a t ) log jTj =log jTj 0 log jTj =0 (A.27) ( RHSof Eq ( A: 22)) (A.28) ) D ( ˝ ) log jTj = X T t =1 log ˇ ( a ( ˝;t ) j s ( ˝;t )) ; 8 ˝ 2T : Alsothedeterministicoptimalpolicysatis˝estheconditionsinEq(A.23A.24).Therefore, 205 (a)(b) FigureA.2:Thedirectedgraphthatdescribestheconditionalindependenceofpairwise relationshipofactions,where Q 1 denotesthereturnoftakingaction a 1 atstate s ,following policy ˇ in M ,i.e., Q ˇ M ( s;a 1 ) . I 1 ; 2 isarandomvariablethatdenotesthepairwiserelationship of Q 1 and Q 2 ,i.e., I 1 ; 2 =1 ; i : i : f :Q 1 Q 2 ; o : w :I 1 ; 2 =0 . thedeterministicoptimalpolicyisauniformlyoptimalpolicy.Thislemmadescribesone typeofMDPinwhichUOPexists.Fromtheabovereasoning,wecanseethataslong asthesystemofnon-linearequationsEq(A.22A.23A.24)hasasolution,theuniformly (near)-optimalpolicyexists. Lemma7 (Hitoptimaltrajectory) . Theprobabilitythataspeci˝coptimaltrajectorywasnot encounteredgivenanarbitrarysoftmaxpolicy ˇ isexponentiallydecreasingwithrespecttothe numberoftrainingepisodes.NomatteraMDPhasdeterministicorprobabilisticdynamics. Proof. Givenaspeci˝coptimaltrajectory ˝ = f s ( ˝;t ) ;a ( ˝;t ) g T t =1 ,andanarbitrarystationary policy ˇ ,theprobabilitythathasneverencounteredatthe n -thepisodeis [1 p ( ˝ )] n = ˘ n , basedonlemma5,wehave p ( ˝ ) > 0 ,thereforewehave ˘ 2 [0 ; 1) . DiscussionsofAssumption4 Intuitively,givenastateandastationarypolicy ˇ ,therelativerelationshipsamongactions canbeindependent,consideringa˝xedMDP M .Therelativerelationshipamongactions istherelativerelationshipofactions'return.Startingfromthesamestate,followinga 206 stationarypolicy,theactions'returnisdeterminedbyMDPpropertiessuchasenvironment dynamics,rewardfunction,etc. Moreconcretely,weconsideraMDPwiththreeactions ( a 1 ;a 2 ;a 3 ) foreachstate.The actionvalue Q ˇ M satis˝estheBellmanequationinEq(A.29).Noticethatinthissubsection, weuse Q ˇ M todenotetheactionvaluethatestimatestheabsolutevalueofreturnin M . Q ˇ M ( s;a i )= r ( s;a i )+max a E s 0 ˘ p ( s;a ) Q ˇ M ( s 0 ;a ) ; 8 i =1 ; 2 ; 3 : (A.29) AswecanseefromEq(A.29), Q ˇ M ( s;a i ) ;i =1 ; 2 ; 3 isonlyrelatedto s;ˇ ,andenvironment dynamics P .Itmeansif ˇ , M and s aregiven,theactionvaluesofthreeactionsare determined.Therefore,wecanuseadirectedgraph[ 21 ]tomodeltherelationshipofaction values,asshowninFigureA.2(a).Similarly,ifweonlyconsidertherankingofactions,this rankingisconsistentwiththerelationshipofactions'return,whichisalsodeterminedby s;ˇ , and P .Therefore,thepairwiserelationshipamongactionscanbedescribedasthedirected graphinFigureA.2(b),whichestablishestheconditionalindependenceofactions'pairwise relationship.Basedontheabovereasoning,weconcludethatAssumption4isrealistic. TheproofofTheorem6 Proof. TheproofmainlyestablishesontheproofforlongtermperformanceTheorem5and connectsthegeneralizationboundinPACframeworktothelowerboundofreturn.We constructahybridpolicybasedonpairwiserankingpolicyEq(4.5)asfollows: 207 If ˇ ( s )=argmax a ( s;a ) , p h ( a j s )= 8 > < > : 1 ;ˇ ( s )=argmax a ( s;a ) 0 ;o:w: (A.30) If ˇ ( s ) 6 =argmax a ( s;a ) , p h ( a j s )= ˇ ( a j s )= m j 6 = i;j =1 p ij (A.31) InplainEnglish,thehybridpolicycanbedescribedasfollows:forastate s andthe policyparameter ,iftheactionchosenbyUOPhasthehighestrelativeactionvalue(i.e., ˇ ( s )= argmax a ( s;a ) ),weusethedeterministicpolicyasde˝nedinEq(A.30)forthis state.Otherwise,weusethestochasticpolicyasde˝nedinEq(A.31).Notethatthe constructionofthispolicyassumewehaveaccesstotheUOP ˇ ,whichisfeasibleinour setting.WithTRS6,wecan˝lteralluniqueoptimaltrajectoriesfollowingUOP.Therefore, whenUOPisdeterministic,foreachstate,wehavetheactionthatischosenbytheUOP. Westudythegeneralizationperformanceandsamplecomplexityofthepairwiseranking policyasfollows: log( 1 jTj X ˝ 2T p ( ˝ ) w ( ˝ )) 1 jTj X ˝ 2T log p ( ˝ ) w ( ˝ ) , X ˝ 2T p ( ˝ ) w ( ˝ ) jTj exp( 1 jTj X ˝ 2T log p ( ˝ ) w ( ˝ )) denote F = X ˝ p ( ˝ ) w ( ˝ )= X ˝ 2T p ( ˝ ) w ( ˝ ) (A.32) 208 , F jTj exp( 1 jTj X ˝ 2T log p ( ˝ ) w ( ˝ )) = jTj exp 1 jTj X ˝ 2T log p ( s 1 )+ X T t =1 log p ( s t +1 j s t ;a t )+ X T t =1 log p h ( a t j s t )+log w ( ˝ ) (A.33) * w ( ˝ )=1 ; 8 ˝ 2T ;s t = s ( ˝;t ) ;a t = a ( ˝;t ) ;t =1 ;:::;T = jTj exp 1 jTj X ˝ 2T log p ( s 1 )+ X T t =1 log p ( s t +1 j s t ;a t )+ T X t =1 log p h ( a t j s t ) !! = jTj exp 1 jTj X ˝ 2T (log p ( s 1 )+ X T t =1 log p ( s t +1 j s t ;a t )) ! exp 1 jTj X ˝ 2T ( X T t =1 log p h ( a t j s t )) (A.34) Denotethedynamicsofatrajectoryas p d ( ˝ )= p ( s 1 T t =1 p ( s t +1 j s t ;a t ) Noticethat p d ( ˝ ) isenvironmentdynamics,whichis˝xedgivenaspeci˝cMDP. , F jTj exp 1 jTj X ˝ 2T log p d ( ˝ ) exp 1 jTj X ˝ 2T ( X T t =1 log p h ( a t j s t )) = jTj ˝ 2T p d ( ˝ )) 1 jTj exp 1 jTj T X ˝ 2T ( X T t =1 log p h ( a t j s t )) T UsethesamereasoningfromEq(A.7)toEq(A.10). = jTj ˝ 2T p d ( ˝ )) 1 jTj exp T X s;a p ˇ ( s;a )log p h ( a j s ) = jTj ˝ 2T p d ( ˝ )) 1 jTj exp( TL ) Wedenote L = X s;a p ˇ ( s;a )log p h ( a j s ) : L istheonlytermthatisrelatedtothepolicyparameter 209 Given h = ˇ ; misclassi˝edstateactionpairsset U w = f s;a j h ( s ) 6 = a; ( s;a ) ˘ p ( s;a ) g L = X s;a 2 U w p ˇ ( s;a )log p h ( a j s )+ X s;a= 2 U w p ˇ ( s;a )log p h ( a j s ) Byde˝nitionof U w ; 8 s;a= 2 U w ;h ( s )= a; ) p h ( a j s )=1 : (A.35) = X s;a 2 U w p ˇ ( s;a )log ˇ ( a j s ) SinceweuseRPGasourpolicyparameterization,thenwith Eq (4 : 5) = X s;a 2 U w p ˇ ( s;a )log m j 6 = i;j =1 p ij ) = X s;a i 2 U w p ˇ ( s;a i ) X m j 6 = i;j =1 log 1 1+ e Q ji ByCondition1,whichcanbeeasilysatis˝edinpractice.Thenwehave: Q ij < 2 c q 1 ApplyLemma1,themisclassi˝edrateisatmost : X s;a i 2 U w p ˇ ( s;a i )( m 1)log( 1 1+ e ) X s;a i 2 U w p ˇ ( s;a i )( m 1)log(1+ e ) ( m 1)log(1+ e ) = (1 m )log(1+ e ) F jTj ˝ 2T p d ( ˝ )) 1 jTj exp( TL ) jTj ˝ 2T p d ( ˝ )) 1 jTj exp( (1 m ) T log(1+ e )) jTj ˝ 2T p d ( ˝ )) 1 jTj (1+ e ) (1 m ) T = D (1+ e ) (1 m ) T 210 Fromgeneralizationperformancetosamplecomplexity: Set1 = D (1+ e ) (1 m ) T ; where D = jTj ˝ 2T p d ( ˝ )) 1 jTj = log 1+ e D 1 ( m 1) T Withrealizableassumption11, min =0 = min 2 = 2 n 1 2 2 log 2 jHj = 2( m 1) 2 T 2 log 1+ e D 1 2 log 2 jHj Bridgethelong-termrewardandlong-termperformance: X ˝ p ( ˝ ) r ( ˝ ) InSection4.2.7, r ( ˝ ) 2 [0 ; 1] ; 8 ˝: X ˝ p ( ˝ ) w ( ˝ ) SincewefocusonUOPDef8,c=1inTSRDef6 = X ˝ 2T p ( ˝ ) w ( ˝ ) 1 Thisthusconcludestheproof. Assumption11 (Realizable) . Weassumethereexistsahypothesis h 2H thatobtainszero expectedrisk,i.e. 9 h 2H) P s;a p ˇ ( s;a ) 1 f h ( s ) 6 = a g =0 . TheAssumption11isnotnecessaryfortheproofofTheorem6.Fortheproofof Corollary4,weintroducethisassumptiontoachievemoreconciseconclusion.In˝nite 211 MDP,therealizableassumptioncanbesatis˝edifthepolicyisparameterizedbymulti-layer neuralnetwork,duetoitsperfect˝nitesampleexpressivity[ 224 ].Itisalsoadvocatedinour empiricalstudiessincetheneuralnetworkachievedoptimalperformancein Pong . TheproofofLemma2 Proof. Let e = i denotestheevent n = i j k ,i.e.thenumberofdi˙erentoptimaltrajectoriesin ˝rst k episodesisequalto i .Similarly, e i denotestheevent n i j k .Sincetheevents e = i and e = j aremutuallyexclusivewhen i 6 = j .Therefore, p ( e i )= p ( e = i ;e = i +1 ;:::;e = jTj )= P jTj j = i p ( e = j ) .Furthermore,weknowthat P T i =0 p ( e = i )=1 since f e = i ;i =0 ;:::; jTjg constructsanuniversalset.Forexample, p ( e 1 )= p ˇ r ; M ( n 1 j k )=1 p ˇ r ; M ( n =0 j k )= 1 ( N j N ) k . p ˇ r ; M ( n i j k )=1 X i 1 i 0 =0 p ˇ; M ( n = i 0 j k ) =1 X i 1 i 0 =0 C i 0 jTj P i 0 j =0 ( 1) j C j i 0 ( N jTj + i 0 j ) k N k (A.36) InEq(A.36),weusetheinclusion-exclusionprinciple[93]tohavethefollowingequality. p ˇ r ; M ( n = i 0 j k )= C i 0 jTj p ( e ˝ 1 ;˝ 2 ;:::;˝ i 0 ) = C i 0 jTj P i 0 j =0 ( 1) j C j i 0 ( N jTj + i 0 j ) k N k e ˝ 1 ;˝ 2 ;:::;˝ i 0 denotestheevent:in˝rst k episodes,acertainsetof i 0 optimaltrajectories ˝ 1 ;˝ 2 ;:::;˝ i 0 ;i 0 jTj issampled. 212 TableA.2:HyperparametersofRPGnetwork HyperparametersValue ArchitectureConv(32-8 8-4) -Conv(64-4 4-2) -Conv(64-3 3-2) -FC(512) Learningrate0.0000625 Batchsize32 Replaybu˙ersize1000000 Updateperiod4 MargininEq(4.14)1 TheproofofCorollary5 Proof. TheCorollary5isadirectapplicationofLemma2andTheorem6.First,wereformat Theorem6asfollows: p ( A j B ) 1 whereevent A denotes P ˝ p ( ˝ ) r ( ˝ ) D (1+ e ) (1 m ) T ,event B denotesthenumberof state-actionpairs n 0 fromUOP(Def8)satisfying n 0 n ,given˝xed .WithLemma2,we have p ( B ) p ˇ r ; M ( n 0 n j k ) .Then, P ( A )= P ( A j B ) P ( B ) (1 ) p ˇ r ; M ( n 0 n j k ) . Set (1 ) p ˇ r ; M ( n 0 n j k )=1 0 wehave P ( A ) 1 0 =1 1 0 p ˇ r ; M ( n 0 n j k ) =2 r 1 2 n log 2 jHj =2 s 1 2 n log 2 jHj p ˇ r ; M ( n 0 n j k ) p ˇ r ; M ( n 0 n j k ) 1+ 0 213 Hyperparameters WepresentthetrainingdetailsofrankingpolicygradientinTableA.2.Thenetwork architectureisthesameastheconvolutionneuralnetworkusedinDQN[ 132 ].Weupdate theRPGnetworkeveryfourtimestepswithaminibatchofsize32.Thereplayratioisequal toeightforallbaselinesandRPG(exceptforACERweusethedefaultsettinginopenai baselines[48]forbetterperformance). 214 AppendixB FederatedLearning AdditionalNotations Inthissection,weintroduceadditionalnotationsthatareusedthroughouttheproof.Following commonpractice[ 178 , 110 ],wede˝netwovirtualsequences v t and w t .Forfulldevice participationand t= 2I E , v t = w t = P N k =1 p k v k t .Forpartialparticipation, t 2I E , w t 6 = v t since v t = P N k =1 p k v k t while w t = P k 2S t w k t .However,wecansetunbiasedsampling strategysuchthat E S t w t = v t . v t +1 isone-stepSGDfrom w t . v t +1 = w t t g t ; (B.1) where g t = P N k =1 p k g t;k isone-stepstochasticgradient,averagedoveralldevices. g t;k = r F k w k t ;˘ k t ; Similarly,wedenotetheexpectedone-stepgradient g t = E ˘ t [ g t ]= P N k =1 p k E ˘ k t g t;k ,where E ˘ k t g t;k = r F k w k t ; (B.2) 215 and ˘ t = f ˘ k t g N k =1 denotesrandomsamplesatalldevicesattimestep t .Sinceinthiswork, wealsoconsiderthecaseofpartialparticipation.Thesamplingstrategytoapproximate thesystemheterogeneitycanalsoa˙ecttheconvergence.Herewefollowthepriorarts[ 75 ] consideringtwotypesofsamplingschemes.ThesamplingschemeIestablishes S t +1 byi.i.d. samplingthedeviceswithreplacement,inthiscasetheupperboundofexpectedsquarenorm of w t +1 v t +1 isgivenby[110,Lemma5]: E S t +1 k w t +1 v t +1 k 2 4 K 2 t E 2 G 2 : (B.3) ThesamplingschemeIIestablishes S t +1 byuniformlysamplingalldeviceswithoutreplace- ment,inwhichwehavethe E S t +1 k w t +1 v t +1 k 2 4( N K ) K ( N 1) 2 t E 2 G 2 : (B.4) Wedenotethisupperboundasfollowsforconcisepresentation. E S t +1 k w t +1 v t +1 k 2 2 t C: (B.5) ComparisonofConvergenceRateswithRelatedWorks Inthissection,wecompareourconvergenceratewiththebest-knownresultsintheliterature (seeTableB.1).In[ 75 ],theauthorsprovide O (1 =NT ) convergencerateofnon-convex problemsunderPolyak-−ojasiewicz(PL)condition,whichmeanstheirresultscandirectly applytothestronglyconvexproblems.However,theirassumptionisbasedonbounded 216 gradientdiversity,de˝nedasfollows: w )= P k p k kr F k ( w ) k 2 2 k P k p k r F k ( w ) k 2 2 B Thisisamorerestrictiveassumptioncomparingtoassumingboundedgradientunderthe caseoftargetaccuracy ! 0 andPLcondition.Toseethis,considerthegradientdiversity attheglobaloptimal w ,i.e., w )= P k p k kr F k ( w ) k 2 2 k P k p k r F k ( w ) k 2 2 .For w ) tobebounded,it requires kr F k ( w ) k 2 2 =0 , 8 k .Thisindicates w isalsotheminimizerofeachlocalobjective, whichcontradictstothepracticalsettingofheterogeneousdata.Therefore,theirbound isnote˙ectiveforarbitrarysmall -accuracyundergeneralheterogeneousdatawhileour convergenceresultsstillholdinthiscase. TableB.1:Ahigh-levelsummaryoftheconvergenceresultsinthispapercomparedtoprior state-of-the-artFLalgorithms.Thistableonlyhighlightsthedependenceon T (numberof iterations), E (themaximalnumberoflocalsteps), N (thetotalnumberofdevices),and K N thenumberofparticipateddevices. istheconditionnumberofthesystemand 2 (0 ; 1) .WedenoteNesterovacceleratedFedAvgasN-FedAvginthistable. Reference Convergencerate E NonIID Participation ExtraAssumptions Setting FedAvg[110] O ( E 2 T ) O (1) 3 Partial Boundedgradient Stronglyconvex FedAvg[75] O ( 1 KT ) O ( K 1 = 3 T 2 = 3 ) y 3 zz Partial Boundedgradientdiversity Stronglyconvex x FedAvg[104] O ( 1 NT ) O ( N 1 = 2 T 1 = 2 ) 3 Full Boundedgradient Stronglyconvex FedAvg/N-FedAvg O ( 1 KT ) O ( N 1 = 2 T 1 = 2 ) z 3 Partial Boundedgradient Stronglyconvex FedAvg[98] O ( 1 p NT ) O ( N 3 = 2 T 1 = 2 ) 3 Full Boundedgradient Convex FedAvg[104] O ( 1 p NT ) O ( N 3 = 4 T 1 = 4 ) 3 Full Boundedgradient Convex FedAvg/N-FedAvg O 1 p KT O ( N 3 = 4 T 1 = 4 ) z 3 Partial Boundedgradient Convex FedAvg O exp( NT E 1 ) O ( T ) 3 Partial Boundedgradient OverparameterizedLR FedMass O exp( NT E p 1 ~ ) O ( T ) 3 Partial Boundedgradient OverparameterizedLR y This E isobtainedunderi.i.d.setting. z This E isobtainedunderfullparticipationsetting. x In[ 75 ],theconvergencerateisfornon-convexsmoothproblemswithPLcondition,which alsoappliestostronglyconvexproblems.Therefore,wecompareitwithourstronglyconvex resultshere. zz Theboundedgradientdiversityassumptionisnotapplicableforgeneralheterogeneous datawhenconvergingtoarbitrarilysmall -accuracy(seediscussionsinSecB). 217 ProofofConvergenceResultsforFedAvg StronglyConvexSmoothObjectives Tofacilitatereading,theoremsfromthemainpaperarerestatedandnumberedidentically. We˝rstsummarizesomepropertiesof L -smoothand -stronglyconvexfunctions[161]. Lemma8. Let F beaconvex L -smoothfunction.Thenwehavethefollowinginequalities: 1.Quadraticupperbound: 0 F ( w ) F ( w 0 ) hr F ( w 0 ) ; w w 0 i L 2 k w w 0 k 2 . 2.Coercivity: 1 L kr F ( w ) r F ( w 0 ) k 2 hr F ( w ) r F ( w 0 ) ; w w 0 i . 3.Lowerbound: F ( w ) F ( w 0 )+ hr F ( w 0 ) ; w w 0 i + 1 2 L kr F ( w ) r F ( w 0 ) k 2 .In particular, kr F ( w ) k 2 2 L ( F ( w ) F ( w )) . 4.Optimalitygap: F ( w ) F ( w ) F ( w ) ; w w i . Lemma9. Let F bea -stronglyconvexfunction.Then F ( w ) F ( w 0 )+ hr F ( w 0 ) ; w w 0 i + 1 2 kr F ( w ) r F ( w 0 ) k 2 F ( w ) F ( w ) 1 2 kr F ( w ) k 2 Theorem14. Let w T = P N k =1 p k w k T , max = max k Np k ,andsetdecayinglearningrates t = 1 4 ( + t ) with = max f 32 E g and = L .ThenunderAssumptions7,8,9,10withfull deviceparticipation, E F ( w T ) F = O 2 max ˙ 2 NT + 2 E 2 G 2 T 2 andwithpartialdeviceparticipationwithatmost K sampleddevicesateachcommunication 218 round, E F ( w T ) F = O 2 G 2 KT + 2 max ˙ 2 NT + 2 E 2 G 2 T 2 Proof. Theproofbuildsonideasfrom[ 110 ].The˝rststepistoobservethatthe L -smoothness of F providestheupperbound E ( F ( w t )) F = E ( F ( w t ) F ( w )) L 2 E k w t w k 2 andbound E k w t w k 2 . Ourmainstepistoprovethebound E k w t +1 w k 2 (1 t ) E k w t w k 2 + 2 t 1 N 2 max ˙ 2 +5 E 2 3 t G 2 Wehave k w t +1 w k 2 = k ( w t t g t ) w k 2 = k ( w t t g t w ) t ( g t g t ) k 2 = A 1 + A 2 + A 3 where A 1 = k w t w t g t k 2 A 2 =2 t h w t w t g t ; g t g t i 219 A 3 = 2 t k g t g t k 2 Byde˝nitionof g t and g t (seeEq(B.2)),wehave E A 2 =0 .For A 3 ,wehavethefollow upperbound: 2 t E k g t g t k 2 = 2 t E k g t E g t k 2 = 2 t N X k =1 p 2 k k g t;k E g t;k k 2 2 t N X k =1 p 2 k ˙ 2 k againbyJensen'sinequalityandusingtheindependenceof g t;k ; g t;k 0 [110,Lemma2]. Nextwebound A 1 : k w t w t g t k 2 = k w t w k 2 +2 h w t w ; t g t i + k t g t k 2 andwewillshowthatthethirdterm k t g t k 2 canbecanceledbyanupperboundofthe secondterm. Now 2 t h w t w ; g t i = 2 t N X k =1 p k h w t w ; r F k ( w k t ) i = 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i 2 t N X k =1 p k h w k t w ; r F k ( w k t ) i 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i +2 t N X k =1 p k ( F k ( w ) F k ( w k t )) t N X k =1 p k k w k t w k 2 220 2 t N X k =1 p k F k ( w k t ) F k ( w t )+ L 2 k w t w k t k 2 + F k ( w ) F k ( w k t ) t k N X k =1 p k w k t w k 2 = t L N X k =1 p k k w t w k t k 2 +2 t N X k =1 p k [ F k ( w ) F k ( w t )] t k w t w k 2 Forthesecondterm,whichisnegative,wecanignoreit,butthisyieldsasuboptimalbound thatfailstoprovidethedesiredlinearspeedup.Instead,weupperbounditusingthefollowing derivation: 2 t N X k =1 p k [ F k ( w ) F k ( w t )] 2 t [ F ( w t +1 ) F ( w t )] 2 t E hr F ( w t ) ; w t +1 w t i + t L E k w t +1 w t k 2 = 2 2 t E hr F ( w t ) ; g t i + 3 t L E k g t k 2 = 2 2 t E hr F ( w t ) ; g t i + 3 t L E k g t k 2 = 2 t h kr F ( w t ) k 2 + k g t k 2 kr F ( w t ) g t k 2 i + 3 t L E k g t k 2 = 2 t " kr F ( w t ) k 2 + k g t k 2 kr F ( w t ) X k p k r F ( w k t ) k 2 # + 3 t L E k g t k 2 2 t " kr F ( w t ) k 2 + k g t k 2 X k p k kr F ( w t ) r F ( w k t ) k 2 # + 3 t L E k g t k 2 2 t " kr F ( w t ) k 2 + k g t k 2 L 2 X k p k k w t w k t k 2 # + 3 t L E k g t k 2 2 t k g t k 2 + 2 t L 2 X k p k k w t w k t k 2 + 3 t L E k g t k 2 2 t kr F ( w t ) k 2 wherewehaveusedthesmoothnessof F twice. 221 Notethattheterm 2 t k g t k 2 exactlycancelsthe 2 t k g t k 2 intheboundfor A 1 ,sothat pluggingintheboundfor 2 t h w t w ; g t i ,wehavesofarproved E k w t +1 w k 2 E (1 t ) k w t w k 2 + t L N X k =1 p k k w t w k t k 2 + 2 t N X k =1 p 2 k ˙ 2 k + 2 t L 2 N X k =1 p k k w t w k t k 2 + 3 t L E k g t k 2 2 t kr F ( w t ) k 2 Theterm E k g t k 2 G 2 byassumption. Nowwebound E P N k =1 p k k w t w k t k 2 following[ 110 ].Sincecommunicationisdoneevery E steps,forany t 0 ,wecan˝nda t 0 t suchthat t t 0 E 1 and w k t 0 = w t 0 forall k . Moreover,using t isnon-increasingand t 0 2 t forany t t 0 E 1 ,wehave E N X k =1 p k k w t w k t k 2 = E N X k =1 p k k w k t w t 0 ( w t w t 0 ) k 2 E N X k =1 p k k w k t w t 0 k 2 = E N X k =1 p k k w k t w k t 0 k 2 = E N X k =1 p k k t 1 X i = t 0 i g i;k k 2 2 N X k =1 p k E t 1 X i = t 0 E 2 i k g i;k k 2 2 N X k =1 p k E 2 2 t 0 G 2 4 E 2 2 t G 2 222 Usingtheboundon E P N k =1 p k k w t w k t k 2 ,wecanconcludethat,with max := N max k p k and min := N min k p k , E k w t +1 w k 2 E (1 t ) k w t w k 2 +4 E 2 3 t G 2 +4 E 2 L 2 4 t G 2 + 2 t N X k =1 p 2 k ˙ 2 k + 3 t LG 2 = E (1 t ) k w t w k 2 +4 E 2 3 t G 2 +4 E 2 L 2 4 t G 2 + 2 t 1 N 2 N X k =1 ( p k N ) 2 ˙ 2 k + 3 t LG 2 E (1 t ) k w t w k 2 +4 E 2 3 t G 2 +4 E 2 L 2 4 t G 2 + 2 t 1 N 2 2 max N X k =1 ˙ 2 k + 3 t LG 2 E (1 t ) k w t w k 2 +6 E 2 3 t G 2 + 2 t 1 N 2 max ˙ 2 whereinthelastinequalityweuse ˙ 2 = max k ˙ 2 k ,andassume t satis˝es t 1 8 .Weshow nextthat E k w t w k 2 = O ( 1 tN + E 2 LG 2 t 2 ) . Let C 6 E 2 LG 2 and D 1 N 2 max ˙ 2 .Supposethatwehaveshown E k w t w k 2 b ( t D + 2 t C ) forsomeconstant b and t .Then E k w t +1 w k 2 b (1 t )( t D + 2 t C )+ 2 t D + 3 t C =( b (1 t )+ t ) t D +( b (1 t )+ t ) 2 t C 223 andsoitremainstochoose t and b suchthat ( b (1 t )+ t ) t t +1 and ( b (1 t )+ t ) 2 t 2 t +1 .Recallthatwerequire t 0 2 t forany t t 0 E 1 ,and t 1 8 . Ifwelet t = 4 ( t + ) where = max f E; 32 g ,thenwemaycheckthat t satis˝esboth requirements. Setting b = 4 ,wehave ( b (1 t )+ t ) t = b (1 4 t + )+ 4 ( t + ) 4 ( t + ) = b t + 4 t + + 4 ( t + ) 4 ( t + ) = b ( t + 3 t + ) 4 ( t + ) b ( t + 1 t + ) 4 ( t + ) b 4 ( t + +1) = t +1 and ( b (1 t )+ t ) 2 t = b (1 4 t + )+ 4 ( t + ) 16 2 ( t + ) 2 = b t + 4 t + + 4 ( t + ) 16 2 ( t + ) 2 = b ( t + 2 t + ) 16 2 ( t + ) 2 b 16 2 ( t + +1) 2 = 2 t +1 wherewehaveusedthefactsthat t + 1 ( t + ) 2 1 ( t + +1) 224 t + 2 ( t + ) 3 1 ( t + +1) 2 for 1 . Thuswehaveshown E k w t +1 w k 2 b ( t +1 D + 2 t +1 C ) forourchoiceof t and b .Nowtoensure k w 0 w k 2 b ( 0 D + 2 0 C ) = b ( 4 D + 16 2 2 C ) wecansimplyscale b by c k w 0 w k 2 foraconstant c largeenoughandtheinductionstep stillholds. Itfollowsthat E k w t w k 2 c k w 0 w k 2 4 ( D t + C 2 t ) forall t 0 . Finally,the L -smoothnessof F implies E ( F ( w T )) F = E ( F ( w T ) F ( w )) L 2 E k w T w k 2 L 2 c k w 0 w k 2 4 ( D T + C 2 T ) 225 =2 c k w 0 w k 2 ( D T + C 2 T ) 2 c k w 0 w k 2 4 ( T + ) 1 N 2 max ˙ 2 +6 E 2 LG 2 ( 4 ( T + ) ) 2 = O ( 1 N 2 max ˙ 2 1 T + 2 E 2 G 2 1 T 2 ) Withpartialparticipation,theupdateateachcommunicationroundisnowgivenby averagesoverasubsetofsampleddevices.When t +1 = 2I E , v t +1 = w t +1 ,whilewhen t +1 = 2I E ,wehave E w t +1 = v t +1 bydesignofthesamplingschemes,sothat E k w t +1 w k 2 = E k w t +1 v t +1 + v t +1 w k 2 = E k w t +1 v t +1 k 2 + E k v t +1 w k 2 Asbefore, E k v t +1 w k 2 E (1 t ) k w t w k 2 +6 E 2 3 t G 2 + 2 t 1 N 2 max ˙ 2 . Thekeyistobound E k w t +1 v t +1 k 2 .ForsamplingschemeIwehave E k w t +1 v t +1 k 2 = 1 K X k p k E k w k t +1 w t +1 k 2 4 K 2 t E 2 G 2 whileforsamplingschemeII E k w t +1 v t +1 k 2 = N K N 1 1 K X k p k E k w k t +1 w t +1 k 2 N K N 1 4 K 2 t E 2 G 2 226 Thesameargumentasthefullparticipationcaseimplies E F ( w T ) F = O ( 2 max ˙ 2 NT + 2 G 2 KT + 2 E 2 G 2 T 2 ) Onemayaskwhetherthedependenceon E intheterm 2 G 2 KT canberemoved,or equivalentlywhether P k p k k w k t w t k 2 = O (1 =T 2 ) canbeindependentof E .Weprovidea simplecounterexamplethatshowsthatthisisnotpossibleingeneral. Lemma10. Thereexistsadatasetsuchthatif E = O ( T ) forany > 0 then P k p k k w k t w t k 2 = 1 T 2 2 ) . Proof. Supposethatwehaveanevennumberofdevicesandeach F k ( w )= 1 n k P n k j =1 ( x j k w ) 2 containsdatapoints x j k = w ;k ,with n k n .Moreover,the w ;k 'scomeinpairsaround theorigin.Asaresult,theglobalobjective F isminimizedat w =0 .Moreover,ifwestart from w 0 =0 ,thenbydesignofthedatasettheupdatesinlocalstepsexactlycanceleach otherateachiteration,resultingin w t =0 forall t .Ontheotherhand,if E = T ,then startingfromany t = O ( T ) withconstantstepsize O ( 1 T ) ,after E iterationsoflocalsteps, thelocalparametersareupdatedtowards w ;k with k w k t + E k 2 = T 1 T ) 2 )= 1 T 2 2 ) . Thisimpliesthat X k p k k w k t + E w t + E k 2 = X k p k k w k t + E k 2 = 1 T 2 2 ) whichisataslowerratethan 1 T 2 forany > 0 .Thusthesamplingvariance E k w t +1 227 v t +1 k 2 = P k p k E k w k t +1 w t +1 k 2 ) decaysataslowerratethan 1 T 2 ,resultingina convergencerateslowerthan O ( 1 T ) withpartialparticipation. ConvexSmoothObjectives Theorem15. Underassumptions7,9,10andconstantlearningrate t = O ( q N T ) , min t T F ( w t ) F ( w )= O max ˙ 2 p NT + NE 2 LG 2 T withfullparticipation,andwithpartialdeviceparticipationwith K sampleddevicesateach communicationroundandlearningrate t = O ( q K T ) , min t T F ( w t ) F ( w )= O max ˙ 2 p KT + E 2 G 2 p KT + KE 2 LG 2 T Proof. Weagainstartbyboundingtheterm k w t +1 w k 2 = k ( w t t g t ) w k 2 = k ( w t t g t w ) t ( g t g t ) k 2 = A 1 + A 2 + A 3 where A 1 = k w t w t g t k 2 A 2 =2 t h w t w t g t ; g t g t i A 3 = 2 t k g t g t k 2 228 Byde˝nitionof g t and g t (seeEq(B.2)),wehave E A 2 =0 .For A 3 ,wehavethefollow upperbound: 2 t E k g t g t k 2 = 2 t E k g t E g t k 2 = 2 t N X k =1 p 2 k k g t;k E g t;k k 2 2 t N X k =1 p 2 k ˙ 2 k againbyJensen'sinequalityandusingtheindependenceof g t;k ; g t;k 0 [110,Lemma2]. Nextwebound A 1 : k w t w t g t k 2 = k w t w k 2 +2 h w t w ; t g t i + k t g t k 2 Usingtheconvexityand L -smoothnessof F k , 2 t h w t w ; g t i = 2 t N X k =1 p k h w t w ; r F k ( w k t ) i = 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i 2 t N X k =1 p k h w k t w ; r F k ( w k t ) i 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i +2 t N X k =1 p k ( F k ( w ) F k ( w k t )) 2 t N X k =1 p k F k ( w k t ) F k ( w t )+ L 2 k w t w k t k 2 + F k ( w ) F k ( w k t ) = t L N X k =1 p k k w t w k t k 2 +2 t N X k =1 p k [ F k ( w ) F k ( w t )] whichresultsin k w t +1 w k 2 k w t w k 2 + t L N X k =1 p k k w t w k t k 2 229 +2 t N X k =1 p k [ F k ( w ) F k ( w t )]+ 2 t k g t k 2 + 2 t N X k =1 p 2 k ˙ 2 k Thedi˙erenceofthisboundwiththatinthestronglyconvexcaseisthatwenolonger haveacontractionfactorinfrontof k w t w k 2 .Inthestronglyconvexcase,wewereable tocancel 2 t k g t k 2 with 2 t P N k =1 p k [ F k ( w ) F k ( w t )] andobtainonlylowerorderterms. Intheconvexcase,weuseadi˙erentstrategyandpreserve P N k =1 p k [ F k ( w ) F k ( w t )] in ordertoobtainatelescopingsum. Wehave k g t k 2 = k X k p k r F k ( w k t ) k 2 = k X k p k r F k ( w k t ) X k p k r F k ( w t )+ X k p k r F k ( w t ) k 2 2 k X k p k r F k ( w k t ) X k p k r F k ( w t ) k 2 +2 k X k p k r F k ( w t ) k 2 2 L 2 X k p k k w k t w t k 2 +2 k X k p k r F k ( w t ) k 2 =2 L 2 X k p k k w k t w t k 2 +2 kr F ( w t ) k 2 using r F ( w )=0 .Nowusingthe L smoothnessof F ,wehave kr F ( w t ) k 2 2 L ( F ( w t ) F ( w )) ,sothat k w t +1 w k 2 k w t w k 2 + t L N X k =1 p k k w t w k t k 2 +2 t N X k =1 p k [ F k ( w ) F k ( w t )] +2 2 t L 2 X k p k k w k t w t k 2 +4 2 t L ( F ( w t ) F ( w ))+ 2 t N X k =1 p 2 k ˙ 2 k 230 = k w t w k 2 +(2 2 t L 2 + t L ) N X k =1 p k k w t w k t k 2 + t N X k =1 p k [ F k ( w ) F k ( w t )] + 2 t N X k =1 p 2 k ˙ 2 k + t (1 4 t L )( F ( w ) F ( w t )) Since F ( w ) F ( w t ) ,aslongas 4 t L 1 ,wecanignorethelastterm,andrearrangethe inequalitytoobtain k w t +1 w k 2 + t ( F ( w t ) F ( w )) k w t w k 2 +(2 2 t L 2 + t L ) N X k =1 p k k w t w k t k 2 + 2 t N X k =1 p 2 k ˙ 2 k k w t w k 2 + 3 2 t L N X k =1 p k k w t w k t k 2 + 2 t N X k =1 p 2 k ˙ 2 k Thesameargumentasbeforeyields E P N k =1 p k k w t w k t k 2 4 E 2 2 t G 2 whichgives k w t +1 w k 2 + t ( F ( w t ) F ( w )) k w t w k 2 + 2 t N X k =1 p 2 k ˙ 2 k +6 3 t E 2 LG 2 k w t w k 2 + 2 t 1 N 2 max ˙ 2 +6 3 t E 2 LG 2 Summingtheinequalitiesfrom t =0 to t = T ,weobtain T X t =0 t ( F ( w t ) F ( w )) k w 0 w k 2 + T X t =0 2 t 1 N 2 max ˙ 2 + T X t =0 3 t 6 E 2 LG 2 sothat min t T F ( w t ) F ( w ) 1 P T t =0 t 0 @ k w 0 w k 2 + T X t =0 2 t 1 N 2 max ˙ 2 + T X t =0 3 t 6 E 2 LG 2 1 A 231 Bysettingtheconstantlearningrate t q N T ,wehave min t T F ( w t ) F ( w ) 1 p NT k w 0 w k 2 + 1 p NT T N T 1 N 2 max ˙ 2 + 1 p NT T ( r N T ) 3 6 E 2 LG 2 1 p NT k w 0 w k 2 + 1 p NT T N T 1 N 2 max ˙ 2 + N T 6 E 2 LG 2 =( k w 0 w k 2 + 2 max ˙ 2 ) 1 p NT + N T 6 E 2 LG 2 = O ( 2 max ˙ 2 p NT + NE 2 LG 2 T ) Similarly,forpartialparticipation,wehave min t T F ( w t ) F ( w ) 1 P T t =0 t 0 @ k w 0 w k 2 + T X t =0 2 t ( 1 N max ˙ 2 + C )+ T X t =0 3 t 6 E 2 LG 2 1 A where C = 4 K E 2 G 2 or N K N 1 4 K E 2 G 2 ,sothatwith t = q K T ,wehave min t T F ( w t ) F ( w )= O ( max ˙ 2 p KT + E 2 G 2 p KT + KE 2 LG 2 T ) 232 ProofofConvergenceResultsforNesterovAcceleratedFe- dAvg StronglyConvexSmoothObjectives Theorem16. Let v T = P N k =1 p k v k T andsetlearningrates t 1 = 3 14( t + )(1 6 t + )max f 1 g , t = 6 ( t + ) .ThenunderAssumptions7,8,9,10withfulldeviceparticipation, E F ( v T ) F = O max ˙ 2 NT + 2 E 2 G 2 T 2 ; andwithpartialdeviceparticipationwith K sampleddevicesateachcommunicationround, E F ( v T ) F = O max ˙ 2 NT + 2 G 2 KT + 2 E 2 G 2 T 2 : Proof. De˝nethevirtualsequences v t = P N k =1 p k v k t , w t = P N k =1 p k w k t ,and g t = P N k =1 p k E g t;k . Wehave E g t = g t and v t +1 = w t t g t ,and w t +1 = v t +1 forall t .Theproofagainuses the L -smoothnessof F tobound E ( F ( v t )) F = E ( F ( v t ) F ( w )) L 2 E k v t w k 2 Ourmainstepistoprovethebound E k v t +1 w k 2 (1 t ) E k v t w k 2 + 2 t 1 N 2 max ˙ 2 +20 E 2 3 t G 2 233 forappropriatestepsizes t ; t . Wehave k v t +1 w k 2 = k ( w t t g t ) w k 2 = k ( w t t g t w ) t ( g t g t ) k 2 = A 1 + A 2 + A 3 where A 1 = k w t w t g t k 2 A 2 =2 t h w t w t g t ; g t g t i A 3 = 2 t k g t g t k 2 Byde˝nitionof g t and g t (seeEq(B.2)),wehave E A 2 =0 .For A 3 ,wehavethefollow upperbound: 2 t E k g t g t k 2 = 2 t E k g t E g t k 2 = 2 t N X k =1 p 2 k k g t;k E g t;k k 2 2 t N X k =1 p 2 k ˙ 2 k againbyJensen'sinequalityandusingtheindependenceof g t;k ; g t;k 0 [110,Lemma2]. Nextwebound A 1 : k w t w t g t k 2 = k w t w k 2 +2 h w t w ; t g t i + k t g t k 2 234 SameastheSGDcase, 2 t h w t w ; g t i + k t g t k 2 t L N X k =1 p k k w t w k t k 2 + 2 t L 2 X k p k k w t w k t k 2 + 3 t L E k g t k 2 t k w t w k 2 sothat k w t w t g t k 2 (1 t ) k w t w k 2 + t L N X k =1 p k k w t w k t k 2 + 2 t L 2 X k p k k w t w k t k 2 + 3 t L E k g t k 2 Di˙erentfromtheSGDcase,wehave k w t w k 2 = k v t + t 1 ( v t v t 1 ) w k 2 = k (1+ t 1 )( v t w ) t 1 ( v t 1 w ) k 2 =(1+ t 1 ) 2 k v t w k 2 2 t 1 (1+ t 1 ) h v t w ; v t 1 w i + 2 t 1 k ( v t 1 w ) k 2 (1+ t 1 ) 2 k v t w k 2 +2 t 1 (1+ t 1 ) k v t w kk v t 1 w k + 2 t 1 k ( v t 1 w ) k 2 whichgives k v t +1 w k 2 235 (1 t )(1+ t 1 ) 2 k v t w k 2 +2(1 t ) t 1 (1+ t 1 ) k v t w kk v t 1 w k + 2 t N X k =1 p 2 k ˙ 2 k + 2 t 1 (1 t ) k ( v t 1 w ) k 2 + t L N X k =1 p k k w t w k t k 2 + 2 t L 2 X k p k k w t w k t k 2 + 3 t LG 2 andwewillusingthisrecursiverelationtoobtainthedesiredbound. Firstwebound E P N k =1 p k k w t w k t k 2 .Sincecommunicationisdoneevery E steps,for any t 0 ,wecan˝nda t 0 t suchthat t t 0 E 1 and w k t 0 = w t 0 forall k .Moreover, using t isnon-increasing, t 0 2 t ,and t t forany t t 0 E 1 ,wehave E N X k =1 p k k w t w k t k 2 = E N X k =1 p k k w k t w t 0 ( w t w t 0 ) k 2 E N X k =1 p k k w k t w t 0 k 2 = E N X k =1 p k k w k t w k t 0 k 2 = E N X k =1 p k k t 1 X i = t 0 i ( v k i +1 v k i ) t 1 X i = t 0 i g i;k k 2 2 N X k =1 p k E t 1 X i = t 0 ( E 1) 2 i k g i;k k 2 +2 N X k =1 p k E t 1 X i = t 0 ( E 1) 2 i k ( v k i +1 v k i ) k 2 2 N X k =1 p k E t 1 X i = t 0 ( E 1) 2 i ( k g i;k k 2 + k ( v k i +1 v k i ) k 2 ) 4 N X k =1 p k E t 1 X i = t 0 ( E 1) 2 i G 2 236 4( E 1) 2 2 t 0 G 2 16( E 1) 2 2 t G 2 wherewehaveused E k v k t v k t 1 k 2 G 2 .Toseethisidentityforappropriate t ; t ,note therecursion v k t +1 v k t = w k t w k t 1 ( t g t;k t 1 g t 1 ;k ) w k t +1 w k t = t g t;k + t ( v k t +1 v k t ) sothat v k t +1 v k t = t 1 g t 1 ;k + t 1 ( v k t v k t 1 ) ( t g t;k t 1 g t 1 ;k ) = t 1 ( v k t v k t 1 ) t g t;k Sincetheidentity v k t +1 v k t = t 1 ( v k t v k t 1 ) t g t;k implies E k v k t +1 v k t k 2 2 2 t 1 E k v k t v k t 1 k 2 +2 2 t G 2 aslongas t ; t 1 satisfy 2 2 t 1 +2 2 t 1 = 2 ,wecanguaranteethat E k v k t v k t 1 k 2 G 2 forall k byinduction.ThistogetherwithJensen'sinequalityalsogives E k v t v t 1 k 2 G 2 forall t . Usingtheboundon E P N k =1 p k k w t w k t k 2 ,wecanconcludethat,with max := N max k p k , E k v t +1 w k 2 237 E (1 t )(1+ t 1 ) 2 k v t w k 2 +16 E 2 3 t G 2 +16 E 2 L 2 4 t G 2 + 3 t LG 2 +(1 t ) 2 t 1 k ( v t 1 w ) k 2 + 2 t N X k =1 p 2 k ˙ 2 k +2 t 1 (1+ t 1 )(1 t ) k v t w kk v t 1 w k E (1 t )(1+ t 1 ) 2 k v t w k 2 +20 E 2 3 t G 2 +(1 t ) 2 t 1 k ( v t 1 w ) k 2 + 2 t 1 N max ˙ 2 +2 t 1 (1+ t 1 )(1 t ) k v t w kk v t 1 w k where ˙ 2 = P k p k ˙ 2 k ,and t satis˝es t 1 5 .Weshownextthat E k v t w k 2 = O ( 1 tN + E 2 t 2 ) byinduction. Assumethatwehaveshown E k y t w k 2 b ( C 2 t + D t ) foralliterationsuntil t ,where C =20 E 2 LG 2 , D = 1 N 2 max ˙ 2 ,and b istobechosenlater.For stepsizeswechoose t = 6 1 t + and t 1 = 3 14( t + )(1 6 t + )max f 1 g where = max f 32 E g ,so that t 1 t and (1 t )(1+14 t 1 ) (1 6 t + )(1+ 3 ( t + )(1 6 t + ) ) =1 6 t + + 3 t + =1 3 t + =1 t 2 Recallthatwealsorequire t 0 2 t forany t t 0 E 1 , t 1 5 ,and 2 2 t 1 +2 2 t 1 = 2 , whichwecanalsochecktoholdbyde˝nitionof t and t . Moreover, E k y t w k 2 b ( C 2 t + D t ) withthechosenstepsizesalsoimplies k v t 1 238 w k 2 k v t w k .Thereforetheboundfor E k v t +1 w k 2 canbefurthersimpli˝edwith 2 t 1 (1+ t 1 )(1 t ) k v t w kk v t 1 w k 4 t 1 (1+ t 1 )(1 t ) k v t w k 2 and (1 t ) 2 t 1 k ( v t 1 w ) k 2 4(1 t ) 2 t 1 k ( v t w ) k 2 sothat E k v t +1 w k 2 (1 t )((1+ t 1 ) 2 +4 t 1 (1+ t 1 )+4 2 t 1 ) E k ( v t w ) k 2 +20 E 2 3 t G 2 + 2 t 1 N max ˙ 2 E (1 t )(1+14 t 1 ) k ( v t w ) k 2 +20 E 2 3 t G 2 + 2 t 1 N max ˙ 2 b (1 t 2 )( C 2 t + D t )+ C 3 t + D 2 t =( b (1 t 2 )+ t ) 2 t C +( b (1 t 2 )+ t ) t D andsoitremainstochoose b suchthat ( b (1 t 2 )+ t ) t t +1 ( b (1 t 2 )+ t ) 2 t 2 t +1 fromwhichwecanconclude E k v t +1 w k 2 2 t +1 C + t +1 D . 239 With b = 6 ,wehave ( b (1 t 2 )+ t ) t =( b (1 ( 3 t + )+ 6 ( t + ) ) 6 ( t + ) =( b t + 3 t + + 6 ( t + ) ) 6 ( t + ) b ( t + 1 t + ) 6 ( t + ) b 6 ( t + +1) = t +1 wherewehaveused t + 1 ( t + ) 2 1 t + +1 . Similarly ( b (1 t 2 )+ t ) 2 t =( b (1 ( 3 t + )+ 6 ( t + ) )( 6 ( t + ) ) 2 =( b t + 3 t + + 6 ( t + ) )( 6 ( t + ) ) 2 = b ( t + 2 t + )( 6 ( t + ) ) 2 b 36 2 ( t + +1) 2 = 2 t +1 wherewehaveused t + 2 ( t + ) 3 1 ( t + +1) 2 . Finally,toensure k v 0 w k 2 b ( C 2 0 + D 0 ) ,wecanrescale b by c k v 0 w k 2 forsome c: Itfollowsthat E k v t w k 2 b ( C 2 t + D t ) forall t .Thus E ( F ( w T )) F = E ( F ( w T ) F ( w )) L 2 E k w T w k 2 L 2 c k w 0 w k 2 6 ( D T + C 2 T ) =3 c k w 0 w k 2 ( D T + C 2 T ) 240 3 c k w 0 w k 2 6 ( T + ) 1 N max ˙ 2 +20 E 2 LG 2 ( 6 ( T + ) ) 2 = O ( 1 N max ˙ 2 1 T + 2 E 2 G 2 1 T 2 ) Withpartialparticipation,thesameargumentintheSGDcaseyields E F ( w T ) F = O ( max ˙ 2 NT + 2 G 2 KT + 2 E 2 G 2 T 2 ) ConvexSmoothObjectives Theorem17. Setlearningrates t = t = O ( q N T ) .ThenunderAssumptions7,9,10 NesterovacceleratedFedAvgwithfulldeviceparticipationhasrate min t T F ( w t ) F = O max ˙ 2 p NT + NE 2 LG 2 T ; andwithpartialdeviceparticipationwith K sampleddevicesateachcommunicationround, min t T F ( w t ) F = O max ˙ 2 p KT + E 2 G 2 p KT + KE 2 LG 2 T : Proof. De˝ne p t := t 1 t [ w t w t 1 + t g t 1 ] = 2 t 1 t ( v t v t 1 ) for t 1 and0for t =0 . Wecancheckthat w t +1 + p t +1 = w t + p t t 1 t g t 241 Nowwede˝ne z t := w t + p t and t = t 1 t forall t ,sothatwehavetherecursiverelation z t +1 = z t t g t Now k z t +1 w k 2 = k ( z t t g t ) w k 2 = k ( z t t g t w ) t ( g t g t ) k 2 = A 1 + A 2 + A 3 where A 1 = k z t w t g t k 2 A 2 =2 t h z t w t g t ; g t g t i A 3 = 2 t k g t g t k 2 whereagain E A 2 =0 and E A 3 2 t P k p 2 k ˙ 2 k .For A 1 wehave k z t w t g t k 2 = k z t w k 2 +2 h z t w ; t g t i + k t g t k 2 Usingtheconvexityand L -smoothnessof F k , 2 t h z t w ; g t i = 2 t N X k =1 p k h z t w ; r F k ( w k t ) i 242 = 2 t N X k =1 p k h z t w k t ; r F k ( w k t ) i 2 t N X k =1 p k h w k t w ; r F k ( w k t ) i = 2 t N X k =1 p k h z t w t ; r F k ( w k t ) i 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i 2 t N X k =1 p k h w k t w ; r F k ( w k t ) i 2 t N X k =1 p k h z t w t ; r F k ( w k t ) i 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i +2 t N X k =1 p k ( F k ( w ) F k ( w k t )) 2 t N X k =1 p k F k ( w k t ) F k ( w t )+ L 2 k w t w k t k 2 + F k ( w ) F k ( w k t ) 2 t N X k =1 p k h z t w t ; r F k ( w k t ) i = t L N X k =1 p k k w t w k t k 2 +2 t N X k =1 p k [ F k ( w ) F k ( w t )] 2 t N X k =1 p k h z t w t ; r F k ( w k t ) i whichresultsin E k w t +1 w k 2 E k w t w k 2 + t L N X k =1 p k k w t w k t k 2 +2 t N X k =1 p k [ F k ( w ) F k ( w t )] + 2 t k g t k 2 + 2 t N X k =1 p 2 k ˙ 2 k 2 t N X k =1 p k h z t w t ; r F k ( w k t ) i Asbefore, k g t k 2 2 L 2 P k p k k w k t w t k 2 +4 L ( F ( w t ) F ( w )) ,sothat 2 t k g t k 2 + t N X k =1 p k [ F k ( w ) F k ( w t )] 2 L 2 2 t X k p k k w k t w t k 2 + t (1 4 t L )( F ( w ) F ( w t )) 243 2 L 2 2 t X k p k k w k t w t k 2 for t 1 = 4 L .Using P N k =1 p k k w t w k t k 2 16 E 2 2 t G 2 and P N k =1 p 2 k ˙ 2 k max 1 N ˙ 2 ,it followsthat E k w t +1 w k 2 + t ( F ( w t ) F ( w )) E k w t w k 2 +( t L +2 L 2 2 t ) N X k =1 p k k w t w k t k 2 + 2 t N X k =1 p 2 k ˙ 2 k 2 t N X k =1 p k h z t w t ; r F k ( w k t ) i E k w t w k 2 +32 LE 2 2 t t G 2 + 2 t max 1 N ˙ 2 2 t N X k =1 p k h z t w t ; r F k ( w k t ) i if t 1 2 L .Itremainstobound E P N k =1 p k h z t w t ; r F k ( w k t ) i .Recallthat z t w t = t 1 t [ w t w t 1 + t g t 1 ] = 2 t 1 t ( v t v t 1 ) and E k v t v t 1 k 2 G 2 , E kr F k ( w k t ) k 2 G 2 . Cauchy-Schwarzgives E N X k =1 p k h z t w t ; r F k ( w k t ) i N X k =1 p k q E k z t w t k 2 q E kr F k ( w k t ) k 2 2 t 1 t G 2 Thus E k w t +1 w k 2 + t ( F ( w t ) F ( w )) 244 E k w t w k 2 +32 LE 2 2 t t G 2 + 2 t max 1 N ˙ 2 +2 t 2 t 1 t G 2 Summingtheinequalitiesfrom t =0 to t = T ,weobtain T X t =0 t ( F ( w t ) F ( w )) k w 0 w k 2 + T X t =0 2 t 1 N max ˙ 2 + T X t =0 t 2 t 32 LE 2 G 2 + T X t =0 2 t 2 t 1 t G 2 sothat min t T F ( w t ) F ( w ) 1 P T t =0 t 0 @ k w 0 w k 2 + T X t =0 2 t 1 N max ˙ 2 + T X t =0 t 2 t 32 LE 2 G 2 + T X t =0 2 t 2 t 1 t G 2 1 A Bysettingtheconstantlearningrates t q N T and t c q N T sothat t = t 1 t = q N T 1 c q N T 2 q N T ,wehave min t T F ( w t ) F ( w ) 1 2 p NT k w 0 w k 2 + 2 p NT T N T 1 N max ˙ 2 + 1 p NT T ( r N T ) 3 32 LE 2 G 2 + 2 p NT T ( r N T ) 3 G 2 =( 1 2 k w 0 w k 2 +2 max ˙ 2 ) 1 p NT + N T (32 LE 2 G 2 +2 G 2 ) = O ( max ˙ 2 p NT + NE 2 LG 2 T ) 245 Similarly,forpartialparticipation,wehave min t T F ( w t ) F ( w ) 1 P T t =0 t 0 @ k w 0 w k 2 + T X t =0 2 t ( 1 N max ˙ 2 + C )+ T X t =0 3 t 6 E 2 LG 2 1 A where C = 4 K E 2 G 2 or N K N 1 4 K E 2 G 2 ,sothatwith t q K T and t c q K T ,wehave min t T F ( w t ) F ( w )= O ( max ˙ 2 p KT + E 2 G 2 p KT + KE 2 LG 2 T ) ProofofGeometricConvergenceResultsforOverparame- terizedProblems GeometricConvergenceofFedAvgforgeneralstronglyconvexand smoothobjectives Theorem18. Fortheoverparameterizedsettingwithgeneralstronglyconvexandsmooth objectives,FedAvgwithlocalSGDupdatesandcommunicationevery E iterationswithconstant stepsize = 1 2 E N l max + L ( N min ) givestheexponentialconvergenceguarantee E F ( w t ) L 2 (1 ) t k w 0 w k 2 = O (exp( 2 E N l max + L ( N min ) t ) k w 0 w k 2 ) Proof. Toillustratethemainideasoftheproof,we˝rstpresenttheprooffor E =2 .Let 246 t 1 beacommunicationround,sothat w k t 1 = w t 1 .Weshowthat k w t +1 w k 2 (1 t )(1 t 1 ) k w t 1 w k 2 forappropriatelychosenconstantstepsizes t ; t 1 .Wehave k w t +1 w k 2 = k ( w t t g t ) w k 2 = k w t w k 2 2 t h w t w ; g t i + 2 t k g t k 2 andthecrosstermcanbeboundedasusualusing -convexityand L -smoothnessof F k : 2 t E t h w t w ; g t i = 2 t N X k =1 p k h w t w ; r F k ( w k t ) i = 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i 2 t N X k =1 p k h w k t w ; r F k ( w k t ) i 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i +2 t N X k =1 p k ( F k ( w ) F k ( w k t )) t N X k =1 p k k w k t w k 2 2 t N X k =1 p k F k ( w k t ) F k ( w t )+ L 2 k w t w k t k 2 + F k ( w ) F k ( w k t ) t k N X k =1 p k ( w k t w ) k 2 247 = t L N X k =1 p k k w t w k t k 2 +2 t N X k =1 p k [ F k ( w ) F k ( w t )] t k w t w k 2 = t L N X k =1 p k k w t w k t k 2 2 t N X k =1 p k F k ( w t ) t k w t w k 2 andso E k w t +1 w k 2 E (1 t ) k w t w k 2 2 t F ( w t )+ 2 t k g t k 2 + t L N X k =1 p k k w t w k t k 2 Applyingthisrecursiverelationto k w t w k 2 andusing k w t 1 w k t 1 k 2 0 ,wefurther obtain E k w t +1 w k 2 E (1 t ) (1 t 1 ) k w t 1 w k 2 2 t 1 F ( w t 1 )+ 2 t 1 k g t 1 k 2 2 t F ( w t )+ 2 t k g t k 2 + t L N X k =1 p k k w t w k t k 2 Nowinsteadofbounding P N k =1 p k k w t w k t k 2 usingtheargumentsinthegeneralconvexcase, wefollow[ 127 ]andusethefactthatintheoverparameterizedsetting, w isaminimizerofeach ` ( w ;x j k ) andthateach ` is l -smoothtoobtain kr F k ( w t 1 ;˘ k t 1 ) k 2 2 l ( F k ( w t 1 ;˘ k t 1 ) F k ( w ;˘ k t 1 )) ,whererecall F k ( w ;˘ k t 1 )= ` ( w ;˘ k t 1 ) ,sothat N X k =1 p k k w t w k t k 2 = N X k =1 p k k w t 1 t 1 g t 1 w k t 1 + t 1 g t 1 ;k k 2 = N X k =1 p k 2 t 1 k g t 1 g t 1 ;k k 2 = 2 t 1 N X k =1 p k ( k g t 1 ;k k 2 k g t 1 k 2 ) 248 = 2 t 1 N X k =1 p k kr F k ( w t 1 ;˘ k t 1 ) k 2 2 t 1 k g t 1 k 2 2 t 1 N X k =1 p k 2 l ( F k ( w t 1 ;˘ k t 1 ) F k ( w ;˘ k t 1 )) 2 t 1 k g t 1 k 2 againusing w t 1 = w k t 1 .Takingexpectationwithrespectto ˘ k t 1 'sandusingthefactthat F ( w )=0 ,wehave E t 1 N X k =1 p k k w t w k t k 2 2 l 2 t 1 N X k =1 p k F k ( w t 1 ) 2 t 1 k g t 1 k 2 =2 l 2 t 1 F ( w t 1 ) 2 t 1 k g t 1 k 2 Notealsothat k g t 1 k 2 = k N X k =1 p k r F k ( w t 1 ;˘ k t 1 ) k 2 while k g t k 2 = k N X k =1 p k r F k ( w k t ;˘ k t ) k 2 2 k N X k =1 p k r F k ( w t ;˘ k t ) k 2 +2 k N X k =1 p k ( r F k ( w t ;˘ k t ) r F k ( w k t ;˘ k t )) k 2 2 k N X k =1 p k r F k ( w t ;˘ k t ) k 2 +2 N X k =1 p k l 2 k w t w k t k 2 Substitutingtheseintotheboundfor k w t +1 w k 2 ,wehave E k w t +1 w k 2 249 E (1 t )((1 t 1 ) k w t 1 w k 2 2 t 1 F ( w t 1 )+ 2 t 1 k g t 1 k 2 ) 2 t F ( w t )+2 2 t k N X k =1 p k r F k ( w t ;˘ k t ) k 2 + 2 l 2 2 t 1 2 t + t 2 t 1 L 2 lF ( w t 1 ) k g t 1 k 2 E k w t +1 w k 2 E (1 t )(1 t 1 ) k w t 1 w k 2 2 t ( F ( w t ) t k N X k =1 p k r F k ( w t ;˘ k t ) k 2 ) 2 t 1 (1 t ) 0 @ (1 l t 1 (2 l 2 2 t + t L ) 1 t ) F ( w t 1 ) t 1 2 k N X k =1 p k r F k ( w t 1 ;˘ k t 1 ) k 2 1 A fromwhichwecanconcludethat E k w t +1 w k 2 (1 t )(1 t 1 ) E k w t 1 w k 2 ifwecanchoose t ; t 1 toguarantee E ( F ( w t ) t k N X k =1 p k r F k ( w t ;˘ k t ) k 2 ) 0 E 0 @ (1 l t 1 (2 l 2 2 t + t L ) 1 t ) F ( w t 1 ) t 1 2 k N X k =1 p k r F k ( w t 1 ;˘ k t 1 ) k 2 1 A 0 250 Notethat E t k N X k =1 p k r F k ( w t ;˘ k t ) k 2 = E t h N X k =1 p k r F k ( w t ;˘ k t ) ; N X k =1 p k r F k ( w t ;˘ k t ) i = N X k =1 p 2 k E t kr F k ( w t ;˘ k t ) k 2 + N X k =1 X j 6 = k p j p k E t hr F k ( w t ;˘ k t ) ; r F j ( w t ;˘ j t ) i = N X k =1 p 2 k E t kr F k ( w t ;˘ k t ) k 2 + N X k =1 X j 6 = k p j p k hr F k ( w t ) ; r F j ( w t ) i = N X k =1 p 2 k E t kr F k ( w t ;˘ k t ) k 2 + N X k =1 N X j =1 p j p k hr F k ( w t ) ; r F j ( w t ) i N X k =1 p 2 k kr F k ( w t ) k 2 E t k N X k =1 p k r F k ( w t ;˘ k t ) k 2 N X k =1 p 2 k E t kr F k ( w t ;˘ k t ) k 2 + k X k p k r F k ( w t ) k 2 1 N min k X k p k r F k ( w t ) k 2 = N X k =1 p 2 k E t kr F k ( w t ;˘ k t ) k 2 +(1 1 N min ) kr F ( w t ) k 2 andsofollowing[ 127 ]ifwelet t = min f qN 2 l max ; 1 q 2 L (1 1 N min ) g fora q 2 [0 ; 1] tobeoptimized 251 later,wehave E t ( F ( w t ) t k N X k =1 p k r F k ( w t ;˘ k t ) k 2 ) E t N X k =1 p k F k ( w t ) t 2 4 N X k =1 p 2 k E t kr F k ( w t ;˘ k t ) k 2 +(1 1 N min ) kr F ( w t ) k 2 3 5 E t N X k =1 p k ( qF k ( w t ;˘ k t ) t 1 N max kr F k ( w t ;˘ k t ) k 2 ) +((1 q ) F ( w t ) t (1 1 N min ) kr F ( w t ) k 2 ) q E t N X k =1 p k ( F k ( w t ;˘ k t ) 1 2 l kr F k ( w t ;˘ k t ) k 2 )+(1 q )( F ( w t ) 1 2 L kr F ( w t ) k 2 ) 0 againusing w optimizes F k ( w ;˘ k t ) with F k ( w ;˘ k t )=0 . Maximizing t = min f qN 2 l max ; 1 q 2 L (1 1 N min ) g over q 2 [0 ; 1] ,weseethat q = l max l max + L ( N min ) resultsinthefastestconvergence,andthistranslatesto t = 1 2 N l max + L ( N min ) .Nextwe claimthat t 1 = c 1 2 N l max + L ( N min ) alsoguarantees E (1 l t 1 (2 l 2 2 t + t L ) 1 t ) F ( w t 1 ) t 1 2 k N X k =1 p k r F k ( w t 1 ;˘ k t 1 ) k 2 0 Notethatbyscaling t 1 byaconstant c 1 ifnecessary,wecanguarantee l t 1 (2 l 2 2 t + t L ) 1 t 1 2 ,andsotheconditionisequivalentto F ( w t 1 ) t 1 k N X k =1 p k r F k ( w t 1 ;˘ k t 1 ) k 2 0 whichwasshowntoholdwith t 1 1 2 N l max + L ( N min ) . 252 Fortheproofofgeneral E 2 ,weusethefollowingtwoidentities: k g t k 2 2 k N X k =1 p k r F k ( w t ;˘ k t ) k 2 +2 N X k =1 p k l 2 k w t w k t k 2 E N X k =1 p k k w t w k t k 2 E 2(1+2 l 2 2 t 1 ) N X k =1 p k k w t 1 w k t 1 k 2 +8 2 t 1 lF ( w t 1 ) 2 2 t 1 k g t 1 k 2 wherethe˝rstinequalityhasbeenestablishedbefore.Toestablishthesecondinequality, notethat N X k =1 p k k w t w k t k 2 = N X k =1 p k k w t 1 t 1 g t 1 w k t 1 + t 1 g t 1 ;k k 2 2 N X k =1 p k k w t 1 w k t 1 k 2 + k t 1 g t 1 t 1 g t 1 ;k k 2 and X k p k k g t 1 ;k g t 1 k 2 = X k p k ( k g t 1 ;k k 2 k g t 1 k 2 ) = X k p k kr F k ( w t 1 ;˘ k t 1 )+ r F k ( w k t 1 ;˘ k t 1 ) r F k ( w t 1 ;˘ k t 1 ) k 2 k g t 1 k 2 2 X k p k kr F k ( w t 1 ;˘ k t 1 ) k 2 + l 2 k w k t 1 w t 1 k 2 k g t 1 k 2 253 sothatusingthe l -smoothnessof ` , E N X k =1 p k k w t w k t k 2 E 2(1+2 l 2 2 t 1 ) N X k =1 p k k w t 1 w k t 1 k 2 +4 2 t 1 X k p k kr F k ( w t 1 ;˘ k t 1 ) k 2 2 2 t 1 k g t 1 k 2 E 2(1+2 l 2 2 t 1 ) N X k =1 p k k w t 1 w k t 1 k 2 +4 2 t 1 2 l X k p k ( F k ( w t 1 ;˘ k t 1 ) F k ( w ;˘ k t 1 )) 2 2 t 1 k g t 1 k 2 = E 2(1+2 l 2 2 t 1 ) N X k =1 p k k w t 1 w k t 1 k 2 +8 2 t 1 lF ( w t 1 ) 2 2 t 1 k g t 1 k 2 Usingthe˝rstinequality,wehave E k w t +1 w k 2 E (1 t ) k w t w k 2 2 t F ( w t )+2 2 t k N X k =1 p k r F k ( w t ;˘ k t ) k 2 +(2 2 t l 2 + t L ) N X k =1 p k k w t w k t k 2 andwechoose t and t 1 suchthat E ( F ( w t ) t k P N k =1 p k r F k ( w t ;˘ k t ) k 2 ) 0 and (2 2 t l 2 + t L ) (1 t )(2 2 t 1 l 2 + t 1 L ) = 3 .Thisgives E k w t +1 w k 2 E (1 t )[(1 t 1 ) k w t 1 w k 2 2 t 1 F ( w t 1 ) +2 2 t 1 k N X k =1 p k r F k ( w t 1 ;˘ k t 1 ) k 2 254 +(2 2 t 1 l 2 + t 1 L )( N X k =1 p k k w t 1 w k t 1 k 2 + N X k =1 p k k w t w k t k 2 ) = 3] Usingthesecondinequality N X k =1 p k k w t w k t k 2 E 2(1+2 l 2 2 t 1 ) N X k =1 p k k w t 1 w k t 1 k 2 +8 2 t 1 lF ( w t 1 ) 2 2 t 1 k g t 1 k 2 andthat 2(1+2 l 2 2 t 1 ) 3 , 2 2 t 1 l 2 + t 1 L 1 ,wehave E k w t +1 w k 2 E (1 t )[(1 t 1 ) k w t 1 w k 2 2 t 1 F ( w t 1 )+2 2 t 1 k N X k =1 p k r F k ( w t 1 ;˘ k t 1 ) k 2 +8 2 t 1 lF ( w t 1 ) +(2 2 t 1 l 2 + t 1 L )(2 N X k =1 p k k w t 1 w k t 1 k 2 )] andif t 1 ischosensuchthat ( F ( w t 1 ) 4 t 1 lF ( w t 1 )) t 1 k N X k =1 p k r F k ( w t 1 ;˘ k t 1 ) k 2 0 and (2 2 t 1 l 2 + t 1 L )(1 t 1 ) (2 2 t 2 l 2 + t 2 L ) = 3 255 weagainhave E k w t +1 w k 2 E (1 t )(1 t 1 )[ k w t 1 w k 2 +(2 2 t 2 l 2 + t 2 L ) (2 N X k =1 p k k w t 1 w k t 1 k 2 ) = 3] Applyingtheabovederivationiteratively ˝t E ,from 256 whichwecanconcludethat E k w t w k 2 ( t t 0 1 Y ˝ =1 (1 t ˝ )) k w t 0 w k 2 (1 c E N l max + L ( N min ) ) t t 0 k w t 0 w k 2 andapplyingthisinequalitytoiterationsbetweeneachcommunicationround, E k w t w k 2 (1 c E N l max + L ( N min ) ) t k w 0 w k 2 = O (exp( E N l max + L ( N min ) t )) k w 0 w k 2 Withpartialparticipation,wenotethat E k w t +1 w k 2 = E k w t +1 v t +1 + v t +1 w k 2 = E k w t +1 v t +1 k 2 + E k v t +1 w k 2 = 1 K X k p k E k w k t +1 w t +1 k 2 + E k v t +1 w k 2 andsotherecursiveidentitybecomes E k w t +1 w k 2 E (1 t ) (1 t ˝ +1 )[(1 t ˝ ) k w t ˝ w k 2 2 t ˝ F ( w t ˝ )+2 2 t ˝ k N X k =1 p k r F k ( w t ˝ ;˘ k t ˝ ) k 2 +8 ˝ 2 t ˝ lF ( w t ˝ ) +(2 2 t ˝ l 2 + t ˝ L + 1 K )(( ˝ +1) N X k =1 p k k w t ˝ w k t ˝ k 2 )] 257 whichrequires (2 2 t ˝ l 2 + t ˝ L + 1 K )(1 t ˝ ) (2 2 t ˝ 1 l 2 + t ˝ 1 L + 1 K ) = 3 2(1+2 l 2 2 t ˝ ) 3 2 2 t ˝ l 2 + t ˝ L + 1 K 1 ( F ( w t ˝ ) 4 ˝ t ˝ lF ( w t ˝ )) t ˝ k N X k =1 p k r F k ( w t ˝ ;˘ k t ˝ ) k 2 0 tohold.Againsetting t ˝ = c 1 ˝ +1 N l max + L ( N min ) forapossiblydi˙erentconstantfrom beforesatis˝estherequirements. Finally,usingthe L -smoothnessof F , F ( w T ) F ( w ) L 2 E k w T w k 2 = O ( L exp( E N l max + L ( N min ) T )) k w 0 w k 2 GeometricConvergenceofFedAvgforOverparameterizedLinearRe- gression We˝rstprovidedetailsonquantitiesusedintheproofofresultsonlinearregressionin Section6.5inthemaintext.Thelocaldeviceobjectivesarenowgivenbythesumof squares F k ( w )= 1 2 n k P n k j =1 ( w T x j k z j k ) 2 ,andthereexists w suchthat F ( w ) 0 .De˝ne thelocalHessianmatrixas H k := 1 n k P n k j =1 x j k ( x j k ) T ,andthestochasticHessianmatrix as ~ H k t := ˘ k t ( ˘ k t ) T ,where ˘ k t isthestochasticsampleonthe k thdeviceattime t .De˝ne l tobethesmallestpositivenumbersuchthat E k ˘ k t k 2 ˘ k t ( ˘ k t ) T l H k forall k .Notethat 258 l max k;j k x j k k 2 .Let L and belowerandupperboundsofnon-zeroeigenvaluesof H k . De˝ne 1 := l and := . Following[ 119 , 87 ],wede˝nethestatisticalconditionnumber ~ asthesmallestpositive realnumbersuchthat E P k p k ~ H k t H 1 ~ H k t ~ H .Theconditionnumbers 1 and ~ are importantinthecharacterizationofconvergenceratesforFedAvgalgorithms.Notethat 1 > and 1 > ~ . Let H = P k p k H k .Ingeneral H haszeroeigenvalues.However,becausethenullspaceof H andrangeof H areorthogonal,inoursubsequenceanalysisitsu˚cestoproject w t w ontotherangeof H ,thuswemayrestricttothenon-zeroeigenvalueof H . Ausefulobservationisthatwecanuse w T x j k z j k 0 torewritethelocalobjectivesas F k ( w )= 1 2 h w w ; H k ( w w ) i 1 2 k w w k 2 H k : F k ( w )= 1 2 n k n k X j =1 ( w T x k;j z k;j ( w T x k;j z k;j )) 2 = 1 2 n k n k X j =1 (( w w ) T x k;j ) 2 = 1 2 h w w ; H k ( w w ) i = 1 2 k w w k 2 H k sothat F ( w )= 1 2 k w w k 2 H . Finally,notethat E ~ H k t = 1 n k P n k j =1 x j k ( x j k ) T = H k and g t;k = ~ H k t ( w k t w ) while g t = P N k =1 p k r F k ( w k t ;˘ k t )= P N k =1 p k ~ H k t ( w k t w ) and g t = P N k =1 p k H k ( w k t w ) Theorem19. Fortheoverparamterizedlinearregressionproblem,FedAvgwithcommuni- cationevery E iterationswithconstantstepsize = O ( 1 E N l max + ( N min ) ) hasgeometric 259 convergence: E F ( w T ) O L exp( NT E ( max 1 +( N min )) ) k w 0 w k 2 : Proof. Weagainshowtheresult˝rstwhen E =2 and t 1 isacommunicationround.We have k w t +1 w k 2 = k ( w t t g t ) w k 2 = k w t w k 2 2 t h w t w ; g t i + 2 t k g t k 2 and 2 t E t h w t w ; g t i = 2 t N X k =1 p k h w t w ; r F k ( w k t ) i = 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i 2 t N X k =1 p k h w k t w ; r F k ( w k t ) i = 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i 2 t N X k =1 p k h w k t w ; H k ( w k t w ) i = 2 t N X k =1 p k h w t w k t ; r F k ( w k t ) i 4 t N X k =1 p k F k ( w k t ) 2 t N X k =1 p k ( F k ( w k t ) F k ( w t )+ L 2 k w t w k t k 2 ) 4 t N X k =1 p k F k ( w k t ) = t L N X k =1 p k k w t w k t k 2 2 t N X k =1 p k F k ( w t ) 2 t N X k =1 p k F k ( w k t ) = t L N X k =1 p k k w t w k t k 2 t N X k =1 p k h ( w t w ) ; H k ( w t w ) i 2 t N X k =1 p k F k ( w k t ) 260 and k g t k 2 = k N X k =1 p k ~ H k t ( w k t w ) k 2 = k N X k =1 p k ~ H k t ( w t w )+ N X k =1 p k ~ H k t ( w k t w t ) k 2 2 k N X k =1 p k ~ H k t ( w t w ) k 2 +2 k N X k =1 p k ~ H k t ( w k t w t ) k 2 whichgives E k w t +1 w k 2 E k w t w k 2 t N X k =1 p k h w t w ; H k w t w i +2 2 t k N X k =1 p k ~ H k t ( w t w ) k 2 + t L N X k =1 p k k w t w k t k 2 +2 2 t k N X k =1 p k ~ H k t ( w k t w t ) k 2 2 t N X k =1 p k F k ( w k t ) following[127]we˝rstprovethat E k w t w k 2 t N X k =1 p k h ( w t w ) ; H k ( w t w ) i +2 2 t k N X k =1 p k ~ H k t ( w t w ) k 2 (1 N 8( max 1 +( N min )) ) E k w t w k 2 withappropriatelychosen t .Comparedtotherate O ( N max 1 +( N min ) ) forgeneral stronglyconvexandsmoothobjectives,thisisanimprovementaslinearspeedupisnow availableforalargerrangeof N . 261 Wehave E t k N X k =1 p k ~ H k t ( w t w ) k 2 = E t h N X k =1 p k ~ H k t ( w t w ) ; N X k =1 p k ~ H k t ( w t w ) i = N X k =1 p 2 k E t k ~ H k t ( w t w ) k 2 + N X k =1 X j 6 = k p j p k E t h ~ H k t ( w t w ) ; ~ H j t ( w t w ) i = N X k =1 p 2 k E t k ~ H k t ( w t w ) k 2 + N X k =1 X j 6 = k p j p k E t h H k ( w t w ) ; H j ( w t w ) i = N X k =1 p 2 k E t k ~ H k t ( w t w ) k 2 + N X k =1 N X j =1 p j p k E t h H k ( w t w ) ; H j ( w t w ) i N X k =1 p 2 k k H k ( w t w ) k 2 = N X k =1 p 2 k E t k ~ H k t ( w t w ) k 2 + k X k p k H k ( w t w ) k 2 N X k =1 p 2 k k H k ( w t w ) k 2 N X k =1 p 2 k E t k ~ H k t ( w t w ) k 2 + k X k p k H k ( w t w ) k 2 1 N min k X k p k H k ( w t w ) k 2 1 N max N X k =1 p k E t k ~ H k t ( w t w ) k 2 +(1 1 N min ) k X k p k H k ( w t w ) k 2 1 N max l N X k =1 p k h ( w t w ) ; H k ( w t w ) i +(1 1 N min ) k X k p k H k ( w t w ) k 2 = 1 N max l h ( w t w ) ; H ( w t w ) i +(1 1 N min ) h w t w ; H 2 ( w t w ) i using k ~ H k t k l . Nowwehave E k w t w k 2 t N X k =1 p k h ( w t w ) ; H k ( w t w ) i +2 2 t k N X k =1 p k ~ H k t ( w t w ) k 2 = 262 h w t w ; ( I t H +2 2 t ( max l N H + N min N H 2 ))( w t w ) i anditremainstoboundthemaximumeigenvalueof ( I t H +2 2 t ( max l N H + N min N H 2 )) andweboundthisfollowing[127].Ifwechoose t < N 2( max l +( N min ) L ) ,then t H +2 2 t ( max l N H + N min N H 2 ) ˚ 0 andtheconvergencerateisgivenbythemaximumof 1 t +2 2 t ( max l N + N min N 2 ) maximizedoverthenon-zeroeigenvalues of H .Toselectthestepsize t thatgivesthe smallestupperbound,wethenminimizeover t ,resultingin min t < N 2( max l +( N min ) L ) max 0: 9 v; H v = ˆ 1 t +2 2 t ( max l N + N min N 2 ) ˙ Sincetheobjectiveisquadraticin ,themaximumisachievedateitherthelargesteigenvalue max of H orthesmallestnon-zeroeigenvalue min of H . When N 4 max l L min +4 min ,i.e.when N = O ( l min )= O ( 1 ) ,theoptimalobjective valueisachievedat min andtheoptimalstepsizeisgivenby t = N 4( max l +( N min ) min ) . Theoptimalconvergencerate(i.e.theoptimalobjectivevalue)isequalto 1 1 8 N min ( max l +( N min ) min ) =1 1 8 N ( max 1 +( N min )) : Thisimpliesthatwhen N = O ( 1 ) ,theoptimalconvergenceratehasalinearspeedupin N . 263 When N islarger,thisstepsizeisnolongeroptimal,butwestillhave 1 1 8 N ( max 1 +( N min )) asanupperboundontheconvergencerate. Nowwehaveproved E k w t +1 w k 2 (1 1 8 N ( max 1 +( N min )) ) E k w t w k 2 + t L N X k =1 p k k w t w k t k 2 +2 2 t k N X k =1 p k ~ H k t ( w k t w t ) k 2 2 t N X k =1 p k F k ( w k t ) Nextweboundtermsinthesecondlineusingasimilarargumentasthegeneralcase.We have 2 2 t k N X k =1 p k ~ H k t ( w k t w t ) k 2 2 2 t l 2 N X k =1 p k k w t w k t k 2 and E N X k =1 p k k w t w k t k 2 E 2(1+2 l 2 2 t 1 ) N X k =1 p k k w t 1 w k t 1 k 2 +8 2 t 1 lF ( w t 1 ) =4 2 t 1 l h w t 1 w ; H ( w t 1 w ) i andif t ; t 1 satisfy t L +2 2 t (1 1 8 N ( max 1 +( N min )) )( t 1 L +2 2 t 1 ) = 3 2(1+2 l 2 2 t 1 ) 3 t L +2 2 t 1 264 wehave E k w t +1 w k 2 (1 1 8 N ( max 1 +( N min )) )[ E k w t 1 w k 2 t h w t 1 w ; H w t 1 w i +2 2 t k N X k =1 p k ~ H k t ( w t w ) k 2 +( t 1 L +2 2 t 1 ) 2 N X k =1 p k k w t 1 w k t 1 k 2 +4 2 t 1 l h w t 1 w ; H ( w t 1 w ) i ] andagainbychoosing t 1 = c N 8( max l +( N min ) min ) forasmallconstant c ,wecan guaranteethat E k w t 1 w k 2 t 1 h w t 1 w ; H w t 1 w i +2 2 t 1 k N X k =1 p k ~ H k t 1 ( w t 1 w ) k 2 +4 2 t 1 l h w t 1 w ; H ( w t 1 w ) i (1 c N 16( max l +( N min ) min ) ) E k w t 1 w k 2 Forgeneral E ,wehavetherecursiverelation E k w t +1 w k 2 E (1 c 1 8 N ( max 1 +( N min )) ) (1 c 1 8 ˝ N ( max 1 +( N min )) )[ k w t ˝ w k 2 t ˝ h w t ˝ w ; H w t ˝ w i +2 2 t ˝ k N X k =1 p k ~ H k t ˝ ( w t ˝ w ) k 2 +4 ˝ 2 t 1 l h w t 1 w ; H ( w t 1 w ) i +(2 2 t ˝ l 2 + t ˝ L )(( ˝ +1) N X k =1 p k k w t ˝ w k t ˝ k 2 )] 265 aslongasthestepsizesarechosen t ˝ = c N 4 ˝ ( max l +( N min ) min ) suchthatthefollowing inequalitieshold (2 2 t ˝ l 2 + t ˝ L ) (1 t ˝ )(2 2 t ˝ 1 l 2 + t ˝ 1 L ) = 3 2(1+2 l 2 2 t ˝ ) 3 2 2 t ˝ l 2 + t ˝ L 1 and k w t ˝ w k 2 t ˝ h w t ˝ w ; H w t ˝ w i +2 2 t ˝ k N X k =1 p k ~ H k t ˝ ( w t ˝ w ) k 2 +4 ˝ 2 t 1 l h w t 1 w ; H ( w t 1 w ) i (1 c N 8( ˝ +1)( max 1 +( N min )) ) E k w t ˝ w k 2 whichgives E k w t w k 2 (1 c 1 8 E N ( max 1 +( N min )) ) t k w 0 w k 2 = O (exp( 1 E N ( max 1 +( N min )) t )) k w 0 w k 2 andwithpartialparticipation,thesameboundholdswithapossiblydi˙erentchoiceof c . 266 GeometricConvergenceofFedMaSSforOverparameterizedLinear Regression Theorem20. Fortheoverparamterizedlinearregressionproblem,FedMaSSwithcom- municationevery E iterationsandconstantstepsizes 1 = O ( 1 E N l max + ( N min ) ) ; 2 = 1 (1 1 ~ ) 1+ 1 p 1 ~ ; = 1 1 p 1 ~ 1+ 1 p 1 ~ hasgeometricconvergence: E F ( w T ) O L exp( NT E ( max p 1 ~ +( N min )) ) k w 0 w k 2 : Proof. Theproofisbasedonresultsin[ 119 ]whichoriginallyproposedtheMaSSalgorithm. Notethattheupdatecanequivalentlybewrittenas v k t +1 =(1 k ) v k t + k u k t k g t;k w k t +1 = 8 > > > < > > > : u k t k g t;k if t +1 = 2I E P N k =1 p k h u k t k g t;k i if t +1 2I E u k t +1 = k 1+ k v k t +1 + 1 1+ k w k t +1 wherethereisabijectionbetweentheparameters 1 k 1+ k = k ; k = k 1 ; k k k 1+ k = k 2 ,and wefurtherintroduceanauxiliaryparameter v k t ,whichisinitializedat v k 0 .Wealsonotethat when k = k k ,theupdatereducestotheNesterovacceleratedSGD.Thisversionofthe FedAvgalgorithmwithlocalMaSSupdatesisusedforanalyzingthegeometricconvergence. Asbefore,de˝nethevirtualsequences w t = P N k =1 p k w k t , v t = P N k =1 p k v k t , u t = P N k =1 p k u k t ,and g t = P N k =1 p k E g t;k .Wehave E g t = g t and w t +1 = u t t g t , v t +1 = 267 (1 k ) v t + k w t k g t ,and u t +1 = k 1+ k v t +1 + 1 1+ k w t +1 . We˝rstprovethetheoremwith E =2 and t 1 beingacommunicationround.Wehave k v t +1 w k 2 H 1 = k (1 ) v t + u t X k p k ~ H k t ( u k t w ) w k 2 H 1 = k (1 ) v t + u t w k 2 H 1 + 2 k X k p k ~ H k t ( u k t w ) k 2 H 1 2 h X k p k ~ H k t ( u k t w ) ; (1 ) v t + u t w i H 1 k (1 ) v t + u t w k 2 H 1 | {z } A +2 2 k X k p k ~ H k t ( u t w ) k 2 H 1 | {z } B +2 2 k X k p k ~ H k t ( u t u k t ) k 2 H 1 2 h X k p k ~ H k t ( u t w ) ; (1 ) v t + u t w i H 1 | {z } C 2 h X k p k ~ H k t ( u k t u t ) ; (1 ) v t + u t w i H 1 Followingtheproofin[119], E A E (1 ) k v t w k 2 H 1 + k u t w k 2 H 1 E (1 ) k v t w k 2 H 1 + k u t w k 2 usingtheconvexityofthenorm kk H 1 andthat isthesmallestnon-zeroeigenvalueof H . Now E B 2 2 ( max 1 N ~ + N min N ) k ( u t w ) k 2 H 268 usingthefolowingbound: E X k p k ~ H k t ! H 1 X k p k ~ H k t ! = E X k p 2 k ~ H k t H 1 ~ H k t + X k 6 = j p k p j ~ H k t H 1 ~ H j t max 1 N E X k p k ~ H k t H 1 ~ H k t + X k 6 = j p k p j H k H 1 H j = max 1 N E X k p k ~ H k t H 1 ~ H k t + X k;j p k p j H k H 1 H j X k p 2 k H k H 1 H k max 1 N E X k p k ~ H k t H 1 ~ H k t + H 1 N min X k p k H k H 1 H k max 1 N E X k p k ~ H k t H 1 ~ H k t + H 1 N min ( X k p k H k ) H 1 ( X k p k H k ) = max 1 N E X k p k ~ H k t H 1 ~ H k t + N min N H max 1 N ~ H + N min N H wherewehaveused E P k p k ~ H k t H 1 ~ H k t ~ H byde˝nitionof ~ andtheoperatorconvexity ofthemapping W ! W H 1 W . Finally, E C = E 2 h X k p k ~ H k t ( u t w ) ; (1 ) v t + u t w i H 1 = 2 h X k p k H k ( u t w ) ; (1 ) v t + u t w i H 1 = 2 h ( u t w ) ; (1 ) v t + u t w i = 2 h ( u t w ) ; u t w + 1 ( u t w t ) i = 2 k u t w k 2 + 1 ( k w t w k 2 k u t w k 2 k w t u t k 2 ) 269 1 k w t w k 2 1 k u t w k 2 wherewehaveused (1 ) v t + u t =(1 )((1+ ) u t w t ) + u t = 1 u t 1 w t andtheidentitythat 2 h a ; b i = k a k 2 + k b k 2 k a + b k 2 . Itfollowsthat E k v t +1 w k 2 H 1 (1 ) k v t w k 2 H 1 + 1 k w t w k 2 +( 1 ) k u t w k 2 +2 2 ( max 1 N ~ + N min N ) k ( u t w ) k 2 H +2 2 k X k p k ~ H k t ( u t u k t ) k 2 H 1 2 h X k p k ~ H k t ( u k t u t ) ; (1 ) v t + u t w i H 1 Ontheotherhand, E k w t +1 w k 2 = E k u t w X k p k ~ H k t ( u t w ) k 2 = E k u t w k 2 2 k u t w k 2 H + 2 k X k p k ~ H k t ( u t w ) k 2 E k u t w k 2 2 k u t w k 2 H + 2 ( max 1 N ` + L N min N ) k u t w k 2 270 whereweusethefollowingbound: E X k p k ~ H k t ! X k p k ~ H k t ! = E X k p 2 k ~ H k t ~ H k t + X k 6 = j p k p j ~ H k t ~ H j t max 1 N E X k p k ~ H k t ~ H k t + X k 6 = j p k p j H k H j = max 1 N E X k p k ~ H k t ~ H k t + X k;j p k p j H k H j X k p 2 k H k H k max 1 N E X k p k ~ H k t ~ H k t + H 2 1 N min X k p k H k H k max 1 N E X k p k ~ H k t ~ H k t + H 2 1 N min ( X k p k H k )( X k p k H k ) = max 1 N E X k p k ~ H k t ~ H k t + N min N H 2 max 1 N l H + L N min N H againusingthat W ! W 2 isoperatorconvexandthat E ~ H k t ~ H k t l H k byde˝nitionof l . Combiningtheboundsfor E k w t +1 w k 2 and E k v t +1 w k 2 H 1 , E k w t +1 w k 2 + k v t +1 w k 2 H 1 (1 ) k v t w k 2 H 1 + 1 k w t w k 2 +( ) k u t w k 2 +(2 2 ( max 1 N ~ + N min N ) 2 + 2 ( max 1 N l + L N min N ) ) k u t w k 2 +2 2 k X k p k ~ H k t ( u t u k t ) k 2 H 1 + L X k p k k ( u t u k t ) k 2 H 1 271 Following[119]ifwechoosestepsizessothat 0 2 2 ( max 1 N ~ + N min N ) 2 + 2 ( max 1 N l + L N min N ) 0 orequivalently 2 ( max 1 N ~ + N min N )+ ( ( max 1 N l + L N min N ) 2) 0 thesecondandthirdtermsarenegative.Tooptimizethestepsizes,notethatthetwo inequalitiesimply 2 (2 ( max 1 N l + L N min N )) 2( max 1 N ~ + N min N ) andmaximizingtherighthandsidewithrespectto ,whichisquadratic,weseethat 1 = ( max 1 N l + L N min N ) maximizestherighthandside,with 1 q 2( max 1 N 1 + N min N )( max 1 N ~ + N min N ) = ( max 1 N ~ + N min N ) Notethat = 1 r 2( max 1 N 1 + N min N )( max 1 N ~ + N min N ) = O ( N p 1 ~ ) when N = O (min f ~ 1 g ) . Finally,todealwiththeterms 2 2 k P k p k ~ H k t ( u t u k t ) k 2 H 1 + L P k p k k ( u t u k t ) k 2 H 1 ,we 272 canuseJensen 2 2 k X k p k ~ H k t ( u t u k t ) k 2 H 1 + L X k p k k ( u t u k t ) k 2 H 1 (2 2 l 2 + L ) X k p k k u t u k t k 2 H 1 =(2 2 l 2 + L ) X k p k k 1+ v t + 1 1+ w t ( 1+ v k t + 1 1+ w k t ) k 2 H 1 (2 2 l 2 + L )(2( 1+ ) 2 2 +2( 1 1+ ) 2 2 ) X k p k k ~ H k t 1 ( u t 1 w ) k 2 (2 2 l 2 + L )(2( 1+ ) 2 2 +2( 1 1+ ) 2 2 ) l 2 k ( u t 1 w ) k 2 whichcanbecombinedwiththetermswith k ( u t 1 w ) k 2 intherecursiveexpansionof E k w t w k 2 + k v t w k 2 H 1 : E k w t w k 2 + k v t w k 2 H 1 (1 ) k v t 1 w k 2 H 1 + 1 k w t 1 w k 2 +( ) k u t 1 w k 2 +(2 2 ( max 1 N ~ + N min N ) 2 + 2 ( max 1 N l + L N min N ) ) k u t 1 w k 2 andthestepsizescanbechosensothattheresultingcoe˚cientsarenegative.Therefore,we haveshownthat E k w t +1 w k 2 (1 ) 2 k w t 1 w k 2 where = 1 r 2( max 1 N 1 + N min N )( max 1 N ~ + N min N ) = O ( N max p 1 ~ + N min ) when N = O (min f ~ 1 g ) . 273 Forgeneral E> 1 ,choosing = c=E ( max 1 N l + L N min N ) forsomesmallconstant c resultsin = O ( 1 E r ( max 1 N 1 + N min N )( max 1 N ~ + N min N ) ) andthisguaranteesthat E k w t w k 2 (1 ) t k w 0 w k 2 forall t . DetailsonExperimentsandAdditionalResults Wedescribedthepreciseproceduretoreproducetheresultsinthischapter.Aswementioned inSection6.6,weempiricallyveri˝edthelinearspeeduponvariousconvexsettingsforboth FedAvganditsacceleratedvariants.Foralltheresults,wesetrandomseedsas 0 ; 1 ; 2 and reportthebestconvergencerateacrossthethreefolds.Foreachrun,weinitialize w 0 = 0 andmeasurethenumberofiterationtoreachthetargetaccuracy .Weusethesmall-scale datasetw8a[ 155 ],whichconsistsof n =49749 sampleswithfeaturedimension d =300 . Thelabeliseitherpositiveoneornegativeone.Thedatasethassparsebinaryfeaturesin f 0 ; 1 g .Eachsamplehas11.15non-zerofeaturevaluesoutof 300 featuresonaverage.We setthebatchsizeequaltofouracrossallexperiments.Inthenextfollowingsubsections,we introduceparametersearchingineachobjectiveseparately. StronglyConvexObjectives We˝rstconsiderthestronglyconvexobjectivefunction, whereweusearegularizedbinarylogisticregressionwithregularization =1 =n ˇ 2 e 5 . Weevenlydistributedon 1 ; 2 ; 4 ; 8 ; 16 ; 32 devicesandreportthenumberofiterations/rounds neededtoconvergeto accuracy,where =0 : 005 .Theoptimalobjectivefunctionvalue f 274 issetas f =0 : 126433176216545 .Thisisdeterminednumericallyandwefollowthesetting in[ 178 ].Thelearningrateisdecayedasthe t = min ( 0 ; nc 1+ t ) ,whereweextensivelysearch thebestlearningrate c 2f 2 1 c 0 ; 2 2 c 0 ;c 0 ; 2 c 0 ; 2 2 c 0 g .Inthiscase,wesearchtheinitial learningrate 0 2f 1 ; 32 g and c 0 =1 = 8 . ConvexSmoothObjectives Wealsousebinarylogisticregressionwithoutregu- larization.Thesettingisalmostsameasitsregularizedcounterpart.Wealsoevenly distributedallthesampleson 1 ; 2 ; 4 ; 8 ; 16 ; 32 devices.The˝gureshowsthenumberof iterationsneededtoconvergeto accuracy,where =0 : 02 .Theoptiamlobjectivefunc- tionvalueissetas f =0 : 11379089057514849 ,determinednumerically.Thelearningrate isdecayedasthe t = min ( 0 ; nc 1+ t ) ,whereweextensivelysearchthebestlearningrate c 2f 2 1 c 0 ; 2 2 c 0 ;c 0 ; 2 c 0 ; 2 2 c 0 g .Inthiscase,wesearchtheinitiallearningrate 0 2f 1 ; 32 g and c 0 =1 = 8 . Linearregression Forlinearregression,weusethesamefeaturevectorsfromw8a datasetandgenerategroundtruth [ w ;b ] fromamultivariatenormaldistributionwithzero meanandstandarddeviationone.Thenwegeneratelabelbasedon y i = x t i w + b .This procedurewillensurewesatisfytheover-parameterizedsettingasrequiredinourtheorems. Wealsoevenlydistributedallthesampleson 1 ; 2 ; 4 ; 8 ; 16 ; 32 devices.The˝gureshowsthe numberofiterationsneededtoconvergeto accuracy,where =0 : 02 .Theoptiamlobjective functionvalueis f =0 .Thelearningrateisdecayedasthe t = min ( 0 ; nc 1+ t ) ,wherewe extensivelysearchthebestlearningrate c 2f 2 1 c 0 ; 2 2 c 0 ;c 0 ; 2 c 0 ; 2 2 c 0 g .Inthiscase,we searchtheinitiallearningrate 0 2f 0 : 1 ; 0 : 12 g and c 0 =1 = 256 . PartialParticipation ToexaminethelinearspeedupofFedAvginpartialparticipation setting,weevenlydistributeddataon 4 ; 8 ; 16 ; 32 ; 64 ; 128 devicesanduniformlysample 50% deviceswithoutreplacement.Allotherhyperparametersarethesameasprevioussections. 275 (a)Stronglyconvexobjective(b)Convexsmoothobjective(c)Linearregression FigureB.1:TheconvergenceofFedAvgw.r.tthenumberoflocalsteps E . NesterovacceleratedFedAvg TheexperimentsofNesterovacceleratedFedAvg(see updateformulabelow)usesthesamesettingaspreviousthreesectionsforvaniliaFedAvg. y k t +1 = w k t t g t;k w k t +1 = 8 > > > < > > > : y k t +1 + t ( y k t +1 y k t ) if t +1 = 2I E P k 2S t +1 y k t +1 + t ( y k t +1 y k t ) if t +1 2I E Weset t =0 : 1 andsearch t inthesamewayas t inFedAvg. Theimpactof E . Inthissubsection,wefurtherexaminehowdoesthenumberoflocal steps( E )a˙ectconvergence.AsshowninFigureB.1,thenumberofiterationsincreasesas E increase,whichslowdowntheconvergenceintermsofgradientcomputation.However, itcansavecommunicationcostsasthenumberofroundsdecreasedwhenthe E increases. Thisshowcasethatweneedaproperchoiceof E totrade-o˙thecommunicationcostand convergencespeed. 276 BIBLIOGRAPHY 277 BIBLIOGRAPHY [1] AbbasAbdolmaleki,JostTobiasSpringenberg,YuvalTassa,RemiMunos,Nicolas Heess,andMartinRiedmiller.Maximumaposterioripolicyoptimisation. arXiv preprintarXiv:1806.06920 ,2018. [2] NaokiAbe,PremMelville,CezarPendus,ChandanKReddy,DavidLJensen,VinceP Thomas,JamesJBennett,GaryFAnderson,BrentRCooley,MelissaKowalczyk,etal. Optimizingdebtcollectionsusingconstrainedreinforcementlearning.In SIGKDD , pagesACM,2010. [3] NaokiAbe,NavalVerma,ChidApte,andRobertSchroko.Crosschanneloptimized marketingbyreinforcementlearning.In SIGKDD ,pages767ACM,2004. [4] YuXuanLiuPieterAbbeel yz SergeyLevineAbhishekGupta y ,ColineDevin y .Learning invariantfeaturespacestotransferskillswithreinforcementlearning.In Underreview asaconferencepaperatICLR2017 ,2017. [5] ZeyuanAllen-Zhu,YuanzhiLi,andZhaoSong.Aconvergencetheoryfordeeplearning viaover-parameterization. arXivpreprintarXiv:1811.03962 ,2018. [6] SaleemaAmershi,MayaCakmak,WBradleyKnox,andToddKulesza.Powertothe people:Theroleofhumansininteractivemachinelearning.AAAI,2014. [7] SaleemaAmershi,JamesFogarty,andDanielWeld.Regroup:Interactivemachine learningforon-demandgroupcreationinsocialnetworks.In ProceedingsoftheSIGCHI , pagesACM,2012. [8] SaleemaAmershi,BongshinLee,AshishKapoor,RatulMahajan,andBlaineChristian. Cuet:human-guidedfastandaccuratenetworkalarmtriage.In Proceedingsofthe SIGCHI ,pages157ACM,2011. [9] MihaelAnkerst,ChristianElsen,MartinEster,andHans-PeterKriegel.Visualclas- si˝cation:aninteractiveapproachtodecisiontreeconstruction.In SIGKDD ,pages ACM,1999. [10] A.Argyriou,T.Evgeniou,andM.Pontil.Convexmulti-taskfeaturelearning. Machine Learning ,2008. [11] KaiArulkumaran,MarcPeterDeisenroth,MilesBrundage,andAnilAnthonyBharath. Abriefsurveyofdeepreinforcementlearning. arXivpreprintarXiv:1708.05866 ,2017. 278 [12] DzmitryBahdanau,PhilemonBrakel,KelvinXu,AnirudhGoyal,RyanLowe,Joelle Pineau,AaronCourville,andYoshuaBengio.Anactor-criticalgorithmforsequence prediction. arXivpreprintarXiv:1607.07086 ,2016. [13] BartBakkerandTomHeskes.Taskclusteringandgatingforbayesianmultitask learning. TheJournalofMachineLearningResearch ,2003. [14] BramBakker,ShimonWhiteson,LeonKester,andFransCAGroen.Tra˚clight controlbymultiagentreinforcementlearningsystems.In InteractiveCollaborative InformationSystems ,pages510.Springer,2010. [15] PeterLBartlettandShaharMendelson.Rademacherandgaussiancomplexities:Risk boundsandstructuralresults. JournalofMachineLearningResearch ,3(Nov 2002. [16] AmirBeckandMarcTeboulle.Afastiterativeshrinkage-thresholdingalgorithmfor linearinverseproblems. SIAMjournalonimagingsciences ,2009. [17] MarcGBellemare,WillDabney,andRémiMunos.Adistributionalperspectiveon reinforcementlearning. arXivpreprintarXiv:1707.06887 ,2017. [18] MarcGBellemare,YavarNaddaf,JoelVeness,andMichaelBowling.Thearcade learningenvironment:Anevaluationplatformforgeneralagents. J.Artif.Intell. Res.(JAIR) ,2013. [19] ShaiBen-DavidandRebaSchuller.Exploitingtaskrelatednessformultipletask learning.In LearningTheoryandKernelMachines ,pages580.Springer,2003. [20] JacobBien,JonathanTaylor,andRobertTibshirani.Alassoforhierarchicalinteractions. Annalsofstatistics ,41(3):1111,2013. [21] ChristopherMBishop. Patternrecognitionandmachinelearning .springer,2006. [22] EdwinVBonilla,KianMChai,andChristopherWilliams.Multi-taskgaussianprocess prediction.In NIPS ,pages160,2007. [23] StephenBoyd,NealParikh,EricChu,BorjaPeleato,andJonathanEckstein.Dis- tributedoptimizationandstatisticallearningviathealternatingdirectionmethodof multipliers. FoundationsandTrends R inMachineLearning ,2011. [24] GregBrockman,VickiCheung,LudwigPettersson,JonasSchneider,JohnSchulman, JieTang,andWojciechZaremba.Openaigym. arXivpreprintarXiv:1606.01540 ,2016. [25] CristianBucilu,RichCaruana,andAlexandruNiculescu-Mizil.Modelcompression.In SIGKDD ,pages41.ACM,2006. 279 [26] ChrisBurges,TalShaked,ErinRenshaw,AriLazier,MattDeeds,NicoleHamilton, andGregHullender.Learningtorankusinggradientdescent.In Proceedingsofthe 22ndinternationalconferenceonMachinelearning ,pages6.ACM,2005. [27] LucianBusoniu,RobertBabuska,andBartDeSchutter.Acomprehensivesurveyof multiagentreinforcementlearning. IEEETrans.Systems,Man,andCybernetics,Part C ,2008. [28] LucianBu³oniu,RobertBabu²ka,andBartDeSchutter.Multi-agentreinforcement learning:Anoverview.In Innovationsinmulti-agentsystemsandapplications-1 ,pages Springer,2010. [29] RemiJCadoret,WilliamRYates,GeorgeWoodworth,andMarkAStewart.Genetic- environmentalinteractioninthegenesisofaggressivityandconductdisorders. Archives ofGeneralPsychiatry ,1995. [30] AlfredoCanziani,AdamPaszke,andEugenioCulurciello.Ananalysisofdeepneural networkmodelsforpracticalapplications. arXivpreprintarXiv:1605.07678 ,2016. [31] ZheCao,TaoQin,Tie-YanLiu,Ming-FengTsai,andHangLi.Learningtorank:from pairwiseapproachtolistwiseapproach.In ICML ,pages36.ACM,2007. [32] MarcCarreras,JunkuYuh,JoanBatlle,andPereRidao.Abehavior-basedscheme usingreinforcementlearningforautonomousunderwatervehicles. IEEEJournalof OceanicEngineering ,2005. [33] RichCaruana.Multitasklearning. Machinelearning ,1997. [34] PabloSamuelCastro,SubhodeepMoitra,CarlesGelada,SaurabhKumar,andMarcG. Bellemare.Dopamine:Aresearchframeworkfordeepreinforcementlearning. CoRR , abs/1812.06110,2018. [35] ShiyuChang,Guo-JunQi,CharuCAggarwal,JiayuZhou,MengWang,andThomasS Huang.Factorizedsimilaritylearninginnetworks.In ICDM ,pagesIEEE,2014. [36] FeiChen,ZhenhuaDong,ZhenguoLi,andXiuqiangHe.Federatedmeta-learningfor recommendation. arXivpreprintarXiv:1802.07876 ,2018. [37] JianhuiChen,JiayuZhou,andJiepingYe.Integratinglow-rankandgroup-sparse structuresforrobustmulti-tasklearning.In SIGKDD ,pages0.ACM,2011. [38] XiaohuiChen,XinghuaShi,XingXu,ZhiyongWang,RyanMills,CharlesLee,and JinboXu.Atwo-graphguidedmulti-tasklassoapproachforeqtlmapping.In AISTATS , pages2012. 280 [39] NamHeeChoi,WilliamLi,andJiZhu.Variableselectionwiththestrongheredity constraintanditsoracleproperty. JASA ,2010. [40] DidiChuxing. [41] Djork-ArnéClevert,ThomasUnterthiner,andSeppHochreiter.Fastandaccuratedeep networklearningbyexponentiallinearunits(elus). arXivpreprintarXiv:1511.07289 , 2015. [42] WillDabney,GeorgOstrovski,DavidSilver,andRémiMunos.Implicitquantile networksfordistributionalreinforcementlearning. arXivpreprintarXiv:1806.06923 , 2018. [43] BoDai,AlbertShaw,LihongLi,LinXiao,NiaoHe,ZhenLiu,JianshuChen,and LeSong.Sbeed:Convergentreinforcementlearningwithnonlinearfunctionapproxima- tion. arXivpreprintarXiv:1712.10285 ,2017. [44] HalDaumé,JohnLangford,andDanielMarcu.Search-basedstructuredprediction. Machinelearning ,2009. [45] PeterDayanandGeo˙reyEHinton.Usingexpectation-maximizationforreinforcement learning. NeuralComputation ,1997. [46] ThomasDegris,MarthaWhite,andRichardSSutton.O˙-policyactor-critic. arXiv preprintarXiv:1205.4839 ,2012. [47] PierreJDejaxandTeodorGabrielCrainic.Surveypapreviewofempty˛owsand ˛eetmanagementmodelsinfreighttransportation. Transportationscience , 248,1987. [48] PrafullaDhariwal,ChristopherHesse,OlegKlimov,AlexNichol,MatthiasPlappert, AlecRadford,JohnSchulman,SzymonSidor,YuhuaiWu,andPeterZhokhov.Openai baselines. https://github.com/openai/baselines ,2017. [49] JillesSteeveDibangoyeandOlivierBu˙et. LearningtoActinDecentralizedPartially ObservableMDPs .PhDthesis,INRIAGrenoble-Rhone-Alpes-CHROMATeam;INRIA Nancy,équipeLARSEN,2018. [50] PierreDillenbourg. CollaborativeLearning:CognitiveandComputationalApproaches. AdvancesinLearningandInstructionSeries. ERIC,1999. [51] PierreDutilleul.Themlealgorithmforthematrixnormaldistribution. JSTAT COMPUTSIM ,1999. 281 [52] ThaliaCEley,KarenSugden,AlejandroCorsico,AliceMGregory,PakSham,Peter McGu˚n,RobertPlomin,andIanWCraig.vironmentinteractionanalysisof serotoninsystemmarkerswithadolescentdepression. Molecularpsychiatry , 915,2004. [53] AEvgeniouandMassimilianoPontil.Multi-taskfeaturelearning. NIPS ,19:41,2007. [54] TheodorosEvgeniou,CharlesAMicchelli,andMassimilianoPontil.Learningmultiple taskswithkernelmethods.In JMLR ,pages6152005. [55] TheodorosEvgeniouandMassimilianoPontil.Regularizedmlearning.In SIGKDD ,pages17.ACM,2004. [56] BenjaminEysenbachandSergeyLevine.Ifmaxentrlistheanswer,whatisthequestion? arXivpreprintarXiv:1910.01913 ,2019. [57] AlirezaFallah,AryanMokhtari,andAsumanOzdaglar.Personalizedfederatedlearning: Ameta-learningapproach. arXivpreprintarXiv:2002.07948 ,2020. [58] HongliangFeiandJunHuan.Structuredfeatureselectionandtaskrelationshipinference formulti-tasklearning. Knowledgeandinformationsystems ,2013. [59] ChelseaFinn. Learningtolearnwithgradients .PhDthesis,UCBerkeley,2018. [60] JakobFoerster,GregoryFarquhar,TriantafyllosAfouras,NantasNardelli,andShimon Whiteson.Counterfactualmulti-agentpolicygradients. arXivpreprintarXiv:1705.08926 , 2017. [61] JeromeFriedman,TrevorHastie,andRobertTibshirani.Anoteonthegrouplasso andasparsegrouplasso. arXivpreprintarXiv:1001.0736 ,2010. [62] SilviaGandy,BenjaminRecht,andIsaoYamada.Tensorcompletionandlow-n-rank tensorrecoveryviaconvexoptimization. InverseProblems ,27(2):025010,2011. [63] JMGatt,CBNemero˙,CDobson-Stone,RHPaul,RABryant,PRScho˝eld,EGordon, AHKemp,andLMWilliams.Interactionsbetweenbdnfval66metpolymorphismand earlylifestresspredictbrainandarousalpathwaystosyndromaldepressionandanxiety. Molecularpsychiatry ,2009. [64] GregoryAGodfreyandWarrenBPowell.Anadaptivedynamicprogrammingalgorithm fordynamic˛eetmanagement,i:Singleperiodtraveltimes. TransportationScience , 2002. [65] GregoryAGodfreyandWarrenBPowell.Anadaptivedynamicprogrammingalgorithm fordynamic˛eetmanagement,ii:Multiperiodtraveltimes. TransportationScience , 2002. 282 [66] PinghuaGong,JiepingYe,andChangshuiZhang.Robustmulti-taskfeaturelearning. In SIGKDD ,pages903.ACM,2012. [67] PinghuaGong,ChangshuiZhang,ZhaosongLu,JianhuaZHuang,andJiepingYe. Ageneraliterativeshrinkageandthresholdingalgorithmfornon-convexregularized optimizationproblems.In ICML ,volume28,page37,2013. [68] PinghuaGong,JiayuZhou,WeiFan,andJiepingYe.E˚cientmulti-taskfeature learningwithcalibration.In SIGKDD ,pages770.ACM,2014. [69] MichaelGrant,StephenBoyd,andYinyuYe.Cvx:Matlabsoftwarefordisciplined convexprogramming,2008. [70] NizarGrira,MichelCrucianu,andNozhaBoujemaa.Activesemi-supervisedfuzzy clusteringforimagedatabasecategorization.In Proceedingsofthe7thACMSIGMM internationalworkshoponMultimediainformationretrieval ,pagesACM,2005. [71] AudrunasGruslys,WillDabney,MohammadGheshlaghiAzar,BilalPiot,MarcBelle- mare,andRemiMunos.Thereactor:Afastandsample-e˚cientactor-criticagentfor reinforcementlearning.2018. [72] ShixiangGu,TimothyLillicrap,ZoubinGhahramani,RichardETurner,andSergey Levine.Q-prop:Sample-e˚cientpolicygradientwithano˙-policycritic. arXivpreprint arXiv:1611.02247 ,2016. [73] CarlosGuestrin,MichailLagoudakis,andRonaldParr.Coordinatedreinforcement learning.In ICML ,volume2,pages2002. [74] TuomasHaarnoja,AurickZhou,PieterAbbeel,andSergeyLevine.Softactor-critic: O˙-policymaximumentropydeepreinforcementlearningwithastochasticactor.In InternationalConferenceonMachineLearning ,pages1852018. [75] FarzinHaddadpourandMehrdadMahdavi.Ontheconvergenceoflocaldescent methodsinfederatedlearning. arXivpreprintarXiv:1910.14425 ,2019. [76] Shih-PingHan.Asuccessiveprojectionmethod. MathematicalProgramming ,40 14,1988. [77] AndrewHard,ChloéMKiddon,DanielRamage,FrancoiseBeaufays,HubertEichner, KanishkaRao,RajivMathews,andSeanAugenstein.Federatedlearningformobile keyboardprediction,2018. [78] KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun.Identitymappingsindeep residualnetworks.In ECCV ,pages630Springer,2016. 283 [79] KariHemminki,JustoLorenzoBermejo,andAstaFörsti.Thebalancebetweenheritable andenvironmentalaetiologyofhumandisease. NatureReviewsGenetics , 2006. [80] MatteoHessel,JosephModayil,HadoVanHasselt,TomSchaul,GeorgOstrovski,Will Dabney,DanHorgan,BilalPiot,MohammadAzar,andDavidSilver.Rainbow:Com- biningimprovementsindeepreinforcementlearning. arXivpreprintarXiv:1710.02298 , 2017. [81] ToddHester,MatejVecerik,OlivierPietquin,MarcLanctot,TomSchaul,BilalPiot, DanHorgan,JohnQuan,AndrewSendonaris,IanOsband,etal.Deepq-learningfrom demonstrations.In Thirty-SecondAAAIConferenceonArti˝cialIntelligence ,2018. [82] Geo˙reyHinton,OriolVinyals,andJe˙Dean.Distillingtheknowledgeinaneural network. arXivpreprintarXiv:1503.02531 ,2015. [83] KurtHornik,MaxwellStinchcombe,andHalbertWhite.Multilayerfeedforward networksareuniversalapproximators. Neuralnetworks ,1989. [84] ZhouyuanHuo,QianYang,BinGu,LawrenceCarinHuang,etal.Fasteron-device trainingusingnewfederatedmomentumalgorithm. arXivpreprintarXiv:2002.02090 , 2020. [85] AndrewIlyas,LoganEngstrom,ShibaniSanturkar,DimitrisTsipras,FirdausJanoos, LarryRudolph,andAleksanderMadry.Aredeeppolicygradientalgorithmstruly policygradientalgorithms? arXivpreprintarXiv:1811.02553 ,2018. [86] MaxJaderberg,VolodymyrMnih,WojciechMarianCzarnecki,TomSchaul,JoelZ Leibo,DavidSilver,andKorayKavukcuoglu.Reinforcementlearningwithunsupervised auxiliarytasks. arXivpreprintarXiv:1611.05397 ,2016. [87] PrateekJain,ShamMKakade,RahulKidambi,PraneethNetrapalli,andAaronSidford. Acceleratingstochasticgradientdescent.In Proc.STAT ,volume1050,page26,2017. [88] AliJalali,SujaySanghavi,ChaoRuan,andPradeepKRavikumar.Adirtymodelfor multi-tasklearning.In NIPS . [89] ShuiwangJiandJiepingYe.Anacceleratedgradientmethodfortracenormminimiza- tion.In ICML ,pages64.ACM,2009. [90] NanJiangandAlekhAgarwal.Openproblem:Thedependenceofsamplecomplexity lowerboundsonplanninghorizon.In ConferenceOnLearningTheory ,pages3 2018. 284 [91] NanJiang,AkshayKrishnamurthy,AlekhAgarwal,JohnLangford,andRobertE Schapire.Contextualdecisionprocesseswithlowbellmanrankarepac-learnable.In Proceedingsofthe34thInternationalConferenceonMachineLearning-Volume70 , pagesJMLR.org,2017. [92] PengJiangandGaganAgrawal.Alinearspeedupanalysisofdistributeddeeplearning withsparseandquantizedcommunication.In AdvancesinNeuralInformationProcessing Systems ,pages2522018. [93] Je˙Kahn,NathanLinial,andAlexSamorodnitsky.Inclusion-exclusion:Exactand approximate. Combinatorica ,1996. [94] PeterKairouz,HBrendanMcMahan,BrendanAvent,AurélienBellet,MehdiBennis, ArjunNitinBhagoji,KeithBonawitz,ZacharyCharles,GrahamCormode,Rachel Cummings,etal.Advancesandopenproblemsinfederatedlearning. arXivpreprint arXiv:1912.04977 ,2019. [95] ShamMachandranathKakadeetal. Onthesamplecomplexityofreinforcementlearning . PhDthesis,UniversityofLondonLondon,England,2003. [96] TsuyoshiKato,HisashiKashima,MasashiSugiyama,andKiyoshiAsai.Multi-task learningviaconicprogramming.In NIPS ,pages7372008. [97] MichaelJKearns,YishayMansour,andAndrewYNg.Approximateplanningin largepomdpsviareusabletrajectories.In AdvancesinNeuralInformationProcessing Systems ,pages1002000. [98] AKhaled,KMishchenko,andPRichtárik.Tightertheoryforlocalsgdonidentical andheterogeneousdata.In The23rdInternationalConferenceonArti˝cialIntelligence andStatistics(AISTATS2020) ,2020. [99] AhmedKhaled,KonstantinMishchenko,andPeterRichtárik.Firstanalysisoflocalgd onheterogeneousdata. NeurIPSWorkshoponFederatedLearningforDataPrivacy andCon˝dentiality ,2019. [100] RahulKidambi,PraneethNetrapalli,PrateekJain,andShamKakade.Ontheinsu˚- ciencyofexistingmomentumschemesforstochasticoptimization.In 2018Information TheoryandApplicationsWorkshop(ITA) ,pagesIEEE,2018. [101] SeyoungKimandEricPXing.Tree-guidedgrouplassoformulti-taskregressionwith structuredsparsity. ICML ,2010. [102] JensKoberandJanRPeters.Policysearchformotorprimitivesinrobotics.In Advancesinneuralinformationprocessingsystems ,pages56,2009. 285 [103] JelleRKokandNikosVlassis.Collaborativemultiagentreinforcementlearningby payo˙propagation. JMLR ,2006. [104] AnastasiaKoloskova,NicolasLoizou,SadraBoreiri,MartinJaggi,andSebastianU Stich.Auni˝edtheoryofdecentralizedsgdwithchangingtopologyandlocalupdates. arXivpreprintarXiv:2003.10422 ,2020. [105] AkshayKrishnamurthy,AlekhAgarwal,andJohnLangford.Pacreinforcementlearning withrichobservations.In AdvancesinNeuralInformationProcessingSystems ,pages 2016. [106] MonicaSLam.Autonomyandprivacywithopenfederatedvirtualassistants. [107] GuillaumeLampleandDevendraSinghChaplot.Playingfpsgameswithdeeprein- forcementlearning. arXivpreprintarXiv:1609.05521 ,2016. [108] JasonLee,YuekaiSun,andMichaelSaunders.Proximalnewton-typemethodsfor convexoptimization.In NIPS ,pages844,2012. [109] TianLi,AnitKumarSahu,ManzilZaheer,MaziarSanjabi,AmeetTalwalkar,and VirginiaSmith.Federatedoptimizationinheterogeneousnetworks. MLSys ,2020. [110] XiangLi,KaixuanHuang,WenhaoYang,ShusenWang,andZhihuaZhang.Onthe convergenceoffedavgonnon-iiddata. ICLR ,2020. [111] XiujunLi,Yun-NungChen,LihongLi,JianfengGao,andAsliCelikyilmaz.End-to-end task-completionneuraldialoguesystems. arXivpreprintarXiv:1703.01008 ,2017. [112] ZLi,YHong,andZZhang.Doon-demandride-sharingservicesa˙ecttra˚cconges- tion?evidencefromuberentry.Technicalreport,Workingpaper,availableatSSRN: https://ssrn.com/abstract=2838043,2016. [113] XianfengLiang,ShuhengShen,JingchangLiu,ZhenPan,EnhongChen,andYifei Cheng.Variancereducedlocalsgdwithlowercommunicationcomplexity. arXivpreprint arXiv:1912.12844 ,2019. [114] KaixiangLin,ShuWang,andJiayuZhou.Collaborativedeepreinforcementlearning. arXivpreprintarXiv:1702.05796 ,2017. [115] KaixiangLin,JianpengXu,InciMBaytas,ShuiwangJi,andJiayuZhou.Multi-task featureinteractionlearning.In Proceedingsofthe22NdACMSIGKDDInternational ConferenceonKnowledgeDiscoveryandDataMining ,pages17352016. [116] KaixiangLin,RenyuZhao,ZheXu,andJiayuZhou.E˚cientlarge-scale˛eetman- agementviamulti-agentdeepreinforcementlearning. Proceedingsofthe24thACM SIGKDDInternationalConferenceonKnowledge ,2018. 286 [117] KaixiangLinandJiayuZhou.Interactivemulti-taskrelationshiplearning.In 2016 IEEE16thInternationalConferenceonDataMining(ICDM) ,pagesIEEE, 2016. [118] KaixiangLinandJiayuZhou.Rankingpolicygradient.In InternationalConference onLearningRepresentations ,2020. [119] ChaoyueLiuandMikhailBelkin.Acceleratingsgdwithmomentumforover- parameterizedlearning. ICLR ,2020. [120] JunLiu,ShuiwangJi,andJiepingYe.Multi-taskfeaturelearningviae˚cient ` 2 ; 1 -norm minimization.In Proceedingsofthe25thconferenceonUAI ,pagesAUAI Press,2009. [121] JunLiuandJiepingYe.E˚cientl1/lqnormregularization. arXiv:1009.4766 ,2010. [122] TianyiLiu,ZhehuiChen,EnluZhou,andTuoZhao.Towarddeeperunderstanding ofnonconvexstochasticoptimizationwithmomentumusingdi˙usionapproximations. arXivpreprintarXiv:1802.05155 ,2018. [123] WeiLiu,LiChen,YunfeiChen,andWenyiZhang.Acceleratingfederatedlearningvia momentumgradientdescent. IEEETransactionsonParallelandDistributedSystems , 2020. [124] YashuLiu,JieWang,andJiepingYe.Ane˚cientalgorithmforweakhierarchicallasso. In SIGKDD ,pages292.ACM,2014. [125] RyanLowe,YiWu,AvivTamar,JeanHarb,PieterAbbeel,andIgorMordatch.Multi- agentactor-criticformixedcooperative-competitiveenvironments. arXivpreprint arXiv:1706.02275 ,2017. [126] Lyft. [127] SiyuanMa,RaefBassily,andMikhailBelkin.Thepowerofinterpolation:Understanding thee˙ectivenessofsgdinmodernover-parametrizedlearning. ICML ,2018. [128] MichaªMaciejewskiandKaiNagel.Thein˛uenceofmulti-agentcooperationonthe e˚ciencyoftaxidispatching.In PPAM ,pages60.Springer,2013. [129] HBrendanMcMahan,EiderMoore,DanielRamage,SethHampson,etal. Communication-e˚cientlearningofdeepnetworksfromdecentralizeddata. Pro- ceedingsofthe20thInternationalConferenceonArti˝cialIntelligenceandStatistics (AISTATS) ,2017. [130] PremMelvilleandVikasSindhwani.Recommendersystems.In Encyclopediaofmachine learning ,pages829Springer,2011. 287 [131] VolodymyrMnih,AdriaPuigdomenechBadia,MehdiMirza,AlexGraves,TimothyP Lillicrap,TimHarley,DavidSilver,andKorayKavukcuoglu.Asynchronousmethods fordeepreinforcementlearning. arXivpreprintarXiv:1602.01783 ,2016. [132] VolodymyrMnih,KorayKavukcuoglu,DavidSilver,AndreiARusu,JoelVeness, MarcGBellemare,AlexGraves,MartinRiedmiller,AndreasKFidjeland,Georg Ostrovski,etal.Human-levelcontrolthroughdeepreinforcementlearning. Nature , 3,2015. [133] MehryarMohri,AfshinRostamizadeh,andAmeetTalwalkar. Foundationsofmachine learning .MITpress,2018. [134] EricMoulinesandFrancisRBach.Non-asymptoticanalysisofstochasticapproximation algorithmsformachinelearning.In AdvancesinNeuralInformationProcessingSystems , pages2011. [135] RémiMunos,TomStepleton,AnnaHarutyunyan,andMarcBellemare.Safeande˚cient o˙-policyreinforcementlearning.In AdvancesinNeuralInformationProcessingSystems , pages2016. [136] O˝rNachum,MohammadNorouzi,KelvinXu,andDaleSchuurmans.Bridgingthe gapbetweenvalueandpolicybasedreinforcementlearning.In AdvancesinNeural InformationProcessingSystems ,pages27752017. [137] ArunNair,PraveenSrinivasan,SamBlackwell,CagdasAlcicek,RoryFearon,Alessandro DeMaria,VedavyasPanneershelvam,MustafaSuleyman,CharlesBeattie,StigPetersen, etal.Massivelyparallelmethodsfordeepreinforcementlearning. arXivpreprint arXiv:1507.04296 ,2015. [138] DeannaNeedell,RachelWard,andNatiSrebro.Stochasticgradientdescent,weighted sampling,andtherandomizedkaczmarzalgorithm.In Advancesinneuralinformation processingsystems ,pages10172014. [139] AndrewYNg,DaishiHarada,andStuartRussell.Policyinvarianceunderreward transformations:Theoryandapplicationtorewardshaping.In ICML ,volume99, pages1999. [140] DucThienNguyen,AkshatKumar,andHoongChuinLau.Collectivemultiagent sequentialdecisionmakingunderuncertainty. AAAI ,2017. [141] DucThienNguyen,AkshatKumar,andHoongChuinLau.Policygradientwithvalue functionapproximationforcollectivemultiagentplanning. NIPS ,2017. 288 [142] GuillaumeObozinski,BenTaskar,andMichaelIJordan.Jointcovariateselectionand jointsubspaceselectionformultipleclassi˝cationproblems. StatisticsandComputing , 2010. [143] BrendanO'Donoghue.Variationalbayesianreinforcementlearningwithregretbounds. arXivpreprintarXiv:1807.09647 ,2018. [144] BrendanO'Donoghue,RemiMunos,KorayKavukcuoglu,andVolodymyrMnih.Com- biningpolicygradientandq-learning. arXivpreprintarXiv:1611.01626 ,2016. [145] JunhyukOh,YijieGuo,SatinderSingh,andHonglakLee.Self-imitationlearning. arXivpreprintarXiv:1806.05635 ,2018. [146] JunhyukOh,MatteoHessel,WojciechMCzarnecki,ZhongwenXu,HadovanHasselt, SatinderSingh,andDavidSilver.Discoveringreinforcementlearningalgorithms. arXiv preprintarXiv:2007.08794 ,2020. [147] OpenAI.Openaiuniverse-starter-agent. https://github.com/openai/ universe-starter-agent ,2017.Accessed:2017-0201. [148] TakayukiOsa,JoniPajarinen,GerhardNeumann,JAndrewBagnell,PieterAbbeel, JanPeters,etal.Analgorithmicperspectiveonimitationlearning. Foundationsand Trends R inRobotics ,2018. [149] SinnoJialinPanandQiangYang.Asurveyontransferlearning. TKDE , 1359,2010. [150] NealParikhandStephenPBoyd.Proximalalgorithms. FoundationsandTrendsin optimization ,2014. [151] EmilioParisotto,JimmyLeiBa,andRuslanSalakhutdinov.Actor-mimic:Deep multitaskandtransferreinforcementlearning. arXivpreprintarXiv:1511.06342 ,2015. [152] JanPetersandStefanSchaal.Reinforcementlearningbyreward-weightedregression foroperationalspacecontrol.In Proceedingsofthe24thinternationalconferenceon Machinelearning ,pages745ACM,2007. [153] JanPetersandStefanSchaal.Reinforcementlearningofmotorskillswithpolicy gradients. Neuralnetworks ,2008. [154] Tu-HoaPham,GiovanniDeMagistris,andRyukiTachibana.Optlayer-practical constrainedoptimizationfordeepreinforcementlearningintherealworld. arXiv preprintarXiv:1709.07643 ,2017. 289 [155] JPlatt.Fasttrainingofsupportvectormachinesusingsequentialminimaloptimization, in,b.scholkopf,c.burges,a.smola,(eds.):Advancesinkernelmethods-supportvector learning,1998. [156] MartinLPuterman. Markovdecisionprocesses:discretestochasticdynamicprogram- ming .JohnWiley&Sons,2014. [157] ZhaonanQu*,KaixiangLin*,JayantKalagnanam,ZhaojianLi,JiayuZhou,and ZhengyuanZhou.Federatedlearning'sblessing:Fedavghaslinearspeedup. arXiv preprintarXiv:2007.05690 ,2020,*denotesequalcontribution. [158] PeterRadchenkoandGarethMJames.Variableselectionusingadaptivenonlinearin- teractionstructuresinhighdimensions. JournaloftheAmericanStatisticalAssociation , ,2010. [159] JanarthananRajendran,AravindLakshminarayanan,MiteshMKhapra,Balaraman Ravindran,etal.A2t:Attend,adaptandtransfer:Attentivedeeparchitecturefor adaptivetransferfrommultiplesources. arXivpreprintarXiv:1510.02879 ,2015. [160] TabishRashid,MikayelSamvelyan,ChristianSchroederdeWitt,GregoryFarquhar, JakobFoerster,andShimonWhiteson.Qmix:Monotonicvaluefunctionfactorisation fordeepmulti-agentreinforcementlearning. arXivpreprintarXiv:1803.11485 ,2018. [161] RTyrrellRockafellar. Convexanalysis .Number28.Princetonuniversitypress,1970. [162] BernardinoRomera-Paredes,HaneAung,NadiaBianchi-Berthouze,andMassimiliano Pontil.Multilinearmultitasklearning.In ICML ,pages1452,2013. [163] StéphaneRossandDrewBagnell.E˚cientreductionsforimitationlearning.In Proceedingsofthethirteenthinternationalconferenceonarti˝cialintelligenceand statistics ,pages668,2010. [164] StephaneRossandJAndrewBagnell.Reinforcementandimitationlearningvia interactiveno-regretlearning. arXivpreprintarXiv:1406.5979 ,2014. [165] StéphaneRoss,Geo˙reyGordon,andDrewBagnell.Areductionofimitationlearning andstructuredpredictiontono-regretonlinelearning.In Proceedingsofthefourteenth internationalconferenceonarti˝cialintelligenceandstatistics ,pages635,2011. [166] AndreiARusu,SergioGomezColmenarejo,CaglarGulcehre,GuillaumeDesjardins, JamesKirkpatrick,RazvanPascanu,VolodymyrMnih,KorayKavukcuoglu,andRaia Hadsell.Policydistillation. arXivpreprintarXiv:1511.06295 ,2015. [167] TomSchaul,JohnQuan,IoannisAntonoglou,andDavidSilver.Prioritizedexperience replay. arXivpreprintarXiv:1511.05952 ,2015. 290 [168] MarkSchmidtandNicolasLeRoux.Fastconvergenceofstochasticgradientdescent underastronggrowthcondition. arXivpreprintarXiv:1308.6370 ,2013. [169] JohnSchulman,PhilippMoritz,SergeyLevine,MichaelJordan,andPieterAbbeel. High-dimensionalcontinuouscontrolusinggeneralizedadvantageestimation. arXiv preprintarXiv:1506.02438 ,2015. [170] JohnSchulman,FilipWolski,PrafullaDhariwal,AlecRadford,andOlegKlimov. Proximalpolicyoptimizationalgorithms. arXivpreprintarXiv:1707.06347 ,2017. [171] AntonSchwaighofer,VolkerTresp,andKaiYu.Learninggaussianprocesskernelsvia hierarchicalbayes.In NIPS ,pages1202004. [172] KiamTianSeow,NamHaiDang,andDer-HorngLee.Acollaborativemultiagent taxi-dispatchsystem. IEEET-ASE ,2010. [173] BurrSettles.Activelearningliteraturesurvey. UniversityofWisconsin,Madison , 52(55-66):11,2010. [174] DavidSilver,AjaHuang,ChrisJMaddison,ArthurGuez,LaurentSifre,GeorgeVan DenDriessche,JulianSchrittwieser,IoannisAntonoglou,VedaPanneershelvam,Marc Lanctot,etal.Masteringthegameofgowithdeepneuralnetworksandtreesearch. Nature ,2016. [175] DavidSilver,JulianSchrittwieser,KarenSimonyan,IoannisAntonoglou,AjaHuang, ArthurGuez,ThomasHubert,LucasBaker,MatthewLai,AdrianBolton,etal. Masteringthegameofgowithouthumanknowledge. Nature ,550(7676):354,2017. [176] VirginiaSmith,Chao-KaiChiang,MaziarSanjabi,andAmeetSTalwalkar.Federated multi-tasklearning.In AdvancesinNeuralInformationProcessingSystems ,pages 2017. [177] MarkJohnSomers.Organizationalcommitment,turnoverandabsenteeism:An examinationofdirectandinteractione˙ects. JournalofOrganizationalBehavior , 1995. [178] SebastianUStich.Localsgdconvergesfastandcommunicateslittle. ICLR ,2019. [179] AlexanderLStrehl,LihongLi,andMichaelLLittman.Reinforcementlearningin˝nite mdps:Pacanalysis. JournalofMachineLearningResearch ,10(No,2009. [180] AlexanderLStrehl,LihongLi,EricWiewiora,JohnLangford,andMichaelLLittman. Pacmodel-freereinforcementlearning.In Proceedingsofthe23rdinternationalconfer- enceonMachinelearning ,pages888.ACM,2006. 291 [181] ThomasStrohmerandRomanVershynin.Arandomizedkaczmarzalgorithmwith exponentialconvergence. JournalofFourierAnalysisandApplications ,15(2):262,2009. [182] RukhsanaSultana,DebraBoyd-Kimball,HFaiPoon,JianCai,WilliamMPierce, JonBKlein,MichaelMerchant,WilliamRMarkesbery,andDAllanButter˝eld.Redox proteomicsidenti˝cationofoxidizedproteinsinalzheimer'sdiseasehippocampusand cerebellum:anapproachtounderstandpathologicalandbiochemicalalterationsinad. Neurobiologyofaging ,2006. [183] QianSun,RitaChattopadhyay,SethuramanPanchanathan,andJiepingYe.Atwo- stageweightingframeworkformulti-sourcedomainadaptation.In NIPS ,pages 2011. [184] WenSun,ArunVenkatraman,Geo˙reyJGordon,ByronBoots,andJAndrewBagnell. Deeplyaggrevated:Di˙erentiableimitationlearningforsequentialprediction.In Proceedingsofthe34thInternationalConferenceonMachineLearning-Volume70 , pagesJMLR.org,2017. [185] PeterSunehag,GuyLever,AudrunasGruslys,WojciechMarianCzarnecki,Vinicius Zambaldi,MaxJaderberg,MarcLanctot,NicolasSonnerat,JoelZLeibo,KarlTuyls, etal.Value-decompositionnetworksforcooperativemulti-agentlearning. arXivpreprint arXiv:1706.05296 ,2017. [186] RichardSSuttonandAndrewGBarto. Reinforcementlearning:Anintroduction , volume1.MITpressCambridge,1998. [187] RichardSSuttonandAndrewGBarto. Reinforcementlearning:Anintroduction .MIT press,2018. [188] UmarSyedandRobertESchapire.Areductionfromapprenticeshiplearningto classi˝cation.In AdvancesinNeuralInformationProcessingSystems ,pages 2010. [189] ArdiTampuu,TambetMatiisen,DorianKodelja,IlyaKuzovkin,KristjanKorjus,Juhan Aru,JaanAru,andRaulVicente.Multiagentcooperationandcompetitionwithdeep reinforcementlearning. PloSone ,12(4):e0172395,2017. [190] MingTan.Multi-agentreinforcementlearning:Independentvs.cooperativeagents.In ICML ,pages337,1993. [191] MatthewETaylorandPeterStone.Transferlearningforreinforcementlearning domains:Asurvey. JMLR ,2009. [192] StefanJTeipel,WolframBayer,GeneEAlexander,YorkZebuhr,DianeTeichberg, LukaKulic,MarcBSchapiro,Hans-JürgenMöller,StanleyIRapoport,andHarald 292 Hampel.Progressionofcorpuscallosumatrophyinalzheimerdisease. Archivesof Neurology ,2002. [193] DevinderThapa,In-SungJung,andGi-NamWang.Agentbaseddecisionsupport systemusingreinforcementlearningunderemergencycircumstances.In International ConferenceonNaturalComputation ,pages892.Springer,2005. [194] RobertTibshirani.Regressionshrinkageandselectionviathelasso. Journalofthe RoyalStatisticalSociety.SeriesB(Methodological) ,pages88,1996. [195] RyotaTomioka,KoheiHayashi,andHisashiKashima.Estimationoflow-ranktensors viaconvexoptimization. arXivpreprintarXiv:1010.0789 ,2010. [196] RyotaTomiokaandTaijiSuzuki.Convextensordecompositionviastructuredschatten normregularization.In NIPS ,pages1332013. [197] AhmedTouati,Pierre-LucBacon,DoinaPrecup,andPascalVincent.Convergent tree-backupandretracewithfunctionapproximation. arXivpreprintarXiv:1705.09322 , 2017. [198] PaulTseng.Convergenceofablockcoordinatedescentmethodfornondi˙erentiable minimization. Journalofoptimizationtheoryandapplications ,2001. [199] BerwinATurlach,WilliamNVenables,andStephenJWright.Simultaneousvariable selection. Technometrics ,2005. [200] EricTzeng,JudyHo˙man,TrevorDarrell,andKateSaenko.Simultaneousdeeptransfer acrossdomainsandtasks.In ICCV ,pages4076,2015. [201] Uber. [202] LeslieGValiant.Atheoryofthelearnable.In Proceedingsofthesixteenthannual ACMsymposiumonTheoryofcomputing ,pages45.ACM,1984. [203] HadoVanHasselt,ArthurGuez,andDavidSilver.Deepreinforcementlearningwith doubleq-learning.In AAAI ,volume2,page5.Phoenix,AZ,2016. [204] VladimirVapnik. Estimationofdependencesbasedonempiricaldata .SpringerScience &BusinessMedia,2006. [205] JianyuWangandGauriJoshi.Cooperativesgd:Auni˝edframeworkforthedesignand analysisofcommunication-e˚cientsgdalgorithms. arXivpreprintarXiv:1808.07576 , 2018. 293 [206] ShiqiangWang,Ti˙anyTuor,TheodorosSalonidis,KinKLeung,ChristianMakaya, TingHe,andKevinChan.Adaptivefederatedlearninginresourceconstrainededge computingsystems. IEEEJournalonSelectedAreasinCommunications , 1221,2019. [207] ZiruiWang,ZihangDai,BarnabásPóczos,andJaimeCarbonell.Characterizingand avoidingnegativetransfer.In ProceedingsoftheIEEEConferenceonComputerVision andPatternRecognition ,pages1122019. [208] ZiyuWang,VictorBapst,NicolasHeess,VolodymyrMnih,RemiMunos,Koray Kavukcuoglu,andNandodeFreitas.Samplee˚cientactor-criticwithexperiencereplay. arXivpreprintarXiv:1611.01224 ,2016. [209] ZiyuWang,TomSchaul,MatteoHessel,HadoVanHasselt,MarcLanctot,andNando DeFreitas.Duelingnetworkarchitecturesfordeepreinforcementlearning. arXiv preprintarXiv:1511.06581 ,2015. [210] MalcolmWare,EibeFrank,Geo˙reyHolmes,MarkHall,andIanHWitten.Interactive machinelearning:lettingusersbuildclassi˝ers. INTJHUM-COMPUTST , 292,2001. [211] ChongWei,YinhuWang,XuedongYan,andChunfuShao.Look-aheadinsertionpolicy forashared-taxisystembasedonreinforcementlearning. IEEEAccess ,2017. [212] RonaldJWilliams.Simplestatisticalgradient-followingalgorithmsforconnectionist reinforcementlearning. Machinelearning ,1992. [213] BlakeEWoodworth,JialeiWang,AdamSmith,BrendanMcMahan,andNatiSrebro. Graphoraclemodels,lowerbounds,andgapsforparallelstochasticoptimization.In Advancesinneuralinformationprocessingsystems ,pages8506,2018. [214] StephenJWright,RobertDNowak,andMárioATFigueiredo.Sparsereconstruction byseparableapproximation. SignalProcessing,IEEETransactionson , 2009. [215] SichengXiong,JavadAzimi,andXiaoliZFern.Activelearningofconstraintsfor semi-supervisedclustering. TKDE ,2014. [216] JianpengXu,Pang-NingTan,andLifengLuo.Orion:Onlineregularizedmulti-task regressionanditsapplicationtoensembleforecasting.In ICDM ,pages IEEE,2014. [217] YaodongYang,RuiLuo,MinneLi,MingZhou,WeinanZhang,andJunWang.Mean ˝eldmulti-agentreinforcementlearning. ICML ,2018. 294 [218] HaoYu,RongJin,andSenYang.Onthelinearspeedupanalysisofcommunication e˚cientmomentumsgdfordistributednon-convexoptimization. ICML ,2019. [219] HaoYu,SenYang,andShenghuoZhu.Parallelrestartedsgdwithfasterconvergence andlesscommunication:Demystifyingwhymodelaveragingworksfordeeplearning. In ProceedingsoftheAAAIConferenceonArti˝cialIntelligence ,volume33,pages 2019. [220] TianheYu,SaurabhKumar,AbhishekGupta,SergeyLevine,KarolHausman,and ChelseaFinn.Gradientsurgeryformulti-tasklearning. arXivpreprintarXiv:2001.06782 , 2020. [221] KunYuan,BichengYing,andAliHSayed.Onthein˛uenceofmomentumacceleration ononlinelearning. TheJournalofMachineLearningResearch ,2016. [222] LeiYuan,JunLiu,andJiepingYe.E˚cientmethodsforoverlappinggrouplasso.In NIPS ,pages360,2011. [223] AndreaZanetteandEmmaBrunskill.Tighterproblem-dependentregretboundsin reinforcementlearningwithoutdomainknowledgeusingvaluefunctionbounds. arXiv preprintarXiv:1901.00210 ,2019. [224] ChiyuanZhang,SamyBengio,MoritzHardt,BenjaminRecht,andOriolVinyals. Understandingdeeplearningrequiresrethinkinggeneralization. arXivpreprint arXiv:1611.03530 ,2016. [225] YiZhangandJe˙GSchneider.Learningmultipletaskswithasparsematrix-normal penalty.In NIPS ,pages25502010. [226] YuZhangandQiangYang.Asurveyonmulti-tasklearning. arXivpreprint arXiv:1707.08114 ,2017. [227] YuZhangandDit-YanYeung.Aconvexformulationforlearningtaskrelationshipsin multi-tasklearning. arXivpreprintarXiv:1203.3536 ,2012. [228] YuZhang,Dit-YanYeung,andQianXu.Probabilisticmulti-taskfeatureselection.In NIPS ,pages25592010. [229] ZhanpengZhang,PingLuo,ChenChangeLoy,andXiaoouTang.Faciallandmark detectionbydeepmulti-tasklearning.In ECCV ,pages08.Springer,2014. [230] LianminZheng,JiachengYang,HanCai,WeinanZhang,JunWang,andYongYu.Ma- gent:Amany-agentreinforcementlearningplatformforarti˝cialcollectiveintelligence. arXivpreprintarXiv:1712.00600 ,2017. 295 [231] FanZhouandGuojingCong.Ontheconvergencepropertiesofa k -stepaveraging stochasticgradientdescentalgorithmfornonconvexoptimization. IJCAI ,2018. [232] JiayuZhou,JianhuiChen,andJiepingYe.Clusteredmulti-tasklearningviaalternating structureoptimization.In NIPS ,pages10,2011. [233] JiayuZhou,JianhuiChen,andJiepingYe.Malsar:Multi-tasklearningviastructural regularization. ArizonaStateUniversity ,2011. [234] JiayuZhou,JunLiu,VaibhavANarayan,JiepingYe,Alzheimer'sDiseaseNeuroimaging Initiative,etal.Modelingdiseaseprogressionviamulti-tasklearning. NeuroImage , 2013. [235] JiayuZhou,LeiYuan,JunLiu,andJiepingYe.Amulti-tasklearningformulationfor predictingdiseaseprogression.In SIGKDD ,pages822.ACM,2011. [236] YukeZhu,RoozbehMottaghi,EricKolve,JosephJLim,AbhinavGupta,LiFei-Fei,and AliFarhadi.Target-drivenvisualnavigationinindoorscenesusingdeepreinforcement learning. arXivpreprintarXiv:1609.05143 ,2016. 296