COLLABORATIVELEARNING:THEORY,ALGORITHMS,AND
APPLICATIONS
By
KaixiangLin
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialful˝llmentoftherequirements
forthedegreeof
ComputerScienceDoctorofPhilosophy
2020
ABSTRACT
COLLABORATIVELEARNING:THEORY,ALGORITHMS,ANDAPPLICATIONS
By
KaixiangLin
Humanintelligenceprosperswiththeadvantageofcollaboration.Tosolveoneoraset
ofchallengingtasks,wecane˙ectivelyinteractwithpeers,fuseknowledgefromdi˙erent
sources,continuouslyinspire,contribute,anddeveloptheexpertiseforthebene˝tofthe
sharedobjectives.Humancollaborationis˛exible,adaptive,andscalableintermsofvarious
cooperativeconstructions,collaboratingacrossinterdisciplinary,evenseeminglyunrelated
domains,andbuildinglarge-scaledisciplinedorganizationsforextremelycomplextasks.On
theotherhand,whilemachineintelligenceachievedtremendoussuccessinthepastdecade,
theabilitytocollaborativelysolvecomplicatedtasksisstilllimitedcomparedtohuman
intelligence.
Inthisdissertation,westudytheproblemofcollaborativelearning-building˛exible,
generalizable,andscalablecollaborativestrategiestofacilitatethee˚ciencyoflearningone
orasetofobjectives.Towardsachievingthisgoal,weinvestigatethefollowingconcreteand
fundamentalproblems:1.Inthecontextofmulti-tasklearning,canweenforce˛exibleforms
ofinteractionsfrommultipletasksandadaptivelyincorporatehumanexpertknowledgeto
guidethecollaboration?2.Inreinforcementlearning,canwedesigncollaborativemethods
thate˙ectivelycollaborateamongheterogeneouslearningagentstoimprovethesample-
e˚ciency?3.Inmulti-agentlearning,canwedevelopascalablecollaborativestrategyto
coordinateamassivenumberoflearningagentsaccomplishingasharedtask?4.Infederated
learning,canwehaveprovablebene˝tfromincreasingthenumberofcollaborativelearning
agents?
Thisthesisprovidesthe˝rstlineofresearchtoviewtheabovelearning˝eldsinauni˝ed
framework,whichincludesnovelalgorithmsfor˛exible,adaptivecollaboration,real-world
applicationsusingscalablecollaborativelearningsolutions,andfundamentaltheoriesfor
propellingtheunderstandingofcollaborativelearning.
ACKNOWLEDGMENTS
Firstandforemost,Iwouldliketothankmyadvisor,Dr.JiayuZhou,forhisadvice,
encouragement,inspirations,andendlesssupportformyresearchandcareer.Throughout
thepast˝veyearsatMichiganStateUniversity,Dr.Zhouhasalwaysin˛uencedmewithhis
curiosity,passion,andpersistenceofresearch.Heiswillingtodiscussthegrantpictureofthe
researchandprovideconstructivesuggestionsinthetechnicaldetails.Meanwhile,despite
beingcreativeandproductive,healsogivesmethefreedomtoworkonavarietyofproblems,
evensomearenotalignedwithhisinterest.IwouldliketothankDrs.JiliangTang,Zhaojian
Li,andAnilK.Jainforbeingonmythesiscommittee.
I'mveryhappytohavehadtheopportunitytocollaboratewiththewonderfulgroup
ofcolleagues,faculty,andresearchersthroughoutmyPh.D.Fortheworkpresentedinthis
dissertation,IenjoyedworkingwithDr.JianpengXu,Dr.InciM.Baytas,Dr.ShuiwangJi,
Dr.ShuWang,RenyuZhao,Dr.ZheXu,ZhaonanQu,Dr.ZhaojianLi,Dr.Zhengyuan
ZhouandDr.JiayuZhou.Ithankthemfortheircontributionsandforeverythingtheyhave
taughtme.Besidestheworkpresentedinthisthesis,Ialsohadthepleasureofworking
withmanyoutstandingresearchers,includingLiyangXie,Dr.FeiWang,Dr.Pang-Ning
Tan,FengyiTang,IkechukwuUchendu,BoyangLiu,DingWang,ZhuangdiZhu,andDr.Bo
Dai.IwouldliketothankallofmyamazingcolleaguesinILLIDANlab:QiWang,Dr.Inci
M.Baytas,LiyangXie,MengyingSun,FengyiTang,BoyangLiu,ZhuangdiZhu,Junyuan
Hong,XitongZhangandIkechukwuUchenduforacollaborative,friendly,andproductive
environment.
IalsowanttoexpressmysincerethankstotheamazingcolleaguesImetduringthe
internships,includingDr.PinghuaGong,WeiChen,GuojunWu,ZhengtianXu,Hongyu
iv
Zheng,JintaoKe,HuaxiuYao,DanWang,LiliCao,LingkaiYang,QiqiWang,Dr.Yaguang
Li,Dr.PengWang,Dr.JieWang,ChaoTao,Dr.JiaChen,andDr.YoujieZhou.Many
thankstoDr.PinghuaGong,Dr.PengWang,forhostingmeasaninternatDidiChuxing
in2017and2018.IamalsomostthankfultoDr.JiaChenandDr.YoujieZhoufortheir
patienceandendlesshelpduringmyinternshipatGooglein2019.
Finally,Ithankmyparents,fortheirunconditionalloveandsupport.
v
TABLEOFCONTENTS
LISTOFTABLES
....................................
ix
LISTOFFIGURES
...................................
xi
LISTOFALGORITHMS
...............................
xiv
Chapter1Introduction
...............................
1
1.1DissertationContributions............................2
1.1.1Model-drivencollaboration........................2
1.1.2Data-drivencollaboration........................4
1.1.3Large-scaleCollaborativeMulti-agentLearning............5
1.1.4TheProvableAdvantageofCollaborativeLearning..........6
1.2DissertationStructure..............................7
Chapter2Background
................................
9
2.1CollaborativeLearningProblemFormulation..................9
2.2ATaxonomyofCollaboration..........................11
2.2.1Model-DrivenCollaboration.......................11
2.2.2Data-drivenCollaboration........................12
2.2.3CollaborativeMulti-agentLearning...................13
Chapter3Model-DrivenCollaborativeLearning
...............
14
3.1Multi-TaskFeatureInteractionLearning....................15
3.1.1Introduction................................15
3.1.2RelatedWork...............................18
3.1.3Taskrelatednessinhighorderfeatureinteractions...........22
3.1.4FormulationsandalgorithmsofthetwoMTILapproaches......27
3.1.4.1Preliminary...........................28
3.1.4.2SharedInteractionApproach.................28
3.1.4.3EmbeddedInteractionApproach...............31
3.1.5Experiments................................35
3.1.6SyntheticDataset.............................35
3.1.6.1E˙ectivenessofmodelingfeatureinteractions........35
3.1.6.2E˙ectivenessofMTIL.....................37
3.1.7SchoolDataset..............................39
3.1.8ModelingAlzheimer'sDisease......................40
3.1.9Discussion.................................41
3.2Multi-TaskRelationshipLearning........................42
3.2.1Introduction................................42
3.2.2RelatedWork...............................46
3.2.3InteractiveMulti-TaskRelationshipLearning..............49
vi
3.2.3.1RevisittheMulti-taskRelationshipLearning.........49
3.2.3.2TheiMTRLFramework....................52
3.2.3.3Aknowledge-awareextensionofMTRL...........54
3.2.3.4E˚cientOptimizationforkMTRL..............56
3.2.3.5BatchModePairwiseConstraintsActivelearning......59
3.2.4Experiments................................61
3.2.4.1ImportanceofHigh-QualityTaskRelationship........61
3.2.4.2E˙ectivenessofQueryStrategy................63
3.2.4.3InteractiveSchemeforQueryStrategy............64
3.2.4.4PerformanceonRealDatasets.................65
3.2.5CaseStudy:BrainAtrophyandAlzheimer'sDisease.........67
Chapter4Data-DrivenCollaborativeLearning
................
71
4.1CollaborativeDeepReinforcementLearning..................71
4.1.1Introduction................................71
4.1.2RelatedWork...............................76
4.1.3Background................................79
4.1.3.1ReinforcementLearning....................79
4.1.3.2AsynchronousAdvantageactor-criticalgorithm(A3C)...80
4.1.3.3Knowledgedistillation.....................81
4.1.4Collaborativedeepreinforcementlearningframework.........82
4.1.5Collaborativedeepreinforcementlearning...............83
4.1.6Deepknowledgedistillation.......................85
4.1.7CollaborativeAsynchronousAdvantage
Actor-Critic................................88
4.1.8Experiments................................91
4.1.8.1TrainingandEvaluation....................91
4.1.8.2Certi˝catedHomogeneoustransfer..............91
4.1.8.3Certi˝catedHeterogeneousTransfer..............93
4.1.8.4CollaborativeDeepReinforcementLearning.........96
4.2RankingPolicyGradient.............................97
4.2.1Introduction................................97
4.2.2Relatedworks...............................98
4.2.3NotationsandProblemSetting.....................100
4.2.4RankingPolicyGradient.........................100
4.2.5O˙-policyLearningasSupervisedLearning...............106
4.2.6Analgorithmicframeworkforo˙-policylearning............113
4.2.7SampleComplexityandGeneralizationPerformance..........115
4.2.8Supervisionstage:Learninge˚ciency..................117
4.2.9Explorationstage:Exploratione˚ciency................120
4.2.10JointAnalysisCombiningExplorationandSupervision........122
4.2.11ExperimentalResults...........................123
4.2.12AblationStudy..............................125
4.2.13Conclusion.................................127
vii
Chapter5CollaborativeMulti-AgentLearning
................
128
5.1Introduction....................................128
5.2RelatedWorks...................................132
5.3ProblemStatement................................134
5.4ContextualMulti-AgentReinforcementLearning................137
5.4.1IndependentDQN............................137
5.4.2ContextualDQN.............................138
5.4.3ContextualActor-Critic.........................140
5.5E˚cientallocationwithlinearprogramming..................143
5.6SimulatorDesign.................................148
5.7Experiments....................................151
5.7.1Experimentalsettings...........................151
5.7.2Performancecomparison.........................152
5.7.3OntheE˚ciencyofReallocations....................155
5.7.4Thee˙ectivenessofaveragedrewarddesign...............158
5.7.5Ablationsonpolicycontextembedding.................159
5.7.6Ablationstudyongroupingthelocations................160
5.7.7Qualitativestudy.............................161
5.8Conclusion.....................................162
Chapter6TheProvableAdvantageofCollaborativeLearning
.......
164
6.1Introduction....................................164
6.2Setup........................................167
6.2.1TheFederatedAveraging(FedAvg)Algorithm.............168
6.2.2Assumptions................................169
6.3LinearSpeedupAnalysisofFedAvg.......................170
6.3.1StronglyConvexandSmoothObjectives................170
6.3.2ConvexSmoothObjectives........................172
6.4LinearSpeedupAnalysisofNesterovAcceleratedFedAvg...........174
6.4.1StronglyConvexandSmoothObjectives................174
6.4.2ConvexSmoothObjectives........................175
6.5GeometricConvergenceofFedAvgintheOverparameterizedSetting.....176
6.5.1GeometricConvergenceofFedAvgintheOverparameterizedSetting.177
6.5.2OverparameterizedLinearRegressionProblems............178
6.6NumericalExperiments..............................180
Chapter7Conclusion
................................
182
APPENDICES
......................................
185
AppendixARankingPolicyGradient.........................186
AppendixBFederatedLearning............................215
BIBLIOGRAPHY
....................................
277
viii
LISTOFTABLES
Table3.1:Examplesofcommonsmoothlossfunctions..................27
Table3.2:PerformancecomparisonMTILandbaselinesontheSchooldataset....40
Table3.3:PerformancecomparisonMTILandbaselinesontheADNIdataset.....41
Table3.4:
TheaverageRMSEofqueryandrandomstrategyontestingdatasetover5
randomsplittingoftrainingandvalidationsamples.............63
Table3.5:TheRMSEcomparisonofkMTRLandbaselines...............63
Table3.6:
ThenameofthebrainregionsinFigure3.8,where(C)denotescortical
parcellationand(W)denoteswhitematterparcellation..........67
Table4.1:NotationsforSection4.2............................101
Table5.1:
PerformancecomparisonofcompetingmethodsintermsofGMVandorder
responseratewithoutrepositioncost.....................155
Table5.2:
PerformancecomparisonofcompetingmethodsintermsofGMV,order
responserate(ORR),andreturnoninvest(ROI)inXianconsidering
repositioncost..................................155
Table5.3:
PerformancecomparisonofcompetingmethodsintermsofGMV,order
responserate(ORR),andreturnoninvest(ROI)inWuhanconsidering
repositioncost.................................156
Table5.4:
E˙ectivenessofcontextualmulti-agentactor-criticconsideringreposition
costs.......................................156
Table5.5:E˙ectivenessofaveragedrewarddesign...................159
Table5.6:E˙ectivenessofcontextembedding......................159
Table5.7:E˙ectivenessofgroupregularizationdesign.................161
ix
Table6.1:
ConvergenceresultsforFedAvgandacceleratedFedAvg.Throughoutthepaper,
N
isthetotalnumberoflocaldevices,and
K

N
isthemaximalnumberofdevices
thatareaccessibletothecentralserver.
T
isthetotalnumberofstochasticupdates
performedbyeachlocaldevice,
E
isthelocalstepsbetweentwoconsecutive
servercommunications(andhence
T=E
isthenumberofcommunications).
y
Inthelinearregressionsetting,wehave

=

1
forFedAvgand

=
p

1
~

for
acceleratedFedAvg,where

1
and
p

1
~

areconditionnumbersde˝nedinSection
6.5.Since

1

~

,thisimpliesaspeedupfactorof
p

1
~

foracceleratedFedAvg.
166
TableA.1:
AcomparisonofstudiesreducingRLtoSL.The
Objective
columndenotes
whetherthegoalistomaximizelong-termreward.The
Cont.Action
columndenoteswhetherthemethodisapplicabletobothcontinuousand
discreteactionspaces.The
Optimality
denoteswhetherthealgorithmscan
modeltheoptimalpolicy.
X
y
denotestheoptimalityachievedbyERLis
w.r.t.theentropyregularizeobjectiveinsteadoftheoriginalobjectiveon
return.The
O˙-Policy
columndenotesifthealgorithmsenableo˙-policy
learning.The
NoOracle
columndenotesifthealgorithmsneedtoaccessto
acertaintypeoforacle(expertpolicyorexpertdemonstrations).....189
TableA.2:HyperparametersofRPGnetwork......................213
TableB.1:
Ahigh-levelsummaryoftheconvergenceresultsinthispapercomparedto
priorstate-of-the-artFLalgorithms.Thistableonlyhighlightsthedepen-
denceon
T
(numberofiterations),
E
(themaximalnumberoflocalsteps),
N
(thetotalnumberofdevices),and
K

N
thenumberofparticipated
devices.

istheconditionnumberofthesystemand

2
(0
;
1)
.Wedenote
NesterovacceleratedFedAvgasN-FedAvginthistable...........217
x
LISTOFFIGURES
Figure3.1:
IllustrationofMTLwithfeatureinteractions.(a)thefeatureinteractions
frommultipletaskscanbecollectivelyrepresentedasatensor
Q
;group
sparsestructures(c)andlow-rankstructures(b)infeatureinteractions
canbeusedtofacilitatemulti-taskmodels.................20
Figure3.2:
RMSEcomparisonbetweenRRandSTILontwosyntheticdatasetswith
samplesizeof1kand5k,respectively....................36
Figure3.3:
Syntheticdataset(Multi-task):RootMeanSquareError(RMSE)compar-
isonsamongallthemethods.TheY-axisisRMSE,X-axisisdimensionof
features....................................37
Figure3.4:
OverviewoftheproposediMTRLframework,whichinvolveshumanexperts
intheloopofmulti-tasklearning.Theframeworkconsistsofthreephases:
(1)
Knowledge-awaremulti-tasklearning
:learningmulti-tasklearningmod-
elsfromknowledgeanddata,(2)
Solicitation
:solicitingmostinformative
knowledgefromhumanexpertsusingactivelearningbasedquerystrategy,
(3)
Encoding
:encodingthedomainknowledgetofacilitateinductivetransfer.
44
Figure3.5:
PerformanceofMTRLandeMTRLasthenumberoffeatureschanging,in
termsof(a)Frobeniusnormand(b)RMSE.MTRL[
227
]learnsbothtask
modelsandtaskrelationshipatthesametime,whileeMTRLherelearns
thetaskmodelswhilethetaskrelationship

is˝xedtogroundtruth,i.e.
encodingthecorrectdomainknowledgeaboutthetaskrelationship...63
Figure3.6:
TheaveragedRMSEofkMTRLusingdi˙erentsettingofquerystrategy.
ThekMTRL-10-100meansselecting10pairwiseconstraintsattheend
ofeachiteration,startfromzero,add10pairwiseconstraintsatatime,
until100constraints.Forall4schemes,kMTRLwithzeroconstraintsis
equivalenttoMTRL.Resultsaretheaverageover5foldrandomsplitting.65
Figure3.7:
Thedistributionofcompetenceon(a)intra-regioncovarianceand(b)
inter-regioncovariance.kMTRLperformsbetterthanMTRLwhen
competence
>
1
.Highercompetenceindicatesbetterperformanceachieved
bykMTRLascomparedtoMTRL.Weseeinamajorityofregionsthe
kMTRLoutperformstheMTRL.......................68
xi
Figure3.8:
Comparisonofsub-matricesofcovarianceamong(left)taskcovariance
using
90%
alldatapointsthatisconsideredas(middle)
thecovariancematrixlearnedviaMTRLon
20%
dataand(right)the
covariancematrixlearnedviakMTRLon
20%
datawith0.8%pair-wise
constraintsqueriedbytheproposedqueryscheme.............68
Figure4.1:IllustrationofCollaborativeDeepReinforcementLearningFramework..72
Figure4.2:
Deepknowledgedistillation.In(a),theteacher'soutputlogits
z

is
mappedthroughadeepalignmentnetworkandthealignedlogits
F

!
(
z

)
isusedasthesupervisiontotrainthestudent.In(b),theextrafully
connectedlayerfordistillationisaddedforlearningknowledgefromteacher.
Forsimplicity'ssake,timestep
t
isomittedhere..............82
Figure4.3:Performanceofonlinehomogeneousknowledgedistillation........93
Figure4.4:
Performanceofonlineknowledgedistillationfromaheterogeneoustask.(a)
distillationfroma
Pong
expertusingthepolicylayertotraina
Bowling
student(KD-policy).(b)distillationfroma
Pong
experttoa
Bowling
studentusinganextradistillationlayer(KD-distill)............94
Figure4.5:
Theactionprobabilitydistributionsofa
Pong
expert,a
Bowling
expert
andanaligned
Pong
expert.........................94
Figure4.6:
Performanceof
o˜ine
,
online
deepknowledgedistillation,andcollabora-
tivelearning..................................95
Figure4.7:O˙-policylearningframework.........................113
Figure4.8:
ThebinarytreestructureMDP(
M
1
)withoneinitialstate,similaras
discussedin[
184
].Inthissubsection,wefocusontheMDPsthathaveno
duplicatedstates.TheinitialstatedistributionoftheMDPisuniform
andtheenvironmentdynamicsisdeterministic.For
M
1
theworstcase
explorationisrandomexplorationandeachtrajectorywillbevisitedat
sameprobabilityunderrandomexploration.Notethatinthistypeof
MDP,theAssumption5issatis˝ed......................121
Figure4.9:
ThetrainingcurvesoftheproposedRPGandstate-of-the-art.Allresults
areaveragedoverrandomseedsfrom1to5.The
x
-axisrepresentsthe
numberofstepsinteractingwiththeenvironment(weupdatethemodel
everyfoursteps)andthe
y
-axisrepresentstheaveragedtrainingepisodic
return.Theerrorbarsareplottedwithacon˝denceintervalof95%...123
Figure4.10:Thetrade-o˙betweensamplee˚ciencyandoptimality..........125
xii
Figure4.11:
Expectedexploratione˚ciencyofstate-of-the-art.Theresultsareaveraged
overrandomseedsfrom1to10.......................126
Figure5.1:
Thegridworldsystemandaspatial-temporalillustrationoftheproblem
setting.....................................137
Figure5.2:
Illustrationofcontextualmulti-agentactor-critic.Theleftpartshowsthe
coordinationofdecentralizedexecutionbasedontheoutputofcentralized
valuenetwork.Therightpartillustratesembeddingcontexttopolicy
network....................................144
Figure5.3:
ThesimulatorcalibrationintermsofGMV.TheredcurvesplottheGMV
valuesofrealdataaveragedover7dayswithstandarddeviation,in10-
minutetimegranularity.Thebluecurvesaresimulatedresultsaveraged
over7episodes.................................150
Figure5.4:Simulatortimelineinonetimestep(10minutes)..............151
Figure5.5:
IllustrationofallocationsofcA2CandLP-cA2Cat18:40and19:40,resp-
sectively.....................................158
Figure5.6:
ConvergencecomparisonofcA2Canditsvariationswithoutusingcontext
embeddinginbothsettings,withandwithoutrepositioncosts.TheX-axis
isthenumberofepisodes.TheleftY-axisdenotesthenumberofcon˛icts
andtherightY-axisdenotesthenormalizedGMVinoneepisode.....159
Figure5.7:
Illustrationontherepositionsnearbytheairportat1:50amand06:40
pm.Thedarkercolordenotesthehigherstatevalueandthebluearrows
denotetherepositions.............................162
Figure5.8:Thenormalizedstatevalueanddemand-supplygapoveroneday.....163
Figure6.1:
ThelinearspeedupofFedAvginfullparticipation,partialparticipation,
andthelinearspeedupofNesterovacceleratedFedAvg,respectively...181
FigureA.1:ThebinarytreestructureMDPwithtwoinitialstates...........194
FigureA.2:
Thedirectedgraphthatdescribestheconditionalindependenceofpairwise
relationshipofactions,where
Q
1
denotesthereturnoftakingaction
a
1
atstate
s
,followingpolicy
ˇ
in
M
,i.e.,
Q
ˇ
M
(
s;a
1
)
.
I
1
;
2
isarandom
variablethatdenotesthepairwiserelationshipof
Q
1
and
Q
2
,i.e.,
I
1
;
2
=
1
;
i
:
i
:
f
:Q
1

Q
2
;
o
:
w
:I
1
;
2
=0
.......................206
FigureB.1:TheconvergenceofFedAvgw.r.tthenumberoflocalsteps
E
.......276
xiii
LISTOFALGORITHMS
Algorithm3.1knowledge-awareMulti-TaskRelationshipLearning(kMTRL)...
58
Algorithm3.2Projectionalgorithm..........................
58
Algorithm3.3QueryStrategyofPairwiseConstraints................
59
Algorithm3.4iMTRLframework...........................
59
Algorithm4.1OnlinecA3C..............................
90
Algorithm4.2O˙-PolicyLearningforRankingPolicyGradient(RPG)......
115
Algorithm5.1

-greedypolicyforcDQN.......................
141
Algorithm5.2ContextualDeepQ-learning(cDQN).................
141
Algorithm5.3ContextualMulti-agentActor-CriticPolicyforward.........
144
Algorithm5.4ContextualMulti-agentActor-CriticAlgorithmfor
N
agents....
145
xiv
Chapter1
Introduction
Humanintelligenceisremarkableatcollaboration.Besidesindependentlearning,ourlearning
processishighlyimprovedbysummarizingwhathasbeenlearned,communicatingitwith
peers,andsubsequentlyfusingknowledgefromdi˙erentsourcestoassistthecurrentlearning
goal.This
collaborativelearning
procedureensuresthattheknowledgeisshared,continuously
re˝ned,andconcludedfromdi˙erentperspectivestoconstructaincreasinglyprofound
understanding,whichcansigni˝cantimprovethethelearninge˚ciency.
Ontheotherhand,machineintelligencestillpalesincomparisontohumaninsome
aspects,despiteitsphenomenaldevelopmentinrecentyears:theyingeneraldesignedfor
onespeci˝ctask,withanisolated,dataine˚cient,andcomputationallyexpensivelearning
paradigm.
Theresearchgoalpresentedinthisdissertationistobuildanintelligentsystemwith
multiplelearningagentsthatcollaborativelyresolvesoneorasetoftasksmoree˚ciently.In
particular,wetacklethefollowingchallengesinvariousdomainsofcollaborativelearning.

Flexibleandinteractivecollaboration.
Howcanmodelsofmultiplelearning
agentsinteracttoleveragetheknowledgefromrelatedtasksina˛exible,stable,and
interactiveway?Moreconcretely,howcanweincorporatehigher-orderinteractions
intothemultiplelearningmodelsduringtraining?Howcanwecontinuouslyguidethe
learningofmultiplemodelsandselectivelysolicitthehumanexpertknowledgetoescort
1
theircollaborationinteractively?

Heterogeneouscollaboration.
Onelimitationincollaborativelearningisthat
thelearningmodelsingeneral,haveahomogeneousstructure.Howcanwedesign
collaborativestrategiesamongheterogeneouslearningagentstoimprovethesample-
e˚ciency?

Large-scalecollaboration.
Inpractice,ane˙ectiveande˚cientcollaborationamong
alargeamountoflearningagentsisdesired.Howcanwescalethecollaborationto
thousandsofagents?

Theoreticalguaranteeofcollaboration.
Besidesthepracticalalgorithmsand
applications,whatarethetheoreticaladvantagesofcollaborativelearning?Doesthe
learningbene˝tfrommorelearningagents?
1.1DissertationContributions
Toresolvetheaforementionedchallengesofcollaborativelearning,thisthesispresentshowthe
collaborationisachievedtoimprovesample-e˚ciencyinvariousscenarios.Moreconcretely,
thecontributionsofthisthesisaresummarizedinthefollowingsections.
1.1.1Model-drivencollaboration
Wediscussmodel-drivencollaborationinthecontextofmulti-tasklearning.The˝rstpart
inthisChapterdiscusseshowdowecapturethehigh-orderfeatureinteractionsamong
relatedtaskscollaboratively.Traditionalmulti-tasklearningwithlinearmodelsarewidely
usedinvariousdataminingandmachinelearningalgorithms.Onemajorlimitationof
2
suchmodelsisthelackofcapabilitytocapturepredictiveinformationfrominteractions
betweenfeatures.Whileintroducinghigh-orderfeatureinteractiontermscanovercomethis
limitation,thisapproachdramaticallyincreasesthemodelcomplexityandimposessigni˝cant
challengesinthelearningagainstover˝tting.Whentherearemultiplerelatedlearning
tasks,featureinteractionsfromthesetasksareusuallyrelatedandmodelingsuchrelatedness
isthekeytoimprovetheirgeneralization.Here,wepresentanovelMulti-Taskfeature
InteractionLearning(MTIL)frameworktoexploitthetaskrelatednessfromhigh-order
featureinteractions.Speci˝cally,wecollectivelyrepresentthefeatureinteractionsfrom
multipletasksasatensor,andpriorknowledgeoftaskrelatednesscanbeincorporatedinto
di˙erentstructuredregularizationsonthistensor.Weformulatetwoconcreteapproaches
underthisframework,namelythesharedinteractionapproachandtheembeddedinteraction
approach.Theformerassumestaskssharethesamesetofinteractions,andthelatterassumes
featureinteractionsfrommultipletasksshareacommonsubspace.Wehaveprovidede˚cient
algorithmsforsolvingthetwoformulations.
Thesecondpartinthischapterinvestigatessolicitingandincorporatingtaskrelatedness
informationfromhumanexperttothemodel,whichguidesthedirectionofthemodel-based
collaboration.InthecenterofMTLalgorithmsishowtherelatednessoftasksaremodeled
andencodedinlearningformulationstofacilitateknowledgetransfer.AmongtheMTL
algorithms,themulti-taskrelationshiplearning(MTRL)attractedmuchattentioninthe
communitybecauseitlearnstaskrelationshipfromdatatoguideknowledgetransfer,instead
ofimposingapriortaskrelatednessassumption.However,thismethodheavilydependson
thequalityoftrainingdata.Whenthereisinsu˚cienttrainingdataorthedataistoonoisy,
thealgorithmcouldlearnaninaccuratetaskrelationshipthatmisleadsthelearningtowards
suboptimalmodels.Toaddresstheaforementionedchallenge,weproposeanovelinteractive
3
multi-taskrelationshiplearning(iMTRL)frameworkthate˚cientlysolicitspartialorder
knowledgeoftaskrelationshipfromhumanexperts,e˙ectivelyincorporatestheknowledge
inaproposedknowledge-awareMTRLformulation.Weproposeane˚cientoptimization
algorithmforkMTRLandcomprehensivelystudyquerystrategiesthatidentifythecritical
pairsthataremostin˛uentialtothelearning.Wepresentextensiveempiricalstudieson
bothsyntheticandrealdatasetstodemonstratethee˙ectivenessofproposedframework.
1.1.2Data-drivencollaboration
InChapter3,wediscussdata-drivencollaborationinthecontextofreinforcementlearning
andusethedataasamediumtofacilitatecollaborationamongmultiplelearningagents,
whichcanthenlargelyimprovethesample-e˚ciency.
Inthischapter,we˝rstleveragetheknowledgedistillationtoenforcethecollaboration
amongheterogeneouslearningagents.Theideaofknowledgetransferhasledtomany
advancesinmachinelearninganddatamining,butsigni˝cantchallengesremain,especially
whenitcomestoreinforcementlearning,heterogeneousmodelstructures,anddi˙erent
learningtasks.Motivatedbyhumancollaborativelearning,weproposeacollaborative
deepreinforcementlearning(CDRL)frameworkthatperformsadaptiveknowledgetransfer
amongheterogeneouslearningagents.Speci˝cally,theproposedCDRLconductsanovel
deepknowledgedistillationmethodtoaddresstheheterogeneityamongdi˙erentlearning
taskswithadeepalignmentnetwork.Furthermore,wepresentane˚cientcollaborative
AsynchronousAdvantageActor-Critic(cA3C)algorithmtoincorporatedeepknowledge
distillationintotheonlinetrainingofagents,anddemonstratethee˙ectivenessoftheCDRL
frameworkusingextensiveempiricalevaluationonOpenAIgym.
Inadditiontoknowledgetransferamongdi˙erenttasks,wecanfurthercoordinate
4
di˙erenthomogeneouslearningagentsforthesametask,whichfurtheradvancesmorestable
optimizationandsample-e˚cientlearning.Themainideaisano˙-policylearningframework
thatdisentanglesexplorationandexploitationinreinforcementlearning,whichbuildupon
theconnectionbetweenimitationlearningandreinforcementlearning.Thestate-of-the-art
estimatestheoptimalactionvalueswhileitusuallyinvolvesanextensivesearchoverthe
state-actionspaceandunstableoptimization.Towardsthesample-e˚cientRL,wepropose
rankingpolicygradient(RPG),apolicygradientmethodthatlearnstheoptimalrankofa
setofdiscreteactions.Toacceleratethelearningofpolicygradientmethods,weestablish
theequivalencebetweenmaximizingthelowerboundofreturnandimitatinganear-optimal
policywithoutaccessinganyoracles.Theseresultsleadtoageneralo˙-policylearning
framework,whichpreservestheoptimality,reducesvariance,andimprovesthesample-
e˚ciency.Weconductextensiveexperimentsshowingthatwhenconsolidatingwiththe
o˙-policylearningframework,RPGsubstantiallyreducesthesamplecomplexity,comparing
tothestate-of-the-art.
1.1.3Large-scaleCollaborativeMulti-agentLearning
Inthischapter,weapplycollaborativemulti-agentreinforcementlearningtoareal-world˛eet
managementapplication,whichisanessentialcomponentforonlineride-sharingplatforms.
Large-scaleonlineride-sharingplatformshavesubstantiallytransformedourlivesbyreallo-
catingtransportationresourcestoalleviatetra˚ccongestionandpromotetransportation
e˚ciency.Ane˚cient˛eetmanagementstrategynotonlycansigni˝cantlyimprovethe
utilizationoftransportationresourcesbutalsoincreasetherevenueandcustomersatisfaction.
Itisachallengingtasktodesignane˙ective˛eetmanagementstrategythatcanadapttoan
environmentinvolvingcomplexdynamicsbetweendemandandsupply.Existingstudiesusu-
5
allyworkonasimpli˝edproblemsettingthatcanhardlycapturethecomplicatedstochastic
demand-supplyvariationsinhigh-dimensionalspace.Weproposetotacklethelarge-scale
˛eetmanagementproblemusingreinforcementlearning,andproposeacontextualmulti-agent
reinforcementlearningframeworkincludingtwoconcretealgorithms,namelycontextualdeep
Q
-learningandcontextualmulti-agentactor-critic,toachieveexplicitcoordinationamonga
largenumberofagentsadaptivetodi˙erentcontexts.Weshowsigni˝cantimprovementsof
theproposedframeworkoverstate-of-the-artapproachesthroughextensiveempiricalstudies.
1.1.4TheProvableAdvantageofCollaborativeLearning
Previously,weproposetheheuristiccollaborativeapproachtocoordinatealargenumber
oflearningagentstoresolveareal-worldapplication.Inaddition,wewouldliketoprovide
arigorousanswertowhetherthereisaprovablebene˝tfromincreasingthenumberof
collaborativelearningagents.Weinvestigatethisprobleminfederatedlearning,whichisa
criticalscenarioinbothindustryandacademia.Federatedlearning(FL)learnsamodeljointly
fromasetofparticipatingdeviceswithoutsharingeachother'sprivatelyhelddata.The
characteristicsofnon-
iid
dataacrossthenetwork,lowdeviceparticipation,andthemandate
thatdataremainprivatebringchallengesinunderstandingtheconvergenceofFLalgorithms,
particularlyinregardstohowconvergencescaleswiththenumberofparticipatingdevices.
Here,wefocusonFederatedAveraging(FedAmostwidelyusedande˙ectiveFL
algorithminusetodaprovideacomprehensivestudyofitsconvergencerate.Although
FedAvghasrecentlybeenstudiedbyanemerginglineofliterature,itremainsopenas
tohowFedAvg'sconvergencescaleswiththenumberofparticipatingdevicesintheFL
crucialquestionwhoseanswerwouldshedlightontheperformanceofFedAvgin
largeFLsystems.We˝llthisgapbyestablishingconvergenceguaranteesforFedAvgunder
6
threeclassesofproblems:stronglyconvexsmooth,convexsmooth,andoverparameterized
stronglyconvexsmoothproblems.WeshowthatFedAvgenjoyslinearspeedupineach
case,althoughwithdi˙erentconvergencerates.Foreachclass,wealsocharacterizethe
correspondingconvergenceratesfortheNesterovacceleratedFedAvgalgorithmintheFL
setting:tothebestofourknowledge,thesearethe˝rstlinearspeedupguaranteesforFedAvg
whenNesterovaccelerationisused.ToaccelerateFedAvg,wealsodesignanewmomentum-
basedFLalgorithmthatfurtherimprovestheconvergencerateinoverparameterizedlinear
regressionproblems.Empiricalstudiesofthealgorithmsinvarioussettingshavesupported
ourtheoreticalresults.
1.2DissertationStructure
Theremainderofthisdissertationisorganizedasfollows.Weintroducethebackgroundof
collaborativelearninginChapter2.InChapter3,westartwithlearninglinearmodelsfor
multipletaskswhileincorporating˛exibleformsofinteractionsanddevelopaninteractive
approachtosolicithumanexpertknowledgeformodelcollaborations.Thischapterwas
previouslypublishedas"Multi-taskFeatureInteractionLearning"[
115
]and"Interactive
Multi-taskRelationshipLearning"[
117
].InChapter4,wepresentdata-drivencollaboration
methodstointeractamongheterogeneouslearningagents,whichcanlargelyimprovethe
sample-e˚ciencyofreinforcementlearningalgorithms.Thematerialsinthischapterarebased
on"CollaborativeDeepReinforcementLearning"[
114
]and"RankingPolicyGradient"[
118
].
InChapter5,westudyareal-worldapplicationanddesignacoordinationstrategythatcan
scaletoalargenumberoflearningagents.Thematerialsinthischapterwerepublishedas
"E˚cientlarge-scale˛eetmanagementviamulti-agentdeepreinforcementlearning"[
116
].In
7
Chapter6,wepresentrigoroustheoriesontheimprovementofconvergencerateswithrespect
totheincreasingnumberofcollaborativelearningagents,whichadvocatetheadvantageof
collaborativelearning.Thematerialsinthischapterarebasedon"FederatedLearning's
Blessing:FedAvghasLinearSpeedup"[157].WeconcludethisdissertationinChapter7.
8
Chapter2
Background
Inthischapter,we˝rstgiveacoherentde˝nitionof
collaborativelearning
usedinthroughout
thisdissertation,thenwediscussconnectionsanddiscrepanciesamongfourspeci˝cscenarios
underthisoverarchingframework.
2.1CollaborativeLearningProblemFormulation
Indisciplinesofcognitivescience,educationandpsychology,
collaborativelearning
,asituation
inwhichagroupofpeoplelearntoachieveasetoftaskstogether,hasbeenadvocated
throughoutpreviousstudies[
50
].Motivatedbythephenomenalsuccessofhumancollaborative
learning,westudythecollaborativelearninginthedomainofarti˝cialintelligence.We˝rst
provideageneralde˝nitionofcollaborativelearninginthisthesis.
De˝nition1
(Collaborativelearning)
.
Collaborativelearningisagenerallearningparadigm
thatmultiplelearningagentscollaboratetosolveoneorasetoftasks.
Here,wewouldliketoclarifytheseveralterminologiesusedinDe˝nition1.

multiple
:incontrasttoindividuallearning,collaborativelearningherecoversawide
rangeoflearning:fromasmallscalesuchasapairoflearningagentstolarge-scale
suchasthousandsoflearningagents.
9

learningagents
:Thelearningagentreferstoamachinelearningmodelthatbehaves
di˙erentlyfromeachother.Forexample,learningagentscanbeparameterizedby
di˙erentdeepneuralnetworks.Theneuralnetworkcanhavedi˙erentdomainsor
architectures.Thecentralrequirementisthateachlearningagentcanlearnindividually
andconductdecisionmakingindependently.

collaborate
:theinteractionamongdi˙erentlearningagents.Thestrategyofthis
interactionisthecentraldesignofthecollaborativelearningalgorithm.

solveoneorasetoftasks
:Inmachinelearning,solvingoneorasetoftasksrefers
tooptimizingoneorseveralobjectivefunctionsthatgeneralizewelltotheunseen
scenarios.
Moreconcretely,weprovidetheproblemformulationofcollaborativelearningasfollows:
min
W
=
f
w
i
g
K
i
=1
K
X
i
=1
F
i
(
W
)
s.t.
w
i
2C
i
(
W
)
8
i
=1
;:::;K
(2.1)
where
F
i
;i
=1
;:::;K
referstothesetoftaskswewanttosolve.Themodelparameter
w
i
denotesthelearningagents.Itisworthnotingthat
w
i
isnotnecessarilyrepresentedbya
singleinstance,e.g.,aneuralnetwork,adecisiontree,etc.Weuse
w
i
todenoteallvariables
thatneedtobedeterminedforadecisionprocess,whichconstructsamappingfromtheinput
oftask
i
totheaction,suchasregression,classi˝cation,etc.Weusetheset
C
i
;
8
i
=1
;:::;K
denotestheinteractionsbetweenlearningagent
i
andothers,whichcanencodevarioustypes
ofcollaborationstrategiesintothelearningprocessaswewilldiscussshortly.Forsimplicity,
wedenotetheunionofmodelsofalllearningagentsas
W
=
f
w
i
g
K
i
=1
.Therationaleof
collaborativelearningisthattheproperdesignofinteractions
C
amongthelearningagents
10
facilitatestheoptimizationofobjectives.
Itisworthnotingthatthecollaborationsetisamoregeneralexpressioncomparingto
theregularization.Theregularizationhasaspeci˝cformonenforcingtheformulationwhile
thesetofcollaborationcanintegratemore˛exiblealgorithmicdesignsofinteraction.Inthis
thesis,despitedi˙erentiationsexistintermsofhowdi˙erentlearningagentsinteract,we
followthecommonpracticeandusecooperationandcollaborationinterchangeably[50].
2.2ATaxonomyofCollaboration
Inthissection,wepresentdi˙erentcategoriesofcollaborations,whichleadstoseveralsub˝elds
inthemachinelearningcommunity.Wediscusstheconnectionsanddiscrepanciesofthose
relatedsub˝eldsandexplorethepossibleadvantagesoforganizingtheminauni˝edview.
2.2.1Model-DrivenCollaboration
The˝rstcategoryofcollaborativelearningismodel-drivencollaboration,whichdirectly
enforcestheinteractionoflearningagentsintheparameterspace.Fromtheperspectiveof
transferlearning,theseapproachesimplementknowledgetransferfromintroducinginductive
biasduringthelearning.Itspeci˝callyspecifytheconditionsoflearnedsolutionneedstobe
satis˝ed,suchassparsityorlow-rankproperty.Inthiscase,thecollaborationconstrainreduces
tothevariousregularizationsandthecollaborativelearningreducestomulti-tasklearning
andfederatedlearning.Moreconcretely,weset
C
i
=
R
(
W
)
,where
R
(

)
istheregularization
addedtothe
W
.Forexample,underthesituationsthat
W
isamatrix(eachlearningagents'
modelisavector),acommonregularizationistracenorm
R
(
W
)=
f
W
jk
W
k
tr
thatcontrols
thesubspaceofmultiplemodels.
11
Multi-TaskLearning(MTL)isaprincipledlearningparadigmthatleveragesuseful
informationcontainedinmultiplerelatedtaskstohelpimprovethegeneralizationperformance
ofallthetasks[
226
].ThegoalofMTListolearn
K
functionsforthetaskssuchthat
f
k
(
x
ik
)=
y
ik
,basedontheassumptionthatalltaskfunctionsarerelatedtosomeextent,
whereeachfunction
f
k
isparameterizedby
w
k
.Thegeneralmulti-tasklearningformulation
isgivenby:
min
W
K
X
k
=1
F
k
(
w
k
)+

R
(
W
)
(2.2)
Another˝eldthatfallsintomodel-drivencollaborationisfederatedlearning.Federated
learning(FL)learnsasinglemodeljointlyfromasetoflearningagents.Ingeneral,each
learningagentcorrespondstoalocaldeviceandthetrainingisperformedsharingeachother's
privatelyhelddata.Asfornow,theprevalentcollaborationstrategyistheaggregationofall
learningagents'models.Thechallengeoffederatedlearningisthepracticalconstraintson
collaboration:toreducethecommunicationcost(thefrequencyofcollaboration),dealwith
systemheterogeneity,andunderstandthetheoreticalpropertiesofthissimplecollaborative
strategy.WewillproviderigorousanswerstothosequestionsinChapter6.
2.2.2Data-drivenCollaboration
Onelimitationforthetraditionalmodel-basedcollaborationisthatthemodelstructure
restrictedduetotheusageofinductivetransfer.Toovercomethisissue,thedata-driven
collaborationleveragesthetechniquessuchasknowledgedistillation,mimiclearning.
12
Inthiscase,thedata-drivencollaborationconstrainisgivenby
C
i
(
w
i
)=
f
argmin
w
i
`
(
w
i
;f
w
j
(
x
)
;y
)
;
8
(
x
;y
)
2
B
g
;
where
B
denotesthereplaybu˙erthatcontainasetofselecteddataaccordingtothe
task-speci˝ccriteria.Noticethattheinteractionbetweenlearningagentsnowareconducted
throughthedatacollectedin
B
.Sincetheotherlearningagent'smodellabeledthedata
in
B
,itcontainsinformationlearnedinagent
j
,whichisthendistilledtoagent
i
through
lossfunction
`
(

)
.Inthisway,wecanempowera˛exiblenetworkstructureamongdi˙erent
agents,thusachievecollaborationamongheterogeneouslearningagents.Theseapproaches
willbeintroducedinChapter4.
2.2.3CollaborativeMulti-agentLearning
Incollaborativemulti-agentlearning,themultiplelearningagentsinteractwithothersto
achieveacommontask.Eachlearningagentcanperformthelearningprocessindividually
whiletheWeemphasizethisproblemasadistincttypeofcollaborationsincetheagentscan
adapttheircollaborationsthroughtheenvironmentfeedback,thoughthistrialanderrorcan
becomputationallyintractable.Toimprovethesample-e˚ciencyinthisscenario,wecan
enforceatask-speci˝cmodel-drivenordata-drivenapproachduringthelearning.Weprovide
aconcretereal-worldapplicationtodemonstratethiscategoryinChapter5.
13
Chapter3
Model-DrivenCollaborativeLearning
Inthischapter,wediscussmodel-drivencollaborationinthecontextofmulti-tasklearning.
Morespeci˝cally,we˝rstproposedanovelMulti-TaskfeatureInteractionLearning(MTIL)
frameworktoexploitthetaskrelatednessfromhigh-orderfeatureinteractions,whichprovides
bettergeneralizationperformancebyinductivetransferamongtasksviasharedrepresentations
offeatureinteractions.Weformulatetwoconcreteapproachesunderthisframeworkand
providee˚cientalgorithms:thesharedinteractionapproachandtheembeddedinteraction
approach.Theformerassumestaskssharethesamesetofinteractions,andthelatter
assumesfeatureinteractionsfrommultipletaskscomefromasharedsubspace.Wehave
providede˚cientalgorithmsforsolvingthetwoapproaches.Secondly,theclassicalmulti-task
relationshiplearningcouldlearnaninaccuratetaskrelationshipwhenthereareinsu˚cient
trainingdataorthedataistoonoisy,andwouldmisleadthelearningtowardssuboptimal
models.Inthischapter,weproposedanovelinteractivemulti-taskrelationshiplearning
(iMTRL)frameworkthate˚cientlysolicitspartialorderknowledgeoftaskrelationshipfrom
humanexperts,e˙ectivelyincorporatestheknowledgeinaproposedknowledge-awareMTRL
formulation.Weproposede˚cientoptimizationalgorithmforkMTRLandcomprehensively
studyquerystrategiesthatidentifythecriticalpairsthataremostin˛uentialtothelearning.
14
3.1Multi-TaskFeatureInteractionLearning
3.1.1Introduction
Linearmodelsaresimpleyetpowerfulmachinelearninganddataminingmodelsthatare
widelyusedinmanyapplications.Duetotheadditivenatureofthelinearmodels,itcanfully
unleashthepoweroffeatureengineering,allowingcraftedfeaturestobeeasilyintegrated
intothelearningsystem.Thisisadesiredpropertyinmanypracticalapplications,inwhich
high-qualityfeaturesarethekeytopredictiveperformance.Moreover,e˚cientparallel
algorithmsarereadilyavailabletolearnlinearmodelsfromlarge-scaledatasets.Despiteits
attractiveproperties,oneapparentlimitationofsuchmodelsisthattheycanonlylearna
setofindividuale˙ectsoffeaturescontributingtotheresponse,duetoitslinearadditive
property.Thuswhenapartoftheresponseisderivedfrominteractionsbetweenfeatures,
suchmodelswouldnotbeabletodetectsuchnon-linearpredictiveinformation,thereby
leadingtopoorpredictiveperformance.
Inpractice,high-orderfeatureinteractionsarecommoninmanydomains.Forexample,
ingeneticsstudies,environmentale˙ectsandgenetic-environmentalinteractionarefound
tohavestrongrelationshipwiththevariabilityinadopteeaggressivity,conductdisorder
andadultantisocialbehavior[
29
].Similarly,theinteractione˙ectsbetweencontinuance
commitmentanda˙ectivecommitmentwasfoundinpredictingannexedabsences[
177
].
Also,arecentstudyofdepressionfoundthatgenotype,sex,environmentalriskandtheir
interactionhavecombinedin˛uenceondepressionsymptoms[
52
].Itisalsoreportedthatthe
interactionofbrain-derivedneurotrophicfactorandearlylifestressexposureareidenti˝ed
inpredictingsyndromaldepressionandanxiety,andassociatedalterationsincognition[
63
].
Inbiomedicalstudies,manyhumandiseasesarearesultofcomplicatedinteractionsamong
15
geneticvariantsandenvironmentalfactors[
79
].Oneintuitivesolutiontoovercomethis
limitationistoaugmentinteractiontermsintolinearmodels,explicitlymodelingthee˙ects
fromtheinteractions.However,thiswilldramaticallyincreasethemodelcomplexityandlead
topoorgeneralizationperformancewhenthereislimitedamountofdata[
35
,
39
,
124
,
158
,
216
].
Ontheotherhand,whentherearemultiplerelatedlearningtasks,themulti-tasklearning
(MTL)paradigm[
10
,
19
,
33
]haso˙eredaprincipledwaytoimprovethegeneralization
performanceofsuchlearningtasksbyleveragingtherelatednessamongtasksandperforming
inductivetransferamongthem.Thepastdecadehaswitnessedagreatamountofsuccess
inapplyingMTLtotackleproblemswherelargeamountoflabeleddataarenotavailable
orcreatingsuchdatasetsincursprohibitivecost.Suchproblemsareespeciallyprevalentin
biologicalandmedicaldomains,whereMTLhasachievedsigni˝cantsuccess,includingdata
analysisongenotypeandgeneexpression[
101
],breastcancerdiagnosis[
228
]andprogression
modelingofAlzheimer'sDisease[
68
],etc.TheMTLimprovesgeneralizationperformance
bylearningasharedrepresentationfromalltasks,whichservesastheagentforknowledge
transfer.Structuredregularizationhasprovidedane˙ectivemeansofmodelingsuchshared
representationandencodingvarioustypesofdomainknowledgeontasks[
10
,
89
,
142
,
199
].
Theattractivebene˝tsprovidedbyMTLmakeitanidealschemewhenlearningproblems
involvemultiplerelatedtaskswithfeatureinteractions,becausetasksmayberelatedwith
eachotherbysharedstructuresonfeatureinteractions.Forexample,predictingvarious
cognitivefunctionsmayinvolveasharedsetofinteractionsamongbrainregions.
However,manyexistingMTLframeworksarebasedonlinearmodels[
10
]intheoriginal
inputspace.Thustheycannotbedirectlyappliedtoexploretaskrelatednessintheform
ofhigh-orderfeatureinteractions.Ontheotherhand,althoughtraditionalnonlinearMTL
methodsbasedonneuralnetworks(e.g.,[
13
])canexploitnon-linearfeatureinteractions
16
tosomeextends,itisgenerallydi˚culttoencodepriorknowledgeontaskrelatednessto
suchmodels.Inthischapter,weproposeanovelmulti-taskfeatureinteractionlearning
framework,whichlearnsasetofrelatedtasksbyexploitingtaskrelatednessintheform
ofsharedrepresentationsinboththeoriginalinputspaceandtheinteractionspaceamong
features.Westudytwoconcreteapproachesunderthisframework,accordingtodi˙erentprior
knowledgeabouttherelatednessviafeatureinteractions.The
sharedinteractionapproach
assumesthatthereareonlyasmallnumberofinteractionsthatarerelevanttothepredictions,
andalltaskssharethesamesetofinteractions;the
embeddedinteractionapproach
assumes
that,foreachtask,thefeatureinteractionsarederivedfromalow-dimensionalsubspace
thatissharedacrossdi˙erenttasks.Wehaveprovidedformulationsande˚cientalgorithms
forbothapproaches.Weconductempiricalstudiesonbothsyntheticandrealdatasetsto
demonstratethee˙ectivenessoftheproposedframeworkonleveragingfeatureinteractions
fromtasks.Thecontributionsofthispaperarethreefolds:

OurnovelframeworkhasextendedtheMTLparadigm,forthe˝rsttime,toallowhigh-
orderrepresentationstobesharedamongtasks,byexploitingpredictiveinformation
fromfeatureinteractions.

Weproposedtwonovelapproachesunderourframeworktomodeldi˙erenttask
relatednessoverfeatureinteractions.

Ourcomprehensiveempiricalstudiesonbothsyntheticandrealdatahaveledto
practicalinsightsoftheproposedframework.
Theremainderofthispaperisorganizedasfollows:Section3.1.2reviewsrelatedworkof
MTLandmodelsinvolvingfeatureinteractions.Section3.1.3introducestheframeworkfor
17
MTIL.ThetwoapproachesunderMTILhavebeengivenin3.1.4.Section6.6presentsthe
experimentalresultsonbothsyntheticandrealdatasets.
3.1.2RelatedWork
TheproposedresearchisrelatedtoexistingworkonMTLandfeatureinteractionlearning.
Inthissection,webrie˛ysummarizethetheserelatedworkandshowhowourworkadvances
theseareas.
Multi-TaskLearning.
MTLhasbeenextensivestudiedoverthelasttwodecades.Inthe
centerofmostMTLalgorithmsishowtaskrelationshipsareassumedandencodedinto
thelearningformulations.Theconceptoflearningmultiplerelatedtasksinparallelwas
˝rstintroducedin[
33
].Itwasdemonstratedinmultiplereal-worldapplicationsthatadding
asharedrepresentationinneuralnetworktaskscanhelpothersgetbettermodels.Such
discoveryhadinspiredmanysubsequentresearche˙ortsinthecommunityandapplicationsin
diverseapplicationdomains.Amongthesestudies,theregularizedMTLframeworkhasbeen
pioneeredby[
55
].Theregularizationschemecaneasilyintegratevarioustaskrelationshipinto
existinglearningformulationstocoupleMTL,thusprovidinga˛exiblemulti-taskextension
toexistingalgorithms.ItiswelladoptedandissoongeneralizedtoarichfamilyofMTL
algorithms.
MTLviaRegularization.
AmongtheworkintheregularizationbasedMTLscheme,there
aremanydi˙erentassumptionsabouthowtasksarerelated,leadingtodi˙erentregularization
termsintheformulation.Forexample,onecommonassumptionisthatthetaskssharea
subsetoffeatures,andthetaskrelatednesscanbecapturedbyimposingagroupsparsity
penaltyonthemodelstoachievesimultaneousfeatureselectionacrosstasks[
199
,
142
].
18
Anothercommonassumptionisthatthemodelsoftaskscomefromthesamesubspace,
leadingtoalow-rankstructurewithinthemodelmatrix.Directlypenalizingtherankfunction
leadstoNP-hardproblems,andoneconvexalternativeistopenalizetheconvexenvelopof
therankfunction,i.e.,tracenorm.Thisencourageslow-rankbyintroducingsparsitytothe
singularvaluesofthemodelmatrix[
89
].In[
10
],theauthorsstudiedaMTLformulationthat
learnsacommonfeaturemappingforthetasksandassumedalltaskssharethesamefeatures
afterthemapping.Theauthorshaveshownthatthisassumptioncanalsobeequivalently
expressedbyalow-rankregularizationonthemodel.Therearemanymoreformulations
thatfallintothiscategoryofformulationtocapturetaskrelatednessbydesigningdi˙erent
sharedrepresentationandregularizationterms,suchasclusterstructures[
232
],tree/graph
structures[
101
,
38
],etc.However,tothebestofourknowledge,alloftheseformulationsdo
notconsiderfeatureinteractionsinthemodel,andextensionstoconsiderinteractionsare
notstraightforward.Inthiswork,wewillextendtheMTLframeworktoenableknowledge
transfernotonlyintheoriginalinputspace,butalsoinhigher-orderfeatureinteraction
space.
MultilinearMTL.
TheuseoftensorinMTLhasshowntobeverye˙ectiveinrepresenting
structuralinformationunderlyinginMTLproblems.In[
162
],Romera-Paredes
etal.
proposed
amultilinearmultitask(MLMTL)frameworkthatarrangesparametersoflineare˙ectsfrom
alltasksintoatensor
W
,bywhichtheyareabletorepresentthemulti-modalrelationships
amongtasks.Inadatasetcontainingmulti-modalrelationships,taskscanbereferenced
bymultipleindices.InMLMTL,theauthorsemployedaregularizeron
W
toinducea
low-rankstructuretotransferknowledgeamongtasks.Theoptimizationproblemcontains
theminimizationoftensor'srank,whichleadstosolvinganon-convexproblem.Thusthe
authorsdevelopanalternatingalgorithm,employingtheTuckerdecompositionandconvex
19
a)tensorrepresentationoffeatureinteractions
b)structuredsparsityofaninteractiontensor
c)low-rankstructureofaninteractiontensor
Figure3.1:IllustrationofMTLwithfeatureinteractions.(a)thefeatureinteractionsfrom
multipletaskscanbecollectivelyrepresentedasatensor
Q
;groupsparsestructures(c)and
low-rankstructures(b)infeatureinteractionscanbeusedtofacilitatemulti-taskmodels.
relaxationusingtensortracenorm.Althoughtheauthorsalsousedatensorrepresentation
inMTL,thelearningformulations,implications,aswellasthemeaningofsuchthetensor
isfundamentallydi˙erentfromthoseinourwork.TheproposedMTILframeworkutilizes
tensortocapturetherelatednessamongtasksandtransferknowledgethroughhigh-order
featureinteractions,whichcannotbeachievedbyanyexistingMTLformulations.Notethat
thetensorinMLMTLisindexedbymulti-modaltasks.InMTIL,thetensorisindexedby
featuresandtasks,whichisclearlydi˙erentfromtheaforementionedwork.Intheproposed
embeddedinteractionapproachforMTIL,however,wefaceasimilarchallengeinMLMTL
toseekasolutioninvolvingalow-ranktensor.
FeatureInteraction
Inmanymachinelearningtasks,weareinterestedinlearningalinearpredictivemodel.
Giventheinputfeaturevectorofasample,theresponseisgivenbyalinearcombinationof
20
thesefeatures,i.e.,aweightedsumofthefeatures.Becauseofthisreasonwecallthemlinear
e˙ects.Therearestrongevidencesfoundinmanycomplexapplicationsthat,inadditionto
thelineare˙ects,therearealsoe˙ectsfromhigh-orderinteractionsbetweensuchfeatures.As
aresult,thereareconsiderablee˙ortsfrombothacademiaandindustryaimingataddressing
thislimitationbyremovingtheadditiveassumptionandincludinginteractione˙ects.
Toovercomethedimensionalityissuesintroducedbyinteractione˙ects,twotypesof
heredityconstraintshavebeenstudied[
20
];namelystronghierarchyinwhichaninteraction
e˙ectcanbeselectedintothemodelonlyifbothofitscorrespondinglineare˙ectshavebeen
selected,andweakhierarchy,inwhichaninteractione˙ectcanbeselectedifatleastoneof
itscorrespondinglineare˙ectshasbeenselected.In[
39
],theauthorsproposedanapproach
knownasSHIMtoidentifytheimportantinteractione˙ects.SHIMextendstheclassical
Lasso[
194
]andenforcesastronghierarchy.Aniterativealgorithmwasproposedbasedon
Lasso,whichmaynotscaletoproblemswithhighdimensionalfeaturespace.Radchenko
et.
al
proposedtheVANISHmethodtoaddresstheproblem[
158
].Theydevelopedaconvex
formulationwithare˝nedpenaltythatcannotonlylearnthesparsesolution,butalsotreat
thelinearandinteractione˙ectsusingdi˙erentweights.Thisway,themaine˙ectcould
havemorein˛uenceontheprediction.In[
20
],ahierarchicallassowasproposedtosearch
forinteractionswithlargemaine˙ectsinsteadofconsideringallpossibleinteractions.The
authorsproposedanalgorithmbasedonADMMforstronghierarchylassoandageneralized
gradientdescentforweakhierarchicallasso.Morerecently,Liu
etal.
[
124
]proposedan
e˚cientalgorithmforsolvingthenon-convexweakhierarchicalLassodirectly,basedonthe
frameworkofgeneraliterativeshrinkageandthresholding(GIST)[
67
].Theauthorsproposed
aclosedformsolutionofproximaloperatorandfurtherimprovedthee˚ciencyofsolvingthe
subproblemofproximaloperatorfromquadratictolinearithmictimecomplexity.
21
Inmanyrealworkapplicationstherearemultiplerelatedtasks.Whenthosethesetasks
involveinteractione˙ects,thetaskscouldberelatedviathehighorderfeatureinteractions.
Inourpaper,weproposetoaddressthemodelcomplexityissuefrominteractione˙ectsusing
anewperspective,byleveragingsuchrelatedness.
3.1.3Taskrelatednessinhighorderfeatureinteractions
Inthissection,wepresenttheframeworkofMulti-TaskfeatureInteractionLearning(MTIL).
Forcompleteness,wegiveaself-containedintroductionofourwork.Wewillderiveconcrete
learningalgorithmsunderthisframeworkinSection3.1.4.
LinearandInteractionE˙ects.
Considerthetraditionallinearmodels.Foraninput
featurevector
x
2
R
d
andascalarresponse
y
,wehaveassumedthefollowingunderlying
lineargenerativemodel:
y
=
d
X
i
=1
x
i
w
i
+

where
w
2
R
d
istheweightvectorforlineare˙ects,and

˘N
(0
;˙
2
)
isaGaussiannoise.
Alinearmodel
f
(
x
;
w
)=
x
T
w
canbeaquitee˙ectivepredictionfunction.However,ifthe
underlyinggenerativemodelincludese˙ectsfromfeatureinteractions,i.e.,
y
=
d
X
i
=1
x
i
w
i
+
d
X
i
=1
d
X
j
=1
x
i
x
j
Q
i;j
+

where
x
i
x
j
Q
i;j
isthejointe˙ectbetweenthe
i
thfeatureandthe
j
thfeature,and
Q
i;j
isthe
weightforthisjointe˙ect.Thistypeoffeatureinteractionshavebeencommonlyfoundin
manyapplications.Ifthetrainingdatafollowthisdistributionthenthelinearmodelisnot
enoughtocapturetherelationshipbetweeninputfeaturesandoutputresponses.Oneofthe
22
approachesistointroducenon-linearfeatureinteractiontermsintothelinearmodel.Thatis,
wecandenoteitasaquadraticfunction:
f
(
x
;
w
;
Q
)=
x
T
w
+
x
T
Qx
;
(3.1)
where
w
2
R
d
and
Q
2
R
d

d
collectivelyrepresenttheparametersforlineare˙ectsand
interactione˙ects,respectively.Wenotethat
Q
istypicallysymmetricbecausethisrepresen-
tationincludestwotermsinvolvingfeature
i
and
j
:
x
i
x
j
(
Q
i;j
+
Q
j;i
)
anditalsoincludes
second-orderfeaturetransformationsoftheoriginalfeatures
x
2
i
Q
i;i
.
DiscussionsonFeatureInteractions.
Insupervisedlearning,weseekapredictive
functionthatmapsaninputvector
x
2
R
d
toacorrespondingoutput
y
2
R
.Let
(
X
;
y
)=
f
(
x
1
;y
1
)
;
(
x
2
;y
2
)
;:::
(
x
n
;y
n
)
g
beatrainingdataset,inwhicheachdatapointisdrawnfrom
certain
i.i.d.
distribution

.Thegoaloflearningisto˝ndthebestpredictor
^
f
2H
sothat
thepredictedvalue
^
y
i
fortheinputdata
x
i
isascloseaspossibletothegroundtruth
y
i
,
8
(
x
i
;y
i
)
2
(
X
;
y
)
,givenalossfunction
L
(
:;:
)
.Wehopethatthepredictor
f
learnedinthis
wayisclosetotheoptimalmodelthatminimizestheexpectedlossaccordingtothe

:
R
(
f
)=
E
(
X
;
y
)
˘

L
(
f
(
X
)
;
y
)
:
(3.2)
Suchpredictorisgivenbytheminimumoftheempiricalrisk:
^
f
=argmin
f
2H
n
X
i
=1
L
(
f
(
x
i
)
;
y
i
)
:
Theerrorcausedbylearningthebestpredictorinthetrainingdatasetiscalledtheestimation
error.Theerrorcausedbyusingarestricted
H
iscalledtheapproximationerror.Fora
23
˝xeddatasize,thesmallerthehypothesisspace
H
,thelargertheapproximationerror,and
viceversa.Thetrade-o˙betweenapproximationerrorandestimationerroriscontrolledby
selectingthesizeof
H
.Byincludingfeatureinteractionswewouldenlargethehypothesis
space,andwemaybeabletodramaticallyminimizetheapproximationerrorcompared
tothetraditionalhypothesisspaceforlinearmodels.Ontheotherhand,wenotethat
givenalimitedamountofdata,alargehypothesisspacemayresultinmodelswithpoor
generalizationperformance.Wewillneedtoeitherincreaseourtrainingdata,orprovide
e˙ectiveregularizationstonarrowdownthehypothesisspace.
Multi-taskFeatureInteractions.
Weconsiderthesettingthattherearemultiplelearning
taskswhicharerelatednotonlyintheoriginalfeaturespace,butalsointermsoffeature
interactions.Theproposeframeworksimultaneouslylearnsallrelatedtasksandprovidesan
e˙ectiveregularizationonthehypothesisspaceusingrelatednessontheinteractions.
Let
D
=
(
X
1
;
y
1
)
;:::;
(
X
T
;
y
T
)
bethetrainingdataforthe
T
learningtasks,andthe
i.i.d.
trainingsamplesfortask
t
isdrawnfrom
(

t
)
m
t
,where
m
t
isthenumberofdatapoints
availablefortask
t
.Wecollectivelydenotethedistributionas
D˘

=
Q
T
t
=1
(

t
)
m
t
.All
taskshavea
d
-dimensionalfeaturespace(i.e.,
x
i
2
R
d
).Thecorrespondingfeaturesare
homogeneousandhavethesamesemanticmeaning.Thetotaltrainingdatapointsare:
(
X
t
;
y
t
)=
f
(
x
1
t
;y
1
t
)
;
(
x
2
t
;y
2
t
)
;:::;
(
x
mt
;y
mt
)
g
;t
=1
;:::;T;
ThegoalofMTListolearn
T
functionsforthetaskssuchthat
f
t
(
x
it
)=
y
it
,basedonthe
assumptionthatalltaskfunctionsarerelatedtosomeextent.
Inordertoconsiderinteractionsforeachtask,weusethequadraticpredictivefunctionin
Eq.3.1foralltasks.Wecollectivelyrepresentthelineare˙ectsfromalltasksasamatrix
24
W
=[
w
1
;:::;
w
T
]
2
R
d

T
,
w
i
2
R
d
andtheinteractione˙ectsasatensor
Q2
R
d

d

T
,
inwhichthe
t
-thfrontalslice
Q
t
2
R
d

d
representstheinteractione˙ectsfortask
t
.We
illustratethisinteractiontensorinFigure3.1(a).
Givenspeci˝clossfunctions
^
`
forsamplesfromonetask,(e.g.,squarelossforregres-
sionandlogisticlossforclassi˝cation,seeTable3.1),thelossfunctionforeachtaskis
`
t
(
f;
w
;
Q
;
X
;
y
)=
P
m
t
i
=1
^
`
(
f
(
x
i
;
w
;
Q
)
;y
i
)
.Ourmulti-taskfeatureinteractionlossfunction
isgivenby:
L
(
W
;
Q
;
f;
X
;
Y
)=
T
X
t
=1
`
t
(
f;
w
t
;
Q
t
;
X
t
;
Y
t
)
:
(3.3)
Notethatitisnotnecessaryforalltaskstohavethesamelossfunction.InMTL,thelearning
ofeachtaskbene˝tsfromtheknowledgefromothertasks,whiche˙ectivelyreducesthe
hypothesisspaceforalltasks.Inordertoachieveknowledgetransferamongtasks,wewould
liketoimposesharedrepresentationsviadesigningregularizationtermsonboth
W
and
Q
,
whichspecifyhowtasksarerelatedintheoriginalfeaturespaceandfeaturesinteractions,
respectively.
TheMTILFramework.
TheproposedMulti-TaskfeatureInteractionLearning(MTIL)
frameworkisthengivenbythefollowinglearningobjective:
min
W
;
Q
L
(
W
;
Q
;
f;
X
;
Y
)+

R
R
F
(
W
)+

I
R
I
(
Q
)
;
(3.4)
where
R
F
(
W
)
istheregularizationprovidingtaskrelatednessintheoriginalfeaturespace,
R
I
(
Q
)
istheregularizationencodingourknowledgeabouthowfeatureinteractionsarerelated
amongtasks,

R
and

I
arethecorrespondingregularizationcoe˚cients.For

I
!1
,the
25
problemreducestotraditionalMTL,when
R
I
ischosenproperly.Inthispaper,weformulate
twoconcreteapproachestocapturethefeatureinteractionpatterns:

SharedInteractionApproach.
Inmanyapplications,eventhoughwehavea
largenumberoffeatureinteractions,onlyafewinteractionsmayberelatedtothe
response[
20
,
39
].Whenlearningwithmultipletasks,di˙erenttasksmayshareexactly
thesamesetoffeatureinteractions,butwithdi˙erente˙ects.Assuch,wecandesign
MTILformulationsthatlearnsasetofcommonfeatureinteractions,whichcould
e˙ectivelyreducethehypothesisspace.Duringthelearningprocesstheselected
featureinteractionsforonetaskwillbetask'sknowledge,contributingtotheshare
representation:asetofindicesofcommoninteractions.AnanalogyintraditionalMTL
isthejointfeaturelearningapproach[
142
,
199
],inwhichtaskssharethesamesetof
features.Onewaytoachievethisapproachisbyusingthestructuredsparsitytoinduce
thesamesparsitypatternsontheinteractione˙ects.Anillustrationofthisapproachis
giveninFigure3.1(b).

EmbeddedInteractionApproach.
Whentheresponsefromonetaskisrelatedto
complicatedfeatureinteractions,thepatternsofsuchinteractionsmaybecapturedby
alow-dimensionalspace,resultinginalow-rankinteractionmatrix.Whenthereare
multiplerelatedtasks,theycouldhaveasharedlow-dimensionalspace,i.e.,di˙erent
interactionmatricesmaysharethesamesetofrank-1basismatrices,buthavedi˙erent
weightsassociatedwiththesebasismatrices.Whencollectivelyrepresentedbyatensor,
weendupwithalow-ranktensor.Duringthelearningprocess,eachtaskcontributes
theirsubspaceinformationtofacilitatelearningofthesharelow-dimensionalsubspace,
whichinturn,improvesthefeaturespace.TheanalogyintraditionalMTListhe
26
Table3.1:Examplesofcommonsmoothlossfunctions.
LosswithInteraction
Lossfunction
L
i
Gradient|LinearE˙.
r
W
L
i
Gradient|InteractionE˙.
r
Q
t
L
i
LogisticLoss


[
log
(
g
(
x
i
))
y
ti
+(1

y
ti
)(
log
(1

g
(
x
i
)))]
(
g
(
x
i
)

y
ti
)
x
i
(
g
(
x
i
)

y
ti
)
x
i
x
T
i
SquaredLoss
1
2
jj
x
T
i
w
t
+
x
T
i
Q
t
x
i

y
ti
jj
2
2
x
i
(
x
T
i
w
t
+
x
T
i
Q
t
x
i

y
ti
)
x
i
(
x
T
i
w
t
+
x
T
i
Q
t
x
i

y
ti
)
x
T
i
SquaredHinge
y
h
(
y
ti
(
x
T
i
w
t
+
x
T
i
Q
t
x
i
))
y
ti
x
i
h
0
(
x
T
i
w
t
+
x
T
i
Q
t
x
i
)
y
ti
x
i
x
T
i
h
0
(
x
T
i
w
t
+
x
T
i
Q
t
x
i
)

g
(
x
)
isthesigmoidfunctionde˝nedas
g
(
x
i
)=1
=

1+
exp
(

(
x
T
i
w
t
+
x
T
i
Q
t
x
i
))

y
h
0
(
z
)=

1
for
z

0
;z

1
for
0
<z<
1
;
0
for
z

1
g
low-rankbasedmodels[
10
,
89
].However,therearechallengingquestionssuchas:
Howtode˝neaproperrankfunctionfortensor?Aretheretractablealgorithmsto
inducelow-rankstructureintensor?Inthenextsectionwewilldiscusstheseimportant
questionsandproposee˚cientalgorithms.WeillustratethisapproachinFigure3.1(c).
Wenotethateventhoughweonlyprovidedtwospeci˝capproachesinthispaper,the
proposedMTILframeworkcouldo˙erbroaderclassofformulations.Theproposedframework
allowsmanyotherpossiblewaystode˝netaskrelatednessonfeatureinteractions,leadingto
abrand-newresearchareaofMTL.
3.1.4FormulationsandalgorithmsofthetwoMTILapproaches
Inthissection,wewillstudyhowtheformulationsandalgorithmsofthesharedinteraction
approachandembeddedinteractionapproachundertheproposedMITLframework.We
notethatextensionofmulti-tasklearningtofeatureinteractionsisnottrivialbecauseof
theinvolvementoftensors.Westartwithformulatingthesharedinteractionapproach
byincorporatingagroupLassopenaltytointroducestructuredsparsityonthetensor,
whichwouldselectonlyasetofcommonfeatureinteractionsacrossdi˙erenttasksthatare
relevanttotheprediction.Fortheembeddedinteractionapproach,weproposebothaconvex
formulationandanon-convexformulation.Whiletheconvexformulationleadstoe˚cient
optimizationalgorithmsandglobalsolutions,thenon-convexformulationprovidesreduced
storagecomplexityforlarge-scaleproblems.
27
3.1.4.1Preliminary
Here,weusethefollowingbasicde˝nitionoftensor:
Mode-
n
˝ber
isavectorde˝nedby˝xingeveryindexbutone.Wemayseeitasthehigher
orderanalogueofmatrixrows(mode-2˝bers)andcolumns(mode-1˝bers).Forexample,in
athree-waytensor
Q2
R
n
1

n
2

n
3
,themode-3˝beris
Q
i;j;
:
2
R
n
3
.
Mode-
n
unfolding
istheprocessofreorderingtheelementsofanN-waytensor
Q2
R
n
1

n
2

;::;

n
N
intoamatrix.Themode-kunfoldingoftensor
Q
isdenotedby
Q
(
k
)
2
R
n
k

J
k
,where
J
k
=
Q
N
i
=1
;i
6
=
k
.Thematrixisarrangedbyconcatenatingallmode-k˝bers
ofthetensor.
Rank-
n
denotestherankoftensor'smode-nunfolding.It'sactuallythedimensionof
thespacespannedbythemode-n˝bersoftensor.Speci˝cally,
rank
n
(
Q
)=
rank
(
Q
(
n
)
)
.
When
Q
isamatrix(i.e.2-waytensor),thisbecomestheregularde˝nitionofrank,since
rank
1
(
Q
)=
rank
2
(
Q
)=
rank
(
Q
)
.
3.1.4.2SharedInteractionApproach
Thegoalofthesharedinteractionapproachistoidentifyasetofcommonandrelevantfeature
interactionsacrossdi˙erenttasks.Theinteractiontensor
Q
inourframeworkhasprovided
aconvenientrepresentationtoencodesuchinformation,andweareabletoincorporating
agroupLassopenalty[
61
]toinduceaspecialtypeofstructuredsparsityonthetensor,
couplingthesameinteractionsforalltasks.Recallthatthesparsityimpliesthatonlythe
signi˝cantinteractione˙ectsarecapturedinthemodel.Forthepurposeofsharedinteraction,
28
a
sparsetensornorm
isde˝nedas:
jjQjj
GL-Sym

X
d
i
=1
X
d
j

i
r
X
K
k
=1

Q
2
i;j;k
+
Q
2
j;i;k

:
(3.5)
Notethatthisnormenforcesasymmetricsparsitybyoverthetensor,sothattheonegroup
isde˝nedtoincludecoe˚cientsofoneinteractionbetweenfeature
i
andfeature
j
,fromall
tasks.Penalizingthetensorsparsenormleadstothefollowingformulation:
min
w
;
Q
L
(
W
;
Q
;
f;
X
;
Y
)+

F
R
F
(
W
)+

I
jjQjj
GL-Sym
;
(3.6)
wheretheparameter

I
controlthesparsityoftensor
Q
,alarger

I
willendupwithamore
sparse
Q
.Thesolutiontoformulationdeliversatensorsuchthatthemode-3˝bersareeither
allzerosvectorsornonzerovectors,i.e.,interactione˙ectsbetween2features
x
i
;x
j
either
existsonalltasks,orirrelevantforalltasks.Notethateventhesparsitypatternsissamefor
alltasks,theirinteractionsmayhavedi˙erentweights.Itiseasytoseethat,thisapproach
subsumesthetraditionalmulti-tasklearningasaspecialcase:when

I
!1
bysetting
regularizationparameterontensor
Q
toin˝nity,alltheelementsinof
Q
inthesolutionwill
bezeros,andthemodelonlyconsiderslineare˙ects.
Whenthelossfunction
L
chosenisconvexandcontinuouslydi˙erentiablewithLipschitz
continuousgradient[
158
],thenwecanuseproximalbasedgradientmethods,suchas˝rst
orderFISTA[
16
],SpaRSA[
214
]orsecondorderProximalNewton[
108
]tosolveite˚ciently.
Becausethatthelineare˙ectsandinteractione˙ectsaredecoupledinthepredictivefunction,
amajorclassoflossfunctionsbelongtothiscategory,andwegiveafewexamplesofcommon
lossfunctionsinTable3.1.Notethatevenwhen
L
isnon-convex,alocaloptimalsolution
29
canbee˚cientlyobtainedusingtheGISTframework[
67
].Thekeytoapplythesealgorithms
istoe˚cientlycomputetheproximaloperatorthatassociatestotheproblem(referto[
150
]
formoredetailsaboutproximal):
min
W
;
Q
1
2
(
k
W

^
W
k
2
F
+
kQ
^
Qk
2
F
)+
ˆ
1
R
F
(
W
)+
ˆ
2
jjQjj
GL-Sym
;
where
^
W
and
^
Q
areintermediatesolutionsateachstep,
ˆ
1
and
ˆ
2
areregularization
parametersaugmentedwithstepsize.NotethatwehaveextendtheForbeniusnormfrom
matrixtotensor.Weseethattheproblemisdecoupledfor
W
and
Q
.Andthetensor
proximal:
min
Q
1
2
kQ
^
Qk
2
F
+
ˆ
2
jjQjj
GL-Sym
;
canbesolvedinthesamewayasthegroupLassoproximaloperator[
222
].Moreover,we˝nd
thatwhenthegradientissymmetric,wedon'tneedtoenforceasymmetrictensorsparse
norm,andwecouldsimplyuseasimplealternative:
jjQjj
GL
=
X
i;j
r
X
K
k
=1
Q
2
i;j;k
;
andinitializethealgorithmwithasymmetrictensorasthestartingpoint.Thereason
thatsymmetryholdscanbeexplainedbytwoparts.First,thegradientof
Q
issymmetric,
thereforethegradientdescentstepwon'tchangethesymmetryoftensor
Q
.Second,the
proximaloperatorassociatedtosparsetensornormwon'tchangethesymmetryofmatrix.
Toseethis,theproximaloperationisperformedbyvectorizingthematrixintoavector
andshrinkeachelementofthevectorwithrespecttoainputvector,whichisobtainedby
30
thelastgradientdescentstep.Sincetheinputvectorrepresentsansymmetricmatrix,the
elementanditssymmetricelementwillalwaysshrinktothesamenewvalue.Therefore,the
symmetryof
Q
holds.Thesparsetensornormisequivalenttoperformthe
l
1
projectionof
vectorswhereeachelementisthe
l
2
normofmode-3˝berintensor
Q
.
3.1.4.3EmbeddedInteractionApproach
Theshareinteractionapproachhasenforcedaveryrestrictiveformofhowtasksaresupposed
torelatetoeachother.Inmanyapplications,thepredictionmaybearesultofcomplicated
featureinteractions,insteadonlyinvolvesafewinteractions.Eventhoughtheprediction
mayinvolveallfeatureinteractions,itisusuallyareasonableassumptionthatthereare
patternsamongtheseinteractions.Numerically,existenceofpatternsimplyalow-dimensional
subspace,whichisre˛ectedbyalow-rankstructureinthematrix.Whentherearemultiple
relatedlearningtasks,onewayforthesetasksrelatetoothersviaasharedlow-dimensional
subspace,whichgivesusalow-ranktensor.Assuch,wemaydesignastructuredregularization
toencouragethematrix
Q
tobealow-ranktensor.Inthispaperwedescribeoneconvex
formulationthatencourageslow-rankstructurebypenalizingatensornormandonenon-
convexformulationthatdirectlylearnsalow-rankrepresentation.
ConvexFormulation
Onewaytoobtainalow-ranktensoristoaugmentourformulationwitharankpenalty.
Oneproblemassociatestotensoristhatthereisnoconsistentwaytode˝netherankofa
tensor.Onewayistousetheaveragerankofunfoldingondi˙erentmode[62]:
1
N
N
X
n
=1
rank
n
(
Q
)=
1
N
N
X
n
=1
rank
(
Q
(
n
)
)
;
31
where
N
isthetotalnumberofmodeofthetensor(
N
=3
whenonlypair-wiseinteractions),
and
Q
(
n
)
isunfoldon
n
mode.SinceminimizingtherankfunctionisproventobeNP-hard,
wecouldpenalizethetracenorminstead,whichistheconvexenvelopeoftherankfunction.
Thetracenormisde˝nedasthesumofsingularvaluesofthematrixvariable[
89
].Wethen
obtainthefollowingconvexformulation:
min
W
;
Q
L
(
W
;
Q
;
f;
X
;
Y
)+

R
R
1
(
W
)+

I
N
3
X
n
=1
jjQ
(
n
)
jj

;
(3.7)
where
k
:
k

denotesthetracenorm.However,thisconvexformulationpenalizeseverymode
oftensor
Q
tobejointlylowrank,whichmaybetoorestrictedinpractice,whichmaylead
tosuboptimalperformance.Moreover,thepracticalwaytosolvetheformulationinEq.(3.7)
istousethealternatingdirectionmethodsofmultipliers(ADMM)[
23
],whichintroduces
auxiliaryvariablesandequalityconstraints,inordertodecouplethethreetensortracenorm
terms.However,ADMMalgorithminpracticeisshowntohaveaslowconvergencerate,and
lesspreferredwhencompositeproximalmethodssuchasFISTAcanbeapplied.
Onealternativewaytoaddresstheseissuesistousethelatenttracenorm[
195
,
196
],
whichisde˝nedasfollowingfora
N

waytensor:
jj
Q
jj
latent
=inf
Q
(1)
+
Q
(2)
+
:::
+
Q
(
N
)
=
Q
N
X
n
=1
jjQ
(
n
)
(
n
)
jj

;
where
Q
(1)
:::
Q
(
N
)
areasetoflow-rankauxiliarytensors,whichstatesthattheoriginal
tensorcanbedecomposedintothesumofasetoftensorsthatarelow-rankindi˙erent
modes.Finally,weproposedtodroptheequalityconstraintthateachauxiliarytensorequal
totheoriginalone,butwedirectlyusethemixtureoftensorstorepresenttheoriginaltensor,
32
sotheproblembecomesaunconstrainedoptimizationproblem.Thepredictivefunctionof
task
t
withsuchmixtureisgivenby:
f
mix
(
x
;
w
t
;
fQ
(
i
)
g
3
i
=1
)=
x
T
w
t
+
x
T
(
X
3
i
=1
Q
(
i
)
t
)
x
;
where
Q
(
j
)
2
R
d

d

K
;
8
j
=1
;
2
;
3
aretheauxiliarytensorsforreplacingtheoriginaltensor
Q
,matrix
Q
(
j
)
(
j
)
2
R
(
n
1
n
2
n
3
=n
j
)

n
j
isthemode
j
unfoldingoftensor
Q
(
j
)
,
Q
(
j
)
t
2
R
d

d
isthe
t
thfrontalsliceoftensor
Q
(
j
)
.Finally,ourconvexformulationunderembeddedinteraction
approachisgivenby:
min
W
;
fQ
(
i
)
g
3
i
=1
L
(
W
;
fQ
(
i
)
g
3
i
=1
;
f
mix
;
X
;
Y
)+

F
R
F
(
W
)+

I
3
X
j
=1
jjQ
(
j
)
(
j
)
jj

:
Theconvexityofthisformulationholdssinceboththelossfunctionandthepenaltyare
convex.
Wenotethatthisformulationcanbesolvedinthesamewayastheformulationin
Eq.(3.7),andthemodelismuchmore˛exibletomodelthecomplicatedinteractionsamong
thefeatures,leveragingtheadvantagesofsuchauxiliarytensors.
Non-ConvexFormulation
Althoughusingproximalgradientmethodsweareabletosecureanoptimalsolutionfor
theconvexformulation,thetimecomplexityandstoragecostareunacceptableinpracticeas
thedimensionofdataincrease.Toseethis,wenotethattheproximaloperatorassociatedto
atracenormregularizedobjectiverequiressingularprojections[
89
],whichrequirescubic-
complexitysingularvaluedecomposition.Recallineachiterationofthegradientmethods
couldinvolvemorethanonecomputationofproximaloperator[
16
],andthusthecomputation
maybeprohibitivewhendimensiongrowslarger.Ontheotherhand,wehavetomaintain
33
3densetensorsofsize
d

d

T
whichmeansthestoragecostisat
O
(
d
2
)
,where
T
isthe
numberoftasksandtypicallywehave
T
˝
d
.Alsothemixtureofthreelow-rankauxiliary
tensorsmayleadtosomedi˚cultywhenitcomestoanalyzingthepredictivemodelitself.
Tothisend,weproposetouseatensorwithaexplicitlow-rankstructure.Considerthe
interactione˙ectsmatrix
Q
2
R
d

d
foronetask,weassumethelow-rankdecomposition
Q
=
B
~
QB
T
,where
B
2
R
d

r
isabasismatrix,
~
Q
2
R
r

r
isasmallmatrix,capturingthe
informationoftheoriginaltensorunderthesetofbases(columns)in
B
.Toseethis,wecan
expand
Q
=
P
r
i;j
=1
~
Q
(
i;j
)
B
i
B
T
j
,meaningthematrix
Q
isaresultofinteractionsamong
basesin
B
andalsospannedbythecolumnsof
B
.Wethuscanuseapredictivefunction
thatexplicitlyconsidersthislow-rankstructure:
f
nvc
(
x
;
w
;
B
;
~
Q
)=
x
T
w
+
x
T
B
~
QB
T
x
:
Whentherearemultipletasks,ourassumptionforembeddedinteractionapproachisthe
sharedbasis,meaning
B
isrestrictedtobesameasallothertasks.Themulti-taskloss
functionisthusgivenby:
L
(
W
;
f
B
g
;
~
Q
;
f
nvc
;
X
;
Y
)=
T
X
t
=1
`
t
(
f
nvc
;
w
t
;
B
;
~
Q
t
;
X
t
;
Y
t
)
;
where
~
Q2
R
r

r

T
collectivedenotesthesetofmatrices
~
Q
fromalltasks.Thislossfunction
isnotconvexbecauseofthemultiplicationofvariablesin
x
T
B
~
QB
T
x
.Thislossfunction
leadstoour˝nalnon-convexformulationforembedded:
min
W
;
f
B
g
;
~
Q
L
(
W
;
f
B
g
;
~
Q
;
f
nvc
;
X
;
Y
)
34
+

F
R
F
(
W
)+

I
R
I
(
f
B
g
;
~
Q
)
;
wheretheregularization
R
I
(
f
B
g
;
~
Q
)
canbeForbeniusnormorotherstructuralinformation
(e.g.
`
1
norm).Thedimension
r
of
B
canbechosenaccordingtotheneedofspeci˝c
applicationdemands,andcanbeselectedbycross-validation.Ingeneral,wechoose
r
˝
d
.
Wenotethatthestoragecomplexityforthefeatureinteractione˙ects(e.g.,tensor
Q
)is
reducefrom
O
(
d
2
K
)
to
O
(
dr
+
r
2
K
)
,whichisdramaticallysmallerthanthefulltensor,
especiallyinthehighdimensionalsettings.Wecouldusethefamilyofblockcoordinate
descentalgorithms[
198
]toalternativelysolvethevariables
W
;
f
B
g
,and
~
Q
,togetalocal
optimalsolution.
3.1.5Experiments
Inthissection,weperformexperimentsonbothsyntheticdatasetsandtworealworlddatasets
toevaluatethee˙ectivenessofourproposedMTILframework.
3.1.6SyntheticDataset
Inordertojustifythee˙ectivenessofmodelingthefeatureinteractionsandMTILframework,
wetestourmethodsonsyntheticdatasets.
3.1.6.1E˙ectivenessofmodelingfeatureinteractions
Inthissubsection,wetestwhethertheinteractionsbetweenfeaturescanbeproperlyhandled
byaddingtheinteractionterm
Q
.Todoso,wecreateasingletasksyntheticdatasetby
assuming:
y
=
Xw
+diag(
XQX
0
)+

;
(3.8)
35
Figure3.2:RMSEcomparisonbetweenRRandSTILontwosyntheticdatasetswithsample
sizeof1kand5k,respectively.
where
X
2
R
n

d
isthefeaturematrix,
y
2
R
n

1
istheresponses,
w
2
R
d

1
istheweight
vector,
Q
2
R
d

d
isasymmetric,low-ranksparsematrix,whichrepresentsthefeature
interactionsinthedataset,and

˘N
(0
;
0
:
01
I
n
)
istheadditivenoiseterm.Wegenerate
20syntheticdatasetswithdi˙erentsizes(1000or1kand5000or5k)anddi˙erentfeature
dimensions(varyingfrom10to100,steppedby10)byrandomlyselecting
X
,
w
,and
Q
and
computing
y
accordingtoEq.(3.8).
We
usesingle
taskfeatureinteractionlearningmodel(STIL)toevaluatethee˙ectiveness
oftheinteractionterm
Q
:
min
w
;
Q
n
X
i
=1
1
2
jj
x
T
i
w
+
x
T
i
Qx
i

y
i
jj
2
2
+

2
jj
w
jj
2
2
+

jj
Q
jj
1
;
1
;
where
w
2
R
d

1
istheweightvector,
Q
2
R
d

d
isthefeatureinteractionmatrix,and
k
Q
k
1
;
1
=
P
i
P
j
j
Q
i;j
j
denotesthe
`
1
;
1
norm.
WecomparedtheRootMeanSquareError(RMSE)betweentheRidgeRegression(RR)
andSTILonbothofthesyntheticdatasets.AstheresultsshowinFigure3.2,STIL
outperformsRRonbothofthedatasets,whichshowsthee˙ectivenessofmodelingthe
36
Figure3.3:Syntheticdataset(Multi-task):RootMeanSquareError(RMSE)comparisons
amongallthemethods.TheY-axisisRMSE,X-axisisdimensionoffeatures.
featureinteractioninthedata.Besides,STIL-5k(RR-5k)performsbetterthanSTIL-1k
(RR-1k),whichdemonstratesthatthelearningmodelswillcapturetheunderliningmodels
ofthedatabetterwithlargertrainingsize.Alsonotethatwiththenumberofdimensions
increases,STILwillgraduallyover˝tthedata,becauseofthedramaticincreaseofthe
interactionsbetweenfeatures.
3.1.6.2E˙ectivenessofMTIL
Inordertotestthee˙ectivenessofMTIL,wegenerateamulti-tasksyntheticdataby
assuming:
y
t
=
X
t
w
t
+diag(
X
t
Q
t
X
T
t
)
;t
=1
;
2
;
3
;::;T;
where
X
t
2
R
n

d
isthefeaturematrixoftask
t
,
y
t
2
R
n

1
istheresponsesoftask
t
,
W
2
R
d

T
=[
w
1
;
w
2
;
w
3
;:::;
w
T
]
istheweightsfortasks.AsdescribedinSection3.1.4.3,
wegeneratefeatureinteractionmatrix
Q
t
=
Bq
t
B
T
andprojectitintoasparse,symmetric
space.
Inthisexperiment,wegenerate5datasetswithdi˙erentfeaturedimensionsfrom10to
37
50,steppedby10,byrandomlyselecting
X
t
,
w
t
,
B
and
q
t
.
Thepredictiveperformanceofthemethodsoutlinedbelowareexaminedonthesynthetic
multi-taskdatasets:

RidgeRegression(RR):Wechoosethismodelasthebaselineandmakeneitherassump-
tionsoffeatureinteractionnortherelationamongallthetasks.

STIL:WeperformSTILoneachofthetaskindependently.

MTL-L:ThisapproachreferstothetraditionalMTLmethodregularizedbythetrace
normoftheweightmatrix
W
[
10
].Itdoesnotmakeassumptionsonfeatureinteractions.

MTIL-L-S:Thisapproach,referstomulti-taskfeatureinteractionlearningregularized
bythetracenormoftheweightmatrix
W
andthetensorgrouplassonormoftensor
Q
(seesection3.1.4.2).

MTIL-S-S:ThisapproachissimilartoMTIL-L-Sexceptthattheregularizationterm
on
W
is
`
2
;
1
norm.

MTIL-L-Lc:Thisapproachreferstomulti-taskfeatureinteractionlearningregularized
bythetracenormoftheweightmatrix
W
andlatenttracenormoftensor
Q
(see
section3.1.4.3).

MTIL-S-Lc:ThisapproachissimilartoMTIL-L-Lcexceptforthattheregularization
termon
W
is
`
2
;
1
norm.

MTIL-L-Ln:Thisapproachrefertomulti-taskfeatureinteractionlearningregularized
bythelowranknormoftensor
Q
andthetracenormoftheweightmatrix
W
(see
section3.1.4.3).
38

MTIL-S-Ln:ThisapproachissimilartoMTIL-L-Lnexceptforthattheregularization
termon
W
is
`
2
;
1
norm.
Figure3.3comparestheRMSEoftheabovemethodsonthe5syntheticdatasets.We
canseethatMTIL-L-LnandMTIL-S-Lnarenotthatsensitivetothechangeoffeature
dimensions,thankstothelow-rankassumptiononthefeatureinteraction.Also,RRand
MTL-Lshareasimilarperformance,whichisconsistentwiththefactthatwedidnotassume
anylow-rankstructureinthissyntheticdataset.NotethatalthoughSTILperformsalmost
thebestonlowdimensionaldata,itsperformancedeterioratesrapidlycomparedwithother
MTILmethods,duetotheincapabilityoflearningthefeatureinteractionsacrosstasks.
3.1.7SchoolDataset
Thisdatasetcontainstheexaminationrecordsof15362studentswith28featuresfrom
139schoolsinyearsof1985,1986and1987,providedbytheInnerLondonEducation
Authority(ILEA).Inthisdataset,eachtaskistopredictexamscoresforstudentsinoneout
ofthe139schools.Weperform4setsofexperimentsbyvaryingtheamountoftrainingsize,
from20%to50%ofthetotalsamplesize.Wetesttheapproachessummarizedinsection3.1.6.2
andtunetheparameterson

R
inset
[10

1
;
10
0
;:::;
10
9
;
10
10
]
.ForMTIL-L-LnandMTIL-S-
Lnmethods,therankofmatrix
r
foreachtaskaretunedin
[2
;
3
;:::;
19
;
20]
.ForMTIL-L-S
andMTIL-L-Lc,wetunetheregularizationparameters

I
in
[10

1
;
10
0
;:::;
10
9
;
10
10
]
.
TheexperimentalresultsareshowninTable3.2.First,formostofthemethods,RMSE
willdecreasewhenthetrainingsizeincreases.Thismeansthatprovidingmoredatainthe
trainingsetwillhelpovercometheover˝ttingproblem.Also,wefoundthattheperformance
ofembeddedfeatureapproaches(i.e.MTIL-L-Lc,MTIL-L-Ln,MTIL-S-Ln)areworsethan
thesingletasklearningapproach.Thereasonbehindthisisthatembeddedfeatureapproaches
39
Table3.2:PerformancecomparisonMTILandbaselinesontheSchooldataset
Training20%
Training30%
Training40%
Training50%
RR
0.9149

0.0031
0.9025

0.0058
0.8885

0.0067
0.8722

0.0059
STIL
0.9149

0.0031
0.9025

0.0057
0.8885

0.0067
0.8721

0.0058
MTL-L
0.8998

0.0044
0.8807

0.0052
0.8657

0.0032
0.8503

0.0070
MTIL-L-S
0.8623

0.0048
0.8506

0.0038
0.8511

0.0043
0.8404

0.0067
MTIL-S-S
0.8999

0.0063
0.8907

0.0049
0.8832

0.0077
0.8686

0.0046
MTIL-L-Lc
0.9252

0.0090
0.8893

0.0037
0.8859

0.0037
0.8720

0.0044
MTIL-S-Lc
0.9353

0.0133
0.9139

0.0053
0.8941

0.0024
0.8761

0.0062
MTIL-L-Ln
1.0084

0.0180
0.9758

0.0097
0.9328

0.0267
0.9041

0.0140
MTIL-S-Ln
1.0026

0.0368
0.9585

0.0059
0.9297

0.0253
0.8965

0.0066
donothavesparseconstraintsontheinteractionterm,whichwillseverelyover˝tthedata
whenthereisnotsu˚cienttrainingsamples.Additionally,theMTL-LandMTIL-L-Sobtain
betterperformancethansingletasklearning,whichindicatesthatthelow-rankstructure
sharedbytasksaree˙ectivelycapturedbythelow-rankassumptioninthesetwomethods.
Moreover,MTIL-L-Smethodoutperformsallothermethods,whichempiricallyprovesthe
e˙ectivenessoflearningthesharedinteractionswithsparseconstraints.
3.1.8ModelingAlzheimer'sDisease
TheAlzheimer'sDiseaseNeuroimagingInitiative(ADNI)database(adni.loni.ucla.edu),which
waslaunchedin2003asa5-yearpublic-privatepartnership,isaimedtotestwhether
thepositronemissiontomography(PET),serialmagneticresonanceimaging(MRI),other
biologicalmarkers,andclinicalandneuropsychologicalassessmentscanbecombinedto
measuretheprogressionofmildcognitiveimpairment(MCI)andearlyAlzheimer'sdisease
(AD).Wefollowtheprocedureofpreprocessingmentionedin[
234
]andobtain648subjects
and305MRIfeatures.Theparametersaretunedinthesamewayaswedescribedin3.1.7.
TheRMSEcomparisonresultisshowninTable3.3.First,wefoundthatalloftheMTLs
outperformthesingletasklearningapproaches(RRandSTIL),whichdemonstratesthe
40
Table3.3:PerformancecomparisonMTILandbaselinesontheADNIdataset.
RMSE

standarddeviation
RR
0.9418

0.0023
STIL
0.9417

0.0021
MTL-L
0.9031

0.0007
MTIL-L-S
0.9030

0.0007
MTIL-S-S
0.9162

0.0017
MTIL-L-Lc
0.8941

0.0050
MTIL-S-Lc
0.8909

0.0059
MTIL-L-Ln
0.8926

0.0009
MTIL-S-Ln
0.9085

0.0028
e˙ectivenessoflearningmultipletasksjointlybyexploringtherelatednessbetweentasks,
aswellastheexistenceoftheunderlyingrelatednessbetweentasksintheADNIdataset.
Second,theRMSEresultsofMTIL-L-SandMTL-Larecomparablewitheachother,which
indicatesthatthemultipletasksinthisdatasetdonotsharethesamefeatureinteraction
structure.Finally,theresultofMTIL-S-Lcmethodoutperformsallothermethods,which
showssuperiorityofourfeatureinteractionframework.Throughamixtureof3low-rank
tensor,weareabletolearnthefeatureinteractionpatterninthisdataset.
3.1.9Discussion
Theproposedmulti-taskfeatureinteractionlearningframeworkhasprovidedusawayto
bridgerelatedtasksusinginteractione˙ects.Byemployingdi˙erenttypesofregularizations
ontheinteractione˙ectstensor,theformulationsunderthisframeworkhaveverydi˙erent
characteristics.
Forthesharedinteractionapproach:weutilizeGroupLassoontheinteractiontensorto
controlthemodelcomplexity.Theproximaloperatoradmitsaclosedformsolution,andthus
theoverallcomputationalcostisverylow.Weareabletoobtaininterpretableresultsfrom
themodel,showingwhatareimportantinteractionsthatarerelevanttothepredictiontasks.
41
Themaindrawbackisthatweassumealltaskssharethesamesetofinteractione˙ects,which
maynotbethecaseformanydatasets.Onewaytofurtherimprovetheformulationisby
extendingthestrongorweakheredityproperties[
20
,
124
]totheproposedMTILframework.
Fortheembeddedinteractionapproach:wecaneasilyobtaintheglobaloptimalforthe
convexformulation.Thoughweareabletotunetheregularizationparameteronthetrace
normstocontroltherankoftheinteractiontensor,itisusuallyveryhardtodecidethevalue
unlesscross-validationisused.Aranklargerthannecessarymayleadtoover-˝ttingwhen
trainingsamplesareinsu˚cient.Ontheotherhand,theobtainedmixtureof3tensorishard
tointerpret.Thenon-convexformulationprovidesabettermodeldecomposition,fromwhich
wecanseethecombinationofbasisfordi˙erenttasksandidentifyembeddedbasesthat
aresharedamongthesetoftasks.Thedrawbackofthisformulationisthatwemayeasily
trappedinabadlocaloptimalunlesswecarefullychoosetheinitialvalue(e.g.,usingthe
solutionfromtheconvexformulation).
Ingeneral,thisframeworkcanbe
generalized
intomanyotherpossiblerelatednesson
featureinteractionsbyincorporatingdi˙erentregularizationterms.Di˙erentapproachesof
thisframeworkshouldbecarefullychosenaccordingtotheapplicationdomain.Inthefuture
workweplantostudythestatisticalpropertiesoftheproposedmodel,whichmayleadto
deeperunderstandingoftheseinteractionmodels.
3.2Multi-TaskRelationshipLearning
3.2.1Introduction
Supervisedlearninghasbeenawellstudiedareaofmachinelearningandtherearemany
e˚cientalgorithmstolearnfromdataandgeneratepredictivemodelstoinferlabelsfor
42
unseendatapoints.Asextensivelystudiedinthestatisticallearningtheory,thequantityand
qualityofthelabeledtrainingdataisthekeytohigh-performancemodels.Unfortunately,
eveninthebigdataera,obtaininglabeledinstancesinmanyrealworlddomainssuchas
biologyandhealthcarestillincurssubstantialcost.Forexample,theNationalInstituteof
Agingfundedover$60milliontoAlzheimer'sdiseaseneuroimaginginitiativetostudythe
diseaseanddataarecollectedfromlessthan1000patients.Thelimitedsamplesizelargely
restrictedthestudyofdiseaseprogressionwithmanypossiblebiomarkers.
Interestingly,whilemachinelearningdemandsalargesetoftrainingsamplestolearn
simpleconcepts,thelearningprocessofhumanbeingsallowsuslinkalearningtaskwith
whatwehavelearnedbeforeandthusweareabletolearncomplicatedcognitiveconcepts
withmuchlesstrainingsamples.Motivatedbythishumanlearning,themulti-tasklearning
(MTL)paradigmlearnsrelatedmachinelearningtaskssimultaneouslyandperformsinductive
knowledgetransferamongthetaskstoimprovetheirthegeneralizationperformance.MTLhas
manysuccessfulapplicationsinboard˝eldssuchasdatamining,computervision,textmining,
bioinformaticsandhealthcareanalytics[
101
,
228
,
68
].Forexample,capturingtemporal
relatednessamongmultiplelearningtasksallowsresearcherstobuildhighperformancedisease
progressionmodelsforAlzheimer'sdiseasebytransferknowledgeamongtimepoints[234].
OneapproachtolearningmultipletasksisbasedontheregularizedMTLframework[
55
].
TheregularizedMTLisextensivelystudiedbecauseofits˛exibilitytoincorporatevarious
learningobjectivessuchasleastsquares,logisticregressionandhingeloss,andtoextend
themwithdi˙erentkindsofassumptionsonhowtasksarerelated.Examplesofsuchtask
relatednessregularizationsincludesharedsetsoffeaturesviasparsityinducingnorms[
120
],
sharedlow-dimensionalsubspaceviathenuclearnorm[
10
],andclusteringstructuresvia
spectral
k
-means[
232
].Thesameframeworkcanaccommodatemorecomplicatedassumptions
43
suchasdirtymodels[
88
]androbustmodels[
37
,
66
].Moreover,e˚cientimplementations
havebeendevelopedforregularizedMTL,whichcanbeeasilyextendedtonewregularization
terms[233].
ManyoftheregularizedMTLmethodsheavilydependonthepriorknowledgeoftask
relatedness.In[
54
,
96
,
171
],forexample,thepriorknowledgeoftaskrelatednessareassumed
tobeknownandisthentransferedtoregularizationtermstoguidethelearning.However,
therelationshipforalltasksmaynotalwaysbeavailable.Toaddressthisproblem,the
multi-taskrelationshiplearning(MTRL)approaches[
227
,
58
,
225
]arestudiedto
learnthe
taskrelationship
intheformofa
taskcovariancematrix
fromthedata,representinghow
similararethetwotasks.Thesemethodshavebeenshowntobemoree˙ectivethanothers
insomelearningproblems.However,recallthatinMTLthetrainingsamplesaretypically
insu˚cient,andthuswemaynotalwaysbeabletoinferreliabletaskrelationshipfromthe
trainingdata.Ifmisleadingtaskcovariancematrixislearnedfrominsu˚cientandnoisy
data,thesubsequentknowledgetransferguidedbysuchcovarianceinformationwillnotbe
performedtowardstherightdirectionasweexpected,andleadtosuboptimalmodels.
Figure3.4:OverviewoftheproposediMTRLframework,whichinvolveshumanexpertsin
theloopofmulti-tasklearning.Theframeworkconsistsofthreephases:(1)
Knowledge-
awaremulti-tasklearning
:learningmulti-tasklearningmodelsfromknowledgeanddata,(2)
Solicitation
:solicitingmostinformativeknowledgefromhumanexpertsusingactivelearning
basedquerystrategy,(3)
Encoding
:encodingthedomainknowledgetofacilitateinductive
transfer.
44
Inmanyapplicationsthehumanexpertsmayhavesomedomainknowledgeabouthow
someoftasksarerelated.Forexample,thephysiciansmayindicatethepredictivemodels
oftwodiseasemodelsshouldbeverysimilarduetothesimilarityinthetheirpathological
pathwaysordynamicsinphysiology.Inthosesituations,solicitingandincorporatingthese
domainknowledgeinthelearningcoulddramaticallyimprovethegeneralizationperformance
oflearningmodels.Unfortunately,tothebestofourknowledge,littleresearchhasbeen
doneonthisarea.Weidenti˝edafewkeyquestionsinarea:(1)Whattypeofdomain
knowledgeissuitableforguidingMTL?(2)Howthesoliciteddomainknowledgecanbe
e˙ectivelyincorporatedintotheMTLformulations;and(3)Howthedomainknowledgecan
bee˚cientlysolicited?
ToaddresstheaforementionedchallengesinMTL,thispapersystematicallyinvestigated
theabovequestionsandproposeanovelinteractivemulti-taskrelationshiplearning(iMTRL)
framework.Speci˝cally,intheiMTRLframeworkweproposetosolicitthedomainknowledge
intheformofpartialorderbetweentwopairsoftasks,whichisequivalenttoapairwise
relationshipbetweentwoelementsinthetaskcovariancematrix.Toe˙ectivelyincorporate
thepartialorderknowledge,weproposeaknowledgeawareMTRL(kMTRL)formulation,
whichlearnsataskcovariancematrixconstrainedbythepartialorderrelationshipsin
thedomainknowledge.Wedevelopane˚cientoptimizationalgorithmfortheproposed
kMTRL.Moreover,sincehumanlabelingisveryexpensiveevenforweaksupervisionliketasks
relationship,weproposeane˚cientquerystrategyforknowledgesolicitation.Weevaluate
theproposediMTRLframeworkonbothsyntheticandrealdatasetsanddemonstrateits
e˚ciencyande˙ectiveness.
Notation:
Weuselowercaseletterstodenotescalars,lowercaseboldletterstodenotevectors
(e.g.
x
),uppercaseboldletterstodenotematrices(e.g.

).Weuse
R
todenotethesetof
45
realnumbersand
R
+
(
R
++
)todenotethesubsetofnon-negative(positive)ones.If
x
2
R
d
,
the
p
-normofvector
x
isgivenby
k
x
k
p
=(
P
d
i
=1
k
x
i
k
p
)
1
p
.If
A
2
R
d

K
,weuse
a
j
2
R
d
to
denotethe
j
thcolumnof
A
and
e
a
i
2
R
T
todenotethe
i
throwof
A
.Forall
r;p>
1
,we
de˝nethe
l
p;q
normof
A
as
k
A
k
p;q
=(
P
d
i
=1
k
e
a
i
k
q
p
)
1
p
.Thesetof
K
integersisdenotedas
N
K
=[1
;:::;K
]
.Weuse
I
d
todenotea
d

d
identitymatrix,and
1
d
todenotea
d
dimension
vectorwithallelementsare1.Unlessstatedotherwise,allvectorsarecolumnvectors.
3.2.2RelatedWork
Multi-tasklearning.
MTLhasbeensuccessfullyappliedtosolvemanychallengingmachine
learningproblemsinvolvingmultiplerelatedtasks.RecentlytheregularizationbasedMTL
approachhasreceivedalotofattentionbecauseofits˛exibilityande˚cientimplementations.
OnemajorresearchdirectioninregularizedMTListoencodetherelationshipamong
tasks[
54
,
96
,
58
,
227
,
171
,
22
].TheregularizedMTLalgorithmscanberoughlyclassi˝edinto
twotypes:the˝rstinvolvesassumptionsabouttaskrelatedness,whicharethen
intoproperregularizationtermsintheregularizationtoinferasharedrepresentation,that
servesasthemediaofknowledgetransfer.Anexampleisthelow-rankMTL[
54
,
96
,
171
],
whichseeksasharedlow-dimensionalsubspaceintaskmodels,andthetasksarerelated
throughthesharedsubspace.Onepotentialissueinsuchmethodsisthatthepriorknowledge
maynotalwaysaccurateandtheassumptionmaynotbesuitableforalltasks.Lateronsome
studiesfocusoninferthetaskrelationshipfromthedataset[
227
,
58
,
22
],e.g,bylearninga
varianceovertasks.Sincethelearnedcovariancematrixgoverningtheknowledge
transferisalsolearnedfromdata,thesemethodsisheavilydependentonthequalityand
quantityofthetrainingsamplesavailable.Whenaninaccuratetaskrelationshipislearned,it
willleadtopointtheknowledgetransferinawrongdirectionandleadtosuboptimalmodels,
46
aswillbeshowninourempiricalstudies.Toalleviatetheproblemofexistingmodels,we
proposeanactivelearningframeworkwhichcaninteractivelylabelthegroundtruthoftask
relationshipintolearningmodelandguidecorrectknowledgetransfer.
ActiveLearning.
Therearetwocommoncategoriesofactivelearning:thepoolbasedand
thebatchmode.Thepoolbasedactivelearningapproachesselectthemostinformative
unlabeledinstanceiteratively,whichisthenlabeledbyuser,withthegoaloflearningabetter
modelwithlesse˙orts[
173
].Theselectionprocessisoftenreferredasa
query
.However,such
sequentialqueryselectionstrategyisine˚cientinmanycases,i.e.addingonelabeleddata
pointatatimeistypicallyinsu˚cienttosubstantiallyimprovetheperformanceofmodel,
andthusthetrainingprocedureisveryslow.Incontrast,thebatch-modeactivelearning
approachesselectasetofmostinformativequeryinstancessimultaneously.Tothebestof
ourknowledge,allpreviousactivelearningfocusonhowtoselectagroupofmostinformative
instancesortrainingsamples.Here,weinsteadproposeanovelquerystrategytoquery
anothertypeofsupervision:taskrelationship.Thissupervisionisintuitivebutcomeswitha
signi˝cantchallenge,i.e.,mostpreviousactivelearningstrategiescannotbedirectlyapplied.
Inourstudythetasksupervisionisrepresentedbypartialorderswhichleadtopair-
wiseconstraints.Thereareafewpreviousstudiesonthee˙ectivenessofthepairwise
constraints[
215
,
70
]underactivelearningframework.In[
70
],aclusteringalgorithmnamed
Active-PCCAwasproposedtoconsiderwhethertwodatapointsshouldbeassignedtothe
sameclusterornot,bywhichitbiasesthecategorizationtowardstheoneexpected.Themost
informativepairwiseconstraintsareselectedusingthedatapointsonthefrontierofthoseleast
well-de˝nedclusters.In[
215
],theauthorsstudiedasemi-supervisedclusteringalgorithmwith
aquerystrategytochoosepairwiseconstraintsbyselectingthemostinformativeinstance,as
wellasdatapointsinitsneighborhoods.ThepairwiseconstraintsareintheformofMust-link
47
andCannot-link,whichrestricttwodatapointsshouldbeinthesameclassornot.However,
thosemethodsaredevelopedforclusteringalgorithms.Howtoselectpairwiseconstraintson
taskrelationshipthataresuitablefortheMTLframeworkremainstobeanopenproblem.
Inthiswork,westudyquerystrategiesfortaskrelationshipsupervision,includingonenovel
strategybasedontheinconsistencyoflearningmodel.
InteractiveMachineLearning.
Interactivemachinelearning(IML)isasystematicwayto
includehumaninthelearningloop,observingtheresultsoflearningandprovidingfeedback
toimprovethegeneralizationperformanceoflearningmodel[
6
].Ithasprovidedanatural
waytointegratebackgroundknowledgeintothelearningprocedure[
7
,
9
,
210
,
8
].Forexample,
thesystemcallederception-based(PBC)[
9
]hasbeenpioneeredtoo˙eran
interactivewaytoconstructdecision.ThePBCisabletoconstructasmallerdecisiontree
buttheaccuracyachieveddoesn'thassigni˝cantimprovecomparedtootherdecisiontree
methodssuchasC4.5.Thedecisionconstructionhasbeenfurtherextendin[
210
].Theyalso
foundoutthatuserscanbuildgoodmodelsonlywhenthevisualizationareapparentintwo
dimension.Manualclassi˝erconstructionisnotsuccessfulforlargedatasetinvolvinghigh
dimensioninteraction.In[
7
],anend-userIMLsystem(ReGroup)areproposedtobeableto
helppeoplecreatecustomizedgroupsinsocialnetworks.In[
8
],theauthorsdevelopedan
IMLsystemnamedas(CueT)tolearnthetriagingdecisionaboutnetworkalarminahighly
dynamicenvironment.Inthispaper,iMTRLisproposedtocombinethedomainknowledge
intermsoftaskrelationshiptobuildlearningmodels.Ourworkisexploringacompletely
novelproblemcomparedtothepreviousstudiesininteractivemachinelearning.
48
3.2.3InteractiveMulti-TaskRelationshipLearning
Inthissection,we˝rstreviewthestrengthsandpotentialissuesofthemulti-taskrelationship
learninginSubsection3.2.3.1,whichmotivatetheoverarchingframeworkoftheproposed
interactivemulti-taskrelationshiplearning(iMTRL)inSubsection3.2.3.2.Subsection3.2.3.3
presentstheknowledge-awareMTRL(kMTRL)formulationandalgorithm.Subsection3.2.3.5
introducesthenovelbatchmodeknowledgequerystrategybasedonactivelearning.
3.2.3.1RevisittheMulti-taskRelationshipLearning
BeforediscussingtheiMTRLframework,werevisitthemulti-taskrelationshiplearning
(MTRL)[
227
],onepopularMTLmodelthatlearnsnotonlythepredictionmodelsbutalsotask
relationship.TheMTRLframeworkhasawellfoundedBayesianbackground.Assumewehave
K
relatedlearningtasks,andineachtaskwearegivenadatamatrixandtheircorresponding
responses.Let
d
bethenumberoffeatures.Forthetask
k
,wearegiven
m
samplesandtheir
correspondingresponses,collectivelydenotedby
X
k
=[(
x
k
1
)
T
;(
x
k
2
)
T
;
:::
;(
x
k
m
)
T
]
2
R
m

d
and
y
k
2
R
m
.Weassumethattheresponsescomefromalinearcombinationoffeatures
withaGaussiannoise,sothatforsample
j
fromtask
i
,wehave
y
i
j
=
w
T
i
x
i
j
+
b
i
+

i
,where
distributionofthenoiseisgivenby

i
˘N
(0
;
2
i
)
.Thegoalofthelearningistoestimatethe
taskparameters
W
=[
w
1
;:::;
w
K
]
andbiasterm
b
=[
b
1
;:::;b
K
]
forall
K
tasksfromdata.
Basedontheassumptionwecanwritethelikelihoodof
y
i
j
given
x
i
j
;
w
i
;b
i
,and

i
isgiven
by:
p
(
y
i
j
j
x
i
j
;
w
i
;b
i
;
i
)
˘N
(
w
T
i
x
i
j
+
b
i
;
2
i
)
;
where
N
(
m
;

representsthemultivariatedistributionwithmean
m
andcovariancematrix
49

[21].Theprioron
W
=(
w
1
;:::;
w
K
)
isgivenby:
p
(
W
j

i
)
˘
0
@
K
Y
i
=1
N
(
w
i
j
0
d
;˙
2
i
I
d
)
1
A
q
(
W
)
;
where
I
d
2
R
d

d
istheidentitymatrix.The˝rsttermistheextensionofridgepriortothe
multi-tasklearningsetting,whichcontrolsthemodelcomplexityofeachtask
w
i
.Thesecond
termreferstothetaskrelationship,inwhichMTRLtriestolearnthecovarianceof
W
using
amatrix-variatenormaldistributionfor
q
(
W
)
q
(
W
)=
MN
d

K
(
W
j
0
d

K
;
I
d


)
;
where
MN
d

K
(
M
;
A

B
)
denotesmatrix-variatenormaldistributionwithmean
M
2
R
d

K
,
rowcovariancematrix
A
2
R
d

d
andcolumncovariancematrix
B
2
R
K

K
.Accordingto
theBayes'stheorem,theposteriordistributionfor
W
isproportionaltotheproductofthe
priordistributionandthelikelihoodfunction[21]:
p
(
W
j
X
;
y
;
b
;˙;

)
/
p
(
y
j
X
;
W
;
b
;
)
p
(
W
j

;˙
)
;
(3.9)
where
X
collectivelydenotesthedatamatrixfor
K
tasksand
y
=[
y
1
;:::;
y
k
]
denoteslabels
foralldatapoints.
BytakingnegativelogarithmofEq.(3.9),themaximumaposterioriestimationof
W
andmaximumlikelihoodestimationof

isgivenby:
min
W
;

K
X
k
=1
1

2
k
k
y

X
k
w
k

b
k
1
n
k
k
2
F
+
1
˙
2
k
tr(
WW
T
)+tr(


1
W
T
)+
d
ln(

)
:
(3.10)
50
Intheaboveformulation,thelastterm
d
ln
(

)
controlsthecomplexityof

andisa
concavefunction.Inordertoobtainaconvexobjectivefunction,theMTRLproposedto
use
tr
(

)=1
insteadtocontrolthecomplexityandproject

tobeapositivesemi-de˝nite
matrix.Assuch,theobjectivefunctionofMTRLisderivedasfollows:
min
W
;

K
X
k
=1
1
n
k
k
y
k

X
k
w
k

b
k
1
n
k
k
2
F
+

1
2
tr(
WW
T
)
(3.11)
+

2
2
tr(


1
W
T
)
:
s.t.


0
;
tr(

)=1
Analternatingalgorithmisproposedin[
227
]tosolvethisformulation.Thealgorithm
iterativelysolvestwosteps:˝rstitoptimizesEq(3.11)withrespectto
W
and
b
when

is˝xed;itthenoptimizestheobjectivefunctionwithrespectiveto

,whichadmitsa
closed-formsolution:

=(
W
T
W
)
1
=
2
=
tr((
W
T
W
)
1
=
2
)
:
(3.12)
WenotethatthereisafeedbackloopinthelearningofMTRLasillustratedabove.The
MTRLachievesknowledgetransferamongtaskmodelsviathetaskrelationmatrix

,and
thetaskmodelswillbeusedtoestimate

.Ifthe

canbelearnedcorrectlyorcanclosely
representthetruetasksrelationship,itwillbene˝tlearningonthetasksparameters
W
by
guidingtheknowledgetransferinagooddirection.Inturn,thebettertasksparameterswill
helpthealgorithmtoidentifyamoreaccurateestimationof

.Thepositivefeedbackloop
isthekeytohelpbuildingagoodMTRLmodel.Onthecontrary,thetrainingprocedure
willbebiasedtowrongdirectiononcewekeepgettingmisleadingfeedbacksintheloop.To
bemorespeci˝c,oncedataiseitherlow-qualityorinsu˚cient-quantity,the

willindicate
aninaccuratedirectiontotransfertheknowledgeamongtasks,whichleadstoanegative
51
feedbackintheloop.Thiswillenduplearningamodelwithpoorgeneralizationperformance,
examplesofwhichwillbeelaboratedintheempiricalstudies.
AnotherremarkisthatinEq.(3.11),duetotherelaxation,thesolutionof

isnolonger
theextractsolutionfromthemaximumlikelihoodestimationofcolumncovariancematrix
derivedfromEq.(3.10).TheadvantagesoftheobjectivefunctioninEq.(3.11)comparedto
Eq.(3.10)havebeendiscussedindetailsin[
227
].Wewouldliketofurtherpointoutthatthe
learned

isactuallyabetterrepresentationoftasksrelationshipthanthecolumncovariance
matrix.Recallthatthecovariancesuggeststheextentthatelementsintwovectorsmove
tothesamedirection.Supposewehavetasksparameters
W
2
R
d

K
,theunbiasedsample
covariancecanbecomputedby
C
=
W
T
c
W
c
=
(
d

1)
,where
W
c
=
W

1
T
d
1
d
W
=d
isthe
centralizedtasksmodels.Thismeasureisonlymeaningfulwhenthereareenoughnumber
ofdimension
d
andthevariancecontainsintasksparameters.If
W
=[1
;

2;1
;

2]
,the
covariancematrixwillreturnanall-zeromatrixwhichwillnotindicateacorrectrelationship.
Instead,anaccurateestimationcanbeinferredbyusingEq.(3.12).Wecanobtaina
correlationmatrix
Corr
=[1

1;

1
;
1]
from

.
Theabovediscussionsleadtotwoimportantconclusions:(1)The

canindicateagenuine
taskrelationship.(2)Maintaininganaccurate

isthekeyinthislearningprocedure.
3.2.3.2TheiMTRLFramework
InMTLscenarios,thequalityandquantityoftrainingdatausuallyimposesigni˝cant
challengestothelearningalgorithms.Thetaskcovariancematrix

inferredfromthedata
maynotalwaysgiveanaccuratedescriptionofthetruetaskrelationship,whichinturn
wouldprevente˙ectiveknowledgetransfer.Fortunately,inmanyreal-worldapplications,
humanexpertspossessindispensabledomainknowledgeaboutrelatednessamongsometasks.
52
Forexample,whenbuildingmodelspredictingdi˙erentregionsofthebrainfromclinical
features,neuroscientistandmedicalresearchercanrevealimportantrelationshipamongthe
regions.Assuch,solicitfeedbackfromhumanexpertsontaskrelationshipandencodethem
assupervisionisespeciallyattractive.Toachievethisgoalweneedtoanswerthefollowing
problems:
1.
Whattypeofknowledgerepresentationcanbee˚cientlysolicitedfromhumanexperts,
andalsocanbeusedtoe˙ectivelyguidethelearningalgorithms?
2.
HowtodesignMTLalgorithmthatcombinesthedomainknowledgeanddata-driven
insights?
3.
Howtoe˙ectivelysolicitknowledge,reducingtheworkloadofthehumanexpertsby
supplyingonlythemostimportantknowledgethata˙ectsthelearningsystem?
Inthispaperweproposeaframeworkofinteractivemulti-taskMachinelearning(iMTRL),
whichprovidesanintegratedsolutiontoaddresstheabovechallengingquestions.The
frameworkisillustratedinFig3.4.TheiMTRLisaniterativelearningprocedurethat
involveshumanexpertsintheloop.Ineachiteration,thelearningprocedureinvolvesthe
following:
1.
Encoding
.Thedomainknowledgeoftaskrelationshipisrepresentedaspartialorders,
andcanbeencodedinthelearningaspairwiseconstraints.
2.
Knowledge-AwareMulti-TaskLearning
.WeproposeanovelMTLalgorithmthatinfers
modelsandtaskrelationshipfromdataandconformthesolicitedknowledge.
3.
ActiveLearningbasedKnowledgeQuery
.Tomaximizetheusefulnessofsolicited
knowledge,weproposeaknowledgequerystrategybasedonactivelearning.
53
Itisnaturalandintuitivetousepartialordersastheknowledgepresentationfortask
relationship.Queryaquestionthatwhetherthetask
i
and
j
aremorerelatedthantask
i
and
k
ismucheasierthanaskingtowhichextentthetask
i
and
j
arerelatedtoeachother.For
example,
i
thtaskand
j
thtaskshaspositiverelationshipwhilethe
i
thtaskand
k
thtaskhas
negativerelationship,thenthisrelationshipisrepresentedbyapartialorder

i;j


i;k
.The
focusofthispaperisthealgorithmdevelopmentforiMTRLandwemakeafewassumptions
toalleviatecommonissuesinusingthispresentationandsimplyourdiscussions:
Assumption1.
Thedomainknowledgeacquiredfromhumanexpertisaccurate.Theexpert
maychoosenottolabelifhe/sheisnotcon˝dent.
Assumption2.
Theacquiredpartialordersarecompatible,i.e.when

i;j
>

i;k
and

i;k
>

k;p
areestablished,the

i;j
<

k;p
cannotbeincluded.
Ifthissituationhappens,wecandiscardthelessimportantconstraintsandmakethe
remainconstraintsbecompatible.Theimportanceofconstraintscanbemeasuredbythe
InconsistencywhichwewillintroducedinDe˝nition2.
3.2.3.3Aknowledge-awareextensionofMTRL
AssumeinthecurrentiterationofiMTRL,ourdomainknowledgeisstoredinaset
T
de˝ned
by:
T
=
f

:

i
1
;j
1


i
2
;j
2
8
(
i
1
;j
1
;i
2
;j
2
)
2
S
g
;
(3.13)
whereeachpairwiseconstrainthasspeci˝edapreferredhalf-spacethatanidealsolution

shouldbelongto,andtheset
S
containstheindexesoftasksselectedbyourquerystrategy.
Thepartialorderinformationismoreimportantthanthemagnitudeof

.Thereasonisthat
ifwemultiplyeachelementin

withascalar
a
,it'sequaltosolvetheEq.(3.15)replacing
54

2
with

2
[
51
].Hence,themagnitudeofelementsin

canbeadjustedsimultaneously
withoutchangingtheresults.Buttheorderofpairsin

isamoreimportantstructureto
encode.Thesealgorithmicadvantagesreinforcedourchoiceofusingpairwiseconstraintsto
representdomainknowledge.
WenotethattheconstraintsinEq.(3.13)wouldleadtoatrivialsolutionthat

i
1
;j
1
=

i
2
;j
2
8
(
i
1
;j
1
;i
2
;j
2
)
2
S
,whichisapparentlynotthee˙ectweseek.Toovercomethis
problem,weaddapositiveparameter
c
sothatwecanassuretheelementsin

preservethe
truepairwiseorder.Hence,theconvexset
T
ischangedto:
T
=
f

:

i
1
;j
1


i
2
;j
2
+
c;
8
(
i
1
;j
1
;i
2
;j
2
)
2
S
g
:
(3.14)
Theproposedknowledge-awaremulti-taskrelationship(kMTRL)learningextendsthe
MTRLbyenforcingafeasiblespacefor

speci˝edby
T
.Tothisend,thekMTRLformulation
isgivenbythefollowingoptimizationproblem:
min
W
;
b
;

F
(
W
;
b
;

)=
K
X
k
=1
1
n
k
k
y
k

X
k
w
k

b
k
1
n
k
k
2
F
+

1
2
tr(
WW
T
)+

2
2
tr(


1
W
T
)
s.t.


0
;
tr(

)=1
;

2T
(3.15)
WenotethateventhoughtheproblemofkMTRLisconsideredtobemorechallenging
tosolvethanMTRLbecauseofadditionalconstraintsintroducedin
T
,thesolutionspace
ofkMTRLismuchsmallerbecauseeachconstraintcutsthesolutionspaceinhalf,andthe
optimizationalgorithmsmayconvergefasterinthiscase.
55
3.2.3.4E˚cientOptimizationforkMTRL
TheproposedkMTRLisaconvexoptimizationproblem,andweproposetosolveitusingan
alternatingalgorithm:
Step1:
We˝rstoptimizetheobjectivefunctionwithrespectto
W
and
b
givena˝xed

.
Thisstepcaneitherbesolvedusingthelinearsystem[
227
]oro˙-the-shelfsolverssuchas
CVX[
69
]andFISTA[
16
].Di˙erentsolverscanbeapplieddependingonthenatureofthe
data:˝rstordersolverssuchasFISTAismorescalablewhentherearemanysamples,while
solvinglinearsystemcanbemoree˚cientasfeaturedimensionishigh.
Step2:
Given
W
and
b
,theobjectivefunctionwithrespectto

isgivenbyananalytical
solutionusingEq.(3.12).
Step3:
The

isprojectedtotheconvexset:
T
=
f

j

2T
;


0
;
tr(

)=1
g
bysolvingtheEuclideanprojectionproblembelow:
min

k


^

k
2
F
;s:t:

2
T
wherethe
^

istheanalyticalsolutionweobtainedfromtheEq.(3.12).Thisobjective
functioncanbesolvede˚cientlyusingasuccessiveprojectionalgorithm[
76
]thatiteratively
projectsthesolutiontoeachconstraintintheset.
TheKKTanalysis[
35
]oftheaboveoptimizationproblemleadstothepropertysummarized
inTheorem1,andleadstoAlgorithm3.2.Tosimplifythediscussion,werequiresthetrue
pairordersareintheformof

i
1
;j
1


i
2
;j
2
.
56
Theorem1.
Supposethat
T
=
f

:

i
1
;j
1


i
2
;j
2
+
c
g
,then,forany

2
R
K

K
,the
projectionof

totheconvexset
T
isgivenby:
Proj(

)=

if

2T
;
otherwise
Proj(

)=


=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:


i
1
;j
1
=
1
2
(

i
1
;j
1
+

i
2
;j
2
+
c
)


i
2
;j
2
=
1
2
(

i
1
;j
1
+

i
2
;j
2

c
)


p;q
=

p;q
;
8
(
p;q
)
6
=(
i
1
;j
1)
and
(
i
2
;j
2)
Inpractice,theterm
W
T
W
isnotguaranteedtobeafullrankmatrix.Infact,ina
typicalMTLsetting
W
isalowrankmatrixandthusthe

calculatedbyEq.(3.12)isalso
arankde˝ciencymatrix.Moreover,recallthattheoperationthatprojects

toaconvex
sethasaveryhighchanceleadtoasingularmatrix.Thenumericalproblemsduringthe
inversionofthesingularmatrix

willleadtoameaninglessinverseoftaskrelationmatrix
andcorruptthetrainingprocedure.Therefore,weproposetosolveaperturbedversionof
ouroriginalobjectivefunctionEq.(3.15)asfollows:
min
W
;
b
;

F
(
W
;
b
;

)=
K
X
k
=1
1
n
k
k
y
k

X
k
w
k

b
k
1
n
k
k
2
F
+

1
2
tr(
WW
T
)+

2
2
tr(


1
(
WW
T
+

I
))
;
s.t.


0
;
tr(

)=1
;

2T
(3.16)
where
T
followsthede˝nitioninEq.(3.14).Asaresult,theanalyticalsolutionof

in
Step
57
Algorithm3.1:
knowledge-awareMulti-TaskRelationshipLearning(kMTRL)
Require:
Trainingdata
f
X
k
;
y
k
g
K
k
,constraintsset
S
,regularizationparameters

1
,

2
,a
positivenumber
c
.Randomlyinitialize
W
0
.

0
=
I
=d
.
1:
while
W
and

arenotconverge
do
2:
Compute
f
W
;
b
g
=argmin
W
;
b
F
(
W
;
b
;

)
3:
Compute

usingEq.(3.12)
4:

=Proj(

,
S
,
n
,
c
)
5:
endwhile
6:
return
W
,
b
,

Algorithm3.2:
Projectionalgorithm
Require:
Taskcorrelationmatrix

,constraintsset
S
,maxiteration
n
,apositivenumber
c
.
1:
for
i
=1
;:::;n
do
2:
while
8
(
i
1
;j
1
;i
2
;j
2
)
2
S
do
3:
if

i
1
;j
1
<

i
2
;j
2
then
4:

i
1
;j
1
=
1
2
(

i
1
;j
1
+

i
2
;j
2
+
c
)
5:

i
1
;j
1
=
1
2
(

i
1
;j
1
+

i
2
;j
2

c
)
6:
endif
7:
endwhile
8:
Dynamicupdate
c
=
c

0
:
9
9:
Project

tobeapositivesemi-de˝nitematrix
10:
if
Allconstraintsaresatis˝ed
then
11:
break
12:
endif
13:
endfor
14:
return

2.
isthusreplacedbythefollowing:

=(
W
T
W
+

I
)
1
=
2
=
tr((
W
T
W
+

I
)
1
=
2
:
(3.17)
ThealgorithmtosolvetheobjectivefunctionEq.(3.16)ispresentedinAlgorithm3.1.
Thisalgorithmcanbeinterpretedasalternatelyperformingsupervisedandunsupervisedsteps.
Inthesupervisedstepwelearnthetaskspeci˝cparameters(
W
and
b
).Inunsupervisedstep
wegetthetaskrelationshipmatrixfromthetaskparameters.Finally,thelastsupervised
stepweencodepriorknowledgetothetaskrelationshipmatrix

.Werepeatthesteps
iterativelyuntilconverge.
58
Algorithm3.3:
QueryStrategyofPairwiseConstraints
Require:
Thetaskcorrelationmatrix

,themodelparametermatrix
W
foralltasks,
thenumberofpairwiseconstraints
n
selectedtobequery
1:
Compute
^

=(
W
T
W
)
1
=
2
=
tr((
W
T
W
)
1
=
2
)
2:
while
8
(
i
1
;j
1
;i
2
;j
2
)
do
3:
Compute

(
i
1
;j
1
;i
2
;j
2
)
and
^

(
i
1
;j
1
;i
2
;j
2
)
4:
endwhile
5:
while
8
(
i
1
;j
1
;i
2
;j
2
)
do
6:
ComputeInc
(
i
1
;j
1
;i
2
;j
2
)
7:
endwhile
8:
Select
n
pairswithhighestscoresintotheset
T
9:
return
T
Algorithm3.4:
iMTRLframework
Require:
Trainingsets
f
X
k
;
y
k
g
K
k
,numberofselectedqueries
q
,
regularizationparameters

1
,

2
,positivenumber
c
,
T
0
=
;
1:
for
i
=1
;:::;n
do
2:
(

i
,
W
i
,
b
i
)=kMTRL(
f
X
k
;
y
k
g
K
k
,
T
i

1
,

1
,

2
,
c
)
3:
T
i
=query(
W
i
,

i
,
q
i
)
4:
T
i
=
T
i
[T
i

1
5:
endfor
6:

=

i
,
W
=
W
i
,
b
=
b
i
7:
return

,
W
,
b
3.2.3.5BatchModePairwiseConstraintsActivelearning
Therearetoomanypossiblepairsforhumanexpertstolabelthemall,andthusthee˚ciency
ofiMTRLframeworkheavilyreliesonthequalityofthepairsselectedbythesystem.In
thissubsection,wediscusstheimportantquestionofhowtoe˚cientlysolicitthedomain
knowledge.Speci˝cally,wewouldliketoselectthepairsthataremostinformativetothe
learningprocess.Weproposeane˚cientheuristicquerystrategyaselaboratedasfollows.
We˝rstdesignascorefunctionforpairwiseconstraintsbasedonthe
inconsistency
inthe
model.Toexplaintheinconsistency,wedenotetheanalyticalsolutioncalculatedby
W
as
^

=(
W
T
W
)
1
=
2
=
tr
((
W
T
W
)
1
=
2
)
andthedi˙erencebetweenelements

i
1
;j
1
and

i
2
;j
2
in
thelearned

as

(
i
1
;j
1
;i
2
;j
2
)
=

i
1
;j
1


i
2
;j
2
.Theninconsistencyinthemodelisde˝ned
asfollows:
59
De˝nition2.
Inconsistencyisde˝nedas:
Inc
(
i
1
;j
1
;i
2
;j
2
)
=
sign
(
i
1
;j
1
;i
2
;j
2
)
j

(
i
1
;j
1
;i
2
;j
2
)

^

(
i
1
;j
1
;i
2
;j
2
)
j
;
wheresign
(
i
1
;j
1
;i
2
;j
2
)=

(
i
1
;j
1
;i
2
;j
2
)
^

(
i
1
;j
1
;i
2
;j
2
)
j

(
i
1
;j
1
;i
2
;j
2
)
^

(
i
1
;j
1
;i
2
;j
2
)
j
.
TheInc
(
i
1
;j
1
;i
2
;j
2
)
representstwotypesofinconsistency:
Negativeinconsistency
:Giventhatthepairwiseordersoftworelationshipmatrices(

and
^

)arenotconsistent,i.e.

i
1
;j
1
>

i
2
;j
2
,but
^

i
1
;j
1
<
^

i
2
;j
2
orviceversa,the
Inc
(
i
1
;j
1
;i
2
;j
2
)
isalwaysnegative.Thesmallerthe
Inc
(
i
1
;j
1
;i
2
;j
2
)
is,thehigheristheheuristic
score.
Positiveinconsistency
:Giventhatthepairwiseordersoftworelationshipmatricesare
consistent,thentheinconsistencycomesfrom
k

(
i
1
;j
1
;i
2
;j
2
)

^

(
i
1
;j
1
;i
2
;j
2
)
k
.Thelargerthe
Inc
(
i
1
;j
1
;i
2
;j
2
)
is,thehigheristheheuristicscore.
Notethatthedisorderoftwopairsaremoreimportantthatthedi˙erenceoftwopairs,
andallpairswithnegativeinconsistencyhastheprioritytobeselectedoverthosewith
positiveinconsistency.Atthe˝rstiteration,beforeaddinganypairwiseconstraintsintothe
trainingprocedure,thelearned

isveryclosetotheanalyticalsolutioncalculatedfrom
W
,
i.e.

(
i
1
;j
1
;i
2
;j
2
)
=
^

(
i
1
;j
1
;i
2
;j
2
)
,exceptforthedisturbofnumericalterm

I
.Therefore,the
inconsistencyiscausedbysomenumericalissuesinthe˝rstround.Thereforeatthe˝rst
trainingiteration,thereisnonegativeinconsistency.Asthenumberofconstraintsaddedinto
themodel,theinconsistencywillappearandthequerystrategywillbecomemoree˙ectivein
thissituation.TheAlgorithm3.3describesthequerystrategy.
Finally,wesummarizeallproceduresofiMTRLinAlgorithm3.4.Theline1meansthere
are
n
iterationslearningproceduresneedtobeconducted.Theline2correspondstothe
60
knowledge-awareMTLstepinouriMTRLframework.Theline3istosolicitthedomain
knowledgeandline4istoanswerthequeryandencodingtheknowledgeintothemodel.
3.2.4Experiments
3.2.4.1ImportanceofHigh-QualityTaskRelationship
Inthissubsection,weconductexperimentstoshowthatencodinganaccuratetaskrelationship
willsigni˝cantlyenhancetheperformanceofMTRL.Thee˙ectivenessofMTRLhasalready
beendemonstratedin[227],inwhichtheauthorsshowedthatMTRLcaninferanaccurate
taskrelationshipfromarelativelycleandatasetwithsu˚cienttrainingsamples.Herewe
useatoyexampletoshowthatMTRLwouldinferamisleadingrelationshipwhennoise
presentsandthereareinsu˚cienttrainingsamples.Thetoydatasetisgeneratedasfollows.
Therearethreetaskswithdatasampledfrom
y
=3
x
+10
,
y
=

2
x
+5
and
y
=10
x
+1
,
respectively.Foreachtaskswegenerate
5
samplesfromauniformlydistributionin
[0
;
10]
.
ThefunctionoutputsforthreetasksarecorruptedbyaGaussiannoisewithzeromeanand
standardvarianceequalto30,10and10,respectively.Accordingtothegenerativeregression
functions,weexpectthatthecorrelationbetweenthe˝rsttaskandthirdtaskiscloseto1
andfortherestofpairsiscloseto-1.WeusethelinearkernelofMTRLwith

1
=0
:
01
and

2
=0
:
05
.Thelearned

givesacorrelationmatrixasfollows:
2
6
4
10
:
9999

0
:
9999
0
:
99991

1

0
:
9999

11
3
7
5
Fromtheabovematrixweseethatthelearnedrelationshipfortask1isoppositetothe
supposedrelationship,becauseofthehighlynoiseddata.Thiswillleadstosuboptimalsolution
61
for
W
=[

3
:
7283
;

2
:
6605
;
3
:
0105]
,ascomparedtothegroundtruth
W
=[3
;

2
;
10]
.On
theotherhand,ifweencodethetruetasksrelationshipby˝xingthe

tobetheground
truthduringthelearningprocess,withtheexactlysameparameterssettingasabove.We
canthenlearnamodel
W
=[0
:
6850
;

0
:
3878
;
2
:
5840]
thatisclosertothegroundtruthin
termsof
l
2
normandkeepsthecorrecttasksrelationship.Thisprocedureisdenotedastruth-
encodedmulti-taskrelationshiplearning(eMTRL)inthissubsection.
Thisobservationmotivatesustofurtherexplorethee˙ectivenessofeMTRL.Wecreated
syntheticdatasetbygenerating
K
=10
tasksparameters
w
i
and
b
i
fromauniformdistribution
between0and1.Eachtaskcontains25samplesdrawnfromaGaussiandistributionwithzero
meansandthevarianceequalsto10.ThefunctionresponseisalsocorruptedbyaGaussian
noisewithzeromeanandhasavarianceof5.Wesplitthissyntheticdatasettotraining,
validationandtestingset.Outofthe25samplesforeachtasks,20%arefortraining,30%
forvalidationand50%fortesting.We˝xthenumberofsamplesandthenumberoftasks,
varythenumberoffeaturesfrom20to100.Theparameters

1
and

2
havebeentuned
in
[
1

10

3
;
1

10

2
;
1

10

1
]
and
[0
;
1

10

3
;
1

10

2
;
1

10

1
;
1
;
10
;
1

10
2
;
1

10
3
]
,
respectively.
TheperformancehasbeenevaluatedusingRootMeanSquareError(RMSE)andFrobenius
normbetweenlearnedtaskmodelandthegroundtruthtaskmodel.Theresultsshownin
Figure3.5indicatethatencodingtheknowledgeabouttaskrelationshipwillsigni˝cantly
bene˝ttheprediction.EventhougheMTRLisnotapracticalmodelbecausewecan
neverknowthetruetaskrelationship,theexperimentalresultscon˝rmthatthereisahuge
potentialtoimprovepredictiveperformanceifwecantakeadvantageofdomainknowledge.
Theexperimentalresultsinnextsectionwillshowhowtoe˚cientlysolicitandincorporate
thedomainknowledgeabouttasksrelationshipintothelearning.
62
Table3.4:TheaverageRMSEofqueryandrandomstrategyontestingdatasetover5random
splittingoftrainingandvalidationsamples.

numberofconstraints
0
5
10
15
20
25
30
35
40
QueryStrategy
1.1387
1.1267
1.1224
1.1117
1.1125
1.1101
1.1102
1.1137
1.1168
RandomSelection
1.1387
1.1255
1.1390
1.1284
1.1165
1.1285
1.1379
1.1382
1.1364
Table3.5:TheRMSEcomparisonofkMTRLandbaselines.
School
RR
MTL-L
MTL-l21
MTRL
kMTRL-20
kMTRL-40
kMTRL-60
kMTRL-80
5%
1.1737

0.0041
1.1799

0.0047
1.176

0.0043
1.0615

0.0167
1.0584

0.0128
1.0553

0.0155
1.0551

0.0158
1.0551

0.0159
10%
1.1428

0.0306
1.1485

0.0293
1.1477

0.0282
0.9872

0.0057
0.9823

0.0030
0.9805

0.0014
0.9803

0.0018
0.9803

0.0018
15%
1.0665

0.0395
1.0699

0.0405
1.0700

0.0399
0.9491

0.0060
0.9334

0.0057
0.9321

0.0081
0.9322

0.0083
0.9323

0.0082
20%
0.9756

0.0157
0.9774

0.0153
0.9776

0.0149
0.9047

0.0031
0.8966

0.0123
0.8906

0.0123
0.8844

0.0022
0.8843

0.0019
MMSE
RR
MTL-L
MTL-l21
MTRL
kMTRL-5
kMTRL-10
kMTRL-15
kMTRL-20
2%
0.9503

0.1467
0.9319

0.1497
0.9314

0.1693
0.9106

0.0976
0.9113

0.0982
0.9058

0.0926
0.9058

0.0926
0.9058

0.0926
(a)(b)
Figure3.5:PerformanceofMTRLandeMTRLasthenumberoffeatureschanging,in
termsof(a)Frobeniusnormand(b)RMSE.MTRL[
227
]learnsbothtaskmodelsandtask
relationshipatthesametime,whileeMTRLherelearnsthetaskmodelswhilethetask
relationship

is˝xedtogroundtruth,i.e.encodingthecorrectdomainknowledgeabout
thetaskrelationship.
3.2.4.2E˙ectivenessofQueryStrategy
Inthissubsection,weconducttheexperimentstoshowthatencodingthedomainknowledgein
theformofpartialorderisuseful.Wefollowthesamesyntheticdatasetwith20featuredimen-
siongeneratedabove.Thesamesettingofsplittingtraining,testingandvalidationdataset,
and5foldrandomsplitvalidationareapplied.Theparameters

1
and

2
havebeentuned
in
[
1

10

3
;
1

10

2
;
1

10

1
]
and
[0
;
1

10

3
;
1

10

2
;
1

10

1
;
1
;
10
;
1

10
2
;
1

10
3
]
,
respectively.Afterthelearningalgorithmconverges,wecomparethethepairwiseconstraints
63
arechosenbytheproposedquerystrategyandtherandomlyselectedstrategy.Theresultsof
twostrategiesarereportedinTable3.4.Weseethetrendthatbothoftheproposedquery
strategyandtherandomselectionreachbettergeneralizationperformanceasthenumberof
incorporatedpairwiseconstraintsincreases.Tobemorespeci˝c,theresultsin˝rstcolumnis
worsethanalltheresultsusingquerystrategyandmostoftheresultsusingrandomselection.
Thisshowthatsolicitthedomainknowledgeintermsofpairwiseconstraintsise˙ective.
Ontheotherhand,whencomparingtheresultsoftheproposedquerystrategyandrandom
selection,weseethatourquerystrategyselectsimportantpairwiseconstraints,leadingto
abettermodelthantherandomquery.Whenthenumberofpairwiseconstraintsislarger
than5,theproposedquerystrategyworksconsistentlybetterthanrandomselection.
3.2.4.3InteractiveSchemeforQueryStrategy
Tofurtheranalysisourquerystrategy,wealsoexploredi˙erentinteractiveschemesinour
querystrategy.Therearemultiplewaystoqueryacertainamountofpartialorders.We
caneitherquerymanytimesandeachtimewithlesslabelinge˙orts,orviceversa.Weuse
kMTRL-a-b
todenoteatotal
b
constraintsandeachtimewequery
a
constraints(thehuman
expertneedstointeractwiththesystem
b=a
times).Thedi˙erentinteractiveschemewill
highlyimpacttheuserexperience.Forexample,kMTRL-10-100needstoqueryexperts
10
timesandexpertsneedtolabel
10
constraintsateachtime.Also,ittakes
10
trainingiterations
whichismuchmoreexpensivethanotherschemes.Incontrast,kMTRL-100-100onlyneedsto
queryexpertsonce,whichisthemoste˚cientscheme.However,thisschemecannotbene˝t
fromtheiterativeprocessofiMTRL.Thepairwiseconstraintsaddedinpreviousiterations
willa˙ectthemodelandwon'tbeselectedagain.Thiswillrevealotherimportantconstraints.
Takingaoneiterationschemecannotutilizethisinformation.Theresultsaresummarized
64
Figure3.6:TheaveragedRMSEofkMTRLusingdi˙erentsettingofquerystrategy.The
kMTRL-10-100meansselecting10pairwiseconstraintsattheendofeachiteration,start
fromzero,add10pairwiseconstraintsatatime,until100constraints.Forall4schemes,
kMTRLwithzeroconstraintsisequivalenttoMTRL.Resultsaretheaverageover5fold
randomsplitting.
inFigure3.6.WeseethatkMTRL-50-100achievesthebestperformance.Therefore,the
bestschemeindicatethatourquerystrategyismostlye˙ectivewhenwebalancethetwo
parameters,andthusitdoesnotrequireintensivelyinteractionwithexpertsandmeanwhile
utilizesthepreviousinformatione˙ectively
1
.
3.2.4.4PerformanceonRealDatasets
Theschooldatasetisawidelyusedbenchmarkdatasetformulti-taskregressionproblem.It
contains15372studentswith28featuresfrom139secondaryschoolsintheyearof1985,1986
and1987,providedbytheInnerLondonEducationAuthority(ILEA).Thetaskistopredict
thescoreforstudentsin139schools.Theexperimentalsettingsareexplainedasfollows.We
˝rstsplitthedatasetintotraining,validationandtestingdatasets.Thepercentageoftesting
samplesvariesfrom
10%
to
25%
ofallsampleseachtasksinoriginaldataset.Takingthe
10%
testingdatasetasanexample,weperform3-foldrandomsplitontherest
90%
data.
1
Codeispubliclyavailableat
https://github.com/illidanlab/iMTL
65
Eachfoldhas
20%
samplesfortrainingand
70%
fortesting.Thesamerandomsplittingare
appliedtothethreedatasets.
AnotherrealdatasetweusedhereisAlzheimer'sDiseaseNeuroimagingInitiative(ADNI)
database
2
.Theexperimentalsetupissameasdescribedinthepaper[
235
].Thegoalis
topredictthesuccessivecognitionstatusofpatientsbasedonthemeasurementsatthe
screeningorthebaselinevisit.Weuse
2%
samplesfortraining,
10%
fortestingandthe
restforvalidation.Wealsoperform3-foldrandomsplitonthisdataset.Thepredictive
performanceofthecompetingmethodslistedbelowarereportedontherealdatasets:

RR:Thisapproachreferstoridgeregression.

MTL-L:Thisapproachreferstothelow-rankmulti-tasklearningwithtracenorm
regularization[10].

MTL-L21:Thisapproachreferstomulti-taskjointfeaturelearningusing
l
2
;
1
norm
thatselectsasubsetoffeaturessharedbyalltasks[121].

MTRL:Thisapproachreferstothemulti-taskrelationshiplearningaswedescribedin
Section3.2.3[227].

kMTRL-
N
:ThisapproachreferstotheproposedkMTRLmethodwith
N
pairwise
encodedintothemodel.
Wetunetheregularizationparameterson
W
in
[
1

10

3
;
1

10

2
;
1

10

1
]
forRR,
MTL-LandMTL-L21.Theregularizationparameters

1
and

2
inEq.(3.16)aretunedin
[
1

10

3
,
1

10

2
,
1

10

1
]and[
0
,
1

10

3
,
1

10

2
,
1

10

1
,
1
,
10
,
1

10
2
,
1

10
3
]
respectively.Thebestparametersareselectedbasedontheperformanceonthevalidation
2
Dataispubliclyavailableat
http://adni.loni.usc.edu/
66
Table3.6:ThenameofthebrainregionsinFigure3.8,where(C)denotescorticalparcellation
and(W)denoteswhitematterparcellation.
#
Intra-region
Inter-region(Row)
Inter-region(Column)
1
(C)RightCaudalMiddleFrontal
(W)RightPutamen
(C)RightInferiorTemporal
2
(C)RightPericalcarine
(W)LeftCerebralCortex
(C)LeftRostralMiddleFrontal
3
(W)CorpusCallosumMidAnterior
(W)RightVentralDiencephalon
(C)RightParsTriangularis
4
(W)RightCerebellumCortex
(C)RightCaudalAnteriorCingulate
(C)RightPrecentral
5
(W)CorpusCallosumCentral
(C)LeftTemporalPole
(C)RightMedialOrbitofrontal
6
(C)LeftBankssts
(C)RightPostcentral
(C)LeftParsTriangularis
7
(C)RightParsOpercularis
(C)RightPrecentral
(C)RightSuperiorParietal
8
(C)LeftIsthmusCingulate
(W)RightCerebralCortex
(C)RightInferiorParietal
9
(C)LeftSupramarginal
(C)LeftIsthmusCingulate
(C)LeftParsOrbitalis
10
(C)RightInferiorTemporal
(C)LeftSuperiorFrontal
(W)CorpusCallosumCentral
set.TheperformanceoflearnedmodelsaremeasuredbyRMSEonthetestingdataset.The
experimentalresultsareshowninTable3.5,fromwhichweseethatkMTRLachievesthe
bestresults.Inthisexperiment,weadopttheschemekMTRL-20-80forschooldatasetand
kMTRL-5-20forMMSEdatasetasdescribedinprevioussubsection.
3.2.5CaseStudy:BrainAtrophyandAlzheimer'sDisease
InthissectionweapplytheproposediMTRLframeworktostudythebrainatrophypatterns
andhowthechangesinthebrainisassociatedtodi˙erentclinicaldementiascoresand
symptomsthatarerelatedtoAlzheimer'sdisease(AD).Itisestimatedthatthereare
currently5millionAmericanshaveAD,andADhasbecomeoneoftheleadingcausesof
deathintheUnitedStates.SinceADischaracterizedbystructuralatrophyinthebrain,there
isapressingdemandofunderstandinghowthebrainatrophyisrelatedtotheprogressionof
thedisease.
Inthisworkwestudyhowthestructuralfeaturesofbrainregionscanberelatedto51
cognitivemarkerssuchas,Alzheimer'sDiseaseAssessmentScale(ADAS),clinicaldementia
rating(CDR),GlobalDeteriorationScale(GDS),Hachinski,NeuropsychologicalBattery,
WMS-RLogic,andotherneuropsychologicalassessmentscores.Weareinterestedinpredicting
thevolumeofbrainareasextractedfromthestructuralmagneticresonanceimaging(MRI).
67
(a)(b)
Figure3.7:Thedistributionofcompetenceon(a)intra-regioncovarianceand(b)inter-region
covariance.kMTRLperformsbetterthanMTRLwhencompetence
>
1
.Highercompetence
indicatesbetterperformanceachievedbykMTRLascomparedtoMTRL.Weseeinamajority
ofregionsthekMTRLoutperformstheMTRL.
(a)Intra-regioncovariance
(b)Inter-regioncovariance
Figure3.8:Comparisonofsub-matricesofcovarianceamong(left)taskcovarianceusing
90%
alldatapointsthatisconsideredasround(middle)thecovariancematrixlearned
viaMTRLon
20%
dataand(right)thecovariancematrixlearnedviakMTRLon
20%
data
with0.8%pair-wiseconstraintsqueriedbytheproposedqueryscheme.
WeusetheADNIcohortconsisting648subjectswhosebaselineMRIimagespassedquality
control.WeusedtheFreeSurfertooltoextractthe99brainvolumesfromregionsofinterest
(ROIs)ofthebaselineMRIimages.ConsideringthepredictionofthevolumeofeachROI
asalearningtask,wethushaveacollectionof99learningtasks,witheachtaskhaving648
samplesand51features.Sincethebrainregionsarerelatedduringtheagingprocessand
Alzheimer'sprogression,theMTLapproachcanbeusedtoimprovetheperformanceby
consideringsuchrelatednessamongbrainregions.
Weadoptthesameexperimentalsettingasinthepreviousexperiments,wherewecompare
theMTRLwiththeproposedkMTRLbyqueryingandaddingpair-wiseexpertknowledge
68
andinspectingthee˙ectivenessofthequeriedtaskrelationshipsupervision.Weshowthe
di˙erencesamongthe(1)taskcovarianceusing
90%
alldatapointsthatisconsideredas
(2)thecovariancematrixlearnedviaMTRLon
10%
dataand(3)the
covariancematrixlearnedviakMTRLon
10%
datawith0.8%pair-wiseconstraintsqueried
bytheproposedqueryscheme.Sincethecomplete
99

99
covariancematricesarehardto
visualize,wechooseinvestigatetwotypesofsubregionsofthecovariancematrices:(a)a
randomintraregionofthecovarianceofthesize
10

10
(rowregionsandcolumnregionsare
thesame)and(b)arandominterregionofthecovarianceofthesize
10

10
(rowregionsand
columnregionsaredi˙erent).Wede˝nethe
competence
metrictoquantifyhowthequality
ofthesub-covariance:
k

MTRL


real
k
F
=
k

kMTRL


real
k
F
;
(3.18)
wherethekMTRLperformsbetterthanMTRLwhencompetence
>
1
,andthehigherthe
better.Werepeatedlychooserandomsub-covariancesandthedistributionofthecompetence
isshownintheFigure3.7,indicatingthatinamajorityofcasesknowledgecanimprove
relationshipestimation.
Wevisualizetwosub-covariancematricesinFigure3.8,whoseregionsareshownin
Table3.6.InFigure3.8(a),weseethatthecovariancesfromboththegroundtruthandthe
kMTRLdiscouragethepositiveknowledgetransferfrom
RightCerebellumCortex
,which
agreeswiththepathologicalcharacteristicsofAD[
182
],wherecerebellumdoesnotcorrelate
withtheprogressionofAD.Alsothepositivecorrelationbetween
CorpusCallosumMid
Anterior
and
CorpusCallosumCentral
isidenti˝edinboththegroundtruthandthekMTRL,
andignoredbyMTRL.Thesigni˝cantreducedcorpuscallosumsizewaspreviouslyreported
69
inADstudies[
192
],andtheprogressionpatternsofthetworegionscanbesimilarbecauseof
thephysicaldistancebetweenthetworegions.Figure3.8(b),weseethattheunsubstantiated
strongcorrelationbetween
RightPrecentral
and
LeftParsTriangularis
asfoundinMTRLhas
beenlargelysuppressedbythedomainknowledge.However,sinceweonlyspeci˝edpartial
orderrelationship,therearechancestheproposedkMTRLalgorithmmayvthe
supervision,aswenoticethatsomeunsubstantiatedpositivecorrelationsinvolving
Right
VentralDiencephalon
areintroducedtothecovariance.Weplantofurtherelaboratethe
˝ndingsandclinicalinsightsofADanddementiainthejournalextensionofthispaper.
70
Chapter4
Data-DrivenCollaborativeLearning
Inthischapter,wediscussdata-drivencollaborationinreinforcementlearning.Morespeci˝-
cally,we˝rstproposeacollaborativedeepreinforcementlearningframeworkthatcanaddress
theknowledgetransferamongheterogeneoustasks.Underthisframework,weproposedeep
knowledgedistillationtoadaptivelyalignthedomainofdi˙erenttaskswiththeutilizationof
deepalignmentnetwork.Secondly,wefurtherconstructheterogeneouslearningagentsin
thesametasktoimproveitssample-e˚ciency.Thecentralideaistodisentangleexploration
andexploitationagentsandthenconductdata-driventransferthroughimitationlearning,
whichleadstoano˙-policylearningframeworklargelyfacilitatesthelearninge˚ciency.The
o˙-policylearningframeworkusesgeneralizedpolicyiterationforexplorationandexploits
thestablenessofsupervisedlearningforderivingpolicy,whichaccomplishestheunbiasedness,
variancereduction,o˙-policylearning,andsamplee˚ciencyatthesametime.
4.1CollaborativeDeepReinforcementLearning
4.1.1Introduction
Ontheotherhand,thestudyofhumanlearninghaslargelyadvancedthedesignofmachine
learninganddataminingalgorithms,especiallyinreinforcementlearningandtransferlearning.
Therecentsuccessofdeepreinforcementlearning(DRL)hasattractedincreasingattention
71
Figure4.1:IllustrationofCollaborativeDeepReinforcementLearningFramework.
fromthecommunity,asDRLcandiscoververycompetitivestrategiesbyhavinglearning
agentsinteractingwithagivenenvironmentandusingrewardsfromtheenvironmentas
thesupervision(e.g.,[
132
,
86
,
107
,
174
]).EventhoughmostofcurrentresearchonDRL
hasfocusedonlearningfromgames,itpossessesgreattransformativepowertoimpact
manyindustrieswithdataminingandmachinelearningtechniquessuchasclinicaldecision
support[
193
],marketing[
3
],˝nance[
2
],visualnavigation[
236
],andautonomousdriving[
32
].
Althoughtherearemanyexistinge˙ortstowardse˙ectivealgorithmsforDRL[
131
,
137
],
thecomputationalcoststillimposessigni˝cantchallengesastrainingDRLforevenasimple
gamesuchas
Pong
[
24
]remainsveryexpensive.Theunderlyingreasonsfortheobstacle
ofe˚cienttrainingmainlylieintwoaspects:First,thesupervision(rewards)fromthe
environmentisverysparseandimplicitduringtraining.Itmaytakeanagenthundredsor
eventhousandsactionstogetasinglereward,andwhichactionsthatactuallyleadtothis
rewardareambiguous.Besidestheinsu˚cientsupervision,trainingdeepneuralnetwork
itselftakeslotsofcomputationalresources.
72
Duetotheaforementioneddi˚culties,performingknowledgetransferfromotherrelated
tasksorwell-traineddeepmodelstofacilitatetraininghasdrawnlotsofattentioninthe
community[
159
,
191
,
151
,
86
,
166
].Existingtransferlearningcanbecategorizedintotwo
classesaccordingtothemeansthatknowledgeistransferred:
datatransfer
[
82
,
151
,
166
]and
modeltransfer
[
53
,
227
,
229
,
151
].Modeltransfermethodsimplementknowledgetransfer
fromintroducinginductivebiasduringthelearning,andhasbeenextensivelystudiedin
bothtransferlearning/multi-tasklearning(MTL)communityanddeeplearningcommunity.
Forexample,intheregularizedMTLmodelssuchas[
55
,
233
],taskswiththesamefeature
spacearerelatedthroughsomestructuredregularization.Anotherexampleisthemulti-task
deepneuralnetwork,wheredi˙erenttaskssharepartsofthenetworkstructures[
229
].One
obviousdisadvantageofmodeltransferisthelackof˛exibility:usuallythefeasibilityof
inductivetransferhaslargelyrestrictedthemodelstructureoflearningtask,whichmakes
itnotpracticalinDRLbecausefordi˙erenttaskstheoptimalmodelstructuresmaybe
radicallydi˙erent.Ontheotherhand,therecentlydevelopeddatatransfer(alsoknownas
knowledgedistillationormimiclearning)[
82
,
166
,
151
]embedsthesourcemodelknowledge
intodatapoints.Thentheyareusedasknowledgebridgetotraintargetmodels,whichcan
havedi˙erentstructuresascomparedtothesourcemodel[
82
,
25
].Becauseofthestructural
˛exibility,thedatatransferisespeciallysuitabletodealwithstructurevariantmodels.
TherearetwosituationsthattransferlearningmethodsareessentialinDRL:
Certi˝catedheterogeneoustransfer.
TrainingaDRLagentiscomputationalexpensive.
Ifwehaveawell-trainedmodel,itwillbebene˝cialtoassistthelearningofothertasksby
transferringknowledgefromthismodel.Thereforeweconsiderfollowingresearchquestion:
Givenone
certi˝cated
task(i.e.themodeliswell-designed,extensivelytrainedandperforms
verywell),howcanwemaximizetheinformationthatcanbeusedinthetrainingofother
73
relatedtasks?Somemodeltransferapproachesdirectlyusetheweightsfromthetrainedmodel
toinitializethenewtask[
151
],whichcanonlybedonewhenthemodelstructuresarethesame.
Thus,thisstrictrequirementhaslargelylimiteditsgeneralapplicabilityonDRL.Ontheother
hand,theinitializationmaynotworkwellifthetasksaresigni˝cantlydi˙erentfromeach
otherinnature[
151
].Thischallengecouldbepartiallysolvedbygeneratinganintermediate
dataset(logits)fromtheexistingmodeltohelplearningthenewtask.However,newproblems
wouldarisewhenwearetransferringknowledgebetween
heterogeneoustasks
.Notonlythe
actionspacesaredi˙erentindimension,theintrinsicactionprobabilitydistributionsand
semanticmeaningsoftwotaskscoulddi˙eralot.Speci˝cally,oneactionin
Pong
may
refertomovethepaddleupwardswhilethesameactionindexin
Riverraid
[
24
]would
correspondto˝re.Therefore,thedistilleddatasetgeneratedfromthetrainedsourcetask
cannotbedirectlyusedtotraintheheterogeneoustargettask.Inthisscenario,the˝rstkey
challengeweidenti˝edinthisworkisthathowtoconductdatatransferamongheterogeneous
taskssothatwecanmaximallyutilizetheinformationfromacerti˝catedmodelwhilestill
maintainthe˛exibilityofmodeldesignfornewtasks.Duringthetransfer,thetransferred
knowledgefromothertasksmaycontradicttotheknowledgethatagentslearnedfromits
environment.Onerecentlywork[
159
]useanattentionnetworkselectiveeliminatetransferif
thecontradictionpresents,whichisnotsuitableinthissettingsincewearegivenacerti˝cated
tasktotransfer.Hence,thesecondchallengeishowtoresolvethecon˛ictandperforma
meaningfultransfer.
Lackofexpertise.
AmoregeneraldesiredbutalsomorechallengingscenarioisthatDRL
agentsaretrainedformultipleheterogeneoustaskswithoutanypre-trainedmodelsavailable.
Onefeasiblewaytoconducttransferunderthisscenarioisthatagentsofmultipletasksshare
partoftheirnetworkparameters[
229
,
166
].However,aninevitabledrawbackis,multiple
74
modelslosetheirtask-speci˝cdesignssincethesharedpartneedstobethesame.Another
solutionistolearnadomaininvariantfeaturespacesharedbyalltasks[
4
].However,some
task-speci˝cinformationisoftenlostwhileconvertingtheoriginalstatetoanewfeature
subspace.Inthiscase,anintriguingquestionsisthat:canwedesignaframeworkthat
fullyutilizestheoriginalenvironmentinformationandmeanwhileleveragestheknowledge
transferredfromothertasks?
Thispaperinvestigatestheaforementionedproblemssystematicallyandproposesanovel
CollaborativeDeepReinforcementLearning(CDRL)framework(illustratedinFigure4.1)to
resolvethem.Ourmajorcontributionisthreefold:

First,inordertotransferknowledgeamongheterogeneoustaskswhileremainingthe
task-speci˝cdesignofmodelstructure,anoveldeepknowledgedistillationisproposed
toaddresstheheterogeneityamongtasks,withtheutilizationofdeepalignmentnetwork
designedforthedomainadaptation.

Second,inordertoincorporatethetransferredknowledgefromheterogeneoustasksinto
theonlinetrainingofcurrentlearningagents,similartohumancollaborativelearning,an
e˚cientcollaborativeasynchronouslyadvantageactor-criticlearning(cA3C)algorithm
isdevelopedundertheCDRLframework.IncA3C,thetargetagentsareabletolearn
fromenvironmentsanditspeerssimultaneously,whichalsoensuretheinformationfrom
originalenvironmentissu˚cientlyutilized.Further,theknowledgecon˛ictamong
di˙erenttasksisresolvedbyaddinganextradistillationlayertothepolicynetwork
underCDRLframework,aswell.

LastbutnotleastwepresentextensiveempiricalstudiesonOpenAIgymtoevaluate
theproposedCDRLframeworkanddemonstrateitse˙ectivenessbyachievingmore
75
than10%performanceimprovementcomparedtothecurrentstate-of-the-art.
Notations:
Inthispaper,weuseteachernetwork/sourcetaskdenotesthenetwork/task
containedtheknowledgetobetransferredtoothers.Similarly,thestudentnetwork/target
taskisreferredtothosetasksutilizingtheknowledgetransferredfromotherstofacilitateits
owntraining.Theexpertnetworkdenotesthenetworkthathasalreadyreachedarelative
highaveragedrewardinitsownenvironment.InDRL,anagentisrepresentedbyapolicy
networkandavaluenetworkthatshareasetofparameters.Homogeneousagentsdenotes
agentsthatperformandlearnunderindependentcopiesofsameenvironment.Heterogeneous
agentsrefertothoseagentsthataretrainedindi˙erentenvironments.
4.1.2RelatedWork
Multi-agentlearning.
Onecloselyrelatedareatoourworkismulti-agentreinforcement
learning.Amulti-agentsystemincludesasetofagentsinteractinginoneenvironment.
Meanwhiletheycouldpotentiallyinteractwitheachother[
28
,
103
,
73
,
190
].Incollaborative
multi-agentreinforcementlearning,agentsworktogethertomaximizeasharedrewardmea-
surement[
103
,
73
].ThereisacleardistinctionbetweentheproposedCDRLframeworkand
multi-agentreinforcementlearning.InCDRL,eachagentinteractswithitsownenvironment
copyandthegoalistomaximizetherewardofthetargetagents.Theformalde˝nitionof
theproposedframeworkisgiveninSection4.1.5.
Transferlearning.
Anotherrelevantresearchtopicisdomainadaptioninthe˝eldof
transferlearning[
149
,
183
,
200
].Theauthorsin[
183
]proposedatwo-stagedomainadaptation
frameworkthatconsidersthedi˙erencesamongmarginalprobabilitydistributionsofdomains,
aswellasconditionalprobabilitydistributionsoftasks.Themethod˝rstre-weightsthedata
76
fromthesourcedomainusingMaximumMeanDiscrepancyandthenre-weightsthepredictive
functioninthesourcedomaintoreducethedi˙erenceonconditionalprobabilities.In[
200
],
themarginaldistributionsofthesourceandthetargetdomainarealignedbytraining
anetwork,whichmapsinputsintoadomaininvariantrepresentation.Also,knowledge
distillationwasdirectlyutilizedtoalignthesourceandtargetclassdistribution.Oneclear
limitationhereisthatthesourcedomainandthetargetdomainarerequiredtohavethe
samedimensionality(i.e.numberofclasses)withsamesemanticsmeanings,whichisnotthe
caseinourdeepknowledgedistillation.
In[
4
],aninvariantfeaturespaceislearnedtotransferskillsbetweentwoagents.However,
projectingthestateintoafeaturespacewouldloseinformationcontainedintheoriginalstate.
Thereisatrade-o˙betweenlearningthecommonfeaturespaceandpreservingthemaximum
informationfromtheoriginalstate.Inourwork,weusedatageneratedbyintermediate
outputsintheknowledgetransferinsteadofasharedspace.Ourapproachthusretains
completeinformationfromtheenvironmentandensureshighqualitytransfer.Therecently
proposedA2Tapproach[
159
]canavoidnegativetransferamongdi˙erenttasks.However,
itispossiblethatsomenegativetransfercasesmaybecauseoftheinappropriatedesignof
transferalgorithms.Inourwork,weshowthatwecanperformsuccessfultransferamong
tasksthatseeminglycausenegativetransfer.
Knowledgetransferindeeplearning.
Sincethetrainingofeachagentinanenvironment
canbeconsideredasalearningtask,andtheknowledgetransferamongmultipletasksbelongs
tothestudyofmulti-tasklearning.Themulti-taskdeepneuralnetwork(MTDNN)[
229
]
transfersknowledgeamongtasksbysharingparametersofseverallow-levellayers.Since
thelow-levellayerscanbeconsideredtoperformrepresentationlearning,theMTDNN
islearningasharedrepresentationforinputs,whichisthenusedbyhigh-levellayersin
77
thenetwork.Di˙erentlearningtasksarerelatedtoeachotherviathissharedfeature
representation.IntheproposedCDRL,wedonotusethesharerepresentationduetothe
inevitableinformationlosswhenweprojecttheinputsintoasharedrepresentation.We
insteadperformexplicitlyknowledgetransferamongtasksbydistillingknowledgethatare
independentofmodelstructures.In[
82
],theauthorsproposedtocompresscumbersome
models(teachers)tomoresimplemodels(students),wherethesimplemodelsaretrainedby
adataset(knowledge)distilledfromtheteachers.However,thisapproachcannothandlethe
transferamongheterogeneoustasks,whichisonekeychallengeweaddressedinthispaper.
Knowledgetransferindeepreinforcementlearning.
Knowledgetransferisalsostudied
indeepreinforcementlearning.[
131
]proposedmulti-threadedasynchronousvariantsofseveral
mostadvanceddeepreinforcementlearningmethodsincludingSarsa,Q-learning,Q-learning
andadvantageactor-critic.Amongallthosemethods,asynchronousadvantageactor-critic
(A3C)achievesthebestperformance.Insteadofusingexperiencereplayasinpreviouswork,
A3Cstabilizesthetrainingprocedurebytrainingdi˙erentagentsinparallelusingdi˙erent
explorationstrategies.Thiswasshowntoconvergemuchfasterthanpreviousmethodsand
uselesscomputationalresources.WeshowinSection4.1.5thattheA3Cissubsumedto
theproposedCDRLasaspecialcase.In[
151
],asinglemulti-taskpolicynetworkistrained
byutilizingasetofexpertDeepQ-Network(DQN)ofsourcegames.Atthisstage,the
goalistoobtainapolicynetworkthatcanplaysourcegamesasclosetoexpertsaspossible.
Thesecondstepistotransfertheknowledgefromsourcetaskstoanewbutrelatedtarget
task.TheknowledgeistransferredbyusingtheDQNinlaststepastheinitializationof
theDQNforthenewtask.Assuch,thetrainingtimeofthenewtaskcanbesigni˝cantly
reduced.Di˙erentfromtheirapproach,theproposedtransferstrategyisnottodirectly
mimicexperts'actionsorinitializebyapre-trainedmodel.In[
166
],knowledgedistillation
78
wasadoptedtotrainamulti-taskmodelthatoutperformssingletaskmodelsofsometasks.
Theexpertsforalltasksare˝rstlyacquiredbysingletasklearning.Theintermediateoutputs
fromeachexpertarethendistilledtoasimilarmulti-tasknetworkwithanextracontroller
layertocoordinatedi˙erentactionsets.Oneclearlimitationisthatmajorcomponentsof
themodelareexactlythesamefordi˙erenttasks,whichmayleadtodegradedperformance
onsometasks.Inourwork,transfercanhappenevenwhentherearenoexpertsavailable.
Also,ourmethodalloweachtasktohavetheirownmodelstructures.Furthermore,even
themodelstructuresarethesameformultipletasks,thetasksarenottrainedtoimprove
theperformanceofothertasks(i.e.itdoesnotmimicexpertsfromothertasksdirectly).
Thereforeourmodelcanfocusonmaximizingitsownreward,insteadofbeingdistractedby
others.
4.1.3Background
4.1.3.1ReinforcementLearning
Inthiswork,weconsiderthestandardreinforcementlearningsettingwhereeachagent
interactswithit'sownenvironmentoveranumberofdiscretetimesteps.Giventhecurrent
state
s
t
2S
atstep
t
,agent
g
i
selectsanaction
a
t
2A
accordingtoitspolicy
ˇ
(
a
t
j
s
t
)
,
andreceivesareward
r
t
+1
fromtheenvironment.Thegoaloftheagentistochoosean
action
a
t
atstep
t
thatmaximizethesumoffuturerewards
f
r
t
g
inadecayingmanner:
R
t
=
P
1
i
=0

i
r
t
+
i
,wherescalar

2
(0
;
1]
isadiscountrate.Basedonthepolicy
ˇ
ofthis
agent,wecanfurtherde˝neastatevaluefunction
V
(
s
t
)=
E
[
R
t
j
s
=
s
t
]
,whichestimatesthe
expecteddiscountedreturnstartingfromstate
s
t
,takingactionsfollowingpolicy
ˇ
untilthe
gameends.Thegoalinreinforcementlearningalgorithmistomaximizetheexpectedreturn.
79
Sincewearemainlydiscussingonespeci˝cagent'sdesignandbehaviorthroughoutthepaper,
weleaveoutthenotationoftheagentindexforconciseness.
4.1.3.2AsynchronousAdvantageactor-criticalgorithm(A3C)
Theasynchronousadvantageactor-critic(A3C)algorithm[
131
]launchesmultipleagentsin
parallelandasynchronouslyupdatesaglobalsharedtargetpolicynetwork
ˇ
(
a
j
s;
p
)
aswell
asavaluenetwork
V
(
s;
v
)
.parametrizedby

p
and

v
,respectively.Eachagentinteracts
withtheenvironment,independently.Ateachstep
t
theagenttakesanactionbasedon
theprobabilitydistributiongeneratedbypolicynetwork.Afterplayingan-steprolloutor
reachingtheterminalstate,therewardsareusedtocomputetheadvantagewiththeoutput
ofvaluefunction.Theupdatesofpolicynetworkisconductedbyapplyingthegradient:
r

p
log
ˇ
(
a
t
j
s
t
;

p
)
A
(
s
t
;a
t
;

v
)
;
wheretheadvantagefunction
A
(
s
t
;a
t
;

v
)
isgivenby:
X
T

t

1
i
=0

i
r
t
+
i
+

T

t
V
(
s
T
;

v
)

V
(
s
t
;

v
)
:
Term
T
representsthestepnumberforthelaststepofthisrollout,itiseitherthemax
numberofrolloutstepsorthenumberofstepsfrom
t
totheterminalstate.Theupdateof
valuenetworkistominimizethesquareddi˙erencebetweentheenvironmentrewardsand
valuefunctionoutputs,i.e.,
min

v
(
X
T

t

1
i
=0

i
r
t
+
i
+

T

t
V
(
s
T
;

v
)

V
(
s
t
;

v
))
2
:
80
Thepolicynetworkandthevaluenetworksharethesamelayersexceptforthelastoutput
layer.Anentropyregularizationofpolicy
ˇ
isaddedtoimproveexploration,aswell.
4.1.3.3Knowledgedistillation
Knowledgedistillation[
82
]isatransferlearningapproachthatdistillstheknowledgefroma
teachernetworktoastudentnetworkusingatemperatureparameterized"softtargets"(i.e.
aprobabilitydistributionoverasetofclasses).Ithasbeenshownthatitcanacceleratethe
trainingwithlessdatasincethegradientfrom"softtargets"containsmuchmoreinformation
thanthegradientobtainedfrom"hardtargets"(e.g.0,1supervision).
Tobemorespeci˝c,logitsvector
z
2R
d
for
d
actionscanbeconvertedtoaprobability
distribution
h
2
(0
;
1)
d
byasoftmaxfunction,raisedwithtemperature
˝
:
h
(
i
)=
softmax
(
z
=˝
)
i
=
exp
(
z
(
i
)
=˝
)
P
j
exp
(
z
(
j
)
=˝
)
;
(4.1)
where
h
(
i
)
and
z
(
i
)
denotesthe
i
-thentryof
h
and
z
,respectively.
ThentheknowledgedistillationcanbecompletedbyoptimizethefollowingKullback-
Leiblerdivergence(KL)withtemperature
˝
[166,82].
L
KL
(
D;

p
)=
X
t
=1
softmax
(
z

t
=˝
)ln
softmax
(
z

t
=˝
)
softmax
(
z

t
)
(4.2)
where
z

t
isthelogitsvectorfromteachernetwork(notation

representsteacher)atstep
t
,while
z

t
isthelogitsvectorfromstudentnetwork(notation

representsstudent)ofthis
step.


p
denotestheparametersofthestudentpolicynetwork.
D
isasetoflogitsfrom
teachernetwork.
81
4.1.4Collaborativedeepreinforcementlearningframework
Inthissection,weintroducetheproposedcollaborativedeepreinforcementlearning(CDRL)
framework.Underthisframework,acollaborativeAsynchronousAdvantageActor-Critic
(cA3C)algorithmisproposedtocon˝rmthee˙ectivenessofthecollaborativeapproach.
Beforeweintroduceourmethodindetails,oneunderlyingassumptionweusedisasfollows:
Assumption3.
Ifthereisauniversethatcontainsallthetasks
E
=
f
e
1
;e
2
;:::;e
1
g
and
k
i
representsthecorrespondingknowledgetomastereachtask
e
i
,then
8
i;j;k
i
\
k
j
6
=
;
.
Thisisaformaldescriptionofourcommonsensethatanypairoftasksarenotabsolutely
isolatedfromeachother,whichhasbeenimplicitlyusedasafundamentalassumptionby
mostpriortransferlearningstudies[
151
,
166
,
55
,
37
,
235
].Therefore,wefocusonminingthe
sharedknowledgeacrossmultipletasksinsteadofprovidingstrategyselectingtasksthatshare
knowledgeasmuchaspossible,whichremainstobeunsolvedandmayleadtoourfuture
work.Thegoalhereistoutilizetheexistingknowledgeaswellaspossible.Forexample,
wemayonlyhaveawell-trainedexpertonplayingPonggame,andwewanttoutilizeits
expertisetohelpusperformbetteronothergames.Thisisoneofthesituationsthatcanbe
solvedbyourcollaborativedeepreinforcementlearningframework.
(a)Distillationprocedure(b)Studentnetworkstructure.
Figure4.2:Deepknowledgedistillation.In(a),theteacher'soutputlogits
z

ismapped
throughadeepalignmentnetworkandthealignedlogits
F

!
(
z

)
isusedasthesupervision
totrainthestudent.In(b),theextrafullyconnectedlayerfordistillationisaddedfor
learningknowledgefromteacher.Forsimplicity'ssake,timestep
t
isomittedhere.
82
4.1.5Collaborativedeepreinforcementlearning
Indeepreinforcementlearning,sincethetrainingofagentsarecomputationalexpensive,the
well-trainedagentsshouldbefurtherutilizedassourceagents(agentswherewetransferred
knowledgefrom)tofacilitatethetrainingoftargetagents(agentsthatareprovidedwith
theextraknowledgefromsource).Inordertoincorporatethistypeofcollaborationtothe
trainingofDRLagents,weformallyde˝nethecollaborativedeepreinforcementlearning
(CDRL)frameworkasfollows:
De˝nition3.
Given
m
independentenvironments
f
"
1
;"
2
;:::;"
m
g
of
m
tasks
f
e
1
;e
2
;:::;e
m
g
,thecorresponding
m
agents
f
g
1
;g
2
;:::;g
m
g
arecollaborativelytrainedinparalleltomaximize
therewards(mastereachtask)withrespecttotargetagents.

Environments.Thereisnorestrictionontheenvironments:The
m
environmentscan
betotallydi˙erentorwithsomeduplications.

Inparallel.Eachenvironment
"
i
onlyinteractswiththeonecorrespondingagent
g
i
,i.e.,
theaction
a
j
t
fromagent
g
j
atstep
t
hasnoin˛uenceonthestate
s
i
t
+1
in
"
i
;
8
i
6
=
j
.

Collaboratively.Thetrainingprocedureofagent
g
i
consistsofinteractingwithenvi-
ronment
"
i
andinteractingwithotheragentsaswell.Theagent
g
i
isnotnecessary
tobeatsamelevelas"collaborative"de˝nedincognitivescience[
50
].E.g.,
g
1
canbe
anexpertfortask
e
1
(environment
"
1
)whileheishelpingagent
g
2
whichisastudent
agentintask
e
2
.

Targetagents.ThegoalofCDRLcanbesetasmaximizingtherewardsthatagent
g
i
obtainsinenvironment
"
i
withthehelpofinteractingwithotheragents,similar
toinductivetransferlearningwhere
g
i
isthetargetagentfortargettaskandothers
83
aresourcetasks.Theknowledgeistransferedfromsourcetotarget
g
i
byinteraction.
Whenwesetthegoaltomaximizetherewardsofmultipleagentsjointly,itissimilarto
multi-tasklearningwherealltasksaresourcetasksandtargettasksatthesametime.
Noticethatourde˝nitionisverydi˙erentfromthepreviouslyde˝nedcollaborative
multiagentMarkowDecisionProcess(collaborativemultiagentMDP)[
103
,
73
]whereaset
ofagentsselectaglobaljointactiontomaximizethesumoftheirindividualrewardsand
theenvironmentistransittedtoanewstatebasedonthatjointaction.First,MDPis
notarequirementinCDRLframework.Second,inCDRL,eachagenthasitsowncopy
ofenvironmentandmaximizesitsowncumulativerewards.Thegoalofcollaborationis
toimprovetheperformanceofcollaborativeagents,comparedwithisolatedones,whichis
di˙erentfrommaximizingthesumofglobalrewardsincollaborativemultiagentMDP.Third,
CDRLfocusesonhowagentscollaborateamongheterogeneousenvironments,insteadofhow
jointactiona˙ectstherewards.InCDRL,di˙erentagentsareactinginparallel,theactions
takenbyotheragentswon'tdirectlyin˛uencecurrentagent'srewards.Whileincollaborative
multiagentMDP,theagentsmustcoordinatetheiractionchoicessincetherewardswillbe
directlya˙ectedbytheactionchoicesofotheragents.
Furthermore,CDRLincludesdi˙erenttypesofinteraction,whichmakesthisageneral
framework.Forexample,thecurrentstate-of-the-artisA3C[
131
]canbecategorizedasone
homogeneousCDRLmethodwithadvantageactor-criticinteraction.Speci˝cally,multiple
agentsinA3Caretrainedinparallelwiththesameenvironment.Allagents˝rstsynchronize
parametersfromaglobalnetwork,andthenupdatetheglobalnetworkwiththeirindividual
gradients.Thisprocedurecanbeseenaseachagentmaintainsitsownmodel(adi˙erent
versionofglobalnetwork)andinteractswithotheragentsbysendingandreceivinggradients.
Inthispaper,weproposeanovelinteractionmethodnameddeepknowledgedistillation
84
undertheCDRLframework.ItisworthnotingthattheinteractioninA3Conlydealswith
thehomogeneoustasks,i.e.allagentshavethesameenvironmentandthesamemodel
structuresothattheirgradientscanbeaccumulatedandinteracted.Bydeepknowledge
distillation,theinteractioncanbeconductedamongheterogeneoustasks.
4.1.6Deepknowledgedistillation
Asweintroducedbefore,knowledgedistillation[
82
]istryingtotrainastudentnetwork
thatcanbehavesimilarlytotheteachernetworkbyutilizingthelogitsfromtheteacheras
supervision.However,transferringtheknowledgeamongheterogeneoustasksfacesseveral
di˚culties.First,theactionspacesofdi˙erenttasksmayhavedi˙erentdimensions.Second,
evenifthedimensionalityofactionspaceissameamongtasks,theactionprobability
distributionsfordi˙erenttaskscouldvaryalot,asweillustratedinFigure4.5(a)and
(b).Thus,theactionpatternsrepresentedbythelogitsofdi˙erentpolicynetworksare
usuallydi˙erentfromtasktotask.Ifwedirectlyforceastudentnetworktomimictheaction
patternofateachernetworkforadi˙erenttask,itcouldbetrainedinawrongdirection,
and˝nallyendsupwithworseperformancethanisolatedtraining.Infact,thissuspecthas
beenempiricallyveri˝edinourexperiments.
Basedontheaboveobservation,weproposedeepknowledgedistillationtotransfer
knowledgebetweenheterogeneoustasks.AsillustratedinFigure4.2(a),theapproachfor
deepknowledgedistillationisstraightforward.Weuseadeepalignmentnetworktomapthe
logitsoftheteachernetworkfromaheterogeneoussourcetask
e

(environment
"

),thenthe
logitsisusedasoursupervisiontoupdatethestudentnetworkoftargettask
e

(environment
"

).Thisprocedureisperformedbyminimizingfollowingobjectivefunctionoverstudent
85
policynetworkparameters


p
0
:
L
KL
(
D;

p
0
;˝
)=
X
t
l
KL
(
F

!
(
z

t
)
;
z

t
0
;˝
)
;
(4.3)
where
l
KL
(
F

!
(
z

t
)
;
z

t
0
;˝
)=
softmax
(
F

!
(
z

t
)
=˝
)ln
softmax
(
F

!
(
z

t
)
=˝
)
softmax
(
z

t
0
)
:
Here

!
denotestheparametersofthedeepalignmentnetwork,whichtransfersthelogits
z

t
fromtheteacherpolicynetworkforknowledgedistillationbyfunction
F

!
(
z

t
)
atstep
t
.As
weshowinFigure4.2(b),


p
isthestudentpolicynetworkparameters(includingparameters
ofCNN,LSTMandpolicylayer)fortask
e

,while


p
0
denotesstudentnetworkparameters
ofCNN,LSTManddistillationlayer.Itisclearthatthedistillationlogits
z

t
0
fromthe
studentnetworkdoesnotdeterminetheactionprobabilitydistributiondirectly,whichis
establishedbythepolicylogits
z

t
,asillustratedinFigure4.2(b).Weaddanotherfully
connecteddistillationlayertodealwiththemismatchofactionspacedimensionalityandthe
contradictionofthetransferredknowledgefromsourcedomainandthelearnedknowledge
fromtargetdomain.Theinputtobothoftheteacherandthestudentnetworkisthestate
ofenvironment
"

oftargettask
e

.Itmeansthatwewanttotransfertheexpertisefrom
anexpertoftask
e

towardsthecurrentstate.Symbol
D
isasetoflogitsfromtheteacher
networkinonebatchand
˝
isthetemperaturesameasdescribedinEq(4.1).Inatrivial
casethattheteachernetworkandthestudentnetworkaretrainedforsametask(
e

equals
e

),thenthedeepalignmentnetwork
F

!
wouldreducetoanidentitymapping,andthe
problemisalsoreducedtoasingletaskpolicydistillation,whichhasbeenprovedtobe
86
e˙ectivein[
166
].Beforewecanapplythedeepknowledgedistillation,weneedto˝rsttrain
agooddeepalignmentnetwork.Inthiswork,weprovidetwotypesoftrainingprotocolsfor
di˙erentsituations:
O˜inetraining
:Thisprotocol˝rsttrainstwoteachernetworksinbothenvironment
"

and
"

.Thenweusethelogitsofbothtwoteachernetworkstotrainadeepalignment
network
F

!
.Afteracquiringapre-trained
F

!
,wetrainastudentnetworkoftask
e

fromscratch,inthemeanwhiletheteachernetworkoftask
e

and
F

!
areusedfordeep
knowledgedistillation.
Onlinetraining
:Supposeweonlyhaveateachernetworkoftask
e

,andwewanttouse
theknowledgefromtask
e

totrainthestudentnetworkfortask
e

togethigherperformance
fromscratch.Thepipelineofthismethodisthat,we˝rstlytrainthestudentnetworkby
interactingwiththeenvironment
"

foracertainamountofsteps
T
1
,andthenstartto
trainthealignmentnetwork
F

!
,usingthelogitsfromtheteachernetworkandthestudent
network.Afterwards,atstep
T
2
,westartperformingdeepknowledgedistillation.Obviously
T
2
islargerthan
T
1
,andthevalueofthemaretask-speci˝c,whichisdecidedempiricallyin
thiswork.
Theo˜inetrainingcouldbeusefulifwehavealreadyhadareasonablygoodmodelfor
task
e

,whilewewanttofurtherimprovetheperformanceusingtheknowledgefromtask
e

.
Theonlinetrainingmethodisusedwhenweneedtolearnthestudentnetworkfromscratch.
Bothtypesoftrainingprotocolcanbeextendedtomultipleheterogeneoustasks.
87
4.1.7CollaborativeAsynchronousAdvantage
Actor-Critic
Inthissection,weintroducetheproposedcollaborativeasynchronousadvantageactor-critic
(cA3C)algorithm.Aswedescribedinsection4.1.5,theagentsarerunninginparallel.Each
agentgoesthroughthesametrainingprocedureasdescribedinAlgorithm4.1.Asitshows,
thetrainingofagent
g
1
canbeseparatedintotwoparts:The˝rstpartistointeractwith
theenvironment,gettherewardandcomputethegradientstominimizethevaluelossand
policylossbasedonGeneralizedAdvantageEstimation(GAE)[169].Thesecondpartisto
interactwithsourceagent
g
2
sothatthelogitsdistilledfromagent
g
2
canbetransferredby
thedeepalignmentnetworkandusedassupervisiontobiasthetrainingofagent
g
1
.
Tobemoreconcrete,thepseudocodeinAlgorithm4.1isanenvolvedversionofA3C
basedononlinetrainingofdeepknowledgedistillation.At
T
-thiteration,theagentinteracts
withtheenvironmentfor
t
max
stepsoruntiltheterminalstateisreached(Line6toLine
15).ThentheupdatingofvaluenetworkandpolicynetworkisconductedbyGAE.This
variationofA3Cis˝rstlyimplementedinOpenAIuniversestarteragent[
147
].Sincethemain
asynchronousframeworkisthesameasA3C,westillusetheA3Ctodenotethisalgorithm
althoughtheupdatingisthenotthesameasadvantageactor-criticalgorithmusedinoriginal
A3Cpaper[131].
TheonlinetrainingofdeepknowledgedistillationismainlycompletedfromLine25to
Line32inAlgorithm4.1.Thetrainingofthedeepalignmentnetworkstartsfrom
T
1
steps
(Line25-28).After
T
1
steps,thestudentnetworkisabletogeneratearepresentativeaction
probabilitydistribution,andwehavesuitablesupervisiontotrainthedeepalignmentnetwork
aswell,parameterizedby

!
.After
T
2
steps,

!
willgraduallyconvergetoalocaloptimal,
88
andwestartthedeepknowledgedistillation.AsillustratedinFigure4.2(b),weusesymbol


p
0
torepresenttheparametersofCNN,LSTMandthefullyconnecteddistillationlayer,
sincewedon'twantthelogitsfromheterogeneousdirectlya˙ecttheactionpatternoftarget
task.Tosimplifythediscussion,theabovealgorithmisdescribedbasedoninteractingwith
asingleagentfromaheterogeneoustask.Inalgorithm4.1,logits
z

t
canbeacquiredfrom
multipleteachernetworksofdi˙erenttasks,eachtaskwilltrainitsowndeepalignment
network

!
anddistillthealignedlogitstothestudentnetwork.
Aswedescribedinprevioussection4.1.5,therearetwotypesofinteractionsinthis
algorithm:1).GAEinteractionusesthegradientssharedbyallhomogeneousagents.2)
Distillationinteractionisthedeepknowledgedistillationfromteachernetwork.TheGAE
interactionisperformedonlyamonghomogeneoustasks.Bysynchronizingtheparameters
fromaglobalstudentnetworkinAlgorithm4.1(line3),thecurrentagentreceivestheGAE
updatesfromalltheotheragentswhointeracteswiththesameenvironment.Inline21
and22,thecurrentagentsendshisgradientstotheglobalstudentnetwork,whichwillbe
synchronizedwithotherhomogeneousagents.Thedistillationinteractionisthenconducted
inline31,wherewehavethealignedlogits
F

!
(
z

t
)
andthedistillationlogits
z

t
0
tocompute
thegradientsforminimizingthedistillationloss.Thegradientsofdistillationarealsosentto
theglobalstudentnetwork.Theroleofglobalstudentnetworkcanberegardedasaparameter
serverthathelpssendinginteractionsamongthehomogeneousagents.Fromadi˙erentangle,
eachhomogeneousagentmaintainsaninstinctversionofglobalstudentnetwork.Therefore,
bothtwotypesofinteractionsa˙ectallhomogeneousagents,whichmeansthatthedistillation
interactionsfromagent
g
2
andagent
g
1
woulda˙ectallhomogeneousagentsofagent
g
1
.
89
Algorithm4.1:
OnlinecA3C
Require:
Globalsharedparametervectors

p
and

v
andglobalsharedcounter
T
=0
;
Agent-speci˝cparametervectors

0
p
and

0
v
,GAE[169]parameters

and

.Timestepto
starttrainingdeepalignmentnetworkanddeepknowledgedistillation
T
1
;T
2
.
1:
while
T<T
max
do
2:
Resetgradients:

p
=0
and

v
=0
3:
Synchronizeagent-speci˝cparameters

0
p
=

p
and

0
v
=

v
4:
t
start
=
t
,Getstate
s
t
5:
Receivereward
r
t
andnewstate
s
t
+1
6:
repeat
7:
Perform
a
t
accordingtopolicy
8:
Receivereward
r
t
andnewstate
s
t
+1
9:
Computevalueofstate
v
t
=
V
(
s
t
;

0
v
)
10:
if
T

T
1
then
11:
Computethelogits
z

t
fromteachernetwork.
12:
Computethepolicylogits
z

t
anddistillationlogits
z

t
0
fromstudentnetwork.
13:
endif
14:
t
=
t
+1
;T
=
T
+1
15:
until
terminal
s
t
or
t

t
start
>
=
t
max
16:
R
=
v
t
=
(
0
forterminal
s
t
V
(
s
t
;
0
v
)
fornon-terminal
s
t
17:
for
i
2f
t

1
;:::;t
start
g
do
18:

i
=
r
i
+
v
i
+1

v
i
19:
A
=

i
+(

)
A
20:
R
=
r
i
+
R
21:

p
 

p
+
r
log
ˇ
(
a
i
j
s
i
;

0
)
A
22:

v
 

v
+
@
(
R

v
i
)
2
=@
0
v
23:
endfor
24:
Performasynchronousupdateof

p
using

p
andof

v
using

v
.
25:
if
T

T
1
then
26:
//Trainingdeepalignmentnetwork.
27:
min

!
P
t
l
KL
(
z

t
;
z

t
;˝
)
,
l
KL
isde˝nedinEq(4.3).
28:
endif
29:
if
T

T
2
then
30:
//onlinedeepknowledgedistillation.
31:
min


p
0
P
t
l
KL
(
F

!
(
z

t
)
;
z

t
0
)
32:
endif
33:
endwhile
90
4.1.8Experiments
4.1.8.1TrainingandEvaluation
Inthiswork,trainingandevaluationareconductedinOpenAIGym[
24
],atoolkitthat
includesacollectionofbenchmarkproblemssuchasclassicAtarigamesusingArcadeLearning
Environment(ALE)[
18
],classiccontrolgames,etc.SameasthestandardRLsetting,anagent
isstimulatedinanenvironment,takinganactionandreceivingrewardsandobservationsat
eachtimestep.Thetrainingoftheagentisdividedintoepisodes,andthegoalistomaximize
theexpectationofthetotalrewardperepisodeortoreachhigherperformanceusingasfew
episodesaspossible.
4.1.8.2Certi˝catedHomogeneoustransfer
Inthissubsection,weverifythee˙ectivenessofknowledgedistillationasatypeofinteraction
incollaborativedeepreinforcementlearningforhomogeneoustasks.Thisisalsotoverifythe
e˙ectivenessofthesimplestcasefordeepknowledgedistillation.Althoughthee˙ectiveness
ofpolicydistillationindeepreinforcementlearninghasbeenveri˝edin[
166
]basedonDQN,
thereisnopriorstudiesonasynchronousonlinedistillation.Therefore,our˝rstexperiment
istodemonstratethattheknowledgedistilledfromacerti˝catedtaskcanbeusedtotraina
decentstudentnetworkforahomogeneoustask.Otherwise,theevenmorechallengingtask
oftransferringamongheterogeneoussourcesmaynotwork.Wenotethatinthiscase,the
Assumption3isfullysatis˝edgiven
k
1
=
k
2
,where
k
1
and
k
2
aretheknowledgeneededto
mastertask
e
1
and
e
2
,respectively.Inthisexperiment,weconductexperimentsinagym
environmentnamed
Pong
.ItisaclassicAtarigamethatanagentcontrolsapaddleto
bounceaballpassanotherplayeragent.Themaximumrewardthateachepisodecanreach
is21.
91
First,wetrainateachernetworkthatlearnsfromitsownenvironmentbyasynchronously
performingGAEupdates.Wethentrainastudentnetworkusingonlyonlineknowledge
distillationfromtheteachernetwork.Forfaircomparisons,weuse8agentsforallenvironments
intheexperiments.Speci˝cally,boththestudentandtheteacheraretrainingin
Pong
with8agents.The8agentsoftheteachernetworkaretrainedusingtheA3Calgorithm
(equivalenttoCDRLwithGAEupdatesinonetask).The8agentsofstudentnetworkare
trainedusingnormalpolicydistillation,whichusesthelogitsgeneratedfromtheteacher
networkassupervisiontotrainthepolicynetworkdirectly.FromtheresultsinFigure4.3(a)
weseethatthestudentnetworkcanachieveaverycompetitiveperformancethatisisalmost
sameasthestate-of-arts,usingonlineknowledgedistillationfromahomogeneoustask.It
alsosuggeststhattheteacherdoesn'tnecessarilyneedtobeanexpert,beforeitcanguide
thetrainingofastudentinthehomogeneouscase.Before2millionsteps,theteacheritselfis
stilllearningfromtheenvironment,whiletheknowledgedistilledfromteachercanalready
beusedtotrainareasonablestudentnetwork.Moreover,weseethatthehybridoftwotypes
ofinteractionsinCDRLhasapositivee˙ectonthetraining,insteadofcausingperformance
deterioration.
Inthesecondexperiment,thestudentnetworkislearningfromboththeonlineknowledge
distillationandtheGAEupdatesfromtheenvironment.We˝ndthattheconvergenceismuch
fasterthanthestate-of-art,asshowninFigure4.3(b).Inthisexperiment,theknowledge
isdistilledfromtheteachertostudentinthe˝rstonemillionstepsandthedistillationis
stoppedafterthat.WenotethatinhomogeneousCDRL,knowledgedistillationisused
directlywithpolicylogitsotherthandistillationlogits.Theknowledgetransfersettinginthis
experimentisnotapracticalonebecausewealreadyhaveawell-trainedmodelof
Pong
,but
itshowsthatwhenknowledgeiscorrectlytransferred,thecombinationofonlineknowledge
92
(a)onlineKDonly(b)onlineKDwithGAE
Figure4.3:Performanceofonlinehomogeneousknowledgedistillation.
distillationandtheGAEupdatesisane˙ectivetrainingprocedure.
4.1.8.3Certi˝catedHeterogeneousTransfer
Inthissubsection,wedesignexperimentstoillustratethee˙ectivenessofCDRLincerti˝cated
heterogeneoustransfer,withtheproposeddeepknowledgedistillation.Givenacerti˝cated
task
Pong
,wewanttoutilizetheexistingexpertiseandapplyittofacilitatethetraining
ofanewtask
Bowling
.Inthefollowingexperiments,wedonottuneanymodel-speci˝c
parameterssuchasnumberoflayers,sizeof˝lterornetworkstructurefor
Bowling
.We
˝rstdirectlyperformtransferlearningfrom
Pong
to
Bowling
byknowledgedistillation.
Sincethetwotaskshasdi˙erentactionpatternsandactionprobabilitydistributions,directly
knowledgedistillationwithapolicylayerisnotsuccessful,asshowninFigure4.4(a).Infact,
theknowledgedistilledfrom
Pong
contradictstotheknowledgelearnedfrom
Bowling
,
whichleadstothemuchworseperformancethanthebaseline.WeshowinFigure4.5(a)and
(b)thattheactiondistributionsof
Pong
and
Bowling
areverydi˙erent.Toresolvethis,
wedistilltheknowledgethroughanextradistillationlayerasillustratedinFigure4.2(b).
Assuch,theknowledgedistilledfromthecerti˝catedheterogeneoustaskcanbesuccessfully
transferredtothestudentnetworkwithimprovedperformanceafterthelearningiscomplete.
93
(a)KDwithpolicylayer(b)KDwithdistillationlayer
Figure4.4:Performanceofonlineknowledgedistillationfromaheterogeneoustask.(a)
distillationfroma
Pong
expertusingthepolicylayertotraina
Bowling
student(KD-
policy).(b)distillationfroma
Pong
experttoa
Bowling
studentusinganextradistillation
layer(KD-distill).
(a)
Pong
(b)
Bowling
(c)aligned
Pong
Figure4.5:Theactionprobabilitydistributionsofa
Pong
expert,a
Bowling
expertand
analigned
Pong
expert.
However,thisleadstoamuchslowerconvergencethanthebaselineasshowninFigure4.4(b),
becausethatittakestimetolearnagooddistillationlayertoaligntheknowledgedistilled
from
Pong
tothecurrentlearningtask.Aninterestingquestionisthat,isitpossibletohave
bothimprovedperformanceandfasterconvergence?
DeepknowledgedistillationO˜inetraining.
Tohandletheheterogeneitybetween
Pong
and
Bowling
,we˝rstverifythee˙ectivenessofdeepknowledgedistillationwithan
o˜inetrainingprocedure.Theo˜inetrainingissplitintotwostages.Inthe˝rststage,we
trainadeepalignmentnetworkwithfourfullyconnectedlayersusingtheReluactivation
function.Thetrainingdataarelogitsgeneratedfromanexpert
Pong
networkand
Bowling
94
(a)O˜ine(b)OnlineStrategy1(c)OnlineStrategy2
(d)Collaborative
Figure4.6:Performanceof
o˜ine
,
online
deepknowledgedistillation,andcollaborative
learning.
network.Therewardsofthenetworksatconvergenceare20and60respectively.Instage2,
withthe
Pong
teachernetworkandtraineddeepalignmentnetwork,wetraina
Bowling
studentnetworkfromscratch.ThestudentnetworkistrainedwithbothGAEinteractions
withitsenvironment,andthedistillationinteractionsfromtheteachernetworkandthedeep
alignmentnetwork.TheresultsinFigure4.6(a)showthatdeepknowledgedistillationcan
transferknowledgefrom
Pong
to
Bowling
bothe˚cientlyande˙ectively.
DeepknowledgedistillationOnlinetraining.
AmorepracticalsettingofCDRLis
theonlinetraining,wherewesimultaneouslytraindeepalignmentnetworkandconductthe
onlinedeepknowledgedistillation.Weusetwoonlinetrainingstrategies:1)Thetrainingof
deepalignmentnetworkstartsafter4millionsteps,whenthestudent
Bowling
network
canperformreasonablywell,andtheknowledgedistillationstartsafter6millionsteps.2)
Thetrainingofdeepalignmentnetworkstartsafter0.1millionsteps,andtheknowledge
distillationstartsafter1millionsteps.ResultsareshowninFigure4.6(b)and(c)respectively.
Theresultsshowthatbothstrategiesreachhigherperformancethanthebaseline.Moreover,
theresultssuggestthatwedonothavetowaituntilthestudentnetworkreachesareasonable
performancebeforewestarttotrainthedeepalignmentnetwork.Thisisbecausethedeep
alignmentnetworkistraintoaligntwodistributionsof
Pong
and
Bowling
,insteadof
transferringtheactualknowledge.Recallthattheactionprobabilitydistributionof
Pong
95
and
Bowling
arequitedi˙erentasshowninFigure4.5(a)and(b).Afterweprojecting
thelogitsof
Pong
usingthedeepalignmentnetwork,thedistributionisverysimilarto
Bowling
,asshowninFigure4.5(c).
4.1.8.4CollaborativeDeepReinforcementLearning
Inpreviousexperiments,weassumethatthereisawell-trained
Pong
expert,andwetransfer
knowledgefromthe
Pong
experttothe
Bowling
studentviadeepknowledgedistillation.
Amorechallengingsettingsthatbothof
Bowling
and
Pong
aretrainedfromscratch.In
thisexperiment,weweshowthattheCDRLframeworkcanstillbee˙ectiveinthissetting.
Inthisexperiment,wetraina
Bowling
networkanda
Pong
networkfromscratchusing
theproposedcA3Calgorithm.The
Pong
agentsaretrainedwithGAEinteractionsonly,and
thetarget
Bowling
receivesupervisionfrombothGAEinteractionsanddistilledknowledge
from
Pong
viaadeepalignmentnetwork.Westarttotrainthedeepalignmentnetworkafter
3millionsteps,andperformdeepknowledgedistillationafter4millionsteps,wherethe
Pong
agentsarestillupdatingfromtheenvironment.Wenotethatinthissetting,theteacher
networkisconstantlybeingupdated,asknowledgeisdistilledfromtheteacheruntil15
millionsteps.ResultsinFigure4.6(d)showthattheproposedcA3Cisabletoconvergetoa
higherperformancethanthecurrentstate-of-art.Therewardoflastonehundredepisodesof
A3Cis
61
:
48

1
:
48
,whilecA3Cachieves
68
:
35

1
:
32
,withasigni˝cantrewardimprovement
of
11
:
2%
.
96
4.2RankingPolicyGradient
4.2.1Introduction
Toutilizethecollaborativestrategyforimprovingthesample-e˚ciencyinsingleagent
reinforcementlearning,wedisentangletheexplorationandexploitationintotwoseparate
agentsandconductdata-drivencollaborationthroughimitationlearning,whichleadstoa
moresample-e˚ciento˙-policylearningframework.We˝rstapproachthesample-e˚cient
reinforcementlearningfromarankingperspective.Insteadofestimatingtheoptimalaction
valuefunction,weconcentrateonlearningoptimalrankofactions.Therankofactions
dependsonthe
relativeactionvalues
.Aslongastherelativeactionvaluespreservethesame
rankofactionsastheoptimalactionvalues(
Q
-values),wechoosethesameoptimalaction.
Tolearnoptimalrelativeactionvalues,weproposethe
rankingpolicygradient(RPG)
that
optimizestheactions'rankwithrespecttothelong-termrewardbylearningthepairwise
relationshipamongactions.
RankingPolicyGradient(RPG)thatdirectlyoptimizesrelativeactionvaluestomaximize
thereturnisapolicygradientmethod.Thetrackofo˙-policyactor-criticmethods[
46
,
72
,
208
]
havemadesubstantialprogressonimprovingthesample-e˚ciencyofpolicygradient.How-
ever,thefundamentaldi˚cultyoflearningstabilityassociatedwiththebias-variancetrade-o˙
remains[
136
].Inthiswork,we˝rstexploittheequivalencebetweenRLoptimizingthe
lowerboundofreturnandsupervisedlearningthatimitatesaspeci˝coptimalpolicy.Build
uponthistheoreticalfoundation,weproposeageneralo˙-policylearningframeworkthat
equipsthegeneralizedpolicyiteration[
187
,Chap.4]withanexternalstepofsupervised
learning.Theproposedo˙-policylearningnotonlyenjoysthepropertyofoptimalitypre-
serving(unbiasedness),butalsolargelyreducesthevarianceofpolicygradientbecauseofits
97
independenceofthehorizonandrewardscale.Furthermore,thislearningparadigmleads
toasamplecomplexityanalysisoflarge-scaleMDP,inanon-tabularsettingwithoutthe
lineardependenceonthestatespace.Basedonoursample-complexityanalysis,wede˝nethe
exploratione˚ciencythatquantitativelyevaluatesdi˙erentexplorationmethods.Besides,
weempiricallyshowthatthereisatrade-o˙betweenoptimalityandsample-e˚ciency,which
iswellalignedwithourtheoreticalindication.Lastbutnotleast,wedemonstratethatthe
proposedapproach,consolidatingtheRPGwitho˙-policylearning,signi˝cantlyoutperforms
thestate-of-the-art[80,17,42,132].
4.2.2Relatedworks
SampleE˚ciency.
Thesamplee˚cientreinforcementlearningcanberoughlydividedinto
twocategories.The˝rstcategoryincludesvariantsof
Q
-learning[
132
,
167
,
203
,
80
].The
mainadvantageof
Q
-learningmethodsistheuseofo˙-policylearning,whichisessential
towardssamplee˚ciency.TherepresentativeDQN[
132
]introduceddeepneuralnetwork
in
Q
-learning,whichfurtherinspriedatrackofsuccessfulDQNvariantssuchasDouble
DQN[
203
],Duelingnetworks[
209
],prioritizedexperiencereplay[
167
],and
Rainbow
[
80
].
Thesecondcategoryistheactor-criticapproaches.Mostofrecentworks[
46
,
208
,
71
]in
thiscategoryleveragedimportancesamplingbyre-weightingthesamplestocorrectthe
estimationbiasandreducevariance.Themainadvantageisinthewall-clocktimesdueto
thedistributedframework,˝rstlypresentedin[
131
],insteadofthesample-e˚ciency.Asof
thetimeofwriting,thevariantsofDQN[
80
,
42
,
17
,
167
,
203
]areamongthealgorithmsof
mostsamplee˚ciency,whichareadoptedasourbaselinesforcomparison.
RLasSupervisedLearning.
Manye˙ortshavefocusedondevelopingtheconnections
betweenRLandsupervisedlearning,suchasExpectation-Maximizationalgorithms[
45
,
152
,
98
102
,
1
],Entropy-RegularizedRL[
145
,
74
],andInteractiveImitationLearning(IIL)[
44
,
188
,
163
,
165
,
184
,
81
,
148
].EM-basedapproachesapplytheprobabilisticframeworktoformulate
theRLproblemmaximizingalowerboundofthereturnasare-weightedregressionproblem,
whileitrequireson-policyestimationontheexpectationstep.Entropy-RegularizedRL
optimizingentropyaugmentedobjectivescanleadtoo˙-policylearningwithouttheusageof
importancesamplingwhileitconvergestosoftoptimality[74].
Ofthethreetracksinpriorworks,theIILismostcloselyrelatedtoourwork.The
IILworks˝rstlypointedouttheconnectionbetweenimitationlearningandreinforcement
learning[
163
,
188
,
165
]andexploretheideaoffacilitatingreinforcementlearningbyimitating
experts.However,mostofimitationlearningalgorithmsassumetheaccesstotheexpertpolicy
ordemonstrations.Theo˙-policylearningframeworkproposedinthisthesiscanbeinterpreted
asanonlineimitationlearningapproachthatconstructsexpertdemonstrationsduringthe
explorationwithoutsolicitingexperts,andconductssupervisedlearningtomaximizereturn
atthesametime.Inshort,ourapproachisdi˙erentfrompriorartsintermsofatleastone
ofthefollowingaspects:objectives,oracleassumptions,theoptimalityoflearnedpolicy,and
on-policyrequirement.Moreconcretely,theproposedmethodisabletolearnoptimalpolicy
intermsoflong-termreward,withoutaccesstotheoracle(suchasexpertpolicyorexpert
demonstration)anditcanbetrainedbothempiricallyandtheoreticallyinano˙-policy
fashion.AmoredetaileddiscussionoftherelatedworkonreducingRLtosupervisedlearning
isprovidedinAppendixA.
PACAnalysisofRL.
Mostexistingstudiesonsamplecomplexityanalysis[
95
,
180
,
97
,
179
,
105
,
91
,
90
,
223
]areestablishedonthevaluefunctionestimation.Theproposedapproach
leveragestheprobablyapproximatelycorrectframework[
202
]inadi˙erentwaysuchthat
itdoesnotrelyonthevaluefunction.Suchindependencedirectlyleadstoapractically
99
sample-e˚cientalgorithmforlarge-scaleMDP,aswedemonstratedintheexperiments.
4.2.3NotationsandProblemSetting
Here,weconsidera˝nitehorizon
T
,discretetimeMarkovDecisionProcess(MDP)with
a˝nitediscretestatespace
S
andforeachstate
s
2S
,theactionspace
A
s
is˝nite.The
environmentdynamicsisdenotedas
P
=
f
p
(
s
0
j
s;a
)
;
8
s;s
0
2S
;a
2A
s
g
.Wenotethatthe
dimensionofactionspacecanvarygivendi˙erentstates.Weuse
m
=
max
s
kA
s
k
todenote
themaximalactiondimensionamongallpossiblestates.Ourgoalistomaximizetheexpected
sumofpositiverewards,orreturn
J
(

)=
E
˝;ˇ

[
P
T
t
=1
r
(
s
t
;a
t
)]
,where
0
<r
(
s;a
)
<
1
;
8
s;a
.
Inthiscase,theoptimaldeterministicMarkovianpolicyalwaysexists[
156
][Proposition
4.4.3].Theupperboundoftrajectoryreward(
r
(
˝
)
)isdenotedas
R
max
=
max
˝
r
(
˝
)
.A
comprehensivelistofnotationsiselaboratedinTable4.1.
4.2.4RankingPolicyGradient
ValuefunctionestimationiswidelyusedinadvancedRLalgorithms[
132
,
131
,
170
,
71
,
80
,
42
]
tofacilitatethelearningprocess.Inpractice,theon-policyrequirementofvaluefunction
estimationsinactor-criticmethodshaslargelyincreasedthedi˚cultyofsample-e˚cient
learning[
46
,
71
].Withtheadvantageofo˙-policylearning,theDQN[
132
]variantsare
currentlyamongthemostsample-e˚cientalgorithms[
80
,
42
,
17
].Forcomplicatedtasks,the
valuefunctioncanalignwiththerelativerelationshipofaction'sreturn,buttheabsolute
valuesarehardlyaccurate[132,85].
TheaboveobservationsmotivateustolookatthedecisionphaseofRLfromadi˙erent
prospect:Givenastate,thedecisionmakingistoperforma
relativecomparison
overavailable
100
Table4.1:NotationsforSection4.2.
Notations
De˝nition

ij
Thediscrepancyoftherelativeactionvalueofaction
i
andaction
j
.

ij
=

i


j
,where

i
=

(
s;a
i
)
:
Noticethatthevaluehereisnot
theestimationofreturn,itrepresentswhichactionwillhave
relativelyhigherreturniffollowed.
Q
ˇ
(
s;a
)
Theactionvaluefunctionorequivalentlytheestimationofreturn
takingaction
a
atstate
s
,followingpolicy
ˇ
.
p
ij
p
ij
=
P
(

i
>
j
)
denotestheprobabilitythat
i
-thactionistobe
rankedhigherthan
j
-thaction.Noticethat
p
ij
iscontrolledby

through

i
;
j
˝
Atrajectory
˝
=
f
s
(
˝;t
)
;a
(
˝;t
)
g
T
t
=1
collectedfromthe
environment.Itisworthnotingthatthistrajectoryisnot
associatedwithanypolicy.Itonlyrepresentsaseriesofstate-action
pairs.Wealsousetheabbreviation
s
t
=
s
(
˝;t
)
,
a
t
=
a
(
˝;t
)
.
r
(
˝
)
Thetrajectoryreward
r
(
˝
)=
P
T
t
=1
r
(
s
t
;a
t
)
isthesumofreward
alongonetrajectory.
R
max
R
max
isthemaximalpossibletrajectoryreward,i.e.,
R
max
=max
˝
r
(
˝
)
.SincewefocusonMDPswith˝nitehorizon
andimmediatereward,thereforethetrajectoryrewardisbounded.
P
˝
Thesummationoverallpossibletrajectories
˝
.
p
(
˝
)
Theprobabilityofaspeci˝ctrajectoryiscollectedfromthe
environmentgivenpolicy
ˇ

.
p

(
˝
)=
p
(
s
0

T
t
=1
ˇ

(
a
t
j
s
t
)
p
(
s
t
+1
j
s
t
;a
t
)
T
Thesetofallpossiblenear-optimaltrajectories.
jTj
denotesthe
numberofnear-optimaltrajectoriesin
T
.
n
Thenumberoftrainingsamplesorequivalentlystateactionpairs
sampledfromuniformly(near)-optimalpolicy.
m
Thenumberofdiscreteactions.
actionsandthenchoosethebestaction,whichcanleadtorelativelyhigherreturnthan
others.Therefore,analternativesolutionistolearntheoptimalrankoftheactions,instead
ofderivingpolicyfromtheactionvalues.Inthissection,weshowhowtooptimizetherank
ofactionstomaximizethereturn,andthusavoidthenecessityofaccurateestimationfor
optimalactionvaluefunction.Tolearntherankofactions,wefocusonlearning
relative
101
actionvalue
(

-values),de˝nedasfollows:
De˝nition4
(Relativeactionvalue(

-values))
.
Forastate
s
,therelativeactionvalues
of
m
actions(

(
s;a
k
)
;k
=1
;:::;m
)isalistofscoresthatdenotestherankofactions.If

(
s;a
i
)
>
(
s;a
j
)
,thenaction
a
i
isrankedhigherthanaction
a
j
.
Theoptimalrelativeactionvaluesshouldpreservethesameoptimalactionastheoptimal
actionvalues:
argmax
a

(
s;a
)=argmax
a
Q
ˇ

(
s;a
)
where
Q
ˇ

(
s;a
i
)
and

(
s;a
i
)
representtheoptimalactionvalueandtherelativeactionvalue
ofaction
a
i
,respectively.Weomitthemodelparameter

in


(
s;a
i
)
forconcisepresentation.
Remark1.
The

-valuesaredi˙erentfromtheadvantagefunction
A
ˇ
(
s;a
)=
Q
ˇ
(
s;a
)

V
ˇ
(
s
)
.Theadvantagefunctionsquantitativelyshowthedi˙erenceofreturntakingdi˙erent
actionsfollowingthecurrentpolicy
ˇ
.The

-valuesonlydeterminetherelativeorderof
actionsanditsmagnitudesarenottheestimationsofreturns.
Tolearnthe

-values,wecanconstructaprobabilisticmodelof

-valuessuchthatthe
bestactionhasthehighestprobabilitytobeselectedthanothers.Inspiredbylearningto
rank[
26
],weconsiderthepairwiserelationshipamongallactions,bymodelingtheprobability
(denotedas
p
ij
)ofanaction
a
i
toberankedhigherthananyaction
a
j
asfollows:
p
ij
=
exp(

(
s;a
i
)


(
s;a
j
))
1+exp(

(
s;a
i
)


(
s;a
j
))
;
(4.4)
where
p
ij
=0
:
5
meanstherelativeactionvalueof
a
i
issameasthatoftheaction
a
j
,
p
ij
>
0
:
5
indicatesthattheaction
a
i
isrankedhigherthan
a
j
.GiventheindependentAssumption4,
wecanrepresenttheprobabilityofselectingoneactionasthemultiplicationofasetof
102
pairwiseprobabilitiesinEq(4.4).Formally,wede˝nethepairwiserankingpolicyinEq(4.5).
PleaserefertoSectionAintheAppendixforthediscussionsonfeasibilityofAssumption4.
De˝nition5.
Thepairwiserankingpolicyisde˝nedas:
ˇ
(
a
=
a
i
j
s
)=
m
j
=1
;j
6
=
i
p
ij
;
(4.5)
wherethe
p
ij
isde˝nedinEq(4.4).Theprobabilitydependsontherelativeactionvalues
q
=[

1
;:::;
m
]
.Thehighestrelativeactionvalueleadstothehighestprobabilitytobeselected.
Assumption4.
Forastate
s
,thesetofevents
E
=
f
e
ij
j8
i
6
=
j
g
areconditionallyinde-
pendent,where
e
ij
denotestheeventthataction
a
i
isrankedhigherthanaction
a
j
.The
independenceoftheeventsisconditionedonaMDPandastationarypolicy.
Ourultimategoalistomaximizethelong-termrewardthroughoptimizingthepairwise
rankingpolicyorequivalentlyoptimizingpairwiserelationshipamongtheactionpairs.Ideally,
wewouldlikethepairwiserankingpolicyselectsthebestactionwiththehighestprobability
andthehighest

-value.Toachievethisgoal,weresorttothepolicygradientmethod.
Formally,weproposetherankingpolicygradientmethod(RPG),asshowninTheorem2.
Theorem2
(RankingPolicyGradientTheorem)
.
ForanyMDP,thegradientoftheexpected
long-termreward
J
(

)=
P
˝
p

(
˝
)
r
(
˝
)
w.r.t.theparameter

ofapairwiserankingpolicy
(Def5)canbeapproximatedby:
r

J
(

)
ˇ
E
˝
˘
ˇ


X
T
t
=1
r


X
m
j
=1
;j
6
=
i
(

i


j
)
=
2

r
(
˝
)

;
(4.6)
andthedeterministicpairwiserankingpolicy
ˇ

is:
a
=
argmax
i

i
;i
=1
;:::;m
,where
103

i
denotestherelativeactionvalueofaction
a
i
(


(
s
t
;a
t
)
,
a
i
=
a
t
),
s
t
and
a
t
denotesthe
t
-thstate-actionpairintrajectory
˝
,

j
;
8
j
6
=
i
denotetherelativeactionvaluesofallother
actionsthatwerenottakengivenstate
s
t
intrajectory
˝
,i.e.,


(
s
t
;a
j
)
,
8
a
j
6
=
a
t
.
TheproofofTheorem2isprovidedinAppendixA.Theorem2statesthatoptimizingthe
discrepancybetweentheactionvaluesofthebestactionandallotheractions,isoptimizing
thepairwiserelationshipsthatmaximizethereturn.OnelimitationofRPGisthatitisnot
convenientforthetaskswhereonlyoptimalstochasticpoliciesexistsincethepairwiseranking
policytakesextrae˙ortstoconstructaprobabilitydistribution[seeAppendixA].Inorder
tolearnthestochasticpolicy,weintroduceListwisePolicyGradient(LPG)thatoptimizes
theprobabilityofrankingaspeci˝cactiononthetopofasetofactions,withrespecttothe
return.InthecontextofRL,thistoponeprobabilityistheprobabilityofaction
a
i
tobe
chosen,whichisequaltothesumofprobabilityallpossiblepermutationsthatmapaction
a
i
atthetop.Thisprobabilityiscomputationallyprohibitivesinceweneedtoconsiderthe
probabilityof
m
!
permutations.Inspiredbylistwiselearningtorankapproach[31],thetop
oneprobabilitycanbemodeledbythesoftmaxfunction(seeTheorem3).Therefore,LPGis
equivalenttothe
Reinforce
[
212
]algorithmwithasoftmaxlayer.LPGprovidesanother
interpretationof
Reinforce
algorithmfromtheperspectiveoflearningtheoptimalranking
andenablesthelearningofbothdeterministicpolicyandstochasticpolicy(seeTheorem4).
Theorem3
([
31
],Theorem6)
.
Giventheactionvalues
q
=[

1
;:::;
m
]
,theprobabilityof
action
i
tobechosen(i.e.toberankedonthetopofthelist)is:
ˇ
(
a
t
=
a
i
j
s
t
)=
˚
(

i
)
P
m
j
=1
˚
(

j
)
;
(4.7)
where
˚
(

)
isanyincreasing,strictlypositivefunction.Acommonchoiceof
˚
isthe
104
exponentialfunction.
Theorem4
(ListwisePolicyGradientTheorem)
.
ForanyMDP,thegradientofthelong-
termreward
J
(

)=
P
˝
p

(
˝
)
r
(
˝
)
w.r.t.theparameter

oflistwiserankingpolicytakesthe
followingform:
r

J
(

)=
E
˝
˘
ˇ

2
4
T
X
t
=1
r

0
@
log
e

i
P
m
j
=1
e

j
1
A
r
(
˝
)
3
5
;
(4.8)
wherethelistwiserankingpolicy
ˇ

parameterizedby

isgivenbyEq(4.9)fortaskswith
deterministicoptimalpolicies:
a
=argmax
i

i
;i
=1
;:::;m
(4.9)
orEq(4.10)forstochasticoptimalpolicies:
a
˘
ˇ
(

s
)
;i
=1
;:::;m
(4.10)
wherethepolicytakestheformasinEq(4.11)
ˇ
(
a
=
a
i
j
s
t
)=
e

i
P
m
j
=1
e

j
(4.11)
istheprobabilitythataction
i
beingrankedhighest,giventhecurrentstateandalltherelative
actionvalues

1
:::
m
.
TheproofofTheorem4exactlyfollowsthedirectpolicydi˙erentiation[
153
,
212
]by
replacingthepolicytotheformoftheSoftmaxfunction.Theactionprobability
ˇ
(
a
i
j
s
)
;
8
i
=
1
;:::;m
formsaprobabilitydistributionoverthesetofdiscreteactions[
31
,Lemma7].
105
Theorem4statesthatthevanillapolicygradient[
212
]parameterizedbySoftmaxlayeris
optimizingtheprobabilityofeachactiontoberankedhighest,withrespecttothelong-term
reward.Furthermore,itenableslearningbothofthedeterministicpolicyandstochastic
policy.
Tothisend,seekingsample-e˚ciencymotivatesustolearntherelativerelationship(RPG
(Theorem2)andLPG(Theorem4))ofactions,insteadofderivingpolicybasedonaction
valueestimations.However,bothoftheRPGandLPGbelongtopolicygradientmethods,
whichsu˙ersfromlargevarianceandtheon-policylearningrequirement[
187
].Therefore,
theintuitiveimplementationsofRPGorLPGarestillfarfromsample-e˚cient.Inthenext
section,wewilldescribeageneralo˙-policylearningframeworkempoweredbysupervised
learning,whichprovidesanalternativewaytoacceleratelearning,preserveoptimality,and
reducevariance.
4.2.5O˙-policyLearningasSupervisedLearning
Inthissection,wediscusstheconnectionsanddiscrepanciesbetweenRLandsupervised
learning,andourresultsleadtoasample-e˚ciento˙-policylearningparadigmforRL.The
mainresultinthissectionisTheorem5,whichcaststheproblemofmaximizingthelower
boundofreturnintoasupervisedlearningproblem,givenonerelativelymildAssumption5
andpracticalassumptions4,6.Itcanbeshownthattheseassumptionsarevalidinarange
ofcommonRLtasks,asdiscussedinLemma6inAppendixA.Thecentralideaistocollect
onlythenear-optimaltrajectorieswhenthelearningagentinteractswiththeenvironment,
andimitatethenear-optimalpolicybymaximizingtheloglikelihoodofthestate-actionpairs
fromthesenear-optimaltrajectories.Withtheroadmapinmind,wethenbegintointroduce
ourapproachasfollows.
106
InadiscreteactionMDPwith˝nitestatesandhorizon,giventhenear-optimalpolicy
ˇ

,thestationarystatedistributionisgivenby:
p
ˇ

(
s
)=
P
˝
p
(
s
j
˝
)
p
ˇ

(
˝
)
,where
p
(
s
j
˝
)
is
theprobabilityofacertainstategivenaspeci˝ctrajectory
˝
andisnotassociatedwithany
policies,andonly
p
ˇ

(
˝
)
isrelatedtothepolicyparameters.Thestationarydistributionof
state-actionpairsisthus:
p
ˇ

(
s;a
)=
p
ˇ

(
s
)
ˇ

(
a
j
s
)
.Inthissection,weconsidertheMDP
thateachinitialstatewillleadtoatleastone(near)-optimaltrajectory.Foramoregeneral
case,pleaserefertothediscussioninAppendixA.Inordertoconnectsupervisedlearning
(i.e.,imitatinganear-optimalpolicy)withRLandenablesample-e˚ciento˙-policylearning,
we˝rstintroducethetrajectoryrewardshaping(TRS),de˝nedasfollows:
De˝nition6
(TrajectoryRewardShaping,TRS)
.
Givena˝xedtrajectory
˝
,itstrajectory
rewardisshapedasfollows:
w
(
˝
)=
8
>
<
>
:
1
;
if
r
(
˝
)

c
0
;o:w:
where
c
=
R
max


isaproblem-dependentnear-optimaltrajectoryrewardthresholdthat
indicatestheleastrewardofnear-optimaltrajectory,


0
and

˝
R
max
.Wedenotetheset
ofallpossiblenear-optimaltrajectoriesas
T
=
f
˝
j
w
(
˝
)=1
g
,i.e.,
w
(
˝
)=1
;
8
˝
2T
.
Remark2.
Thethreshold
c
indicatesatrade-o˙betweenthesample-e˚ciencyandthe
optimality.Thehigherthethreshold,thelessfrequentlyitwillhitthenear-optimaltrajectories
duringexploration,whichmeansithashighersamplecomplexity,whilethe˝nalperformance
isbetter(seeFigure4.10).
Remark3.
Thetrajectoryrewardcanbereshapedtoanypositivefunctionsthatarenot
relatedtopolicyparameter

.Forexample,ifweset
w
(
˝
)=
r
(
˝
)
,theconclusionsinthis
sectionstillhold(seeEq(A.6)inAppendixA).Forthesakeofsimplicity,weset
w
(
˝
)=1
.
107
Di˙erentfromtherewardshapingwork[
139
],whereshapinghappensateachstepon
r
(
s
t
;a
t
)
,theproposedapproachdirectlyshapesthetrajectoryreward
r
(
˝
)
,whichfacilitates
thesmoothtransformfromRLtoSL.Aftershapingthetrajectoryreward,wecantransfer
thegoalofRLfrommaximizingthereturntomaximizethelong-termperformance(Def7).
De˝nition7
(Long-termPerformance)
.
Thelong-termperformanceisde˝nedbytheexpected
shapedtrajectoryreward:
X
˝
p

(
˝
)
w
(
˝
)
:
(4.12)
AccordingtoDef6,theexpectationoveralltrajectoriesistheequaltothatoverthenear-
optimaltrajectoriesin
T
,i.e.,
P
˝
p

(
˝
)
w
(
˝
)=
P
˝
2T
p

(
˝
)
w
(
˝
)
.
Theoptimalityispreservedaftertrajectoryrewardshaping(

=0
;c
=
R
max
)sincethe
optimalpolicy
ˇ

maximizinglong-termperformanceisalsoanoptimalpolicyfortheoriginal
MDP,i.e.,
P
˝
p
ˇ

(
˝
)
r
(
˝
)=
P
˝
2T
p
ˇ

(
˝
)
r
(
˝
)=
R
max
,where
ˇ

=
argmax
ˇ

P
˝
p
ˇ

(
˝
)
w
(
˝
)
and
p
ˇ

(
˝
)=0
;
8
˝=
2T
(seeLemma4inAppendixA).Similarly,when
>
0
,theoptimal
policyaftertrajectoryrewardshapingisanear-optimalpolicyfororiginalMDP.Notethat
mostpolicygradientmethodsusethesoftmaxfunction,inwhichwehave
9
˝=
2T
;p
ˇ

(
˝
)
>
0
(seeLemma5inAppendixA).Thereforewhensoftmaxisusedtomodelapolicy,itwill
notconvergetoanexactoptimalpolicy.Ontheotherhand,ideally,thediscrepancyofthe
performancebetweenthemcanbearbitrarilysmallbasedontheuniversalapproximation[
83
]
withgeneralconditionsontheactivationfunctionandTheorem1in[188].
Essentially,weuseTRSto˝lteroutnear-optimaltrajectoriesandthenwemaximize
theprobabilitiesofnear-optimaltrajectoriestomaximizethelong-termperformance.This
procedurecanbeapproximatedbymaximizingthelog-likelihoodofnear-optimalstate-action
108
pairs,whichisasupervisedlearningproblem.Beforewestateourmainresults,we˝rst
introducethede˝nitionofuniformlynear-optimalpolicy(Def8)andaprerequisite(Asm.5)
specifyingtheapplicabilityoftheresults.
De˝nition8
(UniformlyNear-OptimalPolicy,UNOP)
.
TheUniformlyNear-OptimalPolicy
ˇ

isthepolicywhoseprobabilitydistributionovernear-optimaltrajectories(
T
)isauniform
distribution.i.e.
p
ˇ

(
˝
)=
1
jTj
;
8
˝
2T
,where
jTj
isthenumberofnear-optimaltrajectories.
Whenweset
c
=
R
max
,itisanoptimalpolicyintermsofbothmaximizingreturnand
long-termperformance.Inthecaseof
c
=
R
max
,thecorrespondinguniformpolicyisan
optimalpolicy,wedenotethistypeofoptimalpolicyasuniformlyoptimalpolicy(UOP).
Assumption5
(ExistenceofUniformlyNear-OptimalPolicy)
.
Weassumetheexistenceof
UniformlyNear-OptimalPolicy(Def.8).
BasedonLemma6inAppendixA,Assumption5issatis˝edforcertainMDPsthat
havedeterministicdynamics.OtherthanAssumption5,allotherassumptionsinthiswork
(Assumptions4,6)canalmostalwaysbesatis˝edinpractice,basedonempiricalobservations.
Withtheserelativelymildassumptions,wepresentthefollowinglong-termperformance
theorem,whichshowsthecloseconnectionbetweensupervisedlearningandRL.
Theorem5
(Long-termPerformanceTheorem)
.
Maximizingthelowerboundofexpected
long-termperformanceinEq(4.12)ismaximizingthelog-likelihoodofstate-actionpairs
sampledfromauniformly(near)-optimalpolicy
ˇ

,whichisasupervisedlearningproblem:
argmax

X
s
2S
X
a
2A
s
p
ˇ

(
s;a
)log
ˇ

(
a
j
s
)
(4.13)
Theoptimalpolicyofmaximizingthelowerboundisalsotheoptimalpolicyofmaximizing
109
thelong-termperformanceandthereturn.
Remark4.
ItisworthnotingthatTheorem5doesnotrequireauniformlynear-optimalpolicy
ˇ

tobedeterministic.Theonlyrequirementistheexistenceofauniformlynear-optimal
policy.
Remark5.
Maximizingthelowerboundoflong-termperformanceismaximizingthelower
boundoflong-termrewardsincewecanset
w
(
˝
)=
r
(
˝
)
and
P
˝
p

(
˝
)
r
(
˝
)

P
T
p

(
˝
)
w
(
˝
)
.
Anoptimalpolicythatmaximizesthislowerboundisalsoanoptimalpolicymaximizingthe
long-termperformancewhen
c
=
R
max
,thusmaximizingthereturn.
TheproofofTheorem5canbefoundinAppendixA.Theorem5indicatesthatwebreakthe
dependencybetweencurrentpolicy
ˇ

andtheenvironmentdynamics,whichmeanso˙-policy
learningisabletobeconductedbytheabovesupervisedlearningapproach.Furthermore,we
pointoutthatthereisapotentialdiscrepancybetweenimitatingUNOPbymaximizinglog
likelihood(evenwhentheoptimalpolicy'ssamplesaregiven)andthereinforcementlearning
sincewearemaximizingalowerboundofexpectedlong-termperformance(orequivalently
thereturnoverthenear-optimaltrajectoriesonly)insteadofreturnoveralltrajectories.In
practice,thestate-actionpairsfromanoptimalpolicyishardtoconstructwhiletheuniform
characteristicofUNOPcanalleviatethisissue(seeSec4.2.6).Towardssample-e˚cientRL,
weapplyTheorem5toRPG,whichreducestherankingpolicygradienttoaclassi˝cation
problembyCorollary1.
Corollary1
(Rankingperformancepolicygradient)
.
Thelowerboundofexpectedlong-
termperformance(de˝nedinEq(4.12))usingpairwiserankingpolicy(Eq(4.5))canbe
110
approximatelyoptimizedbythefollowingloss:
min

X
s;a
i
p
ˇ

(
s;a
i
)

X
m
j
=1
;j
6
=
i
max(0
;
1+

(
s;a
j
)


(
s;a
i
))

:
(4.14)
Corollary2
(Listwiseperformancepolicygradient)
.
Optimizingthelowerboundofexpected
long-termperformancebythelistwiserankingpolicy(Eq(4.11))isequivalentto:
max

X
s
p
ˇ

(
s
)
X
m
i
=1
ˇ

(
a
i
j
s
)log
e

i
P
m
j
=1
e

j
(4.15)
TheproofofthisCorollaryisadirectapplicationoftheorem5byreplacingpolicywiththe
softmaxfunction.
TheproofofCorollary1canbefoundinAppendixA.Similarly,wecanreduceLPGto
aclassi˝cationproblem(seeCorollary2).OneadvantageofcastingRLtoSLisvariance
reduction.Withtheproposedo˙-policysupervisedlearning,wecanreducetheupper
boundofthepolicygradientvariance,asshownintheCorollary3.Beforeintroducingthe
variancereductionresults,we˝rstmakethecommonassumptionsontheMDPregularity
(Assumption6)similarto[
43
,
46
,A1].Furthermore,theAssumption6isguaranteedfor
boundedcontinuouslydi˙erentiablepolicysuchassoftmaxfunction.
Assumption6.
weassumetheexistenceofmaximumnormofloggradientoverallpossible
state-actionpairs,i.e.
C
=max
s;a
kr

log
ˇ

(
a
j
s
)
k
1
111
Corollary3
(Policygradientvariancereduction)
.
Givenastationarypolicy,theupperbound
ofthevarianceofeachdimensionofpolicygradientis
O
(
T
2
C
2
R
2
max
)
.Theupperbound
ofgradientvarianceofmaximizingthelowerboundoflong-termperformanceEq(4.13)is
O
(
C
2
)
,where
C
isthemaximumnormofloggradientbasedonAssumption6.Thesupervised
learninghasreducedtheupperboundofgradientvariancebyanorderof
O
(
T
2
R
2
max
)
as
comparedtotheregularpolicygradient,considering
R
max

1
;T

1
,whichisaverycommon
situationinpractice.
TheproofofCorollary3canbefoundinAppendixA.Thiscorollaryshowsthatthe
varianceofregularpolicygradientisupper-boundedbythesquareoftimehorizonandthe
maximumtrajectoryreward.Itisalignedwithourintuitionandempiricalobservation:the
longerthehorizontheharderthelearning.Also,thecommonrewardshapingtrickssuch
astruncatingtherewardto
[

1
;
1]
[
34
]canhelpthelearningsinceitreducesvarianceby
decreasing
R
max
.Withsupervisedlearning,weconcentratethedi˚cultyoflong-timehorizon
intotheexplorationphase,whichisaninevitableissueforallRLalgorithms,andwedrop
thedependenceon
T
and
R
max
forpolicyvariance.Thus,itismorestableande˚cientto
trainthepolicyusingsupervisedlearning.Onepotentiallimitationofthismethodisthat
thetrajectoryrewardthreshold
c
istask-speci˝c,whichiscrucialtothe˝nalperformance
andsample-e˚ciency.InmanyapplicationssuchasDialoguesystem[
111
],recommender
system[
130
],etc.,wedesigntherewardfunctiontoguidethelearningprocess,inwhich
c
isnaturallyknown.Forthecasesthatwehavenopriorknowledgeontherewardfunction
ofMDP,wetreat
c
asatuningparametertobalancetheoptimalityande˚ciency,aswe
empiricallyveri˝edinFigure4.10.Themajortheoreticaluncertaintyongeneraltasksisthe
existenceofauniformlynear-optimalpolicy,whichisnegligibletotheempiricalperformance.
Therigoroustheoreticalanalysisofthisproblemisbeyondthescopeofthiswork.
112
Figure4.7:O˙-policylearningframework.
4.2.6Analgorithmicframeworkforo˙-policylearning
BasedonthediscussionsinSection4.2.5,weexploittheadvantageofreducingRLinto
supervisedlearningviaaproposedtwo-stageso˙-policylearningframework.Asweillustrated
inFigure4.7,theproposedframeworkcontainsthefollowingtwostages:
GeneralizedPolicyIterationforExploration.
Thegoaloftheexplorationstageis
tocollectdi˙erentnear-optimaltrajectoriesasfrequentlyaspossible.Undertheo˙-policy
framework,theexplorationagentandthelearningagentcanbeseparated.Therefore,any
existingRLalgorithmcanbeusedduringtheexploration.Theprincipleofthisframework
isusingthemostadvancedRLagentsasanexplorationstrategyinordertocollectmore
near-optimaltrajectoriesandleavethepolicylearningtothesupervisionstage.
Supervision.
Inthisstage,weimitatetheuniformlynear-optimalpolicy,UNOP(Def8).
AlthoughwehavenoaccesstotheUNOP,wecanapproximatethestate-actiondistribution
fromUNOPbycollectingthenear-optimaltrajectoriesonly.Thenear-optimalsamples
113
areconstructedonlineandwearenotgivenanyexpertdemonstrationorexpertpolicy
beforehand.Thisstepprovidesasample-e˚cientapproachtoconductexploitation,which
enjoysthesuperiorityofstability(Figure4.9),variancereduction(Corollary3),andoptimality
preserving(Theorem5).
Thetwo-stagealgorithmicframeworkcanbedirectlyincorporatedinRPGandLPG
toimprovesamplee˚ciency.TheimplementationofRPGisgiveninAlgorithm4.2,and
LPGfollowsthesameprocedureexceptforthedi˙erenceinthelossfunction.Themain
requirementofAlg.4.2isontheexploratione˚ciencyandtheMDPstructure.Duringthe
explorationstage,asu˚cientamountofthedi˙erentnear-optimaltrajectoriesneedtobe
collectedforconstructingarepresentativesupervisedlearningtrainingdataset.Theoretically,
thisrequirementalwaysholds[seeAppendixSectionA,Lemma7],whilethenumberof
episodesexploredcouldbeprohibitivelylarge,whichmakesthisalgorithmsample-ine˚cient.
Thiscouldbeapracticalconcernoftheproposedalgorithm.However,accordingtoour
extensiveempiricalobservations,wenoticethatlongbeforethevaluefunctionbasedstate-of-
the-artconvergestonear-optimalperformance,enoughamountofnear-optimaltrajectories
arealreadyexplored.
Therefore,wepointoutthatinsteadofestimatingoptimalactionvaluefunctionsand
thenchoosingactiongreedily,usingvaluefunctiontofacilitatetheexplorationandimitating
UNOPisamoresample-e˚cientapproach.AsillustratedinFigure4.7,valuebasedmethods
witho˙-policylearning,bootstrapping,andfunctionapproximationcouldleadtoadivergent
optimization[
187
,Chap.11].Incontrasttoresolvingtheinstability,wecircumventthis
issueviaconstructingastationarytargetusingthesamplesfrom(near)-optimaltrajectories,
andperformimitationlearning.Thistwo-stageapproachcanavoidtheextensiveexploration
ofthesuboptimalstate-actionspaceandreducethesubstantialnumberofsamplesneeded
114
forestimatingoptimalactionvalues.IntheMDPwherewehaveahighprobabilityofhitting
thenear-optimaltrajectories(suchas
Pong
),thesupervisionstagecanfurtherfacilitatethe
exploration.Itshouldbeemphasizedthatourworkfocusesonimprovingthesample-e˚ciency
throughmoree˙ectiveexploitation,ratherthandevelopingnovelexplorationmethod.
Algorithm4.2:
O˙-PolicyLearningforRankingPolicyGradient(RPG)
Require:
Thenear-optimaltrajectoryrewardthreshold
c
,thenumberofmaximaltraining
episodes
N
max
.Maximumnumberoftimestepsineachepisode
T
,andbatchsize
b
.
1:
while
episode
<N
max
do
2:
repeat
3:
Retrievestate
s
t
andsampleaction
a
t
bythespeci˝edexplorationagent(random,

-greedy,oranyRLalgorithms).
4:
Collecttheexperience
e
t
=(
s
t
;a
t
;r
t
;s
t
+1
)
andstoretothereplaybu˙er.
5:
t
=
t
+1
6:
if
t%updatestep==0
then
7:
Sampleabatchofexperience
f
e
j
g
b
j
=1
fromthenear-optimalreplaybu˙er.
8:
Update
ˇ

basedonthehingelossEq(4.14)forRPG.
9:
Updatetheexplorationagentusingsamplesfromtheregularreplaybu˙er(In
simpleMDPssuchas
Pong
wherenear-optimaltrajectoriesareencountered
frequently,near-optimalreplaybu˙ercanbeusedtoupdate
theexplorationagent).
10:
endif
11:
until
terminal
s
t
or
t

t
start
>
=
T
12:
if
return
P
T
t
=1
r
t

c
then
13:
Takethenear-optimaltrajectory
e
t
;t
=1
;:::;T
inthelatestepisodefromtheregular
replaybu˙er,andinsertthetrajectoryintothenear-optimalreplaybu˙er.
14:
endif
15:
if
t
%evaluationstep==0
then
16:
EvaluatetheRPGagentbygreedilychoosingtheaction.Ifthebestperformanceis
reached,thenstoptraining.
17:
endif
18:
endwhile
4.2.7SampleComplexityandGeneralizationPerformance
Inthissection,wepresentatheoreticalanalysisonthesamplecomplexityofRPGwith
o˙-policylearningframeworkinSection4.2.6.Theanalysisleveragestheresultsfromthe
ProbablyApproximatelyCorrect(PAC)framework,andprovidesanalternativeapproach
115
toquantifysamplecomplexityofRLfromtheperspectiveoftheconnectionbetweenRL
andSL(seeTheorem5),whichissigni˝cantlydi˙erentfromtheexistingapproachesthat
usevaluefunctionestimations[
95
,
180
,
97
,
179
,
105
,
91
,
90
,
223
].Weshowthatthesample
complexityofRPG(Theorem6)dependsonthepropertiesofMDPsuchashorizon,action
space,dynamics,andthegeneralizationperformanceofsupervisedlearning.Itisworth
mentioningthatthesamplecomplexityofRPGhasnolineardependenceonthestate-space,
whichmakesitsuitableforlarge-scaleMDPs.Moreover,wealsoprovideaformalquantitative
de˝nition(Def9)ontheexploratione˚ciencyofRL.
Correspondingtothetwo-stageframeworkinSection4.2.6,thesamplecomplexityof
RPGalsosplitsintotwoproblems:

Learninge˚ciency:
Howmanystate-actionpairsfromtheuniformlyoptimalpolicy
doweneedtocollect,inordertoachievegoodgeneralizationperformanceinRL?

Exploratione˚ciency:
ForacertaintypeofMDPs,whatistheprobabilityof
collecting
n
trainingsamples(state-actionpairsfromtheuniformlynear-optimalpolicy)
inthe˝rst
k
episodesintheworstcase?Thisquestionleadstoaquantitativeevaluation
metricofdi˙erentexplorationmethods.
The˝rststageisresolvedbyTheorem6,whichconnectsthelowerboundofthegeneralization
performanceofRLtothesupervisedlearninggeneralizationperformance.Thenwediscuss
theexploratione˚ciencyoftheworstcaseperformanceforabinarytreeMDPinLemma2.
Jointly,weshowhowtolinkthetwostagestogiveageneraltheoremthatstudieshowmany
samplesweneedtocollectinordertoachievecertainperformanceinRL.
Inthissection,werestrictourdiscussionontheMDPswitha˝xedactionspaceand
assumetheexistenceofdeterministicoptimalpolicy.Thepolicy
ˇ
=
^
h
=
argmin
h
2H
^

(
h
)
116
correspondstotheempiricalriskminimizer(ERM)inthelearningtheoryliterature,whichis
thepolicyweobtainedthroughlearningonthetrainingsamples.
H
denotesthehypothesis
classfromwhereweareselectingthepolicy.Givenahypothesis(policy)
h
,theempiricalrisk
isgivenby
^

(
h
)=
P
n
i
=1
1
n
1
f
h
(
s
i
)
6
=
a
i
g
.Withoutlossofgenerosity,wecannormalizethe
rewardfunctiontosettheupperboundoftrajectoryrewardequalstoone(
i:e:;R
max
=1
),
similartotheassumptionin[
90
].Itisworthnotingthatthetrainingsamplesaregenerated
i.i.d.
fromanunknowndistribution,whichisperhapsthemostimportantassumptioninthe
statisticallearningtheory.
i.i.d.
issatis˝edinthiscasesincethestateactionpairs(training
samples)arecollectedby˝lteringthesamplesduringthelearningstage,andwecanmanually
manipulatethesamplestofollowthedistributionofUOP(Def8)byonlystoringtheunique
near-optimaltrajectories.
4.2.8Supervisionstage:Learninge˚ciency
Tosimplifythepresentation,werestrictourdiscussiononthe˝nitehypothesisclass(i.e.
jHj
<
1
)sincethisdependenceisnotgermanetoourdiscussion.However,wenotethatthe
theoreticalframeworkinthissectionisnotlimitedtothe˝nitehypothesisclass.Forexample,
wecansimplyusetheVCdimension[
204
]ortheRademachercomplexity[
15
]togeneralize
ourdiscussiontothein˝nitehypothesisclass,suchasneuralnetworks.Forcompleteness,we
˝rstrevisitthesamplecomplexityresultfromthePAClearninginthecontextofRL.
Lemma1
(SupervisedLearningSampleComplexity[
133
])
.
Let
jHj
<
1
,andlet
;
be
˝xed,theinequality

(
^
h
)

(
min
h
2H

(
h
))+2

=

holdswithprobabilityatleast
1


,when
117
thetrainingsetsize
n
satis˝es:
n

1
2

2
log
2
jHj

;
(4.16)
wherethegeneralizationerror(expectedrisk)ofahypothesis
^
h
isde˝nedas:

(
^
h
)=
X
s;a
p
ˇ

(
s;a
)
1
n
^
h
(
s
)
6
=
a
o
:
Condition1
(Actionvalues)
.
WerestricttheactionvaluesofRPGincertainrange,i.e.,

i
2
[0
;c
q
]
,where
c
q
isapositiveconstant.
Thisconditioncanbeeasilysatis˝ed,forexample,wecanuseasigmoidtocasttheaction
valuesinto
[0
;
1]
.WecanimposethisconstraintsinceinRPGweonlyfocusontherelative
relationshipofactionvalues.Giventhemildconditionandestablishedonthepriorwork
instatisticallearningtheory,weintroducethefollowingresultsthatconnectthesupervised
learningandreinforcementlearning.
Theorem6
(GeneralizationPerformance)
.
GivenaMDPwheretheUOP(Def8)isdeter-
ministic,let
jHj
denotethesizeofhypothesisspace,and
;n
be˝xed,thefollowinginequality
holdswithprobabilityatleast
1


:
X
˝
p

(
˝
)
r
(
˝
)

D
(1+
e
)

(1

m
)
T
;
where
D
=
jTj

˝
2T
p
d
(
˝
))
1
jTj
,
p
d
(
˝
)=
p
(
s
1

T
t
=1
p
(
s
t
+1
j
s
t
;a
t
)
denotestheenvironment
dynamics.

istheupperboundofsupervisedlearninggeneralizationperformance,de˝nedas

=(min
h
2H

(
h
))+2
q
1
2
n
log
2
jHj

=2
q
1
2
n
log
2
jHj

.
118
Corollary4
(SampleComplexity)
.
GivenaMDPwheretheUOP(Def8)isdeterministic,
let
jHj
denotesthesizeofhypothesisspace,andlet

be˝xed.Thenforthefollowinginequality
toholdwithprobabilityatleast
1


:
X
˝
p

(
˝
)
r
(
˝
)

1


itsu˚cesthatthenumberofstateactionpairs(trainingsamplesize
n
)fromtheuniformly
optimalpolicysatis˝es:
n

2(
m

1)
2
T
2
(log
1+
e
D
1


)
2
log
2
jHj

=
O
0
B
@
m
2
T
2

log
D
1


2
log
jHj

1
C
A
:
TheproofsofTheorem6andCorollary4areprovidedinAppendixA.Theorem6
establishestheconnectionbetweenthegeneralizationperformanceofRLandthesample
complexityofsupervisedlearning.Thelowerboundofgeneralizationperformancedecreases
exponentiallywithrespecttothehorizon
T
andactionspacedimension
m
.Thisisaligned
withourempiricalobservationthatitismoredi˚culttolearntheMDPswithalonger
horizonand/oralargeractionspace.Furthermore,thegeneralizationperformancehasa
lineardependenceon
D
,thetransitionprobabilityofoptimaltrajectories.Therefore,
T
,
m
,and
D
jointlydeterminesthedi˚cultyoflearningofthegivenMDP.Aspointedout
byCorollary4,thesmallerthe
D
is,thehigherthesamplecomplexity.Notethat
T
,
m
,
and
D
allcharacterizeintrinsicpropertiesofMDPs,whichcannotbeimprovedbyour
learningalgorithms.OneadvantageofRPGisthatitssamplecomplexityhasnodependence
onthestatespace,whichenablestheRPGtoresolvelarge-scalecomplicatedMDPs,as
119
demonstratedinourexperiments.Inthesupervisionstage,ourgoalisthesameasinthe
traditionalsupervisedlearning:toachievebettergeneralizationperformance

.
4.2.9Explorationstage:Exploratione˚ciency
Theexploratione˚ciencyishighlyrelatedtotheMDPpropertiesandtheexploration
strategy.ToprovideinterpretationonhowtheMDPproperties(statespacedimension,action
spacedimension,horizon)a˙ectthesamplecomplexitythroughexploratione˚ciency,we
characterizeasimpli˝edMDPasin[
184
],inwhichweexplicitlycomputetheexploration
e˚ciencyofastationarypolicy(randomexploration),asshowninFigure4.8.
De˝nition9
(ExplorationE˚ciency)
.
Wede˝netheexploratione˚ciencyofacertain
explorationalgorithm(
A
)withinaMDP(
M
)astheprobabilityofsampling
i
distinctoptimal
trajectoriesinthe˝rst
k
episodes.Wedenotetheexploratione˚ciencyas
p
A;
M
(
n
traj

i
j
k
)
.
When
M
,
k
,
i
andoptimalitythreshold
c
are˝xed,thehigherthe
p
A;
M
(
n
traj

i
j
k
)
,thebetter
theexploratione˚ciency.Weuse
n
traj
todenotethenumberofnear-optimaltrajectoriesin
thissubsection.Iftheexplorationalgorithmderivesaseriesoflearningpolicies,thenwehave
p
A;
M
(
n
traj

i
j
k
)=
p
f
ˇ
i
g
t
i
=0
;
M
(
n
traj

i
j
k
)
,where
t
isthenumberofstepsthealgorithm
A
updatedthepolicy.Ifwewouldliketostudytheexploratione˚ciencyofastationary
policy,thenwehave
p
A;
M
(
n
traj

i
j
k
)=
p
ˇ;
M
(
n
traj

i
j
k
)
.
De˝nition10
(ExpectedExplorationE˚ciency)
.
Theexpectedexploratione˚ciencyofa
certainexplorationalgorithm(
A
)withinaMDP(
M
)isde˝nedas:
E
A;k;
M
=
X
k
i
=0
p
A;
M
(
n
traj
=
i
j
k
)
i:
120
Figure4.8:ThebinarytreestructureMDP(
M
1
)withoneinitialstate,similarasdiscussed
in[
184
].Inthissubsection,wefocusontheMDPsthathavenoduplicatedstates.Theinitial
statedistributionoftheMDPisuniformandtheenvironmentdynamicsisdeterministic.For
M
1
theworstcaseexplorationisrandomexplorationandeachtrajectorywillbevisitedat
sameprobabilityunderrandomexploration.NotethatinthistypeofMDP,theAssumption5
issatis˝ed.
Thede˝nitionsprovideaquantitativemetrictoevaluatethequalityofexploration.
Intuitively,thequalityofexplorationshouldbedeterminedbyhowfrequentlyitwillhit
di˙erentgoodtrajectories.WeuseDef9fortheoreticalanalysisandDef10forpractical
evaluation.
Lemma2
(TheExplorationE˚ciencyofRandomPolicy)
.
TheExplorationE˚ciencyof
randomexplorationpolicyinabinarytreeMDP(
M
1
)isgivenas:
p
ˇ
r
;
M
(
n
traj

i
j
k
)=1

X
i

1
i
0
=0
C
i
0
jTj
P
i
0
j
=0
(

1)
j
C
j
i
0
(
N
jTj
+
i
0

j
)
k
N
k
;
where
N
denotesthetotalnumberofdi˙erenttrajectoriesintheMDP.InbinarytreeMDP
M
1
,
N
=
jS
0
jjAj
T
,wherethe
jS
0
j
denotesthenumberofdistinctinitialstates.
jTj
denotes
thenumberofoptimaltrajectories.
ˇ
r
denotestherandomexplorationpolicy,whichmeans
theprobabilityofhittingeachtrajectoryin
M
1
isequal.
TheproofofLemma2isavailableinAppendixA.
121
4.2.10JointAnalysisCombiningExplorationandSupervision
Inthissection,wejointlyconsiderthelearninge˚ciencyandexploratione˚ciencytostudy
thegeneralizationperformance.Concretely,wewouldliketostudyifweinteractwiththe
environmentacertainnumberofepisodes,whatistheworstgeneralizationperformancewe
canexpectwithcertainprobability,ifRPGisapplied.
Corollary5
(RLGeneralizationPerformance)
.
GivenaMDPwheretheUOP(Def8)is
deterministic,let
jHj
bethesizeofthehypothesisspace,andlet
;n;k
be˝xed,thefollowing
inequalityholdswithprobabilityatleast
1


0
:
X
˝
p

(
˝
)
r
(
˝
)

D
(1+
e
)

(1

m
)
T
;
where
k
isthenumberofepisodeswehaveexploredintheMDP,
n
isthenumberofdistinct
optimalstate-actionpairsweneededfromtheUOP(i.e.,sizeoftrainingdata.).
n
0
denotes
thenumberofdistinctoptimalstate-actionpairscollectedbytherandomexploration.

=
2
s
1
2
n
log
2
jHj
p
ˇ
r
;
M
(
n
0

n
j
k
)
p
ˇ
r
;
M
(
n
0

n
j
k
)

1+

0
.
TheproofofCorollary5isprovidedinAppendixA.Corollary5statesthattheprobability
ofsamplingoptimaltrajectoriesisthemainbottleneckofexplorationandgeneralization,
insteadofstatespacedimension.Ingeneral,theoptimalexplorationstrategydependsonthe
propertiesofMDPs.Inthiswork,wefocusonimprovinglearninge˚ciency,i.e.,learning
optimalrankinginsteadofestimatingvaluefunctions.Thediscussionofoptimalexploration
isbeyondthescopeofthiswork.
122
Figure4.9:ThetrainingcurvesoftheproposedRPGandstate-of-the-art.Allresultsare
averagedoverrandomseedsfrom1to5.The
x
-axisrepresentsthenumberofstepsinteracting
withtheenvironment(weupdatethemodeleveryfoursteps)andthe
y
-axisrepresentsthe
averagedtrainingepisodicreturn.Theerrorbarsareplottedwithacon˝denceintervalof
95%.
4.2.11ExperimentalResults
Toevaluatethesample-e˚ciencyofRankingPolicyGradient(RPG),wefocusonAtari
2600gamesinOpenAIgym[
18
,
24
],withoutrandomlyrepeatingthepreviousaction.We
compareourmethodwiththestate-of-the-artbaselinesincludingDQN[
132
],C51[
17
],
IQN[
42
],
Rainbow
[
80
],andself-imitationlearning(SIL)[
145
].Forreproducibility,weuse
theimplementationprovidedinDopamineframework
1
[
34
]forallbaselinesandproposed
methods,exceptforSILusingtheo˚cialimplementation.
2
.Followthestandardpractice[
145
,
1
https://github.com/google/dopamine
2
https://github.com/junhyukoh/self-imitation-learning
123
80
,
42
,
17
],wereportthetrainingperformanceofallbaselinesastheincreaseofinteractions
withtheenvironment,orproportionallythenumberoftrainingiterations.Werunthe
algorithmswith˝verandomseedsandreporttheaveragerewardswith
95
%con˝dence
intervals.TheimplementationdetailsoftheproposedRPGanditsvariantsaregivenas
follows
3
:
EPG
:EPGisthestochasticlistwisepolicygradient(seeEq(4.10))incorporatedwith
theproposedo˙-policylearning.Moreconcretely,weapplytrajectoryrewardshaping(TRS,
Def6)toalltrajectoriesencounteredduringexplorationandtrainvanillapolicygradient
usingtheo˙-policysamples.Thisisequivalenttominimizingthecross-entropyloss(see
AppendixEq(4.15))overthenear-optimaltrajectories.
LPG
:LPGisthedeterministiclistwisepolicygradientwiththeproposedo˙-policylearn-
ing.Theonlydi˙erencebetweenEPGandLPGisthatLPGchoosesactiondeterministically
(seeAppendixEq(4.9))duringevaluation.
RPG
:RPGexplorestheenvironmentusingaseparateEPGagentin
Pong
andIQNin
othergames.ThenRPGconductssupervisedlearningbyminimizingthehingelossEq(4.14).
Itisworthnotingthattheexplorationagent(EPGorIQN)canbereplacedbyanyexisting
explorationmethod.InourRPGimplementation,wecollectalltrajectorieswiththetrajectory
rewardnolessthanthethreshold
c
withouteliminatingtheduplicatedtrajectoriesandwe
empiricallyfounditisareasonablesimpli˝cation.
Sample-e˚ciency.
AstheresultsshowninFigure4.9,ourapproach,RPG,signi˝cantly
outperformsthestate-of-the-artbaselinesintermsofsample-e˚ciencyatalltasks.Further-
more,RPGnotonlyachievedthemostsample-e˚cientresults,butalsoreachedthehighest
˝nalperformanceat
Robotank
,
DoubleDunk
,
Pitfall
,and
Pong
,comparingtoany
3
Codeisavailableathttps://github.com/illidanlab/rpg.
124
Figure4.10:Thetrade-o˙betweensamplee˚ciencyandoptimality.
model-freestate-of-the-art.Inreinforcementlearning,thestabilityofalgorithmshouldbe
emphasizedasanimportantissue.Aswecanseefromtheresults,theperformanceofbaselines
variesfromtasktotask.Thereisnosinglebaselineconsistentlyoutperformsothers.In
contrast,duetothereductionfromRLtosupervisedlearning,RPGisconsistentlystableand
e˙ectiveacrossdi˙erentenvironments.Inadditiontothestabilityande˚ciency,RPGenjoys
simplicityatthesametime.Intheenvironment
Pong
,itissurprisingthatRPGwithout
anycomplicatedexplorationmethodlargelysurpassedthesophisticatedvalue-functionbased
approaches.MoredetailsofhyperparametersareprovidedintheAppendixSectionA.
4.2.12AblationStudy
Thee˙ectivenessofpairwiserankingpolicyando˙-policylearningassupervised
learning.
TogetabetterunderstandingoftheunderlyingreasonsthatRPGismoresample-
e˚cientthanDQNvariants,weperformedablationstudiesinthe
Pong
environmentby
varyingthecombinationofpolicyfunctionswiththeproposedo˙-policylearning.Theresults
ofEPG,LPG,andRPGareshowninthebottomright,Figure4.9.RecallthatEPGand
LPGuselistwisepolicygradient(vanillapolicygradientusingsoftmaxaspolicyfunction)to
conductexploration,theo˙-policylearningminimizesthecross-entropylossEq(4.15).In
contrast,RPGsharesthesameexplorationmethodasEPGandLPGwhileusespairwise
125
Figure4.11:Expectedexploratione˚ciencyofstate-of-the-art.Theresultsareaveragedover
randomseedsfrom1to10.
rankingpolicyEq(4.5)ino˙-policylearningthatminimizeshingelossEq(4.14).Wecansee
thatRPGismoresample-e˚cientthanEPG/LPGinlearningdeterministicoptimalpolicy.
Wealsocomparedtheadvancedon-policymethodProximalPolicyOptimization(PPO)[
170
]
withEPG,LPG,andRPG.Theproposedo˙-policylearninglargelysurpassedthebest
on-policymethod.Therefore,weconcludethato˙-policyassupervisedlearningcontributes
tothesample-e˚ciencysubstantially,whilethepairwiserankingpolicycanfurtheraccelerate
thelearning.Inaddition,wecompareRPGtorepresentativeo˙-policypolicygradient
approach:ACER[
208
].Astheresultsshown,theproposedo˙-policylearningframeworkis
moresample-e˚cientthanthestate-of-the-arto˙-policypolicygradientapproaches.
OntheTrade-o˙betweenSample-E˚ciencyandOptimality.
ResultsinFigure4.10
showthatthereisatrade-o˙betweensamplee˚ciencyandoptimality,whichiscontrolledby
thetrajectoryrewardthreshold
c
.Recallthat
c
determineshowcloseisthelearnedUNOP
tooptimalpolicies.Ahighervalueof
c
leadstoalessfrequencyofnear-optimaltrajectories
beingcollectedandandthusalowersamplee˚ciency,andhoweverthealgorithmisexpected
toconvergetoastrategyofbetterperformance.Wenotethat
c
istheonlyparameterwe
tunedacrossallexperiments.
126
ExplorationE˚ciency.
WeempiricallyevaluatetheExpectedExplorationE˚ciency
(Def9)ofthestate-of-the-arton
Pong
.ItisworthnotingthattheRLgeneralization
performanceisdeterminedbybothoflearninge˚ciencyandexploratione˚ciency.Therefore,
higherexploratione˚ciencydoesnotnecessarilyleadtomoresamplee˚cientalgorithmdue
tothelearningine˚ciency,asdemonstratedby
RainBow
and
DQN
(seeFigure4.11).Also,
theImplicitQuantileachievesthebestperformanceamongbaselines,sinceitsexploration
e˚ciencylargelysurpassesotherbaselines.
4.2.13Conclusion
Inthiswork,weintroducedrankingpolicygradientmethodsthat,forthe˝rsttime,approach
theRLproblemfromarankingperspective.Furthermore,towardsthesample-e˚cientRL,
weproposeano˙-policylearningframework,whichtrainsRLagentsinasupervisedlearning
mannerandthuslargelyfacilitatesthelearninge˚ciency.Theo˙-policylearningframework
usesgeneralizedpolicyiterationforexplorationandexploitsthestablenessofsupervised
learningforderivingpolicy,whichaccomplishestheunbiasedness,variancereduction,o˙-
policylearning,andsamplee˚ciencyatthesametime.Besides,weprovideanalternative
approachtoanalyzethesamplecomplexityofRL,andshowthatthesamplecomplexityof
RPGhasnodependencyonthestatespacedimension.Lastbutnotleast,empiricalresults
showthatRPGachievessuperiorperformanceascomparedtothestate-of-the-art.
127
Chapter5
CollaborativeMulti-AgentLearning
Inthischapter,weinvestigatethescalabilityofcollaborativelearninginthecontextof
multi-agentlearningforareal-world˛eetmanagementapplication.Weproposetotransfer
thecoordinationofalargenumberoflearningagentsintoalinearprogrammingproblem,
withproperdomainknowledgetoguidetheoptimization.Weshowthesuperiorityofthis
globalcollaborationcomparedtoindividuallearningthroughextensiveevaluationonthe
real-worldtra˚cdata.
5.1Introduction
Large-scaleonlineride-sharingplatformssuchasUber[
201
],Lift[
126
],andDidiChuxing[
40
]
havetransformedthewaypeopletravel,liveandsocialize.Byleveragingtheadvances
inandwideadoptionofinformationtechnologiessuchascellularnetworksandglobal
positioningsystems,theride-sharingplatformsredistributeunderutilizedvehiclesonthe
roadstopassengersinneedoftransportation.Theoptimizationoftransportationresources
greatlyalleviatedtra˚ccongestionandcalibratedtheoncesigni˝cantgapbetweentransport
demandandsupply[112].
Onekeychallengeinride-sharingplatformsistobalancethedemandsandsupplies,i.e.,
ordersofthepassengersanddriversavailableforpickinguporders.Inlargecities,although
millionsofride-sharingordersareservedeveryday,anenormousnumberofpassengersrequests
128
remainunservicedduetothelackofavailabledriversnearby.Ontheotherhand,there
areplentyofavailabledriverslookingforordersinotherlocations.Iftheavailabledrivers
weredirectedtolocationswithhighdemand,itwillsigni˝cantlyincreasethenumberof
ordersbeingserved,andthussimultaneouslybene˝tallaspectsofthesociety:utilityof
transportationcapacitywillbeimproved,incomeofdriversandsatisfactionofpassengers
willbeincreased,andmarketshareandrevenueofthecompanywillbeexpanded.
˛eet
management
isakeytechnicalcomponenttobalancethedi˙erencesbetweendemandand
supply,byreallocatingavailablevehiclesaheadoftime,toachievehighe˚ciencyinserving
futuredemand.
Eventhoughrichhistoricaldemandandsupplydataareavailable,usingthedatato
seekanoptimalallocationpolicyisnotaneasytask.Onemajorissueisthatchanges
inanallocationpolicywillimpactfuturedemand-supply,anditishardforsupervised
learningapproachestocaptureandmodelthesereal-timechanges.Ontheotherhand,the
reinforcementlearning(RL)[
186
],whichlearnsapolicybyinteractingwithacomplicated
environment,hasbeennaturallyadoptedtotacklethe˛eetmanagementproblem[
64
,
65
,
211
].
However,thehigh-dimensionalandcomplicateddynamicsbetweendemandandsupplycan
hardlybemodeledaccuratelybytraditionalRLapproaches.
Recentyearswitnessedtremendoussuccessindeepreinforcementlearning(DRL)in
modelingintellectualchallengingdecision-makingproblems[
132
,
174
,
175
]thatwerepreviously
intractable.Inthelightofsuchadvances,inthischapterweproposeanovelDRLapproachto
learnhighlye˚cientallocationpoliciesfor˛eetmanagement.Therearesigni˝canttechnical
challengeswhenmodeling˛eetmanagementusingDRL:
1)
Feasibilityofproblemsetting.
TheRLframeworkisreward-driven,meaningthatasequence
of
actions
fromthepolicyisevaluatedsolelybythe
reward
signalfromenvironment[
11
].
129
Thede˝nitionsofagent,rewardandactionspaceareessentialforRL.Ifwemodelthe
allocationpolicyusingacentralizedagent,theactionspacecanbeprohibitivelylargesince
anactionneedstodecidethenumberofavailablevehiclestorepositionfromeachlocationto
itsnearbylocations.Also,thepolicyissubjecttoafeasibilityconstraintenforcingthatthe
numberofrepositionedvehiclesneedstobenolargerthanthecurrentnumberofavailable
vehicles.Tothebestofourknowledge,thishigh-dimensionalexact-constrainsatisfaction
policyoptimizationisnotcomputationallytractableinDRL:applyingitinaverysmall-scale
problemcouldalreadyincurhighcomputationalcosts[154].
2)
Large-scaleAgents.
Onealternativeapproachistoinsteaduseamulti-agentDRLsetting,
whereeachavailablevehicleisconsideredasanagent.Themulti-agentrecipeindeedalleviates
thecurseofdimensionalityofactionspace.However,suchsettingcreatesthousandsofagents
interactingwiththeenvironmentateachtime.TrainingalargenumberofagentsusingDRL
isagainchallenging:theenvironmentforeachagentisnon-stationarysinceotheragentsare
learninganda˙ectingtheenvironmentatsamethetime.Mostofexistingstudies[
125
,
60
,
189
]
allowcoordinationamongonlyasmallsetofagentsduetohighcomputationalcosts.
3)
CoordinationsandContextDependenceofActionspace
Facilitatingcoordinationamong
large-scaleagentsremainsachallengingtask.Sinceeachagenttypicallylearnsitsownpolicy
oraction-valuefunctionthatarechangingovertime,itisdi˚culttocoordinateagentsfor
alargenumberofagents.Moreover,theactionspaceisdynamicchangingovertimesince
agentsarenavigatingtodi˙erentlocationsandthenumberoffeasibleactionsdependson
thegeographiccontextofthelocation.
Inthispaper,weproposeacontextualmulti-agentDRLframeworktoresolvetheafore-
mentionedchallenges.Ourmajorcontributionsarelistedasfollows:
130

Weproposeane˚cientmulti-agentDRLsettingforlarge-scale˛eetmanagement
problembyaproperdesignofagent,rewardandstate.

Weproposecontextualmulti-agentreinforcementlearningframeworkinwhichthree
concretealgorithms:
contextualmulti-agentactor-critic
(cA2C),
contextualdeepQ-
learning
(cDQN),and
Contextualmulti-agentactor-criticwithlinearprogramming
(LP-cA2C)aredeveloped.Forthe˝rsttimeinmulti-agentDRL,thecontextual
algorithmscannotonlyachievee˚cientcoordinationamongthousandsoflearning
agentsateachtime,butalsoadapttodynamicallychangingactionspaces.

InordertotrainandevaluatetheRLalgorithm,wedevelopedasimulatorthatsimulates
real-worldtra˚cactivitiesperfectlyaftercalibratingthesimulatorusingrealhistorical
dataprovidedbyDidiChuxing[40].

Lastbutnotleast,theproposedcontextualalgorithmssigni˝cantlyoutperformthe
state-of-the-artmethodsinmulti-agentDRLwithamuchlessnumberofrepositions
needed.
Therestofthischapterisorganizedasfollows.We˝rstgivealiteraturereviewon
therelatedworkinSec5.2.ThentheproblemstatementiselaboratedinSec5.3andthe
simulationplatformwebuiltfortrainingandevaluationareintroducedinSec5.6.The
methodologyisdescribedinSec5.4.Quantitativeandqualitativeresultsarepresentedin
Sec6.6.Finally,weconcludeourworkinSec5.8.
131
5.2RelatedWorks
IntelligentTransportationSystem.
Advancesinmachinelearningandtra˚cdata
analyticsleadtowidespreadapplicationsofmachinelearningtechniquestotacklechallenging
tra˚cproblems.Onetrendingdirectionistoincorporatereinforcementlearningalgorithms
incomplicatedtra˚cmanagementproblems.Therearemanypreviousstudiesthathave
demonstratedthepossibilityandbene˝tsofreinforcementlearning.Ourworkhasclose
connectionstothesestudiesintermsofproblemsetting,methodologyandevaluation.Among
thetra˚capplicationsthatarecloselyrelatedtoourwork,suchastaxidispatchsystemsor
tra˚clightcontrolalgorithms,multi-agentRLhasbeenexploredtomodeltheintricatenature
ofthesetra˚cactivities[
14
,
172
,
128
].Thepromisingresultsmotivatedustousemulti-agent
modelinginthe˛eetmanagementproblem.In[
64
],anadaptivedynamicprogramming
approachwasproposedtomodelstochasticdynamicresourceallocation.Itestimatesthe
returnsoffuturestatesusingapiecewiselinearfunctionanddeliversactions(assigningorders
tovehicles,reallocateavailablevehicles)givenstatesandonestepfuturestatesvalues,by
solvinganintegerprogrammingproblem.In[
65
],theauthorsfurtherextendedtheapproach
tothesituationsthatanactioncanspanacrossmultipletimeperiods.Thesemethodsare
hardtobedirectlyutilizedinthereal-worldsettingwhereorderscanbeservedthroughthe
vehicleslocatedinmultiplenearbylocations.
Multi-agentreinforcementlearning.
Anotherrelevantresearchtopicismulti-agent
reinforcementlearning[
27
]whereagroupofagentssharethesameenvironment,inwhich
theyreceiverewardsandtakeactions.[
190
]comparedandcontrastedindependent
Q
-learning
andacooperativecounterpartindi˙erentsettings,andempiricallyshowedthatthelearning
speedcanbene˝tfromthecooperationamongagents.Independent
Q
-learningisextended
132
intoDRLin[
189
],wheretwoagentsarecooperatingorcompetingwitheachotheronly
throughthereward.In[
60
],theauthorsproposedacounterfactualmulti-agentpolicy
gradientmethodthatusesacentralizedadvantagetoestimatewhethertheactionofone
agentwouldimprovetheglobalreward,anddecentralizedactorstooptimizetheagentpolicy.
Ryan
etal.
alsoutilizedtheframeworkofdecentralizedexecutionandcentralizedtrainingto
developmulti-agentmulti-agentactor-criticalgorithmthatcancoordinateagentsinmixed
cooperative-competitiveenvironments[
125
].However,noneofthesemethodswereapplied
whentherearealargenumberofagentsduetothecommunicationcostamongagents.
Recently,fewworks[
230
,
217
]scaledDRLmethodstoalargenumberofagents,whileitis
notapplicabletoapplythesemethodstocomplexrealapplicationssuchas˛eetmanagement.
In[
140
,
141
],theauthorsstudiedlarge-scalemulti-agentplanningfor˛eetmanagementwith
explicitlymodelingtheexpectedcountsofagents.
Deepreinforcementlearning.
DRLutilizesneuralnetworkfunctionapproximationsand
areshowntohavelargelyimprovedtheperformanceoverchallengingapplications[
175
,
132
].
ManysophisticatedDRLalgorithmssuchasDQN[
132
],A3C[
131
]weredemonstratedto
bee˙ectiveinthetasksinwhichwehaveaclearunderstandingofrulesandhaveeasy
accesstomillionsofsamples,suchasvideogames[
24
,
18
].However,DRLapproachesare
rarelyseentobeappliedincomplicatedreal-worldapplications,especiallyinthosewith
high-dimensionalandnon-stationaryactionspace,lackofwell-de˝nedrewardfunction,andin
needofcoordinationamongalargenumberofagents.Inthischapter,weshowthatthrough
carefulreformulation,theDRLcanbeappliedtotacklethe˛eetmanagementproblem.
133
5.3ProblemStatement
Inthischapter,weconsidertheproblemofmanagingalargesetofavailablehomogeneous
vehiclesforonlineride-sharingplatforms.Thegoalofthemanagementistomaximizethe
grossmerchandisevolume(GMV:thevalueofalltheordersserved)oftheplatformby
repositioningavailablevehiclestothelocationswithlargerdemand-supplygapthanthe
currentone.Thisproblembelongstoavariantoftheclassical˛eetmanagementproblem[
47
].
Aspatial-temporalillustrationoftheproblemisavailableinFigure5.1.Inthisexample,we
use
hexagonal-gridworld
torepresentthemapandsplitthedurationofonedayinto
T
=144
timeintervals(onefor10minutes).Ateachtimeinterval,theordersemergestochasticallyin
eachgridandareservedbytheavailablevehiclesinthesamegridorsixnearbygrids.The
goalof˛eetmanagementhereistodecidehowmanyavailablevehiclestorelocatefromeach
gridtoitsneighborsinaheadoftime,sothatmostorderscanbeserved.
Totacklethisproblem,weproposetoformulatetheproblemusing
multi-agentreinforce-
mentlearning
[
27
].Inthisformulation,weuseasetofhomogeneousagentswithsmallaction
spaces,andsplittheglobalrewardintoeachgrid.Thiswillleadtoamuchmoree˚cient
learningprocedurethanthesingleagentsetting,duetothesimpli˝edactiondimensionand
theexplicitcreditassignmentbasedonsplitreward.Formally,wemodelthe˛eetmanagement
problemasaMarkovgame
G
for
N
agents,whichisde˝nedbyatuple
G
=(
N;
S
;
A
;
P
;
R
;
)
,
where
N;
S
;
A
;
P
;
R
;
arethenumberofagents,setsofstates,jointactionspace,transition
probabilityfunctions,rewardfunctions,andadiscountfactorrespectively.Thede˝nitions
aregivenasfollows:

Agent
:Weconsideranavailablevehicle(orequivalentlyanidledriver)asanagent,
andthevehiclesinthesamespatial-temporalnodearehomogeneous,i.e.,thevehicles
134
locatedatthesameregionatthesametimeintervalareconsideredassameagents
(whereagentshavethesamepolicy).Althoughthenumberofuniqueheterogeneous
agentsisalways
N
,thenumberofagents
N
t
ischangingovertime.

State
s
t
2S
:Wemaintainaglobalstate
s
t
ateachtime
t
,consideringthespatial
distributionsofavailablevehiclesandorders(i.e.thenumberofavailablevehiclesand
ordersineachgrid)andcurrenttime
t
(usingone-hotencoding).Thestateofanagent
i
,
s
i
t
,isde˝nedastheidenti˝cationofthegriditlocatedandthesharedglobalstate
i.e.
s
i
t
=[
s
t
;
g
j
]
2
R
N

3+
T
,where
g
j
istheone-hotencodingofthegridID.Wenote
thatagentslocatedatsamegridhavethesamestate
s
i
t
.

Action
a
t
2A
=
A
1

:::
A
N
t
:a
jointaction
a
t
=
f
a
i
t
g
N
t
1
instructingtheallocation
strategyofallavailablevehiclesattime
t
.Theactionspace
A
i
ofanindividualagent
speci˝eswheretheagentisabletoarriveatthenexttime,whichgivesasetofseven
discreteactionsdenotedby
f
k
g
7
k
=1
.The˝rstsixdiscreteactionsindicateallocatingthe
agenttooneofitssixneighboringgrids,respectively.Thelastdiscreteaction
a
i
t
=7
meansstayinginthecurrentgrid.Forexample,theaction
a
1
0
=2
meanstorelocate
the
1
stagentfromthecurrentgridtothesecondnearbygridattime
0
,asshownin
Figure5.1.Foraconcisepresentation,wealsouse
a
i
t
,
[
g
0
;
g
1
]
torepresentagent
i
movingfromgrid
g
0
to
g
1
.Furthermore,theactionspaceofagentsdependsontheir
locations.Theagentslocatedatcornergridshaveasmalleractionspace.Wealso
assumethattheactionisdeterministic:if
a
i
t
,
[
g
0
;
g
1
]
,thenagent
i
willarriveatthe
grid
g
1
attime
t
+1
.

Rewardfunction
R
i
2R
=
SA!
R
:Eachagentisassociatedwithareward
function
R
i
andallagentsinthesamelocationhavethesamerewardfunction.The
135
i
-thagentattemptstomaximizeitsownexpecteddiscountedreturn:
E
h
P
1
k
=0

k
r
i
t
+
k
i
.
Theindividualreward
r
i
t
forthe
i
-thagentassociatedwiththeaction
a
i
t
isde˝nedasthe
averagedrevenueofallagentsarrivingatthesamegridasthe
i
-thagentattime
t
+1
.
Sincetheindividualrewardsatsametimeandthesamelocationaresame,wedenote
thisrewardofagentsattime
t
andgrid
g
j
as
r
t
(
g
j
)
.Suchdesignofrewardsaimsat
avoidinggreedyactionsthatsendtoomanyagentstothelocationwithhighvalueof
orders,andaligningthemaximizationofeachagent'sreturnwiththemaximizationof
GMV(valueofallservedordersinoneday).Itse˙ectivenessisempiricallyveri˝edin
Sec6.6.

Statetransitionprobability
p
(
s
t
+1
j
s
t
;a
t
):
SAS!
[0
;
1]
:Itgivestheproba-
bilityoftransitingto
s
t
+1
givenajointaction
a
t
istakeninthecurrentstate
s
t
.Notice
thatalthoughtheactionisdeterministic,newvehiclesandorderswillbeavailable
atdi˙erentgridseachtime,andexistingvehicleswillbecomeo˙-lineviaarandom
process.
Tobemoreconcrete,wegiveanexamplebasedontheaboveproblemsettinginFigure5.1.
Attime
t
=0
,agent
1
isrepositionedfrom
g
0
to
g
2
byaction
a
1
0
,andagent
2
isalso
repositionedfrom
g
1
to
g
2
byaction
a
2
0
.Attime
t
=1
,twoagentsarriveat
g
2
,andanew
orderwithvalue
10
alsoemergesatsamegrid.Therefore,thereward
r
1
forboth
a
1
0
and
a
2
0
istheaveragedvaluereceivedbyagentsat
g
2
,whichis
10
=
2=5
.
It'sworthtonotethatthisrewarddesignmaynotleadtotheoptimalreallocationstrategy
thoughitempiricallyleadstogoodreallocationpolicy.Wegiveasimpleexampletoillustrate
thisproblem.WeusethegridworldmapasshowinFigure5.1.Attime
t
=1
,thereisan
orderwithvalue100emergedin
g
1
andanotherorderwithvalue10emergedin
g
0
.Suppose
136
Figure5.1:Thegridworldsystemandaspatial-temporalillustrationoftheproblemsetting.
wehavetwoagentsthatareavailableingrid
g
0
attime
t
=0
.Theoptimalreallocation
strategyinthiscaseistoaskoneagentstayin
g
0
andanothergoto
g
1
,bywhichwecan
receivethetotalreward110.However,inthecurrentsetting,eachagenttrystomaximize
itsownreward.Asaresult,bothofthemwillgoto
g
1
andreceive50rewardandnoneof
themwillgoto
g
1
sincetherewardtheycanreceiveislessthan50.However,weshowthat
therearefewwaystoapproximatethisglobaloptimalallocationstrategyusingtheindividual
actionfunctionofeachagent.
5.4ContextualMulti-AgentReinforcementLearning
Inthissection,wepresenttwonovelcontextualmulti-agentRLapproaches:contextual
multi-agentactor-critic(cA2C)andcontextualDQN(cDQN)algorithm.We˝rstbrie˛y
introducethebasicmulti-agentRLmethod.
5.4.1IndependentDQN
IndependentDQN[
189
]combinesindependent
Q
-learning[
190
]andDQN[
132
].Astraightfor-
wardextensionofindependentDQNfromsmallscaletoalargenumberofagents,istoshare
137
networkparametersanddistinguishagentswiththeirIDs[
230
].Thenetworkparameters
canbeupdatedbyminimizingthefollowinglossfunction,withrespecttothetransitions
collectedfromallagents:
E
2
4
Q
(
s
i
t
;a
i
t
;

)

0
@
r
i
t
+1
+

max
a
i
t
+1
Q
(
s
i
t
+1
;a
i
t
+1
;

0
)
1
A
3
5
2
;
(5.1)
where

0
includesparametersofthetarget
Q
networkupdatedperiodically,and

includes
parametersofbehavior
Q
networkoutputtingtheactionvaluefor

-greedypolicy,sameas
thealgorithmdescribedin[
132
].Thismethodcouldworkreasonablywellafterextensive
tunningbutitsu˙ersfromhighvarianceinperformance,anditalsorepositionstoomany
vehicles.Moreover,coordinationamongmassiveagentsishardtoachievesinceeachunique
agentexecutesitsactionindependentlybasedonitsactionvalues.
5.4.2ContextualDQN
Sinceweassumethatthelocationtransitionofanagentaftertheallocationactionis
deterministic,theactionsthatleadtheagentstothesamegridshouldhavethesameaction
value.Inthiscase,thenumberofuniqueaction-valuesforallagentsshouldbeequalto
thenumberofgrids
N
.Formally,foranyagent
i
where
s
i
t
=[
s
t
;
g
i
]
,
a
i
t
,
[
g
i
;
g
d
]
and
g
i
2
Ner
(
g
d
)
,thefollowingholds:
Q
(
s
i
t
;a
i
t
)=
Q
(
s
t
;
g
d
)
(5.2)
138
Hence,ateachtimestep,weonlyneed
N
uniqueaction-values(
Q
(
s
t
;
g
j
)
;
8
j
=1
;:::;N
)and
theoptimizationofEq(5.1)canbereplacedbyminimizingthefollowingmean-squaredloss:
"
Q
(
s
t
;
g
d
;

)

 
r
t
+1
(
g
d
)+

max
g
p
2
Ner
(
g
d
)
Q
(
s
t
+1
;
g
p
;

0
)
!#
2
:
(5.3)
Thisacceleratesthelearningproceduresincetheoutputdimensionoftheactionvaluefunction
isreducedfrom
R
j
s
t
j
!
R
7
to
R
j
s
t
j
!
R
.Furthermore,wecanbuildacentralizedaction-
valuetableateachtimeforallagents,whichcanserveasthefoundationforcoordinatingthe
actionsofagents.
Geographiccontext.
Inhexagonalgridssystems,bordergridsandgridssurroundedby
infeasiblegrids(e.g.,alake)havereducedactiondimensions.Toaccommodatethis,foreach
gridwecomputea
geographiccontext
G
g
j
2
R
7
,whichisabinaryvectorthat˝ltersout
invalidactionsforagentsingrid
g
j
.The
k
thelementofvector
G
g
j
representsthevalidity
ofmovingtoward
k
thdirectionfromthegrid
g
j
.Denote
g
d
asthegridcorrespondstothe
k
thdirectionofgrid
g
j
,thevalueofthe
k
thelementof
G
g
j
isgivenby:
[
G
t;
g
j
]
k
=
8
>
<
>
:
1
;
if
g
d
isvalidgrid
;
0
;
otherwise
;
(5.4)
where
k
=0
;:::;
6
andlastdimensionofthevectorrepresentsdirectionstayinginsamegrid,
whichisalways1.
Collaborativecontext.
Toavoidthesituationthatagentsaremovingincon˛ictdirections
(i.e.,agentsarerepositionedfromgrid
g
1
to
g
2
and
g
2
to
g
1
atthesametime.),weprovide
a
collaborativecontext
C
t;
g
j
2
R
7
foreachgrid
g
j
ateachtime.Basedonthecentralized
actionvalues
Q
(
s
t
;
g
j
)
,werestrictthevalidactionssuchthatagentsatthegrid
g
j
are
139
navigatingtotheneighboringgridswithhigheractionvaluesorstayingunmoved.Therefore,
thebinaryvector
C
t;
g
j
eliminatesactionstogridswithloweractionvaluesthantheaction
stayingunmoved.Formally,the
k
thelementofvector
C
t;
g
j
thatcorrespondstoactionvalue
Q
(
s
t
;
g
i
)
isde˝nedasfollows:
[
C
t;
g
j
]
k
=
8
>
<
>
:
1
;
if
Q
(
s
t
;
g
i
)
>
=
Q
(
s
t
;
g
j
)
;
0
;
otherwise
:
(5.5)
Aftercomputingbothcollaborativeandgeographiccontext,the

-greedypolicyisthen
performedbasedontheactionvaluessurvivedfromthetwocontexts.Supposetheoriginal
actionvaluesofagent
i
attime
t
is
Q
(
s
i
t
)
2
R
7

0
,givenstate
s
i
t
,thevalidactionvaluesafter
applyingcontextsisasfollows:
q
(
s
i
t
)=
Q
(
s
i
t
)

C
t;
g
j

G
t;
g
j
:
(5.6)
Thecoordinationisenabledbecausethattheactionvaluesofdi˙erentagentsleadtothe
samelocationarerestrictedtobesamesothattheycanbecompared,whichisimpossiblein
independentDQN.Thismethodrequiresthatactionvaluesarealwaysnon-negative,which
willalwaysholdbecausethatagentsalwaysreceivenonnegativerewards.Thealgorithmof
cDQNiselaboratedinAlg5.2.
5.4.3ContextualActor-Critic
Wenowpresentthecontextualmulti-agentactor-critic(cA2C)algorithm,whichisamulti-
agentpolicygradientalgorithmthattailorsitspolicytoadapttothedynamicallychanging
actionspace.Meanwhile,itachievesnotonlyamorestableperformancebutalsoamuch
140
Algorithm5.1:

-greedypolicyforcDQN
Require:
Globalstate
s
t
1:
Computecentralizedactionvalue
Q
(
s
t
;
g
j
)
;
8
j
=1
;:::;N
2:
for
i
=1
to
N
t
do
3:
Computeactionvalues
Q
i
byEq(5.2),where
(
Q
i
)
k
=
Q
(
s
i
t
;a
i
t
=
k
)
.
4:
Computecontexts
C
t;
g
j
and
G
t;
g
j
foragent
i
.
5:
Computevalidactionvalues
q
i
t
=
Q
i
t

C
t;
g
j

G
t;
g
j
.
6:
a
i
t
=
argmax
k
q
i
t
withprobability
1


otherwisechooseanactionrandomlyfromthe
validactions.
7:
endfor
8:
return
Jointaction
a
t
=
f
a
i
t
g
N
t
1
.
Algorithm5.2:
ContextualDeepQ-learning(cDQN)
1:
Initializereplaymemory
D
tocapacity
M
2:
Initializeaction-valuefunctionwithrandomweights

orpre-trainedparameters.
3:
for
m
=1
to
max-iterations
do
4:
Resettheenvironmentandreachtheinitialstate
s
0
.
5:
for
t
=0
to
T
do
6:
Samplejointaction
a
t
usingAlg.5.1,given
s
t
.
7:
Execute
a
t
insimulatorandobservereward
r
t
andnextstate
s
t
+1
8:
Storethetransitionsofallagents(
s
i
t
;a
i
t
;r
i
t
;
s
i
t
+1
;
8
i
=1
;:::;N
t
)in
D
.
9:
endfor
10:
for
k
=1
to
M
1
do
11:
Sampleabatchoftransitions(
s
i
t
;a
i
t
;r
i
t
;
s
i
t
+1
)from
D
,
12:
Computetarget
y
i
t
=
r
i
t
+


max
a
i
t
+1
Q
(
s
i
t
+1
;a
i
t
+1
;

0
)
.
13:
Update
Q
-networkas

 
+
r

(
y
i
t

Q
(
s
i
t
;a
i
t
;

))
2
,
14:
endfor
15:
endfor
moree˚cientlearningprocedureinanon-stationaryenvironment.Therearetwomainideas
inthedesignofcA2C:1)Acentralizedvaluefunctionsharedbyallagentswithanexpected
update;2)Policycontextembeddingthatestablishesexplicitcoordinationamongagents,
enablesfastertrainingandenjoysthe˛exibilityofregulatingpolicytodi˙erentactionspaces.
Thecentralizedstate-valuefunctionislearnedbyminimizingthefollowinglossfunction
141
derivedfromBellmanequation:
L
(

v
)=(
V

v
(
s
i
t
)

V
target
(
s
t
+1
;

0
v
;ˇ
))
2
;
(5.7)
V
target
(
s
t
+1
;

0
v
;ˇ
)=
X
a
i
t
ˇ
(
a
i
t
j
s
i
t
)(
r
i
t
+1
+
V

0
v
(
s
i
t
+1
))
:
(5.8)
whereweuse

v
todenotetheparametersofthevaluenetworkand

0
v
todenotethetarget
valuenetwork.Sinceagentsstayingunmovedatthesametimearetreatedhomogeneousand
sharethesameinternalstate,thereare
N
uniqueagentstates,andthus
N
uniquestate-values
(
V
(
s
t
;
g
j
)
;
8
j
=1
;:::;N
)ateachtime.Thestate-valueoutputisdenotedby
v
t
2
R
N
,where
eachelement
(
v
t
)
j
=
V
(
s
t
;
g
j
)
istheexpectedreturnreceivedbyagentarrivingatgrid
g
j
ontime
t
.Inordertostabilizelearningofthevaluefunction,we˝xatargetvaluenetwork
parameterizedby

0
v
,whichisupdatedattheendofeachepisode.Notethattheexpected
updateinEq(5.7)andtrainingactor/criticinano˜inefashionaredi˙erentfromtheupdates
in
n
-stepactor-criticonlinetrainingusingTDerror[
131
],whereastheexpectedupdates
andtrainingparadigmarefoundtobemorestableandsample-e˚cient.Thisisalsoin
linewithpriorworkinapplyingactor-critictorealapplications[
12
].Furthermore,e˚cient
coordinationamongmultipleagentscanbeestablisheduponthiscentralizedvaluenetwork.
PolicyContextEmbedding.
Coordinationisachievedbymaskingavailableactionspace
basedonthecontext.Ateachtimestep,thegeographiccontextisgivenbyEq(5.4)andthe
collaborativecontextiscomputedaccordingtothevaluenetworkoutput:
[
C
t;
g
j
]
k
=
8
>
<
>
:
1
;ifV
(
s
t
;
g
i
)
>
=
V
(
s
t
;
g
j
)
;
0
;
otherwise
;
(5.9)
wherethe
k
thelementofvector
C
t;
g
j
correspondstotheprobabilityofthe
k
thaction
142
ˇ
(
a
i
t
=
k
j
s
i
t
)
.Let
P
(
s
i
t
)
2
R
7
>
0
denotetheoriginallogitsfromthepolicynetworkoutputfor
the
i
thagentconditionedonstate
s
i
t
.Let
q
valid
(
s
i
t
)=
P
(
s
i
t
)

C
t;
g
j

G
g
j
denotethevalid
logitsconsideringbothgeographicandcollaborativecontextforagent
i
atgrid
g
j
,where

denotesanelement-wisemultiplication.Inordertoachievee˙ectivemasking,werestrictthe
outputlogits
P
(
s
i
t
)
tobepositive.Theprobabilityofvalidactionsforallagentsinthegrid
g
j
aregivenby:
ˇ

p
(
a
i
t
=
k
j
s
i
t
)=[
q
valid
(
s
i
t
)]
k
=
[
q
valid
(
s
i
t
)]
k
k
q
valid
(
s
i
t
)
k
1
:
(5.10)
Thegradientofpolicycanthenbewrittenas:
r

p
J
(

p
)=
r

p
log
ˇ

p
(
a
i
t
j
s
i
t
)
A
(
s
i
t
;a
i
t
)
;
(5.11)
where

p
denotestheparametersofpolicynetworkandtheadvantage
A
(
s
i
t
;a
i
t
)
iscomputed
asfollows:
A
(
s
i
t
;a
i
t
)=
r
i
t
+1
+
V

0
v
(
s
i
t
+1
)

V

v
(
s
i
t
)
:
(5.12)
ThedetaileddescriptionofcA2CissummarizedinAlg5.4.
5.5E˚cientallocationwithlinearprogramming
Inthissection,wepresenttheproposedLP-cA2Cthatutilizesthestatevaluefunctions
learnedbycA2Candcomputethereallocationsinacentralizedview,whichachievesthebest
performancewithhighere˚ciency.
143
Algorithm5.3:
ContextualMulti-agentActor-CriticPolicyforward
Require:
Theglobalstate
s
t
.
1:
Computecentralizedstate-value
v
t
2:
for
i=1to
N
t
do
3:
Computecontexts
C
t;
g
j
and
G
t;
g
j
foragent
i
.
4:
Computeactionprobabilitydistribution
q
valid
(
s
i
t
)
foragent
i
ingrid
g
j
(Eq(5.10)).
5:
Sampleactionforagent
i
ingrid
g
j
basedonactionprobability
p
i
.
6:
endfor
7:
return
Jointaction
a
t
=
f
a
i
t
g
N
t
1
.
Figure5.2:Illustrationofcontextualmulti-agentactor-critic.Theleftpartshowsthe
coordinationofdecentralizedexecutionbasedontheoutputofcentralizedvaluenetwork.
Therightpartillustratesembeddingcontexttopolicynetwork.
Fromanotherperspective,ifweformulatethisproblemasaMDPwherewehavea
meta-agentthatcontrolsthedecisionsofalldrivers,ourgoalistomaximizethelongterm
rewardoftheplatform:
Q
c
(
s
;
a
)=
E
[
1
X
t
=1

t

1
r
t
(
s
t
;
a
t
)
j
s
0
=
s
;
a
0
=
a
;ˇ

]
:
The
ˇ

inaboveformulationdenotestheoptimalglobalreallocationstrategy.Althoughthe
sumofimmediaterewardreceivedbyallagentsisequaltothetotalrewardoftheplatform,
maximizingthelongtermrewardofeachagentisnotequaltomaximizethelongterm
rewardoftheplatform,i.e.
P
i
max
a
i
Q
(
s
i
;a
i
)
6
=max
a
Q
c
(
s
;
a
)
.Incooperativemulti-agent
144
Algorithm5.4:
ContextualMulti-agentActor-CriticAlgorithmfor
N
agents
1:
Initialization:
2:
Initializethevaluenetworkwith˝xedvaluetable.
3:
for
m
=1
tomax-iterations
do
4:
Resetenvironment,getinitialstate
s
0
.
5:
Stage1:Collectingexperience
6:
for
t
=0
to
T
do
7:
Sampleactions
a
t
accordingtoAlg5.3,given
s
t
.
8:
Execute
a
t
insimulatorandobservereward
r
t
andnextstate
s
t
+1
.
9:
ComputevaluenetworktargetasEq(5.8)andadvantageasEq(5.12)
forpolicynetworkandstorethetransitions.
10:
endfor
11:
Stage2:Updatingparameters
12:
for
m
1
=1
to
M
1
do
13:
Sampleabatchofexperience:
s
i
t
;V
target
(
s
i
t
;

0
v
;ˇ
)
14:
UpdatevaluenetworkbyminimizingthevaluelossEq(5.7)overthebatch.
15:
endfor
16:
for
m
2
=1
to
M
2
do
17:
Sampleabatchofexperience:
s
i
t
;a
i
t
;A
(
s
i
t
;a
i
t
)
;
C
t;g
j
;
G
g
j
.
18:
Updatepolicynetworkas

p
 

p
+
r

p
J
(

p
)
.
19:
endfor
20:
endfor
reinforcementlearning,thesumofrewardsofmultipleagentsistheglobalrewardwewant
tomaximize.Inthiscase,givenacentralizedpolicy(
ˇ

)forallagents,thesummationof
longtermrewardshouldbeequaltothegloballongtermreward.
N
X
i
=1
Q
i
(
s
i
;a
i
)=
N
X
i
=1
E
ˇ

"
1
X
t
=1

t

1
r
i
t


s
i
0
=
s
i
;a
i
0
=
a
i
#
=
E
ˇ

2
4
1
X
t
=1

t

1
N
X
i
=1
r
i
t


s
0
=
s
;
a
0
=
a
3
5
=
E
ˇ

"
1
X
t
=1

t

1
r
t


s
0
=
s
;
a
0
=
a
#
=
Q
c
(
s
;
a
)
However,inthiswork,thissimplerelationshipdoesnotholdmainlysincethenumberof
agents(
N
t
)isnotstatic.AsshowninEq(5.13),theglobalrewardattime
t
+1
oftheplatform
145
isnotequaltothesumofallcurrentagents'reward(i.e.
P
N
t
i
=1
r
i
t
+1
6
=
P
N
t
+1
i
=1
r
i
t
+1
=
r
t
+1
)
evengivenacentralizedpolicy
ˇ

.
N
t
X
i
=1
Q
(
s
i
t
;a
i
t
)=
N
t
X
i
=1
E
ˇ

[
r
i
t
+1
+

max
a
i
t
+1
Q
(
s
i
t
+1
;a
i
t
+1
)]
(5.13)
Ideally,wewouldliketodirectlylearnthecentralizedactionvaluefunction
Q
c
whileit's
computationalintractabletoexploreandoptimizethe
Q
c
inthecasewehavesubstantially
largeactionspace.Therefore,weneedtoleveragetheaveragedlongtermrewardofeach
agenttoapproximatethemaximizationofthecentralizedaction-valuefunction
Q
c
.IncDQN,
weapproximatethisallocationbyavoidingthegreedyallocationwith


greedystrategyeven
duringtheevaluationstage.IncA2C,thepolicywillallocatetheagentsinthesamelocation
toitsnearbylocationswithcertainprobabilityaccordingtothestate-values.Infact,we
usesthisempiricalstrategytobetteralignthejointactionsofeachindividualagentwiththe
actionfromoptimalreallocation.However,bothofthecA2CandcDQNtrytocoordinate
agentsfromalocalizedview,inwhicheachagentonlyconsideritsnearbysituationwhen
theyarecoordinating.Therefore,theredundantreallocationstillexistsinthosetwomethods.
Othermethodsthatcanapproximatethecentralizedaction-valuefunctionsuchasVDN[
185
]
andQMIX[160]arenotabletoscaletolargenumberofagents.
Inthiswork,weproposetoapproximatethecentralizedpolicybyformulatingthe
reallocationasalinearprogrammingproblem.
max
y
(
s
t
)

v
(
s
t
)
T
A
t

c
T
t

y
(
s
t
)


k
D
(
o
t
+1

A
t
y
(
s
t
))
k
2
2
(5.14)
s.t.
y
(
s
t
)

0
B
t
y
(
s
t
)=
d
t
146
wherethevector
y
(
s
t
)
2
R
N
r
(
t
)

1
denotesthefeasiblerepositionsforallagentsatcurrent
timestep
t
.Eachelementin
y
(
s
t
)
representsonerepositionfromcurrentgridtoitsnearbygrid.
N
r
(
t
)
isthetotalnumberoffeasiblerepositiondirection.Thenumberoffeasiblerepositions
dependsonthecurrentstatevaluesineachgridsincewereallocateagentsfromlocationwith
lowerstatevaluetothegridwithhigherstatevalue.
A
2
R
N

N
r
(
t
)
isaindicatormatrix
thatdenotestheallocationsthatdispatchdriversintothegrid,i.e.
A
i;j
2f
0
;
1
g
.
A
i;j
=1
meansthe
j
-threpositionreallocatesagentsintothe
i
-thgrid.Similarly,
B
2
R
N

N
r
(
t
)
istheindicatormatrixthatdenotestheallocationsthatdispatchdriversoutofthegrid.
D
2f
0
;
1
g
N

N
istheadjacencymatrixdenotestheconnectivityofthegridworld.
o
t
+1
denotestheestimatednumberofordersineachgridatnexttimestep.
c
t
2
R
N
r
(
t
)

1
denotes
thecostassociatedwitheachrepositionand
s
(
s
t
)
2
R
N

1
denotesthestatevalueforeach
gridintimestep
t
.
The˝rstterminEq(5.14)approximatesourgoalthatwewanttomaximizethelong
termrewardoftheplatform.Sincethestatevaluecanbeinterpretedastheaveragedlong
termrewardoneagentwillreceiveifitappearsincertaingrid,the˝rsttermrepresentsthe
totalrewardminusthetotalcostassociatedwiththerepositions.However,optimizingthe
˝rsttermwillleadtoagreedysolutionthatreallocatesalltheagentstothenearbygrid
withhigheststatevalueminusthecost.Toalleviatethisgreedyreallocation,weaddthe
secondtermtoregularizethenumberofagentsreallocatedtoeachgrid.Sincetheagentin
currentgridcanpickuptheordersemergedinnearbygrids,weutilizetheadjacencymatrix
toregularizethenumberofagentsreallocatedintoagroupofnearbygridsshouldbecloseto
thenumberofordersemergedinagroupofnearbygrids.Fromanotherpointofview,the
secondtermmorefocusontheimmediaterewardsinceitpreferthesolutionthatallocates
rightamountofagentstopick-uptheorderswithoutconsiderthefutureincomethatan
147
agentcanreceivebythatreposition.Theregularizationparameter

isusedtobalancethe
longtermrewardandtheimmediatereward.Thetwo˛owconservationconstrainsrequires
thenumberofrepositionsshouldbepositiveandthenumberofrepositionsfromcurrentgrid
shouldbeequaltothenumberofavailableagentsincurrentgrids.
Ideally,weneedtosolveaintegerprogrammingproblemwhereoursolutionsatis˝es
y
(
s
t
)
2Z
N
r
.However,solvingintegerprogrammingisNP-hardinworstcasewhilesolving
itslinearprogrammingrelaxationisinP.Inpractice,wesolvethelinearprogramming
relaxationandroundthesolutionintointegers[49].
5.6SimulatorDesign
AfundamentalchallengeofapplyingRLalgorithminrealityisthelearningenvironment.
Unlikethestandardsupervisedlearningproblemswherethedataisstationarytothelearning
algorithmsandcanbeevaluatedbythetraining-testingparadigm,theinteractivenature
ofRLintroducesintricatedi˚cultiesontrainingandevaluation.Onecommonsolutionin
tra˚cstudiesistobuildsimulatorsfortheenvironment[
211
,
172
,
128
].Inthissection,we
introduceasimulatordesignthatmodelsthegenerationoforders,procedureofassigning
ordersandkeydriverbehaviorssuchasdistributionsacrossthecity,on-line/o˙-linestatus
controlintherealworld.ThesimulatorservesasthetrainingenvironmentforRLalgorithms,
aswellastheirevaluation.Moreimportantly,oursimulatorallowsustocalibratethekey
performanceindexwiththehistoricaldatacollectedfroma˛eetmanagementsystem,and
thusthepolicieslearnedarewellalignedwithreal-worldtra˚cs.
TheDataDescription
ThedataprovidedbyDidiChuxingincludesordersandtrajectories
ofvehiclesintwocitiesincludingChengduandWuhan.Chengduiscoveredbyahexagonal
148
gridsworldconsistingof504grids.Wuhancontainsmorethanonethousandsgrids.The
orderinformationincludesorderprice,origin,destinationandduration.Thetrajectories
containthepositions(latitudeandlongitude)andstatus(on-line,o˙-line,on-service)ofall
vehicleseveryfewseconds.
TimelineDesign.
Inonetimeinterval(10minutes),themainactivitiesareconducted
sequentially,alsoillustratedinFigure5.4.

Vehiclestatusupdates:
Vehicleswillbestochasticallyseto˜ine(i.e.,o˙fromservice)
oronline(i.e.,startworking)followingaspatiotemporaldistributionlearnedfromreal
datausingthemaximumlikelihoodestimation(MLE).Othertypesofvehiclestatus
updatesinclude˝nishingcurrentserviceorallocation.Inotherwords,ifavehicle
isaboutto˝nishitsserviceatthecurrenttimestep,orarrivingatthedispatched
grid,thevehiclesareavailablefortakingnewordersorbeingrepositionedtoanew
destination.

Ordergeneration:
Thenewordersgeneratedatthecurrenttimesteparebootstrapped
fromrealordersoccurredinthesametimeinterval.Sincetheorderwillnaturally
repositionvehiclesinawiderange,thisprocedurekeepstherepositionfromorders
similartotherealdata.

Interactwithagents:
Thisstepcomputesstateasinputto˛eetmanagementalgorithm
andappliestheallocationsforagents.

Orderassignments:
Allavailableordersareassignedthroughatwo-stageprocedure.
Inthe˝rststage,theordersinonegridareassignedtothevehiclesinthesamegrid.
Inthesecondstage,theremainingun˝lledordersareassignedtothevehiclesinits
neighboringgrids.Inreality,theplatformdispatchesordertoanearbyvehiclewithin
149
Figure5.3:ThesimulatorcalibrationintermsofGMV.TheredcurvesplottheGMVvalues
ofrealdataaveragedover7dayswithstandarddeviation,in10-minutetimegranularity.
Thebluecurvesaresimulatedresultsaveragedover7episodes.
acertaindistance,whichisapproximatelytherangecoveredbythecurrentgridand
itsadjacentgrids.Therefore,theabovetwo-stageprocedureisessentialtostimulate
thesereal-worldactivitiesandthefollowingcalibration.Thissettingdi˙erentiates
ourproblemfromtheprevious˛eetmanagementproblemsetting(i.e.,demandsare
servedbythoseresourcesatthesamelocationonly.)andmakeitimpossibletodirectly
applytheclassicmethodssuchasadaptivedynamicprogrammingapproachesproposed
in[64,65].
Calibration.
Thee˙ectivenessofthesimulatorisguaranteedbycalibrationagainstthereal
dataregardingthemostimportantperformancemeasurement:thegrossmerchandisevolume
(GMV).AsshowninFigure5.3,afterthecalibrationprocedure,theGMVinthesimulatoris
verysimilartothatfromtheride-sharingplatform.The
r
2
betweensimulatedGMVand
realGMVis
0
:
9331
andthePearsoncorrelationis
0
:
9853
with
p
-value
p<
0
:
00001
.
150
Figure5.4:Simulatortimelineinonetimestep(10minutes).
5.7Experiments
Inthissection,weconductextensiveexperimentstoevaluatethee˙ectivenessofourproposed
method.
5.7.1Experimentalsettings
Inthefollowingexperiments,bothoftrainingandevaluationareconductedonthesimulator
introducedinSec5.6.Forallthecompetingmethods,weprescribetwosetsofrandomseed
thatcontrolthedynamicsofthesimulatorfortrainingandevaluation,respectively.Examples
ofdynamicsinsimulatorincludeordergenerations,andstochasticallystatusupdateofall
vehicles.Inthissetting,wecantestthegeneralizationperformanceofalgorithmswhenit
encountersunseendynamicsasinrealscenarios.TheperformanceismeasuredbyGMV(the
totalvalueofordersservedinthesimulator)gainedbytheplatformoveroneepisode(144
timestepsinthesimulator),andorderresponserate(ORR),whichistheaveragednumber
ofordersserveddividedbythenumberofordersgenerated.Weusethe˝rst15episodesfor
trainingandconductevaluationonthefollowingtenepisodesforalllearningmethods.The
numberofavailablevehiclesateachtimeindi˙erentlocationsiscountedbyapre-dispatch
procedure.Thisprocedurerunsavirtualtwo-stageorderdispatchingprocesstocomputethe
remainingavailablevehiclesineachlocation.Onaverage,thesimulatorhas5356agentsper
timestepwaitingformanagement.Allthequantitativeresultsoflearningmethodspresented
inthissectionareaveragedoverthreeruns.
151
5.7.2Performancecomparison
Inthissubsection,theperformanceoffollowingmethodsareextensivelyevaluatedbythe
simulation.

Simulation:
Thisbaselinesimulatestherealscenariowithoutany˛eetmanagement.
ThesimulatedresultsarecalibratedwithrealdatainSec5.6.

Di˙usion:
Thismethoddi˙usesavailablevehiclestoneighboringgridsrandomly.

Rule-based:
Thisbaselinecomputesa
T

N
valuetable
V
rule
,whereeachelement
V
rule
(
t;j
)
representstheaveragedrewardofanagentstayingingrid
g
j
attimestep
t
.Therewardsareaveragedovertenepisodescontrolledbyrandomseedsthatare
di˙erentwithtestingepisodes.Withthevaluetable,theagentsamplesitsactionbased
ontheprobabilitymassfunctionnormalizedfromthevaluesofneighboringgridsat
thenexttimestep.Forexample,ifanagentlocatedin
g
1
attime
t
andthecurrent
validactionsare
[
g
1
;
g
2
]
and
[
g
1
;
g
1
]
,therule-basedmethodsampleitsactionsfrom
p
(
a
i
t
,
[
g
1
;
g
j
])=
V
rule
(
t
+1
;j
)
=
(
V
rule
(
t
+1
;
2)+
V
rule
(
t
+1
;
1))
;
8
j
=1
;
2
.

Value-Iter:
Itdynamicallyupdatesthevaluetablebasedonpolicyevaluation[
186
].
Theallocationpolicyiscomputedbasedonthenewvaluetable,thesameusedinthe
rule-basedmethod,whilethecollaborativecontextisconsidered.

T-
Q
learning
:Thestandardindependenttabular
Q
-learning[
186
]learnsatable
q
tabular
2
R
T

N

7
with

-greedypolicy.Inthiscasethestatereducestotimeandthe
locationoftheagent.

T-SARSA
:TheindependenttabularSARSA[
186
]learnsatable
q
sarsa
2
R
T

N

7
withsamesettingofstatesasT-
Q
learning.
152

DQN
:TheindependentDQNiscurrentlythestate-of-the-artasweintroducedin
Sec5.4.1.Our
Q
networkisparameterizedbyathree-layerELUs[
41
]andweadopt
the

-greedypolicyastheagentpolicy.The

isannealedlinearlyfrom0.5to0.1across
the˝rst15trainingepisodesand˝xedas

=0
:
1
duringthetesting.

cDQN
:ThecontextualDQNasweintroducedinSec5.4.2.The

isannealedthesame
asinDQN.Attheendofeachepisode,the
Q
-networkisupdatedover4000batches,
i.e.
M
1
=4000
inAlg5.2.Toensureavalidcontextmasking,theactivationfunction
oftheoutputlayerofthe
Q
-networkisReLU+1.

cA2C
:Thecontextualmulti-agentactor-criticasweintroducedinSec5.4.3.Atthe
endofeachepisode,boththepolicynetworkandthevaluenetworkareupdatedover
4000batches,i.e.
M
1
=
M
2
=4000
inAlg5.2.SimilartocDQN,Theoutputlayerof
thepolicynetworkusesReLU+1astheactivationfunctiontoensurethatallelements
intheoriginallogits
P
(
s
i
t
)
arepositive.

LP-cA2C
:Thecontextualmulti-agentactor-criticwithlinearprogrammingasintro-
ducedinSec5.5.Duringthetrainingstate,weusecA2Ctoexploretheenvironment
andlearnthestatevaluefunction.Duringtheevaluation,weconductthepolicygiven
bylinearprogramming.
Exceptforthe˝rstbaseline,thegeographiccontextisconsideredinallmethodssothat
theagentswillnotnavigatetotheinvalidgrid.Unlessotherspeci˝ed,thevaluefunction
approximationsandpolicynetworkincontextualalgorithmsareparameterizedbyathree-
layerReLU[
78
]withnodesizesof128,64and32,fromthe˝rstlayertothethirdlayer.The
batchsizeofalldeeplearningmethodsis˝xedas3000,andweuse
AdamOptimizer
witha
learningrateof
1
e

3
.SinceperformanceofDQNvariesalotwhentherearealargenumber
153
ofagents,the˝rstcolumnintheTable5.1forDQNisaveragedoverthebestthreerunsout
ofsixruns,andtheresultsforallothermethodsareaveragedoverthreeruns.Also,the
centralizedcriticsofcDQNandcA2Careinitializedfromapre-trainedvaluenetworkusing
thehistoricalmeanofordervaluescomputedfromtenepisodessimulation,withdi˙erent
randomseedsfrombothtrainingandevaluation.
Totesttherobustnessofproposedmethod,weevaluateallcompetingmethodsunder
di˙erentnumbersofinitialvehiclesaccrossdi˙erentcities.Theresultsaresummarizedin
Table5.1,5.2,5.3.Theresultsof
Di˙usion
improvedtheperformancealotinTable5.1,
possiblybecausethatthemethodsometimesencouragestheavailablevehiclestoleavethe
gridwithhighdensityofavailablevehicles,andthustheimbalancedsituationisalleviated.
However,inamorerealisticsettingthatweconsiderrepositioncost,thismethodcanlead
tonegativee˙ectiveduetothehighlyine˚cientreallocations.The
Rule-based
methodthat
repositionsvehiclestothegridswithahigherdemandvalue,improvestheperformance
ofrandomrepositions.The
Value-Iter
dynamicallyupdatesthevaluetableaccordingto
thecurrentpolicyappliedsothatitfurtherpromotestheperformanceupon
Rule-based
.
Comparingtheresultsof
Value-Iter
,
T-Qlearning
and
T-SARSA
,the˝rstmethodconsistently
outperformsthelattertwo,possiblybecausethattheusageofacentralizedvaluetableenables
coordinations,whichhelpstoavoidcon˛ictrepositions.Theabovemethodssimplifythestate
representationintoaspatial-temporalvaluerepresentation,whereastheDRLmethodsaccount
bothcomplexdynamicsofsupplyanddemandusingneuralnetworkfunctionapproximations.
AstheresultsshowninlastthreerowsofTable5.1,5.2,5.3,themethodswithdeeplearning
outperformsthepreviousone.Furthermore,thecontextualalgorithmslargelyoutperform
theindependentDQN(DQN),whichisthestate-of-the-artamonglarge-scalemulti-agent
DRLmethodandallothercompetingmethods.Lastbutnotleast,thelp-cA2Cacheivethe
154
Table5.1:PerformancecomparisonofcompetingmethodsintermsofGMVandorder
responseratewithoutrepositioncost.
100%
initialvehicles
90%
initialvehicles
10%
initialvehicles
NormalizedGMVORR
NormalizedGMVORR
NormalizedGMVORR
Simulation
100
:
00

0
:
6081
:
80%

0
:
37%
98
:
81

0
:
5080
:
64%

0
:
37%
92
:
78

0
:
7970
:
29%

0
:
64%
Di˙usion
105
:
68

0
:
6486
:
48%

0
:
54%
104
:
44

0
:
5784
:
93%

0
:
49%
99
:
00

0
:
5174
:
51%

0
:
28%
Rule-based
108
:
49

0
:
4090
:
19%

0
:
33%
107
:
38

0
:
5588
:
70%

0
:
48%
100
:
08

0
:
5075
:
58%

0
:
36%
Value-Iter
110
:
29

0
:
7090
:
14%

0
:
62%
109
:
50

0
:
6889
:
59%

0
:
69%
102
:
60

0
:
6177
:
17%

0
:
53%
T-Qlearning
108
:
78

0
:
5190
:
06%

0
:
38%
107
:
71

0
:
4289
:
11%

0
:
42%
100
:
07

0
:
5575
:
57%

0
:
40%
T-SARSA
109
:
12

0
:
4990
:
18%

0
:
38%
107
:
69

0
:
4988
:
68%

0
:
42%
99
:
83

0
:
5075
:
40%

0
:
44%
DQN
114
:
06

0
:
6693
:
01%

0
:
20%
113
:
19

0
:
6091
:
99%

0
:
30%
103
:
80

0
:
9677
:
03%

0
:
23%
cDQN
115
:
19

0
:
4694
:
77%

0
:
32%
114.29

0
:
66
94.00%

0
:
53%
105
:
29

0
:
7079
:
28%

0
:
58%
cA2C
115.27

0
:
70
94.99%

0
:
48%
113
:
85

0
:
6993
:
99%

0
:
47%
105.62

0
:
66
79.57%

0
:
51%
Table5.2:PerformancecomparisonofcompetingmethodsintermsofGMV,orderresponse
rate(ORR),andreturnoninvest(ROI)inXianconsideringrepositioncost.
100%
initialvehicles
90%
initialvehicles
10%
initialvehicles
NormalizedGMVORRROI
NormalizedGMVORRROI
NormalizedGMVORRROI
Simulation
100
:
00

0
:
6081
:
80%

0
:
37%
-
98
:
81

0
:
5080
:
64%

0
:
37%
-
92
:
78

0
:
7970
:
29%

0
:
64%
-
Di˙usion
103
:
02

0
:
4186
:
49%

0
:
42%
0.5890
102
:
35

0
:
5185
:
00%

0
:
47%
0.7856
97
:
41

0
:
5574
:
51%

0
:
46%
1.5600
Rule-based
106
:
21

0
:
4390
:
00%

0
:
43%
1.4868
105
:
30

0
:
4288
:
58%

0
:
37%
1.7983
99
:
37

0
:
3675
:
83%

0
:
48%
3.2829
Value-Iter
108
:
26

0
:
6590
:
28%

0
:
50%
2.0092
107
:
69

0
:
8289
:
53%

0
:
56%
2.5776
101
:
56

0
:
6577
:
11%

0
:
44%
4.5251
T-Qlearning
107
:
55

0
:
5890
:
12%

0
:
52%
2.9201
106
:
60

0
:
5289
:
17%

0
:
41%
4.2052
99
:
99

1
:
2875
:
97%

0
:
91%
5.2527
T-SARSA
107
:
73

0
:
4689
:
93%

0
:
34%
3.3881
106
:
88

0
:
4588
:
82%

0
:
37%
5.1559
99
:
11

0
:
4075
:
23%

0
:
35%
6.8805
DQN
110
:
81

0
:
6892
:
50%

0
:
50%
1.7811
110
:
16

0
:
6091
:
79%

0
:
29%
2.3790
103
:
40

0
:
5177
:
14%

0
:
26%
4.3770
cDQN
112
:
49

0
:
4294
:
88%

0
:
33%
2.2207
112
:
12

0
:
4094
:
17%

0
:
36%
2.7708
104
:
25

0
:
5579
:
41%

0
:
48%
4.8340
cA2C
112
:
70

0
:
6494
:
74%

0
:
57%
3.1062
112
:
05

0
:
4593
:
97%

0
:
37%
3.8085
104
:
19

0
:
7079
:
25%

0
:
68%
5.2124
LP-cA2C
113.60

0
:
56
95.27%

0
:
36%
4.4633
112.75

0
:
65
94.62%

0
:
47%
5.2719
105.37

0
:
58
80.15%

0
:
46%
7.2949
bestperformanceintermsofreturnoninvestment(thegmvgainperreallocation),GMV,
andorderresponserate.
5.7.3OntheE˚ciencyofReallocations
Inreality,eachrepositioncomeswithacost.Inthissubsection,weconsidersuchreposition
costsandestimatedthembyfuelcosts.Sincethetraveldistancefromonegridtoanother
isapproximately1.2kmandthefuelcostisaround0.5RMB/km,wesetthecostofeach
repositionas
c
=0
:
6
.Inthissetting,thede˝nitionofagent,state,actionandtransition
probabilityissameaswestatedinSec5.3.Theonlydi˙erenceisthattherepositioningcost
isincludedintherewardwhentheagentisrepositionedtodi˙erentlocations.Therefore,the
GMVofoneepisodeisthesumofallservedordervaluesubstractedbythetotalofreposition
costinoneepisode.Forexample,theobjectivefunctionforDQNnowincludesthereposition
155
Table5.3:PerformancecomparisonofcompetingmethodsintermsofGMV,orderresponse
rate(ORR),andreturnoninvest(ROI)inWuhanconsideringrepositioncost.
NormalizedGMVORRROI
Simulation
100
:
00

0
:
4876
:
56%

0
:
45%
-
Di˙usion
98
:
84

0
:
4480
:
07%

0
:
24%
-0.2181
Rule-based
103
:
84

0
:
6384
:
91%

0
:
25%
0.5980
Value-Iter
107
:
13

0
:
7085
:
06%

0
:
45%
1.6156
T-Qlearning
107
:
10

0
:
6185
:
28%

0
:
28%
1.8302
T-SARSA
107
:
14

0
:
6484
:
99%

0
:
28%
2.0993
DQN
108
:
45

0
:
6286
:
67%

0
:
33%
1.0747
cDQN
108
:
93

0
:
5789
:
03%

0
:
26%
1.1001
cA2C
113
:
31

0
:
5488
:
57%

0
:
45%
4.4163
LP-cA2C
114.92

0
:
65
89.29%

0
:
39%
6.1417
Table5.4:E˙ectivenessofcontextualmulti-agentactor-criticconsideringrepositioncosts.
NormalizedGMV
ORR
Repositions
DQN
110
:
81

0
:
68
92
:
50%

0
:
50%
606932
cDQN
112
:
49

0
:
42
94
:
88%

0
:
33%
562427
cA2C
112
:
70

0
:
64
94
:
74%

0
:
57%
408859
LP-cA2C
113
:
60

0
:
56
95
:
27%

0
:
36%
304752
costasfollows:
E

Q
(
s
i
t
;a
i
t
;

)


r
i
t
+1

c
+

max
a
i
t
+1
Q
(
s
i
t
+1
;a
i
t
+1
;

0

2
;
(5.15)
where
a
i
t
,
[
g
o
;
g
d
]
,andif
g
d
=
g
o
then
c
=0
,otherwise
c
=0
:
6
.Similarly,wecanconsider
thecostsincA2C.However,itishardtoapplythemtocDQNbecausethattheassumption,
thatdi˙erentactionsthatleadtothesamelocationshouldsharethesameactionvalue,
whichisnotheldinthissetting.Therefore,insteadofconsideringtherepositioncostinthe
objectivefunction,weonlyincorporatetherepositioncostwhenweactuallyconductour
policybasedoncDQN.Underthissetting,thelearningobjectiveofactionvalueofcDQNis
156
sameasinEq(5.3)whilethecontextembeddingischangedfromEq(5.4)tothefollowing:
[
C
t;
g
j
]
k
=
8
>
<
>
:
1
;
if
Q
(
s
t
;
g
i
)
>
=
Q
(
s
t
;
g
j
)+
c;
0
;
otherwise
:
(5.16)
ForLP-cA2C,thecoste˙ectisnaturallyincorporatedintheobjectivefunctionas
inEq(5.14).AstheresultsshowninTable5.4,theDQNtendstorepositionmoreagents
whilethecontextualalgorithmsachievebetterperformanceintermsofbothGMVand
orderresponserate,withlowercost.Moreimportantly,theLP-cA2Coutperformsother
methodsinbothoftheperformanceande˚ciency.Thereasonisthatthismethodformulate
thecoordinationamongagentsintoanoptimizationproblem,whichapproximatesthe
maximizationoftheplatform'slongtermrewardinacentralizedversion.Thecentralized
optimizationproblemcanavoidlotsofredundantreallocationscomparedtopreviousmethods.
Thetrainingproceduresandthenetworkarchitecturearethesameasdescribedintheprevious
section.
Tobemoreconcrete,wegiveaspeci˝cscenariotodemonstratethatthee˚ciencyof
LP-cA2C.Imagingwewouldliketoaskdriverstomovefromgrid
A
tonearbygrid
B
while
thereisagrid
C
thatisadjacenttobothgrid
A
and
B
.Inthepreviousalgorithms,since
theallocationisjointlygivenbyeachagent,it'sverylikelythatwereallocateagentsby
theshortpath
A
!
B
andlongerpath
A
!
C
!
B
whentherearesu˚cientamountof
agentscanarriveat
B
from
A
.Theseine˚cientreallocationscanbeavoidedbyLP-cA2C
naturallysincethelongerpathonlyincursahighercostwhichwillbethesuboptimalsolution
toourobjectivefunctioncomparedtothesolutiononlycontainsthe˝rstpath.Asshown
inFigure5.5(a),theallocationcomputedbycA2Ccontainsmany
triangle
repositionsas
denotedbytheblackcircle,whilewedidn'tobservetheseine˚cientallocationsinFigure5.5
157
(a)cA2C(b)LP-cA2C
Figure5.5:IllustrationofallocationsofcA2CandLP-cA2Cat18:40and19:40,respsectively.
(b).Therefore,theallocationpolicydeliveredbyLP-cA2Cismoree˚cientthanthosegiven
bypreviousalgorithms.
5.7.4Thee˙ectivenessofaveragedrewarddesign
Inmulti-agentRL,therewarddesignforeachagentisessentialforthesuccessoflearning.In
fullycooperativemulti-agentRL,therewardforallagentsisasingleglobalreward[
27
],while
itsu˙ersfromthecreditassignmentproblemforeachagent'saction.Splittingtherewardto
eachagentwillalleviatethisproblem.Inthissubsection,wecomparetwodi˙erentdesigns
fortherewardofeachagent:theaveragedrewardofagridasstatedinSec5.3andthetotal
rewardofagridthatdoesnotaverageonthenumberofavailablevehiclesatthattime.As
shownintable5.5,themethodswithaveragedreward(cA2C,cDQN)largelyoutperform
thoseusingtotalreward,sincethisdesignnaturallyencouragesthecoordinationsamong
agents.Usingtotalreward,ontheotherhand,islikelytorepositionanexcessivenumberof
agentstothelocationwithhighdemand.
158
Table5.5:E˙ectivenessofaveragedrewarddesign.
Proposedmethods
RawReward
NormalizedGMV/ORR
NormalizedGMV/ORR
cA2C
115
:
27

0
:
70
/
94
:
99%

0
:
48%
105
:
75

1
:
17
/
88
:
09%

0
:
74%
cDQN
115
:
19

0
:
46
/
94
:
77%

0
:
32%
108
:
00

0
:
35
/
89
:
53%

0
:
31%
(a)Withoutrepositioncost(b)Withrepositioncost
Figure5.6:ConvergencecomparisonofcA2Canditsvariationswithoutusingcontext
embeddinginbothsettings,withandwithoutrepositioncosts.TheX-axisisthenumberof
episodes.TheleftY-axisdenotesthenumberofcon˛ictsandtherightY-axisdenotesthe
normalizedGMVinoneepisode.
Table5.6:E˙ectivenessofcontextembedding.
NormalizedGMV/ORR
Repositions
Withoutrepositioncost
cA2C
115
:
27

0
:
70
/
94
:
99%

0
:
48%
460586
cA2C-v1
114
:
78

0
:
67
/
94
:
52%

0
:
49%
704568
cA2C-v2
111
:
39

1
:
65
/
92
:
12%

1
:
03%
846880
Withrepositioncost
cA2C
112
:
70

0
:
64
/
94
:
74%

0
:
57%
408859
cA2C-v3
110
:
43

1
:
16
/
93
:
79%

0
:
75%
593796
5.7.5Ablationsonpolicycontextembedding
Inthissubsection,weevaluatethee˙ectivenessofcontextembedding,includingexplicitly
coordinatingtheactionsofdi˙erentagentsthroughthecollaborativecontext,andeliminating
theinvalidactionswithgeographiccontext.Thefollowingvariationsofproposedmethods
areinvestigatedindi˙erentsettings.
159

cA2C-v1:ThisvariationdropscollaborativecontextofcA2Cinthesettingthatdoes
notconsiderrepositioncost.

cA2C-v2:ThisvariationdropsbothgeographicandcollaborativecontextofcA2Cin
thesettingthatdoesnotconsiderrepositioncost.

cA2C-v3:ThisvariationdropscollaborativecontextofcA2Cinthesettingthat
considersrepositioncost.
TheresultsofabovevariationsaresummarizedinTable5.6andFigure5.6.Asseenin
the˝rsttworowsofTable5.6andthered/bluecurvesinFigure5.6(a),inthesettingof
zerorepositioncost,cA2Cachievesthebestperformancewithmuchlessrepositions(
65
:
37%
)
comparingwithcA2C-v1.Furthermore,collaborativecontextembeddingachievessigni˝cant
advantageswhentherepositioncostisconsidered,asshowninthelasttworowsinTable5.6
andFigure5.6(b).Itnotonlygreatlyimprovestheperformancebutalsoacceleratesthe
convergence.Sincethecollaborativecontextlargelynarrowsdowntheactionspaceand
leadstoabetterpolicysolutioninthesenseofbothe˙ectivenessande˚ciency,wecan
concludethatcoordinationbasedoncollaborativecontextise˙ective.Also,comparingthe
performancesofcA2CandcA2C-v2(red/greencurvesinFigure5.6(a)),apparentlythe
policycontextembedding(consideringbothgeographicandcollaborativecontext)isessential
toperformance,whichgreatlyreducestheredundantpolicysearch.
5.7.6Ablationstudyongroupingthelocations
Thissectionstudiesthee˙ectivenessofourregularizationdesignforLP-cA2C.Onekey
di˙erencebetweenourworkandtraditional˛eetmanagementworks[
64
,
65
]isthatwedidn't
assumethedriversinonelocationcanonlypickuptheordersinthesamelocation.On
160
thecontrary,oneagentcanalsoservetheordersemergedinthenearbylocations,whichis
amorerealisticandcomplicatedsetting.Inthiscase,weregularizethenumberofagents
repositionedintoasetofnearbygridsclosetothenumberofestimatedordersatnexttime
step.ThisgroupingregularizationinEq(5.14)ismoree˚cientthantheregularizationin
Eq(5.17)requiringthenumberofagentsrepositionedintoeachgridisclosetothenumber
ofestimatedordersatthatgirdsincelotsofrepositioninsidethesamegroupcanbeavoided.
AstheresultsshowninTable5.7,usingthegroupregularizationinEq(5.14)reallocatesless
agentswhileachievessamebestperformanceastheoneinEq(5.17)(LP-cA2C').
max
y
(
s
t
)
(
v
(
s
t
)
T
A
t

c
T
t
)
y
(
s
t
)


(
o
t

A
t
y
(
s
t
))
2
(5.17)
Table5.7:E˙ectivenessofgroupregularizationdesign
NormalizedGMV
ORR
Repositions
ROI
LP-cA2C
113
:
56

0
:
61
95
:
24%

0
:
40%
341774
3
:
9663
LP-cA2C'
113
:
60

0
:
56
95
:
27%

0
:
36%
304752
4
:
4633
5.7.7Qualitativestudy
Inthissection,weanalyzewhetherthelearnedvaluefunctioncancapturethedemand-supply
relationaheadoftime,andtherationalityofallocations.Toseethis,wepresentacasestudy
ontheregionnearbytheairport.ThestatevalueandallocationpolicyisacquiredfromcA2C
thatwastrainedfortenepisodes.Wethenrunthewell-trainedcA2Cononetestingepisode,
andqualitativelyexamthestatevalueandallocationsundertheunseendynamics.Thesum
ofstatevaluesanddemand-supplygap(de˝nedasthenumberofordersminusthenumberof
vehicles)ofsevengridsthatcovertheCTUairportisvisualized.AsseeninFigure5.8,the
statevaluecancapturethefuturedramaticchangesofdemand-supplygap.Furthermore,the
161
spatialdistributionofstatevaluescanbeseeninFigure5.7.Afterthemidnight,theairport
hasalargenumberoforders,andlessavailablevehicles,andthereforethestatevaluesof
airportarehigherthanotherlocations.Duringthedaytime,morevehiclesareavailableat
theairportsothateachwillreceivelessrewardandthestatevaluesarelowerthanother
regions,asshowninFigure5.7(b).InFigure5.7andFigure5.8,wecanconcludethatthe
valuefunctioncanestimatetherelativeshiftofdemand-supplygapfrombothspatialand
temporalperspectives.ItiscrucialtotheperformanceofcA2Csincethecoordinationis
builtuponthestatevalues.Moreover,asillustratedbybluearrowsinFigure5.7,wesee
thattheallocationpolicygivesconsecutiveallocationsfromlowervaluegridstohighervalue
grids,whichcanthus˝llthefuturedemand-supplygapandincreasetheGMV.
(a)At01:50am.(b)At06:40pm.
Figure5.7:Illustrationontherepositionsnearbytheairportat1:50amand06:40pm.The
darkercolordenotesthehigherstatevalueandthebluearrowsdenotetherepositions.
5.8Conclusion
Inthischapter,we˝rstformulatethelarge-scale˛eetmanagementproblemintoafeasible
settingfordeepreinforcementlearning.Giventhissetting,weproposecontextualmulti-agent
reinforcementlearningframework,inwhichtwocontextualalgorithmscDQNandcA2Care
162
Figure5.8:Thenormalizedstatevalueanddemand-supplygapoveroneday.
developedandbothofthemachievethelargescaleagents'coordinationin˛eetmanagement
problem.cA2Cenjoysboth˛exibilityande˚ciencybycapitalizingacentralizedvalue
networkanddecentralizedpolicyexecutionembeddedwithcontextualinformation.Itis
abletoadapttodi˙erentactionspaceinanend-to-endtrainingparadigm.Asimulatoris
developedandcalibratedwiththerealdataprovidedbyDidiChuxing,whichservedasour
trainingandevaluationplatform.Extensiveempiricalstudiesunderdi˙erentsettingsin
simulatorhavedemonstratedthee˙ectivenessoftheproposedframework.
163
Chapter6
TheProvableAdvantageof
CollaborativeLearning
6.1Introduction
Federatedlearning(FL)isamachinelearningsettingwheremanyclients(e.g.,mobiledevices
ororganizations)collaborativelytrainamodelundertheorchestrationofacentralserver(e.g.,
serviceprovider),whilekeepingthetrainingdatadecentralized[
176
,
94
].Inrecentyears,FL
hasswiftlyemergedasanimportantlearningparadigm[
129
,
109
thatenjoyswidespread
successinapplicationssuchaspersonalizedrecommendation[
36
],virtualassistant[
106
],and
keyboardprediction[
77
],tonameaatleasttworeasons:First,therapidproliferation
ofsmartdevicesthatareequippedwithbothcomputingpoweranddata-capturingcapabilities
providedtheinfrastructurecoreforFL.Second,therisingawarenessofprivacyandthe
exponentialgrowthofcomputationalpower(blessedbyMoore'slaw)inmobiledeviceshave
madeitincreasinglyattractivetopushthecomputationtotheedge.
Despiteitspromiseandbroadapplicabilityinourcurrentera,thepotentialvalueFL
deliversiscoupledwiththeuniquechallengesitbringsforth.Inparticular,whenFLlearnsa
singlestatisticalmodelusingdatafromacrossallthedeviceswhilekeepingeachindividual
device'sdataisolated(andhenceprotectsprivacy)[
94
],itfacestwochallengesthatareabsent
164
incentralizedoptimizationanddistributed(stochastic)optimization[
231
,
178
,
99
,
113
,
205
,
213,206,92,219,218,98,104]:
1)
Dataheterogeneity:
datadistributionsindevicesaredi˙erent(anddatacan'tbe
shared);
2)
Systemheterogeneity:
onlyasubsetofdevicesmayaccessthecentralserverateach
timebothbecausethecommunicationsbandwidthpro˝lesvaryacrossdevicesandbecause
thereisnocentralserverthathascontroloverwhenadeviceisactive.
Toaddressthesechallenges,FederatedAveraging(FedAvg)[
129
]wasproposedasa
particularlye˙ectiveheuristic,whichhasenjoyedgreatempiricalsuccess[
77
].Thissuccess
hassincemotivatedagrowinglineofresearche˙ortsintounderstandingitstheoretical
convergenceguaranteesinvarioussettings.Forinstance,[
75
]analyzedFedAvg(fornon-
convexsmoothproblemssatisfyingPLconditions)undertheassumptionthateachlocal
device'sminimizeristhesameastheminimizerofthejointproblem(ifalldevices'datais
aggregatedtogether),anoverlyrestrictiveassumption.Veryrecently,[
110
]furtheredthe
progressandestablishedan
O
(
1
T
)
convergencerateforFedAvgforstronglyconvexsmooth
problems.Atthesametime,[
84
]studiedtheNesterovacceleratedFedAvgfornon-convex
smoothproblemsandestablishedan
O
(
1
p
T
)
convergenceratetostationarypoints.
However,despitetheseveryrecentfruitfulpioneeringe˙ortsintounderstandingthe
theoreticalconvergencepropertiesofFedAvg,itremainsopenastohowthenumberof
thenumberofdevicesthatparticipateinthethe
convergencespeed.Inparticular,dowegetlinearspeedupofFedAvg?Whataboutwhen
FedAvgisaccelerated?TheseaspectsarecurrentlyunexploredinFL.We˝llinthegapshere
byprovidinga˚rmativeanswers.
OurContributions
WeprovideacomprehensiveconvergenceanalysisofFedAvgandits
165
Table6.1:
ConvergenceresultsforFedAvgandacceleratedFedAvg.Throughoutthepaper,
N
is
thetotalnumberoflocaldevices,and
K

N
isthemaximalnumberofdevicesthatareaccessible
tothecentralserver.
T
isthetotalnumberofstochasticupdatesperformedbyeachlocaldevice,
E
isthelocalstepsbetweentwoconsecutiveservercommunications(andhence
T=E
isthenumberof
communications).
y
Inthelinearregressionsetting,wehave

=

1
forFedAvgand

=
p

1
~

for
acceleratedFedAvg,where

1
and
p

1
~

areconditionnumbersde˝nedinSection6.5.Since

1

~

,
thisimpliesaspeedupfactorof
p

1
~

foracceleratedFedAvg.
h
h
h
h
h
h
h
h
h
h
h
h
h
Participation
Objectivefunction
StronglyConvex
Convex
Overparameterized
Overparameterized
generalcase
linearregression
Full
O
(
1
NT
+
E
2
T
2
)
O

1
p
NT
+
NE
2
T

O
(exp(

NT
E
1
))
O
(exp(

NT
E
))
y
Partial
O

E
2
KT
+
E
2
T
2

O

E
2
p
KT
+
KE
2
T

O
(exp(

KT
E
1
))
O
(exp(

KT
E
))
y
acceleratedvariantsinthepresenceofbothdataandsystemheterogeneity.Ourcontributions
arethreefold.
First,weestablishan
O
(1
=KT
)
convergencerateunderFedAvgforstronglyconvexand
smoothproblemsandan
O
(1
=
p
KT
)
convergencerateforconvexandsmoothproblems
(where
K
isthenumberofparticipatingdevices),therebyestablishingthatFedAvgenjoys
thedesirablelinearspeeduppropertyintheFLsetup.Priortoourworkhere,thebest
andthemostrelatedconvergenceanalysisisgivenby[
110
],whichestablishedan
O
(
1
T
)
convergencerateforstronglyconvexsmoothproblemsunderFedAvg.Ourratematchesthe
same(andoptimal)dependenceon
T
,butalsocompletesthepicturebyestablishingthe
lineardependenceon
K
.
Second,weestablishthesameconvergence
O
(1
=KT
)
forstronglyconvexandsmooth
problemsand
O
(1
=
p
KT
)
forconvexandsmoothproblemsforNesterovacceleratedFedAvg.
WeanalyzetheacceleratedversionofFedAvgherebecauseempiricallyittendstoperform
better;yet,itstheoreticalconvergenceguaranteeisunknown.Tothebestofourknowledge,
thesearethe˝rstresultsthatprovidealinearspeedupcharacterizationofNesterovaccelerated
FedAvginthosetwoproblemclasses(thatFedAvgandNesterovacceleratedFedAvgshare
thesameconvergencerateistobeexpected:thisisthecaseevenforcentralizedstochastic
166
optimization).
Third,westudyasubclassofstronglyconvexsmoothproblemswheretheobjectiveis
over-parameterizedandestablishafaster
O
(
exp
(

KT

))
convergencerateforFedAvg.Within
thisclass,wefurtherconsiderthelinearregressionproblemandestablishanevensharper
rateunderFedAvg.Inaddition,weproposeanewvariantofacceleratedFedA
acceleratedFedAestablishafasterconvergencerate(comparedtoifnoacceleration
isused).Thisstandsincontrasttogeneric(strongly)convexstochasticproblemswhere
theoreticallynorateimprovementisobtainedwhenoneacceleratesFedAvg.Thedetailed
convergenceresultsaresummarizedinTable6.1.
6.2Setup
Inthischapter,westudythefollowingfederatedlearningproblem:
min
w
ˆ
F
(
w
)
,
X
N
k
=1
p
k
F
k
(
w
)
˙
;
(6.1)
where
N
isthenumberoflocaldevices(users/nodes/workers)and
p
k
isthe
k
-thdevice's
weightsatisfying
p
k

0
and
P
N
k
=1
p
k
=1
.Inthe
k
-thlocaldevice,thereare
n
k
datapoints:
x
1
k
;
x
2
k
;:::;
x
n
k
k
.Thelocalobjective
F
k
(

)
isde˝nedas:
F
k
(
w
)
,
1
n
k
P
n
k
j
=1
`

w
;
x
j
k

,where
`
denotesauser-speci˝edlossfunction.Eachdeviceonlyhasaccesstoitslocaldata,which
givesrisetoitsownlocalobjective
F
k
.Notethatwedonotmakeanyassumptionsonthe
datadistributionsofeachlocaldevice.Thelocalminimum
F

k
=
min
w
2
R
d
F
k
(
w
)
canbefar
fromtheglobalminimumofEq(6.1).
167
6.2.1TheFederatedAveraging(FedAvg)Algorithm
We˝rstintroducethestandardFederatedAveraging(FedAvg)algorithm[
129
].FedAvg
updatesthemodelineachdevicebylocalStochasticGradientDescent(SGD)andsends
thelatestmodeltothecentralserverevery
E
steps.Thecentralserverconductsaweighted
averageoverthemodelparametersreceivedfromactivedevicesandbroadcaststhelatest
averagedmodeltoalldevices.Formally,theupdatesofFedAvgatround
t
isdescribedas
follows:
v
k
t
+1
=
w
k
t


t
g
t;k
;
w
k
t
+1
=
8
>
<
>
:
v
k
t
+1
if
t
+1
=
2I
E
;
P
k
2S
t
+1
v
k
t
+1
if
t
+1
2I
E
;
where
w
k
t
isthelocalmodelparametermaintainedinthe
k
-thdeviceatthe
t
-thiteration,
g
t;k
:=
r
F
k
(
w
k
t
;˘
k
t
)
isthestochasticgradientbasedon
˘
k
t
,thedatasampledfrom
k
-th
device'slocaldatauniformlyatrandom.
I
E
=
f
E;
2
E;:::
g
isthesetofglobalcommunication
steps.Weuse
S
t
+1
torepresentthesetofactivedevicesat
t
+1
.
Sincefederatedlearningusuallyinvolvesanenormousamountoflocaldevices,itisoften
morerealistictoassumeonlyasubsetoflocaldevicesisactiveateachcommunicationround
(systemheterogeneity).Inthiswork,weconsiderboththecaseof
fullparticipation
where
themodelisaveragingoveralldevicesatthecommunicationround,i.e.,
w
k
t
+1
=
P
N
k
=1
p
k
v
k
t
+1
,
andthecaseof
partialparticipation
where
jS
t
+1
j
<N
.Withpartialparticipation,
S
t
+1
isobtainedbytwotypesofsamplingschemestosimulatepracticalscenarios[
110
].For
example,oneschemeestablishes
S
t
+1
by
i.i.d.
samplingthedeviceswithprobability
p
k
withreplacement.BothschemesguaranteethatgradientupdatesinFedAvgareunbiased
stochasticversionsofupdatesinFedAvgwithfullparticipation.Formoredetailsonthe
168
notationsandsetup,pleaserefertoSectionBintheappendix.
6.2.2Assumptions
Wemakethefollowingstandardassumptionsontheobjectivefunction
F
1
;:::;F
N
.Assump-
tions7and8arecommonlysatis˝edbyarangeofpopularobjectivefunctions,suchas
`
2
-regularizedlogisticregressionandcross-entropylossfunctions.
Assumption7
(L-smooth)
.
F
1
;

;F
N
areall
L
-smooth:forall
v
and
w
,
F
k
(
v
)

F
k
(
w
)+(
v

w
)
T
r
F
k
(
w
)+
L
2
k
v

w
k
2
2
.
Assumption8
(Strongly-convex)
.
F
1
;

;F
N
areall

-stronglyconvex:forallvand
w
;F
k
(
v
)

F
k
(
w
)+(
v

w
)
T
r
F
k
(
w
)+

2
k
v

w
k
2
2
Assumption9
(Boundedlocalvariance)
.
Let
˘
k
t
besampledfromthe
k
-thdevice'slocal
datauniformlyatrandom.Thevarianceofstochasticgradientsineachdeviceisbounded:
E


r
F
k

w
k
t
;˘
k
t

r
F
k

w
k
t


2

˙
2
k
,for
k
=1
;

;N
andany
w
k
t
.Let
˙
2
=
P
N
k
=1
p
k
˙
2
k
.
Assumption10
(Boundedlocalgradient)
.
Theexpectedsquarednormofstochasticgradients
isuniformlybounded.i.e.,
E


r
F
k

w
k
t
;˘
k
t


2

G
2
,forall
k
=1
;:::;N
and
t
=0
;:::;T

1
.
Assumptions9and10havebeenmadeinmanypreviousworksinfederatedlearning,
e.g.[
219
,
110
,
178
].Weprovidefurtherjusti˝cationfortheirgenerality.Asmodelaveragepa-
rametersbecomecloserto
w

,the
L
-smoothnesspropertyimpliesthat
E
kr
F
k
(
w
k
t
;˘
k
t
)
k
2
and
E
kr
F
k
(
w
k
t
;˘
k
t
)
r
F
k
(
w
k
t
)
k
2
approach
E
kr
F
k
(
w

;˘
k
t
)
k
2
and
E
kr
F
k
(
w

;˘
k
t
)
r
F
k
(
w

)
k
2
.
Therefore,thereisnosubstantialdi˙erencebetweentheseassumptionsandassumingthe
169
boundsat
w

only.Furthermore,comparedtoassuming
boundedgradientdiversity
asin
relatedwork[
75
,
109
],Assumption10ismuchlessrestrictive.Whentheoptimalitygap
convergestozero,boundedgradientdiversityrestrictslocalobjectivestohavethesame
minimizerastheglobalobjective,contradictingtheheterogeneousdatasetting.Fordetailed
discussionsofourassumptions,pleaserefertoAppendixSectionB.
6.3LinearSpeedupAnalysisofFedAvg
Inthissection,weprovideconvergenceanalysesofFedAvgforconvexobjectivesinthegeneral
settingwithbothheterogeneousdataandpartialparticipation.Weshowthatforstrongly
convexandsmoothobjectives,theconvergenceoftheoptimalitygapofaveragedparameters
acrossdevicesis
O
(1
=NT
)
,whileforconvexandsmoothobjectives,therateis
O
(1
=
p
NT
)
.
DetailedproofsaredeferredtoAppendixSectionB.
6.3.1StronglyConvexandSmoothObjectives
We˝rstshowthatFedAvghasan
O
(1
=NT
)
convergenceratefor

-stronglyconvexand
L
-smoothobjectives.Theresultimprovesonthe
O
(1
=T
)
rateof[
110
]withalinearspeedup
inthenumberofdevices
N
.Moreover,itimpliesadistinctionincommunicatione˚ciency
thatguaranteesthislinearspeedupforFedAvgwithfullandpartialdeviceparticipation.
Withfullparticipation,
E
canbechosenaslargeas
O
(
p
T=N
)
withoutdegradingthelinear
speedupinthenumberofworkers.Ontheotherhand,withpartialparticipation,
E
mustbe
O
(1)
toguarantee
O
(1
=NT
)
convergence.
Theorem7.
Let
w
T
=
P
N
k
=1
p
k
w
k
T
,

max
=
max
k
Np
k
,andsetdecayinglearningrates

t
=
1
4

(

+
t
)
with

=
max
f
32
E
g
and

=
L

.ThenunderAssumptions7to10withfull
170
deviceparticipation,
E
F
(
w
T
)

F

=
O


2
max
˙
2

NT
+

2
E
2
G
2

T
2

;
andwithpartialdeviceparticipationwithatmost
K
sampleddevicesateachcommunication
round,
E
F
(
w
T
)

F

=
O


2
G
2

KT
+

2
max
˙
2

NT
+

2
E
2
G
2

T
2

:
Linearspeedup.
We˝rstcompareourboundwiththatin[
110
],whichis
O
(
1
NT
+
E
2
KT
+
E
2
G
2
T
)
.Becausetheterm
E
2
G
2
T
isalso
O
(1
=T
)
withoutadependenceon
N
,foranychoice
of
E
theirboundcannotachievelinearspeedup.Theimprovementofourboundcomesfrom
theterm

2
E
2
G
2

T
2
,whichnowis
O
(
E
2
=T
2
)
.Asaresult,allleadingtermsscalewith
1
=N
inthefulldeviceparticipationsetting,andwith
1
=K
inthepartialparticipationsetting.
Thisimpliesthatinbothsettings,thereisa
linearspeedup
inthenumberofactiveworkers
duringacommunicationround.Wealsoemphasizethatthereasononecannotrecoverthe
fullparticipationboundbysetting
K
=
N
inthepartialparticipationboundisduetothe
variancegeneratedbysamplingwhichdependson
E
.
CommunicationComplexity.
Ourboundimpliesadistinctioninthechoiceof
E
between
thefullandpartialparticipationsettings.Withfullparticipationthereislinearspeedup
O
(1
=NT
)
aslongas
E
=
O
(
p
T=N
)
sincethen
O
(
E
2
=T
2
)=
O
(1
=NT
)
matchestheleading
term.Thiscorrespondstoacommunicationcomplexityof
T=E
=
O
(
p
NT
)
.Incontrast,
theboundin[
110
]doesnotallow
E
toscalewith
p
T
topreserve
O
(1
=T
)
rate,evenforfull
participation.Ontheotherhand,withpartialparticipation,

2
G
2

KT
isalsoaleadingterm,
171
andso
E
mustbe
O
(1)
.Inthiscase,ourboundstillyieldsalinearspeedupin
K
,which
isalsocon˝rmedbyexperiments.Therequirement
E
=
O
(1)
inpartialparticipationlikely
cannotberemovedforoursamplingschemes,asthesamplingvarianceis

E
2
=T
2
)
andthe
dependenceon
E
istight.
Comparisonwithrelatedwork.
Tobetterunderstandthesigni˝canceoftheobtained
bound,wecompareourratestothebest-knownresultsinrelatedsettings.[
75
]provesa
linearspeedup
O
(1
=NT
)
resultforstronglyconvexandsmoothobjectives,with
O
(
N
1
=
3
T
2
=
3
)
communicationcomplexitywith
i.i.d.
dataandpartialparticipation.However,theirresults
buildontheboundedgradientdiversityassumption,whichimpliestheexistenceof
w

that
minimizesalllocalobjectives(seediscussionsinSection6.2.2),e˙ectivelyremovingsystem
heterogeneity.Theboundin[
104
]matchesourboundinthefullparticipationcase,buttheir
frameworkexcludespartialparticipation[104,Proposition1].
6.3.2ConvexSmoothObjectives
NextweprovidelinearspeedupanalysesofFedAvgwithconvexandsmoothobjectivesand
showthattheoptimalitygapis
O
(1
=
p
NT
)
.Thisresultcomplementsthestronglyconvex
caseinthepreviouspart,aswellasthenon-convexsmoothsettingin[
92
,
219
,
75
],where
O
(1
=
p
NT
)
resultsaregivenintermsofaveragedgradientnorm.
Theorem8.
Underassumptions7,9,10andconstantlearningrate

t
=
O
(
q
N
T
)
,
min
t

T
F
(
w
t
)

F
(
w

)=
O


2
max
˙
2
p
NT
+
NE
2
LG
2
T

withfullparticipation,andwithpartialdeviceparticipationwith
K
sampleddevicesateach
172
communicationroundandlearningrate

t
=
O
(
q
K
T
)
,
min
t

T
F
(
w
t
)

F
(
w

)=
O


2
max
˙
2
p
KT
+
E
2
G
2
p
KT
+
KE
2
LG
2
T

:
Choiceof
E
andlinearspeedup.
Withfullparticipation,aslongas
E
=
O
(
T
1
=
4
=N
3
=
4
)
,
theconvergencerateis
O
(1
=
p
NT
)
with
O
(
N
3
=
4
T
3
=
4
)
communicationrounds.Inthepartial
participationsetting,
E
mustbe
O
(1)
inordertoachievelinearspeedupof
O
(1
=
p
KT
)
.Our
resultagaindemonstratesthedi˙erenceincommunicationcomplexitiesbetweenfulland
partialparticipation,andistoourknowledgethe˝rstresultonlinearspeedupinthegeneral
federatedlearningsettingwithbothheterogeneousdataandpartialparticipationforconvex
objectives.
Thevalidrangeof
N
or
K
forlinearspeedup.
Givenspeci˝cvaluesof
E;T
,and
otherconstantsinthebound,wecansolveanoptimal
N
or
K
,whichcanserveasavalid
rangeofthenumberofdevicesforlinearspeedup.Forexample,withpartialparticipation,the
optimalnumberofparticipateddevicesis
K
opt
=
O

p
T
(

max

2
+
G
2
)
=G
2
L

2
=
3
,neglecting
otherconstantcoe˚cients.Sinceincreasingthevalueof
N
/
K
largerthan
K
opt
willnotbring
anybene˝tfortheconvergence,fromtheperspectiveofincreasingthenumberofdevicesto
improveconvergence,thevalidrangeoflinearspeedupis
[1
;K
opt
]
.Ontheotherhand,as
longasthenumberofdevicessatis˝es
K
=
O
(
T
1
=
3
)
,thelinearspeedupisguaranteed.
173
6.4LinearSpeedupAnalysisofNesterovAcceleratedFe-
dAvg
AnaturalextensionoftheFedAvgalgorithmistousemomentum-basedlocalupdatesinstead
oflocalSGDupdates.Toourknowledge,theonlyconvergenceanalysesofFedAvgwith
momentum-basedstochasticupdatesfocusonthenon-convexsmoothcase[
84
,
218
,
109
].In
thissection,wecompletethepicturewith
O
(1
=NT
)
and
O
(1
=
p
NT
)
convergenceresultsfor
Nesterov-acceleratedFedAvgforconvexobjectivesthatmatchtheratesfromtheprevious
section.Asweknowfromstochasticoptimization,Nesterovandothermomentumupdates
mayfailtoaccelerateoverSGD[
119
,
100
,
122
,
221
].ThereforeinSection6.5wewillspecialize
tooverparameterizedproblemswherewedemonstratethataparticularFedAvgvariantwith
momentumupdatesisabletoaccelerateovertheoriginalFedAvgalgorithm.Detailedproofs
ofconvergenceresultsinthissectionaredeferredtoAppendixSectionB.
6.4.1StronglyConvexandSmoothObjectives
TheNesterovAcceleratedFedAvgalgorithmfollowstheupdates:
v
k
t
+1
=
w
k
t


t
g
t;k
;
w
k
t
+1
=
8
>
>
>
<
>
>
>
:
v
k
t
+1
+

t
(
v
k
t
+1

v
k
t
)
if
t
+1
=
2I
E
;
P
k
2S
t
+1
h
v
k
t
+1
+

t
(
v
k
t
+1

v
k
t
)
i
if
t
+1
2I
E
;
where
g
t;k
:=
r
F
k
(
w
k
t
;˘
k
t
)
isthestochasticgradientsampledonthe
k
-thdeviceattime
t
.
Theorem9.
Let
v
T
=
P
N
k
=1
p
k
v
k
T
andsetlearningrates

t

1
=
3
14(
t
+

)(1

6
t
+

)max
f

1
g
,
174

t
=
6

1
t
+

.ThenunderAssumptions7,8,9,10withfulldeviceparticipation,
E
F
(
v
T
)

F

=
O


2
max
˙
2

NT
+

2
E
2
G
2

T
2

;
andwithpartialdeviceparticipationwith
K
sampleddevicesateachcommunicationround,
E
F
(
v
T
)

F

=
O


2
max
˙
2

NT
+

2
G
2

KT
+

2
E
2
G
2

T
2

:
Toourknowledge,thisisthe˝rstconvergenceresultforNesterovacceleratedFedAvg
inthestronglyconvexandsmoothsetting.Thesamediscussionaboutlinearspeedupof
FedAvgappliestotheNesterovacceleratedvariant.Inparticular,toachieve
O
(1
=NT
)
linear
speedup,
T
iterationsofthealgorithmrequireonly
O
(
p
NT
)
communicationroundswith
fullparticipation.
6.4.2ConvexSmoothObjectives
WenowshowthattheoptimalitygapofNesterovAcceleratedFedAvghas
O
(1
=
p
NT
)
rate.
Thisresultcomplementsthestronglyconvexcaseinthepreviouspart,aswellasthenon-
convexsmoothsettingin[
84
,
218
,
109
],whereasimilar
O
(1
=
p
NT
)
rateisgivenintermsof
averagedgradientnorm.
Theorem10.
Setlearningrates

t
=

t
=
O
(
q
N
T
)
.ThenunderAssumptions7,9,10
NesterovacceleratedFedAvgwithfulldeviceparticipationhasrate
min
t

T
F
(
v
t
)

F

=
O


2
max
˙
2
p
NT
+
NE
2
LG
2
T

;
175
andwithpartialdeviceparticipationwith
K
sampleddevicesateachcommunicationround,
min
t

T
F
(
v
t
)

F

=
O


2
max
˙
2
p
KT
+
E
2
G
2
p
KT
+
KE
2
LG
2
T

:
ItispossibletoextendtheresultsinthissectiontoacceleratedFedAvgalgorithmswith
othermomentum-basedupdates.However,inthestochasticoptimizationsetting,noneof
thesemethodscanachieveabetterratethantheoriginalFedAvgwithSGDupdatesfor
generalproblems[
100
].Forthisreason,wewillinsteadturntotheoverparameterizedsetting
[
127
,
119
,
30
]inthenextsectionwhereweshowthatFedAvgenjoysgeometricconvergence
anditispossibletoimproveitsconvergenceratewithmomentum-basedupdates.
6.5GeometricConvergenceofFedAvgintheOverparam-
eterizedSetting
Overparameterizationisaprevalentmachinelearningsettingwherethestatisticalmodelhas
muchmoreparametersthanthenumberoftrainingsamplesandtheexistenceofparameter
choiceswithzerotraininglossisensured[
5
,
224
].Duetothepropertyof
automaticvariance
reduction
inoverparameterization,alineofrecentworksprovedthatSGDandaccelerated
methodsachievegeometricconvergence[
127
,
134
,
138
,
168
,
181
].Anaturalquestionis
whethersucharesultstillholdsinthefederatedlearningsetting.Inthissection,weprovide
the˝rstgeometricconvergencerateofFedAvgfortheoverparameterizedstronglyconvex
andsmoothproblems,andshowthatitpreserveslinearspeedupatthesametime.Wethen
sharpenthisresultinthespecialcaseoflinearregression.Inspiredbyrecentadvancesin
acceleratingSGD[
123
,
87
],wefurtherproposeanovelmomentum-basedFedAvgalgorithm,
176
whichenjoysanimprovedconvergencerateoverFedAvg.Detailedproofsaredeferredto
AppendixSectionB.Inparticular,wedonotneedAssumptions9and10andusemodi˝ed
versionsofAssumptions7and8detailedinthissection.
6.5.1GeometricConvergenceofFedAvgintheOverparameterized
Setting
RecalltheFLproblem
min
w
P
N
k
=1
p
k
F
k
(
w
)
with
F
k
(
w
)=
1
n
k
P
n
k
j
=1
`
(
w
;
x
j
k
)
.Inthissection,
weconsiderthestandardEmpiricalRiskMinimization(ERM)settingwhere
`
isnon-negative,
l
-smooth,andconvex,andasbefore,each
F
k
(
w
)
is
L
-smoothand

-stronglyconvex.Notethat
l

L
.Thissetupincludesmanyimportantproblemsinpractice.Intheoverparameterized
setting,thereexists
w

2
argmin
w
P
N
k
=1
p
k
F
k
(
w
)
suchthat
`
(
w

;
x
j
k
)=0
forall
x
j
k
.We
˝rstshowthatFedAvgachievesgeometricconvergencewithlinearspeedupinthenumberof
workers.
Theorem11.
Intheoverparameterizedsetting,FedAvgwithcommunicationevery
E
itera-
tionsandconstantstepsize

=
O
(
1
E
N
l
max
+
L
(
N


min
)
)
hasgeometricconvergence:
E
F
(
w
T
)

L
2
(1


)
T
k
w
0

w

k
2
=
O

L
exp


E
NT
l
max
+
L
(
N


min
)

k
w
0

w

k
2

:
LinearspeedupandCommunicationComplexity
Thelinearspeedupfactorison
theorderof
O
(
N=E
)
for
N
O
(
l
L
)
,i.e.FedAvgwith
N
workersandcommunicationevery
E
iterationsprovidesageometricconvergencespeedupfactorof
O
(
N=E
)
,for
N
O
(
l
L
)
.
When
N
isabovethisthreshold,however,thespeedupisalmostconstantinthenumberof
workers.Thismatchesthe˝ndingsin[
127
].Ourresultalsoillustratesthat
E
canbetaken
O
(
T

)
forany
<
1
toachievegeometricconvergence,achievingbettercommunication
177
e˚ciencythanthestandardFLsetting.
6.5.2OverparameterizedLinearRegressionProblems
WenowturntoquadraticproblemsandshowthattheboundinTheorem11canbeimproved
to
O
(
exp
(

N
E
1
t
))
foralargerrangeof
N
.WethenproposeavariantofFedAvgthathas
provableaccelerationoverFedAvgwithSGDupdates.Thelocaldeviceobjectivesarenow
givenbythesumofsquares
F
k
(
w
)=
1
2
n
k
P
n
k
j
=1
(
w
T
x
j
k

z
j
k
)
2
,andthereexists
w

suchthat
F
(
w

)

0
.Twonotionsofconditionnumberareimportantinourresults:

1
whichisbased
onlocalHessians,and
~

,whichistermedthestatisticalconditionnumber[
119
,
87
].Fortheir
detailedde˝nitions,pleaserefertoAppendixSectionB.Hereweusethefact
~


1
.Recall

max
=max
k
p
k
N
and

min
=min
k
p
k
N
.
Theorem12.
Fortheoverparamterizedlinearregressionproblem,FedAvgwithcommuni-
cationevery
E
iterationswithconstantstepsize

=
O
(
1
E
N
l
max
+

(
N


min
)
)
hasgeometric
convergence:
E
F
(
w
T
)
O

L
exp(

NT
E
(

max

1
+(
N


min
))
)
k
w
0

w

k
2

:
When
N
=
O
(

1
)
,theconvergencerateis
O
((1

N
E
1
)
T
)=
O
(
exp
(

NT
E
1
))
,which
exhibitslinearspeedupinthenumberofworkers,aswellasa
1

1
dependenceonthe
conditionnumber

1
.Inspiredby[
119
],weproposethe
MaSSacceleratedFedAvg
algorithm
(FedMaSS):
w
k
t
+1
=
8
>
>
>
<
>
>
>
:
u
k
t


k
1
g
t;k
if
t
+1
=
2I
E
;
P
k
2S
t
+1
h
u
k
t


k
1
g
t;k
i
if
t
+1
2I
E
;
178
u
k
t
+1
=
w
k
t
+1
+

k
(
w
k
t
+1

w
k
t
)+

k
2
g
t;k
:
When

k
2

0
,thisalgorithmreducestotheNesterovacceleratedFedAvgalgorithm.Inthe
nexttheorem,wedemonstratethatFedMaSSimprovestheconvergenceto
O
(
exp
(

NT
E
p

1
~

))
.
Toourknowledge,thisisthe˝rstaccelerationresultofFedAvgwithmomentumupdates
overSGDupdates.
Theorem13.
Fortheoverparamterizedlinearregressionproblem,FedMaSSwithcom-
municationevery
E
iterationsandconstantstepsizes

1
=
O
(
1
E
N
l
max
+

(
N


min
)
)
;

2
=

1
(1

1
~

)
1+
1
p

1
~

;

=
1

1
p

1
~

1+
1
p

1
~

hasgeometricconvergence:
E
F
(
w
T
)
O

L
exp(

NT
E
(

max
p

1
~

+(
N


min
))
)
k
w
0

w

k
2

:
SpeedupofFedMaSSoverFedAvg
Tobetterunderstandthesigni˝canceofthe
aboveresult,webrie˛ydiscussrelatedworksonacceleratingSGD.NesterovandHeavyBall
updatesareknowntofailtoaccelerateoverSGDinboththeoverparameterizedandconvex
settings[
119
,
100
,
122
,
221
].Thusingeneralonecannothopetoobtainaccelerationresults
fortheFedAvgalgorithmwithNesterovandHeavyBallupdates.Luckily,recentworks
inSGD[
87
,
119
]introducedanadditionalcompensationtermtotheNesterovupdatesto
addressthenon-accelerationissue.Surprisingly,weshowthesameapproachcane˙ectively
improvetherateofFedAvg.ComparingtheconvergencerateofFedMass(Theorem13)and
FedAvg(Theorem12),when
N
=
O
(
p

1
~

)
,theconvergencerateis
O
((1

N
E
p

1
~

)
T
)=
O
(
exp
(

NT
E
p

1
~

))
asopposedto
O
(
exp
(

NT
E
1
))
.Since

1

~

,thisimpliesaspeedupfactor
of
q

1
~

forFedMaSS.Ontheotherhand,thesamelinearspeedupinthenumberofworkers
179
holdsfor
N
inasmallerrangeofvalues.
6.6NumericalExperiments
Inthissection,weempiricallyexaminethelinearspeedupconvergenceofFedAvgandNesterov
acceleratedFedAvginvarioussettings,includingstronglyconvexfunction,convexsmooth
function,andoverparameterizedobjectives,asanalyzedinprevioussections.
Setup.
Followingtheexperimentalsettingin[
178
],weconductexperimentsonboth
syntheticdatasetsandreal-worlddatasetw8a[
155
]
(
d
=300
;n
=49749)
.Weconsiderthe
distributedobjectives
F
(
w
)=
P
N
k
=1
p
k
F
k
(
w
)
,andtheobjectivefunctiononthe
k
-thlocal
deviceincludesthreecases:1)
Stronglyconvexobjective
:theregularizedbinarylogistic
regressionproblem,
F
k
(
w
)=
1
N
k
P
N
k
i
=1
log
(1+
exp
(

y
k
i
w
T
x
k
i
)+

2
k
w
k
2
.Theregularization
parameterissetto

=1
=n
ˇ
2
e

5
.2)
Convexsmoothobjective
:thebinarylogistic
regressionproblemwithoutregularization.3)
Overparameterizedsetting
:thelinear
regressionproblemwithoutaddingnoisetothelabel,
F
k
(
w
)=
1
N
k
P
N
k
i
=1
(
w
T
x
k
i
+
b

y
k
i
)
2
.
LinearspeedupofFedAvgandNesterovacceleratedFedAvg.
Toverifythelinear
speedupconvergenceasshowninTheorems78910,weevaluatethenumberofiterations
neededtoreach

-accuracyinthreeobjectives.Weinitializeallrunswith
w
0
=
0
d
and
measurethenumberofiterationstoreachthetargetaccuracy

.Foreachcon˝guration
(
E;K
)
,weextensivelysearchthelearningratefrom
min
(

0
;
nc
1+
t
)
,where

0
2f
0
:
1
;
0
:
12
;
1
;
32
g
accordingtodi˙erentproblemsand
c
cantakethevalues
c
=2
i
8
i
2
Z
.Astheresultsshown
inFigure6.1,thenumberofiterationsdecreasesasthenumberof(active)workersincreasing,
whichisconsistentforFedAvgandNesterovacceleratedFedAvgacrossallscenarios.For
additionalexperimentsontheimpactof
E
,detailedexperimentalsetup,andhyperparameter
180
(a)Stronglyconvexobjective(b)Convexsmoothobjective(c)Linearregression
Figure6.1:ThelinearspeedupofFedAvginfullparticipation,partialparticipation,andthe
linearspeedupofNesterovacceleratedFedAvg,respectively.
setting,pleaserefertotheAppendixSectionB.
181
Chapter7
Conclusion
Inthisdissertation,weconsideredtheproblemofcollaborativelearning,aimingto˝nd
e˙ectivewaystoleverageknowledgefrompeersfore˚cientlearningandbettergeneralization.
Tostart,weformallyde˝nedthecollaborativelearningproblemanddiscussedseveral
challengesweneedtoresolveunderthissystematicframework.The˝rstchallengewefocus
onisthe˛exibilityandinteractivemodel-drivencollaboration.Wepresentalgorithmsthat
capturehigh-orderinteractionsandinteractivelyincorporatethehumanexpertknowledge
toguidethecollaboration.Thentogeneralizethecollaborationtoheterogeneouslearning
agentsandheterogeneoustasks,weproposedata-drivencollaborativealgorithms,wherethe
learningagentstransferknowledgefromaselectiveanddynamicdataset.Inadditionto
thevariousformofcollaborations,wealsostudythescalabilityofcollaboration,wherewe
proposelinearprogrammingbasedcollaborativemulti-agentlearningalgorithminthecontext
ofalarge-scale˛eetmanagementapplication.Lastbutnotleast,theempiricalsuccessof
collaborativelearningmotivatesustodigintothereasonwhycollaborativelearningcanbe
bene˝cial.Weproviderigoroustheoreticalanalysisontheconvergenceimprovementwith
respecttotheincreasingnumberoflearningagents.
Therearevariousdomainsthatcanbene˝tfromcollaborativelearning,includingbut
notlimitedtomulti-taskmeta-learning,transferlearning,federatedlearning,multi-agent
reinforcementlearning,etc.Theresearchinthecommunityhasbeendevotedtopushing
182
thefrontierofeachdomainin-depth,whileseldomstudytheirintrinsicconnections,which
canbeessentialtowardsbuildingcollaborativemachineintelligence.Thereareemerging
researchestorevealtherelationsacrossdi˙erent˝eldssuchastheconnectionbetween
federatedlearningandmulti-tasklearning[
176
],federatedlearningandmeta-learning[
57
],
etc,whichcouldserveastheinitialsteptowardsbridgingthegapsacrossmultiple˝elds.One
centralmotivationofthisdissertationthatviewsthosedomainsasanintegratedframework
isthatthecollaborationshouldbeemphasizedasasigni˝cantlearningobjectiveinstead
ofanauxiliaryproduct,alongwithaccomplishingothergoals.Ourvisionisthattowards
thebuildingthemachineintelligencethatiscomparabletohumanintelligence,therigorous
understandingofcollaborativelearningisinevitable.
Moreconcretely,theremanyfuturedirectionsunderthegrantpictureofcollaborative
learning.Firstandforemost,onefundamentalquestioniswhattypeoftaskscanbelearned
collaboratively,Orwhencanweexpectcollaborativelearningbene˝ttheperformance
comparingtolearningindividually?Thisiscloselyrelatedtothenegativetransfer[
207
]and
andtaskinterference[
220
]inmulti-tasklearning.InChapter6,wequantifyasimpli˝ed
settinginsupervisedlearningwherethegradientvarianceacrossheterogeneoustasksare
bounded,whilethisisfarfromdesire.Inpractice,whatisthee˚cientandtestingstandard
beforeconsideringcollaboration?Anotherperspectiveofthinkingthisproblemisthatis
therealwaysexistsacollaborationstrategythatworksbetterthanindividuallearning?
Despitethelong-termgoalofcollaborativelearning,apromisingdirectionwouldbe
learningtocollaborate.Thecurrentcollaborationstrategiesaremostlyprede˝ned.We
manuallysetuptherulesofthecollaborationaccordingtocertaindomainknowledge.
Canweparameterizethecollaborationandlearntheintrinsicprincipleofcollaborative
learningthatisgeneralizable?Recently,wenoticeatrendofmeta-learningandAI-generating
183
algorithms[
146
,
59
],whilesimilare˙ortshaven'tbeenfoundincollaborativelearning.Human
caneasilygeneralizethestructureoforganizations,communicationprotocol,interaction
patternstosolvedi˙erenttasks.Todevelopcollaborativelearningsolutionsalongthis
directionwouldbeincrediblyvaluableforgeneratinghuman-likeintelligence.
184
APPENDICES
185
AppendixA
RankingPolicyGradient
DiscussionofExistingE˙ortsonConnectingReinforce-
mentLearningtoSupervisedLearning.
Therearetwomaindistinctionsbetweensupervisedlearningandreinforcementlearning.In
supervisedlearning,thedatadistribution
D
isstaticandtrainingsamplesareassumedtobe
sampled
i.i.d.
from
D
.Onthecontrary,thedatadistributionisdynamicinreinforcement
learningandthesamplingprocedureisnotindependent.First,sincethedatadistribution
inRLisdeterminedbybothenvironmentdynamicsandthelearningpolicy,andthepolicy
keepsbeingupdatedduringthelearningprocess.Thisupdatedpolicyresultsindynamicdata
distributioninreinforcementlearning.Second,policylearningdependsonpreviouslycollected
samples,whichinturndeterminesthesamplingprobabilityofincomingdata.Therefore,the
trainingsampleswecollectedarenotindependentlydistributed.Theseintrinsicdi˚culties
ofreinforcementlearningdirectlycausethesample-ine˚cientandunstableperformanceof
currentalgorithms.
Ontheotherhand,moststate-of-the-artreinforcementlearningalgorithmscanbeshown
tohaveasupervisedlearningequivalent.Toseethis,recallthatmostreinforcementlearning
algorithmseventuallyacquirethepolicyeitherexplicitlyorimplicitly,whichisamapping
fromastatetoanactionoraprobabilitydistributionovertheactionspace.Theuseofsuch
186
amappingimpliesthatultimatelythereexistsasupervisedlearningequivalenttotheoriginal
reinforcementlearningproblem,ifoptimalpoliciesexist.Theparadoxisthatitisalmost
impossibletoconstructthissupervisedlearningequivalentonthe˛y,withoutknowingany
optimalpolicy.
Althoughthequestionofhowtoconstructandapplypropersupervisionisstillanopen
probleminthecommunity,therearemanyexistinge˙ortsprovidinginsightfulapproachesto
reducereinforcementlearningintoitssupervisedlearningcounterpartoverthepastseveral
decades.Roughly,wecanclassifytheexistinge˙ortsintothefollowingcategories:

Expectation-Maximization(EM)
:[45,152,102,1],etc.

Entropy-RegularizedRL(ERL)
:[144,145,74],etc.

InteractiveImitationLearning(IIL)
:[44,188,163,165,184],etc.
TheearlyapproachesintheEMtrackappliedJensen'sinequalityandapproximation
techniquestotransformthereinforcementlearningobjective.Algorithmsarethenderived
fromthetransformedobjective,whichresembletheExpectation-Maximizationprocedure
andprovidepolicyimprovementguarantee[
45
].Theseapproachestypicallyfocusona
simpli˝edRLsetting,suchasassumingthattherewardfunctionisnotassociatedwiththe
state[
45
],approximatingthegoaltomaximizetheexpectedimmediaterewardandthestate
distributionisassumedtobe˝xed[
153
].Lateronin[
102
],theauthorsextendedtheEM
frameworkfromtargetingimmediaterewardintoepisodicreturn.Recently,[
1
]usedthe
EM-frameworkonarelativeentropyobjective,whichaddsaparameterpriorasregularization.
Ithasbeenfoundthattheestimationstepusing
Retrace
[
135
]canbeunstableevenwitha
linearfunctionapproximation[
197
].Ingeneral,theestimationstepinEM-basedalgorithms
involveson-policyevaluation,whichisonechallengesharedamongpolicygradientmethods.
187
Ontheotherhand,o˙-policylearningusuallyleadstoamuchbettersamplee˚ciency,and
isonemainmotivationthatwewanttoreformulateRLintoasupervisedlearningtask.
Toachieveo˙-policylearning,PGQ[
144
]connectedtheentropy-regularizedpolicygradient
withQ-learningundertheconstraintofsmallregularization.Inthesimilarframework,Soft
Actor-Critic[
74
]wasproposedtoenablesample-e˚cientandfasterconvergenceunder
theframeworkofentropy-regularizedRL.Itisabletoconvergetotheoptimalpolicythat
optimizesthelong-termrewardalongwithpolicyentropy.Itisane˚cientwaytomodel
thesuboptimalbehaviorandempiricallyitisabletolearnareasonablepolicy.Although
recentlythediscrepancybetweentheentropy-regularizedobjectiveandoriginallong-term
rewardhasbeendiscussedin[
143
,
56
],theyfocusonlearningstochasticpolicywhilethe
proposedframeworkisfeasibleforbothlearningdeterministicoptimalpolicy(Corollary1)
andstochasticoptimalpolicy(Corollary2).In[
145
],thisworksharessimilaritytoour
workintermsofthemethodwecollectingthesamples.Theycollectgoodsamplesbased
onthepastexperienceandthenconducttheimitationlearningw.r.tthosegoodsamples.
However,wedi˙erentiateathowdowelookattheproblemtheoretically.Thisself-imitation
learningprocedurewaseventuallyconnectedtolower-bound-soft-Q-learning,whichbelongs
toentropy-regularizedreinforcementlearning.Wecommentthatthereisatrade-o˙between
sample-e˚ciencyandmodelingsuboptimalbehaviors.Themorestrictrequirementwehave
onthesamplescollectedwehavelesschancetohitthesampleswhilewearemorecloseto
imitatingtheoptimalbehavior.
Fromthetrackofinteractiveimitationlearning,earlye˙ortssuchas[
163
,
165
]pointed
outthatthemaindiscrepancybetweenimitationlearningandreinforcementlearningisthe
violationof
i.i.d.
assumption.
SMILe
[
163
]and
DAgger
[
165
]areproposedtoovercomethe
distributionmismatch.Theorem2.1in[
163
]quanti˝edtheperformancedegradationfromthe
188
TableA.1:AcomparisonofstudiesreducingRLtoSL.The
Objective
columndenoteswhether
thegoalistomaximizelong-termreward.The
Cont.Action
columndenoteswhetherthe
methodisapplicabletobothcontinuousanddiscreteactionspaces.The
Optimality
denotes
whetherthealgorithmscanmodeltheoptimalpolicy.
X
y
denotestheoptimalityachieved
byERLisw.r.t.theentropyregularizeobjectiveinsteadoftheoriginalobjectiveonreturn.
The
O˙-Policy
columndenotesifthealgorithmsenableo˙-policylearning.The
NoOracle
columndenotesifthealgorithmsneedtoaccesstoacertaintypeoforacle(expertpolicyor
expertdemonstrations).
Methods
Objective
Cont.Action
Optimality
O˙-Policy
NoOracle
EM
X
X
X
7
X
ERL
7
X
X
y
X
X
IIL
X
X
X
X
7
RPG
X
7
X
X
X
expertconsideringthatthelearnedpolicyfailstoimitatetheexpertwithacertainprobability.
Thetheoremseemstoresemblethelong-termperformancetheorem(Thm.5)inthischapter.
However,itstudiedthescenariothatthelearningpolicyistrainedthroughastatedistribution
inducedbytheexpert,insteadofstate-actiondistributionasconsideredinTheorem5.As
such,Theorem2.1in[
163
]maybemoreapplicabletothesituationwhereaninteractive
procedureisneeded,suchasqueryingtheexpertduringthetrainingprocess.Onthecontrary,
theproposedworkfocusesondirectlyapplyingsupervisedlearningwithouthavingaccessto
theexperttolabelthedata.Theoptimalstate-actionpairsarecollectedduringexploration
andconductingsupervisedlearningonthereplaybu˙erwillprovideaperformanceguarantee
intermsoflong-termexpectedreward.Concurrently,aresembleofTheorem2.1in[
163
]is
Theorem1in[
188
],wheretheauthorsreducedtheapprenticeshiplearningtoclassi˝cation,
undertheassumptionthattheapprenticepolicyisdeterministicandthemisclassi˝cation
rateisboundedatalltimesteps.Inthiswork,weshowthatitispossibletocircumvent
suchastrongassumptionandreduceRLtoitsSL.Furthermore,ourtheoreticalframework
alsoleadstoanalternativeanalysisofsample-complexity.Lateron
AggreVaTe
[
164
]
wasproposedtoincorporatetheinformationofactioncoststofacilitateimitationlearning,
189
anditsdi˙erentiableversion
AggreVaTeD
[
184
]wasdevelopedinsuccessionandachieved
impressiveempiricalresults.Recently,hingelosswasintroducedtoregular
Q
-learningasa
pre-trainingstepforlearningfromdemonstration[
81
],orasasurrogatelossforimitating
optimaltrajectories[
148
].Inthiswork,weshowthathingelossconstructsanewtypeof
policygradientmethodandcanbeusedtolearnoptimalpolicydirectly.
Inconclusion,ourmethodapproachestheproblemofreducingRLtoSLfromaunique
perspectivethatisdi˙erentfromallpriorwork.WithourreformulationfromRLtoSL,the
samplescollectedinthereplaybu˙ersatisfythe
i.i.d.
assumption,sincethestate-actionpairs
arenowsampledfromthedatadistributionofUNOP.Amulti-aspectcomparisonbetween
theproposedmethodandrelevantpriorstudiesissummarizedinTableA.1.
RankingPolicyGradientTheorem
TheRankingPolicyGradientTheorem(Theorem2)formulatestheoptimizationoflong-term
rewardusingarankingobjective.Theproofbelowillustratestheformulationprocess.
Proof.
Thefollowingproofisbasedondirectpolicydi˙erentiation[153,212].Foraconcise
presentation,thesubscript
t
foractionvalue

i
;
j
,and
p
ij
isomitted.
r

J
(

)=
r

X
˝
p

(
˝
)
r
(
˝
)
(A.1)
=
X
˝
p

(
˝
)
r

log
p

(
˝
)
r
(
˝
)
=
X
˝
p

(
˝
)
r

log

p
(
s
0

T
t
=1
ˇ

(
a
t
j
s
t
)
p
(
s
t
+1
j
s
t
;a
t
)

r
(
˝
)
=
X
˝
p

(
˝
)
X
T
t
=1
r

log
ˇ

(
a
t
j
s
t
)
r
(
˝
)
=
E
˝
˘
ˇ


X
T
t
=1
r

log
ˇ

(
a
t
j
s
t
)
r
(
˝
)

190
=
E
˝
˘
ˇ


X
T
t
=1
r

log

Y
m
j
=1
;j
6
=
i
p
ij

r
(
˝
)

=
E
˝
˘
ˇ

"
X
T
t
=1
r

X
m
j
=1
;j
6
=
i
log
 
e

ij
1+
e

ij
!
r
(
˝
)
#
=
E
˝
˘
ˇ


X
T
t
=1
r

X
m
j
=1
;j
6
=
i
log

1
1+
e

ji

r
(
˝
)

(A.2)
ˇ
E
˝
˘
ˇ


X
T
t
=1
r


X
m
j
=1
;j
6
=
i
(

i


j
)
=
2

r
(
˝
)

;
(A.3)
wherethetrajectoryisaseriesofstate-actionpairsfrom
t
=1
;:::;T
,
i:e:˝
=
s
1
;a
1
;s
2
;a
2
;:::;s
T
.
FromEq(A.2)toEq(A.3),weusethe˝rst-orderTaylorexpansionof
log
(1+
e
x
)
j
x
=0
=
log2+
1
2
x
+
O
(
x
2
)
tofurthersimplifytherankingpolicygradient.
ProbabilityDistributioninRankingPolicyGradient
Inthissection,wediscusstheoutputpropertyofthepairwiserankingpolicy.Weshowin
Corollary6thatthepairwiserankingpolicygivesavalidprobabilitydistributionwhenthe
dimensionoftheactionspace
m
=2
.Forcaseswhen
m>
2
andtherangeof
Q
-valuesatis˝es
Condition2,weshowinCorollary7howtoconstructavalidprobabilitydistribution.
Corollary6.
ThepairwiserankingpolicyasshowninEq(4.5)constructsaprobability
distributionoverthesetofactionswhentheactionspace
m
isequalto
2
,givenanyaction
values

i
;i
=1
;
2
.Forthecaseswith
m>
2
,thisconclusiondoesnotholdingeneral.
Itiseasytoverifythat
ˇ
(
a
i
j
s
)
>
0
,
P
2
i
=1
ˇ
(
a
i
j
s
)=1
holdsandthesameconclusion
cannotbeappliedto
m>
2
byconstructingcounterexamples.However,wecanintroduce
adummyaction
a
0
toformaprobabilitydistributionforRPG.Duringpolicylearning,the
algorithmincreasestheprobabilityofbestactionsandtheprobabilityofdummyaction
decreases.Ideally,ifRPGconvergestoanoptimaldeterministicpolicy,theprobabilityof
191
takingbestactionisequalto1and
ˇ
(
a
0
j
s
)=0
.Similarly,wecanintroduceadummy
trajectory
˝
0
withthetrajectoryreward
r
(
˝
0
)=0
and
p

(
˝
0
)=1

P
˝
p

(
˝
)
.Thetrajectory
probabilityformsaprobabilitydistributionsince
P
˝
p

(
˝
)+
p

(
˝
0
)=1
and
p

(
˝
)

0
8
˝
and
p

(
˝
0
)

0
.Theproofofavalidtrajectoryprobabilityissimilartothefollowingproof
on
ˇ
(
a
j
s
)
tobeavalidprobabilitydistributionwithadummyaction.Itspracticalin˛uence
isnegligiblesinceourgoalistoincreasetheprobabilityof(near)-optimaltrajectories.To
presentinaclearway,weavoidmentioningdummytrajectory
˝
0
inProofAwhileitcanbe
seamlesslyincluded.
Condition2
(Therangeofaction-value)
.
Werestricttherangeofaction-valuesinRPGso
thatitsatis˝es

m

ln
(
m
1
m

1

1)
,where

m
=
min
i;j

ji
and
m
isthedimensionofthe
actionspace.
Thisconditioncanbeeasilysatis˝edsinceinRPGweonlyfocusontherelativerelationship
of

andwecanconstraintherangeofaction-valuessothat

m
satis˝esthecondition2.
Furthermore,sincewecanseethat
m
1
m

1
>
1
isdecreasingw.r.ttoactiondimension
m
.
Thelargertheactiondimension,thelessconstraintwehaveontheactionvalues.
Corollary7.
GivenCondition2,weintroduceadummyaction
a
0
andset
ˇ
(
a
=
a
0
j
s
)=
1

P
i
ˇ
(
a
=
a
i
j
s
)
,whichconstructsavalidprobabilitydistribution
(
ˇ
(
a
j
s
))
overtheaction
space
A[
a
0
.
Proof.
Sincewehave
ˇ
(
a
=
a
i
j
s
)
>
0
8
i
=1
;:::;m
and
P
i
ˇ
(
a
=
a
i
j
s
)+
ˇ
(
a
=
a
0
j
s
)=1
.
Toprovethatthisisavalidprobabilitydistribution,weonlyneedtoshowthat
ˇ
(
a
=
a
0
j
s
)

0
;
8
m

2
,i.e.
P
i
ˇ
(
a
=
a
i
j
s
)

1
;
8
m

2
.Let

m
=min
i;j

ji
,
X
i
ˇ
(
a
=
a
i
j
s
)
192
=
X
i
Y
m
j
=1
;j
6
=
i
p
ij
=
X
i
Y
m
j
=1
;j
6
=
i
1
1+
e

ji

X
i
Y
m
j
=1
;j
6
=
i
1
1+
e

m
=
m

1
1+
e

m

m

1

1
(Condition2)
:
Thisthusconcludestheproof.
ConditionofPreservingOptimality
ThefollowingconditiondescribeswhattypesofMDPsaredirectlyapplicabletothetrajectory
rewardshaping(TRS,Def6):
Condition3
(InitialStates)
.
The(near)-optimaltrajectorieswillcoverallinitialstatesof
MDP.i.e.
f
s
(
˝;
1)
j8
˝
2Tg
=
f
s
(
˝;
1)
j8
˝
g
,where
T
=
f
˝
j
w
(
˝
)=1
g
=
f
˝
j
r
(
˝
)

c
g
.
TheMDPssatisfyingthisconditioncoverawiderangeoftaskssuchasDialogueSys-
tem[
111
],Go[
175
],videogames[
18
]andallMDPswithonlyoneinitialstate.Ifwewantto
preservetheoptimalitybyTRS,theoptimaltrajectoriesofaMDPneedtocoverallinitial
statesorequivalently,allinitialstatesmustleadtoatleastoneoptimaltrajectory.Similarly,
thenear-optimalityispreservedforallMDPsthatitsnear-optimaltrajectoriescoverall
initialstates.
Theoretically,itispossibletotransfermoregeneralMDPstosatisfyCondition3and
preservetheoptimalitywithpotential-basedrewardshaping[
139
].Moreconcretely,consider
thedeterministicbinarytreeMDP(
M
1
)withthesetofinitialstates
S
1
=
f
s
1
;s
0
1
g
asde˝ned
inFigureA.1.Thereareeightpossibletrajectoriesin
M
1
.Let
r
(
˝
1
)=10=
R
max
;r
(
˝
8
)=
193
FigureA.1:ThebinarytreestructureMDPwithtwoinitialstates.
3
;r
(
˝
i
)=2
;
8
i
=2
;:::;
7
.Therefore,thisMDPdoesnotsatisfyCondition3.Wecan
compensatethetrajectoryrewardofthebesttrajectorystartingfrom
s
0
1
tothe
R
max
by
shapingtherewardwiththepotential-basedfunction
˚
(
s
0
7
)=7
and
˚
(
s
)=0
;
8
s
6
=
s
0
7
.This
rewardshapingrequiresmorepriorknowledge,whichmaynotbefeasibleinpractice.Amore
realisticmethodistodesignadynamictrajectoryrewardshapingapproach.Inthebeginning,
weset
c
(
s
)=
min
s
2S
1
r
(
˝
j
s
(
˝;
1)=
s
)
;
8
s
2S
1
.Take
M
1
asanexample,
c
(
s
)=3
;
8
s
2S
1
.
Duringtheexplorationstage,wetrackthecurrentbesttrajectoryofeachinitialstateand
update
c
(
s
)
withitstrajectoryreward.
Nevertheless,iftheCondition3isnotsatis˝ed,weneedmoresophisticatedpriorknowledge
otherthanaprede˝nedtrajectoryrewardthreshold
c
toconstructthereplaybu˙er(training
datasetofUNOP).Thepracticalimplementationoftrajectoryrewardshapingandrigorously
theoreticalstudyforgeneralMDPsarebeyondthescopeofthiswork.
194
ProofofLong-termPerformanceTheorem5
Lemma3.
Givenaspeci˝ctrajectory
˝
,thelog-likelihoodofstate-actionpairsoverhorizon
T
isequaltotheweightedsumovertheentirestate-actionspace,i.e.:
1
T
X
T
t
=1
log
ˇ

(
a
t
j
s
t
)=
X
s;a
p
(
s;a
j
˝
)log
ˇ

(
a
j
s
)
;
wherethesumintherighthandsideisthesummationoverallpossiblestate-actionpairs.It
isworthnotingthat
p
(
s;a
j
˝
)
isnotrelatedtoanypolicyparameters.Itistheprobabilityofa
speci˝cstate-actionpair
(
s;a
)
inaspeci˝ctrajectory
˝
.
Proof.
Givenatrajectory
˝
=
f
(
s
(
˝;
1)
;a
(
˝;
1))
;:::;
(
s
(
˝;T
)
;a
(
˝;T
))
g
=
f
(
s
1
;a
1
)
;:::;
(
s
T
;a
T
)
g
,
denotetheuniquestate-actionpairsinthistrajectoryas
U
(
˝
)=
f
(
s
i
;a
i
)
g
n
i
=1
,where
n
is
thenumberofuniquestate-actionpairsin
˝
and
n

T
.Thenumberofoccurrencesof
astate-actionpair
(
s
i
;a
i
)
inthetrajectory
˝
isdenotedas
j
(
s
i
;a
i
)
j
.Thenwehavethe
following:
1
T
X
T
t
=1
log
ˇ

(
a
t
j
s
t
)
=
X
n
i
=1
j
(
s
i
;a
i
)
j
T
log
ˇ

(
a
i
j
s
i
)
=
X
n
i
=1
p
(
s
i
;a
i
j
˝
)log
ˇ

(
a
i
j
s
i
)
=
X
(
s;a
)
2
U
(
˝
)
p
(
s;a
j
˝
)log
ˇ

(
a
j
s
)
(A.4)
=
X
(
s;a
)
2
U
(
˝
)
p
(
s;a
j
˝
)log
ˇ

(
a
j
s
)+
X
(
s;a
)
=
2
U
(
˝
)
p
(
s;a
j
˝
)log
ˇ

(
a
j
s
)
(A.5)
=
X
(
s;a
)
p
(
s;a
j
˝
)log
ˇ

(
a
j
s
)
195
FromEq(A.4)toEq(A.5)weusedthefact:
X
(
s;a
)
2
U
(
˝
)
p
(
s;a
j
˝
)=
X
n
i
=1
p
(
s
i
;a
i
j
˝
)=
X
n
i
=1
j
(
s
i
;a
i
)
j
T
=1
;
andthereforewehave
p
(
s;a
j
˝
)=0
;
8
(
s;a
)
=
2
U
(
˝
)
.Thisthuscompletestheproof.
NowwearereadytoprovetheTheorem5:
Proof.
Thefollowingproofholdsforanarbitrarysubsetoftrajectories
T
determinedbythe
threshold
c
inDef8.The
ˇ

isassociatedwith
c
andthissubsetoftrajectories.Wepresent
thefollowinglowerboundoftheexpectedlong-termperformance:
argmax

X
˝
p

(
˝
)
w
(
˝
)
*
w
(
˝
)=0
;
if
˝=
2T
=argmax

1
jTj
X
˝
2T
p

(
˝
)
w
(
˝
)
useLemma5
*
p

(
˝
)
>
0
and
w
(
˝
)
>
0
;
)
X
˝
2T
p

(
˝
)
w
(
˝
)
>
0
=argmax

log

1
jTj
X
˝
2T
p

(
˝
)
w
(
˝
)

*
log

X
n
i
=1
x
i
=n


X
n
i
=1
log(
x
i
)
=n;
8
i;x
i
>
0
;
wehave:
log

1
jTj
X
˝
2T
p

(
˝
)
w
(
˝
)


X
˝
2T
1
jTj
log
p

(
˝
)
w
(
˝
)
;
wherethelowerboundholdswhen
p

(
˝
)
w
(
˝
)=
1
jTj
;
8
˝
2T
.Tothisend,wemaximizethe
lowerboundoftheexpectedlong-termperformance:
argmax

X
˝
2T
1
jTj
log
p

(
˝
)
w
(
˝
)
196
=argmax

X
˝
2T
log(
p
(
s
1
)
Y
T
t
=1
(
ˇ

(
a
t
j
s
t
)
p
(
s
t
+1
j
s
t
;a
t
))
w
(
˝
))
=argmax

X
˝
2T
log

p
(
s
1
)
Y
T
t
=1
ˇ

(
a
t
j
s
t
)
Y
T
t
=1
p
(
s
t
+1
j
s
t
;a
t
)
w
(
˝
)

=argmax

X
˝
2T

log
p
(
s
1
)+
X
T
t
=1
log
p
(
s
t
+1
j
s
t
;a
t
)+
X
T
t
=1
log
ˇ

(
a
t
j
s
t
)+log
w
(
˝
)

(A.6)
Theaboveshowsthat
w
(
˝
)
canbesetasanarbitrarypositiveconstant
=argmax

1
jTj
X
˝
2T
X
T
t
=1
log
Y

(
a
t
j
s
t
)
=argmax

1
jTj
T
X
˝
2T
X
T
t
=1
log
Y

(
a
t
j
s
t
)
(A.7)
=argmax

1
jTj
X
˝
2T
1
T
X
T
t
=1
log
ˇ

(
a
t
j
s
t
)
(theexistenceofUNOPinAssumption5)
=argmax

X
˝
2T
p
ˇ

(
˝
)
1
T

X
T
t
=1
log
ˇ

(
a
t
j
s
t
)

where
ˇ

isaUNOP(Def8)
)
p
ˇ

(
˝
)=0
8
˝=
2T
(A.8)
Eq(A.8)canbeestablishedbasedon
X
˝
2T
p
ˇ

(
˝
)=
X
˝
2T
1
=
jTj
=1
=argmax

X
˝
p
ˇ

(
˝
)
1
T

X
T
t
=1
log
ˇ

(
a
t
j
s
t
)

(Lemma3)
=argmax

X
˝
p
ˇ

(
˝
)
X
s;a
p
(
s;a
j
˝
)log
ˇ

(
a
j
s
)
The2ndsumisoverallpossiblestate-actionpairs.(A.9)
(
s;a
)
representsaspeci˝cstate-actionpair.
=argmax

X
˝
X
s;a
p
ˇ

(
˝
)
p
(
s;a
j
˝
)log
ˇ

(
a
j
s
)
=argmax

X
s;a
X
˝
p
ˇ

(
˝
)
p
(
s;a
j
˝
)log
ˇ

(
a
j
s
)
=argmax

X
s;a
p
ˇ

(
s;a
)log
ˇ

(
a
j
s
)
:
(A.10)
Inthisproofweuse
s
t
=
s
(
˝;t
)
and
a
t
=
a
(
˝;t
)
asabbreviations,whichdenotethe
t
-thstate
197
andactioninthetrajectory
˝
,respectively.
jTj
denotesthenumberoftrajectoriesin
T
.We
alsousethede˝nitionof
w
(
˝
)
toonlyfocusonnear-optimaltrajectories.Weset
w
(
˝
)=1
forsimplicitybutitwillnota˙ecttheconclusionifsettootherconstants.
Optimality:
Furthermore,theoptimalsolutionfortheobjectivefunctionEq(A.10)isa
uniformly(near)-optimalpolicy
ˇ

.
argmax

X
s;a
p
ˇ

(
s;a
)log
ˇ

(
a
j
s
)
=argmax

X
s
p
ˇ

(
s
)
X
a
ˇ

(
a
j
s
)log
ˇ

(
a
j
s
)
=argmax

X
s
p
ˇ

(
s
)
X
a
ˇ

(
a
j
s
)log
ˇ

(
a
j
s
)

X
s
p
ˇ

(
s
)
X
a
log
ˇ

(
a
j
s
)
=argmax

X
s
p
ˇ

(
s
)
X
a
ˇ

(
a
j
s
)log
ˇ

(
a
j
s
)
ˇ

(
a
j
s
)
=argmax

X
s
p
ˇ

(
s
)
X
a

KL
(
ˇ

(
a
j
s
)
jj
ˇ

(
a
j
s
))=
ˇ

Therefore,theoptimalsolutionofEq(A.10)isalsothe(near)-optimalsolutionforthe
originalRLproblemsince
P
˝
p
ˇ

(
˝
)
r
(
˝
)=
P
˝
2T
1
jTj
r
(
˝
)

c
=
R
max


.Theoptimal
solutionisobtainedwhenweset
c
=
R
max
.
Lemma4.
Givenanyoptimalpolicy
ˇ
ofMDPsatisfyingCondition3,
8
˝=
2T
,wehave
p
ˇ
(
˝
)=0
,where
T
denotesthesetofallpossibleoptimaltrajectoriesinthislemma.If
9
˝=
2T
,suchthat
p
ˇ
(
˝
)
>
0
,then
ˇ
isnotanoptimalpolicy.
Proof.
Weprovethisbycontradiction.Weassume
ˇ
isanoptimalpolicy.If
9
˝
0
=
2T
,such
that1)
p
ˇ
(
˝
0
)
6
=0
,orequivalently:
p
ˇ
(
˝
0
)
>
0
since
p
ˇ
(
˝
0
)
2
[1
;
0]
.and2)
˝
0
=
2T
.Wecan
˝ndabetterpolicy
ˇ
0
bysatisfyingthefollowingthreeconditions:
p
ˇ
0
(
˝
0
)=0
and
198
p
ˇ
0
(
˝
1
)=
p
ˇ
(
˝
1
)+
p
ˇ
(
˝
0
)
;˝
1
2T
and
p
ˇ
0
(
˝
)=
p
ˇ
(
˝
)
;
8
˝=
2f
˝
0
;˝
1
g
Since
p
ˇ
0
(
˝
)

0
;
8
˝
and
P
˝
p
ˇ
0
(
˝
)=1
,therefore
p
ˇ
0
constructsavalidprobabilitydistribu-
tion.Thentheexpectedlong-termperformanceof
ˇ
0
isgreaterthanthatof
ˇ
:
X
˝
p
ˇ
0
(
˝
)
w
(
˝
)

X
˝
p
ˇ
(
˝
)
w
(
˝
)
=
X
˝=
2f
˝
0
;˝
1
g
p
ˇ
0
(
˝
)
w
(
˝
)+
p
ˇ
0
(
˝
1
)
w
(
˝
1
)+
p
ˇ
0
(
˝
0
)
w
(
˝
0
)


X
˝=
2f
˝
0
;˝
1
g
p
ˇ
(
˝
)
w
(
˝
)+
p
ˇ
(
˝
1
)
w
(
˝
1
)+
p
ˇ
(
˝
0
)
w
(
˝
0
)

=
p
ˇ
0
(
˝
1
)
w
(
˝
1
)+
p
ˇ
0
(
˝
0
)
w
(
˝
0
)

(
p
ˇ
(
˝
1
)
w
(
˝
1
)+
p
ˇ
(
˝
0
)
w
(
˝
0
))
*
˝
0
=
2T
;
)
w
(
˝
0
)=0
and
˝
1
2T
;
)
w
(
˝
)=1
=
p
ˇ
0
(
˝
1
)

p
ˇ
(
˝
1
)
=
p
ˇ
(
˝
1
)+
p
ˇ
(
˝
0
)

p
ˇ
(
˝
1
)=
p
ˇ
(
˝
0
)
>
0
:
Essentially,wecan˝ndapolicy
ˇ
0
thathashigherprobabilityontheoptimaltrajectory
˝
1
andzeroprobabilityon
˝
0
.Thisindicatesthatitisabetterpolicythan
ˇ
.Therefore,
ˇ
is
notanoptimalpolicyanditcontradictsourassumption,whichprovesthatsuch
˝
0
doesnot
exist.Therefore,
8
˝=
2T
,wehave
p
ˇ
(
˝
)=0
.
Lemma5
(PolicyPerformance)
.
IfthepolicytakestheformasinEq(4.7)orEq(4.5),then
wehave
8
˝
,
p

(
˝
)
>
0
.Thismeansforallpossibletrajectoriesallowedbytheenvironment,
thepolicytakestheformofeitherrankingpolicyorsoftmaxwillgeneratethistrajectorywith
probability
p

(
˝
)
>
0
.Notethatbecauseofthisproperty,
ˇ

isnotanoptimalpolicyaccording
toLemma4,thoughitcanbearbitrarilyclosetoanoptimalpolicy.
199
Proof.
Thetrajectoryprobabilityisde˝nedas:
p
(
˝
)=
p
(
s
1

T
t
=1
(
ˇ

(
a
t
j
s
t
)
p
(
s
t
+1
j
s
t
;a
t
))
Thenwehave:
ThepolicytakestheformasinEq(4.7)orEq(4.5)
)
ˇ

(
a
t
j
s
t
)
>
0
:
p
(
s
1
)
>
0
;p
(
s
t
+1
j
s
t
;a
t
)
>
0
:
)
p

(
˝
)=0
:
p
(
s
t
+1
j
s
t
;a
t
)=0
or
p
(
s
1
)=0
;
)
p

(
˝
)=0
;
whichmeans
˝
isnotapossibletrajectory.
Insummary,forallpossibletrajectories,
p

(
˝
)
>
0
:
Thisthuscompletestheproof.
ProofofCorollary1
Corollary8
(Rankingperformancepolicygradient)
.
Thelowerboundofexpectedlong-term
performancebyrankingpolicycanbeapproximatelyoptimizedbythefollowingloss:
min

X
s;a
i
p
ˇ

(
s;a
i
)
L
(
s
i
;a
i
)
(A.11)
wherethepair-wiseloss
L
(
s
i
;a
i
)
isde˝nedas:
L
(
s;a
i
)=
X
j
A
j
j
=1
;j
6
=
i
max(0
;
1+

(
s;a
j
)


(
s;a
i
))
Proof.
InRPG,thepolicy
ˇ

(
a
j
s
)
isde˝nedasinEq(4.5).Wethenreplacetheaction
200
probabilitydistributioninEq(4.13)withtheRPGpolicy.
*
ˇ
(
a
=
a
i
j
s
)=
m
j
=1
;j
6
=
i
p
ij
(A.12)
BecauseRPGis˝ttingadeterministicoptimalpolicy,
wedenotetheoptimalactiongivensate
s
as
a
i
;
thenwehave
max

X
s;a
i
p
ˇ

(
s;a
i
)log
ˇ
(
a
i
j
s
)
(A.13)
=max

X
s;a
i
p
ˇ

(
s;a
i
)log
m
j
6
=
i;j
=1
p
ij
)
(A.14)
=max

X
s;a
i
p
ˇ

(
s;a
i
)log
m
j
6
=
i;j
=1
1
1+
e

ji
(A.15)
=min

X
s;a
i
p
ˇ

(
s;a
i
)
m
X
j
6
=
i;j
=1
log(1+
e

ji
)
˝rstorderTaylorexpansion(A.16)
ˇ
min

X
s;a
i
p
ˇ

(
s;a
i
)
m
X
j
6
=
i;j
=1

ji
s.t.
j

ij
j
=
c<
1
;
8
i;j;s
(A.17)
=min

X
s;a
i
p
ˇ

(
s;a
i
)
m
X
j
6
=
i;j
=1
(

j


i
)
s.t.
j

i


j
j
=
c<
1
;
8
i;j;s
(A.18)
)
min

X
s;a
i
p
ˇ

(
s;a
i
)
L
(
s
i
;a
i
)
(A.19)
wherethepairwiseloss
L
(
s;a
i
)
isde˝nedas:
L
(
s;a
i
)=
j
A
j
X
j
=1
;j
6
=
i
max(0
;
margin
+

(
s;a
j
)


(
s;a
i
))
;
(A.20)
wherethemargininEq(A.19)isasmallpositiveconstant.(A.21)
FromEq(A.18)toEq(A.19),weconsiderlearningadeterministicoptimalpolicy
a
i
=
ˇ

(
s
)
,
whereweuseindex
i
todenotetheoptimalactionateachstate.Theoptimal

-values
minimizingEq(A.18)(denotedby

1
)needtosatisfy

1
i
=

1
j
+
c;
8
j
6
=
i;s
.Theoptiaml

-
201
valuesminimizingEq(A.19)(denotedby

2
)needtosatisfy

2
i
=
max
j
6
=
i

2
j
+
margin
;
8
j
6
=
i;s
.Inbothcases,theoptimalpoliciesfromsolvingEq(A.18)andEq(A.18)arethe
same:
ˇ
(
s
)=
argmax
k

1
k
=
argmax
k

2
k
=
a
i
.Therefore,weuseEq(A.19)asasurrogate
optimizationproblemofEq(A.18).
Policygradientvariancereduction
Corollary9
(Variancereduction)
.
Givenastationarypolicy,theupperboundofthevariance
ofeachdimensionofpolicygradientis
O
(
T
2
C
2
R
2
max
)
.Theupperboundofgradientvariance
ofmaximizingthelowerboundoflong-termperformanceEq(4.13)is
O
(
C
2
)
,where
C
isthe
maximumnormofloggradientbasedonAssumption6.Thesupervisedlearninghasreduced
theupperboundofgradientvariancebyanorderof
O
(
T
2
R
2
max
)
ascomparedtotheregular
policygradient,considering
R
max

1
;T

1
,whichisaverycommonsituationinpractice.
Proof.
Theregularpolicygradientofpolicy
ˇ

isgivenas[212]:
X
˝
p

(
˝
)[
T
X
t
=1
r

log(
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
)))
r
(
˝
)]
Theregularpolicygradientvarianceofthe
i
-thdimensionisdenotedasfollows:
Var
0
@
T
X
t
=1
r

log(
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))
i
)
r
(
˝
)
1
A
Wedenote
x
i
(
˝
)=
P
T
t
=1
r

log
(
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))
i
)
r
(
˝
)
forconvenience.Therefore,
x
i
isa
202
randomvariable.Thenapply
var
(
x
)=
E
p

(
˝
)
[
x
2
]

E
p

(
˝
)
[
x
]
2
,wehave:
Var

X
T
t
=1
r

log(
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))
i
)
r
(
˝
)

=
Var
(
x
i
(
˝
))
=
X
˝
p

(
˝
)
x
i
(
˝
)
2

[
X
˝
p

(
˝
)
x
i
(
˝
)]
2

X
˝
p

(
˝
)
x
i
(
˝
)
2
=
X
˝
p

(
˝
)[
X
T
t
=1
r

log(
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))
i
)
r
(
˝
)]
2

X
˝
p

(
˝
)[
X
T
t
=1
r

log(
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))
i
)]
2
R
2
max
=
R
2
max
X
˝
p

(
˝
)[
X
T
t
=1
X
T
k
=1
r

log(
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))
i
)
r

log(
ˇ

(
a
(
˝;k
)
j
s
(
˝;k
)
i
)]
(Assumption6)

R
2
max
X
˝
p

(
˝
)[
T
X
t
=1
T
X
k
=1
C
2
]
=
R
2
max
X
˝
p

(
˝
)
T
2
C
2
=
T
2
C
2
R
2
max
Thepolicygradientoflong-termperformance(Def7):
P
s;a
p
ˇ

(
s;a
)
r

log
ˇ

(
a
j
s
)
.The
policygradientvarianceofthe
i
-thdimensionisdenotedas:
var
(
r

log
ˇ

(
a
j
s
)
i
)
.Thenthe
upperboundisgivenby
var
(
r

log
ˇ

(
a
j
s
)
i
)
=
X
s;a
p
ˇ

(
s;a
)[
r

log
ˇ

(
a
j
s
)
i
]
2

[
X
s;a
p
ˇ

(
s;a
)
r

log
ˇ

(
a
j
s
)
i
]
2

X
s;a
p
ˇ

(
s;a
)[
r

log
ˇ

(
a
j
s
)
i
]
2
(Assumption6)

X
s;a
p
ˇ

(
s;a
)
C
2
203
=
C
2
Thisthuscompletestheproof.
DiscussionsofAssumption5
Inthissection,weshowthatUNOPexistsinarangeofMDPs.Noticethatthelemma6
showsthesu˚cientconditionsofsatisfyingAsumption5ratherthannecessaryconditions.
Lemma6.
ForMDPsde˝nedinSection4.2.3satisfyingthefollowingconditions:

Eachinitialstateleadstooneoptimaltrajectory.Thisalsoindicates
jS
1
j
=
jTj
,where
T
denotesthesetofoptimaltrajectoriesinthislemma,
S
1
denotesthesetofinitial
states.

Deterministictransitions,i.e.,
p
(
s
0
j
s;a
)
2f
0
;
1
g
.

Uniforminitialstatedistribution,i.e.,
p
(
s
1
)=
1
jTj
;
8
s
1
2S
1
.
Thenwehave:
9
ˇ

,wheres.t.
p
ˇ

(
˝
)=
1
jTj
;
8
˝
2T
.Itmeansthatadeterministicuniformly
optimalpolicyalwaysexistsforthisMDP.
Proof.
Wecanprovethisbyconstruction.Thefollowinganalysisappliesforany
˝
2T
.
p
ˇ

(
˝
)=
1
jTj
()
log
p
ˇ

(
˝
)=

log
jTj
()
log
p
(
s
1
)+
X
T
t
=1
log
p
(
s
t
+1
j
s
t
;a
t
)+
X
T
t
=1
log
ˇ

(
a
t
j
s
t
)=

log
jTj
()
X
T
t
=1
log
ˇ

(
a
t
j
s
t
)=

log
p
(
s
1
)

X
T
t
=1
log
p
(
s
t
+1
j
s
t
;a
t
)

log
jTj
204
whereweuse
a
t
;s
t
asabbreviationsof
a
(
˝;t
)
;s
(
˝;t
)
:
Wedenote
D
(
˝
)=

log
p
(
s
1
)

X
T
t
=1
log
p
(
s
t
+1
j
s
t
;a
t
)
>
0
()
X
T
t
=1
log
ˇ

(
a
t
j
s
t
)=
D
(
˝
)

log
jTj
)
wecanobtainauniformlyoptimalpolicybysolvingthenonlinearprogramming:
X
T
t
=1
log
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))=
D
(
˝
)

log
jTj8
˝
2T
(A.22)
log
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))=0
;
8
˝
2T
;t
=1
;:::;T
(A.23)
X
m
i
=1
ˇ

(
a
i
j
s
(
˝;t
))=1
;
8
˝
2T
;t
=1
;:::;T
(A.24)
Usethecondition
p
(
s
1
)=
1
jTj
,thenwehave:
*
X
T
t
=1
log
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))
(A.25)
=
X
T
t
=1
log1=0(
LHSof
Eq
(
A:
22))
(A.26)
*

log
p
(
s
1
)

X
T
t
=1
log
p
(
s
t
+1
j
s
t
;a
t
)

log
jTj
=log
jTj
0

log
jTj
=0
(A.27)
(
RHSof
Eq
(
A:
22))
(A.28)
)
D
(
˝
)

log
jTj
=
X
T
t
=1
log
ˇ

(
a
(
˝;t
)
j
s
(
˝;t
))
;
8
˝
2T
:
Alsothedeterministicoptimalpolicysatis˝estheconditionsinEq(A.23A.24).Therefore,
205
(a)(b)
FigureA.2:Thedirectedgraphthatdescribestheconditionalindependenceofpairwise
relationshipofactions,where
Q
1
denotesthereturnoftakingaction
a
1
atstate
s
,following
policy
ˇ
in
M
,i.e.,
Q
ˇ
M
(
s;a
1
)
.
I
1
;
2
isarandomvariablethatdenotesthepairwiserelationship
of
Q
1
and
Q
2
,i.e.,
I
1
;
2
=1
;
i
:
i
:
f
:Q
1

Q
2
;
o
:
w
:I
1
;
2
=0
.
thedeterministicoptimalpolicyisauniformlyoptimalpolicy.Thislemmadescribesone
typeofMDPinwhichUOPexists.Fromtheabovereasoning,wecanseethataslong
asthesystemofnon-linearequationsEq(A.22A.23A.24)hasasolution,theuniformly
(near)-optimalpolicyexists.
Lemma7
(Hitoptimaltrajectory)
.
Theprobabilitythataspeci˝coptimaltrajectorywasnot
encounteredgivenanarbitrarysoftmaxpolicy
ˇ

isexponentiallydecreasingwithrespecttothe
numberoftrainingepisodes.NomatteraMDPhasdeterministicorprobabilisticdynamics.
Proof.
Givenaspeci˝coptimaltrajectory
˝
=
f
s
(
˝;t
)
;a
(
˝;t
)
g
T
t
=1
,andanarbitrarystationary
policy
ˇ

,theprobabilitythathasneverencounteredatthe
n
-thepisodeis
[1

p

(
˝
)]
n
=
˘
n
,
basedonlemma5,wehave
p

(
˝
)
>
0
,thereforewehave
˘
2
[0
;
1)
.
DiscussionsofAssumption4
Intuitively,givenastateandastationarypolicy
ˇ
,therelativerelationshipsamongactions
canbeindependent,consideringa˝xedMDP
M
.Therelativerelationshipamongactions
istherelativerelationshipofactions'return.Startingfromthesamestate,followinga
206
stationarypolicy,theactions'returnisdeterminedbyMDPpropertiessuchasenvironment
dynamics,rewardfunction,etc.
Moreconcretely,weconsideraMDPwiththreeactions
(
a
1
;a
2
;a
3
)
foreachstate.The
actionvalue
Q
ˇ
M
satis˝estheBellmanequationinEq(A.29).Noticethatinthissubsection,
weuse
Q
ˇ
M
todenotetheactionvaluethatestimatestheabsolutevalueofreturnin
M
.
Q
ˇ
M
(
s;a
i
)=
r
(
s;a
i
)+max
a
E
s
0
˘
p
(

s;a
)
Q
ˇ
M
(
s
0
;a
)
;
8
i
=1
;
2
;
3
:
(A.29)
AswecanseefromEq(A.29),
Q
ˇ
M
(
s;a
i
)
;i
=1
;
2
;
3
isonlyrelatedto
s;ˇ
,andenvironment
dynamics
P
.Itmeansif
ˇ
,
M
and
s
aregiven,theactionvaluesofthreeactionsare
determined.Therefore,wecanuseadirectedgraph[
21
]tomodeltherelationshipofaction
values,asshowninFigureA.2(a).Similarly,ifweonlyconsidertherankingofactions,this
rankingisconsistentwiththerelationshipofactions'return,whichisalsodeterminedby
s;ˇ
,
and
P
.Therefore,thepairwiserelationshipamongactionscanbedescribedasthedirected
graphinFigureA.2(b),whichestablishestheconditionalindependenceofactions'pairwise
relationship.Basedontheabovereasoning,weconcludethatAssumption4isrealistic.
TheproofofTheorem6
Proof.
TheproofmainlyestablishesontheproofforlongtermperformanceTheorem5and
connectsthegeneralizationboundinPACframeworktothelowerboundofreturn.We
constructahybridpolicybasedonpairwiserankingpolicyEq(4.5)asfollows:
207
If
ˇ

(
s
)=argmax
a


(
s;a
)
,
p
h
(
a
j
s
)=
8
>
<
>
:
1
;ˇ

(
s
)=argmax
a


(
s;a
)
0
;o:w:
(A.30)
If
ˇ

(
s
)
6
=argmax
a


(
s;a
)
,
p
h
(
a
j
s
)=
ˇ

(
a
j
s
)=
m
j
6
=
i;j
=1
p
ij
(A.31)
InplainEnglish,thehybridpolicycanbedescribedasfollows:forastate
s
andthe
policyparameter

,iftheactionchosenbyUOPhasthehighestrelativeactionvalue(i.e.,
ˇ

(
s
)=
argmax
a


(
s;a
)
),weusethedeterministicpolicyasde˝nedinEq(A.30)forthis
state.Otherwise,weusethestochasticpolicyasde˝nedinEq(A.31).Notethatthe
constructionofthispolicyassumewehaveaccesstotheUOP
ˇ

,whichisfeasibleinour
setting.WithTRS6,wecan˝lteralluniqueoptimaltrajectoriesfollowingUOP.Therefore,
whenUOPisdeterministic,foreachstate,wehavetheactionthatischosenbytheUOP.
Westudythegeneralizationperformanceandsamplecomplexityofthepairwiseranking
policyasfollows:
log(
1
jTj
X
˝
2T
p

(
˝
)
w
(
˝
))

1
jTj
X
˝
2T
log
p

(
˝
)
w
(
˝
)
,
X
˝
2T
p

(
˝
)
w
(
˝
)
jTj
exp(
1
jTj
X
˝
2T
log
p

(
˝
)
w
(
˝
))
denote
F
=
X
˝
p

(
˝
)
w
(
˝
)=
X
˝
2T
p

(
˝
)
w
(
˝
)
(A.32)
208
,
F
jTj
exp(
1
jTj
X
˝
2T
log
p

(
˝
)
w
(
˝
))
=
jTj
exp

1
jTj
X
˝
2T

log
p
(
s
1
)+
X
T
t
=1
log
p
(
s
t
+1
j
s
t
;a
t
)+
X
T
t
=1
log
p
h
(
a
t
j
s
t
)+log
w
(
˝
)

(A.33)
*
w
(
˝
)=1
;
8
˝
2T
;s
t
=
s
(
˝;t
)
;a
t
=
a
(
˝;t
)
;t
=1
;:::;T
=
jTj
exp
 
1
jTj
X
˝
2T
 
log
p
(
s
1
)+
X
T
t
=1
log
p
(
s
t
+1
j
s
t
;a
t
)+
T
X
t
=1
log
p
h
(
a
t
j
s
t
)
!!
=
jTj
exp
 
1
jTj
X
˝
2T
(log
p
(
s
1
)+
X
T
t
=1
log
p
(
s
t
+1
j
s
t
;a
t
))
!
exp

1
jTj
X
˝
2T
(
X
T
t
=1
log
p
h
(
a
t
j
s
t
))

(A.34)
Denotethedynamicsofatrajectoryas
p
d
(
˝
)=
p
(
s
1

T
t
=1
p
(
s
t
+1
j
s
t
;a
t
)
Noticethat
p
d
(
˝
)
isenvironmentdynamics,whichis˝xedgivenaspeci˝cMDP.
,
F
jTj
exp

1
jTj
X
˝
2T
log
p
d
(
˝
)

exp

1
jTj
X
˝
2T
(
X
T
t
=1
log
p
h
(
a
t
j
s
t
))

=
jTj

˝
2T
p
d
(
˝
))
1
jTj
exp

1
jTj
T
X
˝
2T
(
X
T
t
=1
log
p
h
(
a
t
j
s
t
))
T

UsethesamereasoningfromEq(A.7)toEq(A.10).
=
jTj

˝
2T
p
d
(
˝
))
1
jTj
exp

T
X
s;a
p
ˇ

(
s;a
)log
p
h
(
a
j
s
)

=
jTj

˝
2T
p
d
(
˝
))
1
jTj
exp(
TL
)
Wedenote
L
=
X
s;a
p
ˇ

(
s;a
)log
p
h
(
a
j
s
)
:
L
istheonlytermthatisrelatedtothepolicyparameter

209
Given
h
=
ˇ

;
misclassi˝edstateactionpairsset
U
w
=
f
s;a
j
h
(
s
)
6
=
a;
(
s;a
)
˘
p

(
s;a
)
g
L
=
X
s;a
2
U
w
p
ˇ

(
s;a
)log
p
h
(
a
j
s
)+
X
s;a=
2
U
w
p
ˇ

(
s;a
)log
p
h
(
a
j
s
)
Byde˝nitionof
U
w
;
8
s;a=
2
U
w
;h
(
s
)=
a;
)
p
h
(
a
j
s
)=1
:
(A.35)
=
X
s;a
2
U
w
p
ˇ

(
s;a
)log
ˇ

(
a
j
s
)
SinceweuseRPGasourpolicyparameterization,thenwith
Eq
(4
:
5)
=
X
s;a
2
U
w
p
ˇ

(
s;a
)log
m
j
6
=
i;j
=1
p
ij
)
=
X
s;a
i
2
U
w
p
ˇ

(
s;a
i
)
X
m
j
6
=
i;j
=1
log
1
1+
e
Q
ji
ByCondition1,whichcanbeeasilysatis˝edinpractice.Thenwehave:
Q
ij
<
2
c
q

1
ApplyLemma1,themisclassi˝edrateisatmost
:

X
s;a
i
2
U
w
p
ˇ

(
s;a
i
)(
m

1)log(
1
1+
e
)

X
s;a
i
2
U
w
p
ˇ

(
s;a
i
)(
m

1)log(1+
e
)


(
m

1)log(1+
e
)
=

(1

m
)log(1+
e
)
F
jTj

˝
2T
p
d
(
˝
))
1
jTj
exp(
TL
)
jTj

˝
2T
p
d
(
˝
))
1
jTj
exp(

(1

m
)
T
log(1+
e
))
jTj

˝
2T
p
d
(
˝
))
1
jTj
(1+
e
)

(1

m
)
T
=
D
(1+
e
)

(1

m
)
T
210
Fromgeneralizationperformancetosamplecomplexity:
Set1


=
D
(1+
e
)

(1

m
)
T
;
where
D
=
jTj

˝
2T
p
d
(
˝
))
1
jTj

=
log
1+
e
D
1


(
m

1)
T
Withrealizableassumption11,

min
=0

=


min
2
=

2
n

1
2

2
log
2
jHj

=
2(
m

1)
2
T
2

log
1+
e
D
1


2
log
2
jHj

Bridgethelong-termrewardandlong-termperformance:
X
˝
p

(
˝
)
r
(
˝
)
InSection4.2.7,
r
(
˝
)
2
[0
;
1]
;
8
˝:

X
˝
p

(
˝
)
w
(
˝
)
SincewefocusonUOPDef8,c=1inTSRDef6
=
X
˝
2T
p

(
˝
)
w
(
˝
)

1


Thisthusconcludestheproof.
Assumption11
(Realizable)
.
Weassumethereexistsahypothesis
h

2H
thatobtainszero
expectedrisk,i.e.
9
h

2H)
P
s;a
p
ˇ

(
s;a
)
1
f
h

(
s
)
6
=
a
g
=0
.
TheAssumption11isnotnecessaryfortheproofofTheorem6.Fortheproofof
Corollary4,weintroducethisassumptiontoachievemoreconciseconclusion.In˝nite
211
MDP,therealizableassumptioncanbesatis˝edifthepolicyisparameterizedbymulti-layer
neuralnetwork,duetoitsperfect˝nitesampleexpressivity[
224
].Itisalsoadvocatedinour
empiricalstudiessincetheneuralnetworkachievedoptimalperformancein
Pong
.
TheproofofLemma2
Proof.
Let
e
=
i
denotestheevent
n
=
i
j
k
,i.e.thenumberofdi˙erentoptimaltrajectoriesin
˝rst
k
episodesisequalto
i
.Similarly,
e

i
denotestheevent
n

i
j
k
.Sincetheevents
e
=
i
and
e
=
j
aremutuallyexclusivewhen
i
6
=
j
.Therefore,
p
(
e

i
)=
p
(
e
=
i
;e
=
i
+1
;:::;e
=
jTj
)=
P
jTj
j
=
i
p
(
e
=
j
)
.Furthermore,weknowthat
P
T
i
=0
p
(
e
=
i
)=1
since
f
e
=
i
;i
=0
;:::;
jTjg
constructsanuniversalset.Forexample,
p
(
e

1
)=
p
ˇ
r
;
M
(
n

1
j
k
)=1

p
ˇ
r
;
M
(
n
=0
j
k
)=
1

(
N
j
N
)
k
.
p
ˇ
r
;
M
(
n

i
j
k
)=1

X
i

1
i
0
=0
p
ˇ;
M
(
n
=
i
0
j
k
)
=1

X
i

1
i
0
=0
C
i
0
jTj
P
i
0
j
=0
(

1)
j
C
j
i
0
(
N
jTj
+
i
0

j
)
k
N
k
(A.36)
InEq(A.36),weusetheinclusion-exclusionprinciple[93]tohavethefollowingequality.
p
ˇ
r
;
M
(
n
=
i
0
j
k
)=
C
i
0
jTj
p
(
e
˝
1
;˝
2
;:::;˝
i
0
)
=
C
i
0
jTj
P
i
0
j
=0
(

1)
j
C
j
i
0
(
N
jTj
+
i
0

j
)
k
N
k
e
˝
1
;˝
2
;:::;˝
i
0
denotestheevent:in˝rst
k
episodes,acertainsetof
i
0
optimaltrajectories
˝
1
;˝
2
;:::;˝
i
0
;i
0
jTj
issampled.
212
TableA.2:HyperparametersofRPGnetwork
HyperparametersValue
ArchitectureConv(32-8

8-4)
-Conv(64-4

4-2)
-Conv(64-3

3-2)
-FC(512)
Learningrate0.0000625
Batchsize32
Replaybu˙ersize1000000
Updateperiod4
MargininEq(4.14)1
TheproofofCorollary5
Proof.
TheCorollary5isadirectapplicationofLemma2andTheorem6.First,wereformat
Theorem6asfollows:
p
(
A
j
B
)

1


whereevent
A
denotes
P
˝
p

(
˝
)
r
(
˝
)

D
(1+
e
)

(1

m
)
T
,event
B
denotesthenumberof
state-actionpairs
n
0
fromUOP(Def8)satisfying
n
0

n
,given˝xed

.WithLemma2,we
have
p
(
B
)

p
ˇ
r
;
M
(
n
0

n
j
k
)
.Then,
P
(
A
)=
P
(
A
j
B
)
P
(
B
)

(1


)
p
ˇ
r
;
M
(
n
0

n
j
k
)
.
Set
(1


)
p
ˇ
r
;
M
(
n
0

n
j
k
)=1


0
wehave
P
(
A
)

1


0

=1

1


0
p
ˇ
r
;
M
(
n
0

n
j
k
)

=2
r
1
2
n
log
2
jHj

=2
s
1
2
n
log
2
jHj
p
ˇ
r
;
M
(
n
0

n
j
k
)
p
ˇ
r
;
M
(
n
0

n
j
k
)

1+

0
213
Hyperparameters
WepresentthetrainingdetailsofrankingpolicygradientinTableA.2.Thenetwork
architectureisthesameastheconvolutionneuralnetworkusedinDQN[
132
].Weupdate
theRPGnetworkeveryfourtimestepswithaminibatchofsize32.Thereplayratioisequal
toeightforallbaselinesandRPG(exceptforACERweusethedefaultsettinginopenai
baselines[48]forbetterperformance).
214
AppendixB
FederatedLearning
AdditionalNotations
Inthissection,weintroduceadditionalnotationsthatareusedthroughouttheproof.Following
commonpractice[
178
,
110
],wede˝netwovirtualsequences
v
t
and
w
t
.Forfulldevice
participationand
t=
2I
E
,
v
t
=
w
t
=
P
N
k
=1
p
k
v
k
t
.Forpartialparticipation,
t
2I
E
,
w
t
6
=
v
t
since
v
t
=
P
N
k
=1
p
k
v
k
t
while
w
t
=
P
k
2S
t
w
k
t
.However,wecansetunbiasedsampling
strategysuchthat
E
S
t
w
t
=
v
t
.
v
t
+1
isone-stepSGDfrom
w
t
.
v
t
+1
=
w
t


t
g
t
;
(B.1)
where
g
t
=
P
N
k
=1
p
k
g
t;k
isone-stepstochasticgradient,averagedoveralldevices.
g
t;k
=
r
F
k

w
k
t
;˘
k
t

;
Similarly,wedenotetheexpectedone-stepgradient
g
t
=
E
˘
t
[
g
t
]=
P
N
k
=1
p
k
E
˘
k
t
g
t;k
,where
E
˘
k
t
g
t;k
=
r
F
k

w
k
t

;
(B.2)
215
and
˘
t
=
f
˘
k
t
g
N
k
=1
denotesrandomsamplesatalldevicesattimestep
t
.Sinceinthiswork,
wealsoconsiderthecaseofpartialparticipation.Thesamplingstrategytoapproximate
thesystemheterogeneitycanalsoa˙ecttheconvergence.Herewefollowthepriorarts[
75
]
consideringtwotypesofsamplingschemes.ThesamplingschemeIestablishes
S
t
+1
byi.i.d.
samplingthedeviceswithreplacement,inthiscasetheupperboundofexpectedsquarenorm
of
w
t
+1

v
t
+1
isgivenby[110,Lemma5]:
E
S
t
+1
k
w
t
+1

v
t
+1
k
2

4
K

2
t
E
2
G
2
:
(B.3)
ThesamplingschemeIIestablishes
S
t
+1
byuniformlysamplingalldeviceswithoutreplace-
ment,inwhichwehavethe
E
S
t
+1
k
w
t
+1

v
t
+1
k
2

4(
N

K
)
K
(
N

1)

2
t
E
2
G
2
:
(B.4)
Wedenotethisupperboundasfollowsforconcisepresentation.
E
S
t
+1
k
w
t
+1

v
t
+1
k
2


2
t
C:
(B.5)
ComparisonofConvergenceRateswithRelatedWorks
Inthissection,wecompareourconvergenceratewiththebest-knownresultsintheliterature
(seeTableB.1).In[
75
],theauthorsprovide
O
(1
=NT
)
convergencerateofnon-convex
problemsunderPolyak-−ojasiewicz(PL)condition,whichmeanstheirresultscandirectly
applytothestronglyconvexproblems.However,theirassumptionisbasedonbounded
216
gradientdiversity,de˝nedasfollows:

w
)=
P
k
p
k
kr
F
k
(
w
)
k
2
2
k
P
k
p
k
r
F
k
(
w
)
k
2
2

B
Thisisamorerestrictiveassumptioncomparingtoassumingboundedgradientunderthe
caseoftargetaccuracy

!
0
andPLcondition.Toseethis,considerthegradientdiversity
attheglobaloptimal
w

,i.e.,

w

)=
P
k
p
k
kr
F
k
(
w
)
k
2
2
k
P
k
p
k
r
F
k
(
w
)
k
2
2
.For

w

)
tobebounded,it
requires
kr
F
k
(
w

)
k
2
2
=0
,
8
k
.Thisindicates
w

isalsotheminimizerofeachlocalobjective,
whichcontradictstothepracticalsettingofheterogeneousdata.Therefore,theirbound
isnote˙ectiveforarbitrarysmall

-accuracyundergeneralheterogeneousdatawhileour
convergenceresultsstillholdinthiscase.
TableB.1:Ahigh-levelsummaryoftheconvergenceresultsinthispapercomparedtoprior
state-of-the-artFLalgorithms.Thistableonlyhighlightsthedependenceon
T
(numberof
iterations),
E
(themaximalnumberoflocalsteps),
N
(thetotalnumberofdevices),and
K

N
thenumberofparticipateddevices.

istheconditionnumberofthesystemand

2
(0
;
1)
.WedenoteNesterovacceleratedFedAvgasN-FedAvginthistable.
Reference
Convergencerate
E
NonIID
Participation
ExtraAssumptions
Setting
FedAvg[110]
O
(
E
2
T
)
O
(1)
3
Partial
Boundedgradient
Stronglyconvex
FedAvg[75]
O
(
1
KT
)
O
(
K

1
=
3
T
2
=
3
)
y
3
zz
Partial
Boundedgradientdiversity
Stronglyconvex
x
FedAvg[104]
O
(
1
NT
)
O
(
N

1
=
2
T
1
=
2
)
3
Full
Boundedgradient
Stronglyconvex
FedAvg/N-FedAvg
O
(
1
KT
)
O
(
N

1
=
2
T
1
=
2
)
z
3
Partial
Boundedgradient
Stronglyconvex
FedAvg[98]
O
(
1
p
NT
)
O
(
N

3
=
2
T
1
=
2
)
3
Full
Boundedgradient
Convex
FedAvg[104]
O
(
1
p
NT
)
O
(
N

3
=
4
T
1
=
4
)
3
Full
Boundedgradient
Convex
FedAvg/N-FedAvg
O

1
p
KT

O
(
N

3
=
4
T
1
=
4
)
z
3
Partial
Boundedgradient
Convex
FedAvg
O

exp(

NT
E
1
)

O
(
T

)
3
Partial
Boundedgradient
OverparameterizedLR
FedMass
O

exp(

NT
E
p

1
~

)

O
(
T

)
3
Partial
Boundedgradient
OverparameterizedLR
y
This
E
isobtainedunderi.i.d.setting.
z
This
E
isobtainedunderfullparticipationsetting.
x
In[
75
],theconvergencerateisfornon-convexsmoothproblemswithPLcondition,which
alsoappliestostronglyconvexproblems.Therefore,wecompareitwithourstronglyconvex
resultshere.
zz
Theboundedgradientdiversityassumptionisnotapplicableforgeneralheterogeneous
datawhenconvergingtoarbitrarilysmall

-accuracy(seediscussionsinSecB).
217
ProofofConvergenceResultsforFedAvg
StronglyConvexSmoothObjectives
Tofacilitatereading,theoremsfromthemainpaperarerestatedandnumberedidentically.
We˝rstsummarizesomepropertiesof
L
-smoothand

-stronglyconvexfunctions[161].
Lemma8.
Let
F
beaconvex
L
-smoothfunction.Thenwehavethefollowinginequalities:
1.Quadraticupperbound:
0

F
(
w
)

F
(
w
0
)
hr
F
(
w
0
)
;
w

w
0
i
L
2
k
w

w
0
k
2
.
2.Coercivity:
1
L
kr
F
(
w
)
r
F
(
w
0
)
k
2
hr
F
(
w
)
r
F
(
w
0
)
;
w

w
0
i
.
3.Lowerbound:
F
(
w
)

F
(
w
0
)+
hr
F
(
w
0
)
;
w

w
0
i
+
1
2
L
kr
F
(
w
)
r
F
(
w
0
)
k
2
.In
particular,
kr
F
(
w
)
k
2

2
L
(
F
(
w
)

F
(
w

))
.
4.Optimalitygap:
F
(
w
)

F
(
w

)

F
(
w
)
;
w

w

i
.
Lemma9.
Let
F
bea

-stronglyconvexfunction.Then
F
(
w
)

F
(
w
0
)+
hr
F
(
w
0
)
;
w

w
0
i
+
1
2

kr
F
(
w
)
r
F
(
w
0
)
k
2
F
(
w
)

F
(
w

)

1
2

kr
F
(
w
)
k
2
Theorem14.
Let
w
T
=
P
N
k
=1
p
k
w
k
T
,

max
=
max
k
Np
k
,andsetdecayinglearningrates

t
=
1
4

(

+
t
)
with

=
max
f
32
E
g
and

=
L

.ThenunderAssumptions7,8,9,10withfull
deviceparticipation,
E
F
(
w
T
)

F

=
O


2
max
˙
2

NT
+

2
E
2
G
2

T
2

andwithpartialdeviceparticipationwithatmost
K
sampleddevicesateachcommunication
218
round,
E
F
(
w
T
)

F

=
O


2
G
2

KT
+

2
max
˙
2

NT
+

2
E
2
G
2

T
2

Proof.
Theproofbuildsonideasfrom[
110
].The˝rststepistoobservethatthe
L
-smoothness
of
F
providestheupperbound
E
(
F
(
w
t
))

F

=
E
(
F
(
w
t
)

F
(
w

))

L
2
E
k
w
t

w

k
2
andbound
E
k
w
t

w

k
2
.
Ourmainstepistoprovethebound
E
k
w
t
+1

w

k
2

(1


t
)
E
k
w
t

w

k
2
+

2
t
1
N

2
max
˙
2
+5
E
2

3
t
G
2
Wehave
k
w
t
+1

w

k
2
=
k
(
w
t


t
g
t
)

w

k
2
=
k
(
w
t


t
g
t

w

)


t
(
g
t

g
t
)
k
2
=
A
1
+
A
2
+
A
3
where
A
1
=
k
w
t

w


t
g
t
k
2
A
2
=2

t
h
w
t

w


t
g
t
;
g
t

g
t
i
219
A
3
=

2
t
k
g
t

g
t
k
2
Byde˝nitionof
g
t
and
g
t
(seeEq(B.2)),wehave
E
A
2
=0
.For
A
3
,wehavethefollow
upperbound:

2
t
E
k
g
t

g
t
k
2
=

2
t
E
k
g
t

E
g
t
k
2
=

2
t
N
X
k
=1
p
2
k
k
g
t;k

E
g
t;k
k
2


2
t
N
X
k
=1
p
2
k
˙
2
k
againbyJensen'sinequalityandusingtheindependenceof
g
t;k
;
g
t;k
0
[110,Lemma2].
Nextwebound
A
1
:
k
w
t

w


t
g
t
k
2
=
k
w
t

w

k
2
+2
h
w
t

w

;


t
g
t
i
+
k

t
g
t
k
2
andwewillshowthatthethirdterm
k

t
g
t
k
2
canbecanceledbyanupperboundofthe
secondterm.
Now

2

t
h
w
t

w

;
g
t
i
=

2

t
N
X
k
=1
p
k
h
w
t

w

;
r
F
k
(
w
k
t
)
i
=

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
2

t
N
X
k
=1
p
k
h
w
k
t

w

;
r
F
k
(
w
k
t
)
i

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
+2

t
N
X
k
=1
p
k
(
F
k
(
w

)

F
k
(
w
k
t
))


t

N
X
k
=1
p
k
k
w
k
t

w

k
2
220

2

t
N
X
k
=1
p
k

F
k
(
w
k
t
)

F
k
(
w
t
)+
L
2
k
w
t

w
k
t
k
2
+
F
k
(
w

)

F
k
(
w
k
t
)


t

k
N
X
k
=1
p
k
w
k
t

w

k
2
=

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+2

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]


t

k
w
t

w

k
2
Forthesecondterm,whichisnegative,wecanignoreit,butthisyieldsasuboptimalbound
thatfailstoprovidethedesiredlinearspeedup.Instead,weupperbounditusingthefollowing
derivation:
2

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]

2

t
[
F
(
w
t
+1
)

F
(
w
t
)]

2

t
E
hr
F
(
w
t
)
;
w
t
+1

w
t
i
+

t
L
E
k
w
t
+1

w
t
k
2
=

2

2
t
E
hr
F
(
w
t
)
;
g
t
i
+

3
t
L
E
k
g
t
k
2
=

2

2
t
E
hr
F
(
w
t
)
;
g
t
i
+

3
t
L
E
k
g
t
k
2
=


2
t
h
kr
F
(
w
t
)
k
2
+
k
g
t
k
2
kr
F
(
w
t
)

g
t
k
2
i
+

3
t
L
E
k
g
t
k
2
=


2
t
"
kr
F
(
w
t
)
k
2
+
k
g
t
k
2
kr
F
(
w
t
)

X
k
p
k
r
F
(
w
k
t
)
k
2
#
+

3
t
L
E
k
g
t
k
2


2
t
"
kr
F
(
w
t
)
k
2
+
k
g
t
k
2

X
k
p
k
kr
F
(
w
t
)
r
F
(
w
k
t
)
k
2
#
+

3
t
L
E
k
g
t
k
2


2
t
"
kr
F
(
w
t
)
k
2
+
k
g
t
k
2

L
2
X
k
p
k
k
w
t

w
k
t
k
2
#
+

3
t
L
E
k
g
t
k
2


2
t
k
g
t
k
2
+

2
t
L
2
X
k
p
k
k
w
t

w
k
t
k
2
+

3
t
L
E
k
g
t
k
2


2
t
kr
F
(
w
t
)
k
2
wherewehaveusedthesmoothnessof
F
twice.
221
Notethattheterm


2
t
k
g
t
k
2
exactlycancelsthe

2
t
k
g
t
k
2
intheboundfor
A
1
,sothat
pluggingintheboundfor

2

t
h
w
t

w

;
g
t
i
,wehavesofarproved
E
k
w
t
+1

w

k
2

E
(1


t
)
k
w
t

w

k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k
+

2
t
L
2
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

3
t
L
E
k
g
t
k
2


2
t
kr
F
(
w
t
)
k
2
Theterm
E
k
g
t
k
2

G
2
byassumption.
Nowwebound
E
P
N
k
=1
p
k
k
w
t

w
k
t
k
2
following[
110
].Sincecommunicationisdoneevery
E
steps,forany
t

0
,wecan˝nda
t
0

t
suchthat
t

t
0

E

1
and
w
k
t
0
=
w
t
0
forall
k
.
Moreover,using

t
isnon-increasingand

t
0

2

t
forany
t

t
0

E

1
,wehave
E
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
=
E
N
X
k
=1
p
k
k
w
k
t

w
t
0

(
w
t

w
t
0
)
k
2

E
N
X
k
=1
p
k
k
w
k
t

w
t
0
k
2
=
E
N
X
k
=1
p
k
k
w
k
t

w
k
t
0
k
2
=
E
N
X
k
=1
p
k
k
t

1
X
i
=
t
0

i
g
i;k
k
2

2
N
X
k
=1
p
k
E
t

1
X
i
=
t
0
E
2
i
k
g
i;k
k
2

2
N
X
k
=1
p
k
E
2

2
t
0
G
2

4
E
2

2
t
G
2
222
Usingtheboundon
E
P
N
k
=1
p
k
k
w
t

w
k
t
k
2
,wecanconcludethat,with

max
:=
N

max
k
p
k
and

min
:=
N

min
k
p
k
,
E
k
w
t
+1

w

k
2

E
(1


t
)
k
w
t

w

k
2
+4
E
2

3
t
G
2
+4
E
2
L
2

4
t
G
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k
+

3
t
LG
2
=
E
(1


t
)
k
w
t

w

k
2
+4
E
2

3
t
G
2
+4
E
2
L
2

4
t
G
2
+

2
t
1
N
2
N
X
k
=1
(
p
k
N
)
2
˙
2
k
+

3
t
LG
2

E
(1


t
)
k
w
t

w

k
2
+4
E
2

3
t
G
2
+4
E
2
L
2

4
t
G
2
+

2
t
1
N
2

2
max
N
X
k
=1
˙
2
k
+

3
t
LG
2

E
(1


t
)
k
w
t

w

k
2
+6
E
2

3
t
G
2
+

2
t
1
N

2
max
˙
2
whereinthelastinequalityweuse
˙
2
=
max
k
˙
2
k
,andassume

t
satis˝es

t

1
8
.Weshow
nextthat
E
k
w
t

w

k
2
=
O
(
1
tN
+
E
2
LG
2
t
2
)
.
Let
C

6
E
2
LG
2
and
D

1
N

2
max
˙
2
.Supposethatwehaveshown
E
k
w
t

w

k
2

b

(

t
D
+

2
t
C
)
forsomeconstant
b
and

t
.Then
E
k
w
t
+1

w

k
2

b
(1


t
)(

t
D
+

2
t
C
)+

2
t
D
+

3
t
C
=(
b
(1


t
)+

t
)

t
D
+(
b
(1


t
)+

t
)

2
t
C
223
andsoitremainstochoose

t
and
b
suchthat
(
b
(1


t
)+

t
)

t


t
+1
and
(
b
(1


t
)+

t
)

2
t


2
t
+1
.Recallthatwerequire

t
0

2

t
forany
t

t
0

E

1
,and

t

1
8
.
Ifwelet

t
=
4

(
t
+

)
where

=
max
f
E;
32

g
,thenwemaycheckthat

t
satis˝esboth
requirements.
Setting
b
=
4

,wehave
(
b
(1


t
)+

t
)

t
=

b
(1

4
t
+

)+
4

(
t
+

)

4

(
t
+

)
=

b
t
+


4
t
+

+
4

(
t
+

)

4

(
t
+

)
=
b
(
t
+


3
t
+

)
4

(
t
+

)

b
(
t
+


1
t
+

)
4

(
t
+

)

b
4

(
t
+

+1)
=

t
+1
and
(
b
(1


t
)+

t
)

2
t
=

b
(1

4
t
+

)+
4

(
t
+

)

16

2
(
t
+

)
2
=

b
t
+


4
t
+

+
4

(
t
+

)

16

2
(
t
+

)
2
=
b
(
t
+


2
t
+

)
16

2
(
t
+

)
2

b
16

2
(
t
+

+1)
2
=

2
t
+1
wherewehaveusedthefactsthat
t
+


1
(
t
+

)
2

1
(
t
+

+1)
224
t
+


2
(
t
+

)
3

1
(
t
+

+1)
2
for


1
.
Thuswehaveshown
E
k
w
t
+1

w

k
2

b

(

t
+1
D
+

2
t
+1
C
)
forourchoiceof

t
and
b
.Nowtoensure
k
w
0

w

k
2

b

(

0
D
+

2
0
C
)
=
b

(
4

D
+
16

2

2
C
)
wecansimplyscale
b
by
c
k
w
0

w

k
2
foraconstant
c
largeenoughandtheinductionstep
stillholds.
Itfollowsthat
E
k
w
t

w

k
2

c
k
w
0

w

k
2
4

(
D
t
+
C
2
t
)
forall
t

0
.
Finally,the
L
-smoothnessof
F
implies
E
(
F
(
w
T
))

F

=
E
(
F
(
w
T
)

F
(
w

))

L
2
E
k
w
T

w

k
2

L
2
c
k
w
0

w

k
2
4

(
D
T
+
C
2
T
)
225
=2
c
k
w
0

w

k
2

(
D
T
+
C
2
T
)

2
c
k
w
0

w

k
2


4

(
T
+

)

1
N

2
max
˙
2
+6
E
2
LG
2

(
4

(
T
+

)
)
2

=
O
(


1
N

2
max
˙
2

1
T
+

2

E
2
G
2

1
T
2
)
Withpartialparticipation,theupdateateachcommunicationroundisnowgivenby
averagesoverasubsetofsampleddevices.When
t
+1
=
2I
E
,
v
t
+1
=
w
t
+1
,whilewhen
t
+1
=
2I
E
,wehave
E
w
t
+1
=
v
t
+1
bydesignofthesamplingschemes,sothat
E
k
w
t
+1

w

k
2
=
E
k
w
t
+1

v
t
+1
+
v
t
+1

w

k
2
=
E
k
w
t
+1

v
t
+1
k
2
+
E
k
v
t
+1

w

k
2
Asbefore,
E
k
v
t
+1

w

k
2

E
(1


t
)
k
w
t

w

k
2
+6
E
2

3
t
G
2
+

2
t
1
N

2
max
˙
2
.
Thekeyistobound
E
k
w
t
+1

v
t
+1
k
2
.ForsamplingschemeIwehave
E
k
w
t
+1

v
t
+1
k
2
=
1
K
X
k
p
k
E
k
w
k
t
+1

w
t
+1
k
2

4
K

2
t
E
2
G
2
whileforsamplingschemeII
E
k
w
t
+1

v
t
+1
k
2
=
N

K
N

1
1
K
X
k
p
k
E
k
w
k
t
+1

w
t
+1
k
2

N

K
N

1
4
K

2
t
E
2
G
2
226
Thesameargumentasthefullparticipationcaseimplies
E
F
(
w
T
)

F

=
O
(

2
max
˙
2

NT
+

2
G
2

KT
+

2
E
2
G
2

T
2
)
Onemayaskwhetherthedependenceon
E
intheterm

2
G
2

KT
canberemoved,or
equivalentlywhether
P
k
p
k
k
w
k
t

w
t
k
2
=
O
(1
=T
2
)
canbeindependentof
E
.Weprovidea
simplecounterexamplethatshowsthatthisisnotpossibleingeneral.
Lemma10.
Thereexistsadatasetsuchthatif
E
=
O
(
T

)
forany
>
0
then
P
k
p
k
k
w
k
t

w
t
k
2
=
1
T
2

2

)
.
Proof.
Supposethatwehaveanevennumberofdevicesandeach
F
k
(
w
)=
1
n
k
P
n
k
j
=1
(
x
j
k

w
)
2
containsdatapoints
x
j
k
=
w

;k
,with
n
k

n
.Moreover,the
w

;k
'scomeinpairsaround
theorigin.Asaresult,theglobalobjective
F
isminimizedat
w

=0
.Moreover,ifwestart
from
w
0
=0
,thenbydesignofthedatasettheupdatesinlocalstepsexactlycanceleach
otherateachiteration,resultingin
w
t
=0
forall
t
.Ontheotherhand,if
E
=
T

,then
startingfromany
t
=
O
(
T
)
withconstantstepsize
O
(
1
T
)
,after
E
iterationsoflocalsteps,
thelocalparametersareupdatedtowards
w

;k
with
k
w
k
t
+
E
k
2
=
T


1
T
)
2
)=
1
T
2

2

)
.
Thisimpliesthat
X
k
p
k
k
w
k
t
+
E

w
t
+
E
k
2
=
X
k
p
k
k
w
k
t
+
E
k
2
=
1
T
2

2

)
whichisataslowerratethan
1
T
2
forany
>
0
.Thusthesamplingvariance
E
k
w
t
+1

227
v
t
+1
k
2
=
P
k
p
k
E
k
w
k
t
+1

w
t
+1
k
2
)
decaysataslowerratethan
1
T
2
,resultingina
convergencerateslowerthan
O
(
1
T
)
withpartialparticipation.
ConvexSmoothObjectives
Theorem15.
Underassumptions7,9,10andconstantlearningrate

t
=
O
(
q
N
T
)
,
min
t

T
F
(
w
t
)

F
(
w

)=
O


max
˙
2
p
NT
+
NE
2
LG
2
T

withfullparticipation,andwithpartialdeviceparticipationwith
K
sampleddevicesateach
communicationroundandlearningrate

t
=
O
(
q
K
T
)
,
min
t

T
F
(
w
t
)

F
(
w

)=
O


max
˙
2
p
KT
+
E
2
G
2
p
KT
+
KE
2
LG
2
T

Proof.
Weagainstartbyboundingtheterm
k
w
t
+1

w

k
2
=
k
(
w
t


t
g
t
)

w

k
2
=
k
(
w
t


t
g
t

w

)


t
(
g
t

g
t
)
k
2
=
A
1
+
A
2
+
A
3
where
A
1
=
k
w
t

w


t
g
t
k
2
A
2
=2

t
h
w
t

w


t
g
t
;
g
t

g
t
i
A
3
=

2
t
k
g
t

g
t
k
2
228
Byde˝nitionof
g
t
and
g
t
(seeEq(B.2)),wehave
E
A
2
=0
.For
A
3
,wehavethefollow
upperbound:

2
t
E
k
g
t

g
t
k
2
=

2
t
E
k
g
t

E
g
t
k
2
=

2
t
N
X
k
=1
p
2
k
k
g
t;k

E
g
t;k
k
2


2
t
N
X
k
=1
p
2
k
˙
2
k
againbyJensen'sinequalityandusingtheindependenceof
g
t;k
;
g
t;k
0
[110,Lemma2].
Nextwebound
A
1
:
k
w
t

w


t
g
t
k
2
=
k
w
t

w

k
2
+2
h
w
t

w

;


t
g
t
i
+
k

t
g
t
k
2
Usingtheconvexityand
L
-smoothnessof
F
k
,

2

t
h
w
t

w

;
g
t
i
=

2

t
N
X
k
=1
p
k
h
w
t

w

;
r
F
k
(
w
k
t
)
i
=

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
2

t
N
X
k
=1
p
k
h
w
k
t

w

;
r
F
k
(
w
k
t
)
i

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
+2

t
N
X
k
=1
p
k
(
F
k
(
w

)

F
k
(
w
k
t
))

2

t
N
X
k
=1
p
k

F
k
(
w
k
t
)

F
k
(
w
t
)+
L
2
k
w
t

w
k
t
k
2
+
F
k
(
w

)

F
k
(
w
k
t
)

=

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+2

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]
whichresultsin
k
w
t
+1

w

k
2
k
w
t

w

k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
229
+2

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]+

2
t
k
g
t
k
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k
Thedi˙erenceofthisboundwiththatinthestronglyconvexcaseisthatwenolonger
haveacontractionfactorinfrontof
k
w
t

w

k
2
.Inthestronglyconvexcase,wewereable
tocancel

2
t
k
g
t
k
2
with
2

t
P
N
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]
andobtainonlylowerorderterms.
Intheconvexcase,weuseadi˙erentstrategyandpreserve
P
N
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]
in
ordertoobtainatelescopingsum.
Wehave
k
g
t
k
2
=
k
X
k
p
k
r
F
k
(
w
k
t
)
k
2
=
k
X
k
p
k
r
F
k
(
w
k
t
)

X
k
p
k
r
F
k
(
w
t
)+
X
k
p
k
r
F
k
(
w
t
)
k
2

2
k
X
k
p
k
r
F
k
(
w
k
t
)

X
k
p
k
r
F
k
(
w
t
)
k
2
+2
k
X
k
p
k
r
F
k
(
w
t
)
k
2

2
L
2
X
k
p
k
k
w
k
t

w
t
k
2
+2
k
X
k
p
k
r
F
k
(
w
t
)
k
2
=2
L
2
X
k
p
k
k
w
k
t

w
t
k
2
+2
kr
F
(
w
t
)
k
2
using
r
F
(
w

)=0
.Nowusingthe
L
smoothnessof
F
,wehave
kr
F
(
w
t
)
k
2

2
L
(
F
(
w
t
)

F
(
w

))
,sothat
k
w
t
+1

w

k
2
k
w
t

w

k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+2

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]
+2

2
t
L
2
X
k
p
k
k
w
k
t

w
t
k
2
+4

2
t
L
(
F
(
w
t
)

F
(
w

))+

2
t
N
X
k
=1
p
2
k
˙
2
k
230
=
k
w
t

w

k
2
+(2

2
t
L
2
+

t
L
)
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]
+

2
t
N
X
k
=1
p
2
k
˙
2
k
+

t
(1

4

t
L
)(
F
(
w

)

F
(
w
t
))
Since
F
(
w

)

F
(
w
t
)
,aslongas
4

t
L

1
,wecanignorethelastterm,andrearrangethe
inequalitytoobtain
k
w
t
+1

w

k
2
+

t
(
F
(
w
t
)

F
(
w

))
k
w
t

w

k
2
+(2

2
t
L
2
+

t
L
)
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k
k
w
t

w

k
2
+
3
2

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k
Thesameargumentasbeforeyields
E
P
N
k
=1
p
k
k
w
t

w
k
t
k
2

4
E
2

2
t
G
2
whichgives
k
w
t
+1

w

k
2
+

t
(
F
(
w
t
)

F
(
w

))
k
w
t

w

k
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k
+6

3
t
E
2
LG
2
k
w
t

w

k
2
+

2
t
1
N

2
max
˙
2
+6

3
t
E
2
LG
2
Summingtheinequalitiesfrom
t
=0
to
t
=
T
,weobtain
T
X
t
=0

t
(
F
(
w
t
)

F
(
w

))
k
w
0

w

k
2
+
T
X
t
=0

2
t

1
N

2
max
˙
2
+
T
X
t
=0

3
t

6
E
2
LG
2
sothat
min
t

T
F
(
w
t
)

F
(
w

)

1
P
T
t
=0

t
0
@
k
w
0

w

k
2
+
T
X
t
=0

2
t

1
N

2
max
˙
2
+
T
X
t
=0

3
t

6
E
2
LG
2
1
A
231
Bysettingtheconstantlearningrate

t

q
N
T
,wehave
min
t

T
F
(
w
t
)

F
(
w

)

1
p
NT
k
w
0

w

k
2
+
1
p
NT
T

N
T

1
N

2
max
˙
2
+
1
p
NT
T
(
r
N
T
)
3
6
E
2
LG
2

1
p
NT
k
w
0

w

k
2
+
1
p
NT
T

N
T

1
N

2
max
˙
2
+
N
T
6
E
2
LG
2
=(
k
w
0

w

k
2
+

2
max
˙
2
)
1
p
NT
+
N
T
6
E
2
LG
2
=
O
(

2
max
˙
2
p
NT
+
NE
2
LG
2
T
)
Similarly,forpartialparticipation,wehave
min
t

T
F
(
w
t
)

F
(
w

)

1
P
T
t
=0

t
0
@
k
w
0

w

k
2
+
T
X
t
=0

2
t

(
1
N

max
˙
2
+
C
)+
T
X
t
=0

3
t

6
E
2
LG
2
1
A
where
C
=
4
K
E
2
G
2
or
N

K
N

1
4
K
E
2
G
2
,sothatwith

t
=
q
K
T
,wehave
min
t

T
F
(
w
t
)

F
(
w

)=
O
(

max
˙
2
p
KT
+
E
2
G
2
p
KT
+
KE
2
LG
2
T
)
232
ProofofConvergenceResultsforNesterovAcceleratedFe-
dAvg
StronglyConvexSmoothObjectives
Theorem16.
Let
v
T
=
P
N
k
=1
p
k
v
k
T
andsetlearningrates

t

1
=
3
14(
t
+

)(1

6
t
+

)max
f

1
g
,

t
=
6

(
t
+

)
.ThenunderAssumptions7,8,9,10withfulldeviceparticipation,
E
F
(
v
T
)

F

=
O


max
˙
2

NT
+

2
E
2
G
2

T
2

;
andwithpartialdeviceparticipationwith
K
sampleddevicesateachcommunicationround,
E
F
(
v
T
)

F

=
O


max
˙
2

NT
+

2
G
2

KT
+

2
E
2
G
2

T
2

:
Proof.
De˝nethevirtualsequences
v
t
=
P
N
k
=1
p
k
v
k
t
,
w
t
=
P
N
k
=1
p
k
w
k
t
,and
g
t
=
P
N
k
=1
p
k
E
g
t;k
.
Wehave
E
g
t
=
g
t
and
v
t
+1
=
w
t


t
g
t
,and
w
t
+1
=
v
t
+1
forall
t
.Theproofagainuses
the
L
-smoothnessof
F
tobound
E
(
F
(
v
t
))

F

=
E
(
F
(
v
t
)

F
(
w

))

L
2
E
k
v
t

w

k
2
Ourmainstepistoprovethebound
E
k
v
t
+1

w

k
2

(1


t
)
E
k
v
t

w

k
2
+

2
t
1
N

2
max
˙
2
+20
E
2

3
t
G
2
233
forappropriatestepsizes

t
;
t
.
Wehave
k
v
t
+1

w

k
2
=
k
(
w
t


t
g
t
)

w

k
2
=
k
(
w
t


t
g
t

w

)


t
(
g
t

g
t
)
k
2
=
A
1
+
A
2
+
A
3
where
A
1
=
k
w
t

w


t
g
t
k
2
A
2
=2

t
h
w
t

w


t
g
t
;
g
t

g
t
i
A
3
=

2
t
k
g
t

g
t
k
2
Byde˝nitionof
g
t
and
g
t
(seeEq(B.2)),wehave
E
A
2
=0
.For
A
3
,wehavethefollow
upperbound:

2
t
E
k
g
t

g
t
k
2
=

2
t
E
k
g
t

E
g
t
k
2
=

2
t
N
X
k
=1
p
2
k
k
g
t;k

E
g
t;k
k
2


2
t
N
X
k
=1
p
2
k
˙
2
k
againbyJensen'sinequalityandusingtheindependenceof
g
t;k
;
g
t;k
0
[110,Lemma2].
Nextwebound
A
1
:
k
w
t

w


t
g
t
k
2
=
k
w
t

w

k
2
+2
h
w
t

w

;


t
g
t
i
+
k

t
g
t
k
2
234
SameastheSGDcase,

2

t
h
w
t

w

;
g
t
i
+
k

t
g
t
k
2


t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

2
t
L
2
X
k
p
k
k
w
t

w
k
t
k
2
+

3
t
L
E
k
g
t
k
2


t

k
w
t

w

k
2
sothat
k
w
t

w


t
g
t
k
2

(1


t

)
k
w
t

w

k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

2
t
L
2
X
k
p
k
k
w
t

w
k
t
k
2
+

3
t
L
E
k
g
t
k
2
Di˙erentfromtheSGDcase,wehave
k
w
t

w

k
2
=
k
v
t
+

t

1
(
v
t

v
t

1
)

w

k
2
=
k
(1+

t

1
)(
v
t

w

)


t

1
(
v
t

1

w

)
k
2
=(1+

t

1
)
2
k
v
t

w

k
2

2

t

1
(1+

t

1
)
h
v
t

w

;
v
t

1

w

i
+

2
t

1
k
(
v
t

1

w

)
k
2

(1+

t

1
)
2
k
v
t

w

k
2
+2

t

1
(1+

t

1
)
k
v
t

w

kk
v
t

1

w

k
+

2
t

1
k
(
v
t

1

w

)
k
2
whichgives
k
v
t
+1

w

k
2
235

(1


t

)(1+

t

1
)
2
k
v
t

w

k
2
+2(1


t

)

t

1
(1+

t

1
)
k
v
t

w

kk
v
t

1

w

k
+

2
t
N
X
k
=1
p
2
k
˙
2
k
+

2
t

1
(1


t

)
k
(
v
t

1

w

)
k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

2
t
L
2
X
k
p
k
k
w
t

w
k
t
k
2
+

3
t
LG
2
andwewillusingthisrecursiverelationtoobtainthedesiredbound.
Firstwebound
E
P
N
k
=1
p
k
k
w
t

w
k
t
k
2
.Sincecommunicationisdoneevery
E
steps,for
any
t

0
,wecan˝nda
t
0

t
suchthat
t

t
0

E

1
and
w
k
t
0
=
w
t
0
forall
k
.Moreover,
using

t
isnon-increasing,

t
0

2

t
,and

t


t
forany
t

t
0

E

1
,wehave
E
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
=
E
N
X
k
=1
p
k
k
w
k
t

w
t
0

(
w
t

w
t
0
)
k
2

E
N
X
k
=1
p
k
k
w
k
t

w
t
0
k
2
=
E
N
X
k
=1
p
k
k
w
k
t

w
k
t
0
k
2
=
E
N
X
k
=1
p
k
k
t

1
X
i
=
t
0

i
(
v
k
i
+1

v
k
i
)

t

1
X
i
=
t
0

i
g
i;k
k
2

2
N
X
k
=1
p
k
E
t

1
X
i
=
t
0
(
E

1)

2
i
k
g
i;k
k
2
+2
N
X
k
=1
p
k
E
t

1
X
i
=
t
0
(
E

1)

2
i
k
(
v
k
i
+1

v
k
i
)
k
2

2
N
X
k
=1
p
k
E
t

1
X
i
=
t
0
(
E

1)

2
i
(
k
g
i;k
k
2
+
k
(
v
k
i
+1

v
k
i
)
k
2
)

4
N
X
k
=1
p
k
E
t

1
X
i
=
t
0
(
E

1)

2
i
G
2
236

4(
E

1)
2

2
t
0
G
2

16(
E

1)
2

2
t
G
2
wherewehaveused
E
k
v
k
t

v
k
t

1
k
2

G
2
.Toseethisidentityforappropriate

t
;
t
,note
therecursion
v
k
t
+1

v
k
t
=
w
k
t

w
k
t

1

(

t
g
t;k


t

1
g
t

1
;k
)
w
k
t
+1

w
k
t
=


t
g
t;k
+

t
(
v
k
t
+1

v
k
t
)
sothat
v
k
t
+1

v
k
t
=


t

1
g
t

1
;k
+

t

1
(
v
k
t

v
k
t

1
)

(

t
g
t;k


t

1
g
t

1
;k
)
=

t

1
(
v
k
t

v
k
t

1
)


t
g
t;k
Sincetheidentity
v
k
t
+1

v
k
t
=

t

1
(
v
k
t

v
k
t

1
)


t
g
t;k
implies
E
k
v
k
t
+1

v
k
t
k
2

2

2
t

1
E
k
v
k
t

v
k
t

1
k
2
+2

2
t
G
2
aslongas

t
;
t

1
satisfy
2

2
t

1
+2

2
t

1
=
2
,wecanguaranteethat
E
k
v
k
t

v
k
t

1
k
2

G
2
forall
k
byinduction.ThistogetherwithJensen'sinequalityalsogives
E
k
v
t

v
t

1
k
2

G
2
forall
t
.
Usingtheboundon
E
P
N
k
=1
p
k
k
w
t

w
k
t
k
2
,wecanconcludethat,with

max
:=
N

max
k
p
k
,
E
k
v
t
+1

w

k
2
237

E
(1


t
)(1+

t

1
)
2
k
v
t

w

k
2
+16
E
2

3
t
G
2
+16
E
2
L
2

4
t
G
2
+

3
t
LG
2
+(1


t

)

2
t

1
k
(
v
t

1

w

)
k
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k
+2

t

1
(1+

t

1
)(1


t

)
k
v
t

w

kk
v
t

1

w

k

E
(1


t
)(1+

t

1
)
2
k
v
t

w

k
2
+20
E
2

3
t
G
2
+(1


t

)

2
t

1
k
(
v
t

1

w

)
k
2
+

2
t
1
N

max
˙
2
+2

t

1
(1+

t

1
)(1


t

)
k
v
t

w

kk
v
t

1

w

k
where
˙
2
=
P
k
p
k
˙
2
k
,and

t
satis˝es

t

1
5
.Weshownextthat
E
k
v
t

w

k
2
=
O
(
1
tN
+
E
2
t
2
)
byinduction.
Assumethatwehaveshown
E
k
y
t

w

k
2

b
(
C
2
t
+
D
t
)
foralliterationsuntil
t
,where
C
=20
E
2
LG
2
,
D
=
1
N

2
max
˙
2
,and
b
istobechosenlater.For
stepsizeswechoose

t
=
6

1
t
+

and

t

1
=
3
14(
t
+

)(1

6
t
+

)max
f

1
g
where

=
max
f
32
E
g
,so
that

t

1


t
and
(1


t
)(1+14

t

1
)

(1

6
t
+

)(1+
3
(
t
+

)(1

6
t
+

)
)
=1

6
t
+

+
3
t
+

=1

3
t
+

=1


t
2
Recallthatwealsorequire

t
0

2

t
forany
t

t
0

E

1
,

t

1
5
,and
2

2
t

1
+2

2
t

1
=
2
,
whichwecanalsochecktoholdbyde˝nitionof

t
and

t
.
Moreover,
E
k
y
t

w

k
2

b
(
C
2
t
+
D
t
)
withthechosenstepsizesalsoimplies
k
v
t

1

238
w

k
2
k
v
t

w

k
.Thereforetheboundfor
E
k
v
t
+1

w

k
2
canbefurthersimpli˝edwith
2

t

1
(1+

t

1
)(1


t

)
k
v
t

w

kk
v
t

1

w

k
4

t

1
(1+

t

1
)(1


t

)
k
v
t

w

k
2
and
(1


t

)

2
t

1
k
(
v
t

1

w

)
k
2

4(1


t

)

2
t

1
k
(
v
t

w

)
k
2
sothat
E
k
v
t
+1

w

k
2

(1


t
)((1+

t

1
)
2
+4

t

1
(1+

t

1
)+4

2
t

1
)
E
k
(
v
t

w

)
k
2
+20
E
2

3
t
G
2
+

2
t
1
N

max
˙
2

E
(1


t
)(1+14

t

1
)
k
(
v
t

w

)
k
2
+20
E
2

3
t
G
2
+

2
t
1
N

max
˙
2

b
(1


t
2
)(
C
2
t
+
D
t
)+
C
3
t
+
D
2
t
=(
b
(1


t
2
)+

t
)

2
t
C
+(
b
(1


t
2
)+

t
)

t
D
andsoitremainstochoose
b
suchthat
(
b
(1


t
2
)+

t
)

t


t
+1
(
b
(1


t
2
)+

t
)

2
t


2
t
+1
fromwhichwecanconclude
E
k
v
t
+1

w

k
2


2
t
+1
C
+

t
+1
D
.
239
With
b
=
6

,wehave
(
b
(1


t
2
)+

t
)

t
=(
b
(1

(
3
t
+

)+
6

(
t
+

)
)
6

(
t
+

)
=(
b
t
+


3
t
+

+
6

(
t
+

)
)
6

(
t
+

)

b
(
t
+


1
t
+

)
6

(
t
+

)

b
6

(
t
+

+1)
=

t
+1
wherewehaveused
t
+


1
(
t
+

)
2

1
t
+

+1
.
Similarly
(
b
(1


t
2
)+

t
)

2
t
=(
b
(1

(
3
t
+

)+
6

(
t
+

)
)(
6

(
t
+

)
)
2
=(
b
t
+


3
t
+

+
6

(
t
+

)
)(
6

(
t
+

)
)
2
=
b
(
t
+


2
t
+

)(
6

(
t
+

)
)
2

b
36

2
(
t
+

+1)
2
=

2
t
+1
wherewehaveused
t
+


2
(
t
+

)
3

1
(
t
+

+1)
2
.
Finally,toensure
k
v
0

w

k
2

b
(
C
2
0
+
D
0
)
,wecanrescale
b
by
c
k
v
0

w

k
2
forsome
c:
Itfollowsthat
E
k
v
t

w

k
2

b
(
C
2
t
+
D
t
)
forall
t
.Thus
E
(
F
(
w
T
))

F

=
E
(
F
(
w
T
)

F
(
w

))

L
2
E
k
w
T

w

k
2

L
2
c
k
w
0

w

k
2
6

(
D
T
+
C
2
T
)
=3
c
k
w
0

w

k
2

(
D
T
+
C
2
T
)
240

3
c
k
w
0

w

k
2


6

(
T
+

)

1
N

max
˙
2
+20
E
2
LG
2

(
6

(
T
+

)
)
2

=
O
(


1
N

max
˙
2

1
T
+

2

E
2
G
2

1
T
2
)
Withpartialparticipation,thesameargumentintheSGDcaseyields
E
F
(
w
T
)

F

=
O
(

max
˙
2

NT
+

2
G
2

KT
+

2
E
2
G
2

T
2
)
ConvexSmoothObjectives
Theorem17.
Setlearningrates

t
=

t
=
O
(
q
N
T
)
.ThenunderAssumptions7,9,10
NesterovacceleratedFedAvgwithfulldeviceparticipationhasrate
min
t

T
F
(
w
t
)

F

=
O


max
˙
2
p
NT
+
NE
2
LG
2
T

;
andwithpartialdeviceparticipationwith
K
sampleddevicesateachcommunicationround,
min
t

T
F
(
w
t
)

F

=
O


max
˙
2
p
KT
+
E
2
G
2
p
KT
+
KE
2
LG
2
T

:
Proof.
De˝ne
p
t
:=

t
1


t
[
w
t

w
t

1
+

t
g
t

1
]
=

2
t
1


t
(
v
t

v
t

1
)
for
t

1
and0for
t
=0
.
Wecancheckthat
w
t
+1
+
p
t
+1
=
w
t
+
p
t


t
1


t
g
t
241
Nowwede˝ne
z
t
:=
w
t
+
p
t
and

t
=

t
1


t
forall
t
,sothatwehavetherecursiverelation
z
t
+1
=
z
t


t
g
t
Now
k
z
t
+1

w

k
2
=
k
(
z
t


t
g
t
)

w

k
2
=
k
(
z
t


t
g
t

w

)


t
(
g
t

g
t
)
k
2
=
A
1
+
A
2
+
A
3
where
A
1
=
k
z
t

w


t
g
t
k
2
A
2
=2

t
h
z
t

w


t
g
t
;
g
t

g
t
i
A
3
=

2
t
k
g
t

g
t
k
2
whereagain
E
A
2
=0
and
E
A
3


2
t
P
k
p
2
k
˙
2
k
.For
A
1
wehave
k
z
t

w


t
g
t
k
2
=
k
z
t

w

k
2
+2
h
z
t

w

;


t
g
t
i
+
k

t
g
t
k
2
Usingtheconvexityand
L
-smoothnessof
F
k
,

2

t
h
z
t

w

;
g
t
i
=

2

t
N
X
k
=1
p
k
h
z
t

w

;
r
F
k
(
w
k
t
)
i
242
=

2

t
N
X
k
=1
p
k
h
z
t

w
k
t
;
r
F
k
(
w
k
t
)
i
2

t
N
X
k
=1
p
k
h
w
k
t

w

;
r
F
k
(
w
k
t
)
i
=

2

t
N
X
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i
2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i

2

t
N
X
k
=1
p
k
h
w
k
t

w

;
r
F
k
(
w
k
t
)
i

2

t
N
X
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i
2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
+2

t
N
X
k
=1
p
k
(
F
k
(
w

)

F
k
(
w
k
t
))

2

t
N
X
k
=1
p
k

F
k
(
w
k
t
)

F
k
(
w
t
)+
L
2
k
w
t

w
k
t
k
2
+
F
k
(
w

)

F
k
(
w
k
t
)


2

t
N
X
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i
=

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+2

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]

2

t
N
X
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i
whichresultsin
E
k
w
t
+1

w

k
2

E
k
w
t

w

k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+2

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]
+

2
t
k
g
t
k
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k

2

t
N
X
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i
Asbefore,
k
g
t
k
2

2
L
2
P
k
p
k
k
w
k
t

w
t
k
2
+4
L
(
F
(
w
t
)

F
(
w

))
,sothat

2
t
k
g
t
k
2
+

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]

2
L
2

2
t
X
k
p
k
k
w
k
t

w
t
k
2
+

t
(1

4

t
L
)(
F
(
w

)

F
(
w
t
))
243

2
L
2

2
t
X
k
p
k
k
w
k
t

w
t
k
2
for

t

1
=
4
L
.Using
P
N
k
=1
p
k
k
w
t

w
k
t
k
2

16
E
2

2
t
G
2
and
P
N
k
=1
p
2
k
˙
2
k


max
1
N
˙
2
,it
followsthat
E
k
w
t
+1

w

k
2
+

t
(
F
(
w
t
)

F
(
w

))

E
k
w
t

w

k
2
+(

t
L
+2
L
2

2
t
)
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+

2
t
N
X
k
=1
p
2
k
˙
2
k

2

t
N
X
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i

E
k
w
t

w

k
2
+32
LE
2

2
t

t
G
2
+

2
t

max
1
N
˙
2

2

t
N
X
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i
if

t

1
2
L
.Itremainstobound
E
P
N
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i
.Recallthat
z
t

w
t
=

t
1


t
[
w
t

w
t

1
+

t
g
t

1
]
=

2
t
1


t
(
v
t

v
t

1
)
and
E
k
v
t

v
t

1
k
2

G
2
,
E
kr
F
k
(
w
k
t
)
k
2

G
2
.
Cauchy-Schwarzgives
E
N
X
k
=1
p
k
h
z
t

w
t
;
r
F
k
(
w
k
t
)
i
N
X
k
=1
p
k
q
E
k
z
t

w
t
k
2

q
E
kr
F
k
(
w
k
t
)
k
2


2
t
1


t
G
2
Thus
E
k
w
t
+1

w

k
2
+

t
(
F
(
w
t
)

F
(
w

))
244

E
k
w
t

w

k
2
+32
LE
2

2
t

t
G
2
+

2
t

max
1
N
˙
2
+2

t

2
t
1


t
G
2
Summingtheinequalitiesfrom
t
=0
to
t
=
T
,weobtain
T
X
t
=0

t
(
F
(
w
t
)

F
(
w

))
k
w
0

w

k
2
+
T
X
t
=0

2
t

1
N

max
˙
2
+
T
X
t
=0

t

2
t

32
LE
2
G
2
+
T
X
t
=0
2

t

2
t
1


t
G
2
sothat
min
t

T
F
(
w
t
)

F
(
w

)

1
P
T
t
=0

t
0
@
k
w
0

w

k
2
+
T
X
t
=0

2
t

1
N

max
˙
2
+
T
X
t
=0

t

2
t

32
LE
2
G
2
+
T
X
t
=0
2

t

2
t
1


t
G
2
1
A
Bysettingtheconstantlearningrates

t

q
N
T
and

t

c
q
N
T
sothat

t
=

t
1


t
=
q
N
T
1

c
q
N
T

2
q
N
T
,wehave
min
t

T
F
(
w
t
)

F
(
w

)

1
2
p
NT
k
w
0

w

k
2
+
2
p
NT
T

N
T

1
N

max
˙
2
+
1
p
NT
T
(
r
N
T
)
3
32
LE
2
G
2
+
2
p
NT
T
(
r
N
T
)
3
G
2
=(
1
2
k
w
0

w

k
2
+2

max
˙
2
)
1
p
NT
+
N
T
(32
LE
2
G
2
+2
G
2
)
=
O
(

max
˙
2
p
NT
+
NE
2
LG
2
T
)
245
Similarly,forpartialparticipation,wehave
min
t

T
F
(
w
t
)

F
(
w

)

1
P
T
t
=0

t
0
@
k
w
0

w

k
2
+
T
X
t
=0

2
t

(
1
N

max
˙
2
+
C
)+
T
X
t
=0

3
t

6
E
2
LG
2
1
A
where
C
=
4
K
E
2
G
2
or
N

K
N

1
4
K
E
2
G
2
,sothatwith

t

q
K
T
and

t

c
q
K
T
,wehave
min
t

T
F
(
w
t
)

F
(
w

)=
O
(

max
˙
2
p
KT
+
E
2
G
2
p
KT
+
KE
2
LG
2
T
)
ProofofGeometricConvergenceResultsforOverparame-
terizedProblems
GeometricConvergenceofFedAvgforgeneralstronglyconvexand
smoothobjectives
Theorem18.
Fortheoverparameterizedsettingwithgeneralstronglyconvexandsmooth
objectives,FedAvgwithlocalSGDupdatesandcommunicationevery
E
iterationswithconstant
stepsize

=
1
2
E
N
l
max
+
L
(
N


min
)
givestheexponentialconvergenceguarantee
E
F
(
w
t
)

L
2
(1


)
t
k
w
0

w

k
2
=
O
(exp(


2
E
N
l
max
+
L
(
N


min
)
t
)
k
w
0

w

k
2
)
Proof.
Toillustratethemainideasoftheproof,we˝rstpresenttheprooffor
E
=2
.Let
246
t

1
beacommunicationround,sothat
w
k
t

1
=
w
t

1
.Weshowthat
k
w
t
+1

w

k
2

(1


t

)(1


t

1

)
k
w
t

1

w

k
2
forappropriatelychosenconstantstepsizes

t
;
t

1
.Wehave
k
w
t
+1

w

k
2
=
k
(
w
t


t
g
t
)

w

k
2
=
k
w
t

w

k
2

2

t
h
w
t

w

;
g
t
i
+

2
t
k
g
t
k
2
andthecrosstermcanbeboundedasusualusing

-convexityand
L
-smoothnessof
F
k
:

2

t
E
t
h
w
t

w

;
g
t
i
=

2

t
N
X
k
=1
p
k
h
w
t

w

;
r
F
k
(
w
k
t
)
i
=

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
2

t
N
X
k
=1
p
k
h
w
k
t

w

;
r
F
k
(
w
k
t
)
i

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
+2

t
N
X
k
=1
p
k
(
F
k
(
w

)

F
k
(
w
k
t
))


t

N
X
k
=1
p
k
k
w
k
t

w

k
2

2

t
N
X
k
=1
p
k

F
k
(
w
k
t
)

F
k
(
w
t
)+
L
2
k
w
t

w
k
t
k
2
+
F
k
(
w

)

F
k
(
w
k
t
)


t

k
N
X
k
=1
p
k
(
w
k
t

w

)
k
2
247
=

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+2

t
N
X
k
=1
p
k
[
F
k
(
w

)

F
k
(
w
t
)]


t

k
w
t

w

k
2
=

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2

2

t
N
X
k
=1
p
k
F
k
(
w
t
)


t

k
w
t

w

k
2
andso
E
k
w
t
+1

w

k
2

E
(1


t

)
k
w
t

w

k
2

2

t
F
(
w
t
)+

2
t
k
g
t
k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
Applyingthisrecursiverelationto
k
w
t

w

k
2
andusing
k
w
t

1

w
k
t

1
k
2

0
,wefurther
obtain
E
k
w
t
+1

w

k
2

E
(1


t

)

(1


t

1

)
k
w
t

1

w

k
2

2

t

1
F
(
w
t

1
)+

2
t

1
k
g
t

1
k
2


2

t
F
(
w
t
)+

2
t
k
g
t
k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
Nowinsteadofbounding
P
N
k
=1
p
k
k
w
t

w
k
t
k
2
usingtheargumentsinthegeneralconvexcase,
wefollow[
127
]andusethefactthatintheoverparameterizedsetting,
w

isaminimizerofeach
`
(
w
;x
j
k
)
andthateach
`
is
l
-smoothtoobtain
kr
F
k
(
w
t

1
;˘
k
t

1
)
k
2

2
l
(
F
k
(
w
t

1
;˘
k
t

1
)

F
k
(
w

;˘
k
t

1
))
,whererecall
F
k
(
w
;˘
k
t

1
)=
`
(
w
;˘
k
t

1
)
,sothat
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
=
N
X
k
=1
p
k
k
w
t

1


t

1
g
t

1

w
k
t

1
+

t

1
g
t

1
;k
k
2
=
N
X
k
=1
p
k

2
t

1
k
g
t

1

g
t

1
;k
k
2
=

2
t

1
N
X
k
=1
p
k
(
k
g
t

1
;k
k
2
k
g
t

1
k
2
)
248
=

2
t

1
N
X
k
=1
p
k
kr
F
k
(
w
t

1
;˘
k
t

1
)
k
2


2
t

1
k
g
t

1
k
2


2
t

1
N
X
k
=1
p
k
2
l
(
F
k
(
w
t

1
;˘
k
t

1
)

F
k
(
w

;˘
k
t

1
))


2
t

1
k
g
t

1
k
2
againusing
w
t

1
=
w
k
t

1
.Takingexpectationwithrespectto
˘
k
t

1
'sandusingthefactthat
F
(
w

)=0
,wehave
E
t

1
N
X
k
=1
p
k
k
w
t

w
k
t
k
2

2
l
2
t

1
N
X
k
=1
p
k
F
k
(
w
t

1
)


2
t

1
k
g
t

1
k
2
=2
l
2
t

1
F
(
w
t

1
)


2
t

1
k
g
t

1
k
2
Notealsothat
k
g
t

1
k
2
=
k
N
X
k
=1
p
k
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2
while
k
g
t
k
2
=
k
N
X
k
=1
p
k
r
F
k
(
w
k
t
;˘
k
t
)
k
2

2
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
+2
k
N
X
k
=1
p
k
(
r
F
k
(
w
t
;˘
k
t
)
r
F
k
(
w
k
t
;˘
k
t
))
k
2

2
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
+2
N
X
k
=1
p
k
l
2
k
w
t

w
k
t
k
2
Substitutingtheseintotheboundfor
k
w
t
+1

w

k
2
,wehave
E
k
w
t
+1

w

k
2
249

E
(1


t

)((1


t

1

)
k
w
t

1

w

k
2

2

t

1
F
(
w
t

1
)+

2
t

1
k
g
t

1
k
2
)

2

t
F
(
w
t
)+2

2
t
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
+

2
l
2

2
t

1

2
t
+

t

2
t

1
L

2
lF
(
w
t

1
)
k
g
t

1
k
2

E
k
w
t
+1

w

k
2

E
(1


t

)(1


t

1

)
k
w
t

1

w

k
2

2

t
(
F
(
w
t
)


t
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
)

2

t

1
(1


t

)
0
@
(1

l
t

1
(2
l
2

2
t
+

t
L
)
1


t

)
F
(
w
t

1
)


t

1
2
k
N
X
k
=1
p
k
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2
1
A
fromwhichwecanconcludethat
E
k
w
t
+1

w

k
2

(1


t

)(1


t

1

)
E
k
w
t

1

w

k
2
ifwecanchoose

t
;
t

1
toguarantee
E
(
F
(
w
t
)


t
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
)

0
E
0
@
(1

l
t

1
(2
l
2

2
t
+

t
L
)
1


t

)
F
(
w
t

1
)


t

1
2
k
N
X
k
=1
p
k
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2
1
A

0
250
Notethat
E
t
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
=
E
t
h
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
;
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
i
=
N
X
k
=1
p
2
k
E
t
kr
F
k
(
w
t
;˘
k
t
)
k
2
+
N
X
k
=1
X
j
6
=
k
p
j
p
k
E
t
hr
F
k
(
w
t
;˘
k
t
)
;
r
F
j
(
w
t
;˘
j
t
)
i
=
N
X
k
=1
p
2
k
E
t
kr
F
k
(
w
t
;˘
k
t
)
k
2
+
N
X
k
=1
X
j
6
=
k
p
j
p
k
hr
F
k
(
w
t
)
;
r
F
j
(
w
t
)
i
=
N
X
k
=1
p
2
k
E
t
kr
F
k
(
w
t
;˘
k
t
)
k
2
+
N
X
k
=1
N
X
j
=1
p
j
p
k
hr
F
k
(
w
t
)
;
r
F
j
(
w
t
)
i

N
X
k
=1
p
2
k
kr
F
k
(
w
t
)
k
2
E
t
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2

N
X
k
=1
p
2
k
E
t
kr
F
k
(
w
t
;˘
k
t
)
k
2
+
k
X
k
p
k
r
F
k
(
w
t
)
k
2

1
N

min
k
X
k
p
k
r
F
k
(
w
t
)
k
2
=
N
X
k
=1
p
2
k
E
t
kr
F
k
(
w
t
;˘
k
t
)
k
2
+(1

1
N

min
)
kr
F
(
w
t
)
k
2
andsofollowing[
127
]ifwelet

t
=
min
f
qN
2
l
max
;
1

q
2
L
(1

1
N

min
)
g
fora
q
2
[0
;
1]
tobeoptimized
251
later,wehave
E
t
(
F
(
w
t
)


t
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
)

E
t
N
X
k
=1
p
k
F
k
(
w
t
)


t
2
4
N
X
k
=1
p
2
k
E
t
kr
F
k
(
w
t
;˘
k
t
)
k
2
+(1

1
N

min
)
kr
F
(
w
t
)
k
2
3
5

E
t
N
X
k
=1
p
k
(
qF
k
(
w
t
;˘
k
t
)


t
1
N

max
kr
F
k
(
w
t
;˘
k
t
)
k
2
)
+((1

q
)
F
(
w
t
)


t
(1

1
N

min
)
kr
F
(
w
t
)
k
2
)

q
E
t
N
X
k
=1
p
k
(
F
k
(
w
t
;˘
k
t
)

1
2
l
kr
F
k
(
w
t
;˘
k
t
)
k
2
)+(1

q
)(
F
(
w
t
)

1
2
L
kr
F
(
w
t
)
k
2
)

0
againusing
w

optimizes
F
k
(
w
;˘
k
t
)
with
F
k
(
w

;˘
k
t
)=0
.
Maximizing

t
=
min
f
qN
2
l
max
;
1

q
2
L
(1

1
N

min
)
g
over
q
2
[0
;
1]
,weseethat
q
=
l
max
l
max
+
L
(
N


min
)
resultsinthefastestconvergence,andthistranslatesto

t
=
1
2
N
l
max
+
L
(
N


min
)
.Nextwe
claimthat

t

1
=
c
1
2
N
l
max
+
L
(
N


min
)
alsoguarantees
E
(1

l
t

1
(2
l
2

2
t
+

t
L
)
1


t

)
F
(
w
t

1
)


t

1
2
k
N
X
k
=1
p
k
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2

0
Notethatbyscaling

t

1
byaconstant
c

1
ifnecessary,wecanguarantee
l
t

1
(2
l
2

2
t
+

t
L
)
1


t


1
2
,andsotheconditionisequivalentto
F
(
w
t

1
)


t

1
k
N
X
k
=1
p
k
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2

0
whichwasshowntoholdwith

t

1

1
2
N
l
max
+
L
(
N


min
)
.
252
Fortheproofofgeneral
E

2
,weusethefollowingtwoidentities:
k
g
t
k
2

2
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
+2
N
X
k
=1
p
k
l
2
k
w
t

w
k
t
k
2
E
N
X
k
=1
p
k
k
w
t

w
k
t
k
2

E
2(1+2
l
2

2
t

1
)
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
+8

2
t

1
lF
(
w
t

1
)

2

2
t

1
k
g
t

1
k
2
wherethe˝rstinequalityhasbeenestablishedbefore.Toestablishthesecondinequality,
notethat
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
=
N
X
k
=1
p
k
k
w
t

1


t

1
g
t

1

w
k
t

1
+

t

1
g
t

1
;k
k
2

2
N
X
k
=1
p
k

k
w
t

1

w
k
t

1
k
2
+
k

t

1
g
t

1


t

1
g
t

1
;k
k
2

and
X
k
p
k
k
g
t

1
;k

g
t

1
k
2
=
X
k
p
k
(
k
g
t

1
;k
k
2
k
g
t

1
k
2
)
=
X
k
p
k
kr
F
k
(
w
t

1
;˘
k
t

1
)+
r
F
k
(
w
k
t

1
;˘
k
t

1
)
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2
k
g
t

1
k
2

2
X
k
p
k

kr
F
k
(
w
t

1
;˘
k
t

1
)
k
2
+
l
2
k
w
k
t

1

w
t

1
k
2

k
g
t

1
k
2
253
sothatusingthe
l
-smoothnessof
`
,
E
N
X
k
=1
p
k
k
w
t

w
k
t
k
2

E
2(1+2
l
2

2
t

1
)
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
+4

2
t

1
X
k
p
k
kr
F
k
(
w
t

1
;˘
k
t

1
)
k
2

2

2
t

1
k
g
t

1
k
2

E
2(1+2
l
2

2
t

1
)
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
+4

2
t

1
2
l
X
k
p
k
(
F
k
(
w
t

1
;˘
k
t

1
)

F
k
(
w

;˘
k
t

1
))

2

2
t

1
k
g
t

1
k
2
=
E
2(1+2
l
2

2
t

1
)
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
+8

2
t

1
lF
(
w
t

1
)

2

2
t

1
k
g
t

1
k
2
Usingthe˝rstinequality,wehave
E
k
w
t
+1

w

k
2

E
(1


t

)
k
w
t

w

k
2

2

t
F
(
w
t
)+2

2
t
k
N
X
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
+(2

2
t
l
2
+

t
L
)
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
andwechoose

t
and

t

1
suchthat
E
(
F
(
w
t
)


t
k
P
N
k
=1
p
k
r
F
k
(
w
t
;˘
k
t
)
k
2
)

0
and
(2

2
t
l
2
+

t
L
)

(1


t

)(2

2
t

1
l
2
+

t

1
L
)
=
3
.Thisgives
E
k
w
t
+1

w

k
2

E
(1


t

)[(1


t

1

)
k
w
t

1

w

k
2

2

t

1
F
(
w
t

1
)
+2

2
t

1
k
N
X
k
=1
p
k
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2
254
+(2

2
t

1
l
2
+

t

1
L
)(
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
+
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
)
=
3]
Usingthesecondinequality
N
X
k
=1
p
k
k
w
t

w
k
t
k
2

E
2(1+2
l
2

2
t

1
)
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
+8

2
t

1
lF
(
w
t

1
)

2

2
t

1
k
g
t

1
k
2
andthat
2(1+2
l
2

2
t

1
)

3
,
2

2
t

1
l
2
+

t

1
L

1
,wehave
E
k
w
t
+1

w

k
2

E
(1


t

)[(1


t

1

)
k
w
t

1

w

k
2

2

t

1
F
(
w
t

1
)+2

2
t

1
k
N
X
k
=1
p
k
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2
+8

2
t

1
lF
(
w
t

1
)
+(2

2
t

1
l
2
+

t

1
L
)(2
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
)]
andif

t

1
ischosensuchthat
(
F
(
w
t

1
)

4

t

1
lF
(
w
t

1
))


t

1
k
N
X
k
=1
p
k
r
F
k
(
w
t

1
;˘
k
t

1
)
k
2

0
and
(2

2
t

1
l
2
+

t

1
L
)(1


t

1

)

(2

2
t

2
l
2
+

t

2
L
)
=
3
255
weagainhave
E
k
w
t
+1

w

k
2

E
(1


t

)(1


t

1

)[
k
w
t

1

w

k
2
+(2

2
t

2
l
2
+

t

2
L
)

(2
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
)
=
3]
Applyingtheabovederivationiteratively
˝<E
times,wehave
E
k
w
t
+1

w

k
2

E
(1


t

)

(1


t

˝
+1

)[(1


t

˝

)
k
w
t

˝

w

k
2

2

t

˝
F
(
w
t

˝
)+2

2
t

˝
k
N
X
k
=1
p
k
r
F
k
(
w
t

˝
;˘
k
t

˝
)
k
2
+8
˝
2
t

˝
lF
(
w
t

˝
)
+(2

2
t

˝
l
2
+

t

˝
L
)((
˝
+1)
N
X
k
=1
p
k
k
w
t

˝

w
k
t

˝
k
2
)]
aslongasthestepsizes

t

˝
arechosensuchthatthefollowinginequalitieshold
(2

2
t

˝
l
2
+

t

˝
L
)(1


t

˝

)

(2

2
t

˝

1
l
2
+

t

˝

1
L
)
=
3
2(1+2
l
2

2
t

˝
)

3
2

2
t

˝
l
2
+

t

˝
L

1
(
F
(
w
t

˝
)

4
˝
t

˝
lF
(
w
t

˝
))


t

˝
k
N
X
k
=1
p
k
r
F
k
(
w
t

˝
;˘
k
t

˝
)
k
2

0
Wecancheckthatsetting

t

˝
=
c
1
˝
+1
N
l
max
+
L
(
N


min
)
forsomesmallconstant
c
satis˝es
therequirements.
Sincecommunicationisdoneevery
E
iterations,
w
t
0
=
w
k
t
0
forsome
t
0
>t

E
,from
256
whichwecanconcludethat
E
k
w
t

w

k
2

(
t

t
0

1
Y
˝
=1
(1


t

˝
))
k
w
t
0

w

k
2

(1

c

E
N
l
max
+
L
(
N


min
)
)
t

t
0
k
w
t
0

w

k
2
andapplyingthisinequalitytoiterationsbetweeneachcommunicationround,
E
k
w
t

w

k
2

(1

c

E
N
l
max
+
L
(
N


min
)
)
t
k
w
0

w

k
2
=
O
(exp(

E
N
l
max
+
L
(
N


min
)
t
))
k
w
0

w

k
2
Withpartialparticipation,wenotethat
E
k
w
t
+1

w

k
2
=
E
k
w
t
+1

v
t
+1
+
v
t
+1

w

k
2
=
E
k
w
t
+1

v
t
+1
k
2
+
E
k
v
t
+1

w

k
2
=
1
K
X
k
p
k
E
k
w
k
t
+1

w
t
+1
k
2
+
E
k
v
t
+1

w

k
2
andsotherecursiveidentitybecomes
E
k
w
t
+1

w

k
2

E
(1


t

)

(1


t

˝
+1

)[(1


t

˝

)
k
w
t

˝

w

k
2

2

t

˝
F
(
w
t

˝
)+2

2
t

˝
k
N
X
k
=1
p
k
r
F
k
(
w
t

˝
;˘
k
t

˝
)
k
2
+8
˝
2
t

˝
lF
(
w
t

˝
)
+(2

2
t

˝
l
2
+

t

˝
L
+
1
K
)((
˝
+1)
N
X
k
=1
p
k
k
w
t

˝

w
k
t

˝
k
2
)]
257
whichrequires
(2

2
t

˝
l
2
+

t

˝
L
+
1
K
)(1


t

˝

)

(2

2
t

˝

1
l
2
+

t

˝

1
L
+
1
K
)
=
3
2(1+2
l
2

2
t

˝
)

3
2

2
t

˝
l
2
+

t

˝
L
+
1
K

1
(
F
(
w
t

˝
)

4
˝
t

˝
lF
(
w
t

˝
))


t

˝
k
N
X
k
=1
p
k
r
F
k
(
w
t

˝
;˘
k
t

˝
)
k
2

0
tohold.Againsetting

t

˝
=
c
1
˝
+1
N
l
max
+
L
(
N


min
)
forapossiblydi˙erentconstantfrom
beforesatis˝estherequirements.
Finally,usingthe
L
-smoothnessof
F
,
F
(
w
T
)

F
(
w

)

L
2
E
k
w
T

w

k
2
=
O
(
L
exp(


E
N
l
max
+
L
(
N


min
)
T
))
k
w
0

w

k
2
GeometricConvergenceofFedAvgforOverparameterizedLinearRe-
gression
We˝rstprovidedetailsonquantitiesusedintheproofofresultsonlinearregressionin
Section6.5inthemaintext.Thelocaldeviceobjectivesarenowgivenbythesumof
squares
F
k
(
w
)=
1
2
n
k
P
n
k
j
=1
(
w
T
x
j
k

z
j
k
)
2
,andthereexists
w

suchthat
F
(
w

)

0
.De˝ne
thelocalHessianmatrixas
H
k
:=
1
n
k
P
n
k
j
=1
x
j
k
(
x
j
k
)
T
,andthestochasticHessianmatrix
as
~
H
k
t
:=
˘
k
t
(
˘
k
t
)
T
,where
˘
k
t
isthestochasticsampleonthe
k
thdeviceattime
t
.De˝ne
l
tobethesmallestpositivenumbersuchthat
E
k
˘
k
t
k
2
˘
k
t
(
˘
k
t
)
T

l
H
k
forall
k
.Notethat
258
l

max
k;j
k
x
j
k
k
2
.Let
L
and

belowerandupperboundsofnon-zeroeigenvaluesof
H
k
.
De˝ne

1
:=
l
and

:=

.
Following[
119
,
87
],wede˝nethestatisticalconditionnumber
~

asthesmallestpositive
realnumbersuchthat
E
P
k
p
k
~
H
k
t
H

1
~
H
k
t

~

H
.Theconditionnumbers

1
and
~

are
importantinthecharacterizationofconvergenceratesforFedAvgalgorithms.Notethat

1
>
and

1
>
~

.
Let
H
=
P
k
p
k
H
k
.Ingeneral
H
haszeroeigenvalues.However,becausethenullspaceof
H
andrangeof
H
areorthogonal,inoursubsequenceanalysisitsu˚cestoproject
w
t

w

ontotherangeof
H
,thuswemayrestricttothenon-zeroeigenvalueof
H
.
Ausefulobservationisthatwecanuse
w

T
x
j
k

z
j
k

0
torewritethelocalobjectivesas
F
k
(
w
)=
1
2
h
w

w

;
H
k
(
w

w

)
i
1
2
k
w

w

k
2
H
k
:
F
k
(
w
)=
1
2
n
k
n
k
X
j
=1
(
w
T
x
k;j

z
k;j

(
w

T
x
k;j

z
k;j
))
2
=
1
2
n
k
n
k
X
j
=1
((
w

w

)
T
x
k;j
)
2
=
1
2
h
w

w

;
H
k
(
w

w

)
i
=
1
2
k
w

w

k
2
H
k
sothat
F
(
w
)=
1
2
k
w

w

k
2
H
.
Finally,notethat
E
~
H
k
t
=
1
n
k
P
n
k
j
=1
x
j
k
(
x
j
k
)
T
=
H
k
and
g
t;k
=
~
H
k
t
(
w
k
t

w

)
while
g
t
=
P
N
k
=1
p
k
r
F
k
(
w
k
t
;˘
k
t
)=
P
N
k
=1
p
k
~
H
k
t
(
w
k
t

w

)
and
g
t
=
P
N
k
=1
p
k
H
k
(
w
k
t

w

)
Theorem19.
Fortheoverparamterizedlinearregressionproblem,FedAvgwithcommuni-
cationevery
E
iterationswithconstantstepsize

=
O
(
1
E
N
l
max
+

(
N


min
)
)
hasgeometric
259
convergence:
E
F
(
w
T
)
O

L
exp(

NT
E
(

max

1
+(
N


min
))
)
k
w
0

w

k
2

:
Proof.
Weagainshowtheresult˝rstwhen
E
=2
and
t

1
isacommunicationround.We
have
k
w
t
+1

w

k
2
=
k
(
w
t


t
g
t
)

w

k
2
=
k
w
t

w

k
2

2

t
h
w
t

w

;
g
t
i
+

2
t
k
g
t
k
2
and

2

t
E
t
h
w
t

w

;
g
t
i
=

2

t
N
X
k
=1
p
k
h
w
t

w

;
r
F
k
(
w
k
t
)
i
=

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
2

t
N
X
k
=1
p
k
h
w
k
t

w

;
r
F
k
(
w
k
t
)
i
=

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
2

t
N
X
k
=1
p
k
h
w
k
t

w

;
H
k
(
w
k
t

w

)
i
=

2

t
N
X
k
=1
p
k
h
w
t

w
k
t
;
r
F
k
(
w
k
t
)
i
4

t
N
X
k
=1
p
k
F
k
(
w
k
t
)

2

t
N
X
k
=1
p
k
(
F
k
(
w
k
t
)

F
k
(
w
t
)+
L
2
k
w
t

w
k
t
k
2
)

4

t
N
X
k
=1
p
k
F
k
(
w
k
t
)
=

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2

2

t
N
X
k
=1
p
k
F
k
(
w
t
)

2

t
N
X
k
=1
p
k
F
k
(
w
k
t
)
=

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2


t
N
X
k
=1
p
k
h
(
w
t

w

)
;
H
k
(
w
t

w

)
i
2

t
N
X
k
=1
p
k
F
k
(
w
k
t
)
260
and
k
g
t
k
2
=
k
N
X
k
=1
p
k
~
H
k
t
(
w
k
t

w

)
k
2
=
k
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)+
N
X
k
=1
p
k
~
H
k
t
(
w
k
t

w
t
)
k
2

2
k
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)
k
2
+2
k
N
X
k
=1
p
k
~
H
k
t
(
w
k
t

w
t
)
k
2
whichgives
E
k
w
t
+1

w

k
2

E
k
w
t

w

k
2


t
N
X
k
=1
p
k
h
w
t

w

;
H
k
w
t

w

i
+2

2
t
k
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)
k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+2

2
t
k
N
X
k
=1
p
k
~
H
k
t
(
w
k
t

w
t
)
k
2

2

t
N
X
k
=1
p
k
F
k
(
w
k
t
)
following[127]we˝rstprovethat
E
k
w
t

w

k
2


t
N
X
k
=1
p
k
h
(
w
t

w

)
;
H
k
(
w
t

w

)
i
+2

2
t
k
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)
k
2

(1

N
8(

max

1
+(
N


min
))
)
E
k
w
t

w

k
2
withappropriatelychosen

t
.Comparedtotherate
O
(
N

max

1
+(
N


min
)

)
forgeneral
stronglyconvexandsmoothobjectives,thisisanimprovementaslinearspeedupisnow
availableforalargerrangeof
N
.
261
Wehave
E
t
k
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)
k
2
=
E
t
h
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)
;
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)
i
=
N
X
k
=1
p
2
k
E
t
k
~
H
k
t
(
w
t

w

)
k
2
+
N
X
k
=1
X
j
6
=
k
p
j
p
k
E
t
h
~
H
k
t
(
w
t

w

)
;
~
H
j
t
(
w
t

w

)
i
=
N
X
k
=1
p
2
k
E
t
k
~
H
k
t
(
w
t

w

)
k
2
+
N
X
k
=1
X
j
6
=
k
p
j
p
k
E
t
h
H
k
(
w
t

w

)
;
H
j
(
w
t

w

)
i
=
N
X
k
=1
p
2
k
E
t
k
~
H
k
t
(
w
t

w

)
k
2
+
N
X
k
=1
N
X
j
=1
p
j
p
k
E
t
h
H
k
(
w
t

w

)
;
H
j
(
w
t

w

)
i

N
X
k
=1
p
2
k
k
H
k
(
w
t

w

)
k
2
=
N
X
k
=1
p
2
k
E
t
k
~
H
k
t
(
w
t

w

)
k
2
+
k
X
k
p
k
H
k
(
w
t

w

)
k
2

N
X
k
=1
p
2
k
k
H
k
(
w
t

w

)
k
2

N
X
k
=1
p
2
k
E
t
k
~
H
k
t
(
w
t

w

)
k
2
+
k
X
k
p
k
H
k
(
w
t

w

)
k
2

1
N

min
k
X
k
p
k
H
k
(
w
t

w

)
k
2

1
N

max
N
X
k
=1
p
k
E
t
k
~
H
k
t
(
w
t

w

)
k
2
+(1

1
N

min
)
k
X
k
p
k
H
k
(
w
t

w

)
k
2

1
N

max
l
N
X
k
=1
p
k
h
(
w
t

w

)
;
H
k
(
w
t

w

)
i
+(1

1
N

min
)
k
X
k
p
k
H
k
(
w
t

w

)
k
2
=
1
N

max
l
h
(
w
t

w

)
;
H
(
w
t

w

)
i
+(1

1
N

min
)
h
w
t

w

;
H
2
(
w
t

w

)
i
using
k
~
H
k
t
k
l
.
Nowwehave
E
k
w
t

w

k
2


t
N
X
k
=1
p
k
h
(
w
t

w

)
;
H
k
(
w
t

w

)
i
+2

2
t
k
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)
k
2
=
262
h
w
t

w

;
(
I


t
H
+2

2
t
(

max
l
N
H
+
N


min
N
H
2
))(
w
t

w

)
i
anditremainstoboundthemaximumeigenvalueof
(
I


t
H
+2

2
t
(

max
l
N
H
+
N


min
N
H
2
))
andweboundthisfollowing[127].Ifwechoose

t
<
N
2(

max
l
+(
N


min
)
L
)
,then


t
H
+2

2
t
(

max
l
N
H
+
N


min
N
H
2
)
˚
0
andtheconvergencerateisgivenbythemaximumof
1


t

+2

2
t
(

max
l
N

+
N


min
N

2
)
maximizedoverthenon-zeroeigenvalues

of
H
.Toselectthestepsize

t
thatgivesthe
smallestupperbound,wethenminimizeover

t
,resultingin
min

t
<
N
2(

max
l
+(
N


min
)
L
)
max

0:
9
v;
H
v
=

ˆ
1


t

+2

2
t
(

max
l
N

+
N


min
N

2
)
˙
Sincetheobjectiveisquadraticin

,themaximumisachievedateitherthelargesteigenvalue

max
of
H
orthesmallestnon-zeroeigenvalue

min
of
H
.
When
N

4

max
l
L


min
+4

min
,i.e.when
N
=
O
(
l
min
)=
O
(

1
)
,theoptimalobjective
valueisachievedat

min
andtheoptimalstepsizeisgivenby

t
=
N
4(

max
l
+(
N


min
)

min
)
.
Theoptimalconvergencerate(i.e.theoptimalobjectivevalue)isequalto
1

1
8
N
min
(

max
l
+(
N


min
)

min
)
=1

1
8
N
(

max

1
+(
N


min
))
:
Thisimpliesthatwhen
N
=
O
(

1
)
,theoptimalconvergenceratehasalinearspeedupin
N
.
263
When
N
islarger,thisstepsizeisnolongeroptimal,butwestillhave
1

1
8
N
(

max

1
+(
N


min
))
asanupperboundontheconvergencerate.
Nowwehaveproved
E
k
w
t
+1

w

k
2

(1

1
8
N
(

max

1
+(
N


min
))
)
E
k
w
t

w

k
2
+

t
L
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
+2

2
t
k
N
X
k
=1
p
k
~
H
k
t
(
w
k
t

w
t
)
k
2

2

t
N
X
k
=1
p
k
F
k
(
w
k
t
)
Nextweboundtermsinthesecondlineusingasimilarargumentasthegeneralcase.We
have
2

2
t
k
N
X
k
=1
p
k
~
H
k
t
(
w
k
t

w
t
)
k
2

2

2
t
l
2
N
X
k
=1
p
k
k
w
t

w
k
t
k
2
and
E
N
X
k
=1
p
k
k
w
t

w
k
t
k
2

E
2(1+2
l
2

2
t

1
)
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
+8

2
t

1
lF
(
w
t

1
)
=4

2
t

1
l
h
w
t

1

w

;
H
(
w
t

1

w

)
i
andif

t
;
t

1
satisfy

t
L
+2

2
t

(1

1
8
N
(

max

1
+(
N


min
))
)(

t

1
L
+2

2
t

1
)
=
3
2(1+2
l
2

2
t

1
)

3

t
L
+2

2
t

1
264
wehave
E
k
w
t
+1

w

k
2

(1

1
8
N
(

max

1
+(
N


min
))
)[
E
k
w
t

1

w

k
2


t
h
w
t

1

w

;
H
w
t

1

w

i
+2

2
t
k
N
X
k
=1
p
k
~
H
k
t
(
w
t

w

)
k
2
+(

t

1
L
+2

2
t

1
)

2
N
X
k
=1
p
k
k
w
t

1

w
k
t

1
k
2
+4

2
t

1
l
h
w
t

1

w

;
H
(
w
t

1

w

)
i
]
andagainbychoosing

t

1
=
c
N
8(

max
l
+(
N


min
)

min
)
forasmallconstant
c
,wecan
guaranteethat
E
k
w
t

1

w

k
2


t

1
h
w
t

1

w

;
H
w
t

1

w

i
+2

2
t

1
k
N
X
k
=1
p
k
~
H
k
t

1
(
w
t

1

w

)
k
2
+4

2
t

1
l
h
w
t

1

w

;
H
(
w
t

1

w

)
i

(1

c
N
16(

max
l
+(
N


min
)

min
)
)
E
k
w
t

1

w

k
2
Forgeneral
E
,wehavetherecursiverelation
E
k
w
t
+1

w

k
2

E
(1

c
1
8
N
(

max

1
+(
N


min
))
)

(1

c
1
8
˝
N
(

max

1
+(
N


min
))
)[
k
w
t

˝

w

k
2


t

˝
h
w
t

˝

w

;
H
w
t

˝

w

i
+2

2
t

˝
k
N
X
k
=1
p
k
~
H
k
t

˝
(
w
t

˝

w

)
k
2
+4
˝
2
t

1
l
h
w
t

1

w

;
H
(
w
t

1

w

)
i
+(2

2
t

˝
l
2
+

t

˝
L
)((
˝
+1)
N
X
k
=1
p
k
k
w
t

˝

w
k
t

˝
k
2
)]
265
aslongasthestepsizesarechosen

t

˝
=
c
N
4
˝
(

max
l
+(
N


min
)

min
)
suchthatthefollowing
inequalitieshold
(2

2
t

˝
l
2
+

t

˝
L
)

(1


t

˝

)(2

2
t

˝

1
l
2
+

t

˝

1
L
)
=
3
2(1+2
l
2

2
t

˝
)

3
2

2
t

˝
l
2
+

t

˝
L

1
and
k
w
t

˝

w

k
2


t

˝
h
w
t

˝

w

;
H
w
t

˝

w

i
+2

2
t

˝
k
N
X
k
=1
p
k
~
H
k
t

˝
(
w
t

˝

w

)
k
2
+4
˝
2
t

1
l
h
w
t

1

w

;
H
(
w
t

1

w

)
i

(1

c
N
8(
˝
+1)(

max

1
+(
N


min
))
)
E
k
w
t

˝

w

k
2
whichgives
E
k
w
t

w

k
2

(1

c
1
8
E
N
(

max

1
+(
N


min
))
)
t
k
w
0

w

k
2
=
O
(exp(

1
E
N
(

max

1
+(
N


min
))
t
))
k
w
0

w

k
2
andwithpartialparticipation,thesameboundholdswithapossiblydi˙erentchoiceof
c
.
266
GeometricConvergenceofFedMaSSforOverparameterizedLinear
Regression
Theorem20.
Fortheoverparamterizedlinearregressionproblem,FedMaSSwithcom-
municationevery
E
iterationsandconstantstepsizes

1
=
O
(
1
E
N
l
max
+

(
N


min
)
)
;

2
=

1
(1

1
~

)
1+
1
p

1
~

;

=
1

1
p

1
~

1+
1
p

1
~

hasgeometricconvergence:
E
F
(
w
T
)
O

L
exp(

NT
E
(

max
p

1
~

+(
N


min
))
)
k
w
0

w

k
2

:
Proof.
Theproofisbasedonresultsin[
119
]whichoriginallyproposedtheMaSSalgorithm.
Notethattheupdatecanequivalentlybewrittenas
v
k
t
+1
=(1


k
)
v
k
t
+

k
u
k
t


k
g
t;k
w
k
t
+1
=
8
>
>
>
<
>
>
>
:
u
k
t


k
g
t;k
if
t
+1
=
2I
E
P
N
k
=1
p
k
h
u
k
t


k
g
t;k
i
if
t
+1
2I
E
u
k
t
+1
=

k
1+

k
v
k
t
+1
+
1
1+

k
w
k
t
+1
wherethereisabijectionbetweentheparameters
1


k
1+

k
=

k
;
k
=

k
1
;

k


k

k
1+

k
=

k
2
,and
wefurtherintroduceanauxiliaryparameter
v
k
t
,whichisinitializedat
v
k
0
.Wealsonotethat
when

k
=

k

k
,theupdatereducestotheNesterovacceleratedSGD.Thisversionofthe
FedAvgalgorithmwithlocalMaSSupdatesisusedforanalyzingthegeometricconvergence.
Asbefore,de˝nethevirtualsequences
w
t
=
P
N
k
=1
p
k
w
k
t
,
v
t
=
P
N
k
=1
p
k
v
k
t
,
u
t
=
P
N
k
=1
p
k
u
k
t
,and
g
t
=
P
N
k
=1
p
k
E
g
t;k
.Wehave
E
g
t
=
g
t
and
w
t
+1
=
u
t


t
g
t
,
v
t
+1
=
267
(1


k
)
v
t
+

k
w
t


k
g
t
,and
u
t
+1
=

k
1+

k
v
t
+1
+
1
1+

k
w
t
+1
.
We˝rstprovethetheoremwith
E
=2
and
t

1
beingacommunicationround.Wehave
k
v
t
+1

w

k
2
H

1
=
k
(1


)
v
t
+

u
t


X
k
p
k
~
H
k
t
(
u
k
t

w

)

w

k
2
H

1
=
k
(1


)
v
t
+

u
t

w

k
2
H

1
+

2
k
X
k
p
k
~
H
k
t
(
u
k
t

w

)
k
2
H

1

2

h
X
k
p
k
~
H
k
t
(
u
k
t

w

)
;
(1


)
v
t
+

u
t

w

i
H

1
k
(1


)
v
t
+

u
t

w

k
2
H

1
|
{z
}
A
+2

2
k
X
k
p
k
~
H
k
t
(
u
t

w

)
k
2
H

1
|
{z
}
B
+2

2
k
X
k
p
k
~
H
k
t
(
u
t

u
k
t
)
k
2
H

1

2

h
X
k
p
k
~
H
k
t
(
u
t

w

)
;
(1


)
v
t
+

u
t

w

i
H

1
|
{z
}
C

2

h
X
k
p
k
~
H
k
t
(
u
k
t

u
t
)
;
(1


)
v
t
+

u
t

w

i
H

1
Followingtheproofin[119],
E
A

E
(1


)
k
v
t

w

k
2
H

1
+

k
u
t

w

k
2
H

1

E
(1


)
k
v
t

w

k
2
H

1
+


k
u
t

w

k
2
usingtheconvexityofthenorm
kk
H

1
andthat

isthesmallestnon-zeroeigenvalueof
H
.
Now
E
B

2

2
(

max
1
N
~

+
N


min
N
)
k
(
u
t

w

)
k
2
H
268
usingthefolowingbound:
E
 
X
k
p
k
~
H
k
t
!
H

1
 
X
k
p
k
~
H
k
t
!
=
E
X
k
p
2
k
~
H
k
t
H

1
~
H
k
t
+
X
k
6
=
j
p
k
p
j
~
H
k
t
H

1
~
H
j
t


max
1
N
E
X
k
p
k
~
H
k
t
H

1
~
H
k
t
+
X
k
6
=
j
p
k
p
j
H
k
H

1
H
j
=

max
1
N
E
X
k
p
k
~
H
k
t
H

1
~
H
k
t
+
X
k;j
p
k
p
j
H
k
H

1
H
j

X
k
p
2
k
H
k
H

1
H
k


max
1
N
E
X
k
p
k
~
H
k
t
H

1
~
H
k
t
+
H

1
N

min
X
k
p
k
H
k
H

1
H
k


max
1
N
E
X
k
p
k
~
H
k
t
H

1
~
H
k
t
+
H

1
N

min
(
X
k
p
k
H
k
)
H

1
(
X
k
p
k
H
k
)
=

max
1
N
E
X
k
p
k
~
H
k
t
H

1
~
H
k
t
+
N


min
N
H


max
1
N
~

H
+
N


min
N
H
wherewehaveused
E
P
k
p
k
~
H
k
t
H

1
~
H
k
t

~

H
byde˝nitionof
~

andtheoperatorconvexity
ofthemapping
W
!
W
H

1
W
.
Finally,
E
C
=

E
2

h
X
k
p
k
~
H
k
t
(
u
t

w

)
;
(1


)
v
t
+

u
t

w

i
H

1
=

2

h
X
k
p
k
H
k
(
u
t

w

)
;
(1


)
v
t
+

u
t

w

i
H

1
=

2

h
(
u
t

w

)
;
(1


)
v
t
+

u
t

w

i
=

2

h
(
u
t

w

)
;
u
t

w

+
1


(
u
t

w
t
)
i
=

2

k
u
t

w

k
2
+
1


(
k
w
t

w

k
2
k
u
t

w

k
2
k
w
t

u
t
k
2
)
269

1


k
w
t

w

k
2

1


k
u
t

w

k
2
wherewehaveused
(1


)
v
t
+

u
t
=(1


)((1+

)
u
t

w
t
)

+

u
t
=
1

u
t

1


w
t
andtheidentitythat

2
h
a
;
b
i
=
k
a
k
2
+
k
b
k
2
k
a
+
b
k
2
.
Itfollowsthat
E
k
v
t
+1

w

k
2
H

1

(1


)
k
v
t

w

k
2
H

1
+
1


k
w
t

w

k
2
+(


1


)
k
u
t

w

k
2
+2

2
(

max
1
N
~

+
N


min
N
)
k
(
u
t

w

)
k
2
H
+2

2
k
X
k
p
k
~
H
k
t
(
u
t

u
k
t
)
k
2
H

1

2

h
X
k
p
k
~
H
k
t
(
u
k
t

u
t
)
;
(1


)
v
t
+

u
t

w

i
H

1
Ontheotherhand,
E
k
w
t
+1

w

k
2
=
E
k
u
t

w


X
k
p
k
~
H
k
t
(
u
t

w

)
k
2
=
E
k
u
t

w

k
2

2

k
u
t

w

k
2
H
+

2
k
X
k
p
k
~
H
k
t
(
u
t

w

)
k
2

E
k
u
t

w

k
2

2

k
u
t

w

k
2
H
+

2
(

max
1
N
`
+
L
N


min
N
)
k
u
t

w

k
2
270
whereweusethefollowingbound:
E
 
X
k
p
k
~
H
k
t
! 
X
k
p
k
~
H
k
t
!
=
E
X
k
p
2
k
~
H
k
t
~
H
k
t
+
X
k
6
=
j
p
k
p
j
~
H
k
t
~
H
j
t


max
1
N
E
X
k
p
k
~
H
k
t
~
H
k
t
+
X
k
6
=
j
p
k
p
j
H
k
H
j
=

max
1
N
E
X
k
p
k
~
H
k
t
~
H
k
t
+
X
k;j
p
k
p
j
H
k
H
j

X
k
p
2
k
H
k
H
k


max
1
N
E
X
k
p
k
~
H
k
t
~
H
k
t
+
H
2

1
N

min
X
k
p
k
H
k
H
k


max
1
N
E
X
k
p
k
~
H
k
t
~
H
k
t
+
H
2

1
N

min
(
X
k
p
k
H
k
)(
X
k
p
k
H
k
)
=

max
1
N
E
X
k
p
k
~
H
k
t
~
H
k
t
+
N


min
N
H
2


max
1
N
l
H
+
L
N


min
N
H
againusingthat
W
!
W
2
isoperatorconvexandthat
E
~
H
k
t
~
H
k
t

l
H
k
byde˝nitionof
l
.
Combiningtheboundsfor
E
k
w
t
+1

w

k
2
and
E
k
v
t
+1

w

k
2
H

1
,
E


k
w
t
+1

w

k
2
+
k
v
t
+1

w

k
2
H

1

(1


)
k
v
t

w

k
2
H

1
+
1


k
w
t

w

k
2
+(


)
k
u
t

w

k
2
+(2

2
(

max
1
N
~

+
N


min
N
)

2

+

2

(

max
1
N
l
+
L
N


min
N
)

)
k
u
t

w

k
2
+2

2
k
X
k
p
k
~
H
k
t
(
u
t

u
k
t
)
k
2
H

1
+
L
X
k
p
k
k
(
u
t

u
k
t
)
k
2
H

1
271
Following[119]ifwechoosestepsizessothat


0
2

2
(

max
1
N
~

+
N


min
N
)

2

+

2

(

max
1
N
l
+
L
N


min
N
)


0
orequivalently


2

(

max
1
N
~

+
N


min
N
)+

(

(

max
1
N
l
+
L
N


min
N
)

2)

0
thesecondandthirdtermsarenegative.Tooptimizethestepsizes,notethatthetwo
inequalitiesimply

2


(2


(

max
1
N
l
+
L
N


min
N
))

2(

max
1
N
~

+
N


min
N
)
andmaximizingtherighthandsidewithrespectto

,whichisquadratic,weseethat


1
=
(

max
1
N
l
+
L
N


min
N
)
maximizestherighthandside,with


1
q
2(

max
1
N

1
+

N


min
N
)(

max
1
N
~

+
N


min
N
)


=


(

max
1
N
~

+
N


min
N
)
Notethat

=
1
r
2(

max
1
N

1
+

N


min
N
)(

max
1
N
~

+
N


min
N
)
=
O
(
N
p

1
~

)
when
N
=
O
(min
f
~

1

g
)
.
Finally,todealwiththeterms
2

2
k
P
k
p
k
~
H
k
t
(
u
t

u
k
t
)
k
2
H

1
+
L
P
k
p
k
k
(
u
t

u
k
t
)
k
2
H

1
,we
272
canuseJensen
2

2
k
X
k
p
k
~
H
k
t
(
u
t

u
k
t
)
k
2
H

1
+
L
X
k
p
k
k
(
u
t

u
k
t
)
k
2
H

1

(2

2
l
2
+
L
)
X
k
p
k
k
u
t

u
k
t
k
2
H

1
=(2

2
l
2
+
L
)
X
k
p
k
k

1+

v
t
+
1
1+

w
t

(

1+

v
k
t
+
1
1+

w
k
t
)
k
2
H

1

(2

2
l
2
+
L
)(2(

1+

)
2

2
+2(
1
1+

)
2

2
)
X
k
p
k
k
~
H
k
t

1
(
u
t

1

w

)
k
2

(2

2
l
2
+
L
)(2(

1+

)
2

2
+2(
1
1+

)
2

2
)
l
2
k
(
u
t

1

w

)
k
2
whichcanbecombinedwiththetermswith
k
(
u
t

1

w

)
k
2
intherecursiveexpansionof
E


k
w
t

w

k
2
+
k
v
t

w

k
2
H

1
:
E


k
w
t

w

k
2
+
k
v
t

w

k
2
H

1

(1


)
k
v
t

1

w

k
2
H

1
+
1


k
w
t

1

w

k
2
+(


)
k
u
t

1

w

k
2
+(2

2
(

max
1
N
~

+
N


min
N
)

2

+

2

(

max
1
N
l
+
L
N


min
N
)

)
k
u
t

1

w

k
2
andthestepsizescanbechosensothattheresultingcoe˚cientsarenegative.Therefore,we
haveshownthat
E
k
w
t
+1

w

k
2

(1


)
2
k
w
t

1

w

k
2
where

=
1
r
2(

max
1
N

1
+

N


min
N
)(

max
1
N
~

+
N


min
N
)
=
O
(
N

max
p

1
~

+
N


min
)
when
N
=
O
(min
f
~

1

g
)
.
273
Forgeneral
E>
1
,choosing

=
c=E
(

max
1
N
l
+
L
N


min
N
)
forsomesmallconstant
c
resultsin

=
O
(
1
E
r
(

max
1
N

1
+

N


min
N
)(

max
1
N
~

+
N


min
N
)
)
andthisguaranteesthat
E
k
w
t

w

k
2

(1


)
t
k
w
0

w

k
2
forall
t
.
DetailsonExperimentsandAdditionalResults
Wedescribedthepreciseproceduretoreproducetheresultsinthischapter.Aswementioned
inSection6.6,weempiricallyveri˝edthelinearspeeduponvariousconvexsettingsforboth
FedAvganditsacceleratedvariants.Foralltheresults,wesetrandomseedsas
0
;
1
;
2
and
reportthebestconvergencerateacrossthethreefolds.Foreachrun,weinitialize
w
0
=
0
andmeasurethenumberofiterationtoreachthetargetaccuracy

.Weusethesmall-scale
datasetw8a[
155
],whichconsistsof
n
=49749
sampleswithfeaturedimension
d
=300
.
Thelabeliseitherpositiveoneornegativeone.Thedatasethassparsebinaryfeaturesin
f
0
;
1
g
.Eachsamplehas11.15non-zerofeaturevaluesoutof
300
featuresonaverage.We
setthebatchsizeequaltofouracrossallexperiments.Inthenextfollowingsubsections,we
introduceparametersearchingineachobjectiveseparately.
StronglyConvexObjectives
We˝rstconsiderthestronglyconvexobjectivefunction,
whereweusearegularizedbinarylogisticregressionwithregularization

=1
=n
ˇ
2
e

5
.
Weevenlydistributedon
1
;
2
;
4
;
8
;
16
;
32
devicesandreportthenumberofiterations/rounds
neededtoconvergeto


accuracy,where

=0
:
005
.Theoptimalobjectivefunctionvalue
f

274
issetas
f

=0
:
126433176216545
.Thisisdeterminednumericallyandwefollowthesetting
in[
178
].Thelearningrateisdecayedasthe

t
=
min
(

0
;
nc
1+
t
)
,whereweextensivelysearch
thebestlearningrate
c
2f
2

1
c
0
;
2

2
c
0
;c
0
;
2
c
0
;
2
2
c
0
g
.Inthiscase,wesearchtheinitial
learningrate

0
2f
1
;
32
g
and
c
0
=1
=
8
.
ConvexSmoothObjectives
Wealsousebinarylogisticregressionwithoutregu-
larization.Thesettingisalmostsameasitsregularizedcounterpart.Wealsoevenly
distributedallthesampleson
1
;
2
;
4
;
8
;
16
;
32
devices.The˝gureshowsthenumberof
iterationsneededtoconvergeto


accuracy,where

=0
:
02
.Theoptiamlobjectivefunc-
tionvalueissetas
f

=0
:
11379089057514849
,determinednumerically.Thelearningrate
isdecayedasthe

t
=
min
(

0
;
nc
1+
t
)
,whereweextensivelysearchthebestlearningrate
c
2f
2

1
c
0
;
2

2
c
0
;c
0
;
2
c
0
;
2
2
c
0
g
.Inthiscase,wesearchtheinitiallearningrate

0
2f
1
;
32
g
and
c
0
=1
=
8
.
Linearregression
Forlinearregression,weusethesamefeaturevectorsfromw8a
datasetandgenerategroundtruth
[
w

;b

]
fromamultivariatenormaldistributionwithzero
meanandstandarddeviationone.Thenwegeneratelabelbasedon
y
i
=
x
t
i
w

+
b

.This
procedurewillensurewesatisfytheover-parameterizedsettingasrequiredinourtheorems.
Wealsoevenlydistributedallthesampleson
1
;
2
;
4
;
8
;
16
;
32
devices.The˝gureshowsthe
numberofiterationsneededtoconvergeto


accuracy,where

=0
:
02
.Theoptiamlobjective
functionvalueis
f

=0
.Thelearningrateisdecayedasthe

t
=
min
(

0
;
nc
1+
t
)
,wherewe
extensivelysearchthebestlearningrate
c
2f
2

1
c
0
;
2

2
c
0
;c
0
;
2
c
0
;
2
2
c
0
g
.Inthiscase,we
searchtheinitiallearningrate

0
2f
0
:
1
;
0
:
12
g
and
c
0
=1
=
256
.
PartialParticipation
ToexaminethelinearspeedupofFedAvginpartialparticipation
setting,weevenlydistributeddataon
4
;
8
;
16
;
32
;
64
;
128
devicesanduniformlysample
50%
deviceswithoutreplacement.Allotherhyperparametersarethesameasprevioussections.
275
(a)Stronglyconvexobjective(b)Convexsmoothobjective(c)Linearregression
FigureB.1:TheconvergenceofFedAvgw.r.tthenumberoflocalsteps
E
.
NesterovacceleratedFedAvg
TheexperimentsofNesterovacceleratedFedAvg(see
updateformulabelow)usesthesamesettingaspreviousthreesectionsforvaniliaFedAvg.
y
k
t
+1
=
w
k
t


t
g
t;k
w
k
t
+1
=
8
>
>
>
<
>
>
>
:
y
k
t
+1
+

t
(
y
k
t
+1

y
k
t
)
if
t
+1
=
2I
E
P
k
2S
t
+1

y
k
t
+1
+

t
(
y
k
t
+1

y
k
t
)

if
t
+1
2I
E
Weset

t
=0
:
1
andsearch

t
inthesamewayas

t
inFedAvg.
Theimpactof
E
.
Inthissubsection,wefurtherexaminehowdoesthenumberoflocal
steps(
E
)a˙ectconvergence.AsshowninFigureB.1,thenumberofiterationsincreasesas
E
increase,whichslowdowntheconvergenceintermsofgradientcomputation.However,
itcansavecommunicationcostsasthenumberofroundsdecreasedwhenthe
E
increases.
Thisshowcasethatweneedaproperchoiceof
E
totrade-o˙thecommunicationcostand
convergencespeed.
276
BIBLIOGRAPHY
277
BIBLIOGRAPHY
[1]
AbbasAbdolmaleki,JostTobiasSpringenberg,YuvalTassa,RemiMunos,Nicolas
Heess,andMartinRiedmiller.Maximumaposterioripolicyoptimisation.
arXiv
preprintarXiv:1806.06920
,2018.
[2]
NaokiAbe,PremMelville,CezarPendus,ChandanKReddy,DavidLJensen,VinceP
Thomas,JamesJBennett,GaryFAnderson,BrentRCooley,MelissaKowalczyk,etal.
Optimizingdebtcollectionsusingconstrainedreinforcementlearning.In
SIGKDD
,
pagesACM,2010.
[3]
NaokiAbe,NavalVerma,ChidApte,andRobertSchroko.Crosschanneloptimized
marketingbyreinforcementlearning.In
SIGKDD
,pages767ACM,2004.
[4]
YuXuanLiuPieterAbbeel
yz
SergeyLevineAbhishekGupta
y
,ColineDevin
y
.Learning
invariantfeaturespacestotransferskillswithreinforcementlearning.In
Underreview
asaconferencepaperatICLR2017
,2017.
[5]
ZeyuanAllen-Zhu,YuanzhiLi,andZhaoSong.Aconvergencetheoryfordeeplearning
viaover-parameterization.
arXivpreprintarXiv:1811.03962
,2018.
[6]
SaleemaAmershi,MayaCakmak,WBradleyKnox,andToddKulesza.Powertothe
people:Theroleofhumansininteractivemachinelearning.AAAI,2014.
[7]
SaleemaAmershi,JamesFogarty,andDanielWeld.Regroup:Interactivemachine
learningforon-demandgroupcreationinsocialnetworks.In
ProceedingsoftheSIGCHI
,
pagesACM,2012.
[8]
SaleemaAmershi,BongshinLee,AshishKapoor,RatulMahajan,andBlaineChristian.
Cuet:human-guidedfastandaccuratenetworkalarmtriage.In
Proceedingsofthe
SIGCHI
,pages157ACM,2011.
[9]
MihaelAnkerst,ChristianElsen,MartinEster,andHans-PeterKriegel.Visualclas-
si˝cation:aninteractiveapproachtodecisiontreeconstruction.In
SIGKDD
,pages
ACM,1999.
[10]
A.Argyriou,T.Evgeniou,andM.Pontil.Convexmulti-taskfeaturelearning.
Machine
Learning
,2008.
[11]
KaiArulkumaran,MarcPeterDeisenroth,MilesBrundage,andAnilAnthonyBharath.
Abriefsurveyofdeepreinforcementlearning.
arXivpreprintarXiv:1708.05866
,2017.
278
[12]
DzmitryBahdanau,PhilemonBrakel,KelvinXu,AnirudhGoyal,RyanLowe,Joelle
Pineau,AaronCourville,andYoshuaBengio.Anactor-criticalgorithmforsequence
prediction.
arXivpreprintarXiv:1607.07086
,2016.
[13]
BartBakkerandTomHeskes.Taskclusteringandgatingforbayesianmultitask
learning.
TheJournalofMachineLearningResearch
,2003.
[14]
BramBakker,ShimonWhiteson,LeonKester,andFransCAGroen.Tra˚clight
controlbymultiagentreinforcementlearningsystems.In
InteractiveCollaborative
InformationSystems
,pages510.Springer,2010.
[15]
PeterLBartlettandShaharMendelson.Rademacherandgaussiancomplexities:Risk
boundsandstructuralresults.
JournalofMachineLearningResearch
,3(Nov
2002.
[16]
AmirBeckandMarcTeboulle.Afastiterativeshrinkage-thresholdingalgorithmfor
linearinverseproblems.
SIAMjournalonimagingsciences
,2009.
[17]
MarcGBellemare,WillDabney,andRémiMunos.Adistributionalperspectiveon
reinforcementlearning.
arXivpreprintarXiv:1707.06887
,2017.
[18]
MarcGBellemare,YavarNaddaf,JoelVeness,andMichaelBowling.Thearcade
learningenvironment:Anevaluationplatformforgeneralagents.
J.Artif.Intell.
Res.(JAIR)
,2013.
[19]
ShaiBen-DavidandRebaSchuller.Exploitingtaskrelatednessformultipletask
learning.In
LearningTheoryandKernelMachines
,pages580.Springer,2003.
[20]
JacobBien,JonathanTaylor,andRobertTibshirani.Alassoforhierarchicalinteractions.
Annalsofstatistics
,41(3):1111,2013.
[21]
ChristopherMBishop.
Patternrecognitionandmachinelearning
.springer,2006.
[22]
EdwinVBonilla,KianMChai,andChristopherWilliams.Multi-taskgaussianprocess
prediction.In
NIPS
,pages160,2007.
[23]
StephenBoyd,NealParikh,EricChu,BorjaPeleato,andJonathanEckstein.Dis-
tributedoptimizationandstatisticallearningviathealternatingdirectionmethodof
multipliers.
FoundationsandTrends
R

inMachineLearning
,2011.
[24]
GregBrockman,VickiCheung,LudwigPettersson,JonasSchneider,JohnSchulman,
JieTang,andWojciechZaremba.Openaigym.
arXivpreprintarXiv:1606.01540
,2016.
[25]
CristianBucilu,RichCaruana,andAlexandruNiculescu-Mizil.Modelcompression.In
SIGKDD
,pages41.ACM,2006.
279
[26]
ChrisBurges,TalShaked,ErinRenshaw,AriLazier,MattDeeds,NicoleHamilton,
andGregHullender.Learningtorankusinggradientdescent.In
Proceedingsofthe
22ndinternationalconferenceonMachinelearning
,pages6.ACM,2005.
[27]
LucianBusoniu,RobertBabuska,andBartDeSchutter.Acomprehensivesurveyof
multiagentreinforcementlearning.
IEEETrans.Systems,Man,andCybernetics,Part
C
,2008.
[28]
LucianBu³oniu,RobertBabu²ka,andBartDeSchutter.Multi-agentreinforcement
learning:Anoverview.In
Innovationsinmulti-agentsystemsandapplications-1
,pages
Springer,2010.
[29]
RemiJCadoret,WilliamRYates,GeorgeWoodworth,andMarkAStewart.Genetic-
environmentalinteractioninthegenesisofaggressivityandconductdisorders.
Archives
ofGeneralPsychiatry
,1995.
[30]
AlfredoCanziani,AdamPaszke,andEugenioCulurciello.Ananalysisofdeepneural
networkmodelsforpracticalapplications.
arXivpreprintarXiv:1605.07678
,2016.
[31]
ZheCao,TaoQin,Tie-YanLiu,Ming-FengTsai,andHangLi.Learningtorank:from
pairwiseapproachtolistwiseapproach.In
ICML
,pages36.ACM,2007.
[32]
MarcCarreras,JunkuYuh,JoanBatlle,andPereRidao.Abehavior-basedscheme
usingreinforcementlearningforautonomousunderwatervehicles.
IEEEJournalof
OceanicEngineering
,2005.
[33]
RichCaruana.Multitasklearning.
Machinelearning
,1997.
[34]
PabloSamuelCastro,SubhodeepMoitra,CarlesGelada,SaurabhKumar,andMarcG.
Bellemare.Dopamine:Aresearchframeworkfordeepreinforcementlearning.
CoRR
,
abs/1812.06110,2018.
[35]
ShiyuChang,Guo-JunQi,CharuCAggarwal,JiayuZhou,MengWang,andThomasS
Huang.Factorizedsimilaritylearninginnetworks.In
ICDM
,pagesIEEE,2014.
[36]
FeiChen,ZhenhuaDong,ZhenguoLi,andXiuqiangHe.Federatedmeta-learningfor
recommendation.
arXivpreprintarXiv:1802.07876
,2018.
[37]
JianhuiChen,JiayuZhou,andJiepingYe.Integratinglow-rankandgroup-sparse
structuresforrobustmulti-tasklearning.In
SIGKDD
,pages0.ACM,2011.
[38]
XiaohuiChen,XinghuaShi,XingXu,ZhiyongWang,RyanMills,CharlesLee,and
JinboXu.Atwo-graphguidedmulti-tasklassoapproachforeqtlmapping.In
AISTATS
,
pages2012.
280
[39]
NamHeeChoi,WilliamLi,andJiZhu.Variableselectionwiththestrongheredity
constraintanditsoracleproperty.
JASA
,2010.
[40]
DidiChuxing.
[41]
Djork-ArnéClevert,ThomasUnterthiner,andSeppHochreiter.Fastandaccuratedeep
networklearningbyexponentiallinearunits(elus).
arXivpreprintarXiv:1511.07289
,
2015.
[42]
WillDabney,GeorgOstrovski,DavidSilver,andRémiMunos.Implicitquantile
networksfordistributionalreinforcementlearning.
arXivpreprintarXiv:1806.06923
,
2018.
[43]
BoDai,AlbertShaw,LihongLi,LinXiao,NiaoHe,ZhenLiu,JianshuChen,and
LeSong.Sbeed:Convergentreinforcementlearningwithnonlinearfunctionapproxima-
tion.
arXivpreprintarXiv:1712.10285
,2017.
[44]
HalDaumé,JohnLangford,andDanielMarcu.Search-basedstructuredprediction.
Machinelearning
,2009.
[45]
PeterDayanandGeo˙reyEHinton.Usingexpectation-maximizationforreinforcement
learning.
NeuralComputation
,1997.
[46]
ThomasDegris,MarthaWhite,andRichardSSutton.O˙-policyactor-critic.
arXiv
preprintarXiv:1205.4839
,2012.
[47]
PierreJDejaxandTeodorGabrielCrainic.Surveypapreviewofempty˛owsand
˛eetmanagementmodelsinfreighttransportation.
Transportationscience
,
248,1987.
[48]
PrafullaDhariwal,ChristopherHesse,OlegKlimov,AlexNichol,MatthiasPlappert,
AlecRadford,JohnSchulman,SzymonSidor,YuhuaiWu,andPeterZhokhov.Openai
baselines.
https://github.com/openai/baselines
,2017.
[49]
JillesSteeveDibangoyeandOlivierBu˙et.
LearningtoActinDecentralizedPartially
ObservableMDPs
.PhDthesis,INRIAGrenoble-Rhone-Alpes-CHROMATeam;INRIA
Nancy,équipeLARSEN,2018.
[50]
PierreDillenbourg.
CollaborativeLearning:CognitiveandComputationalApproaches.
AdvancesinLearningandInstructionSeries.
ERIC,1999.
[51]
PierreDutilleul.Themlealgorithmforthematrixnormaldistribution.
JSTAT
COMPUTSIM
,1999.
281
[52]
ThaliaCEley,KarenSugden,AlejandroCorsico,AliceMGregory,PakSham,Peter
McGu˚n,RobertPlomin,andIanWCraig.vironmentinteractionanalysisof
serotoninsystemmarkerswithadolescentdepression.
Molecularpsychiatry
,
915,2004.
[53]
AEvgeniouandMassimilianoPontil.Multi-taskfeaturelearning.
NIPS
,19:41,2007.
[54]
TheodorosEvgeniou,CharlesAMicchelli,andMassimilianoPontil.Learningmultiple
taskswithkernelmethods.In
JMLR
,pages6152005.
[55]
TheodorosEvgeniouandMassimilianoPontil.Regularizedmlearning.In
SIGKDD
,pages17.ACM,2004.
[56]
BenjaminEysenbachandSergeyLevine.Ifmaxentrlistheanswer,whatisthequestion?
arXivpreprintarXiv:1910.01913
,2019.
[57]
AlirezaFallah,AryanMokhtari,andAsumanOzdaglar.Personalizedfederatedlearning:
Ameta-learningapproach.
arXivpreprintarXiv:2002.07948
,2020.
[58]
HongliangFeiandJunHuan.Structuredfeatureselectionandtaskrelationshipinference
formulti-tasklearning.
Knowledgeandinformationsystems
,2013.
[59]
ChelseaFinn.
Learningtolearnwithgradients
.PhDthesis,UCBerkeley,2018.
[60]
JakobFoerster,GregoryFarquhar,TriantafyllosAfouras,NantasNardelli,andShimon
Whiteson.Counterfactualmulti-agentpolicygradients.
arXivpreprintarXiv:1705.08926
,
2017.
[61]
JeromeFriedman,TrevorHastie,andRobertTibshirani.Anoteonthegrouplasso
andasparsegrouplasso.
arXivpreprintarXiv:1001.0736
,2010.
[62]
SilviaGandy,BenjaminRecht,andIsaoYamada.Tensorcompletionandlow-n-rank
tensorrecoveryviaconvexoptimization.
InverseProblems
,27(2):025010,2011.
[63]
JMGatt,CBNemero˙,CDobson-Stone,RHPaul,RABryant,PRScho˝eld,EGordon,
AHKemp,andLMWilliams.Interactionsbetweenbdnfval66metpolymorphismand
earlylifestresspredictbrainandarousalpathwaystosyndromaldepressionandanxiety.
Molecularpsychiatry
,2009.
[64]
GregoryAGodfreyandWarrenBPowell.Anadaptivedynamicprogrammingalgorithm
fordynamic˛eetmanagement,i:Singleperiodtraveltimes.
TransportationScience
,
2002.
[65]
GregoryAGodfreyandWarrenBPowell.Anadaptivedynamicprogrammingalgorithm
fordynamic˛eetmanagement,ii:Multiperiodtraveltimes.
TransportationScience
,
2002.
282
[66]
PinghuaGong,JiepingYe,andChangshuiZhang.Robustmulti-taskfeaturelearning.
In
SIGKDD
,pages903.ACM,2012.
[67]
PinghuaGong,ChangshuiZhang,ZhaosongLu,JianhuaZHuang,andJiepingYe.
Ageneraliterativeshrinkageandthresholdingalgorithmfornon-convexregularized
optimizationproblems.In
ICML
,volume28,page37,2013.
[68]
PinghuaGong,JiayuZhou,WeiFan,andJiepingYe.E˚cientmulti-taskfeature
learningwithcalibration.In
SIGKDD
,pages770.ACM,2014.
[69]
MichaelGrant,StephenBoyd,andYinyuYe.Cvx:Matlabsoftwarefordisciplined
convexprogramming,2008.
[70]
NizarGrira,MichelCrucianu,andNozhaBoujemaa.Activesemi-supervisedfuzzy
clusteringforimagedatabasecategorization.In
Proceedingsofthe7thACMSIGMM
internationalworkshoponMultimediainformationretrieval
,pagesACM,2005.
[71]
AudrunasGruslys,WillDabney,MohammadGheshlaghiAzar,BilalPiot,MarcBelle-
mare,andRemiMunos.Thereactor:Afastandsample-e˚cientactor-criticagentfor
reinforcementlearning.2018.
[72]
ShixiangGu,TimothyLillicrap,ZoubinGhahramani,RichardETurner,andSergey
Levine.Q-prop:Sample-e˚cientpolicygradientwithano˙-policycritic.
arXivpreprint
arXiv:1611.02247
,2016.
[73]
CarlosGuestrin,MichailLagoudakis,andRonaldParr.Coordinatedreinforcement
learning.In
ICML
,volume2,pages2002.
[74]
TuomasHaarnoja,AurickZhou,PieterAbbeel,andSergeyLevine.Softactor-critic:
O˙-policymaximumentropydeepreinforcementlearningwithastochasticactor.In
InternationalConferenceonMachineLearning
,pages1852018.
[75]
FarzinHaddadpourandMehrdadMahdavi.Ontheconvergenceoflocaldescent
methodsinfederatedlearning.
arXivpreprintarXiv:1910.14425
,2019.
[76]
Shih-PingHan.Asuccessiveprojectionmethod.
MathematicalProgramming
,40
14,1988.
[77]
AndrewHard,ChloéMKiddon,DanielRamage,FrancoiseBeaufays,HubertEichner,
KanishkaRao,RajivMathews,andSeanAugenstein.Federatedlearningformobile
keyboardprediction,2018.
[78]
KaimingHe,XiangyuZhang,ShaoqingRen,andJianSun.Identitymappingsindeep
residualnetworks.In
ECCV
,pages630Springer,2016.
283
[79]
KariHemminki,JustoLorenzoBermejo,andAstaFörsti.Thebalancebetweenheritable
andenvironmentalaetiologyofhumandisease.
NatureReviewsGenetics
,
2006.
[80]
MatteoHessel,JosephModayil,HadoVanHasselt,TomSchaul,GeorgOstrovski,Will
Dabney,DanHorgan,BilalPiot,MohammadAzar,andDavidSilver.Rainbow:Com-
biningimprovementsindeepreinforcementlearning.
arXivpreprintarXiv:1710.02298
,
2017.
[81]
ToddHester,MatejVecerik,OlivierPietquin,MarcLanctot,TomSchaul,BilalPiot,
DanHorgan,JohnQuan,AndrewSendonaris,IanOsband,etal.Deepq-learningfrom
demonstrations.In
Thirty-SecondAAAIConferenceonArti˝cialIntelligence
,2018.
[82]
Geo˙reyHinton,OriolVinyals,andJe˙Dean.Distillingtheknowledgeinaneural
network.
arXivpreprintarXiv:1503.02531
,2015.
[83]
KurtHornik,MaxwellStinchcombe,andHalbertWhite.Multilayerfeedforward
networksareuniversalapproximators.
Neuralnetworks
,1989.
[84]
ZhouyuanHuo,QianYang,BinGu,LawrenceCarinHuang,etal.Fasteron-device
trainingusingnewfederatedmomentumalgorithm.
arXivpreprintarXiv:2002.02090
,
2020.
[85]
AndrewIlyas,LoganEngstrom,ShibaniSanturkar,DimitrisTsipras,FirdausJanoos,
LarryRudolph,andAleksanderMadry.Aredeeppolicygradientalgorithmstruly
policygradientalgorithms?
arXivpreprintarXiv:1811.02553
,2018.
[86]
MaxJaderberg,VolodymyrMnih,WojciechMarianCzarnecki,TomSchaul,JoelZ
Leibo,DavidSilver,andKorayKavukcuoglu.Reinforcementlearningwithunsupervised
auxiliarytasks.
arXivpreprintarXiv:1611.05397
,2016.
[87]
PrateekJain,ShamMKakade,RahulKidambi,PraneethNetrapalli,andAaronSidford.
Acceleratingstochasticgradientdescent.In
Proc.STAT
,volume1050,page26,2017.
[88]
AliJalali,SujaySanghavi,ChaoRuan,andPradeepKRavikumar.Adirtymodelfor
multi-tasklearning.In
NIPS
.
[89]
ShuiwangJiandJiepingYe.Anacceleratedgradientmethodfortracenormminimiza-
tion.In
ICML
,pages64.ACM,2009.
[90]
NanJiangandAlekhAgarwal.Openproblem:Thedependenceofsamplecomplexity
lowerboundsonplanninghorizon.In
ConferenceOnLearningTheory
,pages3
2018.
284
[91]
NanJiang,AkshayKrishnamurthy,AlekhAgarwal,JohnLangford,andRobertE
Schapire.Contextualdecisionprocesseswithlowbellmanrankarepac-learnable.In
Proceedingsofthe34thInternationalConferenceonMachineLearning-Volume70
,
pagesJMLR.org,2017.
[92]
PengJiangandGaganAgrawal.Alinearspeedupanalysisofdistributeddeeplearning
withsparseandquantizedcommunication.In
AdvancesinNeuralInformationProcessing
Systems
,pages2522018.
[93]
Je˙Kahn,NathanLinial,andAlexSamorodnitsky.Inclusion-exclusion:Exactand
approximate.
Combinatorica
,1996.
[94]
PeterKairouz,HBrendanMcMahan,BrendanAvent,AurélienBellet,MehdiBennis,
ArjunNitinBhagoji,KeithBonawitz,ZacharyCharles,GrahamCormode,Rachel
Cummings,etal.Advancesandopenproblemsinfederatedlearning.
arXivpreprint
arXiv:1912.04977
,2019.
[95]
ShamMachandranathKakadeetal.
Onthesamplecomplexityofreinforcementlearning
.
PhDthesis,UniversityofLondonLondon,England,2003.
[96]
TsuyoshiKato,HisashiKashima,MasashiSugiyama,andKiyoshiAsai.Multi-task
learningviaconicprogramming.In
NIPS
,pages7372008.
[97]
MichaelJKearns,YishayMansour,andAndrewYNg.Approximateplanningin
largepomdpsviareusabletrajectories.In
AdvancesinNeuralInformationProcessing
Systems
,pages1002000.
[98]
AKhaled,KMishchenko,andPRichtárik.Tightertheoryforlocalsgdonidentical
andheterogeneousdata.In
The23rdInternationalConferenceonArti˝cialIntelligence
andStatistics(AISTATS2020)
,2020.
[99]
AhmedKhaled,KonstantinMishchenko,andPeterRichtárik.Firstanalysisoflocalgd
onheterogeneousdata.
NeurIPSWorkshoponFederatedLearningforDataPrivacy
andCon˝dentiality
,2019.
[100]
RahulKidambi,PraneethNetrapalli,PrateekJain,andShamKakade.Ontheinsu˚-
ciencyofexistingmomentumschemesforstochasticoptimization.In
2018Information
TheoryandApplicationsWorkshop(ITA)
,pagesIEEE,2018.
[101]
SeyoungKimandEricPXing.Tree-guidedgrouplassoformulti-taskregressionwith
structuredsparsity.
ICML
,2010.
[102]
JensKoberandJanRPeters.Policysearchformotorprimitivesinrobotics.In
Advancesinneuralinformationprocessingsystems
,pages56,2009.
285
[103]
JelleRKokandNikosVlassis.Collaborativemultiagentreinforcementlearningby
payo˙propagation.
JMLR
,2006.
[104]
AnastasiaKoloskova,NicolasLoizou,SadraBoreiri,MartinJaggi,andSebastianU
Stich.Auni˝edtheoryofdecentralizedsgdwithchangingtopologyandlocalupdates.
arXivpreprintarXiv:2003.10422
,2020.
[105]
AkshayKrishnamurthy,AlekhAgarwal,andJohnLangford.Pacreinforcementlearning
withrichobservations.In
AdvancesinNeuralInformationProcessingSystems
,pages
2016.
[106]
MonicaSLam.Autonomyandprivacywithopenfederatedvirtualassistants.
[107]
GuillaumeLampleandDevendraSinghChaplot.Playingfpsgameswithdeeprein-
forcementlearning.
arXivpreprintarXiv:1609.05521
,2016.
[108]
JasonLee,YuekaiSun,andMichaelSaunders.Proximalnewton-typemethodsfor
convexoptimization.In
NIPS
,pages844,2012.
[109]
TianLi,AnitKumarSahu,ManzilZaheer,MaziarSanjabi,AmeetTalwalkar,and
VirginiaSmith.Federatedoptimizationinheterogeneousnetworks.
MLSys
,2020.
[110]
XiangLi,KaixuanHuang,WenhaoYang,ShusenWang,andZhihuaZhang.Onthe
convergenceoffedavgonnon-iiddata.
ICLR
,2020.
[111]
XiujunLi,Yun-NungChen,LihongLi,JianfengGao,andAsliCelikyilmaz.End-to-end
task-completionneuraldialoguesystems.
arXivpreprintarXiv:1703.01008
,2017.
[112]
ZLi,YHong,andZZhang.Doon-demandride-sharingservicesa˙ecttra˚cconges-
tion?evidencefromuberentry.Technicalreport,Workingpaper,availableatSSRN:
https://ssrn.com/abstract=2838043,2016.
[113]
XianfengLiang,ShuhengShen,JingchangLiu,ZhenPan,EnhongChen,andYifei
Cheng.Variancereducedlocalsgdwithlowercommunicationcomplexity.
arXivpreprint
arXiv:1912.12844
,2019.
[114]
KaixiangLin,ShuWang,andJiayuZhou.Collaborativedeepreinforcementlearning.
arXivpreprintarXiv:1702.05796
,2017.
[115]
KaixiangLin,JianpengXu,InciMBaytas,ShuiwangJi,andJiayuZhou.Multi-task
featureinteractionlearning.In
Proceedingsofthe22NdACMSIGKDDInternational
ConferenceonKnowledgeDiscoveryandDataMining
,pages17352016.
[116]
KaixiangLin,RenyuZhao,ZheXu,andJiayuZhou.E˚cientlarge-scale˛eetman-
agementviamulti-agentdeepreinforcementlearning.
Proceedingsofthe24thACM
SIGKDDInternationalConferenceonKnowledge
,2018.
286
[117]
KaixiangLinandJiayuZhou.Interactivemulti-taskrelationshiplearning.In
2016
IEEE16thInternationalConferenceonDataMining(ICDM)
,pagesIEEE,
2016.
[118]
KaixiangLinandJiayuZhou.Rankingpolicygradient.In
InternationalConference
onLearningRepresentations
,2020.
[119]
ChaoyueLiuandMikhailBelkin.Acceleratingsgdwithmomentumforover-
parameterizedlearning.
ICLR
,2020.
[120]
JunLiu,ShuiwangJi,andJiepingYe.Multi-taskfeaturelearningviae˚cient
`
2
;
1
-norm
minimization.In
Proceedingsofthe25thconferenceonUAI
,pagesAUAI
Press,2009.
[121]
JunLiuandJiepingYe.E˚cientl1/lqnormregularization.
arXiv:1009.4766
,2010.
[122]
TianyiLiu,ZhehuiChen,EnluZhou,andTuoZhao.Towarddeeperunderstanding
ofnonconvexstochasticoptimizationwithmomentumusingdi˙usionapproximations.
arXivpreprintarXiv:1802.05155
,2018.
[123]
WeiLiu,LiChen,YunfeiChen,andWenyiZhang.Acceleratingfederatedlearningvia
momentumgradientdescent.
IEEETransactionsonParallelandDistributedSystems
,
2020.
[124]
YashuLiu,JieWang,andJiepingYe.Ane˚cientalgorithmforweakhierarchicallasso.
In
SIGKDD
,pages292.ACM,2014.
[125]
RyanLowe,YiWu,AvivTamar,JeanHarb,PieterAbbeel,andIgorMordatch.Multi-
agentactor-criticformixedcooperative-competitiveenvironments.
arXivpreprint
arXiv:1706.02275
,2017.
[126]
Lyft.
[127]
SiyuanMa,RaefBassily,andMikhailBelkin.Thepowerofinterpolation:Understanding
thee˙ectivenessofsgdinmodernover-parametrizedlearning.
ICML
,2018.
[128]
MichaªMaciejewskiandKaiNagel.Thein˛uenceofmulti-agentcooperationonthe
e˚ciencyoftaxidispatching.In
PPAM
,pages60.Springer,2013.
[129]
HBrendanMcMahan,EiderMoore,DanielRamage,SethHampson,etal.
Communication-e˚cientlearningofdeepnetworksfromdecentralizeddata.
Pro-
ceedingsofthe20thInternationalConferenceonArti˝cialIntelligenceandStatistics
(AISTATS)
,2017.
[130]
PremMelvilleandVikasSindhwani.Recommendersystems.In
Encyclopediaofmachine
learning
,pages829Springer,2011.
287
[131]
VolodymyrMnih,AdriaPuigdomenechBadia,MehdiMirza,AlexGraves,TimothyP
Lillicrap,TimHarley,DavidSilver,andKorayKavukcuoglu.Asynchronousmethods
fordeepreinforcementlearning.
arXivpreprintarXiv:1602.01783
,2016.
[132]
VolodymyrMnih,KorayKavukcuoglu,DavidSilver,AndreiARusu,JoelVeness,
MarcGBellemare,AlexGraves,MartinRiedmiller,AndreasKFidjeland,Georg
Ostrovski,etal.Human-levelcontrolthroughdeepreinforcementlearning.
Nature
,
3,2015.
[133]
MehryarMohri,AfshinRostamizadeh,andAmeetTalwalkar.
Foundationsofmachine
learning
.MITpress,2018.
[134]
EricMoulinesandFrancisRBach.Non-asymptoticanalysisofstochasticapproximation
algorithmsformachinelearning.In
AdvancesinNeuralInformationProcessingSystems
,
pages2011.
[135]
RémiMunos,TomStepleton,AnnaHarutyunyan,andMarcBellemare.Safeande˚cient
o˙-policyreinforcementlearning.In
AdvancesinNeuralInformationProcessingSystems
,
pages2016.
[136]
O˝rNachum,MohammadNorouzi,KelvinXu,andDaleSchuurmans.Bridgingthe
gapbetweenvalueandpolicybasedreinforcementlearning.In
AdvancesinNeural
InformationProcessingSystems
,pages27752017.
[137]
ArunNair,PraveenSrinivasan,SamBlackwell,CagdasAlcicek,RoryFearon,Alessandro
DeMaria,VedavyasPanneershelvam,MustafaSuleyman,CharlesBeattie,StigPetersen,
etal.Massivelyparallelmethodsfordeepreinforcementlearning.
arXivpreprint
arXiv:1507.04296
,2015.
[138]
DeannaNeedell,RachelWard,andNatiSrebro.Stochasticgradientdescent,weighted
sampling,andtherandomizedkaczmarzalgorithm.In
Advancesinneuralinformation
processingsystems
,pages10172014.
[139]
AndrewYNg,DaishiHarada,andStuartRussell.Policyinvarianceunderreward
transformations:Theoryandapplicationtorewardshaping.In
ICML
,volume99,
pages1999.
[140]
DucThienNguyen,AkshatKumar,andHoongChuinLau.Collectivemultiagent
sequentialdecisionmakingunderuncertainty.
AAAI
,2017.
[141]
DucThienNguyen,AkshatKumar,andHoongChuinLau.Policygradientwithvalue
functionapproximationforcollectivemultiagentplanning.
NIPS
,2017.
288
[142]
GuillaumeObozinski,BenTaskar,andMichaelIJordan.Jointcovariateselectionand
jointsubspaceselectionformultipleclassi˝cationproblems.
StatisticsandComputing
,
2010.
[143]
BrendanO'Donoghue.Variationalbayesianreinforcementlearningwithregretbounds.
arXivpreprintarXiv:1807.09647
,2018.
[144]
BrendanO'Donoghue,RemiMunos,KorayKavukcuoglu,andVolodymyrMnih.Com-
biningpolicygradientandq-learning.
arXivpreprintarXiv:1611.01626
,2016.
[145]
JunhyukOh,YijieGuo,SatinderSingh,andHonglakLee.Self-imitationlearning.
arXivpreprintarXiv:1806.05635
,2018.
[146]
JunhyukOh,MatteoHessel,WojciechMCzarnecki,ZhongwenXu,HadovanHasselt,
SatinderSingh,andDavidSilver.Discoveringreinforcementlearningalgorithms.
arXiv
preprintarXiv:2007.08794
,2020.
[147]
OpenAI.Openaiuniverse-starter-agent.
https://github.com/openai/
universe-starter-agent
,2017.Accessed:2017-0201.
[148]
TakayukiOsa,JoniPajarinen,GerhardNeumann,JAndrewBagnell,PieterAbbeel,
JanPeters,etal.Analgorithmicperspectiveonimitationlearning.
Foundationsand
Trends
R

inRobotics
,2018.
[149]
SinnoJialinPanandQiangYang.Asurveyontransferlearning.
TKDE
,
1359,2010.
[150]
NealParikhandStephenPBoyd.Proximalalgorithms.
FoundationsandTrendsin
optimization
,2014.
[151]
EmilioParisotto,JimmyLeiBa,andRuslanSalakhutdinov.Actor-mimic:Deep
multitaskandtransferreinforcementlearning.
arXivpreprintarXiv:1511.06342
,2015.
[152]
JanPetersandStefanSchaal.Reinforcementlearningbyreward-weightedregression
foroperationalspacecontrol.In
Proceedingsofthe24thinternationalconferenceon
Machinelearning
,pages745ACM,2007.
[153]
JanPetersandStefanSchaal.Reinforcementlearningofmotorskillswithpolicy
gradients.
Neuralnetworks
,2008.
[154]
Tu-HoaPham,GiovanniDeMagistris,andRyukiTachibana.Optlayer-practical
constrainedoptimizationfordeepreinforcementlearningintherealworld.
arXiv
preprintarXiv:1709.07643
,2017.
289
[155]
JPlatt.Fasttrainingofsupportvectormachinesusingsequentialminimaloptimization,
in,b.scholkopf,c.burges,a.smola,(eds.):Advancesinkernelmethods-supportvector
learning,1998.
[156]
MartinLPuterman.
Markovdecisionprocesses:discretestochasticdynamicprogram-
ming
.JohnWiley&Sons,2014.
[157]
ZhaonanQu*,KaixiangLin*,JayantKalagnanam,ZhaojianLi,JiayuZhou,and
ZhengyuanZhou.Federatedlearning'sblessing:Fedavghaslinearspeedup.
arXiv
preprintarXiv:2007.05690
,2020,*denotesequalcontribution.
[158]
PeterRadchenkoandGarethMJames.Variableselectionusingadaptivenonlinearin-
teractionstructuresinhighdimensions.
JournaloftheAmericanStatisticalAssociation
,
,2010.
[159]
JanarthananRajendran,AravindLakshminarayanan,MiteshMKhapra,Balaraman
Ravindran,etal.A2t:Attend,adaptandtransfer:Attentivedeeparchitecturefor
adaptivetransferfrommultiplesources.
arXivpreprintarXiv:1510.02879
,2015.
[160]
TabishRashid,MikayelSamvelyan,ChristianSchroederdeWitt,GregoryFarquhar,
JakobFoerster,andShimonWhiteson.Qmix:Monotonicvaluefunctionfactorisation
fordeepmulti-agentreinforcementlearning.
arXivpreprintarXiv:1803.11485
,2018.
[161]
RTyrrellRockafellar.
Convexanalysis
.Number28.Princetonuniversitypress,1970.
[162]
BernardinoRomera-Paredes,HaneAung,NadiaBianchi-Berthouze,andMassimiliano
Pontil.Multilinearmultitasklearning.In
ICML
,pages1452,2013.
[163]
StéphaneRossandDrewBagnell.E˚cientreductionsforimitationlearning.In
Proceedingsofthethirteenthinternationalconferenceonarti˝cialintelligenceand
statistics
,pages668,2010.
[164]
StephaneRossandJAndrewBagnell.Reinforcementandimitationlearningvia
interactiveno-regretlearning.
arXivpreprintarXiv:1406.5979
,2014.
[165]
StéphaneRoss,Geo˙reyGordon,andDrewBagnell.Areductionofimitationlearning
andstructuredpredictiontono-regretonlinelearning.In
Proceedingsofthefourteenth
internationalconferenceonarti˝cialintelligenceandstatistics
,pages635,2011.
[166]
AndreiARusu,SergioGomezColmenarejo,CaglarGulcehre,GuillaumeDesjardins,
JamesKirkpatrick,RazvanPascanu,VolodymyrMnih,KorayKavukcuoglu,andRaia
Hadsell.Policydistillation.
arXivpreprintarXiv:1511.06295
,2015.
[167]
TomSchaul,JohnQuan,IoannisAntonoglou,andDavidSilver.Prioritizedexperience
replay.
arXivpreprintarXiv:1511.05952
,2015.
290
[168]
MarkSchmidtandNicolasLeRoux.Fastconvergenceofstochasticgradientdescent
underastronggrowthcondition.
arXivpreprintarXiv:1308.6370
,2013.
[169]
JohnSchulman,PhilippMoritz,SergeyLevine,MichaelJordan,andPieterAbbeel.
High-dimensionalcontinuouscontrolusinggeneralizedadvantageestimation.
arXiv
preprintarXiv:1506.02438
,2015.
[170]
JohnSchulman,FilipWolski,PrafullaDhariwal,AlecRadford,andOlegKlimov.
Proximalpolicyoptimizationalgorithms.
arXivpreprintarXiv:1707.06347
,2017.
[171]
AntonSchwaighofer,VolkerTresp,andKaiYu.Learninggaussianprocesskernelsvia
hierarchicalbayes.In
NIPS
,pages1202004.
[172]
KiamTianSeow,NamHaiDang,andDer-HorngLee.Acollaborativemultiagent
taxi-dispatchsystem.
IEEET-ASE
,2010.
[173]
BurrSettles.Activelearningliteraturesurvey.
UniversityofWisconsin,Madison
,
52(55-66):11,2010.
[174]
DavidSilver,AjaHuang,ChrisJMaddison,ArthurGuez,LaurentSifre,GeorgeVan
DenDriessche,JulianSchrittwieser,IoannisAntonoglou,VedaPanneershelvam,Marc
Lanctot,etal.Masteringthegameofgowithdeepneuralnetworksandtreesearch.
Nature
,2016.
[175]
DavidSilver,JulianSchrittwieser,KarenSimonyan,IoannisAntonoglou,AjaHuang,
ArthurGuez,ThomasHubert,LucasBaker,MatthewLai,AdrianBolton,etal.
Masteringthegameofgowithouthumanknowledge.
Nature
,550(7676):354,2017.
[176]
VirginiaSmith,Chao-KaiChiang,MaziarSanjabi,andAmeetSTalwalkar.Federated
multi-tasklearning.In
AdvancesinNeuralInformationProcessingSystems
,pages
2017.
[177]
MarkJohnSomers.Organizationalcommitment,turnoverandabsenteeism:An
examinationofdirectandinteractione˙ects.
JournalofOrganizationalBehavior
,
1995.
[178]
SebastianUStich.Localsgdconvergesfastandcommunicateslittle.
ICLR
,2019.
[179]
AlexanderLStrehl,LihongLi,andMichaelLLittman.Reinforcementlearningin˝nite
mdps:Pacanalysis.
JournalofMachineLearningResearch
,10(No,2009.
[180]
AlexanderLStrehl,LihongLi,EricWiewiora,JohnLangford,andMichaelLLittman.
Pacmodel-freereinforcementlearning.In
Proceedingsofthe23rdinternationalconfer-
enceonMachinelearning
,pages888.ACM,2006.
291
[181]
ThomasStrohmerandRomanVershynin.Arandomizedkaczmarzalgorithmwith
exponentialconvergence.
JournalofFourierAnalysisandApplications
,15(2):262,2009.
[182]
RukhsanaSultana,DebraBoyd-Kimball,HFaiPoon,JianCai,WilliamMPierce,
JonBKlein,MichaelMerchant,WilliamRMarkesbery,andDAllanButter˝eld.Redox
proteomicsidenti˝cationofoxidizedproteinsinalzheimer'sdiseasehippocampusand
cerebellum:anapproachtounderstandpathologicalandbiochemicalalterationsinad.
Neurobiologyofaging
,2006.
[183]
QianSun,RitaChattopadhyay,SethuramanPanchanathan,andJiepingYe.Atwo-
stageweightingframeworkformulti-sourcedomainadaptation.In
NIPS
,pages
2011.
[184]
WenSun,ArunVenkatraman,Geo˙reyJGordon,ByronBoots,andJAndrewBagnell.
Deeplyaggrevated:Di˙erentiableimitationlearningforsequentialprediction.In
Proceedingsofthe34thInternationalConferenceonMachineLearning-Volume70
,
pagesJMLR.org,2017.
[185]
PeterSunehag,GuyLever,AudrunasGruslys,WojciechMarianCzarnecki,Vinicius
Zambaldi,MaxJaderberg,MarcLanctot,NicolasSonnerat,JoelZLeibo,KarlTuyls,
etal.Value-decompositionnetworksforcooperativemulti-agentlearning.
arXivpreprint
arXiv:1706.05296
,2017.
[186]
RichardSSuttonandAndrewGBarto.
Reinforcementlearning:Anintroduction
,
volume1.MITpressCambridge,1998.
[187]
RichardSSuttonandAndrewGBarto.
Reinforcementlearning:Anintroduction
.MIT
press,2018.
[188]
UmarSyedandRobertESchapire.Areductionfromapprenticeshiplearningto
classi˝cation.In
AdvancesinNeuralInformationProcessingSystems
,pages
2010.
[189]
ArdiTampuu,TambetMatiisen,DorianKodelja,IlyaKuzovkin,KristjanKorjus,Juhan
Aru,JaanAru,andRaulVicente.Multiagentcooperationandcompetitionwithdeep
reinforcementlearning.
PloSone
,12(4):e0172395,2017.
[190]
MingTan.Multi-agentreinforcementlearning:Independentvs.cooperativeagents.In
ICML
,pages337,1993.
[191]
MatthewETaylorandPeterStone.Transferlearningforreinforcementlearning
domains:Asurvey.
JMLR
,2009.
[192]
StefanJTeipel,WolframBayer,GeneEAlexander,YorkZebuhr,DianeTeichberg,
LukaKulic,MarcBSchapiro,Hans-JürgenMöller,StanleyIRapoport,andHarald
292
Hampel.Progressionofcorpuscallosumatrophyinalzheimerdisease.
Archivesof
Neurology
,2002.
[193]
DevinderThapa,In-SungJung,andGi-NamWang.Agentbaseddecisionsupport
systemusingreinforcementlearningunderemergencycircumstances.In
International
ConferenceonNaturalComputation
,pages892.Springer,2005.
[194]
RobertTibshirani.Regressionshrinkageandselectionviathelasso.
Journalofthe
RoyalStatisticalSociety.SeriesB(Methodological)
,pages88,1996.
[195]
RyotaTomioka,KoheiHayashi,andHisashiKashima.Estimationoflow-ranktensors
viaconvexoptimization.
arXivpreprintarXiv:1010.0789
,2010.
[196]
RyotaTomiokaandTaijiSuzuki.Convextensordecompositionviastructuredschatten
normregularization.In
NIPS
,pages1332013.
[197]
AhmedTouati,Pierre-LucBacon,DoinaPrecup,andPascalVincent.Convergent
tree-backupandretracewithfunctionapproximation.
arXivpreprintarXiv:1705.09322
,
2017.
[198]
PaulTseng.Convergenceofablockcoordinatedescentmethodfornondi˙erentiable
minimization.
Journalofoptimizationtheoryandapplications
,2001.
[199]
BerwinATurlach,WilliamNVenables,andStephenJWright.Simultaneousvariable
selection.
Technometrics
,2005.
[200]
EricTzeng,JudyHo˙man,TrevorDarrell,andKateSaenko.Simultaneousdeeptransfer
acrossdomainsandtasks.In
ICCV
,pages4076,2015.
[201]
Uber.
[202]
LeslieGValiant.Atheoryofthelearnable.In
Proceedingsofthesixteenthannual
ACMsymposiumonTheoryofcomputing
,pages45.ACM,1984.
[203]
HadoVanHasselt,ArthurGuez,andDavidSilver.Deepreinforcementlearningwith
doubleq-learning.In
AAAI
,volume2,page5.Phoenix,AZ,2016.
[204]
VladimirVapnik.
Estimationofdependencesbasedonempiricaldata
.SpringerScience
&BusinessMedia,2006.
[205]
JianyuWangandGauriJoshi.Cooperativesgd:Auni˝edframeworkforthedesignand
analysisofcommunication-e˚cientsgdalgorithms.
arXivpreprintarXiv:1808.07576
,
2018.
293
[206]
ShiqiangWang,Ti˙anyTuor,TheodorosSalonidis,KinKLeung,ChristianMakaya,
TingHe,andKevinChan.Adaptivefederatedlearninginresourceconstrainededge
computingsystems.
IEEEJournalonSelectedAreasinCommunications
,
1221,2019.
[207]
ZiruiWang,ZihangDai,BarnabásPóczos,andJaimeCarbonell.Characterizingand
avoidingnegativetransfer.In
ProceedingsoftheIEEEConferenceonComputerVision
andPatternRecognition
,pages1122019.
[208]
ZiyuWang,VictorBapst,NicolasHeess,VolodymyrMnih,RemiMunos,Koray
Kavukcuoglu,andNandodeFreitas.Samplee˚cientactor-criticwithexperiencereplay.
arXivpreprintarXiv:1611.01224
,2016.
[209]
ZiyuWang,TomSchaul,MatteoHessel,HadoVanHasselt,MarcLanctot,andNando
DeFreitas.Duelingnetworkarchitecturesfordeepreinforcementlearning.
arXiv
preprintarXiv:1511.06581
,2015.
[210]
MalcolmWare,EibeFrank,Geo˙reyHolmes,MarkHall,andIanHWitten.Interactive
machinelearning:lettingusersbuildclassi˝ers.
INTJHUM-COMPUTST
,
292,2001.
[211]
ChongWei,YinhuWang,XuedongYan,andChunfuShao.Look-aheadinsertionpolicy
forashared-taxisystembasedonreinforcementlearning.
IEEEAccess
,2017.
[212]
RonaldJWilliams.Simplestatisticalgradient-followingalgorithmsforconnectionist
reinforcementlearning.
Machinelearning
,1992.
[213]
BlakeEWoodworth,JialeiWang,AdamSmith,BrendanMcMahan,andNatiSrebro.
Graphoraclemodels,lowerbounds,andgapsforparallelstochasticoptimization.In
Advancesinneuralinformationprocessingsystems
,pages8506,2018.
[214]
StephenJWright,RobertDNowak,andMárioATFigueiredo.Sparsereconstruction
byseparableapproximation.
SignalProcessing,IEEETransactionson
,
2009.
[215]
SichengXiong,JavadAzimi,andXiaoliZFern.Activelearningofconstraintsfor
semi-supervisedclustering.
TKDE
,2014.
[216]
JianpengXu,Pang-NingTan,andLifengLuo.Orion:Onlineregularizedmulti-task
regressionanditsapplicationtoensembleforecasting.In
ICDM
,pages
IEEE,2014.
[217]
YaodongYang,RuiLuo,MinneLi,MingZhou,WeinanZhang,andJunWang.Mean
˝eldmulti-agentreinforcementlearning.
ICML
,2018.
294
[218]
HaoYu,RongJin,andSenYang.Onthelinearspeedupanalysisofcommunication
e˚cientmomentumsgdfordistributednon-convexoptimization.
ICML
,2019.
[219]
HaoYu,SenYang,andShenghuoZhu.Parallelrestartedsgdwithfasterconvergence
andlesscommunication:Demystifyingwhymodelaveragingworksfordeeplearning.
In
ProceedingsoftheAAAIConferenceonArti˝cialIntelligence
,volume33,pages
2019.
[220]
TianheYu,SaurabhKumar,AbhishekGupta,SergeyLevine,KarolHausman,and
ChelseaFinn.Gradientsurgeryformulti-tasklearning.
arXivpreprintarXiv:2001.06782
,
2020.
[221]
KunYuan,BichengYing,andAliHSayed.Onthein˛uenceofmomentumacceleration
ononlinelearning.
TheJournalofMachineLearningResearch
,2016.
[222]
LeiYuan,JunLiu,andJiepingYe.E˚cientmethodsforoverlappinggrouplasso.In
NIPS
,pages360,2011.
[223]
AndreaZanetteandEmmaBrunskill.Tighterproblem-dependentregretboundsin
reinforcementlearningwithoutdomainknowledgeusingvaluefunctionbounds.
arXiv
preprintarXiv:1901.00210
,2019.
[224]
ChiyuanZhang,SamyBengio,MoritzHardt,BenjaminRecht,andOriolVinyals.
Understandingdeeplearningrequiresrethinkinggeneralization.
arXivpreprint
arXiv:1611.03530
,2016.
[225]
YiZhangandJe˙GSchneider.Learningmultipletaskswithasparsematrix-normal
penalty.In
NIPS
,pages25502010.
[226]
YuZhangandQiangYang.Asurveyonmulti-tasklearning.
arXivpreprint
arXiv:1707.08114
,2017.
[227]
YuZhangandDit-YanYeung.Aconvexformulationforlearningtaskrelationshipsin
multi-tasklearning.
arXivpreprintarXiv:1203.3536
,2012.
[228]
YuZhang,Dit-YanYeung,andQianXu.Probabilisticmulti-taskfeatureselection.In
NIPS
,pages25592010.
[229]
ZhanpengZhang,PingLuo,ChenChangeLoy,andXiaoouTang.Faciallandmark
detectionbydeepmulti-tasklearning.In
ECCV
,pages08.Springer,2014.
[230]
LianminZheng,JiachengYang,HanCai,WeinanZhang,JunWang,andYongYu.Ma-
gent:Amany-agentreinforcementlearningplatformforarti˝cialcollectiveintelligence.
arXivpreprintarXiv:1712.00600
,2017.
295
[231]
FanZhouandGuojingCong.Ontheconvergencepropertiesofa
k
-stepaveraging
stochasticgradientdescentalgorithmfornonconvexoptimization.
IJCAI
,2018.
[232]
JiayuZhou,JianhuiChen,andJiepingYe.Clusteredmulti-tasklearningviaalternating
structureoptimization.In
NIPS
,pages10,2011.
[233]
JiayuZhou,JianhuiChen,andJiepingYe.Malsar:Multi-tasklearningviastructural
regularization.
ArizonaStateUniversity
,2011.
[234]
JiayuZhou,JunLiu,VaibhavANarayan,JiepingYe,Alzheimer'sDiseaseNeuroimaging
Initiative,etal.Modelingdiseaseprogressionviamulti-tasklearning.
NeuroImage
,
2013.
[235]
JiayuZhou,LeiYuan,JunLiu,andJiepingYe.Amulti-tasklearningformulationfor
predictingdiseaseprogression.In
SIGKDD
,pages822.ACM,2011.
[236]
YukeZhu,RoozbehMottaghi,EricKolve,JosephJLim,AbhinavGupta,LiFei-Fei,and
AliFarhadi.Target-drivenvisualnavigationinindoorscenesusingdeepreinforcement
learning.
arXivpreprintarXiv:1609.05143
,2016.
296