EVOLUTIONARYMULTI-OBJECTIVEBI-LEVELOPTIMIZATIONFOR
EFFICIENTDEEPNEURALNETWORKARCHITECTUREDESIGN
By
ZhichaoLu
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialentoftherequirements
forthedegreeof
ElectricalEngineering|DoctorofPhilosophy
2020
ABSTRACT
EVOLUTIONARYMULTI-OBJECTIVEBI-LEVELOPTIMIZATIONFOR
EFFICIENTDEEPNEURALNETWORKARCHITECTUREDESIGN
By
ZhichaoLu
Deepconvolutionalneuralnetworks(CNNs)arethebackbonesofdeeplearning(DL)
paradigmsfornumerousvisiontasks,includingobjectrecognition,detection,segmentation,
etc.EarlyadvancementsinCNNarchitecturesareprimarilydrivenbyhumanexpertiseand
elaboratedesign.Recently,neuralarchitecturesearch(NAS)wasproposedwiththeaimof
automatingthenetworkdesignprocessandgeneratingtask-dependentarchitectures.While
existingapproacheshaveachievedcompetitiveperformance,theyarestillimpracticaltoreal-
worlddeploymentforthreereasons:(1)thegeneratedarchitecturesaresolelyoptimizedfor
predictiveperformance,resultinginncyinutilizinghardwareresources|i.e.energy
consumption,latency,memorysize,etc.;(2)thesearchprocessesrequirevastcomputational
resourcesinmostapproaches;(3)mostexistingapproachesrequireonecompletesearchfor
eachdeploymentspofhardwareorrequirement.
Inthisdissertation,weproposeantevolutionaryNASalgorithmtoaddressthe
aforementionedlimitations.Inparticular,weintroducePareto-optimizationtoNAS,
leadingtoadiversesetofarchitectures,multipleobjectives,beingobtainedsimul-
taneouslyinonerun.Wethenimprovethealgorithm'ssearchthroughsurrogate
models.Weintegrateatransferlearningschemetothealgorithmthatallowsanew
tasktoleverageprevioussearchthatfurtherimprovesboththeperformanceofthe
obtainedarchitecturesandsearch.Therefore,theproposedalgorithmenablesan
automatedandstreamlinedprocesstotlygeneratetask-spcustomneuralnet-
workmodelsthatarecompetitiveundermultipleobjectives.
TABLEOFCONTENTS
LISTOFTABLES
....................................
vi
LISTOFFIGURES
...................................
viii
Chapter1Introduction
...............................
1
1.1Motivation.....................................1
1.2Challenges.....................................3
1.3DissertationOverview..............................5
Chapter2Background
................................
6
2.1NeuralArchitectureSearch............................6
2.2EvolutionaryBi-levelOptimization.......................8
2.3EvolutionaryMulti-objectiveOptimization...................11
2.4EvolutionarySurrogate-AssistedOptimization.................14
2.5RelatedWork...................................15
Chapter3Multi-objectiveEvolutionofDeepNeuralNetworkArchitec-
tures
....................................
19
3.1ProposedApproach................................20
3.1.1SearchSpaceandEncoding.......................22
3.1.2AlgorithmProcedures..........................24
3.2ExperimentsandResults.............................30
3.2.1Baselines..................................30
3.2.2ImplementationDetails..........................31
3.2.3eness................................32
3.2.4.................................36
3.2.5ObservationsonEvolvedArchitectures.................38
3.3AblationStudy..................................40
3.3.1DatasetforSearch............................40
3.3.2ProxyModels...............................41
3.4Conclusion.....................................42
Chapter4SurrogateAssistedEvolutionofDeepNeuralNetworkArchi-
tectures
..................................
44
4.1ProposedApproach................................46
4.1.1OverallAlgorithmDescription......................46
4.1.2SpeedingUpUpperLevelOptimization.................49
4.1.3SpeedingUpLowerLevelOptimization.................50
4.2ExperimentsandResults.............................50
4.2.1PerformanceoftheSurrogatePredictors................51
iv
4.2.2Searchciency.............................52
4.2.3ResultsonStandardDatasets......................53
4.3ScalabilityofMSuNAS..............................56
4.3.1TypesofDatasets.............................56
4.3.2NumberofObjectives..........................57
4.4Conclusion.....................................59
Chapter5TransferLearningAssistedEvolutionofDeepNeuralNetwork
Architectures
...............................
60
5.1ProposedApproach................................62
5.1.1SearchSpaceandEncoding.......................63
5.1.2AccuracyPredictor............................65
5.1.3Many-ObjectiveEvolutionarySearch..................68
5.1.4Many-ObjectiveSelection........................71
5.1.5SupernetAdaptation...........................73
5.1.6ChoosingBestTSolution....................75
5.2ExperimentsandResults.............................76
5.2.1Datasets..................................76
5.2.2SupernetPreparation...........................77
5.2.3ImageNet..........................78
5.2.4ArchitectureTransfer...........................79
5.2.5ScalabilitytoObjectives.........................84
5.2.5.1ThreeObjectives........................85
5.2.5.2TwelveObjectives.......................86
5.3AblationStudy..................................88
5.3.1AccuracyPredictorPerformance.....................88
5.3.2Searchciency.............................90
5.3.3AnalysisofCrossover...........................91
5.3.4AnalysisofMutationHyperparameters.................92
5.3.5enessofSupernetAdaptation..................93
5.4Conclusion.....................................95
Chapter6ApplicationsofEvolutionaryNeuralArchitectureSearch
...
96
6.1MedicalImaging.................................96
6.2AgainstAdversarialAttack............................100
6.3Multi-viewCarAlignment............................103
6.4Non-learnableLayer...............................106
Chapter7Conclusion
................................
108
7.1Contributions...................................108
7.2DiscussionandFutureWork...........................110
7.3ConcludingRemarks...............................111
BIBLIOGRAPHY
....................................
112
v
LISTOFTABLES
Table3.1:SummaryofHyper-parameterSettings.................32
Table3.2:ComparisonbetweenNSGANetandotherbaselinemethods.NS-
GANetarchitecturesareobtainedbysearchingonCIFAR-100.NS-
GANetresultsonCIFAR-10andImageNetareobtainedbyre-training
theweightswithimagesfromtheirrespectivedatasets.Ratio-to-
NSGANetindicatestheresultingsavingson#Paramsand#FLOPs.
ThesearchcostiscomparedinGPU-days,calculatedbymultiplying
thenumberofGPUcardsdeployedwiththeexecutiontimeindays.35
Table4.1:ComparingtherelativesearchofMSuNAStoothersingle-
objectivemethods:\#Model"isthetotalnumberofarchitectures
evaluatedduringsearch,\#Epochs"isthenumberofepochsusedto
traineacharchitectureduringsearch.
y
and
z
denotetrainingepochs
withandwithoutasupernettowarm-starttheweights,respectively.52
Table4.2:ImageNet[1]:MSuNAScomparisonwithmanualand
automateddesignoftnetworks.Modelsaregroupedintosec-
tionsforbettervisualization.Ourresultsareunderlinedandbest
resultineachsectionisinbold.CPUlatency(batchsize=1)ismea-
suredonInteli7-8700KandGPUlatency(batchsize=64)ismeasured
on1080Ti.
y
Thesearchcostexcludesthesupernettrainingtimecost.
z
EstimatedbasedontheclaimthatPNASis8xfasterthanNASNet
from[2]..................................54
Table4.3:Non-standardDatasetsforMSuNAS..................56
Table5.1:HyperparameterSettings........................77
Table5.2:BenchmarkDatasetsforEvaluation..................77
Table5.3:
ImageNet-1K[1]:
NATNetscomparisonwithman-
ualandautomateddesignofcientconvolutionalneuralnetworks.
Modelsaregroupedintosectionsforbettervisualization.Ourresults
areunderlinedandthebestresultineachsectionisinbold.CPUla-
tency(batchsize=1)ismeasuredonInteli7-8700KandGPUlatency
(batchsize=64)ismeasuredon1080Ti.\WS"standsforweightshar-
ing.Allmethodsareundersinglecropandsinglemodelcondition,
withoutanyadditionaldata......................80
vi
Table5.4:NATmodelperformancecorrespondingtoFigure5.9.........84
Table5.5:ComparingtherelativesearchofNATtoothermethods.
\{"denotesfornotapplicable,\WS"standsforweightsharingand
\SMBO"standsforsequentialmodel-basedoptimization[3].
y
is
takenfrom[4],
z
estimatebasedonthe#ofmodelsevaluatedduring
search(20Kin[5],1.2Kin[2],27Kin[6]).

denotesre-rankingstage
wheretop100-250modelsundergoextendedtrainingandevaluation
for300epochsbeforeselectingthemodel.............90
Table5.6:offor75epochsonImageNet............93
Table6.1:AUROConChestX-Ray14testingset..................98
Table6.2:PreliminaryresultsontheCMU-Caralignment[7].Notably,ourpro-
posedalgorithmisabletoarchitectureswithcompetitiveperfor-
mancewhilehaving2xlessparameterswhencomparedtohuman-
designedarchitecture[8].........................105
vii
LISTOFFIGURES
Figure1.1:Avisualizationoftheneuralnetworkarchitecturesthathavebeen
proposedinthedeeplearningcommunity[9].Thediversityinneural
networksimpliestheimportanceofarchitecturalchoices.......2
Figure2.1:Generalsketchofabi-levelproblemtakenfrom[10]..........9
Figure2.2:AprincipleofPareto-dominancebasedmulti-objectiveoptimization
procedureisillustrated..........................12
Figure2.3:Anillustrationofhypervolumeunderbi-objectiveoptimization....13
Figure3.1:
Overview:
Givenadatasetandobjectives,NSGANetdesignsaset
ofcustomarchitecturesspanningthefront.NSGANetes-
timatetheperformanceofanarchitecturethroughits
proxymodel
,
optimizedbyStochasticGradientDescent(SGD)inthelower-level.
Thesearchproceedsin
exploration
viageneticoperations,followed
by
exploitation
viadistributionestimation.SeeAlgorithm1forpseu-
docodeandcolorsareincorrespondence................20
Figure3.2:SchematicoftheNSGANetsearchspacemotivatedfrom[5]:(a)An
architectureiscomposedofstackedblocks.(b)Thenumberofchan-
nelsineachblockisgraduallyincreasedwithdepthofthenetwork.
(c)Eachblockiscomposedofenodes,whereeachnodeisatwo-
branchedcomputationappliedtooutputsfromeitherpreviousblocks
orpreviousnodeswithinthesameblock.(d)Agraphicalvisualiza-
tionof(c).................................24
Figure3.3:Illustrationofnode-levelcrossover....................26
Figure3.4:InputandOperationMutation:Dashedlineboxeswithredcolor
highlightthemutation.
h
i

2
,
h
i

1
areoutputsfromprevious-previous
andpreviousblocks,respectively.
h
(3)
i
indicatesoutputfromnode3
ofthecurrentblock............................27
Figure3.5:IllustrativeexampleofBN-basedexploitationstepinNSGANet:given
pastsuccessfularchitectures,weconstructaBNrelatingthedepen-
denciesbetweenthefournodesinsidetheNormalblock.Anewar-
chitectureisthensampledfromthisBNandproceedforwardfor
performanceestimation..........................28
viii
Figure3.6:ExamplesfromCIFAR-10,CIFAR-100andImageNetdatasets.Im-
agesineachrowbelongtothesameclasswithlabelnamesshownto
theleft...................................31
Figure3.7:Visualizationofblock-levelstructuresfortarchitectures.The
NormalandReductionblocksareshownintheandsecondrows,
respectivelyforNSGANetarchitectures.Examplesofblocksthatare
designedmanuallybyexperts[11,12]andfromotherpeermethods[5]
arealsoincludedin(d)-(f)forcomparison..............33
Figure3.8:(a)Accuracyvs.FLOPsofallarchitecturesgeneratedbyNSGANet
duringthecourseofevolutiononCIFAR-100.Asubsetofnon-
dominatedarchitectures(seetext),namedNSGANet-A0toA4,are
re-trainedthoroughlyandcomparedwithotherpeermethodsin(b).34
Figure3.9:TransferabilityoftheNSGANetarchitecturesto(a)CIFAR-10,and
(b)ImageNet.WecompareTop-1Accuracy
vs.
ComputationalCom-
plexity.Architecturesjoinedbydashedlinesarefrommulti-objective
algorithms.................................36
Figure3.10:SearchcomparisonbetweenNSGANetandotherbaselines
intermsof(a)HV,and(b)therequiredcomputetimeinGPU-Days.
ThesearchcostismeasuredonCIFAR-10formostmethods,except
NSGANetandBlock-QNN[13],wheretheCIFAR-100datasetisused
for.....................................37
Figure3.11:
PostSearchAnalysis:
(a)Frequencyofeachoperationselected
duringthesearch.(b)ofnumberofinputchannelsthatare
concatenationonthevalidationaccuracy.Morechannelsresultsin
morereliableperformanceimprovementbutatthecostofcomputa-
tional..............................39
Figure3.12:(a)Meanaccuracydistributionofrandomlygenerated
architecturesandarchitecturesfrompeerNASmethodsonfourdatasets.
Correlationinperformance(redlines)vs.Savingsingradientdescent
walltime(blueboxes)byreducing(b)thenumberofchannelsinlay-
ers,(c)thenumberoflayers,and(d)thenumberofepochstotrain.
Notethat(b),(c)and(d)havetwoy-axislabelscorrespondingtothe
colorofthelines.............................41
Figure4.1:
Overview:
Givenadatasetandobjectives,MSuNASobtainatask-
spsetofmodelsthatarecompetitiveinallobjectiveswithhigh
search.Itcomprisesoftwosurrogates,oneattheupper
leveltoimprovesampleandoneatthelowerlevel,through
asupernet,toimproveweightlearning............45
ix
Figure4.2:
SearchSpace
:Acandidatearchitecturecomprisesecomputa-
tionalblocks.Parameterswesearchforincludeimageresolution,
numberoflayers(
L
)ineachblockandtheexpansionrate(
e
)and
thekernelsize(
k
)ineachlayer.....................47
Figure4.3:AsamplerunofMSuNASonImageNet:Ineachiteration,accuracy-
predictionsurrogatemodels
S
f
areconstructedfromanarchiveof
previouslyevaluatedarchitectures(a).Newcandidatearchitectures

brownboxesin(b)

areobtainedbysolvingtheauxiliarysingle-level
multi-objectiveproblem
~
F
=
fS
f
;
Cg
(line10inAlgo3).Asubsetof
thecandidatearchitecturesischosentodiversifytheParetofront(c)
-(d).Theselectedcandidatearchitecturesarethenevaluatedand
addedtothearchive(e).Attheconclusionofthesearch,wereport
thenon-dominatedarchitecturesfromthearchive.Thex-axisinall
is#MAdds.........................48
Figure4.4:Performanceofthefoursurrogatemodelpredictorsconsidered.Top
rowcomparesSpearmanrank-ordercorrelationcotasnum-
beroftrainingsamplesincreases.Bottomrowvisualizesthetrue
vs.predictedaccuracyunder500trainingsamples.Therootmean
squareerror(RMSE)oftheAdaptiveSwitching(AS)strategyisalso
provided.................................51
Figure4.5:(a)Asketchvisualizingthehypervolumemetric[14].Inthecase
oftwoobjectives,itmeasuresthedominatedareaachievedbya
multi-objectivealgorithm.Alargerhypervolumeindicatesabetter
Paretofrontachieved.(b)comparingtherelativesearch
ofMSuNAS,NSGANet[15]andrandomsearchunderabi-objective
setup.Weplotthemeanhypervolume(across5runs)onImageNet,
CIFAR-10andCIFAR-100againstnumberofarchitecturesevaluated.
Coloredregionsdenotethestandarddeviationofthemean.....53
Figure4.6:
Accuracy
vs

:Toprowcomparespredictiveaccuracyvs.
GPUlatencyonabatchof64images.Bottomrowcomparespre-
dictiveaccuracyvs.numberofmulti-addsinmillions.Modelsfrom
multi-objectiveapproachesarejoinedwithlines.Ourmodelsareob-
tainedbydirectlysearchingontherespectivedatasets.Inmostprob-
lems,MSuNASmoreaccuratesolutionswithfewerparameters.
......................................55
Figure4.7:Performanceofthesetoftask-spmodelsobtainedbyMSuNAS
onthreettypesofnon-standarddatasets,comparingtoSOTA
fromtransferlearning[16,17]andsemi-/un-supervisedlearning[18,19]57
x
Figure4.8:ScalabilityofMSuNAStotnumbersandtypesofobjectives:
optimizing(a)ascalarizedsingle-objectiveonImageNet;(b)eob-
jectivesincludingaccuracy,Params,MAdds,CPUandGPUlatency,
simultaneously.(c)Post-optimalanalysisonthearchitecturesthat
arenon-dominatedaccordingtotobjectives.....58
Figure5.1:
Overview:
Givenadatasetandobjectivestooptimize,NATdesigns
customarchitecturesspanningtheobjectivefront.NAT
comprisesoftwomainscomponents,supernetadaptationandevolu-
tionarysearch,thatareiterativelyexecuted.NATalsousesanonline
accuracypredictormodeltoimproveitscomputationale...61
Figure5.2:ThearchitecturesinoursearchspacearevariantsofMobileNet-v2
familyofmodels[12,17,20,21].(a)Eachnetworksconsistsofe
stages.Eachstagehastwotofourlayers.Eachlayerisaninverted
residualbottleneckblock.Thesearchspaceincludes,inputimage
resolution(R),widthmultiplier(W),thenumberoflayersineach
stage,the#ofoutputchannels(expansionratioE)ofthe1

1
convolutionandthekernelsize(K)ofthedepth-wiseseparablecon-
volutionineachlayer.(b)Networksarerepresentedas22-integer
strings,wherethetwocorrespondtoresolutionandwidthmul-
tiplier,andtherestcorrespondtothelayers.Eachvalueindicatesa
choice,e.g.thethirdinteger(
L
1
)takesavalueof\1"correspondsto
usingexpansionratioof3andkernelsizeof3inlayer1ofstage1.64
Figure5.3:
TopPath:
AtypicalprocessofevaluatinganarchitectureinNAS
algorithms.
BottomPath:
Accuracypredictoraimstobypassthe
time-consumingcomponentsforevaluatinganetwork'sperformance
bydirectlyregressingitsaccuracy
f
from
a
(architectureintheen-
codedspace)................................66
Figure5.4:Accuracypredictorperformanceasafunctionoftrainingsamples.
Foreachmodel,weshowthemeanandstandarddeviationofthe
Spearmanrankcorrelationon11datasets(Table5.2).Thesizeof
RBFensembleis500...........................67
Figure5.5:(a)
CrossoverOperator
:newarchitecturesarecreatedby
recombiningintegersfromtwoparentarchitectures.Theprobability
ofchoosingfromeitheroneoftheparentsisequal.(b)
Mutation
Operator
:histogramsshowingtheprobabilitiesofmutatedvalues
withcurrentvalueat5underthyperparameter

m
settings.69
xi
Figure5.6:(a)Anexample(assumingminimizationofallobjectives)ofthese-
lectionprocessinAlgo7:Wecreatereferencedirections
Z
by
joiningreferencepointswiththeidealsolution(origin).Thenthrough
non
dominated
sort
,threenon-dominatedsolutionsareidenas-
sociatedwithreferencedirections
Z
(1)
,
Z
(3)
and
Z
(5)
.Wethenselect
theremainingsolutionsbytheorthogonaldistancestothereference
directionswithnoassociatedsolutions|i.e.
Z
(2)
and
Z
(4)
.Thisse-
lectionisscalabletolarger#ofobjectives.Tri-objectiveexampleis
shownin(b)...............................72
Figure5.7:ImageNetArchitecturesfromTFront.............80
Figure5.8:
MAddsvs.ImageNetAccuracy
.NATNetsoutperformother
modelsinbothobjectives.Inparticular,NAT-M4achievesanew
state-of-the-arttop-1accuracyof80.5%undermobilesetting(

600MMAdds).NAT-M1improvesMobileNet-v3top-1accuracyby
2.3%withsimilar#MAdds.......................81
Figure5.9:
MAddsvs.Accuracy
curves

comparingNATexisting
architecturesonadiversesetofdatasets.Thedatasetsarearranged
inascendingorderoftrainingsetsize.Methodsshowninthelegend
pre-trainonImageNetandtheweightsonthetargetdataset.
Methodsin
blue
trainfromscratchoruseexternaltrainingdata...82
Figure5.10:Transferarchitectures(350MMAdds)ontendatasets.........83
Figure5.11:Non-dominatedarchitecturesto
f
top-1accuracy,#MAdds
g
obtained
byNATontdatasets......................84
Figure5.12:
Toprow:
NATNetsobtainedfromtri-objectivesearchtomaximize
ImageNettop-1accuracy,minimizemodelsize(#Params),andmini-
mize
f
#MAdds,CPUlatency,GPUlatency
g
fromlefttoright.
Bot-
tomrow:
2Dprojectionsfromabove3Dscatter,showingtop-1ac-
curacyvs.eachofthefourrelatedmeasurements.Tobetter
visualize(thecomparisonwithMobileNet-v3[21]andMUXNet[22]),
partialsolutionsfromthenon-dominatedfrontiersareshown.All
top-1accuracyshownarewithout............85
xii
Figure5.13:
Left:
ParallelCoordinatePlot(PCP)whereeachverticalbarisan
objectiveandeachlineisanon-dominatedarchitecturesachievedby
NATfroma12-objoptimizationofminimizing#MAddsandmax-
imizingaccuracyonthe11datasetsinTable5.2.Themodelwith
thebest(seeSection5.1.6fordetails)ishighlightedindark
blue.
Right:
1-on-1comparisonbetweentheselectedNATNet(top-
rankedin)andrepresentativepeermodelsontop-1accuracy
onvariousdatasets.Methodwithlargerareaisbetter.........87
Figure5.14:Tvaluesofall45architecturesachievedfromthe12-objective
searchpresentedinFigure5.13.....................88
Figure5.15:
Toprow:
Spearmanrankcorrelationbetweenpredictedaccuracy
andtrueaccuracyoftsurrogatemodelsacrossmanydatasets.
Eachaccuracypredictorisconstructedfrom250samples(trained
architectures).Errorbarsshowmeanandstandarddeviationoverten
runs.
Bottomrow:
GoodnessofvisualizationofRBFensemble,
thebestaccuracypredictor.......................89
Figure5.16:Ablationstudyonthecrossoveroperator:(a)themedianperfor-
mancefromelevenrunsofourevolutionaryalgorithmwithandwith-
outthecrossoveroperator.(b)themedianperformancedeteriorates
asthecrossoverprobabilityreducesfrom0.9to0.2..........91
Figure5.17:Hyperparameterstudyon(a)mutationprobability
p
m
and(b)muta-
tionindexparameter

m
.Foreachstudy,werunNATeleventimes
onfourdatasetstomaximizetop-1accuracyandreportthemedian
performance................................93
Figure5.18:Comparingtheperformanceof
adaptingsupernet
,
adaptingsubnet
and
additional
underabi-objectivesearchsetuponfour
datasets.DetailsareprovidedinSection5.3.5.............93
Figure6.1:VisualizationofChestXray14datasets.Examplesshowingeightcom-
monthoracicdiseasesarefrom[23]...................97
Figure6.2:NSGANet-Xmulti-labelperformanceonChestX-Ray14
(a)andtheclass-wisemeantestAUROCcomparisonwithpeermeth-
ods(b)..................................98
Figure6.3:Examplesofclassactivationmap[24]ofNSGANet-X,highlighting
theclass-spdiscriminativeregions.Thegroundtruthbounding
boxesareplottedovertheheatmaps..................99
xiii
Figure6.4:
RobustnessObjective:
Wearobustnessobjectiveunder
theFGSM[25]attackasfollows:1)obtainclasscationaccuracyon
adversarialimagesgeneratedbyFGSMaswevary

,2)computethe
areaunderthecurve(blueline),approximatedbytheareaofthe
greenregion;2)normalizetherobustnessvaluetotherectangular
areaformedbythe
Idealpoint
and
Nadirpoint
;3)Idealpointis
at100%accuracyatmaximum

value,andthe
nadirpointisastheaccuracyofrandomguessingat

=0
(cleanimages)...............................101
Figure6.5:Tfrontieroftherobustnessexperiments.Colorindicatesthe
generation(iteration)atwhichanetworkarchitectureiseliminated
fromthesurvivingparentpopulation.Thesizeofeachpointispro-
portionaltothenetworkarchitecture'snumberoftrainableparame-
ters.WenotethatnetworksforlattergenerationsformthePareto
front(darkbluepoints).........................102
Figure6.6:(a)Examplesofthecomputationalblocksdiscoveredwithhighclas-
accuracy.Forthesenetworks,themeanaccuracyandro-
bustnessobjectivesare0.8543and0.0535,respectively;(b)Exam-
plesofthecomputationalblocksdiscoveredwithhighrobustness
againstFGSMattack,themeanaccuracyandrobustnessobjectives
are0.8415and0.1036,respectively;(c)Examplesofthecomputa-
tionalblocksdiscoveredalongthepareto-frontthatprovidesanef-
tbetweenaccuracyandadversarialro-
bustness.Theyarearrangedintheorderofdescendingaccuracyand
ascendingrobustness...........................103
Figure6.7:Parallelcoordinateplotsofthe1,200networkarchitecturessampled
byNSGANet.Eachlinerepresentsanetworkarchitecture,eachver-
ticallineisanattributeassociatedwiththenetwork.(a)Networks
thathavetheskipconnectionbitinactive,wecanseethatnoneof
themhavegoodmeasurementonrobustnessagainstadversarialat-
tacks.(b)Networksthathavetheskipconnectionbitactive.This
skipconnectionbitreferstotheconnectionthatgoespastallcompu-
tationwithinaphase,asanormalresidualconnectionwould.When
theskipconnectionisactive,thenetworkscoverthefullrangeof
adversarialrobustness..........................104
Figure6.8:Spatial-resolution-changepathoftheNSGANet-C0architecture.Each
circleencodesaresidualmodule[11].Circlescoloredinwhiteareal-
waysexecutedatthebeginning.Thearrowsandbluescirclesare
partsofthesearchdecisions.Thetotalnumberoflayers(L)areset
tobenine................................105
xiv
Figure6.9:PreliminaryexperimentonconstructingDNNarchitecturesusinglay-
erswithnon-learnableweights.Eachlayeriscomposedofseveralnon-
learnableoperationsinparallel.Wemanuallyconstructedahandful
ofsuchlayersandevaluatethemonCIFAR-10(a).Anexamplecon-
basedonthebetweenaccuracyandthenumberof
theparameters,isshownin(b).Wevalidatetheenessofnon-
learnablelayersbyreplacingtheoriginal3

3convolutionint
ResNetmodelswiththechosenurationonbothCIFAR-100and
ImageNet(c).Evidently,layerwithnon-learnableweightsiscapable
ofyieldingcompetitiveperformancewhilebeingcompu-
tationaltasopposedtoconventionallearnableconvolutions.
......................................107
xv
Chapter1
Introduction
Theprocessofcomputersperformingthetasks,or
automation
hasbeentheleadingfore-
frontoftechnologicaladvances.Automationinengineeringdesign,manufacturing,logistics
hasdramaticallytransformedhowgoodsaremadeandexcelledproductivitybyordersof
magnitudeintherecentpast.However,theapplicationsofmachinelearning(ML)still
largelyrelyonhumanexpertise.Giventherecentimprovementsincomputinghardware|
e.g.graphicsprocessingunits(GPUs)andservice|e.g.clouding/edgecomputing,itis
clearthattheavailabilityofskilledresearchersandpractitionersisthebottleneck,rather
thancomputationalpower.AndautomatingthecreationofMLapplicationsisaplausible
avenuetoalleviatethisbottleneckandtlyacceleratetheprogressofMLtechnolo-
gies.ThroughestablishinganalgorithmicfoundationtoautomateMLapplications,this
dissertationhopestocontributetothisvision.
1.1Motivation
RecentyearshavewitnessedanexplosivegrowthinMLapplications,drivenbybothadvance-
mentsincomputinghardwareandamountofhigh-quality(annotated)data.Inparticular,
aspecialformofML,knownasdeeplearning(DL),hasmerged.DLreferstotheuseof
deep(intermsofnumberoflayers)neuralnetworks(DNNs)toautomaticallyextractdis-
criminativefeaturerepresentationsfromdatainanend-to-endfashion.Withinfewyears
1
ofexistence,DNNshavesurpassedthestate-of-the-artinawiderangeofbenchmarksthat
werepreviouslydominatedbyothermachinelearningalgorithms.Thesebenchmarksspread
fromcomputationvision[26],speechrecognition[27],tonaturallanguageprocessing[28]
domains.
Figure1.1:Avisualizationoftheneuralnetworkarchitecturesthathavebeenproposedin
thedeeplearningcommunity[9].Thediversityinneuralnetworksimpliestheimportance
ofarchitecturalchoices.
Oneofthekeydrivingforcesbehindthissuccessistheintroductionofmanytask-spec
DNNarchitectures(seeFigure1.1foranvisualization).Inthedomainofcomputervision
(themainfocusofthisdissertation),thewaveofarchitecturaldevelopmentfocuseson
solelyimprovingthepredictiveperformanceofconvolutionalneuralnetworks(CNNs)includ-
ingVGG[29],GoogLeNet[30],ResNet[11],etc.Thesemodelsprogressivelyincreasedepth
andnumberofparallelbranches.Thesecondwavefocusesonimprovingnetworks'compu-
tationalwhiletradipredictiveperformancetoasmallextent.Modelsinthis
categoryincludeDenseNet[31],Sh[32],MobileNet-v2[12],etc.Thesemodelsuse
sophisticatedlayerconnections(e.g.denseconnection[31],invertedresidualconnection[12])
andlayerarrangements(e.g.replacingconventionalconvolutionlayerwithdepth-wisesep-
arableconvolutionfollowedbypoint-wiseconvolution[12])tosqueezethemodelsizeand
maintainperformanceinthesametime.Duetoalackoftheoreticalunderstandingofneural
networkarchitectures,thesedevelopmentscostyearsofpainstakingandelaborated
designfromempiricallytestingnetworkmodelsonbenchmarktasks.Furthermore,it'sbe-
2
comingapparentthatsuchamanualprocessofadjustingnetworkarchitecturesinnotonly
tedious,butalsosub-optimal.Analgorithmicapproachtoautomatethedesignofnetwork
architectureisessentialtosustainthesteadygrowthinapplicationsofCNNs.
1.2Challenges
Towardsdesigninganautomaticmethodtosearcharchitecturesthatcanscalewiththe
complexityofthesearchspaceandthenumberofobjectivesbyvastdeployment
scenarios,thereareafewkeychallengesneedtobeaddressed.
tiable:
Onemajorissueisthattheobjectivelandscapewithrespecttothe
architectureisnotcontinuous,thereby,thereisnogradientinformationforsuchsearch.Un-
likethetrainingprocessforDNNs,thegradientcanbecomputedforthelossfunction(e.g.
cross-entropylossincaseofation)withrespecttothenetworkparameters(weights)
inordertominimizetheloss[33].Thiscomputationispossiblebecausethenetworkprim-
itives(e.g.convolution,activationfunctions,etc.)aretiable;hence,thelosscan
bemathematicallyformulatedasafunctionoftheweightsinaneuralnetwork,i.e.
entiable.However,nosuchrelationshipexistsbetweenthenetwork'sarchitecturetopology
andthelossfunction.Withoutanygradientinformation,thesearchforasuitableDNN
becomesablack-boxoptimizationproblemwhereonlythevalueoftheobjectivefunction
(i.e.performanceofDNNonatask)canbecomputed.Furthermore,thesearchspacefor
networkarchitecturesisdiscrete(includingbothordinalandnominalvariables),andthe
arrangementoflayersinaDNNcantakeonanyarbitrarygraphstructure.Anypractical
automatedmethodforoptimizingnetworkstructuremustbeabletohandlesuchalargeand
complexsearchspacetly.
3
Multipleobjectives:
Inadditiontotheneedforhigh-performanceresultsfromDNNs,
real-worldapplicationsofthesemodelsdemandnetworkarchitectureswithtcom-
plexitiesforerentdeploymentscenarios|{e.g.IoTsystems,mobiledevices,embedded
devices,automotives,cloudservers,etc.Thesecomputingdevicesareoftenconstrainedby
avarietyofthardwareresources,suchaspowerconsumption,availablememory,
andlatencyconstraints,tonameafew.Thesehardware/relatedobjectivesare
oftentimesinwiththepredictiveperformanceofDNNs,leadingtotheneedfor
multi-objectiveoptimizationasidealsolutionsthatoptimizedmultipleobjectivesdonotex-
ist.Automatedmethodsundersuchascenarioneedtobalancetheamongthese
competingobjectivesandidentifydiversesetofarchitecturestorepresentthetrade-o
Expensiveevaluation:
Inorderforanautomatedarchitecturesearchalgorithmtoop-
timizeDNNs,itmusthavetosearchthroughthousands,ifnothundredsofthousandsof
networktopologies.Unfortunately,DNNsarecomputationallyveryintensiveandrequire
specializedhardwaresuchasgraphicsprocessingunits(GPUs)totrain.Forexample,it
takesonemonthtoevaluate27KDNNarchitectureswithonehundredGPUsonCIFAR-
10[34]dataset.Unsurprisingly,thetotalnumberofGPUhours/daysrequiredisextremely
high(uptotensofthousands)forasingleexperimentalrun.Attemptingtotrainthou-
sandsofnetworksequallyandthoroughlyisthusimpracticalevenwithcloudcomputing
(e.g.AmazonWebServices).
Vastdeploymentscenarios:
Giventhediversehardwareplatformsandcycon-
straints,designingspecializedDNNsforeveryscenarioisengineer-expensiveandcompu-
tationallyexpensive,eitherwithhuman-basedmethodsorautomatedsystems.Sincesuch
methodsneedtorepeatthenetworkdesignprocessandretrainthedesignednetworkfrom
4
scratchforeachcase.Theirtotalcostgrowslinearlyasthenumberofdeploymentscenarios
increases,whichwillresultinexcessiveenergyconsumption.
Thisdissertationwillsolvethechallengeslistedabovebyusingasurrogate-assistedmulti-
objectiveevolutionaryapproach.Evolutionaryalgorithms(EA)areblack-boxoptimization
algorithmsthatcantlysearchthroughhigh-dimensional,large,andcomplexsearch
spaces[35].SincetherealreadyexistsEAsdesignedfortexplorationofmultiple
objectives,expensiveobjectivefunctionanddiscretevariables,separately,EAspresenta
promisingstartingpointforimprovingneuralarchitecturesearch.
1.3DissertationOverview
Therestofdissertationisstructuredasfollows.InChapter2,weprovidebackgroundand
literaturesurveyontherelatedtopicsandmethods;InChapter3,weincorporatemulti-
objectivePareto-optimizationtoneuralarchitecturesearchandpresenttheinitialalgorithm
framework;InChapter4,weenhancethesearchoftheproposedalgorithmthrough
surrogatemodelling;Thenwefurtherimprovethemethod'sscalabilitytodiversedeployment
scenarioswithadaptivetransferlearninginChapter5.Variousapplicationsoftheproposed
algorithmarepresentedinChapter6.Finallywesummarythecontributionsandprovide
interestingdirectionsforfutureworksinChapter7.
5
Chapter2
Background
Theworkinthisdissertation,evolutionaryneuralarchitecturesearch,buildsupontwomajor
areasofresearch:deeplearningandevolutionaryalgorithms.Thus,thischapterreviews
foundationsandrelatedworkfromthesetwoareasofresearchthatarerelevanttothetopic
ofthisdissertation.First,anintroductiontodeepneuralnetworksisgiven.Second,relavent
topicswithinevolutionaryoptimizationarereviewed.
2.1NeuralArchitectureSearch
Neuralarchitecturesearch(NAS)presentsapromisingpathtoalleviatethepainfulprocessof
manualadjustmentbyposingthedesignofCNNarchitecturesasanoptimizationproblem.
Byalteringthearchitecturalcomponentsinanalgorithmicfashion,novelCNNscanbe
discoveredthatexhibitimprovedperformancemetricsonrepresentativedatasets.Thehuge
surgeinresearchandapplicationsofNASindicatethetremendousacademicandindustrial
interestNAShasattracted,asteamsseektostakeoutsomeofthisterritory.Itisnowwell
recognizedthatdesigningbespokeneuralnetworkarchitecturesforvarioustasksisoneofthe
mostchallengingandpracticallybcomponentoftheentireDeepNeuralNetwork
(DNN)developmentprocess,andisafundamentalsteptowardsautomatedmachinelearning.
Ingeneral,theproblemofdesigningCNNarchitecturesforatargetdataset
D
=
fD
trn
;
D
vld
;
D
tst
g
canbeviewedasabi-leveloptimizationproblem[10].Itcanbemathematicallyformulated
6
as,
minimize

2A
F

;w

(

)

;
subjectto
w

(

)
2
argmin
w
2

f
(
w
;

)
;
(2.1)
wheretheupper-levelvariable

aCNNarchitecture,andthelower-levelvariable
w
itsassociatedweights.Theobjectivefunctions
F
attheupper-leveland
f
at
thelower-leveldenotethelossonthevalidationdata
D
vld
andthetrainingdata
D
trn
,
respectively.
TheprobleminEq.(2.1)posesseveralchallengestoconventionaloptimizationmethods.
First,evaluating
F
foragiven

involvesanotherprolongedoptimizationprocess(lower-level
optimization),astherelationshipbetweenan

anditscorrespondingoptimal
w

(

)isnot
analyticallyduetothenon-linearnatureofmodernCNNs.Andsuchalower-level
optimizationprocessneedstorepeatedforeveryevaluationof
F
intheupper-level.Secondly,
theupper-levelproblemisnottiable,asthegradientof
F
cannotbereliablycomputed
orevenapproximatedduetothecategoricalnatureoftheupper-levelvariable

involving
thechoiceoflayeroperations,activationfunctions,etc.
EarlymethodsforNASreliedonReinforcementLearning(RL)tonavigateandsearch
forarchitectureswithhighperformance.Amajorlimitationoftheseapproaches[36,37]is
thesteepcomputationalrequirementforthesearchprocessitself,oftenrequiringweeksof
wallclocktimeonhundredsofGraphicsProcessingUnit(GPU)cards.Recent
relaxation
-
basedmethods[38,39]seektoimprovethecomputationaleofNASapproachesby
approximatingtheconnectivitybetweenentlayersintheCNNarchitecturesbyreal-
valuedvariablesthatarelearned(optimized)throughgradientdescenttogetherwiththe
weights.However,suchrelaxation-basedNASmethodsfromexcessiveGPUmemory
7
requirementsduringsearch,resultinginconstraintsonthesizeofthesearchspace(e.g.,
reducedlayeroperationchoices).
Inadditiontotheneedforhigh-performanceresultsfromNASmodels,real-worldap-
plicationsofthesemodelsdemandgnetworkarchitectureswithtcomplexities
fortdeploymentscenarios|e.g.,IoTsystems,mobiledevices,automotives,cloud
servers,etc.Thesecomputingdevicesareoftenconstrainedbyavarietyofhardwarere-
sources,suchaspowerconsumption,availablememory,andlatencyconstraints,tonamea
few.Eventhoughtherearemethodswithmoreadvancedtechniqueslikeweightsharing[40]
toimproveRL'ssearch,andbinarygating[41]toreducetheGPUmemoryfoot-
printofrelaxation-basedNASmethods,mostexistingRLandrelaxation-basedmethodsare
notreadilyapplicableformulti-objectiveNAS.
DespitethemanytNASmethodsbeingcontinuallyproposed,Evolutionaryal-
gorithms(EAs)aregettingaplethoraofattention,duetotheirpopulation-basednature
andyinencoding.Theyaviablealternativetoconventionalmachinelearn-
ing(ML)-orientedapproaches,especiallyunderthescopeofmulti-objectiveNAS.AnEA,
ingeneral,isaniterativeprocessinwhichindividualsinapopulationaremadegradually
betterbyapplyingvariationstoselectedindividualsand/orrecombiningpartsofmultiple
individuals.Despitetheeaseofextendingthemtohandlemultipleobjectives,mostexisting
EA-basedNASmethods[42{44]arestillsingle-criteriondriven.
2.2EvolutionaryBi-levelOptimization
bi-leveloptimizationproblemsarecharacterisedbyahierarchicaltwolevelstructure,in
whichthereisanestedinneroptimizationproblemasaconstrainttoaouteroptimization
8
problem.Theouteroptimizationproblemisoftenreferredasthe
upperlevel
taskorthe
leader
task,andtheinneroptimizationproblemisoftenreferredasthe
lowerlevel
taskor
thefollower
task.bi-leveloptimizationproblemsaretypicallychallengingtohandlebecause
thefeasibilityofasolution(upperlevel)canonlybedeterminedafteranotheroptimization
process(lowerlevel).Figure2.1illustratesageneralbi-levelproblemsolvingstructure
involvinginterlinkedoptimizationanddecision-makingtasksatbothlevels.
Figure2.1:Generalsketchofabi-levelproblemtakenfrom[10].
Practicalapplicationsthatarehierarchicalinnaturespanallareasofsciencesanden-
gineering.Forinstance,suchproblemsariseintransportationandnetworkdesign[45,46],
environmentaleconomics[47],chemicalengineering[48,49],optimaldesign[50,51],defense
applications[52,53],facilitylocation[54,55],machinelearning[56]etc.Inaddition,many
problems[57]inpracticecanaredecomposedintomultiplebutinterconnectedlevelssothat
eachlevelofproblemshavefewervariablesandparameters,andismadeupofobjectives
andconstraintsofitsownforaneasierunderstandingandhandling.
9
Mostoftheevolutionaryalgorithmsforbi-leveloptimizationarenestedinnature,where
oneoptimizationalgorithmisusedwithintheother.Theouteralgorithmhandlestheupper
levelproblemandtheinneralgorithmhandlesthelowerlevelproblem.Suchastructure
necessitatesthattheinneralgorithmiscalledforeveryupperlevelpointgeneratedbythe
outeralgorithm.Therefore,nestedapproachescanbequitecomputationallydemanding,
andcanonlybeappliedtosmallscaleproblems.Onecanstudieswithevolutionary
algorithmbeingusedfortheupperlevelproblemandclassicalapproachbeingusedforthe
lowerlevelproblem.Ifthelowerlevelproblemiscomplex,researchershaveusedevolutionary
algorithmsatbothlevels.Belowweprovideareviewofevolutionarybi-leveloptimization
algorithmsfromthepast.
Mathieuetal.[58]wasoneofthetoproposeabi-levelalgorithmusingevolutionary
algorithms.Heusedageneticalgorithmtohandletheupperlevelproblemandlinear
programmingtosolvethelowerlevelproblemforeveryupperlevelmembergeneratedusing
geneticoperations.ThisstudywasfollowedbynestingtheFrank-Wolfealgorithm(reduced
gradientmethod)withinageneticalgorithminYin[59].Otherauthorsutilizedsimilarnested
schemesin[60,61].Studiesinvolvingevolutionaryalgorithmsatbothlevelsinclude[62,63],
whereauthorshaveusedtialevolutionatbothlevelsinthestudy,andtial
evolutionwithinantcolonyoptimizationinthesecondstudy.
Replacingthelowerlevelprobleminbi-leveloptimizationwithitsKarush{Kuhn{Tucker
(KKT)conditionsisacommonapproachforsolvingtheproblembothinclassicalaswellas
evolutionarycomputationliterature.However,aKKTbasedreductioncanonlybeapplied
toproblemswherethelowerlevelisconvexandadherestocertainregularityconditions[64].
Someofthepastevolutionarystudiesthatutilizethisideainclude[65,66].Theapproach
hasbeenpopularandevenrecentlyresearchersarerelyingonreducingthebi-levelproblem
10
intosinglelevelproblemusingKKTandsolvingthereducedproblemusingevolutionary
algorithm,forexample,see[67{70].
WhileKKTconditionscanonlybeappliedtoproblemswherethelowerleveladheres
tocertainmathematicallysimplifyingassumptions,theresearchersareexploringtechniques
thatcansolvemoregeneralinstancesofbi-leveloptimizationproblems.Someoftheap-
proachesarebasedonmeta-modelingthemappingswithinbi-leveloptimization,whileothers
maybebasedonmeta-modelingtheentirebi-levelproblemitself.Studiesinthisdirection
include[71].
2.3EvolutionaryMulti-objectiveOptimization
Evolutionarymulti-objectiveoptimization(EMO)methodsbasedonevolutionaryalgorithms
(EAs)havebecomepopularandbeenwidelyusedinthepastfewdecades[72,73].EMO
methodsstartwithapopulationofsolutions,whichismadegraduallybetterbyapplying
geneticoperators(e.g.crossover,andmutation).Thisdissertationbuildsuponanevolu-
tionaryalgorithmframeworkbecauseofitsyandwide-applicability.Inparticular,
evolutionarymethodstypicallydonotrequireorassumeanyconvexityortiability
oftheobjectiveorconstraintfunctionsanditspopulation-basedapproachprovidesnature
advantageshandlingproblemswithlocallyoptimalsolutions.
InsteadofasinglePareto-optimalsolutionatatime(throughobjectivescalariza-
tion),Pareto-dominancebasedmulti-objectiveoptimizationmethodsattempttoaset
ofPareto-optimalsolutionsinasinglesimulationrun[72].ThegoalsofanEMOare:
1.
Findawell-convergedsetofsolutions,and
2.
Findawell-divsetofsolutionsacrosstheentiretfront.
11
ThepopulationapproachofanEMOandtheinitsoperatorsallowboththe
abovegoalstobeachievedformulti-objectiveproblems.EMOalgorithms,suchasNSGA-
II[74]andSPEA2[75],arepopularmethodsforhandlingtwoandthree-objectiveproblems.
Recentextensions,suchasNSGA-III[76]andMOEA/D[77],areabletohandlefourormore
objectives(evenmorethantenobjectives).Thetopicissoimportanttodaythathandling
suchlargenumbersofobjectivesisgivenaseparatename,evolutionarymany-objective
optimization.Figure2.2illustratestheprincipleofPareto-dominancebasedmulti-objective
optimization.
Figure2.2:AprincipleofPareto-dominancebasedmulti-objectiveoptimizationprocedure
isillustrated.
Ineachiteration,asetofpointsgetsoperatedbyanEMOprocedureandanewsetof
points|hopefullytowardstheParetofrontandwithmorediversity|aregenerated.This
processisexpectedtocontinuewithgenerationsbeforeconvergingonthetrueParetofront
coveringitsentirerange.WhenaPareto-basedbasedmulti-objectiveoptimizationalgorithm
transitsfromonesetofnon-dominatedpointstohopefullyabettersetofnon-dominated
points,thereisaneedtoevaluatetheconvergenceanddiversityofthenewsetofsolutions
12
vis-a-vistheoldsolutions.Itisbecause,ifwecanmeasurebothaspectswellthroughone
ormoreperformancemetrics,theycanalsobeusedtodeterminetheterminationofthe
optimizationalgorithm.
Hypervolume(HV)isawidely-usedmetricforcomparingalgorithmsundermultiple
objectives,astheevaluationmetric.HVnumerousadvantagesincludingParetocom-
pliance[78]anditdoesnotrequirethetrueParetofrontbeforehand.Inthebi-objective
case,hypervolumecanbeinterpretedasthedominatedareawithrespectivetoareference
point.OnehypotheticalexampleisprovidedinFigure2.3,wherecircledpointsarethe
solutionsfoundbymulti-objectivealgorithmsandthecorrespondinghypervolume
isshowninshade.Thehigherthehypervolumemeasure,thebetterarethesolutionsfound
intermsofbothobjectives.Anestimateofthe
nadir
point{asavectorcomprises
oftheworstobjectivevalueofallPareto-optimalsolutions{isacommonchoiceforsetting
thereferencepoint.
Figure2.3:Anillustrationofhypervolumeunderbi-objectiveoptimization.
Despiteseveraladvantages,EAshaveonemajorlimitation:theycanconsumemany
13
functionevaluationstoanapproximatedsetofPareto-optimal(PO)solutions[79].This
issuebecomesmoreprominentwhentheproblemtobesolvedinvolvescomputationallyex-
pensivefunctionswhichisanotherchallengeinsolvingindustrialoptimizationproblems.To
obtainsolutionsforexpensiveproblemsinalimitednumberofexpensivefunctionevalua-
tions,surrogateshavebeenusedintheliteratureasanalternativetoexpensiveevaluations.
2.4EvolutionarySurrogate-AssistedOptimization
Inmostpractical(real-world)optimizationproblems(includingNAS),theevaluationofa
solutiontypicallyinvolvescomputationallyexpensivesimulationsandprocedures,whichre-
mainasoneofthemainbottlenecksofcompletinganoptimizationrunwithinareasonable
amountoftime.However,sincesolutionscreatedatthebeginningofanoptimizationsim-
ulationareexpectedtobefarfromtheoptimalregion,anexactandpreciseevaluationof
thesolutionmaynotbenecessary.Asthesolutionsprogresstowardstheoptimalregion,a
morepreciseevaluationmaybeinvoked.
Therecentdevelopmentsofoptimizationmethodshaveledtoanincreasinginterestof
approximationmodelsor
surrogate
models[80,81].Theuseofmetamodels(orsurrogate
models)toapproximatethefunctionalformofexactobjectivesandconstraintsbyusinga
fewysolutionevaluationsisacommonapproach[82].Amongvariousmethods,
theKriging(alsoknownasGaussianProcess)methodisoneofthewidelyusedsurrogate
models,whichcanprovideanestimatedfunctionvalueandalsosimultaneouslyprovide
anerrorestimateoftheapproximation[83].Theoptimizationprocessisthencontinued
usingthesurrogatemodeluntilthemodelisdeemedtobeaccurateenoughinthecurrent
regionoffocus.Duetothisreason,surrogatemodelsareeinreducingtheoverall
14
computationaltimerequiredinreal-worlddesignoptimization.
2.5RelatedWork
Recentyearshavewitnessedgrowinginterestinneuralarchitecturesearch(NAS).The
promiseofbeingabletoautomaticallyandtlysearchfortask-dependentnetwork
architecturesisparticularlyappealingasdeepneuralnetworksarewidelydeployedindi-
verseapplicationsandcomputationalenvironments.Earlymethods[84,85]madeto
simultaneouslyevolvethetopologyofneuralnetworksalongwithweightsandhyperparam-
eters.Thesemethodsperformcompetitivelywithhand-craftednetworksoncontroltasks
withshallowfullyconnectednetworks.
EvolutionaryAlgorithm(EA):
Designingneuralnetworksthroughevolutionhasbeen
atopicofinterestforalongtime.Recentevolutionaryapproachesfocusonevolvingsolely
thetopologywhileleavingthelearningofweightstogradientdescentalgorithms,andusing
hyper-parametersettingsthataremanuallytuned.XieandYuille'sworkofGeneticCNN
[42]isoneoftheearlystudiesthatshowsthepromiseofusingEAsforNAS.Realetal.
[43]introduceperhapsthetrulylargescaleapplicationofasimpleEAtoNAS.The
extensionofthismethodpresentedin[6],calledAmoebaNet,providesthelargescale
comparisonofEAandRLmethods.TheirEA,usinganage-basedselectionsimilarto[86],
hasdemonstratedfasterconvergencetoanaccuratenetworkwhencomparedtoRLand
randomsearch.Concurrently,Liuetal.[44]evolveahierarchicalrepresentationthatallows
non-modularlayerstructurestoemerge.Despitetheimpressiveimprovementsachievedon
variousdatasets,theseEAmethodsareextremelycomputationallyt,e.g.onerun
oftheregularizedevolutionmethod[6]takesoneweekon450GPUcards.
15
Concurrently,anotherstreamliningofEAmethodsforuseinbudgetedNAShasemerged.
Suganumaetal.[87]useCartesiangeneticprogrammingtoassembleanarchitecturefrom
existingmodularblocks(e.g.,Residualblocks).Sunetal.in[88]usearandomforestas
ansurrogatemodeltopredicttheperformanceofarchitectures,partiallyeliminating
thelower-leveloptimizationviagradientdescent.Thereportedresultsyield3

savings
inwallclocktimewithsimilarperformancewhencomparedtotheirprevious
works[89,90].However,resultsreportedfromthesebudgetedEAmethodsarefarfrom
state-of-the-artandonlydemonstratedonsmall-scaledatasets,i.e.CIFAR-10andCIFAR-
100.
ReinforcementLearning(RL):
Q
-learning[91]isaverypopularvalueiterationmethod
usedforRL.TheMetaQNNmethod[92]employsan

-greedy
Q
-learningstrategywith
experiencereplaytosearchconnectionsbetweenconvolution,pooling,andfullyconnected
layers,andtheoperationscarriedoutinsidethelayers.Zhongetal.[13]extendedthis
ideawiththeBlockQNNmethod.BlockQNNsearchesthedesignofacomputationalblock
withthesame
Q
-learningapproach.Theblockisthenrepeatedtoconstructanetwork,
resultinginamuchmoregeneralnetworkthatachievesbetterresultsthanitspredecessor
onCIFAR-10[34].
Apolicygradientmethodseekstoapproximatesometiablerewardfunction
totrainamodelthatrequiresparametergradients,likeaneuralnetworkarchitecture.Zoph
andLe[36]applythismethodinarchitecturesearchtotrainarecurrentneuralnetwork
controllerthatconstructsnetworks.Theoriginalmethodin[36]usesthecontrollertogener-
atetheentirenetworkatonce.Thiscontrastswithitssuccessor,NASNet[5],whichdesigns
aconvolutionalandpoolingblockthatisrepeatedtoconstructanetwork.NASNetout-
performsitspredecessorandproducesanetworkachievingstate-of-the-artperformanceon
16
CIFAR-10andImageNet.Hsuetal.[93]extendstheNASNetapproachtoamulti-objective
domaintooptimizemultiplelinearcombinationsofaccuracyandenergyconsumptioncriteria
usingtscalarizationparameters.
Relaxation-basedApproachesandOthers:
Approximatingtheconnectivitybetween
tlayersinCNNarchitecturesbyreal-valuedvariablesweightingtheimportanceof
eachlayeristhecommonprincipleofrelaxation-basedNASmethods.Liuetal.imple-
mentthisideaintheDARTSalgorithm[38].DARTSseekstoimprovesearchby
theweightswhileupdatingthearchitectures,showingconvergenceonbothCIFAR-10
andPennTreebank[94]withinonedayinwallclocktimeonasingleGPUcard.Subsequent
approachesinthislineofresearchinclude[95{98].Thesearchoftheseapproaches
stemsfromweightsharingduringthesearchprocess.Thisideaiscomplementarytoour
approachandcanbeincorporatedintoanevolutionaryframeworkaswell.However,itis
beyondthescopeofthisdissertationandisatopicoffuturestudy.
MethodsnotcoveredbytheEA-,RL-orrelaxation-basedparadigmshavealsoshown
successinarchitecturesearch.Liuetal.[2]proposedamethodthatprogressivelyexpands
networksfromsimplecellsandonlytrainsthebest
K
networksthatarepredictedtobe
promisingbyaRNNmeta-modeloftheencodingspace.LiandTalwalkar[99]showthatan
augmentedrandomsearchapproachisanealternativetoNAS.Kandasamyetal.[100]
presentaGaussian-process-basedapproachtooptimizenetworkarchitectures,viewingthe
processthroughaBayesianoptimizationlens.
Multi-objectiveNAS:
Inthiswork,theterm
multi-objectiveNAS
referstomethodsthat
simultaneouslyapproximatetheentiresetofcientarchitecturesinonerun[101].
Kimetal.[102]presentedNEMO,oneoftheearliestevolutionarymulti-objectiveapproaches
17
toevolveCNNarchitectures.NEMOusesNSGA-II[74]tomaximizeperfor-
manceandinferencetimeofanetworkandsearchesoverthespaceofthenumberofoutput
channelsfromeachlayerwithinarestrictedspaceofseventarchitectures.DPP-
Net[103],anextensionfrom[2],progressivelyexpandsnetworksfromsimplestructuresand
onlytrainsthetop-K(basedonPareto-optimality)networksthatarepredictedtobepromis-
ingbyasurrogatemodel.Elskenetal.[104]presenttheLEMONADEmethod,whichis
formulatedtodevelopnetworkswithhighpredictiveperformanceandlowerresourcecon-
straints.LEMONADEreducescomputerequirementsthroughacustom-designedapproxi-
matenetworkmorphisms[105],whichallowsnewlygeneratednetworkstoshareparameters
withtheirforerunners,obviatingtheneedtotrainnewnetworksfromscratch.However,
LEMONADEstillrequirescloseto100GPU-daystosearchontheCIFARdatasets[34].
Multi-objNASthroughScalarization:
Aportfolioofworksthataimstodesignhardware-
spnetworkarchitecturesemerges.Thisinclude,ProxylessNAS[41],MnasNet[20],FB-
Net[97],andMobileNetV3[21]whichuseascalarizedobjectivethatencourageshighaccu-
racyandpenalizescomputeatthesametime,e.g.,maximize
Acc

(
Latency=Target
)

0
:
07
.
Thesemethodsrequireapreferenceweightingoftheimportanceoftob-
jectivesbeforethesearch,whichinitselfrequiresanumberoftrials.
18
Chapter3
Multi-objectiveEvolutionofDeep
NeuralNetworkArchitectures
PracticalapplicationsofNAScanrarelybeconsideredfromthepointofviewofasingle
objectiveofmaximizingperformance;rather,theymustbeevaluatedfromatleastone
additional,objectivethatissptothedeploymentscenario.Inthiswork,we
approachtheproblemofdesigninghigh-performancearchitectureswithdiversecomplexities
fortdeploymentscenariosasamulti-objectivebi-leveloptimizationproblem.We
mathematicallyformulatetheproblemas,
minimize

2A

F

;w

(

)

;
C
(

)

;
subjectto
w

(

)
2
argmin
w
2

f
(
w
;

)
;
(3.1)
where
C
(

)measuresthecomplexityofarchitectures.Theupper-levelvariable

a
CNNarchitecture,andthelower-levelvariable
w
itsassociatedweights.Theobjective
functions
F
attheupper-leveland
f
atthelower-leveldenotethelossonthevalidationdata
D
vld
andthetrainingdata
D
trn
,respectively.
19
Figure3.1:
Overview:
Givenadatasetandobjectives,NSGANetdesignsasetofcus-
tomarchitecturesspanningthefront.NSGANetestimatetheperformanceof
anarchitecturethroughits
proxymodel
,optimizedbyStochasticGradientDescent(SGD)
inthelower-level.Thesearchproceedsin
exploration
viageneticoperations,followedby
exploitation
viadistributionestimation.SeeAlgorithm1forpseudocodeandcolorsarein
correspondence.
3.1ProposedApproach
Ourproposedalgorithm,NSGANet,isaniterativeprocessinwhichinitialarchitecturesare
madegraduallybetterasagroup,calleda
population
.Ineveryiteration,agroupof

(i.e.,newarchitectures)iscreatedbyapplyingvariationsthroughcrossoverandmutation
tothemorepromisingofthearchitecturesalreadyfound,alsoknownas
parents
,fromthe
population.Everymemberinthepopulation(includingbothparentsandcompete
forsurvivalandreproduction(becomingaparent)ineachiteration.Theinitialpopulation
maybegeneratedrandomlyorguidedbyprior-knowledge,i.e.seedingthepastsuccessful
architecturesdirectlyintotheinitialpopulation.Subsequenttoinitialization,NSGANet
proceedsthesearchintwosequentialstages:(i)
exploration
,withthegoalofdiscovering
diversewaystoconstructarchitectures,and(ii)
exploitation
thatreinforcestheemerging
patternscommonlysharedamongthearchitecturessuccessfulduringexploration.Asetof
architecturesrepresentingtbetweennetworkperformanceandcomplexity
20
isobtainedattheendofevolution,throughgeneticoperatorsandaBayesianmodelbased
learningprocedure.Awchartandapseudocodeoutliningtheoverallapproachisshown
inFig.3.1andAlgorithm1,respectively.Intheremainderofthissection,weprovidea
detaileddescriptionoftheaforementionedcomponentsinSections3.1.1-3.1.2.
Algorithm1:
GeneralframeworkofNSGANet
Input:
Complexityobjective
~
f
(seeEq.(3.1)),Max.numberofgenerations
G
,Populationsize
K
,Crossoverprobability
p
c
,Mutationprobability
p
m
,Thestartinggenerationof
exploitation
˝
.
1
g
 
0//initializeagenerationcounter.
2
ˆ
 
1//initializethecontrolparameterfor
exploration
.
3
A 
initializeanemptyarchivetokeeptrackofevaluatedarchs.
4
P
 
initializetheparentpopulationbyuniformsampling.
5
//computeaccuracythroughlower-leveloptimizationinAlgo.2.
6
f
 
Evaluate
(
P
)
7
//calculatedominationrankandcrowdingdistance.
8
[
F
1
;F
2
;:::
]
 
NondominatedSort
(
f;
~
f
(
P
))
9
dist
 
CrowdingDistance
(
F
1
;F
2
;:::
)
10
while
g<G
do
11
k
 
0//initializeanindividualcounter.
12
Q
 ;
//population.
13
while
k<K
do
14
if
rand
()
<ˆ
then
15
//chooseparentsthroughtournamentselectionformating.
16
p
 
BinaryTournamentSelection
(
P;
[
F
1
;F
2
;:::
]
;dist
)
17
q
 
Crossover
(
p;p
c
)
18
q
 
Mutation
(
q;p
m
)
19
else
20
//estimatethedistributionoftheParetoset.
21
BN
 
constructaBayesianNetworkfrom
A
.
22
q
 
sampleanfrom
BN
.
23
end
24
Q
 
Q
[
q
;
k
 
k
+1
25
end
26
f
0
 
Evaluate
(
Q
)//seeline5.
27
[
F
1
;F
2
;:::
]
 
NondominatedSort
(
f
[
f
0
;
~
f
(
P
)
[
~
f
(
Q
))
28
dist
 
CrowdingDistance
(
F
1
;F
2
;:::
)
29
//survivethetop-
K
archstonextgenerationfollowingtheenvironmentalselectionprocedures
outlinedin[74].
30
P
 
Selection
(
P
[
Q;
[
F
1
;F
2
;:::
]
;dist;K
)
31
g
 
g
+1;
A A[
Q
32
if
g>˝
then
update
ˆ
accordingtoEq.(3.2).;
33
end
34
Return
parentpopulation
P
.
21
3.1.1SearchSpaceandEncoding
Thesearchforoptimalnetworkarchitecturescanbeperformedovermanytsearch
spaces.Thegeneralityofthechosensearchspacehasamajoronthequality
ofresultsthatareevenpossible.MostexistingevolutionaryNASapproaches[42,87,88,
104]searchonlyoneaspectofthearchitecturespace|e.g.,theconnectionsand/orhyper-
parameters.Incontrast,NSGANetsearchesoverbothoperationsandconnections|the
searchspaceisthusmorecomprehensive,includingmostoftheprevioussuccessfularchitec-
turesdesignedbothbyhumanexpertsandalgorithmically.
ModernCNNarchitecturesareoftencomposedofanouterstructure(
network-level
)
designwherethewidth(i.e.,#ofchannels),thedepth(i.e.,#oflayers)andthespatial
resolutionchanges(i.e.,locationsofpoolinglayers)aredecided;andaninnerstructure
(
block-level
)designwherethelayer-wiseconnectionsandcomputationsarespe.g.,
Inceptionblock[30],ResNetblock[11],andDenseNetblock[31],etc.AsseenintheCNN
literature,thenetwork-leveldecisionsaremostlyhand-tunedbasedonmeta-heuristicsfrom
priorknowledgeandthetaskathand,asisthecaseinthiswork.Forblock-leveldesign,we
adopttheoneusedin[2,5,6,38]tobeconsistentwithpreviouswork.
A
block
isasmallconvolutionalmodule,typicallyrepeatedmultipletimestoformthe
entireneuralnetwork.Toconstructscalablearchitecturesforimagesoftresolutions,
weusetwotypesofblockstoprocessintermediateinformation:(1)the
Normal
block,a
blocktypethatreturnsinformationofthesamespatialresolution;and(2)the
Reduction
block,anotherblocktypethatreturnsinformationwithspatialresolutionhalvedbyusinga
strideoftwo.SeeFig.3.2forapictorialillustration.
Weusedirectedacyclicgraphs(DAGs)consistingofenodestoconstructbothtypes
22
ofblocks(aReductionblockusesastrideoftwo).Each
node
isatwo-branchedstructure,
mappingtwoinputstooneoutput.Foreachnodeinblock
i
,weneedtopicktwoinputs
fromamongtheoutputofthepreviousblock
h
i

1
,theoutputoftheprevious-previous
block
h
i

2
,andthesetofhiddenstatescreatedinanypreviousnodesofblock
i
.Forpairs
ofinputschosen,wechooseacomputationoperationfromamongthefollowingoptions,
collectedbasedontheirprevalenceintheCNNliterature:

identity

3x3maxpooling

3x3averagepooling

squeeze-and-excitation[106]

3x3localbinaryconv[107]

5x5localbinaryconv[107]

3x3dilatedconvolution

5x5dilatedconvolution

3x3depthwise-separableconv

5x5depthwise-separableconv

7x7depthwise-separableconv

1x7then7x1convolution
Theresultscomputedfrombothbranchesarethenaddedtogethertocreateanewhidden
state,whichisavailableforsubsequentnodesinthesameblock.SeeFig.3.2(b)-3.2(d)for
pictorialillustrations.Thesearchspaceweconsiderinthispaperisanexpandedversionof
the
microsearchspace
usedinourpreviouswork[15].Sp,thecurrentsearchspace
(i)searchforafactorthatgraduallyincrementsthechannelsizeofeachblockwithdepth
(seeFig.3.2(a))asopposedtosharplydoublingthechannelsizewhendown-sampling.(ii)
considersanexpandedsetofprimitiveoperationstoincludebothmorerecentadvancedlayer
primitivessuchassqueeze-and-excitation[106]andmoreparametetlayerprimitives
likelocalbinaryconvolution[107].
Withtheabove-mentionedsearchspace,thereareintotal20decisionstoconstitutea
blockstructure,i.e.choosetwopairsofinputandoperationforeachnode,andrepeatfor
23
(a)
(b)
(c)
(d)
Figure3.2:SchematicoftheNSGANetsearchspacemotivatedfrom[5]:(a)Anarchitecture
iscomposedofstackedblocks.(b)Thenumberofchannelsineachblockisgradually
increasedwithdepthofthenetwork.(c)Eachblockiscomposedofenodes,where
eachnodeisatwo-branchedcomputationappliedtooutputsfromeitherpreviousblocksor
previousnodeswithinthesameblock.(d)Agraphicalvisualizationof(c).
enodes.Theresultingnumberofcombinationsforablockstructureis:
B
=

(
n
+1)!

2

(
n
ops
)
2
n
where
n
denotesthenumberofnodes,
n
ops
denotesthenumberofconsideredoperations.
Therefore,withoneNormalblockandoneReductionblockwithenodesineach,the
overallsizeoftheencodedsearchspaceisapproximately10
33
.
3.1.2AlgorithmProcedures
PerformanceEstimationStrategy:
ToguideNSGANettowardsmoreaccurate
andtarchitectures,weconsidertwometricsasobjectives,namely,ac-
curacyandarchitecturecomplexity.Assessingtheaccuracyofanarchitecture
duringsearchrequiresanotheroptimizationtoidentifytheoptimalvaluesoftheasso-
ciatedweights(seeAlgorithm2).Eventhoughthereexistwell-establishedgradientdescent
algorithmstotlysolvethisoptimization,repeatedlyexecutingsuchanalgorithmfor
24
everycandidatearchitecturerenderstheoverallprocesscomputationallyveryprohibitive.
Therefore,toovercomethiscomputationalbottleneck,wecarefully(usingaseriesofabla-
tionstudies)down-scalethearchitecturestocreatetheirproxymodels[5,6],whichcanbe
optimizedtlyinthelower-levelthroughSGD(Algo.2)Theirperformancesbecome
surrogatemeasurementstoselectarchitecturesduringsearch.
Algorithm2:
PerformanceEvaluationofaCNN
Input:
Thearchitecture

,trainingdata
D
trn
,
validationdata
D
vld
,number
T
ofepochs,
weightdecay

,initiallearningrate

max
.
1
!
 
Randomlyinitializetheweightsin

;
2
t
 
0;
3
while
t<T
do
4

 
1
2

max

1+cos

t
T
ˇ


;
5
for
eachdata-batchin
D
trn
do
6
L 
Cross-entropylossonthe
data

batch
;
7
r
!
 
Computethegradientby
@
L
=@!
;
8
!
 
(1


)
!


r
!
;
9
end
10
t
 
t
+1;
11
end
12
acc
 
Computeaccuracyof

(
!
)on
D
vld
;
13
Return
theaccuracy
acc
.
Anumberofmetricscanserveasproxiesforcomplexity,including:thenumberofactive
nodes,numberofactiveconnectionsbetweenthenodes,numberofparameters,inference
timeandnumberofoating-pointoperations(FLOPs)neededtoexecutetheforwardpass
ofagivenarchitecture.Ourinitialexperimentsconsideredeachofthesemetricsinturn.We
concludedfromextensiveexperimentationthatinferencetimecannotbeestimatedreliably
duetoandinconsistenciesinthecomputingenvironment,GPUmanufacturer,
ambienttemperature,etc.Similarly,thenumberofparameters,activeconnectionsoractive
nodesonlyrelatetooneaspectofthecomplexity.Incontrast,wefoundanestimateof
FLOPstobeamoreaccurateandreliableproxyfornetworkcomplexity.Therefore,clas-
accuracyandFLOPsserveasourchoiceoftwinobjectivestobetradedfor
25
selectingarchitectures.Tosimultaneouslycompareandselectarchitecturesbasedonthese
twoobjectives,weusethenon-dominatedrankingandthe\crowded-ness"conceptsproposed
in[74].
Exploration:
Givenapopulationofarchitectures,parentsareselectedfromthepopulation
withabias.Thischoiceisdictatedbytwoobservations,(1)createdaround
betterparentsareexpectedtohavehigheronaveragethanthosecreatedaround
worseparents,withtheassumptionofsomelevelofgradualisminthesolutionspace;(2)
occasionally(althoughnotusually),performbetterthantheirparents,through
inheritingusefultraitsfrombothparents.Becauseofthis,onemightdemandthatthebest
architectureinthepopulationshouldalwaysbechosenasoneoftheparents.However,the
deterministicandgreedynatureofthatapproachwouldlikelyleadtoprematureconvergence
duetolossofdiversityinthepopulation[108].Toaddressthisproblem,weusebinary
tournamentselection[109]topromoteparentarchitecturesinastochasticfashion.Ateach
iteration,binarytournamentselectionrandomlypickstwoarchitecturesfromthepopulation,
thentheonefavoredbythemulti-objectiveselectioncriteriondescribedinprevioussection
becomesoneoftheparents.Thisprocessisrepeatedtoselectasecondparentarchitecture;
thetwoparentarchitecturesthenundergoacrossoveroperation.
Figure3.3:Illustrationofnode-levelcrossover.
26
InNSGANet,weusetwotypesofcrossover(withequalprobabilityofbeingchosen)to
tlyexchangesub-structuresbetweentwoparentarchitectures.Thetypeisat
theblocklevel,inwhichthearchitecturesarecreatedbyrecombiningtheNormal
blockfromtheparentwiththeReductionblockfromtheotherparentandviceversa.
Thesecondtypeisatthenodelevel,whereanodefromoneparentisrandomlychosenand
exchangedwithanothernodeatthesamepositionfromtheotherparent.Weapplythenode-
levelcrossovertobothNormalandReductionblocks.Figure3.3illustratesanexampleof
node-levelcrossover.Notethattwoarchitecturesaregeneratedaftereachcrossover
operation,andanpopulationofthesamesizeastheparentpopulationisgenerated
ineachiteration.
Figure3.4:InputandOperationMutation:Dashedlineboxeswithredcolorhighlightthe
mutation.
h
i

2
,
h
i

1
areoutputsfromprevious-previousandpreviousblocks,respectively.
h
(3)
i
indicatesoutputfromnode3ofthecurrentblock.
Toenhancethediversityofthepopulationandtheabilitytoescapefromlocalattractors,
weuseadiscretizedversionofthepolynomialmutation(PM)operator[110]subsequent
tocrossover.Weallowmutationtobeappliedonboththeinputhiddenstatesandthe
choiceofoperations.Figure3.4showsanexampleofeachtypeofmutationusingthe
parent-centricPMoperator,inwhichtheareintentionallycreatedaroundthe
parentsinthedecisionspace.InassociationwithPM,wesortourdiscreteencodingofinput
hiddenstateschronologicallyandchoiceofoperationsinascendingorderofcomputational
27
complexity.Inthecontextofneuralarchitecture,thisstepresultsinthemutatedinput
hiddenstatesinarchitecturestomorelikelybeclosetotheinputhiddenstates
inparentarchitecturesinachronologicalmanner.Forexample,
h
(2)
i
ismorelikelytobe
mutatedto
h
(1)
i
thanto
h
i

2
byPM.Asimilarlogicisappliedincaseofmutationonlayer
operations.
Figure3.5:IllustrativeexampleofBN-basedexploitationstepinNSGANet:givenpast
successfularchitectures,weconstructaBNrelatingthedependenciesbetweenthefournodes
insidetheNormalblock.AnewarchitectureisthensampledfromthisBNandproceed
forwardforperformanceestimation.
Exploitation:
Afterasutnumberofarchitectureshasbeenexplored(consuming
2/3ofthetotalcomputationalbudget),westarttoenhancetheexploitationaspectofthe
search.Thekeyideaistoreinforceandreusethepatternscommonlysharedamongpast
successfularchitectures.WeusetheBayesianNetwork[111]astheprobabilisticmodel
toestimatethedistributionoftheParetoset(ofarchitectures).Inthecontextofour
28
searchspaceandencoding,thistranslatestolearningthecorrelationsamongtheoperations
andconnectionsofnodeswithinablock.Ourexploitationstepusesasubset(top-100
architecturesselectedbasedondominationrankandcrowdingdistance[74])ofthepast
evaluatedarchitecturestoguidethepartofthesearch.Moresp,sayweare
designingaNormalblockwiththreenodes,namely

(1)
n
,

(2)
n
,and

(3)
n
.Wewouldliketo
knowtherelationshipamongthesethreenodes.Forthispurpose,weconstructaBayesian
Network(BN)relatingthesevariables,modelingtheprobabilityofNormalblocksbeginning
withaparticularnode

(1)
n
,theprobabilitythat

(2)
n
follows

(1)
n
,andtheprobability
that

(3)
n
follows

(2)
n
and

(1)
n
.Inotherwords,weestimatetheconditionaldistributions
p


(1)
n

,
p


(2)
n
j

(1)
n

,and
p


(3)
n
j

(2)
n
;
(1)
n

byusingthepopulationhistory,andupdate
theseestimatesduringtheexploitationprocess.Newarchitecturesarecreatedby
samplingfromthisBN.ApictorialillustrationofthisprocessisprovidedinFig.3.5.This
BN-basedexploitationstrategyisusedinadditiontothegeneticoperators,whereweinitially
(i.e.atthebeginningofexploitation)assign25%ofthetobecreatedbyBNandwe
updatethisprobabilityadaptively.Tobemorespwecalculatetheprobabilitiesofusing
geneticoperatorsandsamplingfromtheBNmodelatgeneration
t
basedonthesurvivalrates
ofcreatedusingtheminthepreviousgeneration,followingthesoftmaxfunction:
ˆ
(
i
)
t
=
exp(
s
(
i
)
t

1
)
P
2
i
=1
exp(
s
(
i
)
t

1
)
(3.2)
where
ˆ
(
i
)
t
aretheprobabilitiesofusinggeneticoperators(
i
=1)andsamplingfromthe
learnedBNmodel(
i
=2);and
s
(
i
)
t

1
arethesurvivalratesofthecreatedbygenetic
operators(
i
=1)andthelearnedBNmodel(
i
=2)atthepreviousgeneration
t

1.Please
notethat
ˆ
(
i
=1)
t
correspondsto
ˆ
inAlgorithm1.
29
3.2ExperimentsandResults
Inthissection,wewillevaluatetheofNSGANetonmultiplebenchmarkimage
datasets.
3.2.1Baselines
Todemonstratetheenessoftheproposedalgorithm,wecomparethenon-dominated
architecturesachievedattheconclusionofNSGANet'sevolutionwitharchitecturesreported
byvariouspeermethodspublishedintop-tiervenues.Thechosenpeermethodscanbe
broadlycategorizedintothreegroups:architecturesmanuallydesignedbyhumanexperts,
non-EA-(mainlyRLorrelaxation)-based,andEA-based.Humanengineeredarchitectures
includeResNet[11],ResNeXt[112],andDenseNet[31],etc.Thesecondandthirdgroups
rangefromearliermethods[36,42,43]thatareorientedtowards\proof-of-concept"forNAS,
tomorerecentmethods[5,6,38,41],manyofwhichimprovestate-of-the-artresultsonvarious
computervisionbenchmarksatthetimetheywerepublished.
Theenessofthetarchitecturesisjudgedonbothaccuracy
andcomputationalcomplexity.Forcomparisononaccuracy,threewidelyused
naturalobjectbenchmarkdatasetsareconsidered,namely,CIFAR-10,CIFAR-
100andImageNet.CIFAR-10and-100aresimilar,aseachofthemconsistsof50,000images
fortrainingand10,000imagesfortesting,witheachimagebeing32x32pixels.CIFAR-
100extendsCIFAR-10byadding90moreclassesresultingin10

fewertrainingexamples
perclass.ImageNetisalarge-scaledatabasecontainingmorethan1.4millionimagesfrom
1,000tclassesandeachimageisofamuchbiggersize.ImageNetistly
morechallengingthantheCIFARdatasets,andthecumulativeprogressisoftencitedas
30
CIFAR-10
CIFAR-100
ImageNet
Figure3.6:ExamplesfromCIFAR-10,CIFAR-100andImageNetdatasets.Imagesineach
rowbelongtothesameclasswithlabelnamesshowntotheleft.
oneofthemainbreakthroughsincomputervisionandmachinelearning[113].Agalleryof
examplesfromthesethreedatasetsisprovidedinFigure3.6.
3.2.2ImplementationDetails
MotivatedbyandpracticalityconsiderationsmostexistingNASmethods,includ-
ing[5,6,40,89],carryoutthesearchprocessontheCIFAR-10dataset.However,aswe
demonstratethroughablationstudiesinSection3.3theCIFAR-100providesamorereliable
measureofanarchitecture'sincomparisontoCIFAR-10.Basedonthisobservation,
incontrasttoexistingapproaches,weusethemorechallengingCIFAR-100datasetforthe
searchprocess.Furthermore,wesplittheoriginalCIFAR-100trainingset(80%-20%)to
createatrainingandvalidationsettopreventovtothetrainingsetandimprove
thegeneralizability.Weemphasizethattheoriginaltestingsetis
never
usedtoguidethe
selectionofarchitecturesinanyformduringthesearch.
Thesearchitselfisrepeatedetimeswithtinitialrandomseeds.Weselect
andreporttheperformanceofthemedianrunasmeasuredbyhypervolume(HV).Sucha
procedureensuresthereproducibilityofourNASexperimentsandmitigatestheconcerns
thathaveariseninrecentNASstudies[99,114].WeusethestandardSGDalgorithmfor
learningtheassociatedweightsforeacharchitecture.Otherhyper-parametersettingsrelated
31
tothesearchspace,thegradientdescenttrainingandthesearchstrategyaresummarized
inTable3.1.Weprovideanalysisaimedatjustifyingsomeofthehyper-parameterchoices
inSection3.3.Allexperimentsareperformedon8Nvidia2080TiGPUcards.
Table3.1:SummaryofHyper-parameterSettings.
Categories
Parameters
Settings
searchspace
#ofinitialchannels(
Ch
init
)
32
#ofchannelincrements(
Ch
inc
)
6
#ofrepetitionsofNormalblocks(
N
)
4/5/6
gradientdescent
batchsize
128
weightdecay(
L
2
regularization)
5.00E-04
epochs
36/600
learningrateschedule
CosineAnnealing[115]
searchstrategy
populationsize
40
#ofgenerations
30
crossoverprobability
0.9
mutationprobability
0.1
3.2.3eness
WepresenttheobjectivespacedistributionofallarchitecturesgeneratedbyNSGANet
duringthecourseofevolutiononCIFAR-100,inFig.3.8(a).Weincludearchitectures
generatedbytheoriginalNSGA-IIalgorithmanduniformrandomsamplingasreferences
forcomparison.DetailsofthesetwomethodsareprovidedinSection3.2.4.Fromthe
setofnon-dominatedsolutions(outlinedbyredboxmarkersinFig.3.8(a)),weselecte
architecturesbasedontheratioofthegainonaccuracyoverthesacrionFLOPs.For
referencepurposes,wenametheseearchitecturesasNSGANet-A0to-A4inascending
FLOPsorder.SeeFigure3.7foravisualizationonthesearchedarchitectures.
Forcomparisonwithotherpeermethods,wefollowthetrainingprocedurein[38]andre-
traintheweightsofNSGANet-A0to-A4thoroughlyonCIFAR-100.Wewouldliketomen-
32
(a)
NSGANet-A1
(b)
NSGANet-A2
(c)
NSGANet-A4
(d)ResNet
(e)MB-v2
(f)NASNet-A
Figure3.7:Visualizationofblock-levelstructuresfortarchitectures.TheNormal
andReductionblocksareshownintheandsecondrows,respectivelyforNSGANet
architectures.Examplesofblocksthataredesignedmanuallybyexperts[11,12]andfrom
otherpeermethods[5]arealsoincludedin(d)-(f)forcomparison.
tionthatsincemostexistingapproachesdonotreportthenumberofFLOPsforthearchitec-
turesusedontheCIFAR-100dataset,weinsteadcomparetheircomputationalcomplexity
throughnumberofparameterstopreventpotentialdiscrepanciesfromre-implementation.
Figure3.8(b)showsthepost-searcharchitecturecomparisons,NSGANet-A0toA4,i.e.,the
algorithmsderivedinthispaper,jointlydominateallotherconsideredpeermethodswith
aclearmargin.Morespally,NSGANet-A1ismoreaccuratethanpeerEAmethod,
AE-CNN-E2EPP[88],whilebeing
30xmoret
innetworkparameters;NSGANet-
A2achievesbetterperformancethanAmoebaNet[6]andNSGA-Net[15]with
3xfewer
parameters
.Furthermore,NSGANet-A4exceedstheaccuracyof
Shake-Even
292

4

64d+SE
[106]using
8xfewerparameters
.Morecomparisonscanbefoundin
Table3.2(b).
Followingthepracticeadoptedinmostpreviousapproaches[2,5,6,38,40],wemeasure
33
(a)Duringsearch
(b)Postsearch
Figure3.8:(a)Accuracyvs.FLOPsofallarchitecturesgeneratedbyNSGANetduring
thecourseofevolutiononCIFAR-100.Asubsetofnon-dominatedarchitectures(seetext),
namedNSGANet-A0toA4,arere-trainedthoroughlyandcomparedwithotherpeermethods
in(b).
thetransferabilityoftheobtainedarchitecturesbyallowingthearchitecturesevolvedonone
dataset(CIFAR-100inthiscase)tobeinheritedandusedonotherdatasets,byretraining
theweightsfromscratchonthenewdataset{inourcase,onCIFAR-10andImageNet.
TheenessofNSGANetisfurthervalidatedbythetransferredperformanceon
theCIFAR-10dataset.AsweshowinFigs.3.9(a),thefrontierestablishedby
NSGANet-A0to-A4completelydominatesthefrontiersobtainedbythepeerEMOmethods,
bothDPP-Net[103]andLEMONADE[104],aswellasthoseobtainedwithothersingle-
objectivepeermethods.Morespy,NSGANet-A0uses
27xfewerparameters
and
achieveshigherclassiaccuracythanLarge-scaleEvo.[43].NSGANet-A1outperforms
HierarchicalNAS[44]andDenseNet[31]inwhile
saving122x
and
51x
in
parameters
.NSGANet-A2uses
4xlessparameters
toachievesimilarperformanceas
comparedtoNSGA-Net[15].Furthermore,NSGANet-A4exceedspreviousstate-of-the-
artresultsreportedbyProxylessNAS[41]whilebeing
1.4xmorecompact
.Referto
Table3.2(a)formorecomparisons.
FortransferperformancecomparisonontheImageNetdataset,wefollowpreviouswork
34
Table3.2:ComparisonbetweenNSGANetandotherbaselinemethods.NSGANetarchi-
tecturesareobtainedbysearchingonCIFAR-100.NSGANetresultsonCIFAR-10andIm-
ageNetareobtainedbyre-trainingtheweightswithimagesfromtheirrespectivedatasets.
Ratio-to-NSGANetindicatestheresultingsavingson#Paramsand#FLOPs.Thesearch
costiscomparedinGPU-days,calculatedbymultiplyingthenumberofGPUcardsdeployed
withtheexecutiontimeindays.
(a)CIFAR-10
Architecture
Search
Method
GPU-
Days
Top-1Acc.#Params
Ratio-to-
NSGANet
NSGANet-A0
EA27
95.33%0.2M1x
CGP-CNN[87]
EA27
94.02%1.7M8.5x
Large-scaleEvo.[43]
EA2,750
94.60%5.4M27x
AE-CNN+E2EPP[88]
EA7
94.70%4.3M21x
ResNet[11]
manual-
95.39%1.7M8.5x
NSGANet-A1
EA27
96.51%0.5M1x
HierarchicalNAS[44]
EA300
96.37%61.3M122x
PNAS[2]
SMBO150
96.37%3.2M6.4x
DenseNet[31]
manual-
96.54%25.6M51x
NSGANet-A2
EA27
97.35%0.9M1x
CNN-GA[90]
EA35
96.78%2.9M3.2x
AmoebaNet-A[6]
EA3,150
96.88%3.1M3.4x
DARTS[38]
relaxation1
97.18%3.4M3.8x
NSGA-Net[15]
EA4
97.25%3.3M3.7x
NSGANet-A3
EA27
97.78%2.2M1x
NASNet-A[5]
RL1,575
97.35%3.3M1.5x
LEMONADE[104]
EA90
97.42%13.1M6.0x
NSGANet-A4
EA27
97.98%4.0M1x
AmoebaNet-B[6]
EA3,150
97.87%34.9M8.7x
ProxylessNAS[41]
RL1,500
97.92%5.7M1.4x
(b)CIFAR-100
Architecture
Search
Method
GPU-
Days
Top-1Acc.#Params
Ratio-to-
NSGANet
NSGANet-A0
EA27
74.83%0.2M1x
GeneticCNN[42]
EA17
70.95%--
MetaQNN[92]
RL90
72.86%11.2M56x
NSGANet-A1
EA27
80.77%0.7M1x
Large-scaleEvo.[43]
EA2,750
77.00%40.4M58x
ResNet[11]
manual-
77.90%1.7M2.4x
AE-CNN+E2EPP[88]
EA10
77.98%20.9M30x
NSGA-Net[15]
EA8
79.26%3.3M4.7x
CNN-GA[90]
EA40
79.47%4.1M5.9x
PNAS[2]
SMBO150
80.47%3.2M4.6x
ENAS[40]
RL0.5
80.57%4.6M6.6x
NSGANet-A2
EA27
82.58%0.9M1x
AmoebaNet-A[6]
EA3,150
81.07%3.1M3.4x
GDAS[95]
relaxation0.2
81.62%3.4M3.8x
DARTS[38]
relaxation1
82.46%3.4M3.8x
NSGANet-A3
EA27
82.77%2.2M1x
NSGANet-A4
EA27
85.62%4.1M1x
DenseNet[31]
manual-
82.82%25.6M6.2x
SENet[106]
manual-
84.59%34.4M8.4x
Block-QNN[13]
RL32
85.17%33.3M8.1x
(c)ImageNet
Architecture
SearchMethodGPU-Days
Top-1Acc.Top-5Acc.
#ParamsRatio-to-NSGANet
#FLOPsRatio-to-NSGANet
NSGANet-A1
EA27
70.9%90.0%
3.0M1x
270M1x
MobileNet-V2[12]
manual-
72.0%91.0%
3.4M1.1x
300M1.1x
NSGANet-A2
EA27
74.5%92.0%
4.1M1x
466M1x
Sh[116]
manual-
73.7%-
5.4M1.3x
524M1.1x
NASNet-A[5]
RL1,575
74.0%91.3%
5.3M1.3x
564M1.2x
PNAS[2]
SMBO150
74.2%91.9%
5.1M1.2x
588M1.3x
AmoebaNet-A[6]
EA3,150
74.5%92.0%
5.1M1.2x
555M1.2x
DARTS[38]
relaxation1
73.1%91.0%
4.9M1.2x
595M1.3x
NSGANet-A3
EA27
76.2%93.0%
5.0M1x
585M1x
MobileNetV2(1.4)[12]
manual-
74.7%92.5%
6.06M1.2x
582M1x
AmoebaNet-C[6]
EA3,150
75.7%92.4%
6.4M1.3x
570M1x
y
SMBOstandsforsequentialmodel-basedoptimization.SENetistheabbreviationforShake-Even292x4x64d
+SE.
z
TheCIFAR-100accuracyand#paramsforENAS[40]andDARTS[38]arefrom[95].#ParamsforAE-
CNN+E2EPParefrom[89].
35
(a)CIFAR-10
(b)ImageNet
Figure3.9:TransferabilityoftheNSGANetarchitecturesto(a)CIFAR-10,and(b)Ima-
geNet.WecompareTop-1Accuracy
vs.
ComputationalComplexity.Architecturesjoined
bydashedlinesarefrommulti-objectivealgorithms.
[5,12,38,40]andusetheImageNet-mobilesetting,i.e.,thesettingwherenumberofFLOPs
islessthan600M.TheNSGANet-A0istoosimplefortheImageNetdatasetandNSGANet-
A4exceedsthe600MFLOPsthresholdforthemobilesetting,soweprovideresultsonly
forNSGANet-A1,-A2and-A3.Figure3.9(b)showstheobjectivespaceParetocomparison
withtheotherpeermethods.Clearly,NSGANetcanachieveabetterbetween
theobjectives.NSGANet-A2dominatesawiderangeofpeermethodsincludingSh
[116]byhumanexperts,NASNet-A[5]byRL,DARTS[38]byrelaxation-basedmethods,
andAmoebaNet-A[6]byEA.Moreover,NSGANet-A3surpassespreviousstate-of-the-art
performancereportedbyMobileNet-V2[12]andAmoebaNet-C[6]onmobile-settingwitha
marginaloverheadinFLOPs(1%-3%).
3.2.4
ComparingthesearchphasecontributiontothesuccessoftNASalgorithmscan
beandambiguousduetosubstantialinsearchspacesandtrainingpro-
ceduresusedduringthesearch.Therefore,weuse
vanillaNSGA-II
and
uniformrandom
36
sampling
ascomparisonstodemonstratetheofthesearchphaseinNSGANet.All
threemethodsusethesamesearchspaceandperformanceestimationstrategyasdescribed
inSection3.1.ThevanillaNSGA-IIisimplementedbydiscretizingthecrossoverandmu-
tationoperatorsintheoriginalNSGA-II[74]algorithmwithallhyper-parameterssetto
defaultvalues.Anditdoesnotutilizeanyadditionalenhancements|e.g.,theBayesian-
Network-model-basedexploitation.Theuniformrandomsamplingmethodisimplemented
byreplacingthecrossoverandmutationoperatorsintheoriginalNSGA-IIalgorithmwith
aninitializationmethodthatuniformlysamplesthesearchspace.Weruneachofthethree
methodsetimesandrecordthe25-percentile,medianand75-percentileofthenormalized
hypervolume(HV)thatweobtain.TheHVmeasurementsshowninFig.3.10(a)suggest
thatNSGANetiscapableofmoreaccurateandsimplerarchitecturesthanvanilla
NSGA-IIoruniformrandomsampling(evenwithanextendedsearchbudget),inamore
tmanner.
(a)HV
(b)Searchcostvs.Accuracy
Figure3.10:SearchcomparisonbetweenNSGANetandotherbaselinesintermsof
(a)HV,and(b)therequiredcomputetimeinGPU-Days.Thesearchcostismeasuredon
CIFAR-10formostmethods,exceptNSGANetandBlock-QNN[13],wheretheCIFAR-100
datasetisusedfor.
ApartfromtheHVmetric,anotherimportantaspectofdemonstratingtheof
NASisthecomputationalcomplexityofthemethods.Sincetheoreticalanalysisofthecom-
37
putationalcomplexityoftNASmethodsischallenging,wecomparethecomputation
timespentonGraphicsProcessingUnits(GPUs),
GPU-Days
,byeachapproachtoarrive
atthereportedarchitectures.ThenumberofGPU-Daysiscalculatedbymultiplyingthe
numberofemployedGPUcardsbytheexecutiontime(inunitsofdays).
OnerunofNSGANetontheCIFAR-100datasettakesapproximately27GPU-Daysto
averagedovereruns.Thesearchcostsofmostofthepeermethodsaremeasuredon
theCIFAR-10dataset,exceptforBlock-QNN[13]whichismeasuredonCIFAR-100.From
thesearchcostcomparisoninFig.3.10(b),weobservethatourproposedalgorithmismore
tatidentifyingasetofarchitecturesthananumberofotherapproaches,andthesetof
architecturesobtainedhashigherperformance.Moresp,NSGANetsimultaneously
multiplearchitectureswhileusing
10x
to
100xlessGPU-days
insearchingthanmost
oftheconsideredpeermethods,includingHierarchicalNAS[44],AmoebaNet[6],NasNet[5],
andProxylessNAS[41],allofwhichasinglearchitectureatatime.Whencompared
tothepeermulti-objectiveNASmethod,LEMONADE[104],NSGANetmanagestoobtain
abetter(intheParetodominancesense)setofarchitecturesthanLEMONADEwith
3x
fewerGPU-Days
.
3.2.5ObservationsonEvolvedArchitectures
Population-basedapproacheswithmultiplecobjectivesoftenleadtoasetofdiverse
solutioncandidates,whichcanbe\mined"forcommonlyshareddesignprinciples[117].In
ordertodiscoveranypatternsformoretarchitecturedesign,weanalyzedtheentire
historyofarchitecturesgeneratedbyNSGANet.Wemakethefollowingobservations:

Non-parametricoperations|e.g.,theskipconnection(
identity
)andtheaveragepool-
38
ing(
avg
pool
3x3
)|areeintradingperformanceforcomplexity.Empirically,
wenoticethatthreeoutofthefourmostfrequentlyusedoperationsinnon-dominated
architecturesarenon-parametric,asshowninFig.3.11(a).

Largerkernelsizeimprovesaccuracy.AsshowninFig.3.11(a),the
frequenciesofconvolutionswithlargekernelsizes(e.g.,
dil
conv
5x5
,
conv
7x1
1x7
)are
tlyhigherinthetop-20%mostaccuratearchitecturesthaninnon-dominated
architecturesingeneral,whichmustalsobalanceFLOPs.

Paralleloperationsaremorebtoperformancethansequential
operations.TheconcatenationdimensionsshowninFig.3.11(b)indicatethenumber
ofparallelpathsinablockstructure.Clearly,thehigherthenumberofinputscon-
catenated,thehigherthevalidationaccuracy,onaverage.Similararealso
reportedinpreviouswork[6,112].
(a)Frequencyofoperationchoices
(b)Concatenationdimensionvs.Accuracy
Figure3.11:
PostSearchAnalysis:
(a)Frequencyofeachoperationselectedduringthe
search.(b)ofnumberofinputchannelsthatareconcatenationonthevalidation
accuracy.Morechannelsresultsinmorereliableperformanceimprovementbutatthecost
ofcomputational.
Theabovecommonpropertiesofmultiplenon-dominatedsolutionsstayasimportant
knowledgeforfutureapplications.Itisnoteworthythatsuchapost-optimalknowledge
39
extractionprocessispossibleonlyfromamulti-objectiveoptimizationstudy,anotherb
thatweenjoyforposingNASasamulti-objectiveoptimizationproblem.
3.3AblationStudy
3.3.1DatasetforSearch
AspreviouslymentionedinSection3.1,ourproposedmethodfrommostoftheexisting
peermethodsinthechoiceofdatasetsonwhichthesearchiscarriedout.Insteadofdirectly
followingthecurrentpracticeofusingtheCIFAR-10dataset,weinvestigatedtheutilityof
searchonmultiplebenchmarkdatasetsintermsoftheirabilitytoprovidereliableestimatesof
accuracyandgeneralization.Wecarefullyselectedfourdatasets,SVHN[118],
fashionMNIST[119],CIFAR-10,andCIFAR-100forcomparison.Thechoicewasbasedona
numberoffactorsincludingthenumberofclasses,numbersoftrainingexamples,resolutions
andrequiredtrainingtimes.Weuniformlysampled40architecturesfromthesearchspace
(describedinSection3.1)alongwithearchitecturesgeneratedbyotherpeerNASmethods.
Wetrainedeveryarchitecturethreetimeswithtinitialrandomseedsandreportthe
averagedaccuracyoneachofthefourdatasetsinFig.3.12(a).Empirically,we
observethattheCIFAR-100datasetischallengingenoughforarchitecturalto
predictperformance.ThiscanbeobservedinFig.3.12(a)wherethevariation(blue
boxes)inaccuracyacrossarchitecturesisnoticeablylargeronCIFAR-100than
ontheotherthreedatasets.Inaddition,weobservethatmeandin
accuracyonCIFAR-100betweenrandomlygeneratedarchitecturesandarchitecturesfrom
principle-basedmethodshavehigherdeviations,suggestingthatitislesslikelytoa
goodarchitectureonCIFAR-100bychance.
40
(a)ofdatasetchoice.
(b)ofproxymodel'swidth.
(c)ofproxymodel'sdepth.
(d)oftrainingepochs.
Figure3.12:(a)Meanaccuracydistributionofrandomlygeneratedarchitec-
turesandarchitecturesfrompeerNASmethodsonfourdatasets.Correlationinperformance
(redlines)vs.Savingsingradientdescentwalltime(blueboxes)byreducing(b)thenumber
ofchannelsinlayers,(c)thenumberoflayers,and(d)thenumberofepochstotrain.Note
that(b),(c)and(d)havetwoy-axislabelscorrespondingtothecolorofthelines.
3.3.2ProxyModels
ThemaincomputationalbottleneckofNASapproachesresidesinevaluatingthe
accuracyofthearchitecturesbyinvokingthelower-levelweightoptimization.Onesuch
evaluationtypicallytakeshourstowhichlimitsthepracticalutilityofthesearch
underaconstrainedsearchbudget.Inourproposedalgorithm,weadopttheconceptofa
proxymodel[5,6].Proxymodelsaresmall-scaleversionsoftheintendedarchitectures,where
thenumberoflayers(
N
inFig.3.2(a))andthenumberofchannels(
Ch
init
inFig.3.2(a))
ineachlayerarereduced.Duetothedownscaling,proxymodelstypicallyrequiremuch
41
lesscomputetimetotrain

.However,thereexistsabetweengainsincomputation
andlossofpredictionaccuracy.Therefore,itisnotnecessarythattheperformance
ofanarchitecturemeasuredatproxy-modelscalecanserveasareliableindicatorofthe
architecture'sperformancemeasuredatthedesiredscale.
Todeterminethesmallestproxymodelthatcanprovideareliableestimateofperfor-
manceatalargerscale,weconductedparametricstudiesthatgraduallyreducedthesizes
oftheproxymodelsof100randomlysampledarchitecturesfromoursearchspace.Then,
wemeasuredtherank-ordercorrelationandthesavingsinlower-leveloptimizationcompute
timebetweentheproxymodelsandthesamearchitecturesatthefullscale.Figures3.12(b),
3.12(c)and3.12(d)showtheofnumberofchannels,layersandepochsrespectively
onthetrainingtimeandSpearmanrank-ordercorrelationbetweentheproxyandfullscale
models.Wemakethefollowingobservations,(1)increasingthechannelsdoesnot
cantlythewallclocktime,and(2)reducingthenumberoflayersortrainingepochs
tlyreducesthewallclocktimebutalsoreducestherank-ordercorrelation.Based
ontheseobservationsandtheexactfromtheplots,forproxymodel,wesetthe
numberofchannelsto36(maximumdesired),numberofepochsto36,andnumberoflayers
to14.Empirically,wefoundthatthischoiceofparametersagoodbetween
practicalityofsearchandreliabilityofproxymodels.
3.4Conclusion
Inthissection,wehavepresentedNSGANet,anevolutionarymulti-objectivealgorithmfor
neuralarchitecturesearch.NSGANetexploresthedesignspaceofarchitecturesthrough

Smallarchitecturesizeallowslargerbatchsizetobeusedandalowernumberofepochstoconverge
underAlgorithm2.
42
recombiningandmutatingarchitecturalcomponents.NSGANetfurtherimprovesthesearch
byexploitingthepatternsamongthepastsuccessfularchitecturesviadistribution
estimationthroughaBayesianNetworkmodel.ExperimentsonCIFAR-10,CIFAR-100,Im-
ageNetdatasetshavedemonstratedtheenessofNSGANet.Webelievetheseresults
areencouraginganddemonstratetheimportanceofcustomizedandtevolutionary
algorithmsforneuralarchitecturesearchinachievingsuperiorperformancecomparedto
othercontemporarymachinelearningmethods.
43
Chapter4
SurrogateAssistedEvolutionofDeep
NeuralNetworkArchitectures
ThegoalofNASistoobtainboththeoptimalarchitectureanditsassociatedoptimalweights.
ThekeybarriertorealizingthefullpotentialofNASisthenatureofitsformulation.NASis
typicallytreatedasabi-leveloptimizationproblem(seeEq.2.1),withaninneroptimization
loopovertheweightsofthenetworkforagivenarchitecture,whiletheouteroptimization
loopisoverthenetworkarchitectureitself.Thecomputationalchallengeofsolvingthis
problemstemsfromboththeupperandlowerleveloptimization.Learningtheoptimal
weightsofthenetworkinthelowerlevelnecessitatescostlyiterationsofstochasticgradient
descentovermultipleepochs.Similarly,searchingtheoptimalarchitectureisprohibitive
duetothediscretenatureofthearchitecturedescription,sizeofsearchspaceandourdesire
tooptimizemultiple,possiblycompeting,objectives.Mitigatingbothofthesechallenges
explicitlyandsimultaneouslyisthegoalofthissection.
ManyapproacheshavebeenproposedtoimprovetheofNASalgorithms,both
intermsoftheupperlevelandthelowerlevel.Amajorityofthesefocusonthelowerlevel,
includingweightsharing[38,40,120],proxymodels[5,6],coarsetraining[20],etc.Butthese
approachesstillhavetosample,explicitlyorimplicitly,alargenumberofarchitecturesto
evaluateintheupperlevel.Incontrast,thereisrelativelylittlefocusonimprovingthesample
44
oftheupperleveloptimization.Afewrecentapproaches[2,121]adoptsurrogates
thatpredictthelowerlevelperformancewiththegoalofnavigatingtheupperlevelsearch
spacetly.However,thesesurrogatepredictivemodelsarestillverysamplet
sincetheyarelearnedinanstagebysamplingalargenumberofarchitectures
andperformingfulllowerleveloptimization.
Figure4.1:
Overview:
Givenadatasetandobjectives,MSuNASobtainatask-spset
ofmodelsthatarecompetitiveinallobjectiveswithhighsearch.Itcomprisesof
twosurrogates,oneattheupperleveltoimprovesampleyandoneatthelower
level,throughasupernet,toimproveweightlearning.
Inthissection,weextendNSGANetfromprevioussectionandproposeapractically
tNASalgorithm,byadoptingexplicitsurrogatemodelssimultaneouslyatboththe
upperandlowerlevel.(i)Ourlowerlevelsurrogateadoptsaapproach,where
theinitialweightsforareobtainedbyasupernetmodel,suchas[120,122].(ii)
Ourupperlevelsurrogateadoptsanonlinelearningalgorithm,thatfocusesonarchitectures
inthesearchspacethatareclosetothecurrentfront,asopposedtoarandom
setofarchitecturesusedinthesurrogateapproaches[2,103,121].Ouronlinesurro-
gatetlyimprovesthesamplencyoftheupperleveloptimizationproblemin
comparisontothesurrogates.
AnoverviewofourapproachisshowninFig.4.1.WerefertotheproposedNASalgorithm
asMSuNASandtheresultingarchitecturesasMSuNets.Ourmethodisdesignedtoprovide
45
asetofhigh-performancemodelsonacustomdataset(largeorsmallscale,multi-classor
whileoptimizingpossiblymultipleobjectivesofinterest.
4.1ProposedApproach
MSuNASsearchesoverfourimportantdimensionsoftheconvolutionalneuralnetwork
(CNN),includingdepth(#oflayers),width(#ofchannels),kernelsizeandinputreso-
lution.Followingpreviousworks[20],wefactorizeaCNNarchitectureintoesequentially
connectedblocks,withgraduallyreducedfeaturemapsizeandincreasednumberofchan-
nels.Ineachblock,wesearchoverthenumberoflayers,whereonlythelayeruses
stride2ifthefeaturemapsizedecreases,andwealloweachblocktohaveminimumoftwo
andmaximumoffourlayers.Everylayeradoptstheinvertedbottleneckstructure[12]and
wesearchovertheexpansionratiointherst1

1convolutionandthekernelsizeofthe
depth-wiseseparableconvolution.Additionally,weallowtheinputimagesizetorangefrom
192to256.Weuseanintegerbit-stringtoencodethesearchitecturalchoices,andwepad
zerostothebit-stringofarchitecturesthathavefewerlayerssothatwehavea
encoding.ApictorialoverviewofthissearchspaceandencodingisshowninFig.4.2.
4.1.1OverallAlgorithmDescription
TheprobleminEq.(3.1)posestwomaincomputationalbottlenecksforconventionalbi-level
optimizationmethods.First,thelowerlevelproblemoflearningtheoptimalweights
w

(

)
foragivenarchitecture

involvesaprolongedtrainingprocess|e.g.,oneSGDtrainingtakes
twodaysonImageNetwithan8-GPUserver.Second,eventhoughthereexisttechniques
likeweight-sharingtobypassthegradient-descent-basedweightlearningprocess,extensively
46
Figure4.2:
SearchSpace
:Acandidatearchitecturecomprisesecomputationalblocks.
Parameterswesearchforincludeimageresolution,numberoflayers(
L
)ineachblockand
theexpansionrate(
e
)andthekernelsize(
k
)ineachlayer.
samplingarchitecturesattheupperlevelcanstillrendertheoverallprocesscomputationally
prohibitive,e.g.,10,000forwardinferenceonImageNettakes24GPUhours,andformethods
likeNASNet[5],AmoebaNet[6]thatrequiremorethan20,000samples,itstillrequiresdays
tocompletethesearchevenwithweight-sharing.
Algorithm3andFig.4.3showthepseudocodeandcorrespondingstepsfromasample
runonImageNet,respectively.Toovercometheaforementionedtwobottlenecks,weuse
surrogatemodelsatbothlevelstomakeourNASalgorithmpracticallyusefulforavari-
etyofdatasetsandobjectives.Attheupperlevel,weconstructasurrogatemodelthat
predictsthetop-1accuracyfromintegerbit-stringsthatencodearchitectures.Previous
approaches[88,121]thatalsousedsurrogate-modelingoftheaccuracytypicallyfollowan
approach,wheretheaccuracypredictorisbuiltfromsamplescollectedseparately
priortothearchitecturesearchandnotduringthesearch.Wearguethatsucha
processmakesthesearchoutcomehighlydependentontheinitialtrainingsamples.Asan
alternative,weproposetomodelandtheaccuracypredictoriterativelyinanonline
mannerduringthesearch.Inparticular,westartwithanaccuracypredictorconstructed
fromonlyalimitednumberofarchitecturessampledrandomlyfromthesearchspace.We
47
Algorithm3:
MSuNAS
Input:
SS
(searchspace),
S
w
(supernet),
C
(complexityobj),
N
(initialsamples),
K
(max.iterations).
1
A ;
;
2
while
i<N
do
3

 
sample(
SS
)
4
w
o
 S
w
(

)
5
acc
 
SGD(
;w
o
)
6
A A[
(
;acc
)
7
end
8
while
j<K
do
9
S
f
 
constructfrom
A
//
(MLP/CART/RBF/GP)
10
~

 
NSGA-II(
S
f
,
C
)
11

 
subsetfrom
~

12
for

in

do
13
w
o
 S
w
(

)
14
acc
 
SGD(
;w
o
)
15
A A[
(
;acc
)
16
end
17
end
18
Return
NDsort(
A
).
Figure4.3:AsamplerunofMSuNASonImageNet:Ineachiteration,accuracy-prediction
surrogatemodels
S
f
areconstructedfromanarchiveofpreviouslyevaluatedarchitectures
(a).Newcandidatearchitectures

brownboxesin(b)

areobtainedbysolvingtheauxil-
iarysingle-levelmulti-objectiveproblem
~
F
=
fS
f
;
Cg
(line10inAlgo3).Asubsetofthe
candidatearchitecturesischosentodiversifytheParetofront(c)-(d).Theselectedcan-
didatearchitecturesarethenevaluatedandaddedtothearchive(e).Attheconclusionof
thesearch,wereportthenon-dominatedarchitecturesfromthearchive.Thex-axisinall
is#MAdds.
thenuseastandardmulti-objectivealgorithm(NSGA-II[74],inourcase)tosearchusing
thispredictorofarchitectureaccuracyalongwithotherobjectivesthatarealsoofinterestto
theuser.WethenevaluatetheoutcomearchitecturesfromNSGA-IIandtheaccuracy
predictormodelwiththesearchitecturesasnewtrainingsamples.Werepeatthisprocessfor
apre-spnumberofiterationsandoutputthenon-dominatedsolutionsfromthepool
ofevaluatedarchitectures.
48
4.1.2SpeedingUpUpperLevelOptimization
Recallthatthenestednatureofthebi-levelproblemmakestheupperleveloptimization
computationallyveryexpensive,aseveryupperlevelfunctionevaluationrequiresanother
optimizationatthelowerlevel.Hence,toimprovetheofourapproachatthe
upperlevel,wefocusonreducingthenumberofarchitecturesthatwesendtothelowerlevel
forlearningoptimalweights.Toachievethisgoal,weneedasurrogatemodeltopredict
theaccuracyofanarchitecturebeforeweactuallytrainit.Therearetwodesiredproperties
ofsuchapredictor:(1)highrank-ordercorrelationbetweenthepredictedandthetrue
performance;and(2)sampletsuchthatthenumberofarchitecturesthatneedtobe
trainedthroughSGDareminimizedforconstructingthepredictor.
Wecollectedfourtsurrogatemodelsforaccuracypredictionfromthelit-
erature,namely,MultiLayerPerceptron(MLP)[2],AndRegressionTrees
(CART)[88],RadialBasisFunction(RBF)[120]andGaussianProcess(GP)[121].From
ourablationstudy,weobservedthatnoonesurrogatemodelisconsistentlybetterthanoth-
ersintermsoftheabovetwocriteriaonalldatasets(seesection4.2.1).Hence,wepropose
aselectionmechanism,namedAdaptiveSwitching(AS),whichconstructsallfourtypesof
surrogatemodelsandchoosesonebasedoncross-validation.
WiththeaccuracypredictorselectedbyAS,weapplytheNSGA-IIalgorithmtosimul-
taneouslyoptimizeforbothaccuracy(predicted)andotherobjectivesofinteresttotheuser
(line10inAlgorithm3).Forthepurposeofillustration,weassumethattheuserisinterested
inoptimizingMAddsasthesecondobjective.AttheconclusionoftheNSGA-IIsearch,aset
ofnon-dominatedarchitecturesisoutput,seeFig.4.3(b).Oftentimes,wecannotordto
trainallarchitecturesintheset.Toselectasubset,weselectthearchitecturewithhigh-
49
estpredictedaccuracy.Thenweprojectallotherarchitecturecandidatestothe#MAdds
axis,andpicktheremainingarchitecturesfromthesparseregionsthathelpinextending
theParetofrontiertodiverse#MAddsregimes,seeFig.4.3(c)-(d).Thearchitecturesfrom
thechosensubsetarethensenttothelowerlevelforSGDtraining.Weaddthese
architecturestothetrainingsamplestoouraccuracypredictormodelsandproceedto
nextiteration,seeFig.4.3(e).
4.1.3SpeedingUpLowerLevelOptimization
Tofurtherimprovethesearchoftheproposedalgorithm,weadoptthewidely-
usedweight-sharingtechnique[39,123].First,weneedasupernetsuchthatallsearchable
architecturesaresub-networksofit.Weconstructsuchasupernetbytakingthesearched
architecturalhyperparametersattheirmaximumvalues,i.e.,withfourlayersineachofthe
eblocks,withexpansionratiosetto6andkernelsizesetto7ineachlayer(SeeFig.4.2).
Thenwefollowtheprogressiveshrinkingalgorithm[124]totrainthesupernet.Thisprocess
isexecutedoncebeforethearchitecturesearch.Theweightsinheritedfromthetrained
supernetareusedasawarm-startforthegradientdescentalgorithmduringarchitecture
search.
4.2ExperimentsandResults
Inthissection,weevaluatethesurrogatepredictor,thesearchandtheobtained
architecturesonCIFAR-10,CIFAR-100,andImageNet.
50
4.2.1PerformanceoftheSurrogatePredictors
WeevaluatethefoursurrogatemodelsonImageNet,CIFAR-10andCIFAR-100datasets.
Weuniformlysample2,000architecturesfromoursearchspace,thentrainthemusingSGD
for150epochsoneachofthethreedatasetsandrecordtheiraccuracyon5,000held-out
imagesfromthetrainingset.Wethensurrogatemodelswithtnumberofsamples
randomlyselectedfromthe2,000collected.Werepeattheprocessfor10trialstocomparethe
meanandstandarddeviationoftherank-ordercorrelationbetweenthepredictedandtrue
accuracy,seeFig.4.4.Ingeneral,weobservethatnotypeofsurrogatemodelconsistently
outperformstheothersonallthreedatasets.Hence,weadoptanAdaptiveSwitching(AS)
routinethatcomparesthefoursurrogatemodelsandchoosesthebestbasedon10-foldcross-
validation.ItisevidentfromFig.4.4thatASworksbetterthananyoneofthefoursurrogate
modelsaloneonallthreedatasets.TheconstructiontimeoftheASisnegligible(
<
1min).
Figure4.4:Performanceofthefoursurrogatemodelpredictorsconsidered.Toprowcom-
paresSpearmanrank-ordercorrelationcotasnumberoftrainingsamplesincreases.
Bottomrowvisualizesthetruevs.predictedaccuracyunder500trainingsamples.Theroot
meansquareerror(RMSE)oftheAdaptiveSwitching(AS)strategyisalsoprovided.
51
4.2.2Search
Table4.1:ComparingtherelativesearchofMSuNAStoothersingle-objective
methods:\#Model"isthetotalnumberofarchitecturesevaluatedduringsearch,\#Epochs"
isthenumberofepochsusedtotraineacharchitectureduringsearch.
y
and
z
denotetraining
epochswithandwithoutasupernettowarm-starttheweights,respectively.
MethodType
Top1Acc.#MAdds
#ModelSpeedup
#EpochsSpeedup
CIFAR-10
NASNet-A[5]RL
97.4%569M
20,00057x
20upto4x
AmoebaNet-B[6]EA
97.5%555M
27,00077x
25upto5x
PNASNet-5[2]SMBO
96.6%588M
1,1603.3x
20upto4x
MSuNAS(ours)
SMBO+EA
98.4%468M
3501x
5
y
/20
z
1x
ImageNet
MnasNet-A[20]RL
75.2%312M
8,00023x
5upto5x
OnceForAll[124]EA
76.0%230M
16,00046x
0-
MSuNAS(ours)
SMBO+EA
75.9%225M
3501x
1
y
/5
z
1x
Inthissection,wecomparethesearchofMSuNAStoothersingle-objective
methodsonbothCIFAR-10andImageNet.Toquantifythespeedup,wecomparethetwo
governingfactors,namely,thetotalnumberofarchitecturesevaluatedbyeachmethodto
reachthereportedaccuracyandthenumberofepochsundertakentotraineachsampled
architectureduringsearch.TheresultsareprovidedinTable4.1.WeobservethatMSuNAS
is20xfasterthanmethodsthatuseRLorEA.WhencomparedtoPNAS[2],whichalso
utilizesanaccuracypredictor,MSuNASisstillatleast3xfaster.
WethencomparethesearchiencyofMSuNAStoNSGANet[15]andrandomsearch
underabi-objectivesetup:top-1accuracyand#MAdds.Toperformthecomparison,we
runMSuNASfor30iterations,leadingto350architecturesevaluatedintotal.Werecordthe
cumulativehypervolume[14]achievedagainstthenumberofarchitecturesevaluated.We
repeatthisprocessetimesonthreetdatasetstocapturethevarianceinperfor-
manceduetorandomnessintheoptimizationinitializationandprocedure.Forcomparison
toNSGANet,weapplytooursearchspaceandrecordthenumberofarchitecturesevaluated
byNSGANettoreachasimilarhypervolumethanthatachievedbyMSuNAS.Therandom
searchbaselineisperformedbyuniformlysamplingfromoursearchspace.Weplotmeanand
52
standarddeviationofthehypervolumevaluesachievedbyeachmethodinFig.4.5.Based
ontheincrementalrateofhypervolumeincrease,weobservethatMSuNASis3-5xfaster
inachievingabetterParetofrontierintermsofnumberofarchitecturesevaluated.
(a)
(b)
Figure4.5:(a)Asketchvisualizingthehypervolumemetric[14].Inthecaseoftwoob-
jectives,itmeasuresthedominatedareaachievedbyamulti-objectivealgorithm.Alarger
hypervolumeindicatesabetterParetofrontachieved.(b)comparingtherelativesearch
ofMSuNAS,NSGANet[15]andrandomsearchunderabi-objectivesetup.We
plotthemeanhypervolume(across5runs)onImageNet,CIFAR-10andCIFAR-100against
numberofarchitecturesevaluated.Coloredregionsdenotethestandarddeviationofthe
mean.
4.2.3ResultsonStandardDatasets
Priortothesearch,wetrainthesupernetfor350epochsonImageNetfollowingthetraining
hyperparameterssettingfrom[124].Thistrainingtakes40hoursonaclusterof32Tesla
V100cards.Foreachdataset,westartMSuNASwith100randomlysampledarchitectures
andrunfor30iterations.Ineachiteration,weevaluate8architecturesselectedfromthe
candidatesrecommendedbyNSGA-IIaccordingtotheaccuracypredictor.ForCIFAR-10
andCIFAR-100,wetunetheweightsinheritedfromthesupernetfor5epochsthen
evaluateon5,000held-outvalidationimagesfromtheoriginaltrainingset.ForImageNet,
were-calibratetherunningstatisticsoftheBNlayersafterinheritingtheweightsfromthe
supernet.Withnofurthertuning,weevaluateon10,000held-outvalidationimagesfrom
53
Table4.2:ImageNet[1]:MSuNAScomparisonwithmanualandautomated
designoftnetworks.Modelsaregroupedintosectionsforbettervisualization.Our
resultsareunderlinedandbestresultineachsectionisinbold.CPUlatency(batchsize=1)
ismeasuredonInteli7-8700KandGPUlatency(batchsize=64)ismeasuredon1080Ti.
y
Thesearchcostexcludesthesupernettrainingtimecost.
z
Estimatedbasedontheclaim
thatPNASis8xfasterthanNASNetfrom[2].
Model
Type
SearchCost
(GPUdays)
#Params#MAdds
CPULatency
(ms)
GPULatency
(ms)
Top-1Acc.
(%)
Top-5Acc.
(%)
MSuNet-s(ours)
auto1
y
6.1M
225M
9.1
30
75.9
92.5
MobileNetV2[12]
manual0
3.4M300M
8.323
72.091.0
FBNet-C[97]
auto9
5.5M375M
9.131
74.9-
ProxylessNAS[41]
auto8.3
7.1M465M
8.527
75.1
92.5
MobileNetV3[21]
combined-
5.4M219M
10.033
75.2-
OnceForAll[124]
auto2
y
6.1M230M
9.531
76.0
-
MSuNet-m(ours)
auto1
y
7.7M
312M
11.4
37
77.4
93.5
tNet-B0[17]
auto-
5.3M390M
14.446
76.393.2
MixNet-M[125]
auto-
5.0M360M
24.379
77.093.3
AtomNAS-C+[126]
auto1
y
5.5M329M
--
77.2
93.5
MSuNet-l(ours)
auto1
y
8.0M
400M
12.9
52
78.5
94.1
PNASNet-5[2]
auto250
z
5.1M588M
35.682
74.291.9
MSuNet-xl(ours)
auto1
y
8.7M
593M
16.7
73
79.5
94.7
tNet-B1[17]
auto-
7.8M700M
21.578
78.894.4
MixNet-L[125]
auto-
7.3M565M
29.4105
78.994.2
theoriginaltrainingset.Attheconclusionofthesearch,wepickthetopfourarchitectures
basedonandnamethemasMSuNet-s/m/l/xl.Architecturaldetailscanbefound
inthesupplementarymaterials.
Table4.2showstheperformanceofourmodelsontheImageNet2012benchmark[1].We
comparemodelsintermsofpredictiveperformanceonthevalidationset,model
(measuredby#MAddsandlatenciesonthardware),andassociatedsearchcost.
Overall,MSuNetsconsistentlyeithermatchesoroutperformsothermodelsacrossdt
accuracylevelswithhighlycompetitivesearchcosts.Inparticular,MSuNet-machieves
77.4%accuracywith312MMAddsand37msGPUlatency,whichis1%moreaccurateand
1.25

moreetwhencomparedtotNet-B0[17].WhencomparedtoPNASNet-
5[2],whichalsoleveragesanaccuracypredictor,MSuNet-lis4%moreaccurateand3

more
tinCPUlatency,whileusingordersofmagnitudelessinsearchexpense.Additional
comparisonstomodelsderivedfrommulti-objectiveapproachesareprovidedinFig.4.6.
54
ForCIFARdatasets,wetunetheoutcomearchitecturesfromthesearchprocessfor
anadditional50epochs.Fig.4.6comparesourmodelswithotherapproachesintermsof
bothpredictiveperformanceandcomputational.OnCIFAR-10,weobservethat
MSuNASdominatesallpreviousmodelsincluding(1)NASNet-A[5],PNASNet-5[2]and
NSGANet[15]thatsearchonCIFAR-10directly,and(2)tNet[17],MobileNetV3[21]
andMixNet[125]thattunefromImageNet.
Figure4.6:
Accuracy
vs

:Toprowcomparespredictiveaccuracyvs.GPU
latencyonabatchof64images.Bottomrowcomparespredictiveaccuracyvs.numberof
multi-addsinmillions.Modelsfrommulti-objectiveapproachesarejoinedwithlines.Our
modelsareobtainedbydirectlysearchingontherespectivedatasets.Inmostproblems,
MSuNASmoreaccuratesolutionswithfewerparameters.
55
4.3ScalabilityofMSuNAS
4.3.1TypesofDatasets
ExistingNASapproacheswererarelytestedfortheirsearchabilitybeyondstandardbench-
markdatasets,i.e.,ImageNet,CIFAR-10,andCIFAR-100.Instead,theyfollowaconven-
tionaltransferlearningsetup,inwhichthearchitecturesfoundbysearchingonstandard
benchmarkdatasetsaretransferred,withweightstuned,tonewdatasets.Weargue
thatsuchaprocessisconceptuallycontradictorytothegoalofNAS,andthearchitectures
idenundersuchaprocessaresub-optimal.Inthissectionwedemonstratethescala-
bilityofMSuNAStothreeadditionaldatasetswithvariousformsofintermsof
diversityinclasses(multi-classesvs.andsizeoftrainingset(see
Table4.3).WefollowasimilarsearchsettingasinthecaseofCIFARdatasets,outlinedin
Section4.1.Foreachdataset,onesearchtakesapproximately15hourson8GPUcards.
Table4.3:Non-standardDatasetsforMSuNAS
Datasets
Type
#Classes
#Train#Test
CINIC-10[127]
multi-class
10
90,00090,000
STL-10[128]
multi-class
10
5,0008,000
Flowers102[129]

102
2,0406,149
Fig.4.7comparestheperformanceofmodelsfromMSuNASsearcheddirectlyonthere-
spectivedatasetswithmodelsfromotherapproachesthattransferarchitectureslearnedfrom
eitherCIFAR-10orImageNet.Overall,weobservethatMSuNetsntlyoutperform
othermodelsonallthreedatasets.Inparticular,MSuNetsachieveabetterperformance
thanthecurrentlyknownstate-of-the-artonCINIC-10[16]andSTL-10[18].Furthermore,
onOxfordFlowers102,MSuNetachievessimilaraccuracytothatoftNet-B3[17]
whileusing1.2BfewerMAdds.
56
Figure4.7:Performanceofthesetoftask-spmodelsobtainedbyMSuNASonthree
ttypesofnon-standarddatasets,comparingtoSOTAfromtransferlearning[16,17]
andsemi-/un-supervisedlearning[18,19]
4.3.2NumberofObjectives
Single-objectiveFormulation:
Addingahardwaretargetasapenaltyterm
totheobjectivethatmaximizespredictiveperformanceisacommonworkaroundtohandle
multipleobjectivesintheNASliterature[20,41,97].Wedemonstratethatourproposed
algorithmcanalsoelyhandlesuchascalarizedsingle-objectivesearch.Following
thescalarizationmethodin[20],weapplyMSuNAStomaximizevalidationaccuracyon
ImageNetwith600MMAddsasthetargeted.Theaccumulativetop-1accuracy
achievedandtheperformanceoftheaccuracypredictorareprovidedinFig.4.8(a).The
obtainedarchitectureyields79.56%accuracywith596MMAddsontheImageNetvalidation
set,whichismoreaccurateand100MfewerMAddswhencomparedtotNet-B1[17].
Many-objectiveFormulation:
Practicaldeploymentoflearnedmodelsarerarelydriven
byasingleobjective,andmostoften,seektomanyt,possiblycompeting,
objectives.Asanexampleofonesuchscenario,weuseMSuNAStosimultaneouslyoptimize
eobjectives|namely,theaccuracyonImageNet,Params,MAdds,CPUandGPUlatency.
Wefollowthesamesearchsetupasinthemainexperimentsandincreasethebudgetto
ensureathoroughsearchontheexpandedobjectivespace.Weshowtheobtainednon-
57
(a)MaximizeTop-1Acc.
(b)5-Objectivescenario.
(c)Non-dominatedarchitecturesundertobjectives.
Figure4.8:ScalabilityofMSuNAStontnumbersandtypesofobjectives:optimiz-
ing(a)ascalarizedsingle-objectiveonImageNet;(b)eobjectivesincludingaccuracy,
Params,MAdds,CPUandGPUlatency,simultaneously.(c)Post-optimalanalysisonthe
architecturesthatarenon-dominatedaccordingtotobjectives.
dominated(toeobjectives)architecturesinFig.4.8(b).Weusecolorandmarkersizeto
indicateCPUandGPUlatency,respectively.WeobservethataParetosurfaceemerges,
shownintheleft3Dscatterplot,suggestingthataexistsbetweenParamsand
MAdds,i.e.,ParamsandMAddsarenotfullycorrelated.Wethenprojectallarchitectures
to2D,visualizingaccuracyvs.eachoneofthefourconsideredmeasurements,
andhighlightthearchitecturesthatarenon-dominatedinthecorrespondingtwo-objective
cases.Weobservethatmanyarchitecturesthatarenon-dominatedinthee-objective
casearenowdominatedwhenonlyconsideringtwoobjectives.Empirically,weobservethat
accuracyishighlycorrelatedwithMAdds,CPUandGPUlatency,butnotwithParams,to
someextent.
58
4.4Conclusion
ThissectionintroducedMSuNAS,antneuralarchitecturesearchalgorithmforrapidly
designingtask-spcmodelsundermultiplecompetingobjectives.Theofour
approachstemsfrom(i)onlinesurrogate-modelingatthelevelofthearchitecturetoim-
provethesampleencyofsearch,and(ii)asupernetbasedsurrogate-modeltoimprove
theweightslearning.Onstandarddatasets(Cifar-10,Cifar-100andImageNet),
MSuNetsmatchthestate-of-the-artwithasearchcostofoneday.Theutilityandversatility
ofMSuNASarefurtherdemonstratedonnon-standarddatasetsofvarioustypesof
andontnumberofobjectives.Improvementsbeyondthestate-on-the-artonSTL-10
andFlowers102(undermobilesetting)suggestthatNASisamoreivealternativeto
conventionalstandalonetransferlearningapproaches.
59
Chapter5
TransferLearningAssistedEvolution
ofDeepNeuralNetwork
Architectures
Apartfromtransferlearningbyuningthe
weights
,currentNASapproacheshavefailed
todelivernewmodelsforboth
weights
and
topology
oncustomnon-standarddatasets.The
keybarriertorealizingthefullpotentialofNASisthelargedataandcomputationalrequire-
mentsforemployingexistingNASalgorithmsonnewtasks.Inthissection,weintroduce
NeuralArchitectureTransfer
(NAT)tobreachthisbarrier.Givenanimagetion
task,NATobtainscustomneuralnetworks(both
topology
and
weights
),optimizedforpos-
siblymanygobjectives,anddoessowithoutthesteepcomputationalburdenof
runningNASforeachnewtaskfromscratch.AsinglerunofNATtlyobtainsmultiple
customneuralnetworksspanningtheentirefrontofobjectives.
Oursolutionbuildsupontheconceptofasupernet[122]whichcomprisesofmanysub-
nets.Allsubnetsaretrainedsimultaneouslythroughweightsharing,andcanbesampled
verytly.ThisproceduredecouplesthenetworktrainingandsearchphasesofNAS.
Amany-objective

searchcanthenbeemployedontopofthesupernettoallnetwork
architecturesthatprovidethebestamongtheobjectives.However,trainingsuch

Problemshavingmorethanthreeobjectivesarecalledmany-objectiveproblems[76].
60
supernetsforeachtaskfromscratchisverycomputationallyanddataintensive.Thekey
ideaofNATistoleverageanexistingsupernetandtlytransferitintoatask-sp
supernet,whilstsimultaneouslysearchingforarchitecturesthatthebestbe-
tweentheobjectivesofinterest.Therefore,unlikestandardsupernetbasedNAS,wecombine
supernettransferlearningwiththesearchprocess.Attheconclusionofthisprocess,NAT
returns(i)subnetsthatspantheentireobjectivefront,and(ii)atask-spsu-
pernet.Thelattercannowbeutilizedforallfuturedeployment-spNAS,i.e.,newand
thardwareorobjectives,withoutanyadditionaltraining.
Figure5.1:
Overview:
Givenadatasetandobjectivestooptimize,NATdesignscustom
architecturesspanningtheobjectivefront.NATcomprisesoftwomainscompo-
nents,supernetadaptationandevolutionarysearch,thatareiterativelyexecuted.NATalso
usesanonlineaccuracypredictormodeltoimproveitscomputationale.
ThecoreofNAT'sliesinonlyadaptingthesubnetsofthesupernetthatwill
lieonthetfrontofthenewdataset,insteadofallpossiblesubnets.But,
thestructureofthecorrespondingsubnetsisunknownbeforeadaptation.Weresolvethis
\chicken-and-eggproblem"byadoptinganonlineprocedurethatalternatesbetweenthetwo
primarystagesofNAT:(a)
supernetadaptation
ofsubnetsthatareatthecurrent
front,and(b)
evolutionarysearch
forsubnetsthatspanthemany-objectivetrade-ofront.
ApictorialoverviewoftheentireNATmethodisshowninFigure5.1.
Inthe
adaptation
stage,westconstructalayer-wiseempiricaldistributionfromthebest
61
subnetsreturnedbyevolutionarysearch.Then,subnetssampledfromthisdistributionare
Inthe
search
stage,toimprovetheofthesearch,weadoptasurrogate
modeltoquicklypredicttheobjectivesofanysampledsubnetwithoutafull-blownandcostly
evaluation.Furthermore,thepredictormodelitselfisalsolearnedonlinefrompreviously
evaluatedsubnets.Wealternatebetweenthesetwostagesuntilourcomputationalbudget
y
isexhausted.
5.1ProposedApproach
NeuralArchitectureTransfer
consistsofthreemaincomponents;anaccuracypredictor,an
evolutionarysearchroutine,andasupernet.NATstartswithanarchive
A
ofarchitectures
(subnets)createdbyuniformsamplingfromoursearchspace.Weevaluatetheperformance
f
i
ofeachsubnet(
a
i
)usingweightsinheritedfromthesupernet.Theaccuracypredictoris
thenconstructedfrom(
a
i
;f
i
)pairswhich(jointlywithanyadditionalobjectivesprovided
bytheuser)drivesthesubsequentmany-objectiveevolutionarysearchtowardsoptimalar-
chitectures.Promisingarchitecturesattheconclusionoftheevolutionaryprocessareadded
tothearchive
A
.The(partial)weightsofthesupernetcorrespondingtothetop-ranked
subnetsinthearchiveareNATrepeatsthisprocessforapre-spnumber
ofiterations.Attheconclusion,weoutputboththearchiveandthetask-spsuper-
net.Networksthatthebestbetweentheobjectivescanbeobtainedfrom
thearchive.DetaileddescriptionsofeachcomponentofNATareprovidedinthefollowing
subsections.Figure5.1andAlgorithm4provideanoverviewofourentireapproach.
y
Wemanuallysetthecomputationalbudgettoamaximumof1dayona8-GPU(NVIDIA2080Ti)
server.Thisisequivalenttothecomputationalresourcesavailabletoasmalllab.
62
Algorithm4:
NeuralArchitectureTransfer
Input:
trainingdata
D
trn
,validationdata
D
vld
,additionalobjectives
~
f
,supernet
S
w
,archive
size
N
,#ofiterations
T
,#ofepochs
E
,#ofgenerations
G
.
1
t
 
0//initializeaniterationcounter
2
A 
randomlyinitializeanarchiveofarchswithasizeof
N
.
3
while
t<T
do
4
//computeaccuracybyinheritingweightsandinference.
5
f
 S
w
(
A
;
D
vld
)
6
//constructtheaccuracypredictorbyAlgo5.
7
S
f
 
AccuracyPredictor
(
A
;f
)
8
//promisingarchsbyevolutionarysearchinAlgo.6.
9
P
t
 
EvolutionarySearch
(
S
f
,
~
f
,
A
,
G
)
10
//keepthetop-
N
rankedarchsinarchivefollowingAlgo7.
11
A 
Selection
(
A[
P
t
;N
)
12
//tunesupernettopromisingarchsbyAlgo.8.
13
S
w
 
Adapt
(
S
w
;
A
;
D
trn
;E
)
14
t
 
t
+1
15
end
16
//optionalincaseofnopreferencesfromusers.
17
A

 
chooseasubsetofarchsfrom
A
basedonbymethodpresentedinsupplementary
materialsunderSection1.
18
Return
S
w
;
A
;
A

.
5.1.1SearchSpaceandEncoding
SimilarlyasinMSuNet(previoussection),weadoptamodulardesignforoverallstructureof
the
network
,consistingofastem,multiplestagesandatail(seeFig.5.2(a)).The
stem
and
tail
arecommontoallnetworksandnotsearched.Each
stage
inturncomprisesmultiple
layers,andeach
layer
itselfisaninvertedresidualbottleneckstructure[12].
-Network:
Wesearchfortheinputimageresolutionandthewidthmultiplier(afactor
thatscalesthe#ofoutputchannelsofeachlayeruniformly[130]).Followingpreviouswork
[17,20,124],wesegmenttheCNNarchitectureintoesequentiallyconnectedstages.The
stagesgraduallyreducethefeaturemapsizeandincreasethenumberofchannels(Fig.5.2(a)
Left
).
-Stage:
Wesearchoverthenumberoflayers,whereonlythelayerusesstride2ifthe
featuremapsizedecreases,andwealloweachblocktohaveminimumoftwoandmaximum
63
(a)SearchSpace
(b)Encoding
Figure5.2:ThearchitecturesinoursearchspacearevariantsofMobileNet-v2familyof
models[12,17,20,21].(a)Eachnetworksconsistsofestages.Eachstagehastwoto
fourlayers.Eachlayerisaninvertedresidualbottleneckblock.Thesearchspaceincludes,
inputimageresolution(R),widthmultiplier(W),thenumberoflayersineachstage,the
#ofoutputchannels(expansionratioE)ofthe1

1convolutionandthekernelsize
(K)ofthedepth-wiseseparableconvolutionineachlayer.(b)Networksarerepresentedas
22-integerstrings,wherethetwocorrespondtoresolutionandwidthmultiplier,and
therestcorrespondtothelayers.Eachvalueindicatesachoice,e.g.thethirdinteger(
L
1
)
takesavalueof\1"correspondstousingexpansionratioof3andkernelsizeof3inlayer1
ofstage1.
offourlayers(Fig.5.2(a)
Middle
).
-Layer:
Wesearchovertheexpansionratio(betweenthe#ofoutputandinputchannels)
ofthe1

1convolutionandthekernelsizeofthedepth-wiseseparableconvolution
(Fig.5.2(a)
Right
).
Overall,wesearchoverfourprimaryhyperparametersofCNNsi.e.,thedepth(#of
layers),thewidth(#ofchannels),thekernelsize,andtheinputresolution.Theresult-
ingvolumeofoursearchspaceisapproximately3
:
5

10
19
foreachcombinationofimage
resolutionandwidthmultiplier.
Toencodethesearchitecturalchoices,weuseanintegerstringoflength22,asshownin
Fig.5.2(b).Thetwovaluesrepresenttheinputimageresolutionandwidthmultiplier,
64
respectively.Theremaining20valuesdenotetheexpansionratioandkernelsizesettings
foreachofthe20layers.Theavailableoptionsforexpansionratioandkernelsizeare[3,4,
6]and[3,5,7],respectively.Itisworthnotingthatwesortthelayersettingsinascending
#MAddsorder,whichisbtothemutationoperatorusedinourevolutionarysearch
algorithm.
5.1.2AccuracyPredictor
ThemaincomputationalbottleneckofNASarisesfromthenestednatureofthebi-level
optimizationproblem.Theinneroptimizationrequirestheweightsofthesubnetstobe
thoroughlylearnedpriortoevaluatingitsperformance.Methodslikeweight-sharing[40,
123,124]allowsampledsubnetstoinheritweightsamongthemselvesorfromasupernet,
avoidingthetime-consumingprocess(typicallyrequiringhours)oflearningweightsthrough
SGD.However,standaloneweight-sharingstillrequiresinferenceonvalidationdata(typically
requiringminutes)toassessperformance.Therefore,simplyhavingtoevaluatethesubnets
canstillrendertheoverallprocesscomputationallyprohibitiveformethods[5,6,20]that
samplethousandsofarchitecturesduringsearch.
Tomitigatethecomputationalburdenoffullyevaluatingthesubnets,weadoptasur-
rogateaccuracypredictorthatregressestheperformanceofasampledsubnetwithoutper-
formingtrainingorinference.Bylearningafunctionalrelationbetweentheinteger-strings
(subnetsintheencodedspace)andthecorrespondingperformance,thisapproachdecouples
theevaluationofanarchitecturefromdata-processing(includingbothSGDandinference).
Consequently,theevaluationtimereducesfromhours/minutestoseconds.Weillustratethis
conceptinFig.5.3.Theenessofthisidea,however,iscriticallydependentonthe
qualityofthesurrogatemodel.Belowweidentifythreedesiredpropertiesofsuchamodel:
65
Figure5.3:
TopPath:
AtypicalprocessofevaluatinganarchitectureinNASalgorithms.
BottomPath:
Accuracypredictoraimstobypassthetime-consumingcomponentsfor
evaluatinganetwork'sperformancebydirectlyregressingitsaccuracy
f
from
a
(architecture
intheencodedspace).
1.
Reliableprediction:highrank-ordercorrelation
z
betweenpredictedandtrueperfor-
mance.
2.
Consistentprediction:thequalityofthepredictionshouldbeconsistentacrosst
datasets.
3.
Sampleminimizingthenumberoftrainingexamplesnecessarytoconstruct
anaccuratepredictormodel,sinceeachtrainingsamplerequirescostlytrainingand
evaluationofasubnet.
Currentapproaches[2,121,124]thatusesurrogatebasedaccuracypredictors,however,do
notsatisfyproperty(1)and(3)simultaneously.Forinstance,PNAS[2]uses1,160subnets
tobuildthesurrogatebutonlyachievesarank-ordercorrelationof0.476.Similarly,Once-
For-All[124]uses16,000subnetstobuildthesurrogate.Thepoorsamplecomplexityand
rank-ordercorrelationoftheseapproaches,isduetotheelearningofthesurrogate
model.Insteadoffocusingonmodelsthatareatthefrontoftheobjectives,these
z
Lowmeansquareerrorisalsodesirable,butnotnecessarysincetheselectionofarchitecturesinthe
subsequentevolutionarysearchcomparesrelativeperformancebetweenarchitectures.
66
Algorithm5:
AccuracyPredictor(RBFEnsemble)
Input:
trainingdata
X
,trainingtargets
Y
,ensemblesize
K
1
k
 
0//initializeancounter
2
pool
 ;
//initializeapooltostoreallmodels.
3
while
k<K
do
4
(
~
X;
~
Y
)
 
randomlycreateasubsetofthetrainingdata.
5
idx
 
randomlypickasubsetofthefeaturesintrainingdata.
6
rbf
 
aRBFmodelfrom
~
X
[:
;idx
]and
~
Y
.
7
pool
 
pool
[
(
rbf
,
idx
)//appendthemodeltothepool.
8
k
 
k
+1
9
end
10
Return
a
pool
of
K
RBFmodels.
surrogatemodelsarebuiltfortheentiresearchspace.Consequently,thesemethodsrequire
atlylargerandmorecomplexsurrogatemodel.
Figure5.4:Accuracypredictorperformanceasafunctionoftrainingsamples.Foreach
model,weshowthemeanandstandarddeviationoftheSpearmanrankcorrelationon11
datasets(Table5.2).ThesizeofRBFensembleis500.
Weovercometheaforementionedlimitationbyrestrictingthesurrogatemodeltothe
searchspacethatconstitutesthecurrentobjectiveSuchasolutiontly
reducesthesamplecomplexityofthesurrogateandincreasesthereliabilityofitspredictions.
Weadoptfourlow-complexitypredictors,namely,GaussianProcess(GP)[121],RadialBa-
67
sisFunction(RBF)[120],MultilayerPerceptron(MLP)[2],andDecisionTree(DT)[88].
Empirically,weobservethatRBFsareconsistentlybetterthantheotherthreemodelsif
the#oftrainingsamplesismorethan100.TofurtherimproveRBF'sperformance,es-
peciallyunderahighsampleregime,weconstructanensembleofRBFmodels.
AsoutlinedinAlgorithm5,eachRBFmodelisconstructedwithasubsetofsamplesand
featuresrandomlyselectedfromthetraininginstances.Thecorrelationbetweenpredicted
accuracyandtrueaccuracyfromanensembleof500RBFmodelsoutperformsallother
modelsacrossallregimes.Figure5.4comparestheperformanceofthetsurrogate
modelsweconsidered.Practically,weobservedthattheRBFensemblecanbelearnedunder
aminute.
5.1.3Many-ObjectiveEvolutionarySearch
Giventheaccuracypredictorweemployacustomizedevolutionaryalgorithm(EA)tosearch
foroptimalarchitecturesthatthebestbetweenmanyobjectives.TheEAis
aniterativeprocessinwhichinitialarchitectures,selectedfromthearchiveofpreviously
exploredarchitectures,aregraduallyimprovedasagroup,referredtoasa
population
.In
everygeneration(iteration),agroupof

(i.e.,newarchitectures)arecreatedby
applyingvariationsthroughcrossoverandmutation(describedbelow)operationsonthemost
promisingarchitectures,alsoknownas
parents
,foundsofarinthepopulation.Everymember
ofthepopulation,i.e.,bothparentsandcompetesforsurvivalandreproduction
(becomingaparent)ineachgeneration.SeeFig.5.1(bottomrightshadedingreen)fora
pictorialoverview,andAlgorithm6forthepseudocode.
Crossover
exchangesinformationbetweentwoormorepopulationmemberstocreate
twoormorenewmembers.Designinganecrossoverbetweennon-standardsolution
68
Algorithm6:
EvolutionarySearch
Input:
accuracypredictor
S
f
,additionalobjectives
~
f
,archiveofarchs
A
,max.numberof
generations
G
,populationsize
K
,crossoverprobability
p
c
,mutationprobability
p
m
.
1
g
 
0//initializeangenerationcounter
2
f
 S
f
(
A
)//computeaccuracyofallarchsinarchive.
3
P
 
Selection
(
A
;f;
~
f
(
A
)
;K
)//initializetheparentpopulationwithtop-
K
rankedarchsfrom
A
.
4
while
g<G
do
5
//chooseparentsthroughtournamentselectionformating.
6
P
 
BinaryTournamentSelection
(
P
)
7
//createpopulationbycrossoverbetweenparents.
8
Q
 
Crossover
(
P;p
c
)
9
//inducerandomnesstopopulationthroughmutation.
10
Q
 
Mutation
(
Q;p
m
)
11
R
 
P
[
Q
//mergeparentandpopulation.
12
//survivethetop-
K
archstonextgeneration.
13
P
 
Selection
(
R;
S
f
(
R
)
;
~
f
(
R
)
;K
)
14
g
 
g
+1
15
end
16
Return
parentpopulation
P
.
representationscanbeandhasbeenlargelyignoredbyexistingEA-basedNAS
algorithms[6,43,44].Hereweadoptacustomized,homogeneouscrossoverthatuniformly
picksintegersfromparentarchitecturestocreatearchitectures.Thiscrossover
operatortwoproperties:(1)itpreservescommonintegerssharedbetweenparents;
and(2)itisfreeofadditionalhyperparameters.Fig.5.5(a)visualizesourimplementation
ofthecrossoveroperation.Wegeneratetwoarchitectureswitheachcrossover,
andanpopulationofthesamesizeastheparentpopulationisgeneratedineach
generation.
(a)
(b)
Figure5.5:(a)
CrossoverOperator
:newarchitecturesarecreatedbyrecom-
biningintegersfromtwoparentarchitectures.Theprobabilityofchoosingfromeitherone
oftheparentsisequal.(b)
MutationOperator
:histogramsshowingtheprobabilitiesof
mutatedvalueswithcurrentvalueat5underthyperparameter

m
settings.
69
Mutation
isa
local
operatorthatperturbsasolutiontoproduceanewsolutionin
itsvicinity.Inthiswork,weuseadiscretizedversionofthepolynomialmutation(PM)
operator[110]andapplyittoeverysolutioncreatedbythecrossoveroperator.Foragiven
architecture
a
,PMiscarriedoutinteger-wisewithprobability
p
m
,andthemutated
i
th
integer,
a
i
,ofthemutatedis:
a
0
i
=
8
>
>
<
>
>
:
a
i
+((2
u
)
1
=
(1+

m
)

1)(
a
i

a
(
L
)
i
)
;
for
u

0
:
5
;
a
i
+(1


2(1

u
)

1
=
(1+

m
)
)(
a
(
U
)
i

a
i
)
;
for
u>
0
:
5
(5.1)
where
u
isauniformrandomnumberintheinterval[0
;
1].
a
(
L
)
i
and
a
(
U
)
i
arethelower
andupperboundsof
a
i
,respectively.Eachmutatedvalueinaningisroundedto
thenearestinteger.ThePMoperatorinheritsthe
parent-centric
convention,inwhichthe
areintentionallycreatedaroundtheparents.Thecentricityiscontrolledviaan
indexhyperparameter

m
.Inparticular,high-valuesof

m
tendtocreatemutated
aroundtheparent,andlow-valuesencouragemutatedtobefurtherawayfromthe
parentarchitecture.SeeFig.5.5(b)foravisualizationoftheof

m
.Itistheworth
notingthatthePMoperatorwasoriginallyproposedforcontinuousoptimizationwheredis-
tancesbetweenvariablevaluesarenaturallyIncontrast,incontextofourencoding,
ourvariablesarecategoricalinnature,indicatingaparticularlayerhyperparameter.Sowe
sortthesearchedsubnetsinascendingorderof#MAdds,suchthat

m
nowcontrolsthe
in#MAddsbetweentheparentandthemutated
WeapplyPMtoeverymemberintheingpopulation(createdfromcrossover).
Wethenmergethemutatedpopulationwiththeparentpopulationandselectthe
tophalfusingmany-objectiveselectionoperatordescribedinAlgorithm7.Thisprocedure
createstheparentpopulationforthenextgeneration.Werepeatthisoverallprocessfora
70
pre-spnumberofgenerationsandoutputtheparentpopulationattheconclusionof
theevolution.
5.1.4Many-ObjectiveSelection
Inadditiontohighpredictiveaccuracy,real-worldapplicationsdemandNASalgorithmsto
simultaneouslybalanceafewotherobjectivesthataresptothedeployment
scenarios.Forinstance,mobileorembeddeddevicesoftenhaverestrictionsintermsof
modelsize,multiply-adds,latency,powerconsumption,andmemoryfootprint.Withno
priorassumptiononthecorrelationamongtheseobjectives,ascalable(tothenumberof
objectives)selectionisrequiredtodrivethesearchtowardsthehighdimensionalPareto
front.Inthiswork,weadoptthereferencepointguidedselectionoriginallyproposedin
NSGA-III[76],whichhasbeenshowntobeeinhandlingproblemswithasmany
as15objectives.Intheremainderofthissection,weprovideanoverviewofNSGA-III
procedureandreferreaderstotheoriginalpublicationformoredetails.
Algorithm7:
ReferencePointBasedSelection
Input:
Asetofarchs
R
,theirobjectives
F
,numberofarchstoselect
N
,referencedirections
Z
.
1
//putarchsintotfronts(ranklevels)basedondomination.
2
(
F
1
;F
2
;:::
)
 
non
dominated
sort
(
F
)
3
S
 ;
,
i
 
1
4
while
j
S
j
+
j
F
i
j
<N
do
S
 
S
[
F
i
;
i
 
i
+1;
5
F
L
 
F
i
//nextfrontisthesplitfrontwherewecannotaccommodateallarchsassociatedwithit.
6
if
j
S
j
+
j
F
L
j
=
N
then
S
 
S
[
F
L
;
7
else
8
(
~
S;
~
F
L
)
 
Normalize
(
S;F
L
)//normalizetheobjectivesbasedtheidealandnadirpoints
derivedfrom
R
.
9
d
 
computeorthogonaldistto
Z
i
foreach
i
10
ˆ
 
count#associatedsolutionsfor
Z
i
basedon
d
foreach
i
.
11
//remainingarchsfrom
F
L
toup
S
12
S
 
S
[
Niching
(
~
F
L
,
N
j
S
j
,
ˆ
,
d
)
13
end
14
Return
S
.
Domination
isawidely-usedpartialorderingconceptforcomparingtwoobjectivevectors.
71
(a)
(b)
Figure5.6:(a)Anexample(assumingminimizationofallobjectives)oftheselectionprocess
inAlgo7:Wecreatereferencedirections
Z
byjoiningreferencepointswiththeideal
solution(origin).Thenthrough
non
dominated
sort
,threenon-dominatedsolutionsare
idenassociatedwithreferencedirections
Z
(1)
,
Z
(3)
and
Z
(5)
.Wethenselectthe
remainingsolutionsbytheorthogonaldistancestothereferencedirectionswithnoassociated
solutions|i.e.
Z
(2)
and
Z
(4)
.Thisselectionisscalabletolarger#ofobjectives.Tri-
objectiveexampleisshownin(b).
Foragenericmany-objectiveoptimizationproblem:min
a
f
f
1
(
a
)
;:::;f
m
(
a
)
g
,where
f
i
(

)
aretheobjectives(say,lossfunctions)tobeoptimizedand
a
istherepresentationofaneural
networkarchitecture.Fortwogivensolutions
a
1
and
a
2
,solution
a
1
issaidtodominate
a
2
(i.e.,
a
1

a
2
)iffollowingconditionsare
1.
a
1
isnoworsethan
a
2
forallobjectives(
f
i
(
a
1
)

f
i
(
a
2
),
8
i
2f
0
;:::;m
g
),and
2.
a
1
isstrictlybetterthan
a
2
inatleastoneobjective
9
i
2f
0
;:::;m
gj
f
i
(
a
1
)
<
f
i
(
a
2
)).
Asolution
a
i
issaidtobenon-dominatediftheseconditionsholdagainstalltheother
solutions
a
j
(with
j
6
=
i
)intheentiresearchspaceof
a
.
Withtheaboveion,wecansortsolutionstotranksofdomination,where
solutionsinthesamerankarenon-dominatedtoeachother,andthereexistsatleastone
solutioninlowerrankthatdominatesanysolutionsinthehigherrank.Thus,alowernon-
72
dominatedrankedsetislexicographicallybetterthanahigherrankedset.Thisprocess
isreferredas
non
dominated
sort
,anditisthestepintheselectionprocess.During
themany-objectiveselectionprocess,thelowerrankedsetsarechosenoneatatimeuntil
nomoresetscanbeincludedtomaintainthepopulationsize.Theacceptedsetmay
havetobe
split
tochooseonlyapart.Forthispurpose,wechoosethemostdiversesubset
basedonadiversity-maintainingmechanism.Wecreateasetofreferencedirections
fromasetofuniformlydistributed(in(
m

1)-dimensionalspace)referencepointsinthe
unitsimplexbyusingDas-and-Dennismethod[131].Thenweassociateeachsolutiontoa
referencedirectionbasedonorthogonaldistanceofthesolutionfromthedirection.Then,
foreveryreferencedirection,wechoosetheclosestassociatedsolutioninasystematicman-
nerbyadaptivelycomputinganichecount
ˆ
sothateveryreferencedirectiongetsanequal
opportunitytochoosearepresentativeclosestsolutionintheselectedpopulation.Thedom-
inationanddiversity-preservingproceduresareeasilyscalabletoanynumberofobjectives
andimportantlyarefreefromanyhyperparameter.SeeAlgorithm7forthe
pseudocodeandFig.5.6foragraphicalillustration.
5.1.5SupernetAdaptation
Insteadoftrainingeveryarchitecturessampledduringsearchfromscratch,NASwithweight
sharing[38,40]inheritsweightsfrompreviously-trainednetworksorfromasupernet.Directly
inheritingtheweightsobviatestheneedtooptimizetheweightsfromscratchandspeedsup
thesearchfromthousandsofGPUdaystoonlyafew.Inthiswork,wefocusonthesupernet
approach[122,124].Itinvolvestrainingalargenetworkmodel(inwhichsearchable
architecturesbecomesubnets)priortothesearch.Thentheperformanceofthesubnets,
evaluatedwiththeinheritedweights,isusedtoguidetheselectionofarchitecturesduring
73
search.Thekeytothesuccessofthisapproachisthattheperformanceofthesubnetswith
theinheritedweightsbehighlycorrelatedwiththeperformanceofthesamesubnetwhen
thoroughlytrainedfromscratch.Satisfyingthisdesideratumnecessitatesthatthesupernet
weightsbelearnedinsuchawaythat
all
subnetsareoptimized
simultaneously
.
Existingmethods[132,133]attempttoachievetheabovegoalbyimposing
fairness
in
trainingthesupernet,wheretheprobabilitiesoftraininganyparticularsubnetforeachbatch
ofdataisuniforminexpectation.However,wearguethatsimultaneouslytrainingallthe
subnetsinthesearchspaceispracticallynotfeasibleandmoreimportantlynotnecessary.
Firstly,itisevidentfromexistingNASapproaches[41,97]thattobjectives(#Params,
#MAdds,latencyonthardware,etc.)requiretarchitecturesinordertobe
t.Inotherwords,notallsubnetsareequallyimportantforthetaskathand.Secondly,
onlyatinyfraction
x
ofthesearchspacecanpracticallybeexploredbyaNASalgorithms.
Algorithm8:
AdaptSupernet
Input:
supernet
S
w
,archiveofarchs
A
,trainingdata
D
trn
,numberofepochs
E
.
1
e
 
0//initializeanepochcounter
2
Dist
 
constructthedistributionfrom
A
followingEq.(5.2).
3
while
e<E
do
4
for
eachbatchin
D
trn
do
5
subnet
 
samplefrom
Dist
.
6
w
 
setforwardpathof
S
w
accordingto
subnet
.
7
L 
computecross-entropylossondata
batch
.
8
r
w
 
computethegradientby
@
L
=@w
9
S
w
 
onestepofSGD.
10
end
11
e
 
e
+1
12
end
13
Return
supernet
S
w
.
Basedontheaforementionedobservations,weproposeasimpleyetesupernet
trainingroutinethatonlyfocusesontrainingthesubnetsrecommendedbytheevolutionary
searchalgorithminSection5.1.4.Spe,weseektoexploittheknowledgegainedfrom
x
Forexample,AmoebaNet[6]samplesalargenumberof27Karchitectureswhichisstillonlyabout
10

13
%ofitssearchspace.
74
thesearchprocesssofar.Recallthatouralgorithmusesanarchivetokeeptrackofthe
promisingarchitecturesexploredsofar.Foreachvalueinourarchitectureencoding,we
constructacategoricaldistributionfromarchitecturesinthearchive,wheretheprobability
for
i
th
integertakingonthe
j
valueiscomputedas:
p
(
X
i
=
j
)=
#ofarchitectureswithoption
j
at
i
th
integer
total#ofarchitecturesinthearchive
(5.2)
Ineachtrainingstep(batchofdata)wesampleaninteger-stringfromtheabovedistribu-
tion
{
.Wethenactivatethesubpartsofthesupernetcorrespondingtothearchitecture
decodedfromtheinteger-string.Onlyweightscorrespondingtotheactivatedsubpartsin
thesupernetwillbeupdatedineachstep.SeeAlgorithm8forpseudocode.
5.1.6ChoosingBestTSolution
Theproposedmany-objectiveEAisexpectedtoproduce
N
(populationsize)solutions
all
m
objectives.Thesesolutionsareguaranteedtohaveoneproperty:againin
oneobjectivebetween
i
-thand
j
-thsolutionscomesonlyfromalossinatleastoneother
objectivebetweenthem.Wecalculatetheof
i
-thsolutionastheaveragelossper
unitaveragegainamong
m
nearestneighbors(
B
(
i
))basedonnormalizedEuclideandistance
areusedhere),asfollows[72]:
T
i
)=
j
B
(
i
)
j
max
j
=1
Avg
:
Loss(
i
;
j
)
Avg
:
Gain(
i
;
j
)
(5.3)
{
AvisualizationofsuchdistributionsisshowninFig.3ofsupplementarymaterial.
75
where
Avg
:
Loss(
i
;
j
)=
P
m
k
=1
max(0
;f
k
(
j
)

f
k
(
i
))
P
m
k
=1
f
1
j
f
k
(
j
)
>f
k
(
j
)
g
Avg
:
Gain(
i
;
j
)=
P
M
k
=1
max(0
;f
k
(
i
)

f
k
(
j
))
P
m
k
=1
f
1
j
f
k
(
i
)
>f
k
(
j
)
g
Thereafter,thesolutionshavingthehighestvalueindicatesthatitcausesthelargest
averagelossinsomeobjectivestomakeaunitaveragegaininotherobjectivestochoose
anyofitsneighbors.Ifthishighesttradvalueismuchlargerstatisticallythanother
solutions,thenthehighesttrasolutionisthepreferredchoice,incaseofnopreferences
providedfromusers.
5.2ExperimentsandResults
Inthissectionwepresentexperimentalresultstoevaluatetheof
NeuralArchitecture
Transfer
onmultipleimagetasks.Inadditionwealsoinvestigatethescalability
ofourapproachtomorethantwoobjectives.Foralltheexperimentsinthissection,weuse
thesamesetofhyperparmaters(seeTable5.1)forthetcomponentsofNAT.These
choiceswereguidedbytheablationstudiesdescribedinSection5.3.
5.2.1Datasets
Weconsiderelevenimageiondatasetsforevaluationwithsamplesizevaryingfrom
2,040to180,000images(20to18,000imagesperclass;Table5.2).Thesedatasetsspanawide
varietyofimagetasks,includingsuperordinate-levelrecognition(ImageNet[1],
CIFAR-10[34],CIFAR-100[34],CINIC-10[127],STL-10[128]);recognition
76
Table5.1:HyperparameterSettings
Category
Parameter
Setting
global
archivesize
300
numberofiterations
30
accuracypredictor
Trainsize
100
Ensemblesize
500
evolutionarysearch
populationsize
100
numberofgenerationsperiteration
100
crossoverprobability
0.9
mutationprobability
0.1
mutationindex

m
1.0
supernet
numberofepochsperiteration
5
Table5.2:BenchmarkDatasetsforEvaluation
DatasetTypeTrainSizeTestSize#Classes
ImageNet[1]
multi-class
1,281,16750,0001,000
CINIC-10[127]180,0009,00010
CIFAR-10[34]50,00010,00010
CIFAR-100[34]50,00010,00010
STL-10[128]5,0008,00010
Food-101[134]

75,75025,250101
StanfordCars[135]8,1448,041196
FGVCAircraft[136]6,6673,333100
DTD[137]3,7601,88047
Oxford-IIITPets[138]3,6803,36937
OxfordFlowers102[129]2,0406,149102
(Food-101[134],StanfordCars[135],FGVCAircraft[136],Oxford-IIITPets[138],Oxford
Flowers102[129]);andtexture(DTD[137]).WeusetheImageNetdatasetfor
trainingthesupernet,andusetheothertendatasetsforarchitecturetransfer.
5.2.2SupernetPreparation
Oursupernetisconstructedbysettingthearchitectureencodingatthemaximumvalue,i.e.
fourlayersineachstageandeverylayerusesexpandratioofsixandkernelsizeofseven.
Adaptingsubnetsofasupernetwithrandomlyinitializedweightsleadstotraininginstability
77
andlargevarianceinitsperformance.Therefore,wewarm-upthesupernetweightson
ImageNetfollowingthe
progressiveshrinking
algorithm[124],wherethesupernetis
trainedatfull-scale,withsubnetscorrespondingtotoptions(expandratio,kernel
size,#oflayers)beinggraduallyactivatedduringthetrainingprocess.Thisprocedure,
whichtakesabout6daysonaseverwitheightV100GPUs,isoptimizedwithonlythe
cross-entropylossi.e.,asingleobjective.Wenotethatsupernetpreparation
k
expenseisa
one-timecostthatamortizesoveranysubsequenttransfertotdatasetsandobjective
combinationsweshowinthefollowingsubsections.
5.2.3ImageNet
Beforeweevaluateourapproachforarchitecturetransfertootherdatasets,westvalidate
itsenessontheImageNet-1Kdataset.Thisexperimentevaluatestheenessof
NATinadaptingandsearchingforarchitecturesthatspanbetweentwoobjectives.
Forthisexperiment,weconsideraccuracyand#MAddsasthetwoobjectiveofinterest.We
runNATfor30iterations,andfromthearchiveofarchitecturesweselectfourmodels
rangingfrom200MMAddsto600MMAdds(forhigh-endmobiledevices).Following[124]
wefurther

eachmodelfor75epochs.Ourtraininglargelyfollows[20]:
RMSPropoptimizerwithdecay0.9andmomentum0.9;batchnormalizationmomentum
0.99;weightdecay1e-5.Weuseabatchsizeof512andaninitiallearningrateof0.012
thatgraduallyreducestozerofollowingthecosineannealingschedule.Ourregularization
settingsaresimilarasin[17]:weuseaugmentationpolicy[139],dropconnectratio0.2,and
k
Manysupernets,pre-trainedonImageNet,arealreadyavailableifonedoesnotwanttotrainthem
fromscratch.See
https://github.com/mit-han-lab/once-for-all
,
https://github.com/meijieru/
AtomNAS
and
https://github.com/JiahuiYu/slimmable_networks
.

Section5.3.5studiestheimpactofablatingthestep.
78
dropoutratio0.2.
Table5.3showstheperformanceofNATmodelsobtainedthroughbi-objectiveopti-
mizationofmaximizingaccuracyandminimizing#MAdds.Ourmodels,referredtoas
NAT-M
f
1,2,3,4
g
,areinascendingorderof#MAdds(Fig.5.7).Figure5.8showsthefull
#MAdds-accuracycurvecomparisonbetweenNATandexistingNASmethods.
ResultsindicatethatNATNetscompletelydominate(i.e.betterinboth#MAddsand
accuracy)allexistingdesigns,bothmanualandfromotherNASalgorithms,undermobile
settings(

600MMAdds).Comparedtomanually.designednetworks,NATisnotice-
ablymoret.NAT-M1is
2.3%
and
1.5%moreaccurate
thanMobileNet-v3[21]
andFBNetV2-F4[4]respectively,whilebeingequivalentine(i.e.#MAdds,CPU
andGPUlatency).Furthermore,NATNetsareconsistently
6%moreaccurate
than
MobileNet-v2[12]scaledbywidthmultiplierfrom200Mto600M#MAdds.Ourlargest
model,NAT-M4,achievesanewstate-of-the-artImageNettop-1accuracyof80.5%under
mobilesettings(

600M#MAdds).Interestingly,eventhoughthisexperimentdidnot
explicitlyoptimizeforCPUorGPUlatency,NATNetsarefasterthanthose(MobileNet-
V3[21],MNasNet[20])thatexplicitlydooptimizeforlatency.
5.2.4ArchitectureTransfer
ExistingNASapproachesarerarelyappliedtotasksbeyondstandarddatasets(i.e.CIFAR-
10[34]andImageNet[1]),wheretheclascationtaskisatsuperordinate-levelandthe#
oftrainingimagesaresutlylarge.Instead,theyadoptaconventionaltransferlearning
setup[144],inwhichthearchitecturesfoundbysearchingonstandardbenchmarkdatasets
aretransferredasis,withweightstonewdatasets.Wearguethatsuchaprocessis
conceptuallycontradictorytothegoalofNAS.Thearchitecturestransferredfromstandard
79
Figure5.7:ImageNetArchitecturesfromTFront.
Table5.3:
ImageNet-1K[1]:
NATNetscomparisonwithmanualandauto-
mateddesignoftconvolutionalneuralnetworks.Modelsaregroupedintosectionsfor
bettervisualization.Ourresultsareunderlinedandthebestresultineachsectionisinbold.
CPUlatency(batchsize=1)ismeasuredonInteli7-8700KandGPUlatency(batchsize=64)
ismeasuredon1080Ti.\WS"standsforweightsharing.Allmethodsareundersinglecrop
andsinglemodelcondition,withoutanyadditionaldata.
Model
Method
#Params#Multi-AddsCPULat(ms)GPULat(ms)
Top-1Acc(%)Top-5Acc(%)
NAT-M1
WS+EA
6.0M
225M
9.1
30
77.5
93.5
MobileNet-v2[12]
manual
3.5M300M
8.323
72.091.0
ProxylessNAS[41]
RL/gradient
4.0M465M8.527
75.192.5
MnasNet-A1[20]
RL
3.9M312M9.331
75.292.5
MobileNet-v3[21]
RL/NetAdapt
5.4M
219M
10.633
75.2-
MUXNet-m[22]
EA
3.4M218M
14.742
75.392.5
FBNetV2-F4[4]
gradient
7.0M238M15.644
76.0-
tNet-B0[17]
RL/scaling
5.3M390M14.446
76.393.2
NAT-M2
WS+EA
7.7M
312M
11.4
37
78.6
94.3
MUXNet-l[22]
EA
4.0M
318M19.274
76.693.2
AtomNAS-C+[126]
WS+shrinkage
5.9M363M--
77.693.5
AutoNL-L[140]
gradient
5.6M353M--
77.793.7
DNA-c[141]
gradient
5.3M466M14.567
77.893.7
ResNet-152[11]
mannual
60M11.3B66.7176
77.893.8
NAT-M3
WS+EA
9.1M
490M
16.1
62
79.9
94.9
tNet-B1[17]
RL/scaling
7.8M700M19.567
78.894.4
MixNet-L[125]
RL
7.3M565M29.4105
78.994.2
BigNASModel-L[142]
WS
6.4M
586M--
79.5-
OnceForAll[124]
WS+EA
9.1M595M16.572
80.094.9
NAT-M4
WS+EA
9.1M
0.6B
17.3
78
80.5
95.2
Inception-v4[143]
manual
48M13B84.6206
80.095.0
Inception-ResNet-v2[143]
manual
56M13B99.1289
80.195.1
datasetsaresub-optimaleitherwithrespecttoaccuracy,orboth.Ontheother
hand,bytransferringbotharchitectureandweightsNATcanindeeddesignbespokemodels
foreachtask.
80
Figure5.8:
MAddsvs.ImageNetAccuracy
.NATNetsoutperformothermodelsinboth
objectives.Inparticular,NAT-M4achievesanewstate-of-the-arttop-1accuracyof80.5%
undermobilesetting(

600MMAdds).NAT-M1improvesMobileNet-v3top-1accuracyby
2.3%withsimilar#MAdds.
WeevaluatedNATontenimagedatasets(seeTable5.2)thatpresent
tchallengesintermsofdiversityinclasses(superordinatevs.grained)and
sizeoftrainingset(largevssmall).Foreachdataset,werunNATwithtwoobjectives:
maximizetop-1accuracyonvalidationdata(20%randomlyseparatedfromthetrainingset)
andminimize#MAdds.WestartfromthesupernettrainedonImageNet(whichiscreated
oncebeforeallexperiments;seeSection5.2.2)andtransferittothenewtask.Duringthis
procedurethelastlinearlayerisresetdependingonthenumberofcategoriesinthenewtask.
NATisnowappliedforatotalof30iterations.Ineachiterationthesupernetisadaptedfor
5epochsusingSGDwithamomentumof0.9.Thelearningrateisinitializedat0.01and
annealedtozeroin150epochs(30iterationswithveepochsineach).Allhyperparameters
aresetatdefaultvaluesfromTable5.1.Foreachdataset,theoverallNATprocesstakes
81
slightlyunderadayonaserverwitheight2080TiGPUs.
Figure5.9:
MAddsvs.Accuracy
curves
yy
comparingNATexistingarchitectures
onadiversesetofdatasets.Thedatasetsarearrangedinascendingorderoftrainingset
size.Methodsshowninthelegendpre-trainonImageNetandtheweightsonthe
targetdataset.Methodsin
blue
trainfromscratchoruseexternaltrainingdata.
Figure5.9showstheaccuracyand#MAddsforeachdatasetacrossawiderange
ofmodels,includingNATNetsandconventionaltransferlearningweights)of
existingNASandhanddesignedmodels.AcrossalldatasetsNATNetsconsistentlyachieve
betteraccuracywhilebeinganorderofmagnitudemoret(#MAdds)thanexisting
models.Undermobilesettings(

600M),NATNetsachievethestate-of-the-artonthese
datasets,andanewstate-of-the-artaccuracyonbothSTL-10[128]andCINIC-10
zz
[127]
datasets.Surprisingly,onsmallscaledatasetse.g.OxfordFlowers102[129],Oxford-IIIT
Pets[138],DTD[137]andSTL-10[128],weobservethatNATNetsaretlymore
ethanconventionalEvenonddatasetssuchasStanfordCars
yy
ActualvaluesofeachNATNetmodelareprovidedinTable5.4.
zz
Accordingto[19]forSTL-10,and[16]forCINIC-10.
82
Figure5.10:Transferarchitectures(350MMAdds)ontendatasets.
andFGVC,whereconventionaldidnotimproveupontrainingfromscratch,
NATNetsimproveaccuracywhilealsobeingtlymoret.
Figure5.10showsavisualizationofarchitectureswith350MMAddsforeachdataset.
ThelackofsimilarityinthenetworkssuggestthatNATwasabletogeneratenetworks
customizedforeachtask.Additionalvisualizationofarchitecturessearchedonalldatasets
83
Table5.4:NATmodelperformancecorrespondingtoFigure5.9.
Flowers102[129]
Oxford-IIITPets[138]
DTD[137]
STL10[128]
FGVCAircraft[136]
#Params#MAddsTop-1Acc(%)
#Params#MAddsTop-1Acc(%)
#Params#MAddsTop-1Acc(%)
#Params#MAddsTop-1Acc(%)
#Params#MAddsTop-1Acc(%)
3.3M152M97.5
4.0M160M91.8
2.2M136M76.1
4.4M240M96.7
3.2M175M87.0
3.4M195M97.9
5.5M306M93.5
4.0M297M77.6
5.1M303M97.2
3.4M235M89.0
3.7M250M98.1
5.7M471M94.1
4.1M347M78.4
7.5M436M97.8
5.1M388M90.1
4.2M400M98.3
8.5M744M94.3
6.3M560M79.1
7.5M573M97.9
5.3M581M90.8
StanfordCars[135]
CIFAR-100[34]
CIFAR-10[34]
Food-101[134]
CINIC-10[127]
#Params#MAddsTop-1Acc(%)
#Params#MAddsTop-1Acc(%)
#Params#MAddsTop-1Acc(%)
#Params#MAddsTop-1Acc(%)
#Params#MAddsTop-1Acc(%)
2.4M165M90.9
3.8M261M86.0
4.3M232M97.4
3.1M198M87.4
4.6M317M93.4
2.7M222M92.2
6.4M398M87.5
4.6M291M97.9
4.1M266M88.5
6.2M411M94.1
3.5M289M92.6
7.8M492M87.7
6.2M392M98.2
3.9M299M89.0
8.1M501M94.3
3.7M369M92.9
9.0M796M88.3
6.9M468M98.4
4.5M361M89.4
9.1M710M94.8
isprovidedinFigure5.11.
Flowers102
Pets
DTD
STL-10
Aircraft
StanfordCars
CIFAR-10
Food-101
ImageNet
Figure5.11:Non-dominatedarchitecturesto
f
top-1accuracy,#MAdds
g
obtainedbyNAT
ontdatasets.
5.2.5ScalabilitytoObjectives
PracticalapplicationsofNAScanrarelybeconsideredfromthepointofviewofasingle
objective,andmostoften,theymustbeevaluatedfrommanyt,possiblycompeting,
objectives.WedemonstratethescalabilityofNATtomorethantwoobjectivesandevaluate
itseness.
84
Figure5.12:
Toprow:
NATNetsobtainedfromtri-objectivesearchtomaximizeImageNet
top-1accuracy,minimizemodelsize(#Params),andminimize
f
#MAdds,CPUlatency,
GPUlatency
g
fromlefttoright.
Bottomrow:
2Dprojectionsfromabove3Dscatter,
showingtop-1accuracyvs.eachofthefourcyrelatedmeasurements.Tobetter
visualize(thecomparisonwithMobileNet-v3[21]andMUXNet[22]),partialsolutionsfrom
thenon-dominatedfrontiersareshown.Alltop-1accuracyshownarewithout
5.2.5.1ThreeObjectives
WeuseNATtosimultaneouslyoptimizeforthreeobjectives|namely,modelaccuracyon
ImageNet,modelsize(#params),andmodelcomputational.Weconsiderthree
tmetricstoquantifycomputationalCPUlatency,andGPU
latency.Intotal,werunthreeinstancesofthree-objectivesearch|i.e.maximizeaccuracy,
minimize#params,andminimizeoneof#MAdds,CPUlatencyorGPUlatency.Wefollow
thesettingsfromtheImageNetexperimentinSection5.2.3,exceptthestep.
Afterobtainingthenon-dominated)solutions,wevisualizetheobjectives
inFig.5.12.WeobservethatParetosurfacesemergeathighermodelcomplexityregime
(i.e.high#params,#MAdds,etc.),showninthe3Dscatterplotinthetoprow,suggesting
thatexistbetweenmodelsize(#params)andmodel(#MAddsand
85
latency).Inotherwords,#paramsand
f
#MAdds,CPU,GPUlatency
g
arenotcompletely
correlated|e.g.amodelwithafewer#paramsisnotnecessarilymoretin#MAdds
orlatencythananothermodelwithmore#params.Thisisoneoftheadvantagesofusinga
many-objectiveoptimizationalgorithmcomparedtooptimizingasinglescalarizedobjective
(such,asaweighted-sumofobjectives).
Figure5.12visualizes,in2D,thetop-1accuracyasawitheachoneofthefour
consideredmetricsinthebottomrow.The2Dprojectionisobtainedbyignor-
ingthethirdobjective.Forbettervisualizationweonlyshowthearchitecturesthatare
closetotheperformanceofMobilNet-v3[21].NATNetsobtaineddirectlyfromthe
three-objectivesearchi.e.,beforeanyoftheirweights,consistentlyoutperform
MobileNet-v3onImageNetalongalltheobjectives(top-1accuracy,#params,#MAdds,
CPUandGPUlatency).Additionally,wecomparetoMUXNets[22]whicharealsoob-
tainedfromathree-objectiveNASoptimizing
f
top-1accuracy,#params,and#MAdds
g
.
However,MUXNetsadoptasearchspacethatissptailoredforreducingmodelsize.
Therefore,incomparisontoMUXNets,weobservethatNATNetsperformfavourablyonall
theremainingthreemetrics,exceptfor#params.
5.2.5.2TwelveObjectives
TofurthervalidatethescalabilityofNATtoalargenumberofobjectives,weconsider
thetop-1accuracyoneachofthe11datasetsshowninTable5.2alongwith#MAdds,as
separateobjectives,resultingina12-objectiveoptimizationproblem.Notonlyissucha
large-scaleoptimizationplausiblewithNAT,italsorevealsimportantinformation,which
alow-dimensionaloptimizationmaynot.Duringsearch,theaccuracyoneachdatasetis
computedbyinheritingweightsfromthedataset-spsupernetsgeneratedfromprevious
86
Figure5.13:
Left:
ParallelCoordinatePlot(PCP)whereeachverticalbarisanobjective
andeachlineisanon-dominatedarchitecturesachievedbyNATfroma12-objoptimization
ofminimizing#MAddsandmaximizingaccuracyonthe11datasetsinTable5.2.Themodel
withthebest(seeSection5.1.6fordetails)ishighlightedindarkblue.
Right:
1-
on-1comparisonbetweentheselectedNATNet(top-rankedin)andrepresentative
peermodelsontop-1accuracyonvariousdatasets.Methodwithlargerareaisbetter.
experiments(Section5.2.4).Sincethesupernetsarealreadyadaptedtoeachdataset,we
excludethesupernetadaptationstepinNATforthisexperiment.
Figure5.13(Left)showsthe12objectivevaluesforall45non-dominatedarchitectures
obtainedbyNATinaparallelcoordinateplot(PCP),whereeachverticalbarisanobjective
andeachlineconnectingall12verticalbarsisanarchitecture.Wenowapplythe
decisionanalysispresentedinSection5.1.6andobservethatthehighestsolutionis
morethan(

+3
˙
)trade-oawayfromtherestof44solutions(seeFigure5.14).Thissolution
ishighlightedindarkblueinFig.5.13(Left).Itsintermediateperformanceinallobjectives
indicatethatthisbestsolutionmakesagoodcompromiseonall12objectives
amongall45obtainedsolutions.InFig.5.13(Right),wecomparethissolutionwitht
baselinemodelsthataretoeachdatasetseparately.Notably,ourNATNetachieves
betteraccuracyonalldatasetswithsimilarorless#MAddsthantNet-B0[17],
Mobilenet-v2[12],NASNet-A[5],andResNet-50[11],makingourhighestsolution
apreferredone.
87
Figure5.14:Tvaluesofall45architecturesachievedfromthe12-objectivesearch
presentedinFigure5.13.
Theaboveanalysisalludestoacomputationalmechanismforchoosingasinglepreferred
solutionfromtheParetosolutionsobtainedbyamany-objectiveoptimizational-
gorithm.IfsuchanoverwhelminglyhighsolutionexistsintheParetofront,it
becomesoneofthebestchoicesandcanoutperformsolutionsfoundbyasingle-objective
optimizationalgorithm.Withoutresortingtoamany-objectiveoptimizationtomultiple
solutions,idenofsuchahightradsolutionisverychallenging.
5.3AblationStudy
5.3.1AccuracyPredictorPerformance
Inthissubsectionweassesstheenessoftaccuracypredictormodels.We
uniformlysampled350architecturesfromoursearchspaceandtrainedthemusingSGD
for150epochsonImageNet.Eachoneofthemisfor50epochsontheother
tendatasets(Table5.2).Fromthe350pairsofarchitecturesandtop-1accuracycomputed
88
Figure5.15:
Toprow:
Spearmanrankcorrelationbetweenpredictedaccuracyandtrue
accuracyoftsurrogatemodelsacrossmanydatasets.Eachaccuracypredictoris
constructedfrom250samples(trainedarchitectures).Errorbarsshowmeanandstandard
deviationovertenruns.
Bottomrow:
GoodnessofvisualizationofRBFensemble,the
bestaccuracypredictor.
oneachdataset,wereservedasubset(randomlychosen)of50pairsfortesting,andthe
remaining300pairsarethenavailablefortrainingthepredictormodels.
Figure5.4comparesthemean(over11datasets)Spearmanrankcorrelationbetweenthe
predictedandtrueaccuracyforeachaccuracypredictorasthetrainingsamplesizeisvaried
from50to300.Empirically,weobservethatradialbasisfunction(RBF)hasSpearmanrank
correlationcomparedtotheotherthreemodels.And,theproposedRBFensemblemodel
furtherimprovesperformanceoverthestandaloneRBFmodelacrossalltrainingsample
sizeregimes.Figure5.15showsavisualizationofthecomparativeperformanceofpredic-
tormodelsontdatasets.Fromtheperspectiveofminimizingnumberof
trainingexamples(whichreducestheoverallcomputationalcost)andmaximizingSpearman
rankcorrelationinprediction(whichimprovestheaccuracyinrankingarchitecturesduring
search),wechosetheRBFensembleasouraccuracypredictormodelandatrainingsizeof
100forallourexperiments.
89
Table5.5:ComparingtherelativesearchofNATtoothermethods.\{"denotesfor
notapplicable,\WS"standsforweightsharingand\SMBO"standsforsequentialmodel-
basedoptimization[3].
y
istakenfrom[4],
z
estimatebasedonthe#ofmodelsevaluated
duringsearch(20Kin[5],1.2Kin[2],27Kin[6]).

denotesre-rankingstagewheretop
100-250modelsundergoextendedtrainingandevaluationfor300epochsbeforeselecting
themodel.
MethodType
Top-1
Acc.(%)
#Madds
(M)
EstimatedSearchCost(GPUhours)
Prior-searchDuring-searchPost-search
Total
ImageNet
MnasNet[20]gradient
75.2312
---
91k
y
OnceForAll[124]WS+EA
76.9230
1,2004075
1.3k
NAT(ours)
WS+EA
77.5225
1,20015075
1.4k
CIFAR-10
NASNet[5]RL
97.4569
-10,000
z
6000

16k
PNASNet[2]SMBO
96.6588
-600
z
36
0.6k
DARTS[38]WS+gradient
97.3595
-9636
0.1k
AmoeabaNet[6]EA
97.5555
-13,500
z
2400

16k
NAT(ours)
transfer+EA
98.4468
-24-
0.1k
5.3.2Search
TheoverallcomputationcostconsumedbyaNASalgorithmcanbefactoredintothree
phases:(1)
Prior-search
:Costincurredpriortoarchitecturesearch,e.g.trainingsupernet
incaseofone-shotapproaches[122,124]orconstructingaccuracypredictor[121],etc;(2)
During-search
:Costassociatedwithmeasuringtheperformanceofcandidatearchitectures
sampledduringsearchthroughinference.Italsoincludesthecostoftrainingthesupernet
incaseitiscoupledwiththesearch,likein[38]andNAT;(3)
Post-search
:Costassociated
withchoosingaarchitecture,and/orning/re-trainingthearchitecturesfrom
scratch.Forcomparison,weselectrepresentativeNASalgorithms,includingthosebasedon
reinforcementlearning(RL),gradients,evolutionaryalgorithm(EA),andweightsharing
(WS).Table5.5showsresultsforImageNetandCIFAR-10.Theformeristhedataseton
whichthesupernetistrainedandthelatterisaproxyfortransferlearningtoanon-standard
dataset.NATconsistentlyachievesbetterperformance,bothintermsoftop-1accuracyand
model(e.g.#MAdds),comparedtothebaselineswhilecomputationalcostis
similarorlower.TheprimarycomputationalcostofNATisthe
prior-search
trainingof
90
supernetfor1200hours.Weemphasize,again,thatitisaone-timecostthatisamortized
acrossallsubsequentdeploymentscenarios(e.g.10additionaldatasetsinSection5.2.4).
5.3.3AnalysisofCrossover
(a)ofCrossover
(b)ofCrossoverProbability
Figure5.16:Ablationstudyonthecrossoveroperator:(a)themedianperformancefrom
elevenrunsofourevolutionaryalgorithmwithandwithoutthecrossoveroperator.(b)the
medianperformancedeterioratesasthecrossoverprobabilityreducesfrom0.9to0.2.
Crossoverisastandardoperatorinevolutionaryalgorithms,buthaslargelybeenavoided
byexistingEA-basedNASmethods[6,43,44].Butaswedemonstratehere,acarefullyde-
signedcrossoveroperationcansigntlyimprovesearch.Weruntheevolution-
arysearchofNATwithandwithoutthecrossoveroperatoronfourdatasets;ImageNet[1],
CIFAR-10[34],OxfordFlowers102[129],andStanfordCars[135].Thehyperparameters
thatwecompareare:
1.
w/crx
:crossoverprobabilityof0.9;mutationprobabilityof0.1;mutationindex

m
of3.
2.
w/ocrx
:crossoverprobabilityof0.0;mutationprobabilityof0.2;mutationindex

m
of3.
91
Wedoublethemutationprobabilitywhencrossoverisnotusedtocompensateforthereduced
explorationabilityofthesearch.Oneachdataset,weruneachsettingtomaximizethe
top-1accuracy11timesandreportthemedianperformanceasafunctionofthenumberof
architecturesampledinFig5.16(a).Onallfourdatasets,thecrossoveroperatortly
improvestheoftheevolutionarysearchalgorithm.Tofurthervalidate,wesweep
overtheprobabilityofcrossoverwhilemaintainingtherestofthesettings.Themedian
performance(over11runs)deterioratesasthecrossoverprobabilityisreducedfrom0.9to
0.2(seeFig.5.16(b)).
5.3.4AnalysisofMutationHyperparameters
ThemutationoperatorusedinNATiscontrolledviatwohyperparameters|namely,the
mutationprobability
p
m
andmutationindex

m
.Toidentifytheoptimalhyperparameter
values,weconductthefollowingparametersweepexperiments.Settingtherestofthe
hyperparameterstotheirdefaultvalues(seeTable5.1),wesweepthevalueof
p
m
from
0.1to0.8,and

m
from1.0to20.Andforeachsetting,werunNATeleventimesonfour
datasets(sameasthecrossoverexperiment)tomaximizethetop-1accuracy.Figures5.17(a)
and5.17(b)showtheofmutationprobability
p
m
andmutationindex

m
,respectively.
Weobservethatincreasingthemutationprobabilityhasanadverseonperformance.
Similarly,lowvaluesof

m
,whichencouragesthemutatedtobefurtherawayfrom
parentarchitectures,improvestheperformance.Basedontheseobservations,wesetthe
mutationprobability
p
m
andmutationindex

m
parametersto0.1and1.0,respectively,for
allourexperimentsinSection5.2.
92
(a)ofMutationProbability
(b)ofMutationHyperparameter

m
Figure5.17:Hyperparameterstudyon(a)mutationprobability
p
m
and(b)mutationindex
parameter

m
.Foreachstudy,werunNATeleventimesonfourdatasetstomaximizetop-1
accuracyandreportthemedianperformance.
5.3.5enessofSupernetAdaptation
Figure5.18:Comparingtheperformanceof
adaptingsupernet
,
adaptingsubnet
and
addi-
tional
underabi-objectivesearchsetuponfourdatasets.Detailsareprovided
inSection5.3.5.
Table5.6:offor75epochsonImageNet.
NAT-M1
NAT-M2
NAT-M3
NAT-M4
w/o
75.9
77.4
78.9
79.4
w/
77.5
78.6
79.9
80.5

(+1.6)
(+1.2)
(+1.0)
(+1.1)
RecallthatNATadoptsanysupernettrainedonalarge-scaledataset,e.g.ImageNet
andseekstotlytransfertoatask-spsupernetonagivendataset.Herewe
93
comparethisproceduretoamoreconventionalapproachofadaptingeverysubnet(candidate
architecturesinsearch)directly.Sp,weconsiderthefollowing,
1.
SupernetAdaptation
:supernetfor5epochsineachiterationanduseaccuracy
frominheritedweights(withoutfurthertraining)toselectarchitecturesduringsearch.
2.
SubnetAdaptation
:eachsubnetfor5epochsfromtheinheritedweights,then
measuretheaccuracy.
Weapplythesetwoapproachestoabi-objectivesearchofmaximizingtop-1accuracyand
minimizing#MAddsonfourdatasets,includingCIFAR-10,CIFAR-100,OxfordFlowers102,
andSTL-10.Figure5.18comparestheParetofronts.Adaptingthesupernetyields
tlybetterperformancethanadaptingindividualsubnets.Furthermore,weselect
asubsetofsearchedsubnetsafter
subnetadaptation
andtheirweightsforan
additional150epochs.Werefertothisas
additional
inFig.5.18.Empirically,
weobservethatfurthercanmatchtheperformanceof
supernetadaptation
on
datasetswithlargertrainingsamplesperclass(e.g.4,000inCIFAR-10).Ondatasetswith
fewersamplesperclass(e.g.20inFlowers102),thereisstillalargeperformancegap
between
supernetadaptation
and
additional
.Overalltheresultssuggestthat
supernetadaptation
ismoreectiveontaskswithlimitedtrainingsamples.Finally,we
reporttheofthetraarchitecturesonImageNetinSection5.2.3for
anadditional75epochsinTable5.6.
94
5.4Conclusion
Thissectionconsideredtheproblemofdesigningcustomneuralnetworkarchitecturesthat
multipleobjectivesforagivenimagetask.Weintroduced
Neural
ArchitectureTransfer
(NAT),apracticalandeeapproachforthispurpose.Wede-
scribedourtoharnesstheconceptofasupernetandanevolutionarysearchalgorithm
fordesigningtask-spneuralnetworkstrading-oaccuracyandcomputationalcomplex-
ity.Wealsoshowedhowtouseanonlineregressor,asasurrogatemodeltopredictthe
accuracyofsubnetsinthesupernet.Experimentalevaluationonelevenbenchmarkimage
datasets,rangingfromlarge-scalemulti-classtosmall-scaletasks,
showedthatnetworksobtainedbyNAToutperformconventionalbasedtrans-
ferlearningwhilebeingordersofmagnitudemoretundermobilesettings(

600M
Multiply-Adds).NATwasespeciallyeforsmall-scaleinedtaskswhere
tuningpre-trainedImageNetmodelsise.Finally,wealsodemonstratedtheutility
ofNATinoptimizinguptotwelveobjectiveswithasubsequenttradanalysisprocedure
foridentifyingasinglepreferredsolution.Overall,NATisthelargescaledemonstration
ofmany-objectiveneuralarchitecturesearchfordesigningcustomtask-spmodelson
diverseimagedatasets.
95
Chapter6
ApplicationsofEvolutionaryNeural
ArchitectureSearch
6.1MedicalImaging
TheChestX-Ray14benchmarkwasrecentlyintroducedin[23].Thedatasetcontains112,120
highresolutionfrontal-viewchestX-rayimagesfrom30,805patients,andeachimageis
labeledwithoneormultiplecommonthoraxdiseases,or\Normal",otherwise.Visualization
ofexampleX-rayimagesfromthedatabaseisprovidedinFig.6.1.Thedatasetispublicly
availablefromNIHat
https://nihcc.app.box.com/v/ChestXray-NIHCC
.Wefollowthe
train
val
list.txt
and
test
list.txt
providedalongwiththeX-rayimagestosplitthedatabase
fortraining,validationandtesting.
Tocopewiththemulti-labelnatureofthedataset,weuseamultitasklearningsetup
inwhicheachdiseaseistreatedasanindividualbinaryproblem.Weea
14-dimensionallabelvectorofbinaryvaluesindicatingthepresenceofoneormorediseases,
resultinginaregressionlossasopposedtothestandardcross-entropyforsingle-labelprob-
lems.Pastapproaches[23,145,146]typicallyextendfromexistingarchitectures,andthe
currentstate-of-the-artmethod[146]usesavariantoftheDenseNet[31]architecture,which
isdesignedmanuallybyhumanexperts.
ThesearchissetupalongthesamelinesasthatofCIFAR-100describedinChapter3,
96
Figure6.1:VisualizationofChestXray14datasets.Examplesshowingeightcommonthoracic
diseasesarefrom[23].
exceptthattheaccuracyobjectiveisreplacedbytheaverageareaunderthe
ROCcurve(AUROC)metrictoguidetheselectionofarchitectures.Wesplitthedatasetinto
threepartswith70%fortraining,10%forvalidation,andtheremaining20%fortesting.All
imagesarepre-processedto224

224pixelsusingImageNetdataaugmentationtechniques
(e.g.,normalization,randomhorizontalng,etc.).
Oncethenon-dominatedarchitecturesareobtainedaftertheconvergenceofevolution,
weselectthearchitecturethatmaximizestheratiobetweenthegaininAUROCoverthe
inFLOPs.WecallthisarchitectureNSGANet-X,andwere-traintheweights
thoroughlyfromscratchwithanextendednumberofepochs.Thelearningrateisgradually
reducedwhentheAUROConthevalidationsetplateaus.
Table6.1comparestheperformanceofNSGANet-Xwithpeermethodsthatareextended
fromexistingmanuallydesignedarchitectures.Thisincludesarchitecturesusedbytheau-
thorswhooriginallyintroducedtheChestX-Ray14dataset[23],andtheCheXNet[146],
whichisthecurrentstate-of-the-artonthisdataset.Wealsoincluderesultsfromcommer-
97
Table6.1:AUROConChestX-Ray14testingset.
MethodType#ParamsTestAUROC(%)
Wangetal.(2017)[23]manual-73.8
Yaoetal.(2017)[145]manual-79.8
CheXNet(2017)[146]manual7.0M84.4
GoogleAutoML(2018)[147]RL-79.7
LEAF(2019)[148]EA-84.3
NSGANet-A3EA5.0M84.7
NSGANet-XEA2.2M84.6
y
GoogleAutoMLresultisfrom[148].
z
NSGANet-A3representsresultsunderthestandardtransferlearningsetup.
cialAutoMLsystems,i.e.GoogleAutoML[147],andLEAF[148],ascomparisonswith
NASbasedmethods.ThesetupdetailsofthesetwoAutoMLsystemsareavailablein[148].
Noticeably,theperformanceofNSGANet-XexceedsGoogleAutoML'sbyalargemarginof
nearly
4AUROCpoints
.Inaddition,NSGANet-Xmatchesthestate-of-the-artresults
fromhumanengineeredCheXNet,whileusing
3.2xfewerparameters
.Forcompleteness,
wealsoincludetheresultfromNSGANet-A3,whichisevolvedonCIFAR-100,todemon-
stratethetransferlearningcapabilitiesofNSGANet.
(a)
(b)
Figure6.2:NSGANet-Xmulti-labelperformanceonChestX-Ray14(a)andthe
class-wisemeantestAUROCcomparisonwithpeermethods(b).
Moredetailedresultsshowingthedisease-wiseROCcurveofNSGANet-Xanddisease-
98
Atelectasis
Cardiomegaly


Pneumonia
Pneumothorax
Figure6.3:Examplesofclassactivationmap[24]ofNSGANet-X,highlightingtheclass-
spdiscriminativeregions.Thegroundtruthboundingboxesareplottedoverthe
heatmaps.
wiseAUROCcomparisonwithotherpeermethodsareprovidedinFigs.6.2(a)and6.2(b),re-
spectively.TounderstandthepatternbehindthediseasedecisionsofNSGANet-
X,wevisualizetheclassactivationmap(CAM)[24],whichiscommonlyadoptedforlocal-
izingthediscriminativeregionsforimagetion.IntheexamplesshowninFig.6.3,
strongerCAMareasarecoveredwithwarmercolors.Wealsooutlinetheboundingboxes
providedbytheChestX-Ray14dataset[23]asreferences.
Theseresultsfurthervalidatetheabilityofourproposedalgorithmtogeneratetask-
dependentarchitecturesautomatically.Conventionalapproaches,e.g.,transferlearningfrom
existingarchitectures,canbeeinyieldingsimilarperformance,however,asdemon-
99
stratedbyNSGANet,simultaneouslyconsideringcomplexityalongwithperformanceinan
algorithmicfashionallowsarchitecturestobepracticallydeployedinresource-constrained
environments.
6.2AgainstAdversarialAttack
BasedonouranalysisinSectionV-B,yearsofarchitecturaladvancementshavetranslated
tominusculeimprovementsinrobustnessagainstadversarialexamples.Simpleone-iteration
attackstrategylikeFGSM[25]isenoughtoconstructingexamplesthatturnmanymodern
DNNtorandom-guess.Inthissection,wemakeantoimproveadversarial
robustnessfromthearchitecturalperspective.Thesearchspaceusedinthemainpaper
searchesoverbothlayeroperationsandlayerconnections(seeSectionIII-A).Toisolatethe
ofthesetwoaspectstotheadversarialrobustness,wethelayeroperationtobasic
residualmodule[11]andsearchovertheconnectionsamongthesemodulestoimproveboth
accuracyoncleanimagesandrobustnessagainstadversarialexamples.
Designingameasure/objectiveforrobustnessagainstadversarialrobustnessisanareaof
activeresearch(e.g.,[149]).Forourpurposes,wepresentapossiblemeasurehere,illustrated
inFig.6.4.UsingtheFSGMpresentedby[25],thisrobustnessobjectiveprogressively
increasesnoiseproducedbyFSGM.The

axisinFig.6.4referstothehyper-parameterin
theFSGMequation,
x
0
=
x
+

sign(
r
x
L
(
x
;y
true
))
;
where
x
istheoriginalimage,
x
0
isadversarialimage,
y
true
istrueclasslabel,and
L
cross-
100
Figure6.4:
RobustnessObjective:
WearobustnessobjectiveundertheFGSM[25]
attackasfollows:1)obtainationaccuracyonadversarialimagesgeneratedbyFGSM
aswevary

,2)computetheareaunderthecurve(blueline),approximatedbythearea
ofthegreenregion;2)normalizetherobustnessvaluetotherectangularareaformedby
the
Idealpoint
and
Nadirpoint
;3)Idealpointisat100%accuracyat
maximum

value,andthenadirpointisastheaccuracyofrandomguessingat

=0
(cleanimages).
101
entropyloss.Therefore,forthisexperiment,weseektomaximizetwoobjectives,namely,
accuracyandtherobustnessobjectiveabove.
Thesetupfortherobustnessexperimentisasfollows.Fortrainingweuse40,000CIFAR-
10imagesfromtheCIFAR-10trainingdata,10,000ofwhicharereservedforvali-
dation.Eachnetworkisencodedwiththreeblocksusingthemacrospaceencodingfrom
ourpreviouswork[15].Ineachphaseamaximumofsizenodesmaybeactive|wherethe
computationateachnodeis3x3convolutionfollowedbyReLUandbatchnormalization.
Eachnetworkistrainedfor20epochswithSGDonacosineannealedlearningratesched-
ule.TheepsilonvaluesusedintheFSGMrobustnesscalculationare[0.0,0.01,0.03,0.05,
0.07,0.1,0.15].Asbefore,NSGANetinitiatesthesearchwith40randomlycreatednetwork
architecture,and40newnetworkarchitecturesarecreatedateachgeneration(iteration)via
geneticoperations(seemainpaperfordetails).Thesearchisterminatedat30generations.
Figure6.5:Tfrontieroftherobustnessexperiments.Colorindicatesthegeneration
(iteration)atwhichanetworkarchitectureiseliminatedfromthesurvivingparentpopula-
tion.Thesizeofeachpointisproportionaltothenetworkarchitecture'snumberoftrainable
parameters.WenotethatnetworksforlattergenerationsformtheParetofront(darkblue
points).
Empirically,weobserveaclearbetweenaccuracyandrobustness,asshownin
102
(a)
(b)
(c)
Figure6.6:(a)Examplesofthecomputationalblocksdiscoveredwithhighclasscation
accuracy.Forthesenetworks,themeanaccuracyandrobustnessobjectivesare0.8543and
0.0535,respectively;(b)Examplesofthecomputationalblocksdiscoveredwithhighro-
bustnessagainstFGSMattack,themeanaccuracyandrobustnessobjectivesare0.8415
and0.1036,respectively;(c)Examplesofthecomputationalblocksdiscoveredalongthe
pareto-frontthatprovidesantbetweenaccuracyandadver-
sarialrobustness.Theyarearrangedintheorderofdescendingaccuracyandascending
robustness.
Fig.6.5.Visualizationofthenon-dominatedarchitecturesareprovidedinFig.6.6(c).In
ouropinion,NSGANetisusefulincapturingpatternsthattiatearchitecturesthat
aregoodforcompetingobjectives.Wethatthe\wide"networks(likeResNeXt[112]or
Inceptionblocks[30])appeartoprovidegoodaccuracyonstandardbenchmarkimages,but
arefragiletotheFSGMattack.Ontheotherhand,\deep"networks(akintoResNet[11]or
VGG[29])aremorerobusttoFSGMattack,whilehavinglessaccuracy.Thisphenomenon
isillustratedwithexamplesinFigs.6.6(a)and6.6(b),respectively.Furthermore,theskip
connectionofskippingtheentireblock'scomputationappearstobecriticalinobtaininga
networkthatisrobusttoadversarialattacks;seeFig.6.7(a)and6.7(b).
6.3Multi-viewCarAlignment
Inadditiontoobjectdenseimageprediction(e.g.objectalignment,human
bodyposeestimationandsemanticsegmentation,etc.)isanotherclassofproblemsthatis
ofgreatimportancetocomputervision.Denseimagepredictionassignsaclasslabeltoeach
103
(a)
(b)
Figure6.7:Parallelcoordinateplotsofthe1,200networkarchitecturessampledbyNS-
GANet.Eachlinerepresentsanetworkarchitecture,eachverticallineisanattributeasso-
ciatedwiththenetwork.(a)Networksthathavetheskipconnectionbitinactive,wecansee
thatnoneofthemhavegoodmeasurementonrobustnessagainstadversarialattacks.(b)
Networksthathavetheskipconnectionbitactive.Thisskipconnectionbitreferstothe
connectionthatgoespastallcomputationwithinaphase,asanormalresidualconnection
would.Whentheskipconnectionisactive,thenetworkscoverthefullrangeofadversarial
robustness.
104
pixelinthequeryimages,asopposedtoonelabeltotheentireimageincaseofclascation.
Inthissection,weapplyNSGANettotheproblemofmulti-viewcarkey-pointsalignment.
WeusetheCMU-Cardatasetoriginallyintroducedin[7].Thedatasetcontainsaround
10,000carimagesintorientations,environments,andocclusionsituations.Inthis
case,wesearchforthepathofimageresolutionchanges,similarto[98].Thenode-level
structureiskeptusingthebasicresidualunit[11].Theperformanceofarchitectures
inthiscaseiscalculatedusingtherootmeansquare(RMS)errorbetweenthepredicted
heatmapandgroundtruthforeachkey-point,moredetailsareavailablein[7].Weuse
FLOPsasthesecondobjectiveforarchitecturecomplexitymeasurement.Theobtained
architecturesarenamedasNSGANet-C0and-C1.Theobtainedresultsareprovidedin
Table6.2andthevisualizationofthearchitecturesisprovidedinFig.6.8.
Table6.2:PreliminaryresultsontheCMU-Caralignment[7].Notably,ourproposedal-
gorithmisabletoarchitectureswithcompetitiveperformancewhilehaving2xless
parameterswhencomparedtohuman-designedarchitecture[8].
ArchitecturesParams.FLOPsRegression
(M)(M)Error(%)
Hourglass[8]3.3836137.80
NSGANet-C01.5325848.66
NSGANet-C11.6126638.64
Figure6.8:Spatial-resolution-changepathoftheNSGANet-C0architecture.Eachcircle
encodesaresidualmodule[11].Circlescoloredinwhitearealwaysexecutedatthebeginning.
Thearrowsandbluescirclesarepartsofthesearchdecisions.Thetotalnumberoflayers
(L)aresettobenine.
105
6.4Non-learnableLayer
Ourpost-optimizationanalysisontheevolvedarchitectures,showninChapter3under
Section3.2.5,hasrevealedsomeinterestingoneofwhichbeingtheenessof
non-parametricoperations,e.g.identitymapping,average/maxpooling,etc.,intrading
performanceforarchitecturalcomplexity.Tofurthervalidatethisobservation,
weconsideraexpandedrangeofoperationsincludingbothnon-parametricandweigh
operations,whichwenameas
non-learnableoperations
inthispaper.Wemanuallyconstruct
suchlayersbyconcatenatemultiplenon-learnableoperationsinparallel.Theobtainedresults
areshowninFigure6.9.
Ourpreliminaryresultsonmanualconstructionofnon-learnablelayersareveryen-
couraging.Inadditionaltothecomparativeperformancetoregularfullylearnedlayers,
non-learnablelayersuniqueadvantagesintermsofre-usableweightsformulti-tasking
networkarchitectures,astheweightsareagnostic(notsplearnedonaparticular
task).Webelievedesigningdedicatedsearchalgorithmtoshapetheconstructionofthese
non-learnablelayersisapromisingdirectionforNAStowardsautomaticdesignformulti-
taskingarchitectures.
106
(a)
(b)
(c)
Figure6.9:PreliminaryexperimentonconstructingDNNarchitecturesusinglayerswith
non-learnableweights.Eachlayeriscomposedofseveralnon-learnableoperationsinparal-
lel.WemanuallyconstructedahandfulofsuchlayersandevaluatethemonCIFAR-10(a).
Anexamplebasedonthetrade-obetweenaccuracyandthenumberofthe
parameters,isshownin(b).Wevalidatetheenessofnon-learnablelayersbyreplac-
ingtheoriginal3

3convolutionintResNetmodelswiththechosen
onbothCIFAR-100andImageNet(c).Evidently,layerwithnon-learnableweightsiscapa-
bleofyieldingcompetitiveperformancewhilebeingcomputationaltas
opposedtoconventionallearnableconvolutions.
107
Chapter7
Conclusion
Inthissection,welistthetcontributionsofthisdissertation,followedbythe
futuredirectionsandconcludingremarks.
7.1Contributions
Multi-objective:
Inthisdissertation,weshowthatthroughantevolutionary
frameworkwithhighlycustomizedoperators(i.e.crossover,mutation,distributionestima-
tion),asetofarchitecturesthatmultiplecompetingobjectivescanbeobtained
inasingleruntly.Thisallowsdesignerstochooseasuitablenetwork
a-posteriori
as
opposedtoapreferenceweightingeachobjectivepriortothesearch.Further
post-optimalanalysisofthesetofnon-dominatedarchitecturesoftentimesrevealsvaluable
designprinciples,whichstaysasanotherbforposingNASasamulti-objectiveopti-
mizationproblem,asincaseofthisdissertation.
SurrogateModeling:
Inthisdissertation,weshowanalternativeapproachtosolvethe
nestedbi-levelNASproblem,i.e.,simultaneouslyoptimizingthearchitectureandlearnthe
optimalmodelweights.However,insteadofgradientbasedrelaxations,weadvocatefor
surrogatemodels.Attheupper-level,weconstructmultiplethenadaptivelyselectone
predictormodelforpredictinganarchitecture'sperformancefromitsencoding.Thisprocess
itselfreducestheoverallsearchcostbyafactorof
45x
comparedtopreviousapproaches.
108
Atthelower-level,wedeployasupernetthatallowsthetrainingofanarchitecturetostart
fromamuchadvancedstageascomparedtoarandompoint,leadingtoareductionof
30x
intrainingiency(i.e.therequiredtrainingepochsreducesfrom150to5).
ScalabilityviaTransferLearning:
MostexistingNASapproachesrequireonecomplete
searchforeachdeploymentspofhardwareorobjective.Thisisacomputationally
impracticalendeavorgiventhepotentiallylargenumberofapplicationscenarios.Weusean
onlinesearch-guidedapproachasopposetoconventionaltransferlearningtorapidlyadapta
supernetthat'slearnedonupstreamtasks(e.g.ImageNet)toanewtaskathand,including
bothtypesofdatasetsandtherequiredcomputationalconstraints.Suchanapproachnot
onlyallowsanarchitecturesearchtobetlyexecuted,butalsoensuretheperformance
isbeyondthecurrentbestpractice,i.e.transferlearning.
CompetitiveandComprehensiveResults:
Thisdissertationdemonstratethescalability
andpracticalityoftheproposedapproachonmanydatasetscorrespondingtodiversescenar-
ios;large-scalemulti-class(ImageNet,CINIC-10[127]),medium-scalemulti-class(CIFAR-10,
CIFAR-100),small-scalemulti-class(STL-10[128]),large-scale(Food-101[134]),
medium-scale(StanfordCars[135],FGVCAircraft[136])andsmall-scale
grained(DTD[137],Oxford-IIITPets[138],OxfordFlowers102[129])datasets.Wefuther
demonstratethescalabilityandutilityoftheproposedapproachacrossmanyobjectives.
Optimizingaccuracy,modelsizeandoneofMAdds,CPUorGPUlatency,theobtained
architecturesdominatesMobileNet-v3[21]acrossallobjectives.Wealsoconsidera12ob-
jectiveproblemofacommonarchitectureacrosselevendatasetswhileminimizing
MAdds.Thebestarchitecturedominatesallmodelsacrossthesedatasetsunder
mobilesettings.
Real-worldApplications:
Towardstheendofthisdissertation,weapplyourproposed
109
algorithmtovariouspracticalapplications,includingmedicalimagingofclassifyingcommon
thoraxdiseases,improvingrobustnessagainstadversarialattack,denseimagepredictionof
multi-viewcarkey-pointsalignment,andcustomlayerdesignofnotrainableparameters.
7.2DiscussionandFutureWork
Despitethesteadystreamofempiricalimprovementsonmanydatasetsandapplications
showninthisdissertation,therearestilltremendousscopesandopportunitieslefttobe
explored.Intheremainderofthissection,weprovideabriefdiscussiononsomeofthe
promisingfutureextensionsanddirections.
Higher-levelTasks:
Thisdissertationmainlyfocusesonobjectrecognition,whichisthe
fundamentaltaskincomputervision.Inaddition,thereexistsmanyothermorechallenging
tasksincludingobjectdetection,semanticsegmentation,panopticsegmentation,humanpose
analysis,alignment,imagegeneration,tonameafew.Alloftheseaforementionedtasksare
stillheavilyrelyingonhumanengineeringinnovationstoprogressthearchitecturaldesign.
Adaptingtheproposedalgorithminthisdissertationtoautomatethedesignofnetwork
architectureforthesetaskswillbeimportantandimpactfulforthecomputervision
SearchSpace:
Designingansearchspace(possiblywithtransformerarchi-
tectures)mayhelpbridgethegapbetweenthesequential/tabulardata(naturallanguage
processing)andpixeldata(computervision),bothintermsofresearchinnovationsandin
thewayinformationisreasonedabout.Thereistinterestintasksthatinvolve
simultaneouslyhandlingvisualandtextualinputs,whichisessentiallyrequiredforachieve
true
autonomous
inmachinelearning.Thesebimodaltaskshavebeenhistoricallyknownto
becultwithconventionalarchitectures.NASwithsuchansearchspacecouldbe
110
apromisingavenuetotacklethischallengingbutimportantproblem.
SearchbasedDNNTraining:
Gradientthroughback-propagationbasedweightlearning
hasbeenthecornerstonefordeeplearning.It'spolynomialtimenaturewithrespectto
thenetworkparameters(alongwiththeadvancedcomputinghardware)hasmadetheDNN
trainingnotonlypossiblebutalsot(relatively).However,ithasalsoleadtoseveral
compromises,includingthechoiceoflayerbeingsimpleoratleasttiable,singleor
scalarizedobjectiveintraining,heavilyrelyingoninitialization(Heinitialization[11])and
normalization(BatchNorm),tonameafew.FreetheDNNtrainingfromgradienthasgreat
potentialindiscoveringmorenovelarchitecturewithcomplexoperations,andcanbeeasily
extendedtohandlemultipleobjectives.
7.3ConcludingRemarks
Theworkinthisdissertationdemonstratedthepotentialofevolutionarycomputationin
automatingthedesignofneuralnetworksolutionstomanyMLapplications.Thetopology,
andofweightsofthearchitecturecanbeoptimizedsimultaneouslytomeettherequirements
ofthetask,resultinginsuperiorperformanceundermultipleobjectives.Thisdissertation
primarilyshowedthatcomputervisiondomain(objectrecognitioninparticular)cangreatly
bfromevolutionaryneuralarchitecturesearch;manymoredomainsremainwherethe
workinthisdissertationcanbeappliedtoimprovetheuserexperienceorproductivity.
111
BIBLIOGRAPHY
112
BIBLIOGRAPHY
[1]
O.Russakovsky,J.Deng,H.Su,J.Krause,S.Satheesh,S.Ma,Z.Huang,A.Karpathy,
A.Khosla,M.Bernstein
etal.
,\Imagenetlargescalevisualrecognitionchallenge,"
InternationalJournalofComputerVision
,vol.115,no.3,pp.211{252,2015.
[2]
C.Liu,B.Zoph,M.Neumann,J.Shlens,W.Hua,L.-J.Li,L.Fei-Fei,A.Yuille,
J.Huang,andK.Murphy,\Progressiveneuralarchitecturesearch,"in
EuropeanCon-
ferenceonComputerVision(ECCV)
,2018.
[3]
F.Hutter,H.H.Hoos,andK.Leyton-Brown,\Sequentialmodel-basedoptimization
forgeneralalgorithmin
InternationalConferenceonLearningand
IntelligentOptimization
.Springer,2011,pp.507{523.
[4]
A.Wan,X.Dai,P.Zhang,Z.He,Y.Tian,S.Xie,B.Wu,M.Yu,T.Xu,K.Chen
etal.
,\Fbnetv2:tiableneuralarchitecturesearchforspatialandchanneldi-
mensions,"in
IEEEConferenceonComputerVisionandPatternRecognition(CVPR)
,
2020.
[5]
B.Zoph,V.Vasudevan,J.Shlens,andQ.V.Le,\Learningtransferablearchitectures
forscalableimagerecognition,"in
IEEEConferenceonComputerVisionandPattern
Recognition(CVPR)
,2018.
[6]
E.Real,A.Aggarwal,Y.Huang,andQ.V.Le,\Regularizedevolutionforimage
architecturesearch,"in
AAAIConferenceonAIntelligence
,2019.
[7]
V.NareshBoddeti,T.Kanade,andB.VijayaKumar,\Correlationersforob-
jectalignment,"in
IEEEConferenceonComputerVisionandPatternRecognition
(CVPR)
,2013.
[8]
A.Newell,K.Yang,andJ.Deng,\Stackedhourglassnetworksforhumanposeesti-
mation,"in
EuropeanConferenceonComputerVision(ECCV)
,2016.
[9]
F.VanVeenandS.Leijnen,\Theneuralnetworkzoo,"https://www.asimovinstitute.
org/neural-network-zoo,accessed:2020-05-17.
[10]
A.Sinha,P.Malo,andK.Deb,\Areviewonbileveloptimization:Fromclassical
toevolutionaryapproachesandapplications,"
IEEETransactionsonEvolutionary
Computation
,vol.22,no.2,pp.276{295,2017.
113
[11]
K.He,X.Zhang,S.Ren,andJ.Sun,\Deepresiduallearningforimagerecognition,"
in
IEEEConferenceonComputerVisionandPatternRecognition(CVPR)
,2016.
[12]
M.Sandler,A.Howard,M.Zhu,A.Zhmoginov,andL.-C.Chen,\Mobilenetv2:In-
vertedresidualsandlinearbottlenecks,"in
IEEEConferenceonComputerVisionand
PatternRecognition(CVPR)
,2018.
[13]
Z.Zhong,J.Yan,W.Wu,J.Shao,andC.-L.Liu,\Practicalblock-wiseneuralnet-
workarchitecturegeneration,"in
IEEEConferenceonComputerVisionandPattern
Recognition(CVPR)
,2018.
[14]
E.ZitzlerandL.Thiele,\Multiobjectiveoptimizationusingevolutionaryalgorithms
|acomparativecasestudy,"in
ParallelProblemSolvingfromNature|PPSNV
,
A.E.Eiben,T.ack,M.Schoenauer,andH.-P.Schwefel,Eds.Berlin,Heidelberg:
SpringerBerlinHeidelberg,1998,pp.292{301.
[15]
Z.Lu,I.Whalen,V.Boddeti,Y.Dhebar,K.Deb,E.Goodman,andW.Banzhaf,
\Nsga-net:Neuralarchitecturesearchusingmulti-objectivegeneticalgorithm,"in
GeneticandEvolutionaryComputationConference(GECCO)
,2019.
[16]
N.Nayman,A.Noy,T.Ridnik,I.Friedman,R.Jin,andL.Zelnik,\Xnas:Neural
architecturesearchwithexpertadvice,"in
AdvancesinNeuralInformationProcessing
Systems(NeurIPS)
,2019.
[17]
M.TanandQ.V.Le,tnet:Rethinkingmodelscalingforconvolutionalneural
networks,"in
ICML
,2019.
[18]
D.Berthelot,N.Carlini,I.Goodfellow,N.Papernot,A.Oliver,andC.A.
\Mixmatch:Aholisticapproachtosemi-supervisedlearning,"in
AdvancesinNeural
InformationProcessingSystems(NeurIPS)
,2019.
[19]
X.Wang,D.Kihara,J.Luo,andG.-J.Qi,\Enaet:Self-trainedensembleautoencoding
transformationsforsemi-supervisedlearning,"
arXivpreprintarXiv:1911.09265
,2019.
[20]
M.Tan,B.Chen,R.Pang,V.Vasudevan,M.Sandler,A.Howard,andQ.V.Le,
\Mnasnet:Platform-awareneuralarchitecturesearchformobile,"in
IEEEConference
onComputerVisionandPatternRecognition(CVPR)
,2019.
[21]
A.Howard,M.Sandler,G.Chu,L.-C.Chen,B.Chen,M.Tan,W.Wang,Y.Zhu,
R.Pang,V.Vasudevan,Q.V.Le,andH.Adam,\Searchingformobilenetv3,"in
InternationalConferenceonComputerVision(ICCV)
,2019.
114
[22]
Z.Lu,K.Deb,andV.N.Boddeti,\MUXConv:Informationmultiplexingincon-
volutionalneuralnetworks,"in
IEEEConferenceonComputerVisionandPattern
Recognition(CVPR)
,2020.
[23]
X.Wang,Y.Peng,L.Lu,Z.Lu,M.Bagheri,andR.M.Summers,\Chestx-ray8:
Hospital-scalechestx-raydatabaseandbenchmarksonweakly-supervised
andlocalizationofcommonthoraxdiseases,"in
IEEEConferenceonComputerVision
andPatternRecognition(CVPR)
,2017.
[24]
B.Zhou,A.Khosla,A.Lapedriza,A.Oliva,andA.Torralba,\Learningdeepfeatures
fordiscriminativelocalization,"in
IEEEConferenceonComputerVisionandPattern
Recognition(CVPR)
,2016.
[25]
I.J.Goodfellow,J.Shlens,andC.Szegedy,\Explainingandharnessingadversarial
examples,"
arXivpreprintarXiv:1412.6572
,2014.
[26]
A.Krizhevsky,I.Sutskever,andG.E.Hinton,\Imagenetwithdeep
convolutionalneuralnetworks,"in
AdvancesinNeuralInformationProcessingSystems
(NeurIPS)
,2012.
[27]
A.Graves,A.-r.Mohamed,andG.Hinton,\Speechrecognitionwithdeeprecurrent
neuralnetworks,"in
IEEEInternationalconferenceonacoustics,speechandsignal
processing(ICASSP)
,2013.
[28]
R.CollobertandJ.Weston,\Aarchitecturefornaturallanguageprocessing:
Deepneuralnetworkswithmultitasklearning,"in
InternationalConferenceonMa-
chineLearning(ICML)
,2008.
[29]
K.SimonyanandA.Zisserman,\VeryDeepConvolutionalNetworksforLarge-scale
ImageRecognition,"in
InternationalConferenceonLearningRepresentations(ICLR)
,
2015.
[30]
C.Szegedy,W.Liu,Y.Jia,P.Sermanet,S.Reed,D.Anguelov,D.Erhan,V.Van-
houcke,andA.Rabinovich,\Goingdeeperwithconvolutions,"in
IEEEConference
onComputerVisionandPatternRecognition(CVPR)
,2015.
[31]
G.Huang,Z.Liu,L.VanDerMaaten,andK.Q.Weinberger,\Denselyconnected
convolutionalnetworks,"in
IEEEConferenceonComputerVisionandPatternRecog-
nition(CVPR)
,2017.
115
[32]
X.Zhang,X.Zhou,M.Lin,andJ.Sun,\ShAnextremelytconvolu-
tionalneuralnetworkformobiledevices,"in
IEEEConferenceonComputerVision
andPatternRecognition(CVPR)
,2018.
[33]
R.Hecht-Nielsen,\Theoryofthebackpropagationneuralnetwork,"in
Neuralnetworks
forperception
.Elsevier,1992,pp.65{93.
[34]
A.Krizhevsky,G.Hinton
etal.
,\Learningmultiplelayersoffeaturesfromtinyimages,"
Citeseer,Tech.Rep.,2009.
[35]
K.DebandC.Myburgh,\Breakingthebillion-variablebarrierinreal-worldoptimiza-
tionusingacustomizedevolutionaryalgorithm,"in
ProceedingsoftheGeneticand
EvolutionaryComputationConference2016
,2016,pp.653{660.
[36]
B.ZophandQ.V.Le,\Neuralarchitecturesearchwithreinforcementlearning,"in
InternationalConferenceonLearningRepresentations(ICLR)
,2017.
[37]
B.Baker,O.Gupta,N.Naik,andR.Raskar,\Designingneuralnetworkarchitectures
usingreinforcementlearning,"in
InternationalConferenceonLearningRepresenta-
tions(ICLR)
,2017.
[38]
H.Liu,K.Simonyan,andY.Yang,\DARTS:tiablearchitecturesearch,"in
InternationalConferenceonLearningRepresentations(ICLR)
,2019.
[39]
R.Luo,F.Tian,T.Qin,E.Chen,andT.-Y.Liu,\Neuralarchitectureoptimization,"
in
AdvancesinNeuralInformationProcessingSystems(NeurIPS)
,2018.
[40]
H.Pham,M.Guan,B.Zoph,Q.Le,andJ.Dean,tneuralarchitecturesearch
viaparameterssharing,"in
InternationalConferenceonMachineLearning(ICML)
,
2018.
[41]
H.Cai,L.Zhu,andS.Han,\ProxylessNAS:Directneuralarchitecturesearchontarget
taskandhardware,"in
InternationalConferenceonLearningRepresentations(ICLR)
,
2019.
[42]
L.XieandA.Yuille,\Geneticcnn,"in
InternationalConferenceonComputerVision
(ICCV)
,2017.
[43]
E.Real,S.Moore,A.Selle,S.Saxena,Y.L.Suematsu,J.Tan,Q.V.Le,andA.Ku-
rakin,\Large-scaleevolutionofimagein
InternationalConferenceonMa-
chineLearning(ICML)
,2017.
116
[44]
H.Liu,K.Simonyan,O.Vinyals,C.Fernando,andK.Kavukcuoglu,\Hierarchicalrep-
resentationsforcientarchitecturesearch,"in
InternationalConferenceonLearning
Representations(ICLR)
,2018.
[45]
A.Chen,J.Kim,S.Lee,andY.Kim,\Stochasticmulti-objectivemodelsfornetwork
designproblem,"
ExpertSystemswithApplications
,vol.37,no.2,pp.1608{1619,2010.
[46]
W.FanandR.Machemehl,\Bi-leveloptimizationmodelforpublictransportation
networkredesignproblem:Accountingforequityissues,"
TransportationResearch
Record:JournaloftheTransportationResearchBoard
,vol.2263,pp.151{162,2011.
[47]
M.A.AmouzegarandK.Moshirvaziri,\Determiningoptimalpollutioncontrolpolicies:
Anapplicationofbilevelprogramming,"
EuropeanJournalofOperationalResearch
,
vol.119,no.1,pp.100{120,1999.
[48]
P.A.ClarkandA.W.Westerberg,\Bilevelprogrammingforsteady-statechemical
processdesign-i.fundamentalsandalgorithms,"
ComputersandChemicalEngineering
,
vol.14,no.1,pp.87{97,1990.
[49]
A.U.RaghunathanandL.T.Biegler,\Mathematicalprogramswithequilibriumcon-
straints(mpecs)inprocessengineering,"
ComputersandChemicalEngineering
,vol.27,
no.10,pp.1381{1392,2003.
[50]
S.Christiansen,M.Patriksson,andL.Wynter,\Stochasticbilevelprogrammingin
structuraloptimization,"
Structuralandmultidisciplinaryoptimization
,vol.21,no.5,
pp.361{371,2001.
[51]
J.Herskovits,A.Leontiev,G.Dias,andG.Santos,\Contactshapeoptimization:a
bilevelprogrammingapproach,"
Structuralandmultidisciplinaryoptimization
,vol.20,
no.3,pp.214{221,2000.
[52]
G.Brown,M.Carlyle,J.on,andK.Wood,\Defendingcriticalinfrastructure,"
Interfaces
,vol.36,no.6,pp.530{544,2006.
[53]
E.IsraeliandR.K.Wood,\Shortest-pathnetworkinterdiction,"
Networks
,vol.40,
no.2,pp.97{111,2002.
[54]
H.Ku˘cukaydin,N.Aras,andI.K.\Competitivefacilitylocationproblem
withattractivenessadjustmentofthefollower:Abilevelprogrammingmodelandits
solution,"
EuropeanJournalofOperationalResearch
,vol.208,no.3,pp.206{220,
2011.
117
[55]
T.Uno,H.Katagiri,andK.Kato,\Anevolutionarymulti-agentbasedsearchmethod
forstackelbergsolutionsofbilevelfacilitylocationproblems,"
InternationalJournalof
InnovativeComputing,InformationandControl
,vol.4,no.5,pp.1033{1042,2008.
[56]
J.LiangandR.Miikkulainen,\Evolutionarybileveloptimizationforcomplexcontrol
tasks,"in
GeneticandEvolutionaryComputationConference(GECCO)
,2015.
[57]
B.Barnhart,Z.Lu,M.Bostian,A.Sinha,K.Deb,L.Kurkalova,M.Jha,andG.Whit-
taker,\Handlingpracticalitiesinagriculturalpolicyoptimizationforwaterqualityim-
provements,"in
GeneticandEvolutionaryComputationConference(GECCO)
,2017.
[58]
R.Mathieu,L.Pittard,andG.Anandalingam,\Geneticalgorithmbasedapproachto
bi-levellinearprogramming,"
OperationsResearch
,vol.28,no.1,pp.1{21,1994.
[59]
Y.Yin,\Geneticalgorithmbasedapproachforbilevelprogrammingmodels,"
Journal
ofTransportationEngineering
,vol.126,no.2,pp.115{120,2000.
[60]
H.LiandY.Wang,\Ahybridgeneticalgorithmforsolvingnonlinearbilevelprogram-
mingproblemsbasedonthesimplexmethod,"
InternationalConferenceonNatural
Computation
,vol.4,pp.91{95,2007.
[61]
X.Zhu,Q.Yu,andX.Wang,\Ahybridtialevolutionalgorithmforsolving
nonlinearbilevelprogrammingwithlinearconstraints,"in
CognitiveInformatics,2006.
ICCI2006.5thIEEEInternationalConferenceon
,vol.1.IEEE,2006,pp.126{131.
[62]
J.Angelo,E.Krempser,andH.Barbosa,tialevolutionforbilevelprogram-
ming,"in
Proceedingsofthe2013CongressonEvolutionaryComputation(CEC-2013)
.
IEEEPress,2013.
[63]
J.S.AngeloandH.J.C.Barbosa,\Astudyontheuseofheuristicstosolveabilevel
programmingproblem,"
InternationalTransactionsinOperationalResearch
,2015.
[64]
J.A.Mirrlees,\Thetheoryofmoralhazardandunobservablebehaviour:Parti,"
The
ReviewofEconomicStudies
,vol.66,no.1,pp.3{21,1999.
[65]
S.R.Hejazi,A.Memariani,G.Jahanshahloo,andM.M.Sepehri,\Linearbilevelpro-
grammingsolutionbygeneticalgorithm,"
Computers&OperationsResearch
,vol.29,
no.13,pp.1913{1925,2002.
[66]
Y.Wang,Y.C.Jiao,andH.Li,\Anevolutionaryalgorithmforsolvingnonlinear
bilevelprogrammingbasedonanewconstraint-handlingscheme,"
IEEETransactions
118
onSystems,Man,andCybernetics,PartC:ApplicationsandReviews
,vol.32,no.2,
pp.221{232,2005.
[67]
Y.Wang,H.Li,andC.Dang,\Anewevolutionaryalgorithmforaclassofnonlinear
bilevelprogrammingproblemsanditsglobalconvergence,"
INFORMSJournalon
Computing
,vol.23,no.4,pp.618{629,2011.
[68]
Y.Jiang,X.Li,C.Huang,andX.Wu,\Applicationofparticleswarmoptimization
basedonchkssmoothingfunctionforsolvingnonlinearbilevelprogrammingproblem,"
AppliedMathematicsandComputation
,vol.219,no.9,pp.4332{4339,2013.
[69]
H.Li,\Ageneticalgorithmusingasearchspaceforsolvingnonlinear/linear
fractionalbilevelprogrammingproblems,"
AnnalsofOperationsResearch
,pp.1{16,
2015.
[70]
Z.Wan,G.Wang,andB.Sun,\Ahybridintelligentalgorithmbycombiningparti-
cleswarmoptimizationwithchaossearchingtechniqueforsolvingnonlinearbilevel
programmingproblems,"
SwarmandEvolutionaryComputation
,vol.8,pp.26{32,
2013.
[71]
A.Sinha,P.Malo,andK.Deb,\Evolutionaryalgorithmforbileveloptimizationusing
approximationsofthelowerleveloptimalsolutionmapping,"
EuropeanJournalof
OperationalResearch
,vol.257,no.2,pp.395{411,2017.
[72]
K.Deb,
Multi-objectiveoptimizationusingevolutionaryalgorithms
.Chichester,UK:
Wiley,2001.
[73]
C.Coello,G.Lamont,andD.VanVeldhuizen,
EvolutionaryAlgorithmsforSolving
Multi-objectiveProblems
.Springer,NewYork,2ndedition,2007.
[74]
K.Deb,A.Pratap,S.Agarwal,andT.Meyarivan,\Afastandelitistmultiobjective
geneticalgorithm:Nsga-ii,"
IEEETransactionsonEvolutionaryComputation
,vol.6,
no.2,pp.182{197,2002.
[75]
E.Zitzler,M.Laumanns,andL.Thiele,\SPEA2:ImprovingtheStrengthPareto
EvolutionaryAlgorithm,"in
EUROGEN2001.EvolutionaryMethodsforDesign,Op-
timizationandControlwithApplicationstoIndustrialProblems
,Athens,Greece,2002,
pp.95{100.
[76]
K.DebandH.Jain,\Anevolutionarymany-objectiveoptimizationalgorithmusing
reference-point-basednondominatedsortingapproach,parti:Solvingproblemswith
119
boxconstraints,"
IEEETransactionsonEvolutionaryComputation
,vol.18,no.4,pp.
577{601,Aug2014.
[77]
Q.ZhangandH.Li,\Moea/d:Amultiobjectiveevolutionaryalgorithmbasedon
decomposition,"
IEEETransactionsonevolutionarycomputation
,vol.11,pp.712{
731,2007.
[78]
H.Ishibuchi,H.Masuda,Y.Tanigaki,andY.Nojima,\Modistancecalculationin
generationaldistanceandinvertedgenerationaldistance,"in
Internationalconference
onevolutionarymulti-criterionoptimization
.Springer,2015,pp.110{125.
[79]
J.Yaochu,\Surrogate-assistedevolutionarycomputation:Recentadvancesandfuture
challenges,"
SwarmandEvolutionaryComputation
,vol.1,no.2,pp.61{70,2011.
[80]
A.Forrester,A.Sobester,andA.Keane,
Engineeringdesignviasurrogatemodelling:
Apracticalguide
.JohnWiley&Sons,2008.
[81]
K.Shimoyama,K.Sato,S.Jeong,andS.Obayashi,\Updatingkrigingsurrogatemod-
elsbasedonthehypervolumeindicatorinmulti-objectiveoptimization,"
Journalof
MechanicalDesign
,vol.135,no.9,p.094503,2013.
[82]
A.CassioliandF.Schoen,\Globaloptimizationofexpensiveblackboxproblemswith
aknownlowerbound,"
JournalofGlobalOptimization
,pp.177{190,2013.
[83]
D.R.Jones,\Ataxonomyofglobaloptimizationmethodsbasedonresponsesurfaces,"
Journalofglobaloptimization
,vol.21,no.4,pp.345{383,2001.
[84]
X.Yao,\Evolvingneuralnetworks,"
ProceedingsoftheIEEE
,vol.87,no.9,
pp.1423{1447,1999.
[85]
K.O.StanleyandR.Miikkulainen,\Evolvingneuralnetworksthroughaugmenting
topologies,"
EvolutionaryComputation
,vol.10,no.2,pp.99{127,2002.
[86]
G.S.Hornby,\Alps:theage-layeredpopulationstructureforreducingtheprob-
lemofprematureconvergence,"in
GeneticandEvolutionaryComputationConference
(GECCO)
,2006.
[87]
M.Suganuma,S.Shirakawa,andT.Nagao,\Ageneticprogrammingapproachto
designingconvolutionalneuralnetworkarchitectures,"in
GeneticandEvolutionary
ComputationConference(GECCO)
,2017.
120
[88]
Y.Sun,H.Wang,B.Xue,Y.Jin,G.G.Yen,andM.Zhang,\Surrogate-assistedevolu-
tionarydeeplearningusinganend-to-endrandomforest-basedperformancepredictor,"
IEEETransactionsonEvolutionaryComputation
,2020.
[89]
Y.Sun,B.Xue,M.Zhang,andG.G.Yen,\Completelyautomatedcnnarchitec-
turedesignbasedonblocks,"
IEEETransactionsonNeuralNetworksandLearning
Systems
,2020.
[90]
Y.Sun,B.Xue,M.Zhang,G.G.Yen,andJ.Lv,\Automaticallydesigningcnn
architecturesusingthegeneticalgorithmforimage
IEEETransactions
onCybernetics
,pp.1{15,2020.
[91]
C.J.C.H.Watkins,\Learningfromdelayedrewards,"Ph.D.dissertation,King's
College,Cambridge,UK,May1989.[Online].Available:http://www.cs.rhul.ac.uk/
˘
chrisw/new
thesis.pdf
[92]
B.Baker,O.Gupta,N.Naik,andR.Raskar,\Designingneuralnetworkarchitectures
usingreinforcementlearning,"in
InternationalConferenceonLearningRepresenta-
tions(ICLR)
,2017.
[93]
C.-H.Hsu,S.-H.Chang,J.-H.Liang,H.-P.Chou,C.-H.Liu,S.-C.Chang,J.-Y.Pan,
Y.-T.Chen,W.Wei,andD.-C.Juan,\Monas:Multi-objectiveneuralarchitecture
searchusingreinforcementlearning,"
arXivpreprintarXiv:1806.10332
,2018.
[94]
M.Marcus,G.Kim,M.A.Marcinkiewicz,R.MacIntyre,A.Bies,M.Ferguson,
K.Katz,andB.Schasberger,\Thepenntreebank:annotatingpredicateargument
structure,"in
ProceedingsoftheworkshoponHumanLanguageTechnology
.Associ-
ationforComputationalLinguistics,1994,pp.114{119.
[95]
X.DongandY.Yang,\Searchingforarobustneuralarchitectureinfourgpuhours,"
in
IEEEConferenceonComputerVisionandPatternRecognition(CVPR)
,2019.
[96]
S.Xie,H.Zheng,C.Liu,andL.Lin,\SNAS:stochasticneuralarchitecturesearch,"
in
InternationalConferenceonLearningRepresentations(ICLR)
,2019.
[97]
B.Wu,X.Dai,P.Zhang,Y.Wang,F.Sun,Y.Wu,Y.Tian,P.Vajda,Y.Jia,and
K.Keutzer,\Fbnet:Hardware-awaretconvnetdesignviatiableneural
architecturesearch,"in
IEEEConferenceonComputerVisionandPatternRecognition
(CVPR)
,2019.
121
[98]
C.Liu,L.-C.Chen,F.ScH.Adam,W.Hua,A.L.Yuille,andL.Fei-Fei,\Auto-
deeplab:Hierarchicalneuralarchitecturesearchforsemanticimagesegmentation,"in
IEEEConferenceonComputerVisionandPatternRecognition(CVPR)
,2019.
[99]
L.LiandA.Talwalkar,\Randomsearchandreproducibilityforneuralarchitecture
search,"
arXivpreprintarXiv:1902.07638
,2019.
[100]
K.Kandasamy,W.Neiswanger,J.Schneider,B.Poczos,andE.P.Xing,\Neural
architecturesearchwithbayesianoptimisationandoptimaltransport,"in
Advancesin
NeuralInformationProcessingSystems(NeurIPS)
,2018.
[101]
Y.JinandB.\Pareto-basedmultiobjectivemachinelearning:Anoverview
andcasestudies,"
IEEETransactionsonSystems,Man,andCybernetics,PartC
(ApplicationsandReviews)
,vol.38,no.3,pp.397{415,2008.
[102]
Y.-H.Kim,B.Reddy,S.Yun,andC.Seo,\Nemo:Neuro-evolutionwithmultiob-
jectiveoptimizationofdeepneuralnetworkforspeedandaccuracy,"in
International
ConferenceonMachineLearning(ICML)AutoMLWorkshop
,2017.
[103]
J.-D.Dong,A.-C.Cheng,D.-C.Juan,W.Wei,andM.Sun,\Dpp-net:Device-aware
progressivesearchforpareto-optimalneuralarchitectures,"in
EuropeanConferenceon
ComputerVision(ECCV)
,2018.
[104]
T.Elsken,J.H.Metzen,andF.Hutter,tmulti-objectiveneuralarchitecture
searchvialamarckianevolution,"in
InternationalConferenceonLearningRepresen-
tations(ICLR)
,2019.
[105]
T.Wei,C.Wang,Y.Rui,andC.W.Chen,\Networkmorphism,"in
International
ConferenceonMachineLearning(ICML)
,2016.
[106]
J.Hu,L.Shen,andG.Sun,\Squeeze-and-excitationnetworks,"in
IEEEConference
onComputerVisionandPatternRecognition(CVPR)
,2018.
[107]
F.Juefei-Xu,V.N.Boddeti,andM.Savvides,\Localbinaryconvolutionalneural
networks,"in
IEEEConferenceonComputerVisionandPatternRecognition(CVPR)
,
2017.
[108]
Y.Leung,Y.Gao,andZ.-B.Xu,\Degreeofpopulationdiversity-aperspectiveon
prematureconvergenceingeneticalgorithmsanditsmarkovchainanalysis,"
IEEE
TransactionsonNeuralNetworks
,vol.8,no.5,pp.1165{1176,1997.
122
[109]
B.L.Miller,D.E.Goldberg
etal.
,\Geneticalgorithms,tournamentselection,and
theofnoise,"
Complexsystems
,vol.9,no.3,pp.193{212,1995.
[110]
K.DebandR.B.Agrawal,\Simulatedbinarycrossoverforcontinuoussearchspace,"
Complexsystems
,vol.9,no.2,pp.115{148,1995.
[111]
M.Pelikan,D.E.Goldberg,andE.Cantu-Paz,\Boa:Thebayesianoptimization
algorithm,"in
GeneticandEvolutionaryComputationConference(GECCO)
,1999.
[112]
S.Xie,R.Girshick,P.ar,Z.Tu,andK.He,\Aggregatedresidualtransforma-
tionsfordeepneuralnetworks,"in
IEEEConferenceonComputerVisionandPattern
Recognition(CVPR)
,2017.
[113]
J.Malik,\Technicalperspective:Whatledcomputervisiontodeeplearning?"
Com-
municationsoftheACM
,vol.60,no.6,pp.82{83,2017.
[114]
S.Xie,A.Kirillov,R.Girshick,andK.He,\Exploringrandomlywiredneuralnetworks
forimagerecognition,"in
InternationalConferenceonComputerVision(ICCV)
,2019,
pp.1284{1293.
[115]
I.LoshchilovandF.Hutter,\Sgdr:Stochasticgradientdescentwithwarmrestarts,"
in
InternationalConferenceonLearningRepresentations(ICLR)
,2017.
[116]
X.Zhang,X.Zhou,M.Lin,andJ.Sun,\ShAnextremelytconvolu-
tionalneuralnetworkformobiledevices,"in
IEEEConferenceonComputerVision
andPatternRecognition(CVPR)
,2018.
[117]
K.DebandA.Srinivasan,\Innovization:Innovatingdesignprinciplesthroughopti-
mization,"in
GeneticandEvolutionaryComputationConference(GECCO)
,2006.
[118]
Y.Netzer,T.Wang,A.Coates,A.Bissacco,B.Wu,andA.Y.Ng,\Readingdigitsin
naturalimageswithunsupervisedfeaturelearning,"in
AdvancesinNeuralInformation
ProcessingSystems(NeurIPS)WorkshoponDeepLearningandUnsupervisedFeature
Learning
,2011.
[119]
H.Xiao,K.Rasul,andR.Vollgraf,\Fashion-MNIST:Anovelimagedatasetforbench-
markingmachinelearningalgorithms,"
arXivpreprintarXiv:1708.07747
,2017.
[120]
B.Baker,O.Gupta,R.Raskar,andN.Naik,\Acceleratingneuralarchitecturesearch
usingperformanceprediction,"
arXivpreprintarXiv:1705.10823
,2017.
123
[121]
X.Dai,P.Zhang,B.Wu,H.Yin,F.Sun,Y.Wang,M.Dukhan,Y.Hu,Y.Wu,
Y.Jia
etal.
,\Chamnet:Towardstnetworkdesignthroughplatform-aware
modeladaptation,"in
IEEEConferenceonComputerVisionandPatternRecognition
(CVPR)
,2019.
[122]
G.Bender,P.-J.Kindermans,B.Zoph,V.Vasudevan,andQ.Le,\Understanding
andsimplifyingone-shotarchitecturesearch,"in
InternationalConferenceonMachine
Learning(ICML)
,2018.
[123]
A.Brock,T.Lim,J.Ritchie,andN.Weston,\SMASH:One-shotmodelarchitecture
searchthroughhypernetworks,"in
InternationalConferenceonLearningRepresenta-
tions(ICLR)
,2018.
[124]
H.Cai,C.Gan,T.Wang,Z.Zhang,andS.Han,\Onceforall:Trainonenetwork
andspecializeitfortdeployment,"in
InternationalConferenceonLearning
Representations(ICLR)
,2020.
[125]
M.TanandQ.V.Le,\Mixconv:Mixeddepthwiseconvolutionalkernels,"in
British
MachineVisionConference(BMVC)
,2019.
[126]
J.Mei,Y.Li,X.Lian,X.Jin,L.Yang,A.Yuille,andJ.Yang,\Atomnas:Fine-grained
end-to-endneuralarchitecturesearch,"in
InternationalConferenceonLearningRep-
resentations(ICLR)
,2020.
[127]
L.N.Darlow,E.J.Crowley,A.Antoniou,andA.J.Storkey,\Cinic-10isnotimagenet
orcifar-10,"
arXivpreprintarXiv:1810.03505
,2018.
[128]
A.Coates,A.Ng,andH.Lee,\Ananalysisofsingle-layernetworksinunsupervised
featurelearning,"in
ProceedingsoftheFourteenthInternationalConferenceonA
cialIntelligenceandStatistics
,2011.
[129]
M.NilsbackandA.Zisserman,\Automatedweroveralargenumber
ofclasses,"in
2008SixthIndianConferenceonComputerVision,GraphicsImage
Processing
,2008.
[130]
A.G.Howard,M.Zhu,B.Chen,D.Kalenichenko,W.Wang,T.Weyand,M.An-
dreetto,andH.Adam,\Mobilenets:tconvolutionalneuralnetworksformobile
visionapplications,"
arXivpreprintarXiv:1704.04861
,2017.
[131]
I.DasandJ.E.Dennis,\Normal-boundaryintersection:Anewmethodfor
generatingtheparetosurfaceinnonlinearmulticriteriaoptimizationproblems,"
124
SIAMJ.onOptimization
,vol.8,no.3,p.631{657,Mar.1998.[Online].Available:
https://doi.org/10.1137/S1052623496307510
[132]
Z.Guo,X.Zhang,H.Mu,W.Heng,Z.Liu,Y.Wei,andJ.Sun,\Singlepathone-shot
neuralarchitecturesearchwithuniformsampling,"
arXivpreprintarXiv:1904.00420
,
2019.
[133]
X.Chu,B.Zhang,R.Xu,andJ.Li,\Fairnas:Rethinkingevaluationfairnessofweight
sharingneuralarchitecturesearch,"
arXivpreprintarXiv:1907.01845
,2019.
[134]
L.Bossard,M.Guillaumin,andL.VanGool,\Food-101{miningdiscriminativecom-
ponentswithrandomforests,"in
EuropeanConferenceonComputerVision(ECCV)
,
2014.
[135]
J.Krause,M.Stark,J.Deng,andL.Fei-Fei,\3dobjectrepresentationsforrained
categorization,"in
IEEEWorkshopon3DRepresentationandRecognition(3dRR-13)
,
Sydney,Australia,2013.
[136]
S.Maji,J.Kannala,E.Rahtu,M.Blaschko,andA.Vedaldi,\Fine-grainedvisual
ofaircraft,"Tech.Rep.,2013.
[137]
M.Cimpoi,S.Maji,I.Kokkinos,S.Mohamed,andA.Vedaldi,\Describingtexturesin
thewild,"in
IEEEConferenceonComputerVisionandPatternRecognition(CVPR)
,
2014.
[138]
O.M.Parkhi,A.Vedaldi,A.Zisserman,andC.Jawahar,\Catsanddogs,"in
IEEE
ConferenceonComputerVisionandPatternRecognition(CVPR)
,2012.
[139]
E.D.Cubuk,B.Zoph,J.Shlens,andQ.V.Le,\Randaugment:Practicalautomated
dataaugmentationwithareducedsearchspace,"
arXivpreprintarXiv:1909.13719
,
2019.
[140]
Y.Li,X.Jin,J.Mei,X.Lian,L.Yang,C.Xie,Q.Yu,Y.Zhou,S.Bai,andA.Yuille,
\Neuralarchitecturesearchforlightweightnon-localnetworks,"in
IEEEConference
onComputerVisionandPatternRecognition(CVPR)
,2020.
[141]
C.Li,J.Peng,L.Yuan,G.Wang,X.Liang,L.Lin,andX.Chang,\Blockwisely
supervisedneuralarchitecturesearchwithknowledgedistillation,"in
IEEEConference
onComputerVisionandPatternRecognition(CVPR)
,2020.
125
[142]
J.Yu,P.Jin,H.Liu,G.Bender,P.-J.Kindermans,M.Tan,T.Huang,X.Song,
R.Pang,andQ.Le,\Bignas:Scalingupneuralarchitecturesearchwithbigsingle-
stagemodels,"
arXivpreprintarXiv:2003.11142
,2020.
[143]
C.Szegedy,S.V.Vanhoucke,andA.A.Alemi,\Inception-v4,inception-resnet
andtheimpactofresidualconnectionsonlearning,"in
AAAIConferenceonA
Intelligence
,2017.
[144]
S.Kornblith,J.Shlens,andQ.V.Le,\Dobetterimagenetmodelstransferbetter?"
in
IEEEConferenceonComputerVisionandPatternRecognition(CVPR)
,2019.
[145]
L.Yao,E.Poblenz,D.Dagunts,B.Covington,D.Bernard,andK.Lyman,\Learning
todiagnosefromscratchbyexploitingdependenciesamonglabels,"
arXivpreprint
arXiv:1710.10501
,2017.
[146]
P.Rajpurkar,J.Irvin,K.Zhu,B.Yang,H.Mehta,T.Duan,D.Ding,A.Bagul,
C.Langlotz,K.Shpanskaya
etal.
,\Chexnet:Radiologist-levelpneumoniadetection
onchestx-rayswithdeeplearning,"
arXivpreprintarXiv:1711.05225
,2017.
[147]
G.R.Blog,\Automlforlargescaleimageationandobjectdetec-
tion,"
GoogleResearch,https://research.googleblog.com/2017/11/automl-for-large-
scaleimage.html,Blog
,2017.
[148]
J.Liang,E.Meyerson,B.Hodjat,D.Fink,K.Mutch,andR.Miikkulainen,\Evolu-
tionaryneuralautomlfordeeplearning,"in
GeneticandEvolutionaryComputation
Conference(GECCO)
,2019.
[149]
N.CarliniandD.Wagner,\Towardsevaluatingtherobustnessofneuralnetworks,"in
IEEESymposiumonSecurityandPrivacy(SP)
,2017.
126