LARGE-SCALEHIGHDIMENSIONALDISTANCEMETRIC
LEARNINGANDITSAPPLICATIONTOCOMPUTER
VISIONByQiQian
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialfulﬁllmentoftherequirements
forthedegreeof
ComputerScience-DoctorofPhilosophy
2015ABSTRACT
LARGE-SCALEHIGHDIMENSIONALDISTANCEMETRIC
LEARNINGANDITSAPPLICATIONTOCOMPUTER
VISIONByQiQian
Learninganappropriatedistancefunction(i.e.,similarity)isoneofthekeytasksin
machinelearning,especiallyfordistancebasedmachinelearningalgorithms,e.g.,
k-nearestneighborclassiﬁer,
k-meansclustering,etc.Distancemetriclearning(DML),thesubject
tobestudiedinthisdissertation,isdesignedtolearnametricthatpullstheexamples

fromthesameclasstogetherandpushestheexamplesfromtclassesawayfrom

eachother.AlthoughmanyDMLalgorithmshavebeendevelopedinthepastdecade,most
ofthemcanhandleonlysmalldatasetswithhundredsoffeatures,signiﬁcantlylimiting
theirapplicationstorealworldapplicationsthatofteninvolvemillionsoftrainingexamples

representedbyhundredsofthousandsoffeatures.Threemainchallengesareencountered
tolearnthemetricfromtheselarge-scalehighdimensionaldata:(i)Tomakesurethatthe
learnedmetricisaPositiveSemi-Deﬁnitive(PSD)matrix,aprojectionintothePSDcone

isrequiredateveryiteration,whosecostiscubicinthedimensionalitymakingitunsuitable
forhighdimensionaldata;(ii)ThenumberofvariablesthatneedstobeoptimizedinDML
isquadraticinthedimensionality,whichresultsintheslowconvergencerateinoptimization

andhighrequirementofmemorystorage;(iii)ThenumberofconstraintsusedbyDMLisat

leastquadratic,ifnotcubic,inthenumberofexamplesdependingonifpairwiseconstraints
ortripletconstraintsareusedinDML.Besides,featurescanberedundantduetohigh
dimensionalrepresentations(e.g.,facefeatures)andDMLwithfeatureselectionispreferred
fortheseapplications.
Themaincontributionofthisdissertationistoaddressthesechallengesboththeoretically
andempirically.First,forthechallengearisingfromthePSDprojection,weexploitthe

mini-batchstrategyandadaptivesamplingwithsmoothlossfunctiontosigniﬁcantlyreduce
thenumberofupdates(i.e.,projections)whilekeepingthesimilarperformance.Second,
forthechallengearisingfromhighdimensionality,weproposeadualrandomprojection

approach,whichenjoysthelightcomputationduetotheusageofrandomprojectionand
atthesametime,signiﬁcantlyimprovestheenessofrandomprojection.Third,for
thechallengewithlarge-scaleconstraints,wedevelopanovelmulti-stagemetriclearning

framework.Itdividestheoriginaloptimizationproblemintomultiplestages.Itreducesthe

computationbyadaptivelysamplingasmallsubsetofconstraintsateachstage.Finally,to
handleredundantfeatureswithgroupproperty,wedevelopagreedyalgorithmthatselects
featuregroupandlearnsthecorrespondingmetricsimultaneouslyateachiterationleadingto

furtherimprovementoflearningwhencombinedwithadaptivemini-batchstrategy
andincrementalsampling.BesidesthetheoreticalandempiricalinvestigationofDMLon
thebenchmarkdatasetsofmachinelearning,wealsoapplytheproposedmethodstoseveral

importantcomputervisionapplications(i.e.,ﬁne-grainedvisualcategorization(FGVC)and
facerecognition).
ACKNOWLEDGMENTS
Firstandforemost,IwouldliketothankmyadvisorDr.RongJin.Imethimfortheﬁrst
timewhenIwasamasterstudentatNanjingUniversity.Thetalkwithhimopenedawindow
withatbutveryinsightfulandbeautifulsceneryforme.WhenIbeganmyPhD
studyatMSU,hedevotedalotofhistimetotrainmethinkinginamathematicalwayto

solvelearningproblems,whereIfoundmypassionandhappinessinresearch.Hisstrictbut
friendlyguidanceisexhaustiveandtireless,whichmakesmemoreandmoreconﬁdentabout
mycareer.Hisrelentlesspassionforacademicresearchalsoinﬂuencesmemuchbeyondthe

study.I’mveryfortunatetohaveDr.Jinasmyadvisor.
Ialsowouldliketothankothercommitteemembers:Dr.Pang-NingTan,Dr.Sara
SelinAviyenteandDr.XiaomingLiu.IwouldliketothankmycolleaguesatMSU.They
are:FengjieLi,MehrdadMahdavi,TianbaoYang,LijunZhang,JinfengYi,ZheyunFeng,

BeiBeiLiu,LanboSheandQiaoziGao.Theynotonlydiscussedwithmeaboutmyresearch,
butalsohelpedmetomakemylifeinEastLansingbetter.Ialsowouldliketothankour
professionalandpatientatthedepartmentofcomputerscience:NormaTeague,Linda

MooreandKatherineTrinklein.IwishNormaandLindaenjoytheirretiredlife.
IwouldliketothankmycolleaguesinNECLaboratoriesAmerica,theyare:Shenghuo
Zhu,YuanqingLinandXiaoyuWang.Ihadtwosummerinternshipsthereandmymentor

ShenghuoZhuinspiredmealotonbothresearchandengineering.Iwouldliketothank

peopleinAlibabaSeattleoandsiliconvalleyowhereItookmyrecentsummer
intern.IwouldliketothankLuoSi,JunWang,JunTao,ZhiPanLiandXuXie.
IwanttothankDr.Zhi-HuaZhou,whoismymasteradvisoratNanjingUniversity.
Withouthishelp,Iwouldnothavetouchedmachinelearningandbegansuchawonderful
ivresearchtrip.Hisdevotedattitudetowardsbothworkandlifebeneﬁtsmealotsincethen.
Finally,Iwouldliketothankmydearfamily.Withoutsupportsfromthem,Icannotﬁnish
mystudysosmoothly.IthankmywifeJuhuaHu,whohelpsmemuchnomatterresearch

orlife.Everypublicationofminehashercontribution.Thesweetestacknowledgementisto
myson,MichaelJiaboQian,whobringsmeatotaltroleandlife.
vTABLEOFCONTENTS
LISTOFTABLES
....................................viiiLISTOFFIGURES
...................................xLISTOFALGORITHMS
...............................xiiChapter1Introduction
...............................11.1DistanceMetric..................................2
1.2SupervisioninDistanceMetricLearning
....................3
1.3ComputationalChallengesinDistanceMetric
Learning......................................4
1.4ApplicationsofDistanceMetricLearningtoComputerVision
........5
1.5Contributions
...................................7
Chapter2LiteratureSurvey
............................92.1PSDConstraintinDistanceMetricLearning
..................9
2.2LinearDistanceMetricLearning
.........................10
2.2.1DistanceMetricLearningwithPairwiseConstraints
..........11
2.2.2DistanceMetricLearningwithTripletConstraints
...........12
2.3NonlinearDistanceMetricLearning
.......................13
2.4ImprovingofDistanceMetricLearning
...............15
2.5GeneralizationPerformanceofDistanceMetric
Learning......................................16
Chapter3Large-scaleDistanceMetricLearningbyAdaptiveSampling
andMini-BatchStochasticGradientDescent(SGD)
......173.1ImprovedSGDforDMLbyMini-batchandAdaptiveSampling
.......19
3.1.1Mini-batchSGDforDML(Mini-SGD)
.................22
3.1.2AdaptiveSamplingbasedSGDforDML(AS-SGD)
..........25
3.1.3HybridApproaches:CombiningMini-batchwithAdaptiveSampling
forDML..................................28
3.2Experiments....................................29
3.2.1ParameterSetting............................30

3.2.2Experiment(I):enessoftheProposedSGDAlgorithmsforDML32
3.2.3Experiment(II):EoftheProposedSGDAlgorithmsforDML33
3.2.4Experiment(III):ComparisonwithState-of-the-artOnlineDMLMeth-
ods.....................................36
3.3Conclusions....................................41
viChapter4DistanceMetricLearningforHighDimensionalData
.....434.1DualRandomProjectionforDistanceMetricLearning
............45
4.1.1DualRandomProjectionforDistanceMetricLearning
........47
4.1.2MainTheoreticalResults.........................49
4.2Experiments....................................51
4.2.1ExperimentalSetting...........................51
4.2.2EciencyoftheProposedMethod...................55

4.2.3EvaluationbyRanking..........................56
4.2.4EvaluationbyClassiﬁcation.......................58
4.3Conclusions....................................59
Chapter5Fine-GrainedVisualCategorizationviaMulti-stageMetricLearn-
ing.....................................615.1Multi-stageMetricLearning
...........................65
5.1.1ConstraintsChallenge:Multi-stageDivision..............66
5.1.2ComputationalChallenge:DualRandomProjection
..........69
5.1.3StorageChallenge:LowRankApproximation.............70
5.2Experiments....................................73
5.2.1OxfordCats&Dogs............................75

5.2.2Oxford102Flowers............................77
5.2.3Birds-2011
.................................78
5.2.4StanfordDogs...............................80
5.2.5ComparisonofEciency.........................81
5.3Conclusions....................................82
Chapter6FeatureSelectionforFaceRecognition:
ADistanceMetricLearningApproach
...............846.1DMLforfeatureselection............................86
6.1.1AdaptiveMini-batchStrategy
......................91
6.1.2IncrementalSamplingStrategy
.....................94
6.2Experiments....................................97
6.2.1ExperimentI:FaceVeriﬁcation.....................98
6.2.2ExperimentII:FaceClassiﬁcation....................100
6.3Conclusion.....................................102
Chapter7Conclusions&FuturePlan
......................105APPENDIX........................................107BIBLIOGRAPHY....................................119viiLISTOFTABLES
Table3.1:Statisticsforthetendatasetsusedinourempiricalstudy.
......30
Table3.2:Classiﬁcationerror(%)of
k-NN(
k=3)usingthedistancemetrics
learnedbytheproposedSGDmethodsforDML.Standarddeviation

computedfromﬁvetrialsisincludedintheparenthesis.
.......33
Table3.3:Classiﬁcationerror(%)of
k-NN(
k=3)usingthedistancemet-
ricslearnedbybaselineSGDmethod,onlinelearningalgorithmsand
batchlearningapproachforDML.Standarddeviationcomputedfrom

ﬁvetrialsisincludedintheparenthesis.
................38
Table3.4:Thecomparisonofrunningtime(seconds)forOASISandthehybrid
methods.Averageresultsoverﬁvetrialsarereported.........39
Table4.1:Examplesofapplicationswithhighdimensionalfeatures.
......43
Table4.2:Statisticsforthedatasetsusedinourempiricalstudy.#Cisthe
numberofclasses.#Fisthenumberoforiginalfeatures.#Trainand
#Testrepresentthenumberoftrainingdataandtestdata,respectively.52
Table4.3:CPUtime(minutes)fortmethodsforDML.Allalgorithms
areimplementedinMatlabexceptforLMNNwhosecorepartisim-
plementedinCandismoretthanourMatlabimplementation.55
Table4.4:ComparisonofrankingresultsmeasuredbymAP(%)fort
metriclearningalgorithms.
.......................57
Table5.1:Comparisonofmeanaccuracy(%)on
cats
&dogs
dataset.“#”means
thatmoreinformation(e.g.,groundtruesegmentation)isusedbythe
method..................................76
Table5.2:Comparisonofmeanaccuracy(%)on
102ﬂowersdataset.“#”means
thatmoreinformation(e.g.,groundtruesegmentation)isusedbythe

method..................................78
Table5.3:Comparisonofmeanaccuracy(%)on
birds11
dataset.“*”denotes
themethodthatmirrorstrainingimages.
...............80
Table5.4:Comparisonofmeanaccuracy(%)on
S-dogs
dataset.“*”denotesthe
methodthatmirrorstrainingimages.
.................81
viiiTable5.5:ComparisonofRunningtime(seconds).
................81
Table6.1:Comparisonofrunningtime(seconds).Thereportedrunningtimeof
Greco-miniandGreco-hybridisthattheyachievethesametraining
performanceasGrecowith100iterations................100
ixLISTOFFIGURES
Figure1.1:IllustratetheproblemofEuclideandistance.Theleftimageismore
closerthantherightonetothetargetimageunderEuclideandistance
whileitisactuallyfromatclass“Commondandelion”....2
Figure2.1:Theﬁgureisfromthework[114]andillustratesthelearningproce-
dureofLMNN.Thelearnedmetricpushesawaytheexamplesthat
arefromtclassesbutinthenearestneighborswithalarge
margin...................................13
Figure3.1:Thetrainingandtestingerrorsoverepochesfordataset
dna.....32
Figure3.2:Thecomparisonofrunningtime(seconds)forvariousSGDmethods.
NotethatLMNN,abatchDMLalgorithm,ismainlyimplemented
inC,whichiscomputationallymoretthanourMatlabimple-
mentation.AlltheothermethodsareimplementedinMatlab....34
Figure3.3:ThecomparisonofnumberofupdatesforvariousSGDmethods.Note
thatsincePOLAandLEGOoptimizepairwiseconstraints,wedecom-
poseeachtripletconstraintintotwopairwiseconstraintsforthesetwo
methods.Asaresult,thenumberofconstraintsisdoubledforthese
twomethods................................40
Figure4.1:Theeigenvaluedistributionofdatasetsusedinourempiricalstudy.53

Figure4.2:Thecomparisonofdierentstochasticalgorithmsforranking....58

Figure4.3:Thecomparisonoftstochasticalgorithmsforclassiﬁcation.59
Figure5.1:IllustrationofhowDMLlearnstheembeddingthatpullstogetherthe
datapointsfromthesameclassandpushesapartthedatapointsfrom
tclasses.Bluepointsarefromtheclass“Englishmarigold”
whileredonesare“Barbertondaisy”.Animportantnotehereis
thatourDMLdoesnotrequiretocollapsedatapointsfromeach

classandthisallowstheﬂexibilitytomodelintra-classvariance.A
bigchallengenowishowtodealwithhigh-dimensionfeaturerepre-
sentationwhichistypicalforimage-levelvisualfeatures.Tothisend,
weproposeamulti-stageschemeformetriclearning.
.........62
Figure5.2:Theframeworkoftheproposedmethod.................68
xFigure5.3:Examplesofretrievedimages.Firstcolumnarequeryimageshigh-
lightedbygreenboundingboxes.Columns2-4includethemostsimi-
larimagesmeasuredbyEuclid.Columns5-7showthosebythemetric
fromLMNN.Columns8-10arefromthemetricofMsML.Imagesin
columns2-10arehighlightedbyredboundingboxeswhentheyshare
thesamecategoryasqueries,andblueboundingboxesiftheyarenot.76
Figure5.4:Convergencecurveoftheproposedmethodon
102ﬂowers.......78
Figure5.5:Comparisonwithtsizeofclasseson
birds11
..........79
Figure5.6:Examplesofretrievedimages.Firstcolumnarequeryimageshigh-
lightedbygreenboundingboxes.Columns2-4includethemostsimi-
larimagesmeasuredbyEuclid.Columns5-7showthosebythemetric
fromLMNN.Columns8-10arefromthemetricofMsML.Imagesin
columns2-10arehighlightedbyredboundingboxeswhentheyshare
thesamecategoryasqueries,andblueboundingboxesiftheyarenot.83
Figure6.1:Illustrationoffeatureselectionforfaceveriﬁcation.Althoughde-
scriptorsareover-completed,asubsetofdescriptors(e.g.,eyes,nose,
etc)cancapturemostof
..................85
Figure6.2:Comparisonoftrainingerroronfaceveriﬁcation.Redstardenotes
thepositionthatGreco-miniachievesthesameperformanceasGreco

with100iterations.RedcrossdenotesthepositionthatGreco-hybrid
achievesthesameperformanceasGrecowith100iterations.....99
Figure6.3:Comparisonoftesterroronfaceveriﬁcation..............100

Figure6.4:Comparisonoftrainingerroronfaceclassiﬁcation.Redstardenotes
thepositionthatGreco-miniachievesthesameperformanceasGreco

with100iterations.RedcrossdenotesthepositionthatGreco-hybrid
achievesthesameperformanceasGrecowith100iterations.....101
Figure6.5:Comparisonoftesterroronfaceclassiﬁcation.............102

Figure6.6:Comparisonoftrainingerroronfaceclassiﬁcationwiththebase-
linemethoddevelopedbyNEC.Redcrossdenotesthepositionthat

Greco-hybrid-5achievesthesameperformanceasNEC’sbaseline
methodwith100iterations.RedstardenotesthepositionthatGreco-
hybrid-10achievesthesameperformanceasNEC’sbaselinemethod
with100iterations............................103
Figure6.7:Comparisonoftesterroronfaceclassiﬁcationwiththebaselinemethod
developedbyNEC............................103
xiLISTOFALGORITHMS
Algorithm1Mini-batchStochasticGradientDescent(Mini-SGD)forDML
....23
Algorithm2AdaptiveSamplingStochasticGradientDescent(AS-SGD)forDML25

Algorithm3AFrameworkofHybridStochasticGradientDescent(Hybrid-SGD)
forDML..................................28
Algorithm4DualRandomProjectionMethod(DuRP)forDML
..........49
Algorithm5AnEtAlgorithmforRecovering
MandProjectItontoPSD
Conefrom
M...............................72
Algorithm6The
Multi-stageMetricLearningFrameworkforHighDimensional
DML(MsML)
...............................73
Algorithm7GreedyCoordinateDescentMetricLearningforFaceVeriﬁcation(Greco)89
Algorithm8GreedyCoordinateDescentMetricLearningwithAdaptiveMini-batch
(Greco-mini)...............................92
Algorithm9GreedyCoordinateDescentMetricLearningwithIncrementalSam-
pling(Greco-isamp)
............................95
Algorithm10GreedyCoordinateDescentMetricLearningwithHybridstrategies
(Greco-hybrid)..............................97
xiiChapter1
Introduction
Machinelearning,asasubﬁeldofartiﬁcialintelligence,involvesautomaticallyimproving
fromexperienceandhasbeensuccessfullyappliedformanyrealworldapplications[83].
Distancefunctionsareessentialtomanymachinelearningtasks.Forexample,
k-nearestneighbor(
k-NN)classiﬁer[1]assignsthetestexamplewiththemostfrequentlabelinits
knearestneighborsfromtrainingset.
K-meansclusteringalgorithm[53]outputsclusters
accordingtothedistancefromeachinstanceto
kcenters.However,Euclideandistance
withhandcraftedfeaturesmaynotbesttocapturethebetweent

classesorclusters.Fig.1.1providesanexamplewherethe
k-NNclassiﬁerwithEuclidean
distanceisusedtoidentifytheﬂowertype“Englishmarigold”.UsingLLCfeatures[113],
wefoundthattheexamplefromatclass(i.e.,“Commondandelion”)ismorecloser

tothetargetexamplethantheonefromthesameclass.Therefore,learninganappropriate
distancefunctionbecomesthekeyfordistancebasedmethods.
Inthisdissertation,wewillstudydistancemetriclearning(DML),whichaimstolearna
metricthatpullstheexamplesfromthesameclasstogetherandpushestheexamplesfrom
thetclassesawayfromeachotheraccordingtothesupervisedinformation.Westart
thediscussionbyintroducingtheformofthedistancemetricandsupervisioninDML.
1Figure1.1:IllustratetheproblemofEuclideandistance.Theleftimageismorecloserthan
therightonetothetargetimageunderEuclideandistancewhileitisactuallyfromat
class“Commondandelion”.
1.1DistanceMetric
Distancefunctionmeasuresthedistancebetweenanytwoexamples,whichcouldbedenoted
asdist(
xi,xj)for
d-dimensionalexamples
xiandxj.MostexistingDMLmethodsadopt
theformof“Mahalanobisdistance”[79]:dist
M(xi,xj)=(xi−xj)M(xi−xj),where
Miscalleda
distancemetric
withthesizeof
d×d.Itiseasytoverifythatthedistance
functionisequivalenttoEuclideandistancewhenMisanidentitymatrix.Thetargetof
DMListolearnabetterdistancefunction(i.e.,distancemetric)thanEuclideandistance

toevaluatethedistancesbetweenpairsofexamplescorrectly.Asadistancefunction,it

shouldsatisfycertainproperties,suchassymmetric,non-negativeandtriangleinequality.
Consequently,thelearnedmetricisrequiredtobeapositivesemi-deﬁnite(PSD)matrix,
whichisoftenreferredasthe
PSDconstraint
.Wewillelaboratestrategiesonhowto
handleitinSection2.1.
21.2SupervisioninDistanceMetricLearning
Tolearnagooddistancemetric,certainsupervisedinformationfromthedataisneeded.
tfrommanysupervised[106]orsemi-supervised[123]machinelearningapproaches,

thesupervisionfordistancemetriclearningappearsaspairwiseortripletconstraintsrather
thanthelabelsfortrainingexamples.Eachpairwiseconstraintisconsistedof
twoexamples.Themetricislearnedtoguaranteethatthedistanceofpairsfromthesameclassissmaller

thanapre-deﬁnedthresholdwhilethatofpairsfromtclassesislarge[115].In
contrast,
threeexamplesareincludedinatripletconstraint[114].Agoodmetricshould
makesurethattheexamplesfromthesameclassisseparatedfromtheexamplesoft

classeswithalargemargin.Itisobviousthatasingletripletconstraintcouldbedividedin
tothreepairwiseconstraints.Recently,someresearchersproposedquadrupletconstraints,
whichcontainfourexamplesinaconstraintandcouldbeconsideredasavariantoftriplet

constraints[76].
Theseconstraintscouldbederivedfromlabelsorprovidedbytheapplicationsdirectly.
Forexample,pairwiseconstraintsconsistofpairsofexampleswiththesamelabelandpairs

ofthosefromtclasses.Someapplicationscanalsoprovidethepairwiseconstraint

directly.Infacerecognition[63],eachtrainingexamplecontainstwofaceimagesandthe
labelisgivenasanindicatorthatthesetwoimagesbelongtothesamepersonornot.Given
theseconstraints,variousDMLmethodsaredevelopedandrepresentativeoneswithpairwise

ortripletconstraintsaredescribedinSection2.2.1andSection2.2.2,respectively.
Notethatalthoughdimensionreductionpurposemethods,e.g.,principlecomponent
analysis(PCA)[95],lineardiscriminantanalysis(LDA)[54]orunsupervisedmethods,e.g,

locallylinearembedding(LLE)[94],isometricfeaturemapping(ISOMAP)[101],sometimes
3arealsocategorizedtodistancemetriclearning,wewillfocusonthediscussionofthegeneral
purposeDMLmethodswithpairwiseortripletsupervisioninthisstudy.
1.3ComputationalChallengesinDistanceMetric
LearningGiventhesesupervisedinformation,theobjectiveofDMLisoptimizingoverpairwiseor
tripletconstraintswhilekeepingthelearnedmetricinthePSDcone.ManyDMLmeth-
odshavebeendevelopedunderthisframework,however,littleofthemtrytoaddressthe

fundamentalissuesinDML.Thecomputationalchallengesaremainlyfromthreeaspects.
•TomakesurethelearnedmetricisaPSDmatrix,mostDMLmethodshavetoproject
theintermediatesolutionontothePSDconeat
everyiteration
.Theprojectionstep
requirestheeigen-decompositionoperator,whichcosts
O(d3)(atleast
O(d2)and
disthedimensionalityofdata).
•Thenumberofvariablesthatneedtobeoptimizedincreasesfrom
O(d)inlinearmodel
toO(d2)inDML.Itresultsinaslowerconvergencerateinsolvingtherelatedopti-
mizationproblem[89].Inaddition,
d2variablesbecometoohugetobestoredinthe
memorywhenthedimensionalityofdataisstlylarge.
•Largenumberofconstraintsareneededtoavoidoverﬁttingwhenlearningametric.
Thenumberofconstraintscouldbeupto
O(n3)(i.e.,tripletconstraints),where
nisthenumberofexamples.ItmeansthatthetotalcostofDMLcouldbeupto
O(d3n3).Besides,tocapturethedetailsofimagesincomputervision(e.g.,facerecognition),
over-completeddescriptorsaresampledfromeachimage,whichresultsinhighdimensional
4representationsandredundantfeatures.
Theoccurrenceofalltheseproblemsmakesthelarge-scalehighdimensionalDMLan
extremelytask,andwewillstudyitextensivelyinthiswork.
1.4ApplicationsofDistanceMetricLearningtoCom-
puterVision
Distancemetriclearningisanimportantsubjectinmachinelearning,andhasfoundappli-
cationsinmanydomains,includinginformationretrieval[56],supervisedclassiﬁcation[114],
clustering[115,23]anddomainadaptation[93].Besidesmachinelearningcommunity,other
researchgroups,e.g.,computervisionanddatamining[110,111],alsorealizetheimportance

ofDML,andhaveapplieditformanyrealworldapplications.
Incomputervision,DMLisﬁrstfoundtobeveryhelpfulforimageclassiﬁcation[49,
82,108,38],visualobjectrecognition[36,74]andimageretrieval[57,24,112].Image

classiﬁcationandobjectrecognitiontasksareoftensolvedasthemulti-classclassiﬁcation
problem.Theyﬁrstobtainametricaccordingtothepairwiseortripletinformationandthen
applyk-NNclassiﬁerforclassiﬁcation.Sinceimagesareverysensitivetotposes,light
conditionsandangles,DMLcanlearnaninvariantspacetoobtainthesemantic

fromlowlevelfeaturesbetweentobjects.Inastandardimageretrievalscenario,
trainingsetconsistsofpairwiseortripletconstraintsandadistancemetricislearnedto
rankthesepairsappropriately.Whentherecomesaqueryimage,thesystemshouldoutput

theimagesthataremostsimilartothequeryone,whichisequivalenttoobtainingthe
imagesclosesttothequeryunderthelearnedmetric.
Inaddition,DMLisalsoadoptedforfaceveriﬁcationproblem[59,50,29,17,71].Face
5veriﬁcationisverypracticalandhasalreadybeenusedinmanyrealbusinesses,e.g.,ter-
roristdetection,bordercontrolandaccesscontrolsystem[64].Afterobtainingametric
viaoptimizinglotsofpairsofimagesfromthesameortpersons,theaccesscontrol

systemcouldidentifythepersonautomaticallyandkeysarenotnecessaryforthehouse.
Forexample,whenastrangercomes,thecamerainthedoorcantakeaphotoforhimand
thencompareittothephotosofpeoplewholiveinthehouse.Ifthepairwisecompari-

sonunderthelearnedmetricreturnstrue,thedoorwillopen,otherwiseitwillkeepclose.
Comparedwiththetraditionalaccessbadges,faceisnaturallyuniqueforeachpersonand
ismoreorevenimpossibletocopy,whichmakesthehousemuchsafer.Notethat

tfromconventionalmulti-classclassiﬁcationmission,therecouldbebillionpeople

inthedatasetandeachpersonhaslittlenumberofimages(e.g.,oneperperson),which
makesmostofexistingclassiﬁers,especiallyone-vs-allmethods,failedforthisclassiﬁcation
problemwithhugenumberofclasses.Besidestheseapplications,DMLisalsoadoptedby

objecttracking[27,104],videoeventdetection[4,87],etc.incomputervision.
Inthiswork,wewillintroduceDMLtoﬁne-grainedvisualcategorization(FGVC)[7].In
contrasttoclassifyinganimagetoabasicclass,FGVCrequirestocategorizetheimagetoa

subordinateclass,whereclassesonlyhavesubtleandthenumberofclassescould
beverylarge.Furthermore,theimagesinthesameclasscouldbeverytduetothe
tposes,examples,etc.,whichmeanstheintra-classvariantislarge.Theproperties

ofDML,whichisindependentfromthenumberofclassescomparedtoone-vs-allstrategy
andﬂexibletohandlelargeintra-classinvariant,makeitappropriateforFGVC.Tothebest
ofourknowledge,thisistheﬁrstworktoapplyDMLforthechallengingFGVCtask.
61.5Contributions
Inthisstudy,wedevelopseveralrandomizedalgorithmstoalleviatethechallengesdescribed
inSection1.3.First,toreducethenumberofPSDprojections,wecombinethemini-batch

stochasticgradientdescentmethodwithanadaptivesamplingstrategy.Second,wedevelopa
dualrandomprojectionmethodtohandlethelargenumberofvariables.Finally,wepropose
amulti-stagemetriclearningframeworktodividethelearningprocedureintoaseriesof

subproblems,whereeachoneonlyhasanepoch(
O(n))ofactiveconstraints,andthetotal
computationalcostoftheresultingnewframeworkofDMLislinearinthedimensionality
andthenumberofexamples(
O(dn)).Besidesthese,weinvestigatethefeatureselection
problemanddevelopagreedymethodtoselectfeaturesandlearnmetricssimultaneously.
Thedetailedcontributionsofthedissertationaredescribedasfollows.
•Weﬁrstexploitthecombinationofthemini-batchstrategywithsmoothlossfunction
forDML.Then,weproposeanadaptivesamplingapproachfortDML.Weveri-

fy,boththeoreticallyandempirically,theandenessofthemini-batch
strategywithsmoothlossandtheadaptivesamplingapproachforDML,respectively.
Finally,wepresenttwohybridapproachesthatexploitthecombinationofthemini-

batchstrategywithadaptivesamplingforDML.Tothebestofourknowledge,itis

theﬁrstworkthatreducethenumberofupdatesinDMLwhilekeepingthesimilar
performance.
•WeproposeadualrandomprojectionapproachforhighdimensionalDML.Ourap-
proach,ononehand,enjoysthelightcomputationofrandomprojection,andonthe

otherhand,signiﬁcantlyimprovestheenessofrandomprojection.Weverify
theenessoftheproposedalgorithmsbothempiricallyandtheoretically.
7•Wedevelopanovelmulti-stagemetriclearningframeworkforhighdimensionalDML
toaddresshighdimensionalDMLwithlargenumberofconstraints.Wedividethe
originaloptimizationproblemintomultiplestages.Ateachstage,onlyasmallsubset

ofconstraintsthataretobeclassiﬁedbythecurrentlylearnedmetricwill
beadaptivelysampledandusedtoimprovethelearnedmetric.Thenweextendthe
dualrandomprojectionstrategytosolveeachsubproblem.Theempiricalstudywith

standardFVGCbenchmarkdatasetsveriﬁesthatourframeworkisbotheand
tcomparedtothestate-of-the-artFGVCapproaches.
•Weproposeagreedyalgorithmthatcanselectdescriptors(i.e.,features)andlearnthe
correspondingmetricssimultaneouslyforfacerecognitionwiththeguaranteedconver-

gencerate.Toalleviatethehighcomputationalcostofexhaustedsearchthatﬁndsonly
onefeaturegroupateachiteration,weinvestigatetheadaptivemini-batchstrategy
andincrementalsamplingrespectively,andthehybridalgorithmthatcombinesthese

twostrategiesisalsostudied.Theempiricalstudyonfacerecognitionconﬁrmsthe

enessandeoftheproposedmethods.
8Chapter2
LiteratureSurvey
Inthischapter,wewillintroducetheexistingDMLmethodsbrieﬂy.Section2.1describes
thegeneralstrategiestoaddressthePSDconstraintthatallDMLmethodshavetodeal
with.Afterthat,Section2.2-2.4presentthepopularDMLmethods.Finally,Section2.5

summarizesthetheoreticalanalysisforDML.
2.1PSDConstraintinDistanceMetricLearning
BeforeintroducingthespeciﬁcDMLmethods,weﬁrstdemonstratethestrategiestohandle

PSDconstraint,whichkeepsthelearnedmetricinthePSDcone,inDML.BothearlyDML

methodsthatrequirethegradientfromallconstraintsateachiteration[115,45],andt
approaches[66,98]thatdealwithonlyoneconstraintateachiterationbyexploitingthe
techniquesofonlinelearningorstochasticoptimization,shareonecommonstrategy:inorder

toensurethatthelearneddistancemetricisPSD,theseapproachesrequire,ateachiteration,

projectingthelearneddistancemetric
˜MontothePSDconebysolvingtheproblem
minMPSD
M−˜M2FThesolutionisthepositiveeigenvalueswiththecorrespondingeigenvectors[47](i.e.,
M=iivivii>0where
iandviarethe
i-theigenvalueandthecorrespondingeigenvector
9of˜M).Therefore,ithastoperformtheeigen-decompositionforagivenmatrix,whichis
computationallyexpensive(i.e.,atleast
O(d2)).SeveralstudieshavebeenproposedtoavoidprojectionsinSGD.In[55],theauthorsde-
velopedaprojectionfreeSGDalgorithmthatreplacestheprojectionstepwithaconstrained
linearprogrammingproblem.In[80],theauthorsproposedaSGDalgorithmwithonlyone
projectionthatisperformedattheendoftheiterations.Unfortunately,theimprovementof

thetwoalgorithmsincomputationaleislimited,becausetheyhavetocompute,
ateachiteration
,theminimumeigenvalueandtheeigenvectoroftheupdateddistancemetric,
anoperationwith
O(d2)cost,where
disthedimensionalityofthedata.
Itisnoteworthytomentionthatinarecentstudy[24],theauthorsshowempiricallythat
itispossibletolearnagooddistancemetricusingonlinelearningwithouthavingtoperform
theprojectionateachiteration.Infact,onlyoneprojectionintothePSDconeisperformed
attheendofonlinelearningtoensurethattheresultingmatrixisPSD.Itistfrom

thealgorithmspresentedin[55,80]andnoadditionalmechanismsareneededtopreventthe
intermediatesolutionsfrombeingtoofarawayfromthePSDcone.Wereferthisstrategy
asone-projectionparadigm
intherestofthisdissertation.
2.2LinearDistanceMetricLearning
MostofconventionalDMLapproachesareequivalenttoseekingalineartransformationfor
theinputspace.Inthissection,wewilldescriberepresentativeapproachesindetailaccording
tothetypeofconstraintsused.Moreexamplescouldbefoundintwosurveypapers[117,73]
102.2.1DistanceMetricLearningwithPairwiseConstraints
AsmentionedinSection1.2,DMLusuallyinvolvestwokindsofconstraints:pairwiseor
triplet.Mostearlyworksfordistancemetriclearningfocusonoptimizingpairwisecon-

straints.Xing,etal.[115]proposedoneoftheearliestmethodsforthetypicalDMLproblem
asstudiedinthisdissertation.Theobjectiveismaximizingthedistancebetweenthepairsof
examplesfromthetclasseswhilekeepingthedistancebetweenthepairsofexamples

withthesamelabellessthanapre-deﬁnedthreshold.Althoughtheproblemissolvedviaa

semi-deﬁniteprogramming(SDP)method[12]thatisonlyrunnableonsmalldatasets,they
showtheadvantageofDMLfordistancebasedmethodsandsimulatemoreresearchforthis
topic.POLAmethod[96]isthenproposedtoalleviatethecomputationcostofDML.Itis

anonlinelearningmethod,whichonlyreceivesasinglepairwiseconstraintateachiteration.
Ifthecurrentmetricmakesamistake,itwillbeupdatedaccordingly,otherwisethemetric
willbethesame.Thecostofeachiterationismuchlesssinceonlythegradientfromasingle

constraintiscomputed.Inaddition,thePSDprojectionisonlyneededwhenthepairof
examplesfromthesameclassismisclassiﬁedandthecostcouldbeaslowas
O(d2)since
theupdateisarankonematrix.MCML[45]istheconvexrelaxedversionofNCA[46].

ItrepresentstheextremecaseofDMLthatalldatapointsinthesameclassshouldbe

mappedintoasinglelocationbythelearnedmetric.Forthispurpose,itminimizestheKL
divergencebetweenthedistributionreturnedbythelearnedmetricandthatfromtheideal
case.Italsoproposedtokeepalow-rankcopyafterobtainingthemetrictoachievedimen-

sionreduction.Tillnow,alloftheseDMLmethodsmayrequirePSDprojectionatevery
iteration,whichcouldbe
O(d3).ITML[32]appliedtheLogDetregularizertoavoidPSD
projections.LogDetdivergenceisaBregmanmatrixdivergence[13]betweentwomatrices.
11Byintroducingthisregularizer,themetricisupdatedintheinverseformwitharandone
matrixateachiteration,whichcouldbeﬁnishedwithoutcomputingtheinversepractically
viatheSherman-Morrison-Woodburyformula.Meanwhile,theupdatingrulemakessure

thattheintermediatelearnedmetricisstillinthePSDcone.AlthoughthereisnoPSD
projectionforITML,theadditionalcostisincludedandthecostforeachiterationisstill
nolessthan
O(d2).2.2.2DistanceMetricLearningwithTripletConstraints
DMLmethodswithpairwiseconstraintsaimtolearnametricthatcollapseallexamplesin
eachclasstoasmallclusterwhoseradiusisthepre-deﬁnedthreshold.However,itis
whentheintra-classvarianceislarge,whichisoftenthecaseinrealworldapplications.
Thus,tripletconstraintsisproposedtohandlethisproblemmoreﬂexibly[114].Eachtriplet

constraintcontainsthreeexamples:twoexampleswiththesamelabelwhereoneisamong
thek-nearestneighborsofthetargetexampleandoneexamplefromthetclasswho
isalsointhenearestneighborsofthetargetexample.LMNN[114]thenlearnsthemetric

topushawaytheexampleoftclasswithalargemargin.Fig.2.1illustratesthe

enessofLMNN.ComparedwithpreviousDMLapproaches,itonlyrequiresthatthe
examplesinthenearestneighborscouldbecollapsetoasmallcluster,whichpreservesthe
largeintra-classvarianceandismoreappropriateforcombiningwith
k-NNclassiﬁer.Since
itborrowstheconceptoflargemargin,italsocouldbeanalyzedfromtheviewofsupport
vectormachine(SVM)[107],whereLMNNissimilartolearnaseriesoflocalSVM-like
models[33].
AfterthesuccessofLMNN,Chechik,etal.[24]appliedthetripletconstraintswithan
onlinelearningfashionforandachievedagoodperformanceforranking.Surpris-
12Figure2.1:Theﬁgureisfromthework[114]andillustratesthelearningprocedureofLMNN.
Thelearnedmetricpushesawaytheexamplesthatarefromtclassesbutinthenearest
neighborswithalargemargin.
ingly,theyfoundempiricallythatonePSDprojectionattheendofthealgorithmwillnot
theperformance,whichimpliesthepotentialpossibilityofgettingridoftheexpensive
PSDprojection.Sincegradientdescentmethodrequiresprojectingtheintermediatesolution
ontoPSDcone,Shat,etal.[98]adoptedmini-batchstochasticgradientdescentmethodto

balancethecostofcomputinggradientandthatofPSDprojection.Furthermore,italso
reducesthevarianceofthestochasticmethod,whichonlyrandomlysamplesasingletriplet
ateachiteration.
2.3NonlinearDistanceMetricLearning
Sometimes,lineartransformationcouldnotexploitcomplicateddatadistributionstly,
e.g,faceimagesusuallylieonthenonlinearmanifold[59].Tohandlethisnonlinearfeature
space,kernel
trick,whichhasbeenwellstudiedinlinearmodel[15],isappliedtoextendthe
13existingDMLmethods[45,32,114,102].Itsbasicideaismappingtheinputexamplestoa
muchhigheroreveninﬁnitedimensionalspaceandthenobtainingthelineartransformation
there.Althoughitsometimesperformsbetter,thesizeoflearnedmetric(i.e.,dualvariables)

isO(n2)forDML,whichismoreexpensivewhen
nd.Thealternativewaytoprovide
nonlinearlearningcapacityis
deeplearning
(orcalleddeepneuralnetworks),whichis
verypopulartheseyearsduetoitsencouragingperformance[72].Unlikekerneltrick,deep

learningdealswiththeinputfeaturesdirectlyandlearnsthenonlinearmappingbetween
eachlayeraccordingtononlinearactivationfunctions,e.g.,tanh,sigmoid,etc.Theoriginal
structureofdeeplearningisdesignedforclassiﬁcationandthepipelineisoptimizedwiththe

singleimageasinput.Byreplacingthelossfunctionaspairwise[59]ortripletdistance[112]

andconcatenatingthepipelinesinparallel,itisconvenienttoadapttoDMLtasks.
Besidestheseextensionsforthesinglemetriclearning,therearesomestudiesonlearning
multiplemetrics
simultaneously.Thekeypartisdeﬁningthemetricforappropriate
components.Someworksapplymulti-metriclearningondata.Thatis,theylearneach
metricforeachclassorlocalcluster,andsometimesevenforeachsingleexample[42,84].
Forexample,MM-LMNN[114]learnsametricperclassandthedistancebetweentwo

examplesisdeﬁnedonthemetriccorrespondingtothelabelofexamples.Ontheother
hand,themultiplemetricsalsocouldbelearnedaccordingtotfeatureblocks[61,29].
Forexample,eachfaceimagecouldbedividedintotparts,e.g.,nose,eyes,ears,

etc.,andthenPMML[29]learnsasinglemetriconeachfeatureblockwhichrepresents
thetpart.Althoughmultiplemetricsmayimprovetheperformanceoverasingle
metric,itintensiﬁesthealreadyexpensiveoptimizationproblembyincreasingthenumber

ofmetrics.Moreover,somegoodpropertiesofasinglemetricaresacriﬁced,e.g.,global

transformation,interactionsbetweentcomponents,etc.Sincealloftheseextensions
14arebasedonconventionalDML,wewillfocusonlearningasinglelineardistancemetricin
thisdissertation.
2.4ImprovingEofDistanceMetricLearning
AlthoughmanyDMLmethodshavebeendeveloped,thecostofupdateateachiterationis

still(
O(d2))duetothenumberofvariables.Recently,someworksaimtoalleviatethehigh
computationcostviasparsity.Dtfromlinearclassiﬁer,thesparsityofmetriccould

comefromtwoaspects:lowrankanditemsparsity.
SinceeachmetricisaPSDmatrix,itcouldbedecomposedas
M=LL,where
Misa
d×dmatrix,Lisa
d×rmatrixand
ristherankof
M.When
rd,thetotalnumberof
variablesforoptimizingdecreasesfrom
d2tord,whichisonly
O(d).Someexistingmethods
applylowrankstrategytoacceleratethelearningprocess[114,31].Thedrawbackisthat
thecorrespondingoptimizationproblemisnotconvexanymore,hence,itisnotguaranteed
toobtainaglobaloptimalmetric.Inaddition,thesemethodsrequiretoﬁx
rbeforeapplying
theDMLmethods,whicheasilyleadstothesuboptimalsolutionwhenthetruerankislarger
thanthepre-deﬁnedrank.Somestudiesapplytracenormorfantoperegularization[77]to
obtainalowrankmetric,whilethecostfortrainingstageisstillatleast
O(d2).Foritemsparsity,itcouldbefurthercategorizedintotwoparts:sparsityonthesingle
itemandsparsityonthecolumns.Theformeroneassumesthatonly
sitemshavethe
nonzerovalueineachcolumn/row,where
sdandiscorrespondingtothe
L1regularizerforeachcolumn/row[89].Inthelatterscenario,thereisonlylimitednumberofcolumnswith

nonzeroitems,whichisequivalenttopenalizingthewholematrixwith
L2,1regularizer[78].
Thesemethodscouldoutputthesparsemetricwhichisetforteststage,however,the
15learningprocessstillinvolves
O(d2)variables.
2.5GeneralizationPerformanceofDistanceMetric
LearningBesidesdevelopingtDMLmethodsforapplications,somestudiesbegintoanalyze
thegeneralizationperformanceofDMLmethodstheoretically[68,16,6].Givenaspeciﬁc

DMLmethodandthetrainingsetthatarei.i.d.sampledfromanunknowndistribution,we

areinterestedinthelearnedmodel’spredictionerroronanarbitrarypairwiseconstraintfrom
thetruedistribution,whichisknownas
generalizationerror
.However,theonlyevaluation
thatwecouldhaveisitspredictionerroronthetrainingset,whichisusuallyreferredas

empiricalerror
.Accordingtothetheoryofstability,Jin,etal.[68]proposedtheﬁrst
workthatuncoveredtherelationshipbetweentheminDMLandfoundthattheempirical
errorconvergestothegeneralizationerrorattherateof
O(1n)withhighprobability.After
that,someotherstudiesshowthesimilarresultviattechniques[16,6].Cao,etal.
furtherresearchedtheinﬂuencefromthetregularizers,suchasFrobeniusnorm,
L1norm,L2,1normandtracenorm.
Theseanalysesarebasedontheclassiﬁcationaccuracyforpairwiseconstraints,while
themoreinterestingquestionistheperformanceof
k-NNclassiﬁerwiththelearnedmetric.
Guo,etal.[51]proposedthetheorythatbridgesthisgapandshowsthatthegeneralization
errorofthelinearSVMwiththesimilaritymatrixfromthelearnedmetricisupperbound-

edbythegeneralizationerrorofthecorrespondingmetriclearning,whichmeansthatthe
generalizationperformanceoflinearclassiﬁerisguaranteedbythatofmetriclearning.
16Chapter3
Large-scaleDistanceMetricLearning
byAdaptiveSamplingand
Mini-BatchStochasticGradient
Descent(SGD)
Inthischapter
1,wewilladdressthechallengefromPSDprojections(
O(d3)).Unlikeprevious
forhandlingPSDconstraintasmentionedinSection2.1,wewillfocusonreducing
thenumberofPSDprojectionsratherthanoptimizingeachprojectioninSGD,whichisthe

popularstrategyforlarge-scaledata.Asaresult,thekeychallengeindevelopingt

SGDalgorithmsforDMLishowtoreducethenumberofprojectionswithoutathe
performanceofDML.
AcommonapproachforreducingthenumberofupdatesandprojectionsinDMListo
usethenon-smoothlossfunction.Apopularchoiceofthenon-smoothlossfunctionisthe
hingeloss,whosederivativebecomeszerowhentheinputvalueexceedsacertainthreshold.
ManyonlinelearningalgorithmsforDML[24,32,66]takeadvantageofthenon-smoothloss

functiontoreducethenumberofupdatesandprojections.In[98],theauthorsproposeda
1Thischapterisadaptedfromthepublishedpaper:Q.Qian,R.Jin,J.Yi,L.ZhangandS.Zhu.E
cientDistanceMetricLearningbyAdaptiveSamplingandMini-BatchStochasticGradientDescent(SGD).
MachineLearningJournal(MLJ),99:3,353-372,2015.
17structurepreservingmetriclearningalgorithm(SPML)thatcombinesamini-batchstrategy
withthehingelosstofurtherreducethenumberofupdatesforDML.Itgroupsmultiple
constraintsintoamini-batchandperformsonlyoneupdateofthedistancemetricforeach

mini-batch.But,accordingtoourempiricalstudy,althoughSPMLreducestherunningtime
ofthestandardSGDalgorithm,itresultsinasigniﬁcantlyworseperformanceforseveral
datasets,duetothedeploymentofthemini-batchstrategy.
Inthischapter,weﬁrstdevelopanewmini-batchbasedSGDalgorithmforDML,termed
Mini-SGD.UnlikeSPMLthatreliesonthehingeloss,theproposedMini-SGDalgorithm
exploitsa
smooth
lossfunctionforDML.Byusingasmoothlossfunction,theproposed
algorithmisabletoelytakeadvantageofthereductioninthevarianceofgradients

achievedbythemini-batch,whichinreturnleadstoabetterregretboundforonlinelearn-
ing[28]andconsequentiallyamoreaccuratepredictionforthelearneddistancemetric.We
showtheoreticallythatbyusingasmoothlossfunction,Mini-SGDisabletoachievesim-

ilarconvergencerateasthestandardSGDalgorithmbutwithsigniﬁcantlylessnumberof
updates.Thesecondcontributionofthisworkistodevelopanewstrategy,termed
adap-tivesampling
,forreducingthenumberofprojectionsinDML.Thekeyideaofadaptive
samplingistoﬁrstmeasurethey”inclassifyingaconstraintusingthelearneddis-
tancemetric,andthenperformstochasticupdatingbasedontheymeasure.Finally,
wedeveloptwo
hybridapproaches
thatcombineadaptivesamplingwithmini-batchto
furtherimprovethecomputationaleofSGDforDML.Weconductanextensiveem-
piricalstudytoverifytheenessandeoftheproposedalgorithmsforDML.
Wesummarizethemaincontributionofthisworkasfollows:
•Tothebestofourknowledge,thisistheﬁrstworkthatexploitsthecombinationofthe
mini-batchstrategywithsmoothlossfunctionforDML.Weverify,boththeoretically
18andempirically,theandenessofthemini-batchstrategywithsmooth
lossforDML.
•WeproposeanadaptivesamplingapproachfortDML.Weverify,boththeoreti-
callyandempirically,theandenessoftheadaptivesamplingapproach

forDML.
•Wepresenttwohybridapproachesthatexploitthecombinationofthemini-batch
strategywithadaptivesamplingforDML.Ourextensiveempiricalstudyveriﬁesthat
thehybridapproachesaresigniﬁcantlymoreetthanboththemini-batchstrategy

andtheadaptivesamplingapproach.
Therestofthischapterisorganizedasfollows:Section3.1describestheproposedSGD
algorithmsforDMLbasedonmini-batchandadaptivesampling.Twohybridapproaches
arepresentedthatcombinemini-batchandadaptivesamplingforDML.Thetheoretical

guaranteesforbothmini-batchbasedandadaptivesamplingbasedSGDarealsopresented
inSection3.1.Section3.2summarizestheresultsoftheempiricalstudy,andSection3.3
concludesthischapter.
3.1ImprovedSGDforDMLbyMini-batchandAdap-
tiveSampling
WeﬁrstreviewthebasicframeworkofDMLwithtripletconstraints.Wethenpresenttwo
strategiestoimprovethecomputationaleofSGDforDML,onebymini-batchand
theotherbyadaptivesampling.Wepresentthetheoreticalguaranteesforbothstrategies,
19anddefermoredetailedanalysistotheappendix.Attheendofthissection,wepresenttwo
hybridapproachesthatcombinemini-batchwithadaptivesamplingformoreetDML.
LetRdbethedomainforinputpatterns,where
disthedimensionality.Forthe
convenienceofanalysis,weassumealltheinputpatternswithboundednorm,i.e.
xX,|x|2r.Givenadistancemetric
MRd×d,thedistancesquarebetween
xaandxb,denotedbydist
M(xa,xb),ismeasuredby
distM(xa,xb)=(xa−xb)M(xa−xb)Let=
{M:M0,MFR}bethedomainfordistancemetric
M,where
Rspeciﬁes
thedomainsize.Let
D={(x1i,x1j,x1k),...,(xNi,xNj,xNk)}bethesetoftripletconstraints
usedforDML,where
xtiisexpectedtobecloserto
xtjthanto
xtk.Let
(z)betheconvex
lossfunction.Deﬁne
xti,xtj,xtk;M)as
xti,xtj,xtk;M)=dist
M(xti,xtk)−distM(xti,xtj)=M,(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj)=M,A
twhereAt=(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj)Giventhetripletconstraintsin
Dandthedomaininwelearnanoptimaldistancemetric
MRd×dbysolvingthefollowingoptimizationproblem
minML(M)=1NNt=1xti,xtj,xtk;M)(3.1)20Wealsodeﬁnetheexpectationofthelossfunctionas
L(M)=Exi,xj,xk;M))(3.2)wheretheexpectationistakenover
xi,xjandxk.ThekeyideaofonlineDMListominimizetheempiricalloss
L(M)byupdatingthe
distancemetricbasedononesampledconstraintateachiteration.Morespeciﬁcally,at
iterationt,itsamplesatripletconstraint(
xti,xtj,xtk),andupdatesthedistancemetric
MttoMt+1byMt+1Mt−xti,xtj,xtk;Mt))Atwhere0isthestepsize,
(·)isthederivativeand
(M)projectsamatrix
Monto
thedomain.Thefollowingpropositionshows
(M)canbecomputedintwosteps,i.e.
ﬁrstprojecting
MontothePSDcone,andthenscalingtheprojected
Mtoﬁtinwiththe
constraint
MFR.Proposition1.
[12]Wehave
(M)=1max(MF/R,
1)P(M)Here
P(M)projectsmatrix
MontothePSDconeandiscomputedas
P(M)=di=1max(i,0)viviwhere
(i,vi),i=1,...,daretheeigenvaluesandcorrespondingeigenvectorsof
M.AsindicatedbyProposition1,
(M)requiresprojectingdistancemetric
Montothe
21PSDcone,anexpensiveoperationthatrequireseigen-decompositionof
M.Finally,weapproximatethehingelossbyasmoothlossinourstudy
(z)=1Llog(1+exp(
−L(z−1)))(3.3)
whereL>0isaparameterthatcontrolstheapproximationerror:thelargerthe
L,the
closer(z)istothehingeloss.Notethatthesmoothapproximationofthehingelosswas
ﬁrstsuggestedin[122]forclassiﬁcationandwaslaterveriﬁedbyanempiricalstudyin[119].

Thekeypropertiesofthelossfunction
(z)in(3.3)aregiveninthefollowingproposition.
Proposition2.
Forthelossfunctiondeﬁnedin(3.3),wehave
zR,|(z)1,|(z)(z)Comparedtothehingelossfunction,themainadvantageofthelossfunctionin(3.3)is
thatitisasmoothlossfunction.Aswillberevealedbyouranalysis,itisthesmoothness
ofthelossfunctionthatallowsustoelyexploreboththemini-batchandadaptive
samplingstrategiesformoreetDMLwithouthavingtosacriﬁcethepredictionperfor-

mance.3.1.1Mini-batchSGDforDML(Mini-SGD)
Mini-batchSGDimprovesthecomputationaleofonlineDMLbygroupingmultiple
constraintsintoamini-batchandonlyupdatingthedistancemetriconceforeachmini-batch.
Forbrevity,wewillrefertothisalgorithmas
Mini-SGDfortherestofthechapter.
22Algorithm1
Mini-batchStochasticGradientDescent(Mini-SGD)forDML
1:Input:tripletconstraints
{(xti,xtj,xtk)}Nt=1,stepsize
,mini-batchsize
b,anddomain
sizeR2:InitializeM1=IandT=N/b
3:fort=1,...,Tdo4:Samplebtripletconstraints
{(xt,si,xt,sj,xt,sk)}bs=15:Updatethedistancemetricby
Mt+1(Mt−t(Mt))6:endfor
7:return¯M=1TTt=1MtLetbbethebatchsize.Atiteration
t,itsamples
btripletconstraints,denotedby
(xt,si,xt,sj,xt,sk),s=1,...,b,
anddeﬁnesthemini-batchlossatiteration
tast(Mt)=1bbs=1xt,si,xt,sj,xt,sk;Mt)Mini-batchDMLupdatesthedistancemetric
MttoMt+1usingthegradientofthemini-bach
lossfunction
t(M),i.e.,
Mt+1(Mt−t(Mt))Algorithm1givesthedetailedstepsofMini-SGDforDML,whereinstep5Proposition1is
usedtotlycomputetheprojection
(·).ThetheorembelowprovidesthetheoreticalguaranteefortheMini-SGDalgorithmfor
DMLusingthesmoothlossfunctiondeﬁnedin(3.3).
Theorem1.
Let
¯MbethesolutionoutputbyAlgorithm1thatusesthelossfunctiondeﬁned
23in(3.3).Let
Mbetheoptimalsolutionthatminimizes
L(M).Assuming
AtFAforanytripletconstraintand
1/(3LA2),wehave
E[L(¯M)]L(M)1−3A
2+bR22(1−3A
2)(3.4)wheretheexpectationistakenoverthesequenceoftripletconstraints.
Remark1
First,weobservethatthesecondtermintheupperboundin(3.4),i.e.,
bR2/[2(1−3A
2)],hasalineardependenceonmini-batchsize
b,implyingthatthelarger
theb,thelessaccuratethedistancemetriclearnedbyAlgorithm1.Hence,byadjusting
parameterb,thesizeofmini-batch,weareabletomakeappropriatebetweenthe
predictionaccuracyandthecomputationalethesmallerthe
b,themoreaccurate
thedistancemetricbutwithmoreupdatesandconsequentiallyhighercomputationalcost.

WhenL(M)=0,wehaveE[
L(¯M)]=
O(1/N),i.e.theexpectedpredictionerrorwillbe
reducedattherateof
b/N
,signiﬁcantlyfasterthanthatofthemini-batchSGDalgorithm
(i.e.O(1/N))givenin[28].Second,ifweset
as:=3LA2(1+2
)where
=3bLR2A2NL(M)(3.5)wehave
E[L(¯M)]2L(M)+6bLA2R2N(3.6)Althoughthestepsizein(3.5)requirestheknowledgeof
L(M)thatisusuallyunavailable,
assuggestedin[67],
L(M)canbeestimatedempiricallyusingpartoftrainingexamples.
Third,theboundin(3.4)revealstheimportanceofusingasmoothlossfunctionasthe
24Algorithm2
AdaptiveSamplingStochasticGradientDescent(AS-SGD)forDML
1:Input:tripletconstraints
{(xti,xtj,xtk)}Nt=1,stepsize
,anddomainsize
R2:InitializeM1=I3:fort=1,...,Ndo4:Sampleabinaryrandomvariable
ZtwithPr(Zt=1)=
|xti,xtj,xtk;Mt)|5:ifZt=1then6:Updatethedistancemetricby
t=sign(
xti,xtj,xtk;Mt)),Mt+1(Mt−tAt)7:endif
8:endfor
9:return¯M=1NNt=1Mtlasttermin(3.4)isproportionalto
Lthatmeasuresthesmoothnessofthelossfunction.
Asaresult,usinganon-smoothlossfunction(e.g.hingeloss)inDMLwillnotbeableto

beneﬁttheadvantageofthemini-batchstrategy.Finally,unliketheanalysisin[98](i.e.
Theorem2)thatonlyconsiderthecasewhen
b=1,Theorem1provideageneralresultfor
anymini-batchsize
b.3.1.2AdaptiveSamplingbasedSGDforDML(AS-SGD)
WenowdevelopanewapproachforreducingthenumberofupdatesinSGDinorder
toimprovethecomputationaleofDML.Insteadofupdatingthedistancemetricat
eachiteration,theproposedstrategyintroducesarandombinaryvariabletodecideifthe
distancemetric
Mtwillbeupdatedgivenatripletconstraint(
xti,xtj,xtk).Morespeciﬁcally,
itcomputesthederivative
xti,xtj,xtk;Mt)),andsamplesarandomvariable
Ztwithprobability
Pr(Zt=1)=
|xti,xtj,xtk;Mt))|25Thedistancemetricwillbeupdatedonlywhen
Zt=1.AccordingtoProposition2,wehave
|xti,xtj,xtk;Mt))xti,xtj,xtk;Mt))forthesmoothlossfunctiongivenin(3.3),
implyingthatatripletconstrainthasahighchancetobeusedforupdatingthedistance

metricifithasalargeloss.Therefore,theessentialideaoftheproposedadaptivesampling
strategyistogivealargechancetoupdatethedistancemetricwhenthetripletisto
beclassiﬁedandalowchancewhenthetripletcanbeclassiﬁedcorrectlywithlargemargin.

Wenotethatanalternativestrategyistosampleatripletconstraint(
xti,xtj,xtk)baseonits
lossxti,xtj,xtk;Mt)).Wedidnotchoosethelossasthebasisforupdatingbecauseitis
thederivative,nottheloss,thatwillbeusedbySGDforupdatingthedistancemetric.The

detailedstepsofadaptivesamplingbasedSGDforDMLisgiveninAlgorithm2.Werefer

tothisalgorithmas
AS-SGDforshortintherestofthischapter.
ThetheorembelowprovidestheperformanceguaranteeforAS-SGD.Italsoboundsthe
numberofupdates
Tt=1ZtforAS-SGD.
Theorem2.
Let
¯MbethesolutionoutputbyAlgorithm2thatusesthelossfunctiondeﬁned
in(3.3).Let
Mbetheoptimalsolutionthatminimizes
L(M).Assuming
AtFAforanytripletconstraintand
2/LA2,wehave
EL(¯M)L(M)1−A
2/2+R22(1−A
2/2)(3.7)andthenumberofupdatesboundedby
ENt=1ZtNLL(M)1−A
2/2+LR22(1−A
2/2)(3.8)wheretheexpectationistakenoverboththebinaryrandomvariables
{Zt}Nt=1andthese-
26quenceoftripletconstraints.
Remark2
Ifweset
as=2LA2(1+2
)where
=LA2R24NL(M)wehave
E[L(¯M)]2L(M)+LR2A2N(3.9)andENt=1Zt2NLL(M)+L2A2R2(3.10)Theboundsgivenin(3.7)and(3.9)sharesimilarstructuresasthosegivenin(3.4)and(3.6)
exceptthattheydonothavemini-batchsize
bthatcanbeusedtomakebetween
thenumberofupdatesandtheclassiﬁcationaccuracy.Thenumberofupdatesperformed
byAlgorithm2isboundedby(3.10).Thedominatetermin(3.10)is
O(L(M)N),implying
thatAlgorithm2willhaveasmallnumberofupdatesiftheoptimaldistancemetriconly

makesasmallnumberofmistakesforthegivensetoftrainingtriplets.Intheextremecase

whenL(M)0,theexpectednumberofupdateswillbeboundedbyaconstant
L2A2R2.Wenotethatthisisconsistentwithourintuition:itwillbeeasytolearnagooddistance
metricwhentheoptimaloneonlymakesafewmistakes,andasaresult,onlyafewupdates

areneededtoﬁndadistancemetricthatareconsistentwithmostofthetrainingtriplets.
Comparedwiththeresultofperceptronmethod[96],wedonotassumethatthedatasetis
separable,whichmakesourboundforthenumberofupdatesmorepracticallyuseful.
27Algorithm3
AFrameworkofHybridStochasticGradientDescent(Hybrid-SGD)forDML
1:Input:tripletconstraints
{(xti,xtj,xtk)}Nt=1,stepsize
,mini-batchsize
b,anddomain
sizeR2:InitializeM1=IandT=N/b
3:fort=1,...,Tdo4:Samplebtriplets{xt,si,xt,sj,xt,sk}bs=15:Computesamplingprobability
t({xt,si,xt,sj,xt,sk}bs=1;Mt)asinEqn.3.11or3.12
6:Sampleabinaryrandomvariable
ZtwithPr(Zt=1)=
t7:ifZt=1then8:Updatethedistancemetricby
t=1tMt+1(Mt−tt(Mt))9:endif
10:endfor
11:return¯M=1TTt=1Mt3.1.3HybridApproaches:CombiningMini-batchwithAdaptive
SamplingforDML
Sincemini-batchandadaptivesamplingimprovethecomputationaleofSGD
fromtaspects,itisnaturaltocombinethemtogetherformoreetDML.Similar
totheMini-SGDalgorithm,thehybridapproacheswillgroupmultipletripletconstraintsinto
amini-batch.But,unlikeMini-SGDthatupdatesthedistancemetricforeverymini-batch

ofconstraints,thehybridapproachesfollowtheideaofadaptivesampling,andintroducea
binaryrandomvariabletodecideifthedistancemetricwillbeupdatedforeverymini-batch
ofconstraints.Bycombiningthestrengthofmini-batchandadaptivesamplingforSGD,

thehybridapproachesareabletomakefurtherimprovementinthecomputationale

ofDML.Algorithm3highlightsthekeystepsofthehybridapproaches.
28Oneofthekeystepsinthehybridapproaches(step5inAlgorithm3)istochoose
appropriatesamplingprobability
tforeverymini-batchconstraints(
xt,si,xt,sj,xt,sk),s=1,...,b.Inthiswork,westudytwotchoicesforsamplingprobability
t:•Theﬁrstapproachchooses
tbasedonatripletconstraintrandomlysampledfroma
mini-batch.Morespeciﬁcally,givenamini-batchoftripletconstraints
{xt,si,xt,sj,xt,sk}bs=1,itrandomlysamplesanindex
sintherange[1
,b].Itthensetsthesamplingprobability
ttobethederivativefortherandomlysampledtriplet,i.e.,
t=|xt,si,xt,sj,xt,sk;Mt))|(3.11)Werefertothisapproachas
HR-SGD.•Thesecondapproachisbasedontheaveragecaseanalysis.Itsetsthesampling
probabilityastheaveragederivativemeasuredbythenormofthegradient
t(Mt),i.e.,t=1Wt(Mt)F(3.12)whereW=max
tt(Mt)Fandisestimatedbysampling.Werefertothisapproach
asHA-SGD.3.2Experiments
Tendatasetsareusedtovalidatetheenessoftheproposedalgorithms.Table3.1
summarizestheinformationofthesedatasets.Datasets
dna,letter[58],protein
andsensit[35]aredownloadedfromLIBSVM[22].Datasets
tdt30andrcv20
aredocumentcorpora:
tdt30isthesubsetoftdt2data[14]comprisedofthedocumentsfromthe30mostpopularcategories
29Table3.1:Statisticsforthetendatasetsusedinourempiricalstudy.
#class
#feature
#train
#test
semeion102561,115478dna31802,0001,186isolet266176,2381,559tdt30302006,5752,819letter261615,0005,000protein
335717,7666,621connect4
34247,28920,268sensit310078,82319,705rcv20
20200477,14114,185poker
10101,000,00025,010andrcv20
isthesubsetofalargercv1dataset[5]consistedofdocumentsfromthe20most
popularcategories.Following[24],wereducethedimensionalityofthesedocumentdatasets

to200byprinciplecomponentsanalysis(PCA).Alltheotherdatasetsaredownloaded

directlyfromtheUCIrepository[40].Forallthedatasetsusedinthisstudy,weusethe
standardtraining/testingsplitprovidedbytheoriginaldataset,exceptfordatasets
semeion,connect4
andtdt30where70%ofdataisrandomlyselectedfortrainingandtheremaining
30%isusedfortesting.Alltheexperimentsarerepeatedﬁvetimes,andboththeaverage
resultsandtheirstandarddeviationarereported.Alltheexperimentsarerunonalaptop
with8GBmemoryandtwo2.50GHzIntelCorei5-2520MCPUs.
3.2.1ParameterSetting
Theparameter
Linthelossfunction(3.3)issettobe3accordingtothesuggestionin[122].
Thenumberoftripletconstraints
Nissettobe100
,000forallthedatasetsexceptfortwo
smalldatasets
semeionanddnawhereN=20
n.Toconstructtripletconstraints,wefollow
theactivesamplingstrategygivenin[114]:ateachiteration
t,weﬁrstrandomlypicka
30trainingexample
xti,andthen
xtjfromthe3positivenearestneighbors
2ofxti;wethen
randomlyselectatripletconstraintfromthesetofactiveconstraints
3involving
xtiandxtj.Wenotethatthisistfrom[24],wheretripletconstraintsareselectedcompletely

randomly.Ourempiricalstudyshowsthattheactivesamplingstrategyismoree
thanchoosingtripletconstraintscompletelyrandomly.Thisisbecausetheactivelysampled
constraintsaremoreinformativetothelearneddistancemetricthanacompletelyrandom

choice.Furthermore,toverifythatthechoiceof
Ndoesnotleadtotheoverﬁttingof
trainingdata,particularlyforthetwosmalldatasets,inFig.3.1,weshowthetrainingand
testerrorsfordataset
dna.Itisclearthatbotherrorsdeclineoverepoches,suggestingthat
nooverﬁttingisfoundevenforthesmalldataset.
ForMini-SGDandthehybridapproaches,weset
b=10forthesizeofmini-batchas
in[98],leadingtoatotalof
T=10
,000iterationsfortheseapproaches.Weevaluatethe
learneddistancemetricbytheclassiﬁcationerrorofa
k-NNonthetestdata,wherethe
numberofnearestneighbors
kissettobe3basedonourexperience.
Parameter
Rintheproposedalgorithmsdeterminesthedomainsizeforthedistance
metrictobelearned.Weobservethattheclassiﬁcationerrorof
k-NNremainsalmost
unchangedwhenvarying
Rintherangeof
{100,1000,10000}.Wethusset
R=1,000for
alltheexperiments.Anotherimportantparameterusedbytheproposedalgorithmsisthe
stepsize
.Weevaluatetheimpactofstepsize
bymeasuringtheclassiﬁcationerrorof
ak-NNalgorithmthatusesthedistancemetriclearnedbytheMini-SGDalgorithmwith
={0.1,1,10}.Weobservethat
=1yieldsalowclassiﬁcationerrorforalmostall
datasets.Wethusﬁx
=1fortheproposedalgorithmsinalltheexperiments.
2xjisapositivenearestneighborof
xiifxjandxisharethesameclassassignment.
3AconstraintisactiveifitshingelossbasedontheEuclideandistanceisnon-zero.
3105101520024681012#EpochsError(%)TrainTestFigure3.1:Thetrainingandtestingerrorsoverepochesfordataset
dna.3.2.2Experiment(I):enessoftheProposedSGDAlgo-
rithmsforDML
Inthisexperiment,wecomparetheperformanceoftheproposedSGDalgorithmsforDML,
i.e.,Mini-SGD,AS-SGDandtwohybridapproaches(HR-SGDandHA-SGD),tothefullver-

sionofSGDforDML(SGD).WealsoincludeEuclideandistanceasthereferencemethodin

ourcomparison.Table3.2showstheclassiﬁcationerrorof
k-NN(
k=3)usingtheproposed
DMLalgorithmsandthebaselinealgorithms,respectively.First,itisnotsurprisingtoob-
servethatallthedistancemetriclearningalgorithmsimprovetheclassiﬁcationperformance

ofk-NNcomparedtotheEuclideandistance.Second,foralmostalldatasets,weobserve
thatalltheproposedDMLalgorithms(i.e.,Mini-SGD,AS-SGD,HR-SGD,andHA-SGD)
yieldsimilarclassiﬁcationperformanceasSGD,thefullversionofSGDalgorithmforDML.

ThisresultconﬁrmsthattheproposedSGDalgorithmsareeforDMLdespitethe
32modiﬁcationswemadetotheSGDalgorithm.
Table3.2:Classiﬁcationerror(%)of
k-NN(
k=3)usingthedistancemetricslearnedbythe
proposedSGDmethodsforDML.Standarddeviationcomputedfromﬁvetrialsisincluded
intheparenthesis.
EuclidSGDMini-SGDAS-SGDHR-SGDHA-SGDsemeion8.794.39(0.30)4.60(0.53)4.23(0.60)4.27(0.41)4.18(0.26)dna20.716.90(0.16)6.64(0.33)7.15(0.42)6.80(0.21)6.86(0.15)isolet8.985.98(0.15)4.23(0.19)6.09(0.13)4.59(0.30)4.77(0.17)tdt305.964.51(0.07)3.52(0.08)4.53(0.06)3.70(0.20)3.65(0.09)letter4.422.26(0.09)2.54(0.06)2.25(0.10)2.31(0.08)2.23(0.07)protein
49.9539.46(0.42)38.16(0.24)39.49(0.51)40.76(0.20)40.03(0.30)connect4
29.4820.16(0.08)20.20(0.08)20.22(0.12)21.45(0.71)20.41(0.14)sensit27.2823.62(0.04)22.95(0.07)23.70(0.06)23.39(0.20)23.33(0.18)rcv20
9.137.76(0.16)8.42(0.04)7.74(0.11)8.40(0.04)8.37(0.02)poker
37.9835.89(0.06)35.22(0.18)35.87(0.08)35.74(0.41)35.66(0.16)3.2.3Experiment(II):EoftheProposedSGDAlgorithm-
sforDML
Fig.3.2summarizestherunningtimefortheproposedDMLalgorithmsandthebaseline
SGDalgorithm.WenotethattherunningtimesinFig.3.2donottakeintoaccountthe
timeforconstructingtripletconstraintssinceitissharedbyallthemethodsincomparison.
ItisnotsurprisingtoobservethatalltheproposedSGDalgorithms,includingMini-SGD,
AS-SGD,HA-SGDandHR-SGD,signiﬁcantlyreducetherunningtimeofSGD.Forinstance,

fordataset
isolet,ittakesSGDmorethan35
,000secondstolearnadistancemetric,while
therunningtimeisreducedtolessthan4
,000secondswhenapplyingtheproposedSGD
algorithms,roughlyafactorof10reductioninrunningtime.Comparingtherunningtimeof

AS-SGDtothatofMini-SGD,weobservethateachmethodhasitsownadvantage:AS-SGD
ismoretondataset
semeionwhileMini-SGDismoreetontheotherdatasets.
ThisisbecausetmechanismsareemployedbyAS-SGDandMini-SGDtoreducethe
33(a)semeion(b)dna(c)isolet(d)tdt30(e)letter(f)
protein
(g)connect4
(h)sensit(i)rcv20
(j)poker
Figure3.2:Thecomparisonofrunningtime(seconds)forvariousSGDmethods.Notethat
LMNN,abatchDMLalgorithm,ismainlyimplementedinC,whichiscomputationally

moretthanourMatlabimplementation.Alltheothermethodsareimplementedin
Matlab.34computationalcost:AS-SGDimprovesthecomputationaleofDMLbyskippingthe
constraintsthatareeasytobeclassiﬁed,whileMini-SGDimprovesthethecomputational
ofSGDbyperformingtheupdatingofdistancemetriconceformultipletriplet

constraints.Finally,weobservethatthetwohybridapproachesthatcombinethestrengthof
bothadaptivesamplingandmini-batchSGD,arecomputationallymosttforalmost
alldatasets.WealsoobservethatHR-SGDappearstobemoreetthanHA-SGDon

sixdatasetsandonlylosesitedgeondatasets
protein
andrcv20
.ThisisbecauseHR-SGD
computesthesamplingprobability
tbasedononerandomlysampledtripletwhileHA-SGD
needstocomputetheaveragederivativeforeachmini-batchoftripletconstraintsforthe

samplingprobability.
TofurtherexaminethecomputationaleofproposedSGDalgorithmsforDML,we
summarizeinFig.3.3thenumberofupdatesperformedbytheproposedSGDalgorithmsand
thebaselineSGDalgorithm,respectively.WeobservethatalltheproposedSGDalgorithms

forDMLareabletoreducethenumberofupdatessigniﬁcantlycomparedtoSGD.Comparing
Mini-SGDtoAS-SGD,weobservethatfor
semeion,thenumberofupdatesperformed
byAS-SGDissigniﬁcantlylessthanMini-SGD,whileitistheotherwayaroundforthe

otherdatasets.ThisisagainduetothefactthatAS-SGDandMini-SGDdeployt
mechanismsforreducingcomputationalcosts.Asweexpect,thetwohybridapproachesare
abletofurtherreducethenumberofupdatesperformedbyAS-SGDandMini-SGD,making

themmoretalgorithmsforDML.
BycomparingtherunningtimeinFig.3.2tothenumberofupdatesinFig.3.3,we
observethatasmallnumberofupdatesdoesNOTalwaysguaranteeashortrunningtime.

Thisisexhibitedbythecomparisonbetweenthetwohybridapproaches:althoughHA-

SGDperformsthesimilarnumberofupdatesasHR-SGDondatasets
semeionanddna,it
35takesHA-SGDsigniﬁcantlylongertimetoﬁnishthecomputationthanHR-SGD.Thisisalso
exhibitedbycomparingtheresultsacrosstdatasetsforaﬁxedmethod.Forexample,
fortheHA-SGDmethod,thenumberofupdatesforthe
protein
datasetisnearlythesame
asthatforthe
poker
dataset,buttherunningtimeforthe
protein
datasetisabout100
timeslongerthanthatforthe
poker
dataset.Thisresultmaysoundcounterintuitiveatthe
ﬁrstglance.But,amorecarefulanalysisrevealsthatinadditiontothenumberofupdates,

therunningtimeofDMLisalsoabythecomputationalcostperiteration,which
explainstheconsistencybetweenFig.3.2andFig.3.3.Inthecaseofcomparingthetwo
hybridapproaches,weobservethatHA-SGDissubjectedtoahighercomputationalcostper

iterationthanHR-SGDbecauseHA-SGDhastocomputethenormofthe
average
gradient
overeachmini-batchwhileHR-SGDonlyneedstocomputethederivativeof
onerandomlysampledtripletconstraintforeachmini-batch.Inthecaseofcomparingtherunningtime
acrosstdatasets,the
protein
datasethasasigniﬁcantlyhigherdimensionalitythan
thepoker
dataset,andthereforeissubjectedtoahighercomputationalcostperiteration
becausethecomputationalcostofprojectinganupdateddistancemetricontothePSDcone
increasesatleastquadraticallyinthedimensionality.
3.2.4Experiment(III):ComparisonwithState-of-the-artOnline
DMLMethods
WecomparetheproposedSGDalgorithmstothreestate-of-the-artonlinealgorithmsand
onebathmethodforDML:
•SPML[98]:anonlinelearningalgorithmforDMLthatisbasedonmini-batchSGD
andthehingeloss,
36•OASIS
[24]:astate-of-the-artonlineDMLalgorithmandsymmetricversionwithonly
oneprojectionisapplied,
•LEGO[66]:anonlineversionoftheinformationtheoreticbasedDMLalgorithm[32].
•POLA[96]:aPerceptionbasedonlineDMLalgorithm.
Finally,forsanitychecking,wealsocomparetheproposedSGDalgorithmsto
LMN-N[114],astate-of-the-artbatchlearningalgorithmforDML.
BothSPMLandOASISusethesamesetoftripletconstraintstolearnadistancemet-
ricastheproposedSGDalgorithms.SinceLEGOandPOLAaredesignedforpairwise
constraints,forfaircomparison,wegeneratepairwiseconstraintsforLEGOandPOLAby

splittingeachtripletconstraint(
xti,xtj,xtk)intotwopairwiseconstraints:amust-linkcon-
straint(
xti,xtj)andacannot-linkconstraint(
xti,xtk).Thissplittingoperationresultsina
totalof2
NpairwiseconstraintsforLEGOandPOLA.Finally,wenotethatsinceLMNN
isabatchlearningmethod,itisallowedtoutilize
anytripletconstraintderivedfromthe
data,andisnotrestrictedtothesetoftripletconstraintswegeneratefortheSGDmethod-
s.AllthebaselineDMLalgorithmsareimplementedbyusingthecodesfromtheoriginal
authorsexceptforSPML,forwhichwemadeappropriatechangestotheoriginalcodein

ordertoavoidlargematrixmultiplicationandimprovethecomputationale.SPML,
OASISLEGOandPOLAareimplementedinMatlab,whilethecorepartsofLMNNare
implementedbyCthatisusuallydeemedtobemoretthanMatlab.Thedefault

parameterssuggestedbytheoriginalauthorsareusedinthebaselinealgorithms.Thestep
sizeofLEGOissettobe1,asitwasobservedin[24]thatthepredictionperformanceof
LEGOisingeneralinsensitivetothestepsize.Inallexperiments,allthebaselinemethods

settheinitialsolutionfordistancemetrictobeanidentitymatrix.
37Table3.3:Classiﬁcationerror(%)of
k-NN(
k=3)usingthedistancemetricslearnedby
baselineSGDmethod,onlinelearningalgorithmsandbatchlearningapproachforDML.
Standarddeviationcomputedfromﬁvetrialsisincludedintheparenthesis.
BaselineBatch
OnlineLearning
SGDLMNNPOLALEGOOASIS
SPMLsemeion4.187.11(0.39)19.25(1.95)12.89(1.84)6.74(0.34)4.81(0.59)dna6.644.89(0.29)7.32(0.55)7.39(0.55)11.75(0.43)6.78(0.58)isolet4.234.11(0.08)5.18(0.38)18.08(6.98)4.37(0.26)4.36(0.18)tdt303.522.80(0.0)5.93(0.38)21.11(3.68)3.92(0.08)3.47(0.13)letter2.233.20(0.00)3.10(0.22)5.24(0.45)3.92(0.05)3.98(0.53)protein
38.1639.86(0.16)38.38(0.56)42.60(1.13)37.83(0.23)40.12(0.53)connect4
20.1621.60(0.26)25.67(0.85)26.06(1.30)22.37(0.63)24.60(0.70)sensit22.9524.45(0.02)27.51(0.39)26.50(1.37)22.12(0.24)23.48(0.25)rcv20
7.74N/A7.96(0.08)8.49(0.18)8.08(0.06)8.61(0.12)poker
35.22N/A41.26(1.70)40.58(1.23)45.12(2.14)39.42(0.71)Table3.3summarizestheclassiﬁcationresultsof
k-NN(
k=3)usingthedistancemetrics
learnedbytheproposedmethodandbybaselinealgorithms,respectively.
SGDdenotesthe
bestresultofproposemethodsinTable3.2.First,weobservethatLEGOandPOLAperform
signiﬁcantlyworsethantheproposedDMLalgorithmsforfourdatasets,including
semeion,connect4
,sensitandpoker
.LEGOalsoperformspoorlyon
isoletandtdt30.Thiscanbe
explainedbythefactthatLEGOandPOLAusepairwiseconstraintsforDMLwhiletheother
methodsincomparisonusetripletconstraintsforDML.Accordingto[24,98,114],triplet
constraintsareingeneralmoreethanpairwiseconstraints.Second,althoughboth

SPMLandMini-SGDarebasedonthemini-batchstrategy,SPMLperformssigniﬁcantly
worsethanMini-SGDonthreedatasets,i.e.
protein
,connect4
,and
poker
.Theperformance
betweenSPMLandMini-SGDcanbeexplainedbythefactthatMini-SGDuses

asmoothlossfunctionwhileahingelossisusedbySPML.Accordingtoouranalysisand
theanalysisin[28],usingasmoothlossfunctioniscriticalforthesuccessofthemini-batch
strategy.Third,OASISyieldssimilarperformanceastheproposedalgorithmsforalmostall

datasetsexceptfordatasets
semeion,dnaandpoker
,forwhichOASISperformssigniﬁcantly
38worse.Overall,weconcludethattheproposedDMLalgorithmsyieldsimilar,ifnotbetter,
performanceasthestate-of-the-artonlinelearningalgorithmsforDML.
ComparedtoLMNN,astate-of-the-artbatchlearningalgorithmforDML,weobserve
thattheproposedSGDalgorithmsyieldsimilarperformanceonfourdatasets.Theyhowever
performsigniﬁcantlybetterthanLMNNondatasets
semeionandletter,andsigniﬁcantly
worseondatasets
dnaandtdt30.Weattributetheinclassiﬁcationerrortothe
factthattheproposedDMLalgorithmsarerestrictedto100
,000randomlysampledtriplet
constraintswhileLMNNisallowedtouse
all
thetripletconstraintsthatcanbederived
fromthedata.Therestrictionintripletconstraintscouldsometimeslimittheclassiﬁcation

performancebutattheothertimehelpavoidtheoverﬁttingproblem.Wealsoobservethat

LMNNisunabletorunonthetwolargedatasets
rcv20
andpoker
,indicatingthatLMNN
doesnotscalewelltothesizeofdatasets.
Table3.4:Thecomparisonofrunningtime(seconds)forOASISandthehybridmethods.
Averageresultsoverﬁvetrialsarereported.
Methods
semeiondnaisolettdt30letterOASIS
4.45.5156.610.42.3HR-SGD3.65.1363.221.71.4HA-SGD7.68.1597.928.41.1Methods
protein
connect4
sensitrcv20
poker
OASIS
161.44.519.046.51.5HR-SGD275.73.023.5139.20.9HA-SGD164.62.515.365.61.2Therunningtimesfortheproposedalgorithmsandthebaselinealgorithmsaresumma-
rizedinFig.3.2.Thenumberofupdatesforbothgroupsofalgorithmsareprovidedin

Fig.3.3.ItisnotsurprisingtoobservethattwoonlineDMLalgorithms(SPML,OASIS)

aresigniﬁcantlymoreetthanSGDintermsofbothrunningtimeandthenumberof
updates.WealsoobservethatMini-SGDandSPMLsharethesamenumberofupdatesand
39(a)semeion(b)dna(c)isolet(d)tdt30(e)letter(f)
protein
(g)connect4
(h)sensit(i)rcv20
(j)poker
Figure3.3:ThecomparisonofnumberofupdatesforvariousSGDmethods.Notethatsince
POLAandLEGOoptimizepairwiseconstraints,wedecomposeeachtripletconstraintinto

twopairwiseconstraintsforthesetwomethods.Asaresult,thenumberofconstraintsis
doubledforthesetwomethods.
40similarrunningtimeforalldatasetsbecausetheyusethesamemini-batchstrategy.Further-
more,comparedtothreeonlineDMLalgorithms(SPML,LEGOandPOLA),thetwohybrid
approachesaresigniﬁcantlymoreetinbothrunningtimeandthenumberofupdates.

Table3.4comparesthedetailedrunningtimeofOASISandthehybridmethods.Wealso
observethatthehybridmethodsaremoreetthanOASISonsixdatasets(i.e.,
semeion,dna,letter,connect4
,sensitandpoker
),eventhoughOASISonlyperformsprojectiononceat
theendoftheprogram.Weattributetheofthehybridapproachestothereduced
numberofupdatestheyhavetoperformonthelearnedmetric.Finally,sinceLMNNis
implementedbyC,itisnotsurprisingtoobservethatLMNNsharessimilarrunningtimeas

theotheronlineDMLalgorithmsforrelativelysmalldatasets.Itishoweversigniﬁcantlyless

tthantheonlinelearningalgorithmsfordatasetsofmodestsize(e.g.
letter,connect4
andsensit),andbecomescomputationallyinfeasibleforthetwolargedatasets
rcv20
andpoker
.Overall,weobservethatthetwohybridapproachesaresigniﬁcantlymoreet
thantheotherDMLalgorithmsincomparison.
3.3Conclusions
Inthischapter,weproposetwostrategiestoimprovethecomputationaleofSGD

forDML,i.e.mini-batchandadaptivesampling.Thekeyideaofmini-batchistogroup

multipletripletconstraintsintoamini-batch,andonlyupdatethedistancemetriconce
foreachmini-batch;thekeyideaofadaptivesamplingistoperformstochasticupdating
bygivingatripletconstraintmorechancetobeusedforupdatingthedistance

metricthananeasytripletconstraint.Wedeveloptheoreticalguaranteesforbothstrategies.
Wealsodeveloptwovariantsofhybridapproachesthatcombinemini-batchwithadaptive
41samplingformoreetDML.Ourempiricalstudyconﬁrmsthattheproposedalgorithms
yieldsimilar,ifnotbetter,predictionperformanceasthestate-of-the-artonlinelearning
algorithmsforDMLbutwithsigniﬁcantlylessamountofrunningtime.Inthefuture,we

couldanalysisthetheoreticalguaranteeforthehybridmethods.
42Chapter4
DistanceMetricLearningforHigh
DimensionalData
Inthischapter,wewilladdressthechallengefromlargenumberofvariables(
O(d2)).We
followtheone-projectionparadigmasinSection2.1andfocusonoptimizing
O(d2)variables.
Nowadays,highdimensionalrepresentationsbecomemoreandmorepopularinvariousap-
plications,especiallyforimagesandvideoasexamplesshowninTable4.1.Incontrast,DML
isonlyavailableforlowdimensionaldata(e.g.,somehundredfeatures)duetotheinvolve-

mentof
d2variables.SomestudiestrytoalleviatetheissueasdescribedinSection2.4,but
allofthemhavetoholdsomestrongassumptions.
Table4.1:Examplesofapplicationswithhighdimensionalfeatures.
Applications#Features
FaveVeriﬁcation[25]
100,000ImageClassiﬁcation[90]
250,000VideoEventDetection[87]
500,000ThestraightforwardwaytosolvehighdimensionalDMLproblemisreducingthedi-
mensionalityofdatausingdimensionalityreductionmethodssuchasprincipalcomponent
analysis(PCA)[114]orrandomprojection(RP)[103].AlthoughRPiscomputationally

moretthanPCA,itoftenyieldssigniﬁcantlyworseperformancethanPCAunless
thenumberofrandomprojectionsistlylarge[39].WenotethatRPhasbeensuc-
cessfullyappliedtomanymachinelearningtasks,e.g.,classiﬁcation[91],clustering[11]and
43regression[81],however,onlyafewstudiesexaminedtheapplicationofRPtoDML,and
mostofthemwithlimitedsuccess.
Inthischapter,weproposea
dualrandomprojection
approachforhighdimensional
DML.Ourapproach,ononehand,enjoysthelightcomputationofrandomprojection,and
ontheotherhand,signiﬁcantlyimprovestheenessofrandomprojection.Themain
limitationofusingrandomprojectionforDMListhatallthecolumns/rowsofthelearned

metricwilllieinthesubspacespannedbytherandomvectors.Weaddressthislimitation
ofrandomprojectionby
•ﬁrstestimatingthe
dualvariablesbasedontherandomprojectedvectorsand,
•thenreconstructingthedistancemetricusingtheestimateddualvariablesanddata
vectorsintheoriginalspace.
Sincetheﬁnaldistancemetriciscomputedusingtheoriginalvectors,nottherandomly
projectedvectors,thecolumn/rowspaceofthelearnedmetricwillNOTberestrictedto

thesubspacespannedbytherandomprojection,thusalleviatingthelimitationofrandom
projection.Weverifytheenessoftheproposedalgorithmsbothempiricallyand
theoretically.
Weﬁnallynotethatourworkisbuiltupontherecentwork[120]onrandomprojection
whereadualrandomprojectionalgorithmisdevelopedforlinearclassiﬁcation.Ourwork
from[120]inthatweapplythetheoryofdualrandomprojectiontoDML.More

importantly,wehavemadeanimportantprogressinadvancingthetheoryofdualrandom
projection.Unlikethetheoryin[120]wherethedatamatrixisassumedtobelowrank
orapproximatelylowrank,ournewtheoryofdualrandomprojectionisapplicabletoany

datamatrixevenwhenitis
NOTapproximatelylowrank.Thisnewanalysissigniﬁcantly
broadenstheapplicationdomainswheredualrandomprojectionisapplicable,whichis
44furtherveriﬁedbyourempiricalstudy.
Therestofthechapterisorganizedasfollows:Section4.1describestheproposeddual
randomprojectionapproachforDMLandthedetailedalgorithmforsolvingthedualproblem

inthesubspacespannedbyrandomprojection.Section4.2summarizestheresultsofthe
empiricalstudy,andSection4.3concludesthiswork.
4.1DualRandomProjectionforDistanceMetricLearn-
ingLetX=(x1,···,xn)Rd×ndenotethecollectionoftrainingexamples.GivenaPSD
matrixM,thedistancebetweentwoexamples
xiandxjisgivenas
distM(xi,xj)=(xi−xj)M(xi−xj).TheproposedframeworkforDMLwillbebasedontripletconstraints,notpairwisecon-
straints.Thisisbecauseseveralpreviousstudieshavesuggestedthattripletconstraintsare

moreethanpairwiseconstraints[114,24,98].Let
D={(x1i,x1j,x1k),...,(xNi,xNj,xNk)}bethesetoftripletconstraintsusedfortraining,where
xtiisexpectedtobemoresimilarto
xtjthanto
xtk.Ourgoalistolearnametricfunction
Mthatisconsistentwithmostofthe
tripletconstraintsin
D,i.e.
(xti,xtj,xtk),(xti−xtj)M(xti−xtj)+1(xti−xtk)M(xti−xtk)45Followingtheempiricalriskminimizationframework,wecastthetripletconstraintsbased
DMLintothefollowingoptimizationproblem:
minMSd2M2F+1NNt=1(M,A
t)(4.1)
whereSdstandsforthesymmetricmatrixofsize
d×d,0istheregularizationparameter,
(·)isaconvexlossfunction,
At=(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj),and
,standsforthedotproductbetweentwomatrices.Wenotethatwedidnotenforce
Min(4.1)
tobePSDbecausewefollowtheone-projectionparadigmproposedin[24]thatﬁrstlearns
asymmetricmatrix
Mbysolvingtheoptimizationproblemin(4.1)andthenprojectsthe
learnedmatrix
MontothePSDcone.Weemphasizethatunlike[120],wedidnotassume
(·)tobesmooth,makingitpossibletoapplytheproposedapproachtothehingeloss.
Let(·)betheconvexconjugateof
(·).Thedualproblemof(4.1)isgivenby
max1N−1NNt=1(t)−122

Nt=1tAt

2Fwhichisequivalentto
max[−1,0]N−Nt=1(t)−12G(4.2)where=(1,···N)andG=[Ga,b]N×Nisamatrixof
N×NwithGa,b=Aa,Ab.Wedenoteby
MRd×dtheoptimalprimalsolutionto(4.1),andby
RNtheoptimal
46dualsolutionto(4.2).Usingtheﬁrstorderconditionforoptimality,wehave
M=−1Nt=1tAt(4.3)4.1.1DualRandomProjectionforDistanceMetricLearning
Directlysolvingtheprimalproblemin(4.1)orthedualproblemin(4.2)couldbecompu-
tationalexpensivewhenthedataisofhighdimensionandthenumberoftrainingtriplets

isverylarge.Weaddressthischallengebyinducingarandommatrix
RRd×m,where
mdandRi,j(0,1/m),andprojectingallthedatapointsintothelowdimensional
spaceusingtherandommatrix,i.e.,
xi=Rxi.Asaresult,
At,afterrandomprojection,
becomes
At=RAtR.AtypicalapproachofusingrandomprojectionforDMListoobtainamatrix
Msofsize
m×mbysolvingtheprimalproblemwiththerandomlyprojectedvectors
{xi}ni=1,i.e.
minMSm2M2F+1NNt=1(M,At)(4.4)
Giventhelearnedmetric
Ms,foranytwodatapoints
xandx,theirdistanceismeasured
by(
x−x)RMsR(x−x)=(x−x)M(x−x),where
M=RMsRRd×distheemetricintheoriginalspace
Rd.Thekeylimitationofthisrandomprojection
approachisthatboththecolumnandrowspaceof
Marerestrictedtothesubspacespanned
byvectorsinrandommatrix
R.Insteadofsolvingtheprimalproblem,weproposedtosolvethedualproblemusingthe
47randomlyprojecteddatapoints
{xi}ni=1,i.e.
maxRN−Nt=1(t)−12G(4.5)whereGa,b=RAaR,R
AbR.Afterobtainingtheoptimalsolution
for(4.5),we
reconstructthemetricbyusingthedualvariables
anddatamatrix
Xintheoriginal
space,i.e.
M=−1Nt=1tAt(4.6)Itisimportanttonotethatunliketherandomprojectionapproach,therecoveredmetric
Min(4.6)isnotrestrictedbythesubspacespannedbytherandomvectors,akeytothe
successoftheproposedalgorithm.
Alg.4summarizesthekeystepsfortheproposeddualrandomprojectionmethodfor
DML.Followingone-projectionparadigm[24],weprojectthelearnedsymmetricmatrix
MontothePSDconeattheendofthealgorithm.ThekeycomponentofAlg.4istosolve

theoptimizationproblemin(4.2)atStep4accurately.Wechoosestochasticdualcoordi-
nateascent(SDCA)methodforsolvingthedualproblem(4.5)becauseitenjoysalinear
convergencewhenthelossfunctionissmooth,andisshownempiricallytobesigniﬁcantly

fasterthantheotherstochasticoptimizationmethods[97].Weusethecombinationstrategy

recommendedin[97],denotedby
CSDCA,whichusesSGDfortheﬁrstepochandthen
appliesSDCAfortherestepochs.
48Algorithm4
DualRandomProjectionMethod(DuRP)forDML
1:Input:thetripletconstraints
Dandthenumberofrandomprojections
m.2:Generatearandommatrix
RRd×mandRi,j(0,1/m).3:Projecteachexampleas
x=Rx.4:Solvetheoptimizationproblem(4.5)andobtaintheoptimalsolution
5:Recoverthesolutionintheoriginalspaceby
M=−1ttAt6:Output:
PSD
(M)4.1.2MainTheoreticalResults
First,similarto[120],weconsiderthecasewhenthedatamatrix
Xisoflowrank.The
theorembelowshowsthatunderthelowrankassumption,withahighprobability,the
distancemetricrecoveredbyAlgorithm4isnearlyoptimal.
Theorem3.
Let
Mbetheoptimalsolutionto(4.1).Let
betheoptimalsolutionfor
(4.5),andlet
Mbethesolutionrecoveredfrom
using(4.6).Undertheassumptionthat
allthedatapointslieinthesubspaceof
r-dimension,forany
01/6,withaprobability
atleast
1−,wehave
PSD
(M)−PSD
(M)F31−3MFprovided
m(r+1)log(2
r/
)2andconstant
cisatleast
1/3.TheproofofTheorem3canbefoundinappendix.Theorem3indicatesthatifthe
numberofrandomprojectionsistlylarge(i.e.
m(
rlogr)),wecanrecoverthe
optimalsolutionintheoriginalspacewithasmallerror.Itisimportanttonotethatour
analysis,unlike[120],canbeappliedtonon-smoothlosssuchasthehingeloss.
Inthesecondcase,weassumethelossfunction
(·)is
-smooth(i.e.,
|(z)−(z)|z−z|).Thetheorembelowshowsthatthedualvariablesobtainedbysolvingtheoptimization
49problemin(4.5)canbeclosetotheoptimaldualvariables,evenwhenthedatamatrix
XisNOTlowrankorapproximatelylowrank.Forthepresentationoftheorem,weﬁrstdeﬁnea
fewimportantquantities.Deﬁnematrices
M1RN×N,M2RN×N,M3RN×N,and
M4RN×NasM1a,b=(xai−xak)22(xbi−xbk)22M2a,b=(xai−xaj)22(xbi−xbj)22M3a,b=(xai−xak)22(xbi−xbj)22M4a,b=(xai−xaj)22(xbi−xbk)22Deﬁnethemaximumofthespectralnormofthefourmatrices,i.e.
=max
M12,M22,M32,M42(4.7)where2standsforthespectralnormofmatrices.
Theorem4.
Assume(z)is-smooth.Let
betheoptimalsolutiontothedualproblem
in(4.2),andlet
betheapproximatelyoptimalsolutionfor(4.5)withsuboptimality
.Then,withaprobabilityatleast
1−,wehave
−2max8
2,2where
isdeﬁnein(4.7),provided
m82ln8N.TheproofofTheorem4canbefoundintheappendix.UnlikeTheorem3wherethe
datamatrix
Xisassumedtobelowrank,Theorem4holdswithoutanypriorassumption
50aboutthedatamatrix.Itshowsthatdespitetherandomprojection,thedualsolutioncan
berecoveredapproximatelyusingtherandomlyprojectedvectors,providedthatthenumber
ofrandomprojections
mistlylarge,
issmall,andtheapproximatelyoptimal
solutionistlyaccurate.Inthecasewhenmostofthetrainingexamplesare
notlineardependent,wecouldhave
(
N/d
),whichcouldbeamodestnumberwhen
disverylarge.TheresultinTheorem4essentiallyjustiﬁesthekeyideaofourapproach,
i.e.computingthedualvariablesﬁrstandrecoveringthedistancemetriclater.Finally,
since−2,theapproximationerrorintherecovereddualvariables,isproportional
tothesquarerootofthesuboptimality
,anaccuratesolutionfor(4.5)isneededtoensure
asmallapproximationerror.WenotethatgivenTheorem4,itisstraightforwardtobound

M−M2usingtherelationshipbetweenthedualvariablesandtheprimalvariablesin
(4.3).4.2Experiments
Wewillﬁrstdescribetheexperimentalsetting,andthenpresentourempiricalstudyfor

rankingandclassiﬁcationtasksonvariousdatesets.
4.2.1ExperimentalSetting
Datasets
Eightdatasetsareusedtovalidatetheenessoftheproposedalgorithm
forDML.Table4.2summarizestheinformationofthesedatasets.
imagenet5consistsof5
randomlyselectedclassesfromImageNet[92]while
cats5
contains5catspeciesfromthe
samearchive.
caltech30
isasubsetofCaltech256imagedataset[48]andweusetheversion
pre-processedby[24].
tdt30isasubsetoftdt2dataset[14].Both
caltech30
andtdt30are51Table4.2:Statisticsforthedatasetsusedinourempiricalstudy.#Cisthenumberof
classes.#Fisthenumberoforiginalfeatures.#Trainand#Testrepresentthenumberof
trainingdataandtestdata,respectively.
#C#F#Train
#Test
protein
335717,7666,621isolet266176,2381,559imagenet551,0004,5821,964cats5
51,0005,1282,199caltech30
301,0005,5022,355tdt30301,0006,5752,81920news201,00015,9353,993rcv30
301,000507,58515,195comprisedoftheexamplesfromthe30mostpopularcategories.Alltheotherdatasetsare
downloadedfromLIBSVM[22],where
rcv30
isasubsetoftheoriginaldatasetconsisted
ofdocumentsfromthe30mostpopularcategories.Fordatasets
tdt30,20newsandrcv30
,theyarecomprisedofdocumentsrepresentedbyvectorsof
50,000dimensions.Sinceitis
expensivetocomputeandmaintainamatrixof50
,000×50,000,forthesethreedatasets,
wefollowtheprocedurein[24]thatmapsalldocumentstoaspaceof1
,000dimension.
Morespeciﬁcally,weﬁrstkeepthetop20
,000mostpopularwordsforeachcollection,and
thenreducetheirdimensionalityto1
,000byusingPCA.Weemphasizethatforseveral
datasetsinourtestbeds,theirdatamatricescannotbewellapproximatedbylowrank
matrices.Fig.4.1summarizestheeigenvaluedistributionoftheeightdatasetsusedinour

experiment.Weobservethatfouroutofthesedatasets(i.e.,
caltech20
,tdt30,20news,rcv30
)haveaﬂateigenvaluedistribution,indicatingthattheassociateddatamatricescannotbe
wellapproximatedbyalowrankmatrix.Thisjustiﬁestheimportanceofremovingthelow

rankassumptionfromthetheoryofdualrandomprojection,animportantcontributionof

thiswork.
Forthedatasetsusedinthisstudy,weusethestandardtraining/testingsplitprovided
520100200300400020406080100120140#IndexEigenvalue(a)protein
020040060080002004006008001000#IndexEigenvalue(b)isolet0200400600800100005001000150020002500#IndexEigenvalue(c)imagenet502004006008001000050010001500200025003000#IndexEigenvalue(d)cats5
02004006008001000051015#IndexEigenvalue(e)caltech30
020040060080010000100200300400#IndexEigenvalue(f)
tdt30020040060080010000200400600800#IndexEigenvalue(g)20news02004006008001000020406080100#IndexEigenvalue(h)rcv30
Figure4.1:Theeigenvaluedistributionofdatasetsusedinourempiricalstudy
bytheoriginaldatasets,exceptfordatasets
imagenet5,cats5
,tdt30,caltech30
andrcv30
.For
imagenet5,cats5
,tdt30andcaltech30
,werandomlyselect70%ofthedatafortraining
andusetheremaining30%fortesting;for
rcv30
,weswitchthetrainingandtestsetsdeﬁned
bytheoriginalpackagetoensurethatthenumberoftrainingexamplesisstlylarge.
Evaluationmetrics
Tomeasurethequalityoflearneddistancemetrics,twotypesof
evaluationsareadoptedinourstudy.First,wefollowtheevaluationprotocolin[24]and

evaluatethelearnedmetricbyitsrankingperformance.Morespeciﬁcally,wetreateach
testinstance
qasaquery,andranktheothertestinstancesintheascendingorderoftheir
distanceto
qusingthelearnedmetric.Themean-average-precision(mAP)givenbelowis
usedtoevaluatethequalityoftherankinglist
mAP=1|Q||Q|i=11ririj=1P(xij)where|Q|isthesizeofqueryset,
riisthenumberofrelevantinstancesfor
i-thqueryand
P(xij)istheprecisionfortheﬁrst
jrankedinstanceswhentheinstancerankedatthe
53j-thpositionisrelevanttothequery
q.Here,aninstance
xisrelevanttoaquery
qiftheybelongtothesameclass.Second,weevaluatethelearnedmetricbyitsclassiﬁcation
performancewith
k-nearestneighborclassiﬁer.Morespeciﬁcally,foreachtestinstance
q,weapplythelearnedmetrictoﬁndtheﬁrst
ktrainingexampleswiththeshortestdistance,
andpredicttheclassassignmentfor
kbytakingthemajorityvoteamongthe
knearestneighbors.Finally,wealsoevaluatethecomputationaleoftheproposedalgorithm

forDMLbyitse.
BaselinesBesidestheEuclideandistancethatisusedasabaselinesimilaritymeasure,
sixstate-of-the-artDMLmethodsarecomparedinourempiricalstudy:
•DuOri:ThisalgorithmﬁrstappliesCombinedStochasticDualCoordinateAscent
(CSDCA)[97]tosolvethedualproblemin(4.2)andthencomputesthedistance
metricusingthelearneddualvariables.
•DuRP:ThisistheproposedalgorithmforDML(i.e.Algorithm4).
•SRP:Thisalgorithmappliesrandomprojectiontoprojectdataintolowdimensional
space,andthenitemploysCSDCAtolearnthedistancemetricinthissubspace.
•SPCA:ThisalgorithmusesPCAastheinitialsteptoreducethedimensionality,and
thenappliesCSDCAtolearnthedistancemetricinthesubspacegeneratedbyPCA.
•OASIS
[24]:Astate-of-the-artonlinelearningalgorithmforDMLthatlearnsthe
optimaldistancemetricdirectlyfromtheoriginalspacewithoutanydimensionality

reduction.•LMNN[114]:Astate-of-the-artbatchlearningalgorithmforDML.Itperformsthe
dimensionalityreductionusingPCAbeforestartingDML.
Implementationdetails
Werandomlyselect
N=100
,000activetriplets(i.e.,incur
thepositivehingelossbyEuclideandistance)andsetthenumberofepochstobe3forall
54stochasticmethods(i.e.,DuOri,DuRP,SRP,SPCAandOASIS),whichyieldstly
accuratesolutionsinourexperimentsandisalsoconsistentwiththeobservationin[97].We
search
in{10−5,10−4,10−3,10−2}andﬁxitas1
/Nsinceitisinsensitive.Thestepsizeof
CSDCAissetaccordingtotheanalysisin[97].Forallstochasticoptimizationmethods,we
followtheone-projectionparadigmbyprojectingthelearnedmetricontothePSDcone.The
hingelossisusedintheimplementationoftheproposedalgorithm.BothOASISandLMNN

usetheimplementationprovidedbytheoriginalauthorsandparametersaretunedbased
ontherecommendationbytheoriginalauthors.AllmethodsareimplementedinMatlab,
exceptforLMNN,whosecorepartisimplementedinC,whichisshowntobemoret

thanourMatlabimplementation.Allstochasticoptimizationmethodsarerepeatedﬁve

timesandtheaverageresultoverﬁvetrialsisreported.Allexperimentsareimplemented
onaLinuxServerwith64GBmemoryand12
×2.4GHzCPUsandonlysinglethreadis
permittedforeachexperiment.
4.2.2oftheProposedMethod
Table4.3:CPUtime(minutes)fortmethodsforDML.Allalgorithmsareimplement-
edinMatlabexceptforLMNNwhosecorepartisimplementedinCandismoret
thanourMatlabimplementation.
MetricinOriginalSpace
MetricinSubspace
DuOriDuRPOASIS
SRPSPCALMNNprotein
214.5±5.80.6±0.081.9±3.30.2±0.00.2±0.0488.9isolet198.4±7.20.6±0.012.6±0.20.1±0.00.1±0.0384.0imagenet5776.3±31.21.2±0.068.3±1.30.1±0.00.2±0.01,374.1cats5
782.3±70.91.1±0.074.1±1.10.1±0.00.3±0.11,467.0caltech30
1,214.5±229.51.3±0.0640.4±121.20.2±0.00.5±0.02,197.9tdt301,029.9±16.80.8±0.0140.8±4.50.2±0.00.4±0.0624.220news1,212.9±154.31.0±0.0216.3±48.80.2±0.00.5±0.01,893.6rcv30
1,121.3±79.41.3±0.0432.5±7.70.2±0.04.2±0.0N/A55Inthisexperiment,wesetthenumberofrandomprojectiontobe10,whichaccording
toexperimentalresultsinSection4.2.3and4.2.4,yieldsalmosttheoptimalperformancefor
theproposedalgorithm.Forfaircomparison,thenumberofreduceddimensionisalsoset

tobe10forLMNN.
Table.4.3comparestheCPUtime(inminutes)oftmethods.Noticethatthe
timeofsamplingtripletsisnottakenintoaccountasitisconsumedbyallthemethods,and

alltheotheroperators(e.g.,randomprojectionandPCA)areincluded.Itisnotsurprising
toobservethatDuRP,SRPandSPCAhavesimilarCPUtimes,andaresigniﬁcantlymore
tthantheothermethodsduetotheofdimensionalityreduction.SinceDuRP

andSRPsharethesameprocedureforcomputingthedualvariablesinthesubspace,theonly

betweenthemliesintheprocedureforreconstructingthedistancemetricfromthe
estimateddualvariables,acomputationaloverheadthatmakesDuRPslightlyslowerthan
SRP.Foralldatasets,weobservethatDuRPisatleast200timesfasterthanDuOriand20

timesfasterthanOASIS.Comparedtothestochasticoptimizationmethods,LMNNisthe
leastetonsixdatasets(i.e.,
protein
,isolet,imagenet5,cats5
,caltech30
and20news),mostlyduetothefactthatitisabatchlearningalgorithm.
4.2.3EvaluationbyRanking
Inﬁrstexperiment,wesetthenumberofrandomprojectionsusedbySRP,SPCAandthe

proposedDuRPalgorithmtobe10,whichisroughly1%ofthedimensionalityoftheoriginal
space.Forfaircomparison,thenumberofreduceddimensionforLMNNisalsosettobe
10.Wemeasurethequalityoflearnedmetricsbyitsrankingperformanceusingthemetric

ofmAP.
Table.4.4summarizestheperformanceoftmethodsforDML.First,weobserve
56Table4.4:ComparisonofrankingresultsmeasuredbymAP(%)fortmetriclearning
algorithms.MetricinOriginalSpace
MetricinSubspaceMetric
EuclidDuOriDuRPOASIS
SRPSPCALMNNprotein
39.047.0±0.149.1±0.145.7±0.137.7±0.141.9±0.141.9isolet46.777.1±0.570.5±0.667.6±0.211.9±1.236.5±2.668.7imagenet525.434.6±0.134.1±0.329.2±0.121.5±0.328.1±0.231.9cats5
22.526.3±0.127.3±0.323.7±0.121.0±0.121.5±0.224.8caltech30
16.423.8±0.125.5±0.125.4±0.28.1±0.419.5±0.016.3tdt3036.865.9±0.269.4±0.355.9±0.111.2±0.349.7±0.266.420news8.420.1±0.224.9±0.316.2±0.15.3±0.112.2±0.122.5rcv30
16.765.7±0.163.2±0.268.6±0.112.8±0.446.5±0.0N/AthatDuRPsigniﬁcantlyoutperformsSRPandSPCAforalldatasets.Infact,SRPisworse
thanEuclideandistancewhichcomputesthedistanceintheoriginalspace.SPCAisonlyable
toperformbetterthantheEuclideandistance,andisoutperformedbyalltheotherDML

algorithms.Second,weobservethatforallthedatasets,DuRPyieldssimilarperformanceas
DuOri.TheonlybetweenDuRPandDuOriisthatDuOrisolvesthedualproblem
withoutusingrandomprojection.ThecomparisonbetweenDuRPandDuOriindicatesthat

therandomprojectionstephasminimalimpactonthelearneddistancemetric,justifying

thedesignoftheproposedalgorithm.Third,comparedtoOASIS,weobservethatDuRP
performssigniﬁcantlybetteronthreedatasets(i.e.,
imagenet5,tdt30and20news)andhasthe
comparableperformanceontheotherdatasets.Finally,weobservethatforalldatasets,the

proposedDuRPmethodsigniﬁcantlyoutperformsLMNN,astate-of-the-artbatchlearning
algorithmforDML.Wealsonotethatbecauseoflimitedmemory,weareunabletorun
LMNNondatasets
rcv30
.Inthesecondexperiment,wevarythenumberofrandomprojectionsfrom10to50.
AllstochasticmethodsarerunwithﬁvetrailsandFig.4.2reportstheaverageresultswith
standarddeviation.NotethattheperformanceofOASISandDuOriremainunchangedwith
57variednumberofprojectionsbecausetheydonotuseprojection.Itissurprisingtoobserve
thatDuRPalmostachievesitsbestperformancewithonly10projectionsforalldatasets.
ThisisincontrasttoSRPandSPCA,whoseperformanceusuallyimproveswithincreasing

numberofprojections.WealsoobservethatDuRPoutperformsDuOriforseveraldatasets
(i.e.protein
,imagenet5,cats5
,caltech30
,tdt30and20news).Wesuspectthatthebetter
performanceofDuRPisbecauseoftheimplicitregularizationduetotherandomprojection.

Weplantoinvestigatemoreabouttheregularizationcapabilityofrandomprojectioninthe
future.Weﬁnallypointoutthatwithtlylargenumberofprojections,SPCAis
abletooutperformOASISon5datasets(i.e.,
protein
,imagenet5,cats5
,tdt30and20news),indicatingthatthecomparisonresultmaybesensitivetothenumberofprojections.
10203040503638404244464850#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(a)protein
10203040501020304050607080#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(b)isolet10203040502025303540#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(c)imagenet51020304050202224262830#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(d)cats5
102030405051015202530#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(e)caltech30
102030405010203040506070#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(f)
tdt3010203040500510152025#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(g)20news102030405010203040506070#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(h)rcv30
Figure4.2:Thecomparisonoftstochasticalgorithmsforranking
4.2.4EvaluationbyClassiﬁcation
Inthisexperiment,weevaluatethelearnedmetricbyitsclassiﬁcationaccuracywith
k-NN(
k=5)classiﬁer.Weemphasizethatthepurposeofthisexperimentistoevaluatethe
58metricslearnedbytDMLalgorithms,nottodemonstratethatthelearnedmetricwill
resultinthestate-of-artclassiﬁcationperformance
1.Similartotheevaluationbyranking,
allexperimentsarerunﬁvetimesandtheresultsaveragedoverﬁvetrialswithstandard

deviationarereportedinFig.4.3.Weessentiallyhavethesameobservationasthatfor
therankingexperimentsreportedinSection4.2.3exceptthatformostdatasets,thethree
methodsDuRP,DuOri,andOASISyieldverysimilarperformance.
Notethemainconcernofthischapteristimeandthesizeoflearnedmetric
isd×d.Itisstraightforwardtostorethelearnedmetricetlybykeepingalow-rank
approximationofit.
102030405035404550556065#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(a)protein
1020304050020406080100#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(b)isolet102030405010203040506070#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(c)imagenet5102030405015202530354045#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(d)cats5
10203040501520253035404550#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(e)caltech30
1020304050020406080100#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(f)
tdt301020304050020406080#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(g)20news1020304050020406080100#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(h)rcv30
Figure4.3:Thecomparisonoftstochasticalgorithmsforclassiﬁcation
4.3Conclusions
Inthischapter,weproposeadualrandomprojectionmethodtolearnthedistancemetric
forlarge-scalehigh-dimensionaldatasets.Themainideaistosolvethedualproblemin
1Manystudies(e.g.,[114,116])haveshownthatmetriclearningdonotyieldbetterclassiﬁcationaccuracy
thanthestandardclassiﬁcationalgorithms(e.g.,SVM)givenastlylargenumberoftrainingdata.
59thesubspacespannedbyrandomprojection,andthenrecoverthedistancemetricinthe
originalspaceusingtheestimateddualvariables.Wedevelopthetheoreticalguaranteethat
withahighprobability,theproposedmethodcanaccuratelyrecovertheoptimalsolution

withsmallerrorwhenthedatamatrixisoflowrank,andtheoptimaldualvariableseven
whenthedatamatrixcannotbewellapproximatedbyalowrankmatrix.Ourempirical
studyconﬁrmsboththeenessandeoftheproposedalgorithmforDML

bycomparingittothestate-of-the-artalgorithmsforDML.Inthefuture,wemayfurther
improvetheofourmethodbyexploitingthescenariowhenoptimaldistancemetric
canbewellapproximatedbyalowrankmatrix.
60Chapter5
Fine-GrainedVisualCategorization
viaMulti-stageMetricLearning
Inthischapter
1,wewilladdressthechallengefromthelargenumberofconstraints(
O(n3))andproposeanewDMLframeworkforFGVCbycombiningtheapproachinthelastchapter.

FGVCaimstodistinguishobjectsinsubordinateclasses.Forinstance,dogimagesare
classiﬁedintotbreedsofdogs,suchas“Chihuahua”,“Pug”,“Samoyed”andso
on[70,88].Asaresult,FGVCclassiﬁcationhastohandletheco-occurrenceoftwosomewhat

contradictoryrequirements:1)itneedstodistinguishmanysimilarclasses(e.g.,thedog

breedsthatonlyhavesubtleand2)itneedstodealwithlargeintra-classvariance
(e.g.,causedbytposes,examples,etc.).Fig.5.1demonstratesanexampleandsuch
co-occurringrequirementsmakeFVGCverychallenging.
ThepopularpipelineforFVGCconsistsoftwosteps,featureextractionstepandclas-
siﬁcationstep.Thefeatureextractionstep,whichsometimescombineswithsegmenta-
tion[2,21,88],partlocalization[8,118]orboth[20],istoextractimagelevelrepresentations,

andpopularchoicesareLLCfeatures[2],Fishervectors[44]andsoon.Arecentdevelopment
istouseconvolutionalneuralnetwork(CNN)[72]trainedonlarge-scaleimageclassiﬁcation
dataset(e.g.,ImageNet[92])andthenusethetrainedmodeltoextractfeatures[34].Theso-
1Thischapterisadaptedfromthepublishedpaper:Q.Qian,R.Jin,S.ZhuandY.Lin.Fine-Grained
VisualCategorizationviaMulti-stageMetricLearning.In:Proceedingsofthe28thIEEEConferenceon
ComputerVisionandPatternRecognition(CVPR’15),Boston,MA,2015.
61Figure5.1:IllustrationofhowDMLlearnstheembeddingthatpullstogetherthedatapoints
fromthesameclassandpushesapartthedatapointsfromtclasses.Bluepoints
arefromtheclass“Englishmarigold”whileredonesare“Barbertondaisy”.Animportant
notehereisthatourDMLdoesnotrequiretocollapsedatapointsfromeachclassandthis

allowstheﬂexibilitytomodelintra-classvariance.Abigchallengenowishowtodealwith
high-dimensionfeaturerepresentationwhichistypicalforimage-levelvisualfeatures.To
thisend,weproposeamulti-stageschemeformetriclearning.
calleddeeplearningfeatureshavedemonstratedthestate-of-the-artperformanceonFGVC
datasets[34].NotethattrainingCNNdirectlyonFGVCdatasetsisbecausetheex-

istingFGVCbenchmarksareoftentoosmall[34](onlyseveraltensofthousandsoftraining
62imagesorless).Inthischapter,wesimplytakethestate-of-the-artdeeplearningfeatures
withoutanyotheroperators(e.g.,segmentation)butfocusonstudyingbetterclassiﬁcation
approachtobetteraddresstheaforementionedtwoco-occurringrequirementsinFGVC.
Fortheclassiﬁcationstep,manyexistingFGVCmethodssimplylearnasingleclassiﬁer
foreachﬁne-grainedclassaccordingtotheone-vs-allstrategy[2,8,21,118].Apparently,
thisstrategydoesnotscalewelltothenumberofﬁne-grainedclasseswhilethenumber

ofsubordinateclassesinFGVCcouldbeverylarge(e.g.,200classesin
birds11
dataset).Additionally,suchone-vs-allschemeisonlytoaddresstheﬁrstissueinthetwoissues,
namely,itmakestoseparatetclasseswithoutmodelingintra-classvariance.

Incontrast,thischapterproposesaDMLapproach,aimingtoexplicitlyhandlethetwoco-

occurringrequirementswitha
singlemetric.Fig.5.1illustrateshowDMLworks.Thekey
ideaofDMLisintwofolds:1)itlearnsadistancemetricthatpullsneighboringdatapoints
ofthesameclassclosetoeachotheranddatapointsfromtclassesfarapart,and

2)ithastheﬂexibilitytodeﬁneneighboringsizeandrequireonlyaportionofdatapoints
fromthesameclasstobeclosetoeachother.Consequently,theﬂexibilityofneighboring
sizeactsasanewaytomodellargeintra-classvariance.Withalearnedmetric,a

k-nearestneighborclassiﬁerwillbeappliedtoﬁndtheclassassignmentforatestimage.
Therearethreechallengesinlearningametricdirectlyfromtheoriginalhighdimensional
space:•Largenumberofconstraints
:Alargenumberoftrainingconstraintsareusuallyre-
quiredtoavoidtheoverﬁttingofhighdimensionalDML.Thetotalnumberoftriplet
constraintscouldbeupto
O(n3)where
nisthenumberofexamples.
•Computationalchallenge
:DMLhastolearnamatrixofsize
d×d,where
disthe
dimensionalityofdataand
d=134
,016inourstudy.The
O(d2)numberofvariables
63leadstotwocomputationalchallengesinﬁndingtheoptimalmetric.First,itresultsin
aslowerconvergencerateinsolvingtherelatedoptimizationproblem[89].Second,to
ensurethelearnedmetrictobepositivesemi-deﬁnitive(PSD),mostDMLalgorithms

require,at
everyiterationofoptimization,projectingtheintermediatesolutionontoa
PSDcone,anexpensiveoperationwithcomplexityof
O(d3)(atleast
O(d2)).•Storagelimitation
:Itcanbeexpensivetosimplysave
O(d2)numberofvariablesin
memory.Forinstance,inourstudy,itwouldtakemorethan130GBtostorethe
completedmetricinmemory,whichaddsmorecomplexitytothealready
optimizationproblem.
Inthischapter,weproposeamulti-stagemetriclearningframeworkforhighdimensional
DMLtoexplicitlyaddressthesechallenges.First,todealwithalargenumberofconstraints
usedbyhighdimensionalDML,wedividetheoriginaloptimizationproblemintomultiple
stages.Ateachstage,onlyasmallsubsetofconstraintsthataretobeclassiﬁed

bythecurrentlylearnedmetricwillbeadaptivelysampledandusedtoimprovethelearned
metric.Bysettingtheregularizerappropriately,wecanprovethattheﬁnalsolutionis
optimizedoverallappearedconstraints.Second,tohandlethecomputationalchallengein

eachsubproblem,weextendthetheoryof
dualrandomprojection
[120],whichwasoriginally
developedforlinearclassiﬁcationproblems,tometriclearning.Theproposedmethod,onone
hand,enjoystheeofrandomprojection,andontheotherhandlearnsadistance

metricofsize
d×d.Thisisincontrasttomostdimensionalityreductionmethodsthat
learnametricina
reduced
space.Finally,tohandlethestorageproblem,weproposeto
maintainalowrankcopyofthelearnedmetricbyarandomizedalgorithmforlowrank

matrixapproximation.Itnotonlyacceleratesthewholelearningprocessbutalsoregularizes

thelearnedmetrictoavoidoverﬁtting.ExtensivecomparisonsonbenchmarkFGVCdatasets
64verifytheenessandeoftheproposedmethod.
Therestofthechapterisorganizedasfollows:Section5.1describesthedetailsofthe
proposedmethod.Section5.2showstheresultsoftheempiricalstudy,andSection5.3

concludesthiswork.
5.1Multi-stageMetricLearning
TheproposedDMLalgorithmfocusesontripletconstraintssoastopullthesmallportion

ofnearestexamplesfromthesameclasstogether[114].Let
D={(xi,yi),i=1,...,n}bea
collectionof
ntrainingimages,where
xiRdandyiistheclassassignmentof
xi.Givena
distancemetric
M,thedistancebetweentwodatapoints
xiandxjismeasuredby
distM(xi,xj)=(xi−xj)M(xi−xj)Let{xti,xtj,xtk}(t=1,...,N)beasetof
Ntripletconstraintsderivedfromthetraining
examplesin
D.Sinceineachconstraint(
xti,xtj,xtk),xtiandxtjareassumedtosharethe
sameclassthatistfromthatof
xtk,weexpect
dM(xti,xtj)<dM(xti,xtk).Asaresult,
theoptimaldistancemetric
Mislearnedbysolvingthefollowingoptimizationproblem
minMSd,M02M2F+Nt=1(dM(xti,xtk)−dM(xti,xtj))(5.1)
whereSdincludesall
d×drealsymmetricmatricesand
(·)isalossfunctionthatpenal-
izestheobjectivefunctionwhen
dM(xti,xtk)isnotsigniﬁcantlylargerthan
dM(xti,xtj).In
thisstudy,wechoosethesmoothedhingeloss[97]thatappearstobemoreefor
65optimizationthanthehingelosswhilekeepingthebeneﬁtoflargemargin
(x)=
0:x>11−x−2:x<1−12(1−x)2:o.w.
OnemaincomputationalchallengeofDMLcomesfromthePSDconstraint
M0in
(5.1).Weaddressthischallengebyfollowingtheone-projectionparadigm[24]thatﬁrst

learnsametric
MwithoutthePSDconstraintandthenprojects
MtothePSDconeat
theveryendofthelearningprocess.Hence,inthisstudy,wewillfocusonthefollowing
optimizationproblemforFGVC
minMSd2M2F+Nt=1(At,M)(5.2)
whereAt=(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj)isintroducedasamatrixrepresentation
foreachtripletconstraint,and
,representsthedotproductbetweentwomatrices.
WewilldiscussthestrategiestoaddressthethreechallengesofhighdimensionalDML,
andsummarizetheframeworkofhighdimensionalDMLforFGVCattheendofthissection.
5.1.1ConstraintsChallenge:Multi-stageDivision
Inordertoreliablydeterminethedistancemetricinahighdimensionalspace,alargenumber
oftrainingexamplesareneededtoavoidtheoverﬁttingproblem.Sincethenumberoftriplet
constraintscanbe
O(n3),thenumberofsummationtermsin(5.2)canbeextremelylarge,
makingittoelysolvetheoptimizationproblemin(5.2).Althoughactive
learningmayhelpreducethenumberofconstraints[114],thenumberofactiveconstraints
66canstillbeverylargesincemanyimagesinFGVCfromtcategoriesarevisually
similar,leadingtomanymistakes.Toaddressthischallenge,wedividethelearningprocess
intomultiplestages.At
s-thstage,let
Ms−1bethedistancemetriclearnedfromthelast
stage.Wesampleasubsetofactivetripletconstraintsthataretobeclassiﬁedby
Ms−1(i.e.,incurlargehingeloss).Given
Ms−1andthesampledtripletconstraints
Ns,we
updatethedistancemetricbysolvingthefollowingoptimizationproblem
minMsSd2Ms−Ms−12F+ts(At,Ms)(5.3)
Althoughonlyasmallsizeofconstraintsisusedtoimprovethemetricateachstage,we
have
Theorem5.
Themetriclearnedbysolvingtheproblem(5.3)isequivalenttooptimizethe
followingobjectivefunction
minMSd2M2F+sk=1tk(At,M)Proof.
Considertheobjectivefunctionfortheﬁrst
sstagesminMSd2M2F+s−1k=1tk(At,M)˘ˇˆ:=Ls−1(M)+ts(At,M)(5.4)
Itisobviousthat
Ls−1isstronglyconvex,sowehave(Chapter9,[12])
Ls−1(M)=Ls−1(Ms−1)+s−1(Ms−1),M−Ms−1+12(M−Ms−1)2Ls−1(M),M−Ms−167forsome
Mbetween
MandMs−1.SinceMs−1,thesolutionobtainedfromtheﬁrst
s−1stages,approximatelyoptimizes
Ls−1(M)and
Ls−1is-stronglyconvex,then
Ls−1(M)s−1(Ms−1)+2M−Ms−12F(5.5)Weﬁnishtheproofbyreplacing
Ls−1(M)in(5.4)withtheapproximationin(5.5).
RemarkThistheoremdemonstratesthatthemetriclearnedfromthelaststageisopti-
mizedoverconstrainsfromallstages.Therefore,theoriginalproblemcouldbedividedinto
severalsubproblemsandeachonlyhasananumberofactiveconstraints.Fig.5.2

summariestheframeworkofmulti-stagelearningprocedure.
Figure5.2:Theframeworkoftheproposedmethod.
685.1.2ComputationalChallenge:DualRandomProjection
Nowwewilltrytosolvethehighdimensionalsubproblembydualrandomprojectiontech-
niquewhichissimilartothatinthelastchapter.Forsimplifyingtheanalysis,wewill

investigatethesubproblemattheﬁrststageandfollowingstagescouldbeanalyzedinthe
sameway.Byintroducingtheconvexconjugate
forin(5.3),thedualproblemofDML
ismaxR|N1|−|N1|t=1(t)−12G(5.6)wheretisthedualvariablefor
AtandGisamatrixdeﬁnedas
Ga,b=Aa,Ab.M1=−1|N1|t=1tAtbysettingthegradientwithrespectto
M1tozeros.Let
R1,R2Rd×mbetwoGaussianrandommatrices,where
misthenumberofrandomprojections(
md)and
Ri,j1,Ri,j2(0,1/m).Foreachtripletconstraint,weprojectitsrepresentation
Atintolow
dimensionalspaceusingtherandommatrices,i.e.
At=R1AtR2.Byusingdoublerandom
projections,whichistfromthesinglerandomprojectionin[120],wehave
Lemma1.
Aa,Ab,thedoublerandomprojectionspreservethepairwisesimilaritybetween
them:E[Aa,Ab]=Aa,AbTheproofisstraightforward.Accordingtothelemma,thedualvariablesin(5.6)could
beestimatedinthelowdimensionspaceas
maxˆR|N1|−|N1|t=1(ˆt)−12ˆGˆ(5.7)whereG(a,b
)=Aa,Ab.Then,bythedeﬁnitionofconvexconjugateandthesmoothness
ofthelossfunction,whichistfromtheanalysisasinChapter4,eachdualvariable
69ˆtin(5.7)couldfurtherbeestimatedby
(At,M1),where
M1Rm×misthemetric
learnedinthereducedspace.Generally,
Msislearnedbysolvingthefollowingoptimization
problemminMsSm2Ms−Ms−12F+|Ns|t=1(At,Ms)(5.8)
Sincethesizeof
MsRm×missigniﬁcantlysmallerthanthatof
Ms,(5.8)canbesolved
muchmoretlythan(5.3).Inourimplementation,asimplestochasticgradientdescent
isdevelopedtoetlysolvetheoptimizationproblemin(5.8).Given
M1,theﬁnal
distancemetric
M1Rd×dintheoriginalspaceisestimatedas
M1=−1|N1|t=1ˆtAt(5.9)5.1.3StorageChallenge:LowRankApproximation
Although(5.9)allowsustorecoverthedistancemetric
Minoriginal
ddimensionalspace
fromthedualvariables
{t}|N|
t=1,itisexpensive,ifnotimpossible,tosave
Minthememory
sincedisverylargeinFGVC[2].Toaddressthischallenge,insteadofsaving
M,wepropose
tosavethelowrankapproximationof
M.Morespeciﬁcally,let
1rbetheﬁrst
rdeigenvaluesof
M,andlet
u1,...,urbethecorrespondingeigenvectors.Weapproximate
Mbyalowrankmatrix
M=ri=1iuiui=LL.DtfromexistingDMLmethods
thatdirectlyoptimize
L[113],weobtain
Mﬁrstandthendecomposeittoavoidsuboptimal
soulition.Unlike
Mthatrequires
O(d2)storagespace,itonlytakes
O(rd)spacetosave
Mandrcouldbearbitraryvalue.Inaddition,thelowrankmetricalsoacceleratesthe
samplingstepbyreducingthecostofcomputingdistancefrom
O(d)to
O(r).Lowrankis
70alsoapopularregularizertoavoidoverﬁttingwhenlearninghighdimensionalmetric[78].
However,thekeyissueishowtotlycomputetheeigenvectorsandeigenvaluesof
Mateach
stage.Thisisparticularlychallenginginourcaseas
Min(5.9)evencannotbe
computedexplicitlyduetoitslargesize.
Toaddressthisproblem,ﬁrstweinvestigatethestructureoftherecoveringstepfor
s-thstageasin(5.9)
Ms=Ms−1−1|Ns|t=1stAs
t=Ms−2−1(|Ns|t=1stAs
t+|Ns−1|t=1s−1tAs−1t)=−1sk=1|Nk|t=1ktAk
tTherefore,wecouldexpressthesummationasmatrixmultiplication.Inparticular,for
eachtriplet(
xti,xtj,xtk),wedenoteitsdualvariableby
=(A,M)andsetthecorre-
spondingentriesinasparsematrix
CasC(i,j
)=,C(j,i
)=,C(j,j
)=−C(i,k
)=−,C(k,i
)=−,C(k,k
)=(5.10)Itiseasytoverifythat
Mcanbewrittenas
M=XCX
(5.11)Second,weexploittherandomizedtheory[52]totlycomputetheeigen-decomposition
71ofM.Morespeciﬁcally,let
RRd×q(qd)beanGaussianrandommatrix.According
to[52],withanoverwhelmingprobability,mostofthetop
reigenvectorsof
Mlieinthe
subspacespannedbythecolumnvectorsin
MRprovided
qr+k,where
kisacon-
stantindependentfrom
d.Thelimitationofthemethodisthatitrequirestheappearance
ofthematrix
Mforcomputing
MRwhilekeepingthewholematrixishere.
Fortunately,byreplacing
MwithXCX
accordingto(5.11),wecanapproximatethetop
eigenvectorsof
Mbythoseof
XCX
Rthatisofsize
d×qandcanbecomputedtly
sinceCisasparsematrix.Theoverallcomputationalcostoftheproposedalgorithmfor
lowrankapproximationisonly
O(qnd
),whichislinearin
d.Notethatthesparsematrix
Ciscumulatedoverallstages.
Alg.5summarizesthekeystepsoftheproposedapproachforlowrankapproximation,
whereqrandeigstandforQRandeigendecompositionofamatrix.Notethatthedistributed
computingisparticularlyefortherealizationofthealgorithmbecausethematrix

multiplication
XCX
Rcanbeaccomplishedinparallel,whichishelpfulwhen
nisalso
large.Algorithm5
AntAlgorithmforRecovering
MandProjectItontoPSDConefrom
M1:Input:DatasetXRd×n,MRm×m,thenumberofrandomcombinations
q2:ComputeaGaussianrandommatrix
RRd×q3:Computeasparsematrix
Cusing(5.10)
4:Y=R×X,Y=Y×C,Y
=Y×X5:[Q,R
]=qr(Y)6:B=Q×X,B=B×C,B
=B×X7:[U,=
eig(B)8:U=QU9:returnL=[1u1,···,rur]and
M=LL,where
uiisthe
ithcolumnof
Uandiisthe
ithpositivediagonalelementof
Alg.6showsthewholepictureoftheproposedmethod.Thetotalcostoftheframework
72Algorithm6
TheMulti-stageMetricLearningFrameworkforHighDimensionalDML
(MsML)1:Input:DatasetXRd×n,thenumberofrandomprojections
m,thenumberofrandom
combinations
q,andthenumberofstages
T2:ComputetwoGaussianrandommatrices
R1,R2Rd×m3:InitializeM0=0Rm×mandM0=0Rd×d4:fors=1,...,Tdo5:Sampleoneepoch(
O(n))activetripletconstraintsusing
Ms−16:EstimateMsbysolvingtheoptimizationproblemasin(5.8)usingSGD
7:Recoverthedistancemetric
Msinthe
ddimensionalspaceusingAlg.5
8:endfor
9:returnMTisonly
O(dn).5.2Experiments
DeCAFfeatures[34]areextractedastheimagerepresentationsintheexperiments.Although
itisfromtheactivationofadeepconvolutionalnetwork,whichistrainedonImageNet[72],
itoutperformsconventionalvisualfeaturesonmanygeneraltasks[34].Weconcatenate
featuresfromthelastthreefullconnectedlayers(i.e.,DeCAF
5+6+7)andthedimensionof
resultingfeaturesis51
,456.Weapplytheproposedalgorithmtolearnadistancemetricandusethelearnedmetric
togetherwithasmoothed
k-nearestneighborclassiﬁer,avariantof
k-NN,topredictthe
classassignmentsfortestexamples.tfromtraditional
k-NN,smoothed
k-NNﬁrst
obtainskreferencecentersforeachclassbyclusteringimagesineachclassinto
kclusters.Topredicttheclassassignmentforatestimage
x,itcomputesadistancebetween
xanda
73classCusingthereferencecenters
C1,...,Ckasdis(x,C)=−1log˙˝kj=1exp−|x−Cj|2˛˚,(5.12)andassigns
xtotheclasswiththeshortestdistance.Thedistancefunctiongivenin(5.12)
actuallycomputesthesoftminamongthedistancebetween
xandCj,andweusehardmin
min1jk|x−Cj|2when=0.Weﬁndthatsmoothed
k-NNismoreetthanthe
traditionalk-NN,especiallyforlarge-scalelearningproblems.Werefertotheclassiﬁcation
approachbasedonthemetriclearnedbytheproposedalgorithmandthesmoothed
k-NNas
MsML,andthesmoothed
k-NNwithEuclideandistanceintheoriginalspaceas
Euclid.Althoughthesizeofcovariancematrixisverylarge(51
,456×51,456),itsrank
islowduetothesmallnumberoftrainingexamples,andthusPCAcanbecomputed
explicitly.Thestate-of-the-artDMLalgorithm,i.e.
LMNN[114]withPCAaspreprocess,
isalsoincludedincomparison.Theone-vs-allstrategy,basedontheimplementationof
LIBLINEAR[37],isusedasabaselineforFGVC,withtheregularizationparametervaried
intherange
{10i}(i=−2,···,3).Werefertoitas
LSVM.Wealsoincludethestate-of-the-
artresultsforFGVCinourevaluation.AlltheparametersusedbyMsMLaresetempirically,

withthenumberofrandomprojections
m=100andthenumberofrandomcombinations
q=600.PCAisappliedforLMNNtoreducethedimensionalityto
mbeforeDMLis
learned.LMNNisimplementedbythecodefromtheoriginalauthorsandtherecommended

parametersareused
2.Toensurethatthebaselinemethodfullyexploitstrainingdata,we
setthemaximumnumberofiterationsforLMNNas10
4.Theseparametervaluesareused
throughoutalltheexperiments.Alltraining/testsplitsareprovidedbydatasets.
Mean2Wedidvarytheparameterslightlyfromtherecommendedvaluesanddidnotﬁndanynoticeablechange
intheclassiﬁcationaccuracy.
74accuracy,astandardevaluationmetricforFGVC,isusedtoevaluatetheclassiﬁcation
performance.Allexperimentsarerunonasinglemachinewith162
.10GHzcoresand96GB
memory.
5.2.1OxfordCats&Dogs
Cats&dogs
contains7
,349imagesfrom37catsanddogsspecies[88].Thereareabout100
imagesperclassfortrainingandrestfortesting.Table5.1summariestheresults.First,we
observethatMsMLismoreaccuratethanthebaselineLSVM.Thisisnotsurprisingbecause
thedistancemetricislearnedfromthetrainingexamplesofallclassassignments.Thisisin

contrasttotheone-vs-allapproachusedinLSVMinwhichtheclassiﬁcationfunctionfora
classCislearnedonlybytheclassassignmentsof
C.Second,ourmethodperformssignif-
icantlybetterthanthebaselineDMLmethod,indicatingthattheunsuperviseddimension

reductionmethodPCAmayresultinsuboptimalsolutionsforDML.Fig.5.3comparesthe
imagesthataremostsimilartothequeryimagesusingthemetriclearnedbytheproposed
algorithm(Column8-10)tothosebasedonthemetriclearnedbyLMNN(Column5-7)and

Euclid(Column2-4).Weobservethatmoresimilarimagesarefoundbythemetriclearned

byMsMLthanbyLMNN.Forexample,MsMLisabletocapturethebetweentwo
catsspecies(longhairv.s.shorthair)whileLMNNreturnstheverysimilarimageswithwrong
label.MoreexamplesaregiveninFig.5.6.Third,MsMLhasoverwhelmingperformance

comparedtoallstate-of-the-artFGVCapproaches.Althoughthemethod[88]usingground
truthheadboundingboxandsegmentationachieves59
.21%,MsMLis20%betterthanit
withonlyimageinformation,whichshowstheadvantageoftheproposedmethod.Finally,

ittakeslessthan0
.2secondtoextractDeCAFfeaturesperimagebasedona
CPU
imple-mentationwhileasimplesegmentationoperatorcostsmorethan2.5secondsasreportedin
75thestudy[2],makingtheproposedmethodforFGVCpracticallymoreappealing.
Table5.1:Comparisonofmeanaccuracy(%)on
cats
&dogs
dataset.“#”meansthatmore
information(e.g.,groundtruesegmentation)isusedbythemethod.
Methods
MeanAccuracy(%)
Imageonly[88]
39.64Det+Seg[2]
54.30Image+Head+Body#[88]
59.21Euclid72.60LSVM77.63LMNN76.24MsML80.45MsML+81.18Figure5.3:Examplesofretrievedimages.Firstcolumnarequeryimageshighlightedby
greenboundingboxes.Columns2-4includethemostsimilarimagesmeasuredbyEuclid.
Columns5-7showthosebythemetricfromLMNN.Columns8-10arefromthemetricof
MsML.Imagesincolumns2-10arehighlightedbyredboundingboxeswhentheysharethe
samecategoryasqueries,andblueboundingboxesiftheyarenot.
ToevaluatetheperformanceofMsMLforextremelyhighdimensionalfeatures,wecon-
catenateconventionalfeaturesbyusingthepipelineforvisualfeatureextractionthatis
outlinedin[2].Speciﬁcally,weextractHOG[30]featuresat4tscalesandencode
76themto8
KdimensionalfeaturedictionarybytheLLCmethod[113].Amaxpoolingstrate-
gyisthenusedtoaggregatelocalfeaturesintoasinglevectorrepresentation.Finally,82
,560featuresareextractedfromeachimageandthetotaldimensionisupto134
,016.MsMLwith
thecombinedfeaturesisdenotedas
MsML+andwecanobservethatitfurtherimproves
theperformancebyabout1%asinTable5.1.Notethatthetimeofextractingthishigh
dimensionalconventionalfeaturesisonly0
.5secondperimage,whichisstillmuchcheaper
thananysegmentationorlocalizationoperator.
5.2.2Oxford102Flowers
102ﬂowersistheOxfordﬂowersdatasetforﬂowerspecies[86],whichconsistsof8189images
from102classes.Eachclasshas20imagesfortrainingandrestarefortesting.Table5.2
showstheresultsfromtmethods.Wehavethesimilarconclusionforthemethodsus-

ingthesameDeCAFfeatures.Thatis,MsMLoutperformsLSVMandLMNNsigniﬁcantly.
AlthoughLSVMalreadyperformsverywell,MsMLfurtherimprovestheaccuracy.Addition-
ally,itisobservedthattheperformanceofstate-of-the-artmethodsevenwithsegmentation

operatorsaremuchworsethanthatofMsML.NotethatGT[86]useshandannotatedseg-

mentationsfollowedbymultiplekernelSVM,whileMsMLoutperformsitabout3%without
anysupervisedinformation,whichconﬁrmstheenessoftheproposedmethod.
Fig.5.4illustratesthechangingtrendoftestmeanaccuracyasthenumberofstages
increases.WeobservethatMsMLconvergesveryfast,whichveriﬁesthatmulti-stagedivision
isessentialinourproposedframework.
77Table5.2:Comparisonofmeanaccuracy(%)on
102ﬂowersdataset.“#”meansthatmore
information(e.g.,groundtruesegmentation)isusedbythemethod.
Methods
MeanAccuracy(%)
CombinedCoHoG[65]
74.80CombinedFeatures[85]
76.30BiCoS-MT[19]
80.00Det+Seg[2]
80.66TriCoS[21]
85.20GT#[86]
85.60Euclid76.21LSVM87.14LMNN81.93MsML88.39MsML+89.450510152075808590#StagesTest MA(%)Figure5.4:Convergencecurveoftheproposedmethodon
102ﬂowers.5.2.3Birds-2011
birds11
istheCaltech-USCD-200-2011birdsdatasetforbirdspecies[109].Thereare200
classeswithtotal11
,788imagesandeachclasshasrough30imagesfortraining.Weusethe
versionwithgroundtruthboundingbox.Table5.3comparestheproposedmethodtothe

state-of-the-artbaselines.First,itisobviousthattheperformanceofMsMLissigniﬁcantly
78betterthanallbaselinemethodsastheobservationabove.Second,althoughSymb[20]
combinessegmentationandlocalization,MsMLoutperformsitby9%withoutanytime
consumingoperator.Third,Symb*andAli*mirrorthetrainingimagestoimprovetheir

performances,whileMsMLisevenbetterwithoutthistrick.Finally,MsMLoutperforms
themethodcombiningDeCAFfeaturesandDPDmodels[121],whichisduetothefact
thatmostofresearchofFGVCignorechoosingtheappropriatebaseclassiﬁerandsimply

adoptlinearSVMwithone-vs-allstrategy.Forcomparison,wealsoreporttheresultwith
mirroringtrainingimagesandisdenotedas
MsML+*.Itprovidesanother1%improvement
overMsML+asshowninTable5.3.
01020304050777879808182#Auxiliary ClassesMean Accuracy(%)LSVMMsMLFigure5.5:Comparisonwithtsizeofclasseson
birds11
.ToillustratethecapacityofMsMLinexploringthecorrelationamongclasses,which
makesitmoreethanasimpleone-vs-allclassiﬁerforFGVC,weconductoneaddi-
tionalexperiment.Werandomlyselect50classesfrom
birds11
asthe
targetclasses
anduse
thetestimagesfromthetargetclassesforevaluation.Whenlearningthemetric,besides

thetrainingimagesfrom50targetclasses,wesample
kclassesfromthe150unselectedones
79Table5.3:Comparisonofmeanaccuracy(%)on
birds11
dataset.“*”denotesthemethod
thatmirrorstrainingimages.
Methods
MeanAccuracy(%)
Symb[20]
56.60POOF[9]
56.78Symb*[20]
59.40Ali*[44]
62.70DeCAF+DPD[34]
64.96Euclid46.85LSVM61.44LMNN51.04MsML65.84MsML+66.61MsML+*67.86asthe
auxiliaryclasses,andusetrainingimagesfromtheauxiliaryclassesasadditional
trainingexamplesfordistancemetriclearning.Fig.5.5comparestheperformanceofLSVM
andMsMLwiththeincreasingnumberofauxiliaryclasses.Itisnotsurprisingtoobserve

thattheperformanceofLSVMdecreasesalittlesinceitisunabletoexplorethesupervision

informationintheauxiliaryclassestoimprovetheclassiﬁcationaccuracyoftargetclasses
andmoreauxiliaryclassesjustintensifytheclassimbalanceproblem.Incontrast,theper-
formanceofMsMLimprovessigniﬁcantlywithincreasingauxiliaryclasses,indicatingthat

MsMLiscapableofelyexploringthetrainingdatafromtheauxiliaryclassesand
thereforeisparticularysuitableforFGVC.
5.2.4StanfordDogs
S-dogs
istheStanforddogspeciesdataset[70].Itcontains120classesand20
,580images,
where100imagesfromeachclassisusedfortraining.SinceitisthesubsetofImageNet[92],

whereDeCAFmodelistrainedfrom,wejustreporttheresultinTable5.4asreference.
80Table5.4:Comparisonofmeanaccuracy(%)on
S-dogs
dataset.“*”denotesthemethod
thatmirrorstrainingimages.
Methods
MeanAccuracy(%)
SIFT[70]
22.00EdgeTemplates[118]
38.00Symb[20]
44.10Symb*[20]
45.60Ali*[44]
50.10Euclid59.22LSVM65.00LMNN62.17MsML69.07MsML+69.80MsML+*70.315.2.5Comparisonof
Inthissection,wecomparethetrainingtimeoftheproposedalgorithmforhighdimensional
DMLtothatofLSVMandLMNN.MsMLisimplementedbyJulia,whichisalittleslower
thanC
3,whileLSVMusestheLIBLINEARpackage,thestate-of-the-artalgorithmforsolv-
inglinearSVMimplementedmostlyinC.ThecorepartofLMNNisalsoimplementedinC.
Thetimeforfeatureextractionisnotincludedherebecauseitissharedbyallthemethods
incomparison.TherunningtimeforMsMLincludesalloperationalcost(i.e.,thecostfor

tripletconstraintssampling,computingrandomprojectionandlowrankapproximation).
Table5.5:ComparisonofRunningtime(seconds).
Methods
cats
&dogs
102ﬂowersbirds11
S-dogs
LSVM196.2309.81,417.01,724.8LMNN832.58702.71,178.21,643.6MsML164.91174.4413.1686.3MsML+337.20383.7791.31,229.7Table5.5summarizestheresultsofthecomparison.First,ittakesMsMLabout1
/3ofthetimetocompletethecomputationthanLMNN.ThisisbecauseMsMLemploysa
3Detailedcomparisoncanbefoundinhttp://julialang.org
81stochasticoptimizationmethodtoﬁndtheoptimaldistancemetricwhileLMNNisabatch
learningmethod.Second,weobservethattheproposedmethodissigniﬁcantlymoreet
thanLSVMonmostofdatasets.ThehighcomputationalcostofLSVMmostlycomesfrom

twoaspects.First,LSVMhastotrainoneclassiﬁcationmodelforeachclass,andbecomes
signiﬁcantlyslowerwhenthenumberofclassesislarge.Second,thefactthatimagesfrom
tclassesarevisuallysimilarmakesitcomputationallytoﬁndtheoptimal

linearclassiﬁerthatcanseparateimagesofoneclassfromimagesfromtheotherclasses.In
contrast,thetrainingtimeofMsMLisindependentlyfromthenumberofclasses,makingit
moreappropriateforFGVC.Finally,therunningtimeofMsML+with134
,016featuresonly
doublesthatofMsML,whichveriﬁesthattheproposedmethodislinearindimensionality

(O(d)).5.3Conclusions
Inthischapter,wedevelopamulti-stagemetriclearningframeworkofhighdimension-
alFGVCproblem,whichaddressesthechallengesarisingfromconventionalDML.More
speciﬁcally,itdividestheoriginalproblemsintomultiplestagestohandlethechallenge
arisingfromtoomanytripletconstraints,extendsthetheoryofdualrandomprojectionto

addressthecomputationalchallengeforhighdimensionaldata,andarandomizedalgorithm

forlowrankmatrixapproximationforthestoragechallenge.Theempiricalstudyshowsthat
theproposedmethodwithgeneralpurposefeaturesyieldsperformancethatissigniﬁcantly
betterthanthestate-of-the-artapproachesforFGVC.Inthefuture,wecouldtrytocom-

binetheproposedDMLalgorithmwithsegmentationandlocalizationtofurtherimprove
theperformanceofFGVC.
82Figure5.6:Examplesofretrievedimages.Firstcolumnarequeryimageshighlightedby
greenboundingboxes.Columns2-4includethemostsimilarimagesmeasuredbyEuclid.
Columns5-7showthosebythemetricfromLMNN.Columns8-10arefromthemetricof
MsML.Imagesincolumns2-10arehighlightedbyredboundingboxeswhentheysharethe
samecategoryasqueries,andblueboundingboxesiftheyarenot.
83Chapter6
FeatureSelectionforFace
Recognition:ADistanceMetricLearningApproach
Inthischapter,wewillstudyfeatureselectionprobleminfacerecognition.Facerecognition
involvesidentifyingiftwofaceimagesbelongtothesameperson(i.e.,faceveriﬁcation)
orwhichpersonthefaceimagebelongsto(i.e.,faceclassiﬁcation).Itisanimportant
applicationofcomputervisionandhasbeenstudiedextensivelyinthepastdecades[26,100,

3,62,25].Manyfacerecognitionmethodsrequireestimatingthesimilaritybetweenimages

appropriatelyusingthetechniqueofdistancemetriclearning[50,60,61].
OnechallengeinapplyingDMLtoestimatethesimilarityoffaceimagesisthehigh
dimensionalityoffacerepresentationsusedtocapturethesubtlebetweenthe

faces[25,3].Thehighdimensionalitycomesfromthefactthat
over-completed
descrip-torsaresampledforfaceimagesandthetotalnumberoffeaturescanbeupto1
,000,000[61].
MostofexistingDMLmethodsprojectthehighdimensionalfacefeaturestoalowdimen-

sionalspacewithunsuperviseddimensionreductiontechniques(e.g.,PCA)andthenlearn
ametricforthelowdimensionalspace[60,50].Theresultingsolutioncanbesuboptimal
sinceimportantinformationoffaceimagesislostafterdimensionreduction.
84Byfurtherinvestigatingtheproblem,itisfoundthatalthoughthedimensionalityof
featurevectorsforfaceimagesisveryhigh,thenumberoffeaturesusedbyeachdescriptor
issigniﬁcantlysmallandisonly45inourstudy.Itisthusettolearnametricfor

thegroupoffeaturesfromeachdescriptor.Ontheotherhand,sincethedescriptorsare
over-completed,asubsetofthemmaybesttoverifyfaceimages.Fig.6.1illustrates
thefeatureselection(i.e.,descriptorselection)forfaceveriﬁcation.Eachdescriptoriscor-

respondingtoasmallpatchontheface,andasubsetofthem(e.g.,eyes,nose,etc.)can
capturemostof


Figure6.1:Illustrationoffeatureselectionforfaceveriﬁcation.Althoughdescriptorsare
over-completed,asubsetofdescriptors(e.g.,eyes,nose,etc)cancapturemostof
Inthischapter,weproposeaDMLmethodthatcanselectasubsetofdescriptorsand
learnadistancemetricsforeachselecteddescriptorsimultaneously.Unliketheprevious

methodsthatlearnasinglemetricfromtheoriginalfeaturespace,welearnametricfor

eachfeaturegroupthatiscorrespondingtotheselecteddescriptorateveryiterationand
theensembleofmetricsisusedfortheveriﬁcation.Themethodavoidsdealingwithhigh
dimensionalfeaturesdirectlywhilekeepingusefulinformationfromalldescriptors.
Themaincontributionsofthisworkaresummarizedasfollows.
•WeproposeaDMLmethodtohandletheover-completeddescriptorsinfaceveriﬁ-
cation.Ateachiteration,alinearoptimizationproblemissolvedtosimultaneously
selectadescriptorandlearnthecorrespondingmetric.Thisproblemhastheclosed-

formsolution,whichmakingthecomputationet.
85•Weprovethattheproposedmethodhas
O(1/T)convergencerate,where
Tisthe
numberofiterationsandequivalenttothenumberofselecteddescriptorsinthiswork.
•Toreducethehighcomputationalcostofexhaustivesearchthatselectsonesingle
descriptorateachiteration,weexploitmini-batchstrategyandincrementalsampling,
respectively.Itisshowntheoreticallyandempiricallythattheacceleratedmethods

areabletoyieldsimilarperformancewithsigniﬁcantlyreducedcomputationalcosts.
Thehybridmethodwiththesestrategiestogetherachievesthebestinour
empiricalstudy.
Therestofthischapterisorganizedasfollows:Section6.1describestheproposedap-
proach.Section6.2showstheresultsoftheempiricalstudyandSection6.3concludesthis
workwithfutureresearchdirections.
6.1DMLforfeatureselection
Givenasetoffaceimages
XRd×n,DMLaimstolearnametricthatestimatespairwise
distancesas
distM(xi,xj)=(xi−xj)M(xi−xj)Infaceveriﬁcation,agoodmetriccanseparatetheimagesfromthesamepersonandthose
fromtpersonsas
xi,xj:dist
M(xi,xj)b−1;xi,xk:dist
M(xi,xk)b+1wherexiandxjarefromthesamepersonwhile
xkisfromatperson,and
bisa
pre-deﬁnedthreshold.
86Themetriccanbelearnedbysolvingthecorrespondingoptimizationproblem,whichis
minMd+L(M)=1NNi,j(Yij(distM(xi,xj)−b))(6.1)
whereSd+isthesetofpositivesemi-deﬁnite(PSD)matriceswithsize
d×d,and
(·)canbe
anysmoothlossfunctionandisthelogisticlossinourworkas
(z)=log(1+exp(
z+1))
Yij−
1,+1}istheindicatorwhere
Yij=1if
xiandxjarefromthesamepersonand
Yij=−1otherwise.
Itishardtohandletheproblemin(6.1)directlybyconventionalDMLmethodsdueto
thehighdimensionalityofthefeaturespace.Ontheotherhand,featuresareknowntobe

redundantandthenumberoffeaturesfromeachdescriptorissmall,makingitpossibleto
tlylearnametricforeverygroupoffeature..Therefore,weproposeaniterative
algorithmthatisabletoperformfeatureselectionandmetriclearningsimultaneously:it

selectsoneinformativedescriptorateachiterationandlearnametricforthefeaturesofthe
selecteddescriptor.
Atthe
t-thiteration,weselectthedescriptoras
argmin
s=1,···,SminMtSk+,MtF1(Zt−1),ztswhereSisthenumberofdescriptors,
kisthedimensionoffeaturesfromasingledescriptor
andwehave
kS=d.(Zt−1)=[(Zt−1i,j)],zts=[Yi,j(distMt(xsi,xsj)−b)]and
xsisthe
s-th87featuregroupoftheexample.Itisalinearoptimizationproblemandisequivalentto
argmin
s=1,···,SminMtSk+,MtF1i,jYi,jt−1i,jAsi,j,Mt(6.2)whereAsi,j=(xsi−xsj)(xsi−xsj).TheintuitionofEqn.6.2istolearntheoptimalmetricfor
eachfeaturegroupandthengreedilyselectthefeaturegroupwhosedistanceistheclosest
tothenegativedirectionofthegradientwiththelearnedmetric.
Themetricislearnedfromthesubproblem
minMtSk+,MF1i,jYi,jt−1i,jAsi,j,MtAccordingtotheK.K.T.condition[12],theproblemhastheclosed-formsolution
MtPSD,
MtF=1(−i,jYi,jt−1i,jAsi,j)(6.3)
where
PSD,
MF=1(M)istoprojectthemetricintothePSDconewiththeunitFrobenius
norm.BytakingEqn.6.3backtoEqn.6.2,theselectioncriterionbecomes
argmin
s=1,···,SScores=PSD
(−i,jYi,jt−1i,jAsi,j)F(6.4)Afterselectingthedescriptorwiththeminimalscore,thedistanceisupdatedas
Zti,j=(1
−t)Zt−1i,j+tzti,j88Thestepsize
tcanbeoptimizedbysolvingtheproblem
mint:0t<11Ni,j((1−t)Zt−1i,j+tzti,j)withNewton’smethodorusethepre-deﬁnedsizeas
t=2t+1Alg.7summarizesthekeystepsoftheproposedmethod.
Algorithm7
GreedyCoordinateDescentMetricLearningforFaceVeriﬁcation(Greco)
1:Input:DatasetXRd×n,#Iterations
T2:fort=1,···,Tdo3:SelectasingledescriptoraccordingtoEqn.6.4
4:ComputethecorrespondingmetricasinEqn.6.3
5:Updatethedistanceusingtheselectedfeaturegroupandthelearnedmetricas
Zti,j=(1
−t)Zt−1i,j+tzti,j6:endfor
7:returnZTThegreedycoordinatedescentmethodinAlg.7issimilarasFrank-Wolfemethod[41,43]
orconditionalgradientdescentmethod[75],butitlearnstheensembleofmetricsratherthan
asinglemodelasinpreviouswork.Wehavethefollowingtheoremfortheguaranteeofthe
performance.
Theorem6.
Let
ZTbethesolutionofAlg.7after
Titerationsand
Zbetheoptimal
solution.Assumethatthelossfunction
(·)isconvexand
-smoothand
i,j
:zi−zj2D.Bysetting
t=2t+1,wehave
L(Zt)−L(Z)2T+189TheproofissimilartothatofFrank-Wolfemethod.
Proof.
L(Zt)−L(Z)=L((1−t)Zt−1+tzt)−(Z)(Zt−1)−L(Z)+t(Zt−1),zt−Zt−1+2t2Zt−1−zt2(6.5)(Zt−1)−L(Z)+t(Zt−1),Z−Zt−1+2t2D(6.6)(1−t)(L(Zt−1)−(Z))+
2t2D(6.7)Eqn.6.5isobtainedbecausethatthelossfunctionis
-smooth.Eqn.6.6isfromthefact
thatztistheoptimalsolutionofProblem6.2and
Zisintheconvexhullas
zt.Eqn.6.7
isduetothat
Lisconvex.
Bysetting
t=2t+1,weprovethetheorembyinduction.Weholdtheassumptionthat
(ZT)−(Z)2T+1WhenT=1,
1=1anditisobviousthat
(Z1)−(Z)2For
T=t,wehave
(Zt)−(Z)(1−1t+1)2t−1t+12t+2(t+1)
22t+1906.1.1AdaptiveMini-batchStrategy
Althoughtheproposedmethodcanhandlethehighdimensionalchallengeinfaceveriﬁcation,
ithastoexhaustivelysearchthroughalldescriptorstoselectonlyonedescriptorateach

iteration,whichiscomputationallytwhenthenumberofdescriptorsisverylarge.

WeproposetwovariantsofDMLbasedfeatureselectionmethodstoimprovethe.
First,weinvestigatethemini-batchstrategy,whichselectamini-batchdescriptorsrather
thanasingledescriptorateachiteration.Aftercomputingthescoreofeachdescriptoras

inEqn.6.4,top
mdescriptorswillbeselected.Then,alargermetric(i.e.,sizeof
mk×mk)willbelearnedfromallfeaturesofselecteddescriptorsas
MtPSD,
MtF=1/m(−i,jYi,jt−1i,jAs1,···,sdi,j)Notethatthenormof
Mtisshrankby1
/mtomakesurethenormof
ztisstillbounded
overmultiplefeaturegroups.Itisbecause
As1,···,smi,jFm(maxi,j,sAsi,jF)Thisshrinkageratioalsocanbeestimatedempiricallyas
maxAs1,···,smi,jFmaxAsi,jFforatighter
approximation.
Giventhenormalizedscoreofmini-batchdescriptors,theupdatingisadaptivelyby
comparingtothescoreoftheoptimalsingledescriptor.Iftheformerscoreisbetterthan
thelaterone,thedistancewillbeupdatedbythemini-batchfeaturegroups.Otherwise,the
91singleoptimaldescriptorwillbeused.Alg.8showsthemethodwithmini-batchstrategy.
Algorithm8
GreedyCoordinateDescentMetricLearningwithAdaptiveMini-batch
(Greco-mini)1:Input:DatasetXRd×n,#Iterations
T,mini-batchsize
m2:fort=1,···,Tdo3:Selecttop
ddescriptorsaccordingtoEqn.6.4
4:Computeasinglemetricfromtheselectedfeaturegroups
5:if(Zt−1),zMmini−batchL
(Zt−1),zMsingle
then6:Mt=Mmini−batch7:else8:Mt=Msingle
9:endif
10:Updatethedistanceusingtheselectedfeaturegroupandthelearnedmetricas
Zti,j=(1
−t)Zt−1i,j+tzti,j11:endfor
12:returnZTWecanprovethatthemini-batchstrategywillnotatheconvergencerateifitsscore
isbetterthanthesingleone,whichisstatedinthefollowingtheorem.
Theorem7.
Let
ZTbethesolutionofAlg.8after
Titerationsand
Zbetheoptimal
solution.Assumethatthelossfunction
(·)isconvexand
-smoothand
i,j
:zi−zj2D.Bysetting
t=2t+1andassumingthatthescoreofmini-batchdescriptorsis
better
thanthescoreoftheoptimalsingledescriptor,wehave
L(Zt)−L(Z)2−2T+1Theproofistrivialandwelistitforcompleteness.
Proof.
Weconsidertheproblemwhenthesolutionofeachiterationisslightlybetterthanthe
singleoptimalsolutionwithafactor
where0and
(Zt−1),˜zt
(Zt−1),zt.92Withthesimilarprocessasabove,wehave
L(Zt)−L(Z)=L((1−t)Zt−1+t˜zt)−(Z)(Zt−1)−L(Z)+t(Zt−1),zt−Zt−1t+2t2Zt−1−˜zt2(1−t)(L(Zt−1)−L(Z))−t+2t2DWeprovetheresultbyinduction.Weset
t=2/(t+1)asbeforeandtheassumptionis
(ZT)−(Z)2−2T+1WhenT=1,itis
(Z1)−(Z)2−−WhenT=t,wehave
(Zt)−(Z)2t+1−(1−2t+1)2t−2(t+1)
22−2t+1Remarkisthebetweentheperformanceandthenumberofselecteddescrip-
tors.Althoughthenumberofiterationsforthemethodwithmini-batchstrategyislessthan
thatoftheoriginalmethodtoachievethesameperformance,thenumberofdescriptorscan
belargerthanthatoforiginalone.
93However,ifthesolutionofmini-batchisgoodenoughas
2(1−T+1Tm+1)wheremisthemini-batchsize,wehave
(ZT)−(Z)2Tm+1whichmeansthatevenwiththesamenumberofdescriptors,themethodwithmini-batch
performsthesameastheoriginalmethodandtherunningtimeisshrankbythefactor
minthisidealcase.
6.1.2IncrementalSamplingStrategy
Inthissection,weexploitthesamplingstrategytoreducethenumberofcandidatedescrip-
torsateachiteration.tfromuniformsampling,asubsetwithsize
1−tt+1Swillbesampledatthe
t-thiteration,where0
t<t+1andmax
tt=c.Thesizeofcandidate
subsetisincrementalwiththeincreasingnumberofiterations,whichmeanstheselected
descriptoriscloserandclosertotheoptimaloneandthisstepiscriticalforconvergence.
Alg.9summarizestheproposedmethod.
Wecanshowthatthemethodwithincrementalsamplinghasthesimilarconvergence
rateastheoriginalone,whichissummarizedinthefollowingtheorem.
Theorem8.
Let
ZTbethesolutionofAlg.9after
Titerationsand
Zbetheoptimal
solution.Assumethatthelossfunction
(·)isconvexand
-smoothand
i,j
:zi−zj294Algorithm9
GreedyCoordinateDescentMetricLearningwithIncrementalSampling
(Greco-isamp)1:Input:DatasetXRd×n,#Iterations
T,samplingratio
t2:fort=1,···,Tdo3:Randomlysampleasubsetofdescriptorswithsizeof
1−tt+1S4:SelectasingledescriptorinthesubsetaccordingtoEqn.6.4andcomputethecorre-
spondingmetric
5:Updatethedistanceusingtheselectfeaturegroupandthelearnedmetricas

Zti,j=(1
−t)Zt−1i,j+tzti,j6:endfor
7:returnZTD.Bysetting
t=2t+1and0t<min{c,t
+1},wehave
E[L(Zt)]−L(Z)2+4cT+1Proof.
Weconsiderthegeneralcasethatthesolutionatthe
t-thiterationisnotoptimal
butclosetotheoptimalonewithafactor
where0
t<1and
(Zt−1),˜ztt(Zt−1),ztsincetheoptimalscoreisnegative.Withthesimilarstrategyasinprevious
proof,wehave
L(Zt)−L(Z)=L((1−t)Zt−1+t˜zt)−L(Z)(Zt−1)−L(Z)+t(Zt−1)tzt−Zt−1+2t2Zt−1−˜zt2(Zt−1)−L(Z)+t(Zt−1),Z−Zt−1−t(1−t)(Zt−1),Zt−1+2t2D(1−t)(L(Zt−1)−L(Z))+
t(1−t)(L(0)−L(Zt−1))+
2t2D(6.8)=(1
−t)(L(Zt−1)−L(Z))+
t(1−t)(L(0)−L(Z))+
2t2D(1−t)(L(Zt−1)−L(Z))+2
t(1−t)+2t2D(6.9)95Eqn.6.8isfromthefactthat
LisconvexandEqn.6.9isbecause
L(0)−L(Z)=log(1+
e)−(Z)<2Ifweset
t1−t/(t+1)where0
t<min{c,t
+1},Eqn.6.9becomes
(Zt)−(Z)(1−t)((Zt−1)−(Z))+2
ttt+1+2t2DWeprovetheresultbyinduction.Weset
t=2/(t+1)andtaketheassumption
(ZT)−(Z)2+4cT+1WhenT=1,itis
(Z1)−(Z)1+2+2cWhenT=t,wehave
(Zt)−(Z)2t+1+(1
−2t+1)4ct+4t(t+1)
22+4ct+1Weﬁnishtheproofbyshowingthat
t1−t/(t+1)inAlg.9,whichis
E[(Zt−1),˜zt]Pr[ztsub](Zt−1),zt(1−tt+1)(Zt−1),zt96Theproposedadaptivemini-batchstrategyandincrementalsamplingcanbecombined
tofurtherimprovetheandthehybridalgorithmissummarizedinAlg.10.
Algorithm10
GreedyCoordinateDescentMetricLearningwithHybridstrategies(Greco-
hybrid)
1:Input:DatasetXRd×n,#Iterations
T,mini-batchsize
m,samplingratio
t2:fort=1,···,Tdo3:Randomlysampleasubsetofdescriptorswithsizeof
1−tt+1S4:Selecttop
ddescriptorsaccordingtoEqn.6.4withintheselectedsubset
5:Computeasinglemetricfromtheselectedfeaturegroups
6:if(Zt−1),zMmini−batchL
(Zt−1),zMsingle
then7:Mt=Mmini−batch8:else9:Mt=Msingle
10:endif
11:SelectasingledescriptorinthesubsetaccordingtoEqn.6.4andcomputethecorre-
spondingmetric
12:Updatethedistanceusingtheselectfeaturegroupandthelearnedmetricas
Zti,j=(1
−t)Zt−1i,j+tzti,j13:endfor
14:returnZT6.2Experiments
Wecollectafaceimagedatasetwith4
,000trainingexamplesand2
,040testimages.There
are400peopleintrainingsetandeachofthemhas10images.Testsetconsistsof60t

peoplewith34imagesforeach.Weusethecovariancematrixdescriptors[105]toextract
featuresand7
,671descriptorsaresampledforfaceimages.Eachdescriptorhas45features
andtheﬁnaldimensionalityofrepresentationsisupto345
,195.Fivevariantsofproposedmethodsarecomparedthroughtheexperiments:
•Greco-random:randomlyselectadescriptorateachiterationandlearntheoptimal
metric97•Greco:themethodthatselectsthefeaturegroupandlearnsthemetricsimultaneously
asinAlg.7
•Greco-mini:theproposedmethodwithadaptivemini-batchstrategyasinAlg.8
•Greco-isamp:theproposedmethodwithincrementalsamplingstrategyasinAlg.9
•Greco-hybrid
:theproposedhybridmethodasinAlg.10
Thenumberofiterationsissettobe
T=100.Thesizeofmini-batchissetas
m=3.We
setthesamplingratioofGreco-isampas10%intheﬁrst50iterationsand1
−45/(t+1)in
therestones.Allexperimentsarerepeatedthreetimesandtheaverageresultsarereported.
6.2.1ExperimentI:FaceVeriﬁcation
Forfaceveriﬁcation,werandomlysample4
,000positivepairsandnegativepairsrespectively
tocomputethescoreofeachdescriptorateachiteration.40
,800randomlysampledpairs
fromtestset,wherehalfofthemarepositive,areusedastestpairs.Wealsorandomlysample

80,000pairsfromtrainingset,wherehalfofthemarepositive,toestimatethetrainingerror.
Besidesthemethodproposedinthiswork,weincludeITML[32]thestate-of-the-artDML,
inthecomparison.Sinceitcannothandlethehighdimensionaldatadirectly,wereduce

thedimensionto500byPCAbeforeapplyingthemethod.Thenumberofiterationsfor

ITMLissettobe1
,000,000tomakesuretheapproachhasexploredthedatatly.
NotethatGrecosamples8
,000pairsateachiteration,thetotalnumberofpairsusedinour
methodisonly800
,000,whichislessthanthatofITML.Thethresholdissetempirically
asb=6.
Figs.6.2-6.3comparetheperformanceoftmethodsonfaceveriﬁcationproblem.
First,itisobviousthatallproposedmethodsperformsigniﬁcantlybetterthanITML.This
98isbecausetheusingofPCAinITMLresultsinasigniﬁcantlossofinformation.Second,
theperformanceofGreco-randomisworsethantheproposedmethods,whichveriﬁesthe
enessoftheproposedgreedycoordinatedescentmethod.Third,thetrainingerrorof

Greco-minidecreasedmorerapidlythanGreco,indicatingthattheincorporationofmini-
batchmethodise,asillustratedinTheorem7.Finally,Greco-isampandGreco-
hybridhavethesimilarperformanceasGrecoandGreco-mini,respectively,conﬁrmingthat

theincrementalsamplingwillnotsacriﬁcetheperformancetoomuch.
Table6.1comparestherunningtimeofproposedmethods.ComparedwithGreco,Greco-
minisavesmorethan40%runningtimewhileGreco-isamponlyuses28%ofthat.Thehybrid

methodisthemostetoneandtakes11%runningtimeofGreco.Besidesrunningtime,

weﬁndthatGreco-minisamples126descriptorsandGreco-hybridsamples119descriptors
toachievethesameperformanceastheGrecowith100descriptors.Wethusconcludethat
Greco-miniandGreco-hybridtradethenumberofdescriptorsforatlimitedcost.
0204060801001015202530354045#IterationsTraining Error(%)GrecoFigure6.2:Comparisonoftrainingerroronfaceveriﬁcation.Redstardenotestheposition
thatGreco-miniachievesthesameperformanceasGrecowith100iterations.Redcross
denotesthepositionthatGreco-hybridachievesthesameperformanceasGrecowith100
iterations.99020406080100202530354045#IterationsTest Error(%)ITMLGrecoFigure6.3:Comparisonoftesterroronfaceveriﬁcation.
Table6.1:Comparisonofrunningtime(seconds).ThereportedrunningtimeofGreco-mini
andGreco-hybridisthattheyachievethesametrainingperformanceasGrecowith100

iterations.Faceveriﬁcation
GrecoGreco-miniGreco-isampGreco-hybrid
3,716.32,084.41,044.5412.8Faceclassiﬁcation
GrecoGreco-miniGreco-isampGreco-hybrid
3,643.82,252.81,012.5356.76.2.2ExperimentII:FaceClassiﬁcation
Inthissection,weconducttheexperimentforfaceclassiﬁcationtask.Theonly
fromthefaceveriﬁcationisthelossfunction.Giventhetripletconstraint
{xi,xj,xk}wherexiandxjarefromthesamepersonand
xkist,wehavethelearningproblemas
minMd+L(M)=1NNi,j,k(distM(xi,xj)−distM(xi,xk))whichcanbesolvedbytheproposedmethodswithappropriatemodiﬁcations.Weinclude
LMNN[114]with500PCAfeaturesasthebaselineDMLmethodincomparisonandthe
100resultsofleave-one-out3-NNaresummarizedinFigs.6.4-6.5.
Wesharethesameobservationasthelastexperiment,i.e.thelearneddistancemetric
withPCAfeaturesyieldssigniﬁcantlyworseperformancethantheproposedmethods,and

theproposedmethodsaresigniﬁcantlybetterthanrandomlypickingadescriptorateach
iteration.Greco-isamphasthesimilarperformanceasGrecoandGreco-miniisbetterthan
Greco.Greco-hybridhasthecomparableperformancewithGreco-minibuttakessigniﬁcantly

lessrunningtime.
02040608010020304050607080#IterationsTraining Error(%)GrecoFigure6.4:Comparisonoftrainingerroronfaceclassiﬁcation.Redstardenotestheposition
thatGreco-miniachievesthesameperformanceasGrecowith100iterations.Redcross

denotesthepositionthatGreco-hybridachievesthesameperformanceasGrecowith100
iterations.BesidescomparingtogeneralDMLmethods,wealsoconductthecomparisonthatin-
cludestheproposedhybridmethodandtheDMLmethodspeciﬁedforfacerecognition[61].
SincemoreconstraintswillapproximatetheexpectationinEqn.6.1better,weincreasethe

numberofsampledconstraintsateveryiterationfromoneepoch(i.e.,4000)toﬁveandten
epochs,whicharedenotedasGreco-hybrid-1,Greco-hybrid-5andGreco-hybrid-10,respec-
tively.Figs.6.6-6.7summarizetheresultsoftheproposedhybridmethodsandthebaseline
101020406080100102030405060#IterationsTest Error(%)EuclideLMNNGrecoFigure6.5:Comparisonoftesterroronfaceclassiﬁcation.
methoddevelopedbyNEC[61],whichalsoselectsfeaturesalongwithlearningmetrics.Itis
acommercialsoftwareandamongthetopperformersforfacerecognition.Withthest
numberofconstraintsateachiteration,ourhybridmethodsperformevenbetterthanthe
welldeployedsoftware.TheNEC’sbaselinemethodconsumes14
,799.1secondsontheserver
with32core2.10GHzCPUs,whileourhybridmethodscostonly532
.4and583
.7seconds
forGreco-hybrid-5andGreco-hybird-10,respectively,toachievethesameperformanceby
usingtheserverwith24core2.30GHzCPUs,whichfurtherveriﬁestheenessand

oftheproposedmethod.
6.3Conclusion
Inthischapter,weproposeaDMLmethodtoselectdescriptors(i.e.,featuregroups)and

learncorrespondingmetricssimultaneouslyforfacerecognition.Ourmethodexploitsthe

factthatthedimensionalityofthefeaturesfromasingledescriptorissmallandgreedily
selectsonedescriptorfromallcandidatesateachiteration.Toimprovethecomputational
1020204060801001020304050607080#IterationsTraining Error(%)Figure6.6:Comparisonoftrainingerroronfaceclassiﬁcationwiththebaselinemethod
developedbyNEC.RedcrossdenotesthepositionthatGreco-hybrid-5achievesthesame

performanceasNEC’sbaselinemethodwith100iterations.Redstardenotestheposition
thatGreco-hybrid-10achievesthesameperformanceasNEC’sbaselinemethodwith100
iterations.02040608010001020304050#IterationsTest Error(%)Figure6.7:Comparisonoftesterroronfaceclassiﬁcationwiththebaselinemethoddeveloped

byNEC.
,weproposetwovariantswithmini-batchandincrementalsamplingstrategies,
respectively.Boththeeandenessofthehybridmethodwiththesetwo
103strategiesarealsoveriﬁedintheempiricalstudy.Inthefuture,wecouldtrythemethodfor
thefaceimageswithtkindsofdescriptorswherethedimensionalityoffeaturesfrom
tdescriptorcanbet.
104Chapter7
Conclusions&FuturePlan
Inthisdissertation,westudythefundamentalchallengesinapplyingDMLtolarge-scale
highdimensionaldataandevaluatetheperformanceoftheproposedmethodswithcomputer
visionapplications.Wesummarizethemasfollows.
•ExpensivePSDprojections(
O(d3)):WealleviatethecostofPSDprojectionsbyreduc-
ingthenumberofupdatesinSGD.Tolimititsimpactonthepredictionperformance,

weexploitthemini-batchstrategyandadaptivesamplingmethod,respectively.We
verifytheenessforbothofthemtheoreticallyandempirically.Then,wepro-
posetwohybridmethodsbycombiningthesetwostrategiesandfurtherimprovethe

whileobtainingthesimilarperformanceasthestandardSGD.
•Largenumberofvariables(
O(d2)):Wedevelopthedualrandomprojectiontechnique
forhighdimensionalDML.Weﬁrstprojectthedatafromtheoriginalspacetothe

lowdimensionalspacebyrandomprojection.Then,thedualvariablesareestimated

thereandtheﬁnalmetricisreconstructedintheoriginalspaceaccordingtothem.The
proposedmethodisastasrandomprojectionmethodswhileitcouldrecover
theoptimalmetrictheoretically.Theempiricalstudyconﬁrmsitsenessand

.
•Hugenumberoftripletconstraints(
O(n3)):Tohandlehugenumberofconstraints,
wedividetheoriginalproblemintomultiplestageswhereeachsubproblemonlykeeps
105asmallsubsetofactiveconstraints.Afterlearningthroughallstages,theﬁnalmet-
ricisnearlyoptimaloverallappearedconstraintsasshowninthetheoreticalanalysis.
Then,weevaluatetheproposedmulti-stagemetriclearningframeworkforthechalleng-

ingFGVCtask.Itachievesthesigniﬁcantlybetterperformancethanstate-of-the-art
FGVCmethodswhileitisevenmoretthanthelinearSVM.
•Featureselection:Weproposeagreedymethodtoselectthefeaturegroupandlearnthe
correspondingmetricsimultaneouslyateachiterationwithguaranteedperformance.

Then,weinvestigatetheadaptivemini-batchstrategyandincrementalsamplingto
alleviatethehighcomputationalcostofexhaustedsearchthatﬁndsonefeaturegroup
ateachiteration.Inaddition,thehybridapproachisalsostudiedinthisdissertation

thatcombinesthestrengthofadaptivemini-batchandincrementalsampling.The
empiricalstudyonfaceveriﬁcationandclassiﬁcationconﬁrmstheenessand
oftheproposedmethods.
Inthefuture,theworkondistancemetriclearninginspiredbythisdissertationcanbe
furthercontinuedinthefollowingdirection:
•ImprovingEofDistanceMetricLearningbyelyLeveraging
theSparseStructureofData
Inthisdirection,weplantogeneralizethefeatureselectionworkpresentedinChapter6
toageneralsparsestructureinthedistancemetrictobelearned.Weplantoexplore

thetheoryofcompressivesensing,grouplasso,andmatrixcompletionformoree
DMLforhighdimensionallarge-scaledata.
106APPENDIX107Appendix
Proofs
ProofofTheorem1
Proof.
Usingthestandardanalysisforonlinelearning(Chapter12,[18]),wehave
t(Mt)−t(M)Mt−M,t(Mt)Mt−M2F2−Mt+1−M2F2−Mt−Mt+12F2+Mt−Mt+1,t(Mt)Mt−M2F2−Mt+1−M2F2−Mt−Mt+12F2+1bbs=1Mt−Mt+1,t,s(Mt)Mt−M2F2−Mt+1−M2F2+2bbs=1t,s(Mt)2FBytakingtheexpectationwithrespecttothe
t-thmini-batchoftripletconstraint,wehave
L(Mt)−L(M)Mt−M2F2−Et[Mt+1−M2F]2+2bbs=1Et,s[t,s(Mt)2F]Byaddingtheinequalitiesofalliterationsandtakingexpectationoverthesequenceoftriplet
constraints,wehave
Tt=1E[L(Mt)]−L(M)12M1−M2F+2bTt=1bs=1E[t,s(Mt)2F]˘ˇˆ:=CT108AccordingtoProposition2,wehave
(z)2(z)(z).Using
A=max
1tNAtF,we
have
CT=Tt=1bs=1Et,s[t,s(Mt)2F]LA2bTt=1E[L(Mt)]Usingtheresultfor
CT,wehave
1−3A
2L(¯M)(M)+R22Wecompletetheproofbydividingbothsideswith1
−3A
2andreplacing
TwithN/b
.ProofofTheorem2
Proof.
Toboundthenumberofupdates,wehave
ENt=1Zt=ENt=1|t(Mt)|LENt=1L(Mt)wherethelaststepfollowsfrom
|t(M)t(M).Usingthestandardanalysisforonlinelearning(Chapter12,[18]),wehave:
(Mt)−(M)(Mt)At,Mt−M=tZtAt,Mt−M+((Mt)−tZt)At,Mt−MMt−M2FMt+1−M2F2+2Zt2+t(|(Mt)|−Zt)At,Mt−MTakingthesumfrom
t=1to
Nandexpectationoverbothbinaryvariables
{Zt}Nt=1and109thesequenceoftripletconstraints,wehave:
Nt=1E[L(Mt)]−L(M)M1−M2F2+22ENt=1ZtWecompletetheproofbyreorganizingtheaboveinequality.
ProofofTheorem3
Proof.
First,wewanttoprovethat
Gisagoodestimationfor
G.Werewrite
Ga,bbyKroneckerproduct:
Ga,b=Aa,Ab=(xai−xak)(xai−xak)−(xai−xaj)(xai−xaj),(xbi−xbk)(xbi−xbk)−(xbi−xbj)(xbi−xbj)=(xai−xak)(xai−xak)−(xai−xaj)(xai−xaj),(xbi−xbk)(xbi−xbk)−(xbi−xbj)(xbi−xbj)=za,zbwherezt=(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj).Deﬁne
Z=[z1,···,zN],wehave
G=ZZ.Underthelowrankassumptionthatalltrainingexampleslieinthesubspaceof
r-dimension,thedataset
X=[x1,···,xn]canbedecomposedas:
X=UV=ri=1iuiviwhereiisthe
i-thsingularvalueof
X,and
uiandviarethecorrespondingleftandright
singularvectorsof
X.GiventhepropertyofKroneckerproductthat(
AB)(CD)=110(AC)(BD),wehave:
zt=(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj)=[U(˜xti−˜xtk)][U(˜xti−˜xtk)]−[U(˜xti−˜xtj)][U(˜xti−˜xtj)]=(UU) (˜xti−˜xtk)(˜xti−˜xtk)−(˜xti−˜xtj)(˜xti−˜xtj)!where˜xti=Uxti.Deﬁne
˜Z=(˜z1,...,˜zn),where
˜zt=(˜xti−˜xtj)(˜xti−˜xtj)−(˜xti−˜xtk)(˜xti−˜xtk),wehave:
G=˜Z(UU)(UU)˜Z=˜Z(IrIr)˜Z=˜Z˜ZwhereIrIrequalstotheidentityoperatorof
r2×r2.Withtherandomprojectionapproximation,wehave:
zt=(R(xti−xtk))(R(xti−xtk))−(R(xti−xtj))(R(xti−xtj))=[RU(˜xti−˜xtk)][RU(˜xti−˜xtk)]−[RU(˜xti−˜xtj)][RU(˜xti−˜xtj)]=(RR)(UU) (˜xti−˜xtk)(˜xti−˜xtk)−(˜xti−˜xtj)(˜xti−˜xtj)!So,G=˜Z(UU)(RRRR)(UU)˜Z=˜Z([URRU][URRU])˜ZInordertoboundthebetween
GandG,weneedthefollowingcorollary:
111Corollary9.
[120]Let
SRr×mbeastandardGaussianrandommatrix.Then,forany
01/2,withaprobability
1−,wehave
1mSS−I2provided
m(r+1)log(2
r/
)2whereconstant
cisatleast
1/4.Deﬁne=
URRU−Ir.UsingCorollary.9,withaprobability1
−,wehave
2.Usingthenotation,wehavethefollowingexpressionfor
G−GG−G=˜Z((Ir)
(Ir)
−IrIr)˜Z=˜ZIr+Ir˜Z=˜Z˜Zwhere=
Ir+IrUsingthefactthattheeigenvaluevaluesof
ABisgivenby
i(A)j(B),itiseasytoverifythat,
22+2+22Usingthefactthat
2andtaking
1/6whichresultsin
c1/3,withaprobability
1−,wehave
23(.1)112DeﬁneL()and
L()as
L()=−ni=1(i)−12G,L()=−ni=1(i)−12GWearenowreadytogivetheproofforTheorem1.Thebasiclogicisstraightforward.
SinceGiscloseto
G,wewouldexpect
,theoptimalsolutionto
L(),tobecloseto
,theoptimalsolutionto
L().Sinceboth
MandMarelinearinthedualvariables
and,wewouldexpect
Mtobecloseto
M.SincemaximizesL()overitsdomain,whichmeans(
−)L()0,we
have
L()L()+12(−)G(−)(.2)
Usingtheconcavenessof
L()andthefactthat
maximizesL()overitsdomain,we
have:
L()L()+(−)L()−12(−)G(−)=L()−12(−)G(−)+(−)L()L()+L()L()+1(−)(G−G)()−12(−)G(−)(.3)
113Combiningtheinequalitiesin(.2)and(.3),wehave
1(−)(G−G)1(−)G(−)or(−)(G−G)(−)G(−)+(−)(G−G)(−)Deﬁnep=˜Z,p=˜Z,wehave:
(p−p)pp−p22+(p−p)p−p)Usingtheboundgivenin(.1),withaprobability1
−,wehave
p−p231−3p2Wecompletetheproofbyusingthefact
M−MF=1p−p2,MF=1p2and[99]
PSD
(M)−PSD
(M)FM−MF114ProofofTheorem4
Proof.
Ouranalysisisbasedonthefollowingtwotheorems.
Theorem10.
(Theorem2[10])Let
xRd,and
x=Rx/m,where
RRd×misa
randommatrixwhoseentriesarechosenindependentlyfrom
N(0,1).Then
Pr"(1−)x22x22(1+
)x22#1−2exp
−m4(2−3).Theorem11.
(LemmaB-1[69])Suppose
Misarealsymmetricmatrixwithnon-negative
entries,and
Eisarealsymmetricmatrixsuchthat
maxi,j|Ei,j.Then,
E˘M2M2,where
2standsforthespectralnormofmatrixand
E˘Mistheelement-wise
productbetweenmatrices
EandM.DeﬁneL()and
L()as
L()=−Ni=1(i)−12G,L()=−Ni=1(i)−12GSince(z)is
-smooth,wehave
()be
−1-strongly-convex.Usingthefactthat
approximatelymaximizes
L()with
-suboptimalityand
(·)is
−1-strongly-convex,we
have
L()L()+12(−)(−1I+G)(−)−(.4)Usingtheconcavenessof
L()andthefactthat
maximizesL()overitsdomain,we
115have:
L()L()+1(−)(G−G)−12(−)(−1I+G)(−)(.5)
Combiningtheinequalitiesin(.4)and(.5),wehave
+(−)(G−G)1−22whenweset
=1/N.Usingthefact(
−)(G−G)−22G−G2,wehave
−22−22G−G2+,
implyingthat
−2max2G−G22,2.(.6)Tobound
−2,weneedtobound
G−G2.Tothisend,wewritethe
Ga,basGa,b=Aa,Ab= (xai−xak)(xbi−xbk)!2+ (xai−xaj)(xbi−xbj)!2− (xai−xak)(xbi−xbj)!2− (xai−xaj)(xbi−xbk)!2116Similarly,wewrite
Ga,basGa,b=RAaR,R
AbR= (xai−xak)RR(xbi−xbk)!2+ (xai−xaj)RR(xbi−xbj)!2− (xai−xak)RR(xbi−xbj)!2− (xai−xaj)RR(xbi−xbk)!2Hence,wecanwrite
G−G=B1+B2+B3+B4,where
B1,B2,B3,and
B4aredeﬁned
asB1a,b= (xai−xak)RR(xbi−xbk)!2− (xai−xak)(xbi−xbk)!2B2a,b= (xai−xaj)RR(xbi−xbj)!2− (xai−xaj)(xbi−xbj)!2B3a,b= (xai−xak)(xbi−xbj)!2− (xai−xak)RR(xbi−xbj)!2B4a,b= (xai−xaj)(xbi−xbk)!2− (xai−xaj)RR(xbi−xbk)!2UsingtheresultfromTheorem10andthedeﬁnitionofmatrices
M1,M2,M3,M4,wehave,
withaprobability1
−,forany
a,b
,|Bia,bia,b,i={1,2,3,4}providedthat
1/2and
m82ln8N(.7)117UsingTheorem11,undertheconditionin(.7),wehave,withaprobability1
−,G−G2M1+M2+M3+M44wherethelaststepusesthedeﬁnitionof
.Wecompletetheproofbypluggingthebound
forG−G2into(.6).
118BIBLIOGRAPHY119BIBLIOGRAPHY[1]DavidW.Aha,DennisF.Kibler,andMarcK.Albert.Instance-basedlearningalgo-
rithms.ML,6:37–66,1991.
[2]AneliaAngelovaandShenghuoZhu.Etobjectdetectionandsegmentationfor
ﬁne-grainedrecognition.In
CVPR,2013.
[3]OrenBarkan,JonathanWeill,LiorWolf,andHagaiAronowitz.Fasthighdimensional
vectormultiplicationfacerecognition.In
IEEEInternationalConferenceonComputer
Vision,ICCV2013,Sydney,Australia,December1-8,2013
,pages1960–1967,2013.
[4]HilaBecker,MorNaaman,andLuisGravano.Learningsimilaritymetricsforevent
identiﬁcationinsocialmedia.In
WSDM,pages291–300,2010.
[5]RonBekkermanandMartinScholz.Dataweaving:scalingupthestate-of-the-artin
dataclustering.In
CIKM,pages1083–1092,2008.
[6]Aur´
elienBelletandAmauryHabrard.Robustnessandgeneralizationformetriclearn-
ing.CoRR,abs/1209.1086,2012.
[7]AlexBerg,RyanFarrell,AdityaKhosla,JonathanKrause,Fei-FeiLi,JiaLi,and
SubhransuMaji.Fine-grainedchallenge2013.2013.
[8]ThomasBergandPeterN.Belhumeur.Poof:Part-basedone-vs-onefeaturesforﬁne-
grainedcategorization,faceveriﬁcation,andattributeestimation.In
CVPR,2013.
[9]ThomasBergandPeterN.Belhumeur.POOF:part-basedone-vs.-onefeaturesfor
ﬁne-grainedcategorization,faceveriﬁcation,andattributeestimation.In
CVPR,pages
955–962,2013.
[10]AvrimBlum.Randomprojection,margins,kernels,andfeature-selection.In
SLSFS,pages52–68,2005.
[11]ChristosBoutsidis,AnastasiosZouzias,andPetrosDrineas.Randomprojectionsfor
k-meansclustering.In
NIPS,pages298–306,2010.
120[12]StephenBoydandLievenVandenberghe.
ConvexOptimization
.CambridgeUniversity
Press,2004.
[13]LevMBregman.Therelaxationmethodofﬁndingthecommonpointofconvex
setsanditsapplicationtothesolutionofproblemsinconvexprogramming.
USSRcomputationalmathematicsandmathematicalphysics
,7(3):200–217,1967.
[14]DengCai,XuanhuiWang,andXiaofeiHe.Probabilisticdyadicdataanalysiswith
localandglobalconsistency.In
ICML,pages105–112,2009.
[15]ColinCampbell.Kernelmethods:asurveyofcurrenttechniques.
Neurocomputing
,48(1-4):63–84,2002.
[16]QiongCao,Zheng-ChuGuo,andYimingYing.Generalizationboundsformetricand
similaritylearning.
CoRR,abs/1207.5437,2012.
[17]QiongCao,YimingYing,andPengLi.Similaritymetriclearningforfacerecognition.
InICCV,pages2408–2415,2013.
[18]Nicol`
oCesa-BianchiandG´
aborLugosi.
Prediction,Learning,andGames
.Cambridge
UniversityPress,2006.
[19]YuningChai,VictorS.Lempitsky,andAndrewZisserman.Bicos:Abi-levelco-
segmentationmethodforimageclassiﬁcation.In
ICCV,pages2579–2586,2011.
[20]YuningChai,VictorS.Lempitsky,andAndrewZisserman.Symbioticsegmentation
andpartlocalizationforﬁne-grainedcategorization.In
ICCV,pages321–328,2013.
[21]YuningChai,EsaRahtu,VictorS.Lempitsky,LucJ.VanGool,andAndrewZis-
serman.Tricos:Atri-levelclass-discriminativeco-segmentationmethodforimage
classiﬁcation.In
ECCV,pages794–807,2012.
[22]Chih-ChungChangandChih-JenLin.Libsvm:Alibraryforsupportvectormachines.
ACMTIST
,2(3):27,2011.
[23]HongChangandDit-YanYeung.Locallylinearmetricadaptationforsemi-supervised
clustering.In
ICML,pages153–160,2004.
[24]GalChechik,VarunSharma,UriShalit,andSamyBengio.Largescaleonlinelearning
ofimagesimilaritythroughranking.
JMLR,11:1109–1135,2010.
121[25]DongChen,XudongCao,FangWen,andJianSun.Blessingofdimensionality:High-
dimensionalfeatureanditstcompressionforfaceveriﬁcation.In
CVPR,pages
3025–3032,2013.
[26]SumitChopra,RaiaHadsell,andYannLeCun.Learningasimilaritymetricdis-
criminatively,withapplicationtofaceveriﬁcation.In
2005IEEEComputerSociety
ConferenceonComputerVisionandPatternR
ecognition(CVPR2005),20-26June
2005,SanDiego,CA,USA
,pages539–546,2005.
[27]YangCong,JunsongYuan,andYandongTang.Objecttrackingviaonlinemetric
learning.In
ICIP,pages417–420,2012.
[28]AndrewCotter,OhadShamir,NatiSrebro,andKarthikSridharan.Bettermini-batch
algorithmsviaacceleratedgradientmethods.In
NIPS,pages1647–1655,2011.
[29]ZhenCui,WenLi,DongXu,ShiguangShan,andXilinChen.Fusingrobustface
regiondescriptorsviamultiplemetriclearningforfacerecognitioninthewild.In

CVPR,pages3554–3561,2013.
[30]NavneetDalalandBillTriggs.Histogramsoforientedgradientsforhumandetection.
InCVPR,pages886–893,2005.
[31]JasonV.DavisandInderjitS.Dhillon.Structuredmetriclearningforhighdimensional
problems.In
KDD,pages195–203,2008.
[32]JasonV.Davis,BrianKulis,PrateekJain,SuvritSra,andInderjitS.Dhillon.
Information-theoreticmetriclearning.In
ICML,pages209–216,2007.
[33]HuyenDo,AlexandrosKalousis,JunWang,andAdamWoznica.Ametriclearning
perspectiveofSVM:ontherelationofLMNNandSVM.In
AISTATS
,pages308–317,
2012.[34]Donahue,YangqingJia,OriolVinyals,JudyNingZhang,EricTzeng,
andTrevorDarrell.Decaf:Adeepconvolutionalactivationfeatureforgenericvisual
recognition.In
ICML,pages647–655,2014.
[35]MarcoF.DuarteandYuHenHu.Vehicleclassiﬁcationindistributedsensornetworks.
J.ParallelDistrib.Comput.
,64(7):826–838,2004.
[36]SandraEbert,MarioFritz,andBerntSchiele.Activemetriclearningforobjectrecog-
nition.In
PatternR
ecognition
,pages327–336,2012.
122[37]Rong-EnFan,Kai-WeiChang,Cho-JuiHsieh,Xiang-RuiWang,andChih-JenLin.
LIBLINEAR:Alibraryforlargelinearclassiﬁcation.
JMLR,9:1871–1874,2008.
[38]ZheyunFeng,RongJin,andAnilK.Jain.Large-scaleimageannotationbyt
androbustkernelmetriclearning.In
IEEEInternationalConferenceonComputer
Vision,ICCV2013,Sydney,Australia,December1-8,2013
,pages1609–1616,2013.
[39]DmitriyFradkinandDavidMadigan.Experimentswithrandomprojectionsforma-
chinelearning.In
KDD,pages517–522,2003.
[40]A.FrankandA.Asuncion.UCImachinelearningrepository,2010.
[41]MargueriteFrankandPhilipWolfe.Analgorithmforquadraticprogramming.
Navalresearchlogisticsquarterly
,3(1-2):95–110,1956.
[42]AndreaFrome,YoramSinger,FeiSha,andJitendraMalik.Learningglobally-
consistentlocaldistancefunctionsforshape-basedimageretrievalandclassiﬁcation.
InICCV,pages1–8,2007.
[43]DanGarberandEladHazan.Fasterratesforthefrank-wolfemethodoverstrongly-
convexsets.In
Procee
dingsofthe32ndInternationalConferenceonMachineLearning,
ICML2015,Lille,France,6-11July2015
,pages541–549,2015.
[44]EfstratiosGavves,BasuraFernando,CeesG.M.Snoek,ArnoldW.M.Smeulders,
andTinneTuytelaars.Fine-grainedcategorizationbyalignments.In
ICCV,pages
1713–1720,2013.
[45]AmirGlobersonandSamT.Roweis.Metriclearningbycollapsingclasses.In
NIPS,pages451–458,2005.
[46]JacobGoldberger,SamT.Roweis,E.Hinton,andRuslanSalakhutdinov.
Neighbourhoodcomponentsanalysis.In
NIPS,2004.
[47]GeneHowardGolubandCharlesVanLoan.Matrixcomputations.1989.
[48]GregAlexHolub,andPietroPerona.Caltech-256objectcategorydataset,
2007.[49]MatthieuGuillaumin,ThomasMensink,JakobJ.Verbeek,andCordeliaSchmid.
Tagprop:Discriminativemetriclearninginnearestneighbormodelsforimageauto-
annotation.In
ICCV,pages309–316,2009.
123[50]MatthieuGuillaumin,JakobJ.Verbeek,andCordeliaSchmid.Isthatyou?metric
learningapproachesforfaceidentiﬁcation.In
ICCV,pages498–505,2009.
[51]Zheng-ChuGuoandYimingYing.Guaranteedclassiﬁcationviaregularizedsimilarity
learning.NeuralComputation
,26(3):497–522,2014.
[52]N.Halko,P.-G.Martinsson,andJ.A.Tropp.Findingstructurewithrandomness:
Probabilisticalgorithmsforconstructingapproximatematrixdecompositions.
ArXiv
e-prints,September2009.
[53]JohnAHartigan.Clusteringalgorithms.1975.

[54]TrevorHastieandRobertTibshirani.Discriminantadaptivenearestneighborclassiﬁ-
cation.IEEETrans.PatternAnal.Mach.Intell.
,18(6):607–616,1996.
[55]EladHazanandSatyenKale.Projection-freeonlinelearning.In
ICML,2012.
[56]XiaofeiHe,Wei-YingMa,andHongJiangZhang.Learninganimagemanifoldfor
retrieval.In
ACMMultimedia
,pages17–23,2004.
[57]StevenC.H.Hoi,WeiLiu,andShih-FuChang.Semi-superviseddistancemetric
learningforcollaborativeimageretrieval.In
CVPR,2008.
[58]Chih-WeiHsuandChih-JenLin.Acomparisonofmethodsformulticlasssupport
vectormachines.
IEEETrans.onNeuralNetw.
,13(2):415–425,2002.
[59]JunlinHu,JiwenLu,andYap-PengTan.Discriminativedeepmetriclearningforface
veriﬁcationinthewild.In
CVPR,pages1875–1882,2014.
[60]JunlinHu,JiwenLu,JunsongYuan,andYap-PengTan.Largemarginmulti-metric
learningforfaceandkinshipveriﬁcationinthewild.In
ComputerVision-ACCV
2014-12thAsianConferenceonComputerVision,Singapore,Singapore,November
1-5,2014,RevisedSelectedPapers,PartIII
,pages252–267,2014.
[61]ChangHuang,ShenghuoZhu,andKaiYu.Largescalestronglysupervisedensem-
blemetriclearning,withapplicationstofaceveriﬁcationandretrieval.
CoRR,ab-
s/1212.6094,2012.
[62]GaryB.Huang,HonglakLee,andErikG.Learned-Miller.Learninghierarchicalrepre-
sentationsforfaceveriﬁcationwithconvolutionaldeepbeliefnetworks.In
2012IEEE
124ConferenceonComputerVisionandPatternR
ecognition,Providence,RI,USA,June
16-21,2012
,pages2518–2525,2012.
[63]GaryB.Huang,ManuRamesh,TamaraBerg,andErikLearned-Miller.Labeledfaces
inthewild:Adatabaseforstudyingfacerecognitioninunconstrainedenvironments.

TechnicalReport07-49,UniversityofMassachusetts,Amherst,October2007.
[64]RatnawatiIbrahimandZalhanMohdZin.Studyofautomatedfacerecognitionsystem
forodooraccesscontrolapplication.In
ICCSN,pages132–136.IEEE,2011.
[65]SatoshiItoandSusumuKubota.Objectclassiﬁcationusingheterogeneousco-
occurrencefeatures.In
ECCV,pages209–222,2010.
[66]PrateekJain,BrianKulis,InderjitS.Dhillon,andKristenGrauman.Onlinemetric
learningandfastsimilaritysearch.In
NIPS,pages761–768,2008.
[67]RongJin.Stochasticoptimizationofsmoothloss.In
arXiv:1312.0048,2013.
[68]RongJin,ShijunWang,andYangZhou.Regularizeddistancemetriclearning:Theory
andalgorithm.In
NIPS,pages862–870,2009.
[69]NoureddineElKaroui.Thespectrumofkernelrandommatrices.In
TheAnnalsof
Statistics,2010.
[70]AdityaKhosla,NityanandaJayadevaprakash,BangpengYao,andFei-feiLi.Novel
datasetforﬁne-grainedimagecategorization.In
FirstWorkshoponFine-Grained
VisualCategorization,CVPR
,2011.
[71]MartinK¨
ostinger,MartinHirzer,PaulWohlhart,PeterM.Roth,andHorstBischof.
Largescalemetriclearningfromequivalenceconstraints.In
CVPR,pages2288–2295,
2012.[72]AlexKrizhevsky,IlyaSutskever,andE.Hinton.Imagenetclassiﬁcationwith
deepconvolutionalneuralnetworks.In
NIPS,pages1106–1114,2012.
[73]BrianKulis.Metriclearning:Asurvey.
FoundationsandTrendsinMachineLearning
,5(4):287–364,2013.
[74]KevinLai,LiefengBo,XiaofengRen,andDieterFox.Sparsedistancelearningfor
objectrecognitioncombiningrgbanddepthinformation.In
RoboticsandAutomation
(ICRA),2011IEEEInternationalConferenceon
,pages4007–4013.IEEE,2011.
125[75]GuanghuiLan.Thecomplexityoflarge-scaleconvexprogrammingunderalinear
optimizationoracle.
arXivpreprintarXiv:1309.5550
,2013.
[76]MarcTevaLaw,NicolasThome,andMatthieuCord.Quadruplet-wiseimagesimilarity
learning.In
ICCV,pages249–256,2013.
[77]MarcTevaLaw,NicolasThome,andMatthieuCord.Fantoperegularizationinmetric
learning.In
CVPR,pages1051–1058,2014.
[78]DarylLim,GertLanckriet,andBrianMcFee.Robuststructuralmetriclearning.In
ICML,2013.
[79]PrasantaChandraMahalanobis.Onthegeneralizeddistanceinstatistics.
Procee
dingsoftheNationalInstituteofSciences(Calcutta)
,2:49–55,1936.
[80]M.Mahdavi,T.Yang,R.Jin,S.Zhu,andJ.Yi.Stochasticgradientdescentwithonly
oneprojection.In
NIPS,pages503–511,2012.
[81]Odalric-AmbrymMaillardandRemiMunos.Linearregressionwithrandomprojec-
tions.In
JMLR,volume13,2012.
[82]ThomasMensink,JakobJ.Verbeek,FlorentPerronnin,andGabrielaCsurka.Metric
learningforlargescaleimageclassiﬁcation:Generalizingtonewclassesatnear-zero

cost.In
ECCV,pages488–501,2012.
[83]TomM.Mitchell.
Machinelearning
.McGrawHillseriesincomputerscience.McGraw-
Hill,1997.
[84]YangMu,WeiDing,andDachengTao.Localdiscriminativedistancemetricsensemble
learning.PatternR
ecognition
,46(8):2337–2349,2013.
[85]M-E.Nilsback.
AnAutomaticVisualFlora–SegmentationandClassiﬁcationofFlow-
ersImages
.PhDthesis,UniversityofOxford,2009.
[86]Maria-ElenaNilsbackandAndrewZisserman.Automatedﬂowerclassiﬁcationovera
largenumberofclasses.In
ICVGIP
,pages722–729,2008.
[87]DanOneata,JakobJ.Verbeek,andCordeliaSchmid.Actionandeventrecognition
withﬁshervectorsonacompactfeatureset.In
ICCV,pages1817–1824,2013.
126[88]OmkarM.Parkhi,AndreaVedaldi,AndrewZisserman,andC.V.Jawahar.Catsand
dogs.In
CVPR,pages3498–3505,2012.
[89]Guo-JunQi,JinhuiTang,Zheng-JunZha,Tat-SengChua,andHong-JiangZhang.
Antsparsemetriclearninginhigh-dimensionalspacevia
l1-penalizedlog-
determinantregularization.In
ICML,page106,2009.
[90]QiQian,RongJin,ShenghuoZhu,andYuanqingLin.Anintegratedframeworkfor
highdimensionaldistancemetriclearninganditsapplicationtoﬁne-grainedvisual
categorization.CoRR,abs/1402.0453,2014.
[91]AliRahimiandBenjaminRecht.Randomfeaturesforlarge-scalekernelmachines.In
NIPS,2007.
[92]OlgaRussakovsky,JiaDeng,HaoSu,JonathanKrause,SanjeevSatheesh,SeanMa,
ZhihengHuang,AndrejKarpathy,AdityaKhosla,MichaelBernstein,AlexanderC.
Berg,andLiFei-Fei.ImageNetLargeScaleVisualRecognitionChallenge,2014.
[93]KateSaenko,BrianKulis,MarioFritz,andTrevorDarrell.Adaptingvisualcategory
modelstonewdomains.In
ECCV,pages213–226,2010.
[94]LawrenceK.SaulandSamT.Roweis.Thinkglobally,ﬁtlocally:Unsupervisedlearning
oflowdimensionalmanifold.
JMLR,4:119–155,2003.
[95]BernhardSch¨
olkopf,AlexJ.Smola,andKlaus-RobertM¨
uller.Nonlinearcomponent
analysisasakerneleigenvalueproblem.
NeuralComputation
,10(5):1299–1319,1998.
[96]ShaiShalev-Shwartz,YoramSinger,andAndrewY.Ng.Onlineandbatchlearningof
pseudo-metrics.In
ICML,2004.
[97]ShaiShalev-ShwartzandTongZhang.Stochasticdualcoordinateascentmethodsfor
regularizedlossminimization.
CoRR,abs/1209.1873,2012.
[98]BlakeShaw,BertC.Huang,andTonyJebara.Learningadistancemetricfroma
network.In
NIPS,pages1899–1907,2011.
[99]HenryStark,YongiYang,andYongyiYang.
Vectorspaceprojections:anumerical
approachtosignalandimageprocessing,neuralnets,andoptics
.JohnWiley&Sons,
Inc.,1998.
127[100]YanivTaigman,MingYang,Marc’AurelioRanzato,andLiorWolf.Deepface:Closing
thegaptohuman-levelperformanceinfaceveriﬁcation.In
2014IEEEConferenceon
ComputerVisionandPatternR
ecognition,CVPR2014,Columbus,OH,USA,June
23-28,2014
,pages1701–1708,2014.
[101]JoshuaBTenenbaum,VinDeSilva,andJohnCLangford.Aglobalgeometricframe-
workfornonlineardimensionalityreduction.
Science
,290(5500):2319–2323,2000.
[102]LorenzoTorresaniandKuang-chihLee.Largemargincomponentanalysis.In
NIPS,pages1385–1392,2006.
[103]GrigoriosTsagkatakisandAndreasE.Savakis.Manifoldmodelingwithlearneddis-
tanceinrandomprojectionspaceforfacerecognition.In
ICPR,pages653–656,2010.
[104]GrigoriosTsagkatakisandAndreasE.Savakis.Onlinedistancemetriclearningfor
objecttracking.
IEEETrans.CircuitsSyst.VideoTechn.
,21(12):1810–1821,2011.
[105]OncelTuzel,FatihPorikli,andPeterMeer.Pedestriandetectionviaclassiﬁcationon
riemannianmanifolds.
IEEETrans.PatternAnal.Mach.Intell.
,30(10):1713–1727,
2008.[106]VladimirVapnik.
Thenatureofstatisticallearningtheory
.springer,2000.
[107]VladimirNaumovichVapnikandVlamimirVapnik.
Statisticallearningtheory
,vol-
ume2.WileyNewYork,1998.
[108]YashaswiVermaandC.V.Jawahar.Imageannotationusingmetriclearninginse-
manticneighbourhoods.In
ECCV,pages836–849,2012.
[109]C.Wah,S.Branson,P.Welinder,P.Perona,andS.Belongie.TheCaltech-UCSD
Birds-200-2011Dataset.Technicalreport,2011.
[110]FeiWang,JimengSun,andShahramEbadollahi.Integratingdistancemetricslearned
frommultipleexpertsanditsapplicationininter-patientsimilarityassessment.In

SDM,pages59–70,2011.
[111]FeiWang,JimengSun,JianyingHu,andShahramEbadollahi.imet:Interactivemetric
learninginhealthcareapplications.In
SDM,pages944–955,2011.
128[112]JiangWang,YangSong,ThomasLeung,ChuckRosenberg,JingbinWang,James
Philbin,BoChen,andYingWu.Learningﬁne-grainedimagesimilaritywithdeep
ranking.In
CVPR,pages1386–1393,2014.
[113]JinjunWang,JianchaoYang,KaiYu,FengjunLv,ThomasS.Huang,andYihong
Gong.Locality-constrainedlinearcodingforimageclassiﬁcation.In
CVPR,pages
3360–3367,2010.
[114]KilianQ.WeinbergerandLawrenceK.Saul.Distancemetriclearningforlargemargin
nearestneighborclassiﬁcation.
JMLR,10:207–244,2009.
[115]EricP.Xing,AndrewY.Ng,MichaelI.Jordan,andStuartJ.Russell.Distancemetric
learningwithapplicationtoclusteringwithside-information.In
NIPS,pages505–512,
2002.[116]ZhixiangEddieXu,KilianQ.Weinberger,andOlivierChapelle.Distancemetric
learningforkernelmachines.
CoRR,abs/1208.3422,2012.
[117]LiuYangandRongJin.Distancemetriclearning:Acomprehensivesurvey.
[Online].[118]ShulinYang,LiefengBo,JueWang,andLindaG.Shapiro.Unsupervisedtemplate
learningforﬁne-grainedobjectrecognition.In
NIPS,pages3131–3139,2012.
[119]JianZhang,RongJin,YimingYang,andAlexanderG.Hauptmann.Modiﬁedlo-
gisticregression:AnapproximationtoSVManditsapplicationsinlarge-scaletext
categorization.In
ICML,pages888–895,2003.
[120]LijunZhang,MehrdadMah-davi,RongJin,Tian-BaoYang,andShenghuoZhu.Re-
coveringoptimalsolutionbydualrandomprojection.In
arXiv:1211.3046,2013.
[121]NingZhang,RyanFarrell,ForrestN.Iandola,andTrevorDarrell.Deformablepart
descriptorsforﬁne-grainedrecognitionandattributeprediction.In
ICCV,pages729–
736,2013.
[122]T.ZhangandF.J.Oles.Textcategorizationbasedonregularizedlinearclassiﬁcation
methods.
InformationRetrieval
,4(1):5–31,2001.
[123]XiaojinZhuandAndrewB.Goldberg.
IntroductiontoSemi-SupervisedLearning
.SynthesisLecturesonArtiﬁcialIntelligenceandMachineLearning.Morgan&Claypool
Publishers,2009.
129