LARGE-SCALEHIGHDIMENSIONALDISTANCEMETRIC LEARNINGANDITSAPPLICATIONTOCOMPUTER VISIONByQiQian ADISSERTATION Submittedto MichiganStateUniversity inpartialfulfillmentoftherequirements forthedegreeof ComputerScience-DoctorofPhilosophy 2015ABSTRACT LARGE-SCALEHIGHDIMENSIONALDISTANCEMETRIC LEARNINGANDITSAPPLICATIONTOCOMPUTER VISIONByQiQian Learninganappropriatedistancefunction(i.e.,similarity)isoneofthekeytasksin machinelearning,especiallyfordistancebasedmachinelearningalgorithms,e.g., k-nearestneighborclassifier, k-meansclustering,etc.Distancemetriclearning(DML),thesubject tobestudiedinthisdissertation,isdesignedtolearnametricthatpullstheexamples fromthesameclasstogetherandpushestheexamplesfromtclassesawayfrom eachother.AlthoughmanyDMLalgorithmshavebeendevelopedinthepastdecade,most ofthemcanhandleonlysmalldatasetswithhundredsoffeatures,significantlylimiting theirapplicationstorealworldapplicationsthatofteninvolvemillionsoftrainingexamples representedbyhundredsofthousandsoffeatures.Threemainchallengesareencountered tolearnthemetricfromtheselarge-scalehighdimensionaldata:(i)Tomakesurethatthe learnedmetricisaPositiveSemi-Definitive(PSD)matrix,aprojectionintothePSDcone isrequiredateveryiteration,whosecostiscubicinthedimensionalitymakingitunsuitable forhighdimensionaldata;(ii)ThenumberofvariablesthatneedstobeoptimizedinDML isquadraticinthedimensionality,whichresultsintheslowconvergencerateinoptimization andhighrequirementofmemorystorage;(iii)ThenumberofconstraintsusedbyDMLisat leastquadratic,ifnotcubic,inthenumberofexamplesdependingonifpairwiseconstraints ortripletconstraintsareusedinDML.Besides,featurescanberedundantduetohigh dimensionalrepresentations(e.g.,facefeatures)andDMLwithfeatureselectionispreferred fortheseapplications. Themaincontributionofthisdissertationistoaddressthesechallengesboththeoretically andempirically.First,forthechallengearisingfromthePSDprojection,weexploitthe mini-batchstrategyandadaptivesamplingwithsmoothlossfunctiontosignificantlyreduce thenumberofupdates(i.e.,projections)whilekeepingthesimilarperformance.Second, forthechallengearisingfromhighdimensionality,weproposeadualrandomprojection approach,whichenjoysthelightcomputationduetotheusageofrandomprojectionand atthesametime,significantlyimprovestheenessofrandomprojection.Third,for thechallengewithlarge-scaleconstraints,wedevelopanovelmulti-stagemetriclearning framework.Itdividestheoriginaloptimizationproblemintomultiplestages.Itreducesthe computationbyadaptivelysamplingasmallsubsetofconstraintsateachstage.Finally,to handleredundantfeatureswithgroupproperty,wedevelopagreedyalgorithmthatselects featuregroupandlearnsthecorrespondingmetricsimultaneouslyateachiterationleadingto furtherimprovementoflearningwhencombinedwithadaptivemini-batchstrategy andincrementalsampling.BesidesthetheoreticalandempiricalinvestigationofDMLon thebenchmarkdatasetsofmachinelearning,wealsoapplytheproposedmethodstoseveral importantcomputervisionapplications(i.e.,fine-grainedvisualcategorization(FGVC)and facerecognition). ACKNOWLEDGMENTS Firstandforemost,IwouldliketothankmyadvisorDr.RongJin.Imethimforthefirst timewhenIwasamasterstudentatNanjingUniversity.Thetalkwithhimopenedawindow withatbutveryinsightfulandbeautifulsceneryforme.WhenIbeganmyPhD studyatMSU,hedevotedalotofhistimetotrainmethinkinginamathematicalwayto solvelearningproblems,whereIfoundmypassionandhappinessinresearch.Hisstrictbut friendlyguidanceisexhaustiveandtireless,whichmakesmemoreandmoreconfidentabout mycareer.Hisrelentlesspassionforacademicresearchalsoinfluencesmemuchbeyondthe study.I’mveryfortunatetohaveDr.Jinasmyadvisor. Ialsowouldliketothankothercommitteemembers:Dr.Pang-NingTan,Dr.Sara SelinAviyenteandDr.XiaomingLiu.IwouldliketothankmycolleaguesatMSU.They are:FengjieLi,MehrdadMahdavi,TianbaoYang,LijunZhang,JinfengYi,ZheyunFeng, BeiBeiLiu,LanboSheandQiaoziGao.Theynotonlydiscussedwithmeaboutmyresearch, butalsohelpedmetomakemylifeinEastLansingbetter.Ialsowouldliketothankour professionalandpatientatthedepartmentofcomputerscience:NormaTeague,Linda MooreandKatherineTrinklein.IwishNormaandLindaenjoytheirretiredlife. IwouldliketothankmycolleaguesinNECLaboratoriesAmerica,theyare:Shenghuo Zhu,YuanqingLinandXiaoyuWang.Ihadtwosummerinternshipsthereandmymentor ShenghuoZhuinspiredmealotonbothresearchandengineering.Iwouldliketothank peopleinAlibabaSeattleoandsiliconvalleyowhereItookmyrecentsummer intern.IwouldliketothankLuoSi,JunWang,JunTao,ZhiPanLiandXuXie. IwanttothankDr.Zhi-HuaZhou,whoismymasteradvisoratNanjingUniversity. Withouthishelp,Iwouldnothavetouchedmachinelearningandbegansuchawonderful ivresearchtrip.Hisdevotedattitudetowardsbothworkandlifebenefitsmealotsincethen. Finally,Iwouldliketothankmydearfamily.Withoutsupportsfromthem,Icannotfinish mystudysosmoothly.IthankmywifeJuhuaHu,whohelpsmemuchnomatterresearch orlife.Everypublicationofminehashercontribution.Thesweetestacknowledgementisto myson,MichaelJiaboQian,whobringsmeatotaltroleandlife. vTABLEOFCONTENTS LISTOFTABLES ....................................viiiLISTOFFIGURES ...................................xLISTOFALGORITHMS ...............................xiiChapter1Introduction ...............................11.1DistanceMetric..................................2 1.2SupervisioninDistanceMetricLearning ....................3 1.3ComputationalChallengesinDistanceMetric Learning......................................4 1.4ApplicationsofDistanceMetricLearningtoComputerVision ........5 1.5Contributions ...................................7 Chapter2LiteratureSurvey ............................92.1PSDConstraintinDistanceMetricLearning ..................9 2.2LinearDistanceMetricLearning .........................10 2.2.1DistanceMetricLearningwithPairwiseConstraints ..........11 2.2.2DistanceMetricLearningwithTripletConstraints ...........12 2.3NonlinearDistanceMetricLearning .......................13 2.4ImprovingofDistanceMetricLearning ...............15 2.5GeneralizationPerformanceofDistanceMetric Learning......................................16 Chapter3Large-scaleDistanceMetricLearningbyAdaptiveSampling andMini-BatchStochasticGradientDescent(SGD) ......173.1ImprovedSGDforDMLbyMini-batchandAdaptiveSampling .......19 3.1.1Mini-batchSGDforDML(Mini-SGD) .................22 3.1.2AdaptiveSamplingbasedSGDforDML(AS-SGD) ..........25 3.1.3HybridApproaches:CombiningMini-batchwithAdaptiveSampling forDML..................................28 3.2Experiments....................................29 3.2.1ParameterSetting............................30 3.2.2Experiment(I):enessoftheProposedSGDAlgorithmsforDML32 3.2.3Experiment(II):EoftheProposedSGDAlgorithmsforDML33 3.2.4Experiment(III):ComparisonwithState-of-the-artOnlineDMLMeth- ods.....................................36 3.3Conclusions....................................41 viChapter4DistanceMetricLearningforHighDimensionalData .....434.1DualRandomProjectionforDistanceMetricLearning ............45 4.1.1DualRandomProjectionforDistanceMetricLearning ........47 4.1.2MainTheoreticalResults.........................49 4.2Experiments....................................51 4.2.1ExperimentalSetting...........................51 4.2.2EciencyoftheProposedMethod...................55 4.2.3EvaluationbyRanking..........................56 4.2.4EvaluationbyClassification.......................58 4.3Conclusions....................................59 Chapter5Fine-GrainedVisualCategorizationviaMulti-stageMetricLearn- ing.....................................615.1Multi-stageMetricLearning ...........................65 5.1.1ConstraintsChallenge:Multi-stageDivision..............66 5.1.2ComputationalChallenge:DualRandomProjection ..........69 5.1.3StorageChallenge:LowRankApproximation.............70 5.2Experiments....................................73 5.2.1OxfordCats&Dogs............................75 5.2.2Oxford102Flowers............................77 5.2.3Birds-2011 .................................78 5.2.4StanfordDogs...............................80 5.2.5ComparisonofEciency.........................81 5.3Conclusions....................................82 Chapter6FeatureSelectionforFaceRecognition: ADistanceMetricLearningApproach ...............846.1DMLforfeatureselection............................86 6.1.1AdaptiveMini-batchStrategy ......................91 6.1.2IncrementalSamplingStrategy .....................94 6.2Experiments....................................97 6.2.1ExperimentI:FaceVerification.....................98 6.2.2ExperimentII:FaceClassification....................100 6.3Conclusion.....................................102 Chapter7Conclusions&FuturePlan ......................105APPENDIX........................................107BIBLIOGRAPHY....................................119viiLISTOFTABLES Table3.1:Statisticsforthetendatasetsusedinourempiricalstudy. ......30 Table3.2:Classificationerror(%)of k-NN( k=3)usingthedistancemetrics learnedbytheproposedSGDmethodsforDML.Standarddeviation computedfromfivetrialsisincludedintheparenthesis. .......33 Table3.3:Classificationerror(%)of k-NN( k=3)usingthedistancemet- ricslearnedbybaselineSGDmethod,onlinelearningalgorithmsand batchlearningapproachforDML.Standarddeviationcomputedfrom fivetrialsisincludedintheparenthesis. ................38 Table3.4:Thecomparisonofrunningtime(seconds)forOASISandthehybrid methods.Averageresultsoverfivetrialsarereported.........39 Table4.1:Examplesofapplicationswithhighdimensionalfeatures. ......43 Table4.2:Statisticsforthedatasetsusedinourempiricalstudy.#Cisthe numberofclasses.#Fisthenumberoforiginalfeatures.#Trainand #Testrepresentthenumberoftrainingdataandtestdata,respectively.52 Table4.3:CPUtime(minutes)fortmethodsforDML.Allalgorithms areimplementedinMatlabexceptforLMNNwhosecorepartisim- plementedinCandismoretthanourMatlabimplementation.55 Table4.4:ComparisonofrankingresultsmeasuredbymAP(%)fort metriclearningalgorithms. .......................57 Table5.1:Comparisonofmeanaccuracy(%)on cats &dogs dataset.“#”means thatmoreinformation(e.g.,groundtruesegmentation)isusedbythe method..................................76 Table5.2:Comparisonofmeanaccuracy(%)on 102flowersdataset.“#”means thatmoreinformation(e.g.,groundtruesegmentation)isusedbythe method..................................78 Table5.3:Comparisonofmeanaccuracy(%)on birds11 dataset.“*”denotes themethodthatmirrorstrainingimages. ...............80 Table5.4:Comparisonofmeanaccuracy(%)on S-dogs dataset.“*”denotesthe methodthatmirrorstrainingimages. .................81 viiiTable5.5:ComparisonofRunningtime(seconds). ................81 Table6.1:Comparisonofrunningtime(seconds).Thereportedrunningtimeof Greco-miniandGreco-hybridisthattheyachievethesametraining performanceasGrecowith100iterations................100 ixLISTOFFIGURES Figure1.1:IllustratetheproblemofEuclideandistance.Theleftimageismore closerthantherightonetothetargetimageunderEuclideandistance whileitisactuallyfromatclass“Commondandelion”....2 Figure2.1:Thefigureisfromthework[114]andillustratesthelearningproce- dureofLMNN.Thelearnedmetricpushesawaytheexamplesthat arefromtclassesbutinthenearestneighborswithalarge margin...................................13 Figure3.1:Thetrainingandtestingerrorsoverepochesfordataset dna.....32 Figure3.2:Thecomparisonofrunningtime(seconds)forvariousSGDmethods. NotethatLMNN,abatchDMLalgorithm,ismainlyimplemented inC,whichiscomputationallymoretthanourMatlabimple- mentation.AlltheothermethodsareimplementedinMatlab....34 Figure3.3:ThecomparisonofnumberofupdatesforvariousSGDmethods.Note thatsincePOLAandLEGOoptimizepairwiseconstraints,wedecom- poseeachtripletconstraintintotwopairwiseconstraintsforthesetwo methods.Asaresult,thenumberofconstraintsisdoubledforthese twomethods................................40 Figure4.1:Theeigenvaluedistributionofdatasetsusedinourempiricalstudy.53 Figure4.2:Thecomparisonofdierentstochasticalgorithmsforranking....58 Figure4.3:Thecomparisonoftstochasticalgorithmsforclassification.59 Figure5.1:IllustrationofhowDMLlearnstheembeddingthatpullstogetherthe datapointsfromthesameclassandpushesapartthedatapointsfrom tclasses.Bluepointsarefromtheclass“Englishmarigold” whileredonesare“Barbertondaisy”.Animportantnotehereis thatourDMLdoesnotrequiretocollapsedatapointsfromeach classandthisallowstheflexibilitytomodelintra-classvariance.A bigchallengenowishowtodealwithhigh-dimensionfeaturerepre- sentationwhichistypicalforimage-levelvisualfeatures.Tothisend, weproposeamulti-stageschemeformetriclearning. .........62 Figure5.2:Theframeworkoftheproposedmethod.................68 xFigure5.3:Examplesofretrievedimages.Firstcolumnarequeryimageshigh- lightedbygreenboundingboxes.Columns2-4includethemostsimi- larimagesmeasuredbyEuclid.Columns5-7showthosebythemetric fromLMNN.Columns8-10arefromthemetricofMsML.Imagesin columns2-10arehighlightedbyredboundingboxeswhentheyshare thesamecategoryasqueries,andblueboundingboxesiftheyarenot.76 Figure5.4:Convergencecurveoftheproposedmethodon 102flowers.......78 Figure5.5:Comparisonwithtsizeofclasseson birds11 ..........79 Figure5.6:Examplesofretrievedimages.Firstcolumnarequeryimageshigh- lightedbygreenboundingboxes.Columns2-4includethemostsimi- larimagesmeasuredbyEuclid.Columns5-7showthosebythemetric fromLMNN.Columns8-10arefromthemetricofMsML.Imagesin columns2-10arehighlightedbyredboundingboxeswhentheyshare thesamecategoryasqueries,andblueboundingboxesiftheyarenot.83 Figure6.1:Illustrationoffeatureselectionforfaceverification.Althoughde- scriptorsareover-completed,asubsetofdescriptors(e.g.,eyes,nose, etc)cancapturemostof ..................85 Figure6.2:Comparisonoftrainingerroronfaceverification.Redstardenotes thepositionthatGreco-miniachievesthesameperformanceasGreco with100iterations.RedcrossdenotesthepositionthatGreco-hybrid achievesthesameperformanceasGrecowith100iterations.....99 Figure6.3:Comparisonoftesterroronfaceverification..............100 Figure6.4:Comparisonoftrainingerroronfaceclassification.Redstardenotes thepositionthatGreco-miniachievesthesameperformanceasGreco with100iterations.RedcrossdenotesthepositionthatGreco-hybrid achievesthesameperformanceasGrecowith100iterations.....101 Figure6.5:Comparisonoftesterroronfaceclassification.............102 Figure6.6:Comparisonoftrainingerroronfaceclassificationwiththebase- linemethoddevelopedbyNEC.Redcrossdenotesthepositionthat Greco-hybrid-5achievesthesameperformanceasNEC’sbaseline methodwith100iterations.RedstardenotesthepositionthatGreco- hybrid-10achievesthesameperformanceasNEC’sbaselinemethod with100iterations............................103 Figure6.7:Comparisonoftesterroronfaceclassificationwiththebaselinemethod developedbyNEC............................103 xiLISTOFALGORITHMS Algorithm1Mini-batchStochasticGradientDescent(Mini-SGD)forDML ....23 Algorithm2AdaptiveSamplingStochasticGradientDescent(AS-SGD)forDML25 Algorithm3AFrameworkofHybridStochasticGradientDescent(Hybrid-SGD) forDML..................................28 Algorithm4DualRandomProjectionMethod(DuRP)forDML ..........49 Algorithm5AnEtAlgorithmforRecovering MandProjectItontoPSD Conefrom M...............................72 Algorithm6The Multi-stageMetricLearningFrameworkforHighDimensional DML(MsML) ...............................73 Algorithm7GreedyCoordinateDescentMetricLearningforFaceVerification(Greco)89 Algorithm8GreedyCoordinateDescentMetricLearningwithAdaptiveMini-batch (Greco-mini)...............................92 Algorithm9GreedyCoordinateDescentMetricLearningwithIncrementalSam- pling(Greco-isamp) ............................95 Algorithm10GreedyCoordinateDescentMetricLearningwithHybridstrategies (Greco-hybrid)..............................97 xiiChapter1 Introduction Machinelearning,asasubfieldofartificialintelligence,involvesautomaticallyimproving fromexperienceandhasbeensuccessfullyappliedformanyrealworldapplications[83]. Distancefunctionsareessentialtomanymachinelearningtasks.Forexample, k-nearestneighbor( k-NN)classifier[1]assignsthetestexamplewiththemostfrequentlabelinits knearestneighborsfromtrainingset. K-meansclusteringalgorithm[53]outputsclusters accordingtothedistancefromeachinstanceto kcenters.However,Euclideandistance withhandcraftedfeaturesmaynotbesttocapturethebetweent classesorclusters.Fig.1.1providesanexamplewherethe k-NNclassifierwithEuclidean distanceisusedtoidentifytheflowertype“Englishmarigold”.UsingLLCfeatures[113], wefoundthattheexamplefromatclass(i.e.,“Commondandelion”)ismorecloser tothetargetexamplethantheonefromthesameclass.Therefore,learninganappropriate distancefunctionbecomesthekeyfordistancebasedmethods. Inthisdissertation,wewillstudydistancemetriclearning(DML),whichaimstolearna metricthatpullstheexamplesfromthesameclasstogetherandpushestheexamplesfrom thetclassesawayfromeachotheraccordingtothesupervisedinformation.Westart thediscussionbyintroducingtheformofthedistancemetricandsupervisioninDML. 1Figure1.1:IllustratetheproblemofEuclideandistance.Theleftimageismorecloserthan therightonetothetargetimageunderEuclideandistancewhileitisactuallyfromat class“Commondandelion”. 1.1DistanceMetric Distancefunctionmeasuresthedistancebetweenanytwoexamples,whichcouldbedenoted asdist( xi,xj)for d-dimensionalexamples xiandxj.MostexistingDMLmethodsadopt theformof“Mahalanobisdistance”[79]:dist M(xi,xj)=(xi−xj)M(xi−xj),where Miscalleda distancemetric withthesizeof d×d.Itiseasytoverifythatthedistance functionisequivalenttoEuclideandistancewhenMisanidentitymatrix.Thetargetof DMListolearnabetterdistancefunction(i.e.,distancemetric)thanEuclideandistance toevaluatethedistancesbetweenpairsofexamplescorrectly.Asadistancefunction,it shouldsatisfycertainproperties,suchassymmetric,non-negativeandtriangleinequality. Consequently,thelearnedmetricisrequiredtobeapositivesemi-definite(PSD)matrix, whichisoftenreferredasthe PSDconstraint .Wewillelaboratestrategiesonhowto handleitinSection2.1. 21.2SupervisioninDistanceMetricLearning Tolearnagooddistancemetric,certainsupervisedinformationfromthedataisneeded. tfrommanysupervised[106]orsemi-supervised[123]machinelearningapproaches, thesupervisionfordistancemetriclearningappearsaspairwiseortripletconstraintsrather thanthelabelsfortrainingexamples.Eachpairwiseconstraintisconsistedof twoexamples.Themetricislearnedtoguaranteethatthedistanceofpairsfromthesameclassissmaller thanapre-definedthresholdwhilethatofpairsfromtclassesislarge[115].In contrast, threeexamplesareincludedinatripletconstraint[114].Agoodmetricshould makesurethattheexamplesfromthesameclassisseparatedfromtheexamplesoft classeswithalargemargin.Itisobviousthatasingletripletconstraintcouldbedividedin tothreepairwiseconstraints.Recently,someresearchersproposedquadrupletconstraints, whichcontainfourexamplesinaconstraintandcouldbeconsideredasavariantoftriplet constraints[76]. Theseconstraintscouldbederivedfromlabelsorprovidedbytheapplicationsdirectly. Forexample,pairwiseconstraintsconsistofpairsofexampleswiththesamelabelandpairs ofthosefromtclasses.Someapplicationscanalsoprovidethepairwiseconstraint directly.Infacerecognition[63],eachtrainingexamplecontainstwofaceimagesandthe labelisgivenasanindicatorthatthesetwoimagesbelongtothesamepersonornot.Given theseconstraints,variousDMLmethodsaredevelopedandrepresentativeoneswithpairwise ortripletconstraintsaredescribedinSection2.2.1andSection2.2.2,respectively. Notethatalthoughdimensionreductionpurposemethods,e.g.,principlecomponent analysis(PCA)[95],lineardiscriminantanalysis(LDA)[54]orunsupervisedmethods,e.g, locallylinearembedding(LLE)[94],isometricfeaturemapping(ISOMAP)[101],sometimes 3arealsocategorizedtodistancemetriclearning,wewillfocusonthediscussionofthegeneral purposeDMLmethodswithpairwiseortripletsupervisioninthisstudy. 1.3ComputationalChallengesinDistanceMetric LearningGiventhesesupervisedinformation,theobjectiveofDMLisoptimizingoverpairwiseor tripletconstraintswhilekeepingthelearnedmetricinthePSDcone.ManyDMLmeth- odshavebeendevelopedunderthisframework,however,littleofthemtrytoaddressthe fundamentalissuesinDML.Thecomputationalchallengesaremainlyfromthreeaspects. •TomakesurethelearnedmetricisaPSDmatrix,mostDMLmethodshavetoproject theintermediatesolutionontothePSDconeat everyiteration .Theprojectionstep requirestheeigen-decompositionoperator,whichcosts O(d3)(atleast O(d2)and disthedimensionalityofdata). •Thenumberofvariablesthatneedtobeoptimizedincreasesfrom O(d)inlinearmodel toO(d2)inDML.Itresultsinaslowerconvergencerateinsolvingtherelatedopti- mizationproblem[89].Inaddition, d2variablesbecometoohugetobestoredinthe memorywhenthedimensionalityofdataisstlylarge. •Largenumberofconstraintsareneededtoavoidoverfittingwhenlearningametric. Thenumberofconstraintscouldbeupto O(n3)(i.e.,tripletconstraints),where nisthenumberofexamples.ItmeansthatthetotalcostofDMLcouldbeupto O(d3n3).Besides,tocapturethedetailsofimagesincomputervision(e.g.,facerecognition), over-completeddescriptorsaresampledfromeachimage,whichresultsinhighdimensional 4representationsandredundantfeatures. Theoccurrenceofalltheseproblemsmakesthelarge-scalehighdimensionalDMLan extremelytask,andwewillstudyitextensivelyinthiswork. 1.4ApplicationsofDistanceMetricLearningtoCom- puterVision Distancemetriclearningisanimportantsubjectinmachinelearning,andhasfoundappli- cationsinmanydomains,includinginformationretrieval[56],supervisedclassification[114], clustering[115,23]anddomainadaptation[93].Besidesmachinelearningcommunity,other researchgroups,e.g.,computervisionanddatamining[110,111],alsorealizetheimportance ofDML,andhaveapplieditformanyrealworldapplications. Incomputervision,DMLisfirstfoundtobeveryhelpfulforimageclassification[49, 82,108,38],visualobjectrecognition[36,74]andimageretrieval[57,24,112].Image classificationandobjectrecognitiontasksareoftensolvedasthemulti-classclassification problem.Theyfirstobtainametricaccordingtothepairwiseortripletinformationandthen applyk-NNclassifierforclassification.Sinceimagesareverysensitivetotposes,light conditionsandangles,DMLcanlearnaninvariantspacetoobtainthesemantic fromlowlevelfeaturesbetweentobjects.Inastandardimageretrievalscenario, trainingsetconsistsofpairwiseortripletconstraintsandadistancemetricislearnedto rankthesepairsappropriately.Whentherecomesaqueryimage,thesystemshouldoutput theimagesthataremostsimilartothequeryone,whichisequivalenttoobtainingthe imagesclosesttothequeryunderthelearnedmetric. Inaddition,DMLisalsoadoptedforfaceverificationproblem[59,50,29,17,71].Face 5verificationisverypracticalandhasalreadybeenusedinmanyrealbusinesses,e.g.,ter- roristdetection,bordercontrolandaccesscontrolsystem[64].Afterobtainingametric viaoptimizinglotsofpairsofimagesfromthesameortpersons,theaccesscontrol systemcouldidentifythepersonautomaticallyandkeysarenotnecessaryforthehouse. Forexample,whenastrangercomes,thecamerainthedoorcantakeaphotoforhimand thencompareittothephotosofpeoplewholiveinthehouse.Ifthepairwisecompari- sonunderthelearnedmetricreturnstrue,thedoorwillopen,otherwiseitwillkeepclose. Comparedwiththetraditionalaccessbadges,faceisnaturallyuniqueforeachpersonand ismoreorevenimpossibletocopy,whichmakesthehousemuchsafer.Notethat tfromconventionalmulti-classclassificationmission,therecouldbebillionpeople inthedatasetandeachpersonhaslittlenumberofimages(e.g.,oneperperson),which makesmostofexistingclassifiers,especiallyone-vs-allmethods,failedforthisclassification problemwithhugenumberofclasses.Besidestheseapplications,DMLisalsoadoptedby objecttracking[27,104],videoeventdetection[4,87],etc.incomputervision. Inthiswork,wewillintroduceDMLtofine-grainedvisualcategorization(FGVC)[7].In contrasttoclassifyinganimagetoabasicclass,FGVCrequirestocategorizetheimagetoa subordinateclass,whereclassesonlyhavesubtleandthenumberofclassescould beverylarge.Furthermore,theimagesinthesameclasscouldbeverytduetothe tposes,examples,etc.,whichmeanstheintra-classvariantislarge.Theproperties ofDML,whichisindependentfromthenumberofclassescomparedtoone-vs-allstrategy andflexibletohandlelargeintra-classinvariant,makeitappropriateforFGVC.Tothebest ofourknowledge,thisisthefirstworktoapplyDMLforthechallengingFGVCtask. 61.5Contributions Inthisstudy,wedevelopseveralrandomizedalgorithmstoalleviatethechallengesdescribed inSection1.3.First,toreducethenumberofPSDprojections,wecombinethemini-batch stochasticgradientdescentmethodwithanadaptivesamplingstrategy.Second,wedevelopa dualrandomprojectionmethodtohandlethelargenumberofvariables.Finally,wepropose amulti-stagemetriclearningframeworktodividethelearningprocedureintoaseriesof subproblems,whereeachoneonlyhasanepoch( O(n))ofactiveconstraints,andthetotal computationalcostoftheresultingnewframeworkofDMLislinearinthedimensionality andthenumberofexamples( O(dn)).Besidesthese,weinvestigatethefeatureselection problemanddevelopagreedymethodtoselectfeaturesandlearnmetricssimultaneously. Thedetailedcontributionsofthedissertationaredescribedasfollows. •Wefirstexploitthecombinationofthemini-batchstrategywithsmoothlossfunction forDML.Then,weproposeanadaptivesamplingapproachfortDML.Weveri- fy,boththeoreticallyandempirically,theandenessofthemini-batch strategywithsmoothlossandtheadaptivesamplingapproachforDML,respectively. Finally,wepresenttwohybridapproachesthatexploitthecombinationofthemini- batchstrategywithadaptivesamplingforDML.Tothebestofourknowledge,itis thefirstworkthatreducethenumberofupdatesinDMLwhilekeepingthesimilar performance. •WeproposeadualrandomprojectionapproachforhighdimensionalDML.Ourap- proach,ononehand,enjoysthelightcomputationofrandomprojection,andonthe otherhand,significantlyimprovestheenessofrandomprojection.Weverify theenessoftheproposedalgorithmsbothempiricallyandtheoretically. 7•Wedevelopanovelmulti-stagemetriclearningframeworkforhighdimensionalDML toaddresshighdimensionalDMLwithlargenumberofconstraints.Wedividethe originaloptimizationproblemintomultiplestages.Ateachstage,onlyasmallsubset ofconstraintsthataretobeclassifiedbythecurrentlylearnedmetricwill beadaptivelysampledandusedtoimprovethelearnedmetric.Thenweextendthe dualrandomprojectionstrategytosolveeachsubproblem.Theempiricalstudywith standardFVGCbenchmarkdatasetsverifiesthatourframeworkisbotheand tcomparedtothestate-of-the-artFGVCapproaches. •Weproposeagreedyalgorithmthatcanselectdescriptors(i.e.,features)andlearnthe correspondingmetricssimultaneouslyforfacerecognitionwiththeguaranteedconver- gencerate.Toalleviatethehighcomputationalcostofexhaustedsearchthatfindsonly onefeaturegroupateachiteration,weinvestigatetheadaptivemini-batchstrategy andincrementalsamplingrespectively,andthehybridalgorithmthatcombinesthese twostrategiesisalsostudied.Theempiricalstudyonfacerecognitionconfirmsthe enessandeoftheproposedmethods. 8Chapter2 LiteratureSurvey Inthischapter,wewillintroducetheexistingDMLmethodsbriefly.Section2.1describes thegeneralstrategiestoaddressthePSDconstraintthatallDMLmethodshavetodeal with.Afterthat,Section2.2-2.4presentthepopularDMLmethods.Finally,Section2.5 summarizesthetheoreticalanalysisforDML. 2.1PSDConstraintinDistanceMetricLearning BeforeintroducingthespecificDMLmethods,wefirstdemonstratethestrategiestohandle PSDconstraint,whichkeepsthelearnedmetricinthePSDcone,inDML.BothearlyDML methodsthatrequirethegradientfromallconstraintsateachiteration[115,45],andt approaches[66,98]thatdealwithonlyoneconstraintateachiterationbyexploitingthe techniquesofonlinelearningorstochasticoptimization,shareonecommonstrategy:inorder toensurethatthelearneddistancemetricisPSD,theseapproachesrequire,ateachiteration, projectingthelearneddistancemetric ˜MontothePSDconebysolvingtheproblem minMPSD M−˜M2FThesolutionisthepositiveeigenvalueswiththecorrespondingeigenvectors[47](i.e., M=iivivii>0where iandviarethe i-theigenvalueandthecorrespondingeigenvector 9of˜M).Therefore,ithastoperformtheeigen-decompositionforagivenmatrix,whichis computationallyexpensive(i.e.,atleast O(d2)).SeveralstudieshavebeenproposedtoavoidprojectionsinSGD.In[55],theauthorsde- velopedaprojectionfreeSGDalgorithmthatreplacestheprojectionstepwithaconstrained linearprogrammingproblem.In[80],theauthorsproposedaSGDalgorithmwithonlyone projectionthatisperformedattheendoftheiterations.Unfortunately,theimprovementof thetwoalgorithmsincomputationaleislimited,becausetheyhavetocompute, ateachiteration ,theminimumeigenvalueandtheeigenvectoroftheupdateddistancemetric, anoperationwith O(d2)cost,where disthedimensionalityofthedata. Itisnoteworthytomentionthatinarecentstudy[24],theauthorsshowempiricallythat itispossibletolearnagooddistancemetricusingonlinelearningwithouthavingtoperform theprojectionateachiteration.Infact,onlyoneprojectionintothePSDconeisperformed attheendofonlinelearningtoensurethattheresultingmatrixisPSD.Itistfrom thealgorithmspresentedin[55,80]andnoadditionalmechanismsareneededtopreventthe intermediatesolutionsfrombeingtoofarawayfromthePSDcone.Wereferthisstrategy asone-projectionparadigm intherestofthisdissertation. 2.2LinearDistanceMetricLearning MostofconventionalDMLapproachesareequivalenttoseekingalineartransformationfor theinputspace.Inthissection,wewilldescriberepresentativeapproachesindetailaccording tothetypeofconstraintsused.Moreexamplescouldbefoundintwosurveypapers[117,73] 102.2.1DistanceMetricLearningwithPairwiseConstraints AsmentionedinSection1.2,DMLusuallyinvolvestwokindsofconstraints:pairwiseor triplet.Mostearlyworksfordistancemetriclearningfocusonoptimizingpairwisecon- straints.Xing,etal.[115]proposedoneoftheearliestmethodsforthetypicalDMLproblem asstudiedinthisdissertation.Theobjectiveismaximizingthedistancebetweenthepairsof examplesfromthetclasseswhilekeepingthedistancebetweenthepairsofexamples withthesamelabellessthanapre-definedthreshold.Althoughtheproblemissolvedviaa semi-definiteprogramming(SDP)method[12]thatisonlyrunnableonsmalldatasets,they showtheadvantageofDMLfordistancebasedmethodsandsimulatemoreresearchforthis topic.POLAmethod[96]isthenproposedtoalleviatethecomputationcostofDML.Itis anonlinelearningmethod,whichonlyreceivesasinglepairwiseconstraintateachiteration. Ifthecurrentmetricmakesamistake,itwillbeupdatedaccordingly,otherwisethemetric willbethesame.Thecostofeachiterationismuchlesssinceonlythegradientfromasingle constraintiscomputed.Inaddition,thePSDprojectionisonlyneededwhenthepairof examplesfromthesameclassismisclassifiedandthecostcouldbeaslowas O(d2)since theupdateisarankonematrix.MCML[45]istheconvexrelaxedversionofNCA[46]. ItrepresentstheextremecaseofDMLthatalldatapointsinthesameclassshouldbe mappedintoasinglelocationbythelearnedmetric.Forthispurpose,itminimizestheKL divergencebetweenthedistributionreturnedbythelearnedmetricandthatfromtheideal case.Italsoproposedtokeepalow-rankcopyafterobtainingthemetrictoachievedimen- sionreduction.Tillnow,alloftheseDMLmethodsmayrequirePSDprojectionatevery iteration,whichcouldbe O(d3).ITML[32]appliedtheLogDetregularizertoavoidPSD projections.LogDetdivergenceisaBregmanmatrixdivergence[13]betweentwomatrices. 11Byintroducingthisregularizer,themetricisupdatedintheinverseformwitharandone matrixateachiteration,whichcouldbefinishedwithoutcomputingtheinversepractically viatheSherman-Morrison-Woodburyformula.Meanwhile,theupdatingrulemakessure thattheintermediatelearnedmetricisstillinthePSDcone.AlthoughthereisnoPSD projectionforITML,theadditionalcostisincludedandthecostforeachiterationisstill nolessthan O(d2).2.2.2DistanceMetricLearningwithTripletConstraints DMLmethodswithpairwiseconstraintsaimtolearnametricthatcollapseallexamplesin eachclasstoasmallclusterwhoseradiusisthepre-definedthreshold.However,itis whentheintra-classvarianceislarge,whichisoftenthecaseinrealworldapplications. Thus,tripletconstraintsisproposedtohandlethisproblemmoreflexibly[114].Eachtriplet constraintcontainsthreeexamples:twoexampleswiththesamelabelwhereoneisamong thek-nearestneighborsofthetargetexampleandoneexamplefromthetclasswho isalsointhenearestneighborsofthetargetexample.LMNN[114]thenlearnsthemetric topushawaytheexampleoftclasswithalargemargin.Fig.2.1illustratesthe enessofLMNN.ComparedwithpreviousDMLapproaches,itonlyrequiresthatthe examplesinthenearestneighborscouldbecollapsetoasmallcluster,whichpreservesthe largeintra-classvarianceandismoreappropriateforcombiningwith k-NNclassifier.Since itborrowstheconceptoflargemargin,italsocouldbeanalyzedfromtheviewofsupport vectormachine(SVM)[107],whereLMNNissimilartolearnaseriesoflocalSVM-like models[33]. AfterthesuccessofLMNN,Chechik,etal.[24]appliedthetripletconstraintswithan onlinelearningfashionforandachievedagoodperformanceforranking.Surpris- 12Figure2.1:Thefigureisfromthework[114]andillustratesthelearningprocedureofLMNN. Thelearnedmetricpushesawaytheexamplesthatarefromtclassesbutinthenearest neighborswithalargemargin. ingly,theyfoundempiricallythatonePSDprojectionattheendofthealgorithmwillnot theperformance,whichimpliesthepotentialpossibilityofgettingridoftheexpensive PSDprojection.Sincegradientdescentmethodrequiresprojectingtheintermediatesolution ontoPSDcone,Shat,etal.[98]adoptedmini-batchstochasticgradientdescentmethodto balancethecostofcomputinggradientandthatofPSDprojection.Furthermore,italso reducesthevarianceofthestochasticmethod,whichonlyrandomlysamplesasingletriplet ateachiteration. 2.3NonlinearDistanceMetricLearning Sometimes,lineartransformationcouldnotexploitcomplicateddatadistributionstly, e.g,faceimagesusuallylieonthenonlinearmanifold[59].Tohandlethisnonlinearfeature space,kernel trick,whichhasbeenwellstudiedinlinearmodel[15],isappliedtoextendthe 13existingDMLmethods[45,32,114,102].Itsbasicideaismappingtheinputexamplestoa muchhigheroreveninfinitedimensionalspaceandthenobtainingthelineartransformation there.Althoughitsometimesperformsbetter,thesizeoflearnedmetric(i.e.,dualvariables) isO(n2)forDML,whichismoreexpensivewhen nd.Thealternativewaytoprovide nonlinearlearningcapacityis deeplearning (orcalleddeepneuralnetworks),whichis verypopulartheseyearsduetoitsencouragingperformance[72].Unlikekerneltrick,deep learningdealswiththeinputfeaturesdirectlyandlearnsthenonlinearmappingbetween eachlayeraccordingtononlinearactivationfunctions,e.g.,tanh,sigmoid,etc.Theoriginal structureofdeeplearningisdesignedforclassificationandthepipelineisoptimizedwiththe singleimageasinput.Byreplacingthelossfunctionaspairwise[59]ortripletdistance[112] andconcatenatingthepipelinesinparallel,itisconvenienttoadapttoDMLtasks. Besidestheseextensionsforthesinglemetriclearning,therearesomestudiesonlearning multiplemetrics simultaneously.Thekeypartisdefiningthemetricforappropriate components.Someworksapplymulti-metriclearningondata.Thatis,theylearneach metricforeachclassorlocalcluster,andsometimesevenforeachsingleexample[42,84]. Forexample,MM-LMNN[114]learnsametricperclassandthedistancebetweentwo examplesisdefinedonthemetriccorrespondingtothelabelofexamples.Ontheother hand,themultiplemetricsalsocouldbelearnedaccordingtotfeatureblocks[61,29]. Forexample,eachfaceimagecouldbedividedintotparts,e.g.,nose,eyes,ears, etc.,andthenPMML[29]learnsasinglemetriconeachfeatureblockwhichrepresents thetpart.Althoughmultiplemetricsmayimprovetheperformanceoverasingle metric,itintensifiesthealreadyexpensiveoptimizationproblembyincreasingthenumber ofmetrics.Moreover,somegoodpropertiesofasinglemetricaresacrificed,e.g.,global transformation,interactionsbetweentcomponents,etc.Sincealloftheseextensions 14arebasedonconventionalDML,wewillfocusonlearningasinglelineardistancemetricin thisdissertation. 2.4ImprovingEofDistanceMetricLearning AlthoughmanyDMLmethodshavebeendeveloped,thecostofupdateateachiterationis still( O(d2))duetothenumberofvariables.Recently,someworksaimtoalleviatethehigh computationcostviasparsity.Dtfromlinearclassifier,thesparsityofmetriccould comefromtwoaspects:lowrankanditemsparsity. SinceeachmetricisaPSDmatrix,itcouldbedecomposedas M=LL,where Misa d×dmatrix,Lisa d×rmatrixand ristherankof M.When rd,thetotalnumberof variablesforoptimizingdecreasesfrom d2tord,whichisonly O(d).Someexistingmethods applylowrankstrategytoacceleratethelearningprocess[114,31].Thedrawbackisthat thecorrespondingoptimizationproblemisnotconvexanymore,hence,itisnotguaranteed toobtainaglobaloptimalmetric.Inaddition,thesemethodsrequiretofix rbeforeapplying theDMLmethods,whicheasilyleadstothesuboptimalsolutionwhenthetruerankislarger thanthepre-definedrank.Somestudiesapplytracenormorfantoperegularization[77]to obtainalowrankmetric,whilethecostfortrainingstageisstillatleast O(d2).Foritemsparsity,itcouldbefurthercategorizedintotwoparts:sparsityonthesingle itemandsparsityonthecolumns.Theformeroneassumesthatonly sitemshavethe nonzerovalueineachcolumn/row,where sdandiscorrespondingtothe L1regularizerforeachcolumn/row[89].Inthelatterscenario,thereisonlylimitednumberofcolumnswith nonzeroitems,whichisequivalenttopenalizingthewholematrixwith L2,1regularizer[78]. Thesemethodscouldoutputthesparsemetricwhichisetforteststage,however,the 15learningprocessstillinvolves O(d2)variables. 2.5GeneralizationPerformanceofDistanceMetric LearningBesidesdevelopingtDMLmethodsforapplications,somestudiesbegintoanalyze thegeneralizationperformanceofDMLmethodstheoretically[68,16,6].Givenaspecific DMLmethodandthetrainingsetthatarei.i.d.sampledfromanunknowndistribution,we areinterestedinthelearnedmodel’spredictionerroronanarbitrarypairwiseconstraintfrom thetruedistribution,whichisknownas generalizationerror .However,theonlyevaluation thatwecouldhaveisitspredictionerroronthetrainingset,whichisusuallyreferredas empiricalerror .Accordingtothetheoryofstability,Jin,etal.[68]proposedthefirst workthatuncoveredtherelationshipbetweentheminDMLandfoundthattheempirical errorconvergestothegeneralizationerrorattherateof O(1n)withhighprobability.After that,someotherstudiesshowthesimilarresultviattechniques[16,6].Cao,etal. furtherresearchedtheinfluencefromthetregularizers,suchasFrobeniusnorm, L1norm,L2,1normandtracenorm. Theseanalysesarebasedontheclassificationaccuracyforpairwiseconstraints,while themoreinterestingquestionistheperformanceof k-NNclassifierwiththelearnedmetric. Guo,etal.[51]proposedthetheorythatbridgesthisgapandshowsthatthegeneralization errorofthelinearSVMwiththesimilaritymatrixfromthelearnedmetricisupperbound- edbythegeneralizationerrorofthecorrespondingmetriclearning,whichmeansthatthe generalizationperformanceoflinearclassifierisguaranteedbythatofmetriclearning. 16Chapter3 Large-scaleDistanceMetricLearning byAdaptiveSamplingand Mini-BatchStochasticGradient Descent(SGD) Inthischapter 1,wewilladdressthechallengefromPSDprojections( O(d3)).Unlikeprevious forhandlingPSDconstraintasmentionedinSection2.1,wewillfocusonreducing thenumberofPSDprojectionsratherthanoptimizingeachprojectioninSGD,whichisthe popularstrategyforlarge-scaledata.Asaresult,thekeychallengeindevelopingt SGDalgorithmsforDMLishowtoreducethenumberofprojectionswithoutathe performanceofDML. AcommonapproachforreducingthenumberofupdatesandprojectionsinDMListo usethenon-smoothlossfunction.Apopularchoiceofthenon-smoothlossfunctionisthe hingeloss,whosederivativebecomeszerowhentheinputvalueexceedsacertainthreshold. ManyonlinelearningalgorithmsforDML[24,32,66]takeadvantageofthenon-smoothloss functiontoreducethenumberofupdatesandprojections.In[98],theauthorsproposeda 1Thischapterisadaptedfromthepublishedpaper:Q.Qian,R.Jin,J.Yi,L.ZhangandS.Zhu.E cientDistanceMetricLearningbyAdaptiveSamplingandMini-BatchStochasticGradientDescent(SGD). MachineLearningJournal(MLJ),99:3,353-372,2015. 17structurepreservingmetriclearningalgorithm(SPML)thatcombinesamini-batchstrategy withthehingelosstofurtherreducethenumberofupdatesforDML.Itgroupsmultiple constraintsintoamini-batchandperformsonlyoneupdateofthedistancemetricforeach mini-batch.But,accordingtoourempiricalstudy,althoughSPMLreducestherunningtime ofthestandardSGDalgorithm,itresultsinasignificantlyworseperformanceforseveral datasets,duetothedeploymentofthemini-batchstrategy. Inthischapter,wefirstdevelopanewmini-batchbasedSGDalgorithmforDML,termed Mini-SGD.UnlikeSPMLthatreliesonthehingeloss,theproposedMini-SGDalgorithm exploitsa smooth lossfunctionforDML.Byusingasmoothlossfunction,theproposed algorithmisabletoelytakeadvantageofthereductioninthevarianceofgradients achievedbythemini-batch,whichinreturnleadstoabetterregretboundforonlinelearn- ing[28]andconsequentiallyamoreaccuratepredictionforthelearneddistancemetric.We showtheoreticallythatbyusingasmoothlossfunction,Mini-SGDisabletoachievesim- ilarconvergencerateasthestandardSGDalgorithmbutwithsignificantlylessnumberof updates.Thesecondcontributionofthisworkistodevelopanewstrategy,termed adap-tivesampling ,forreducingthenumberofprojectionsinDML.Thekeyideaofadaptive samplingistofirstmeasurethey”inclassifyingaconstraintusingthelearneddis- tancemetric,andthenperformstochasticupdatingbasedontheymeasure.Finally, wedeveloptwo hybridapproaches thatcombineadaptivesamplingwithmini-batchto furtherimprovethecomputationaleofSGDforDML.Weconductanextensiveem- piricalstudytoverifytheenessandeoftheproposedalgorithmsforDML. Wesummarizethemaincontributionofthisworkasfollows: •Tothebestofourknowledge,thisisthefirstworkthatexploitsthecombinationofthe mini-batchstrategywithsmoothlossfunctionforDML.Weverify,boththeoretically 18andempirically,theandenessofthemini-batchstrategywithsmooth lossforDML. •WeproposeanadaptivesamplingapproachfortDML.Weverify,boththeoreti- callyandempirically,theandenessoftheadaptivesamplingapproach forDML. •Wepresenttwohybridapproachesthatexploitthecombinationofthemini-batch strategywithadaptivesamplingforDML.Ourextensiveempiricalstudyverifiesthat thehybridapproachesaresignificantlymoreetthanboththemini-batchstrategy andtheadaptivesamplingapproach. Therestofthischapterisorganizedasfollows:Section3.1describestheproposedSGD algorithmsforDMLbasedonmini-batchandadaptivesampling.Twohybridapproaches arepresentedthatcombinemini-batchandadaptivesamplingforDML.Thetheoretical guaranteesforbothmini-batchbasedandadaptivesamplingbasedSGDarealsopresented inSection3.1.Section3.2summarizestheresultsoftheempiricalstudy,andSection3.3 concludesthischapter. 3.1ImprovedSGDforDMLbyMini-batchandAdap- tiveSampling WefirstreviewthebasicframeworkofDMLwithtripletconstraints.Wethenpresenttwo strategiestoimprovethecomputationaleofSGDforDML,onebymini-batchand theotherbyadaptivesampling.Wepresentthetheoreticalguaranteesforbothstrategies, 19anddefermoredetailedanalysistotheappendix.Attheendofthissection,wepresenttwo hybridapproachesthatcombinemini-batchwithadaptivesamplingformoreetDML. LetRdbethedomainforinputpatterns,where disthedimensionality.Forthe convenienceofanalysis,weassumealltheinputpatternswithboundednorm,i.e. xX,|x|2r.Givenadistancemetric MRd×d,thedistancesquarebetween xaandxb,denotedbydist M(xa,xb),ismeasuredby distM(xa,xb)=(xa−xb)M(xa−xb)Let= {M:M0,MFR}bethedomainfordistancemetric M,where Rspecifies thedomainsize.Let D={(x1i,x1j,x1k),...,(xNi,xNj,xNk)}bethesetoftripletconstraints usedforDML,where xtiisexpectedtobecloserto xtjthanto xtk.Let (z)betheconvex lossfunction.Define xti,xtj,xtk;M)as xti,xtj,xtk;M)=dist M(xti,xtk)−distM(xti,xtj)=M,(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj)=M,A twhereAt=(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj)Giventhetripletconstraintsin Dandthedomaininwelearnanoptimaldistancemetric MRd×dbysolvingthefollowingoptimizationproblem minML(M)=1NNt=1xti,xtj,xtk;M)(3.1)20Wealsodefinetheexpectationofthelossfunctionas L(M)=Exi,xj,xk;M))(3.2)wheretheexpectationistakenover xi,xjandxk.ThekeyideaofonlineDMListominimizetheempiricalloss L(M)byupdatingthe distancemetricbasedononesampledconstraintateachiteration.Morespecifically,at iterationt,itsamplesatripletconstraint( xti,xtj,xtk),andupdatesthedistancemetric MttoMt+1byMt+1Mt−xti,xtj,xtk;Mt))Atwhere0isthestepsize, (·)isthederivativeand (M)projectsamatrix Monto thedomain.Thefollowingpropositionshows (M)canbecomputedintwosteps,i.e. firstprojecting MontothePSDcone,andthenscalingtheprojected Mtofitinwiththe constraint MFR.Proposition1. [12]Wehave (M)=1max(MF/R, 1)P(M)Here P(M)projectsmatrix MontothePSDconeandiscomputedas P(M)=di=1max(i,0)viviwhere (i,vi),i=1,...,daretheeigenvaluesandcorrespondingeigenvectorsof M.AsindicatedbyProposition1, (M)requiresprojectingdistancemetric Montothe 21PSDcone,anexpensiveoperationthatrequireseigen-decompositionof M.Finally,weapproximatethehingelossbyasmoothlossinourstudy (z)=1Llog(1+exp( −L(z−1)))(3.3) whereL>0isaparameterthatcontrolstheapproximationerror:thelargerthe L,the closer(z)istothehingeloss.Notethatthesmoothapproximationofthehingelosswas firstsuggestedin[122]forclassificationandwaslaterverifiedbyanempiricalstudyin[119]. Thekeypropertiesofthelossfunction (z)in(3.3)aregiveninthefollowingproposition. Proposition2. Forthelossfunctiondefinedin(3.3),wehave zR,|(z)1,|(z)(z)Comparedtothehingelossfunction,themainadvantageofthelossfunctionin(3.3)is thatitisasmoothlossfunction.Aswillberevealedbyouranalysis,itisthesmoothness ofthelossfunctionthatallowsustoelyexploreboththemini-batchandadaptive samplingstrategiesformoreetDMLwithouthavingtosacrificethepredictionperfor- mance.3.1.1Mini-batchSGDforDML(Mini-SGD) Mini-batchSGDimprovesthecomputationaleofonlineDMLbygroupingmultiple constraintsintoamini-batchandonlyupdatingthedistancemetriconceforeachmini-batch. Forbrevity,wewillrefertothisalgorithmas Mini-SGDfortherestofthechapter. 22Algorithm1 Mini-batchStochasticGradientDescent(Mini-SGD)forDML 1:Input:tripletconstraints {(xti,xtj,xtk)}Nt=1,stepsize ,mini-batchsize b,anddomain sizeR2:InitializeM1=IandT=N/b 3:fort=1,...,Tdo4:Samplebtripletconstraints {(xt,si,xt,sj,xt,sk)}bs=15:Updatethedistancemetricby Mt+1(Mt−t(Mt))6:endfor 7:return¯M=1TTt=1MtLetbbethebatchsize.Atiteration t,itsamples btripletconstraints,denotedby (xt,si,xt,sj,xt,sk),s=1,...,b, anddefinesthemini-batchlossatiteration tast(Mt)=1bbs=1xt,si,xt,sj,xt,sk;Mt)Mini-batchDMLupdatesthedistancemetric MttoMt+1usingthegradientofthemini-bach lossfunction t(M),i.e., Mt+1(Mt−t(Mt))Algorithm1givesthedetailedstepsofMini-SGDforDML,whereinstep5Proposition1is usedtotlycomputetheprojection (·).ThetheorembelowprovidesthetheoreticalguaranteefortheMini-SGDalgorithmfor DMLusingthesmoothlossfunctiondefinedin(3.3). Theorem1. Let ¯MbethesolutionoutputbyAlgorithm1thatusesthelossfunctiondefined 23in(3.3).Let Mbetheoptimalsolutionthatminimizes L(M).Assuming AtFAforanytripletconstraintand 1/(3LA2),wehave E[L(¯M)]L(M)1−3A 2+bR22(1−3A 2)(3.4)wheretheexpectationistakenoverthesequenceoftripletconstraints. Remark1 First,weobservethatthesecondtermintheupperboundin(3.4),i.e., bR2/[2(1−3A 2)],hasalineardependenceonmini-batchsize b,implyingthatthelarger theb,thelessaccuratethedistancemetriclearnedbyAlgorithm1.Hence,byadjusting parameterb,thesizeofmini-batch,weareabletomakeappropriatebetweenthe predictionaccuracyandthecomputationalethesmallerthe b,themoreaccurate thedistancemetricbutwithmoreupdatesandconsequentiallyhighercomputationalcost. WhenL(M)=0,wehaveE[ L(¯M)]= O(1/N),i.e.theexpectedpredictionerrorwillbe reducedattherateof b/N ,significantlyfasterthanthatofthemini-batchSGDalgorithm (i.e.O(1/N))givenin[28].Second,ifweset as:=3LA2(1+2 )where =3bLR2A2NL(M)(3.5)wehave E[L(¯M)]2L(M)+6bLA2R2N(3.6)Althoughthestepsizein(3.5)requirestheknowledgeof L(M)thatisusuallyunavailable, assuggestedin[67], L(M)canbeestimatedempiricallyusingpartoftrainingexamples. Third,theboundin(3.4)revealstheimportanceofusingasmoothlossfunctionasthe 24Algorithm2 AdaptiveSamplingStochasticGradientDescent(AS-SGD)forDML 1:Input:tripletconstraints {(xti,xtj,xtk)}Nt=1,stepsize ,anddomainsize R2:InitializeM1=I3:fort=1,...,Ndo4:Sampleabinaryrandomvariable ZtwithPr(Zt=1)= |xti,xtj,xtk;Mt)|5:ifZt=1then6:Updatethedistancemetricby t=sign( xti,xtj,xtk;Mt)),Mt+1(Mt−tAt)7:endif 8:endfor 9:return¯M=1NNt=1Mtlasttermin(3.4)isproportionalto Lthatmeasuresthesmoothnessofthelossfunction. Asaresult,usinganon-smoothlossfunction(e.g.hingeloss)inDMLwillnotbeableto benefittheadvantageofthemini-batchstrategy.Finally,unliketheanalysisin[98](i.e. Theorem2)thatonlyconsiderthecasewhen b=1,Theorem1provideageneralresultfor anymini-batchsize b.3.1.2AdaptiveSamplingbasedSGDforDML(AS-SGD) WenowdevelopanewapproachforreducingthenumberofupdatesinSGDinorder toimprovethecomputationaleofDML.Insteadofupdatingthedistancemetricat eachiteration,theproposedstrategyintroducesarandombinaryvariabletodecideifthe distancemetric Mtwillbeupdatedgivenatripletconstraint( xti,xtj,xtk).Morespecifically, itcomputesthederivative xti,xtj,xtk;Mt)),andsamplesarandomvariable Ztwithprobability Pr(Zt=1)= |xti,xtj,xtk;Mt))|25Thedistancemetricwillbeupdatedonlywhen Zt=1.AccordingtoProposition2,wehave |xti,xtj,xtk;Mt))xti,xtj,xtk;Mt))forthesmoothlossfunctiongivenin(3.3), implyingthatatripletconstrainthasahighchancetobeusedforupdatingthedistance metricifithasalargeloss.Therefore,theessentialideaoftheproposedadaptivesampling strategyistogivealargechancetoupdatethedistancemetricwhenthetripletisto beclassifiedandalowchancewhenthetripletcanbeclassifiedcorrectlywithlargemargin. Wenotethatanalternativestrategyistosampleatripletconstraint( xti,xtj,xtk)baseonits lossxti,xtj,xtk;Mt)).Wedidnotchoosethelossasthebasisforupdatingbecauseitis thederivative,nottheloss,thatwillbeusedbySGDforupdatingthedistancemetric.The detailedstepsofadaptivesamplingbasedSGDforDMLisgiveninAlgorithm2.Werefer tothisalgorithmas AS-SGDforshortintherestofthischapter. ThetheorembelowprovidestheperformanceguaranteeforAS-SGD.Italsoboundsthe numberofupdates Tt=1ZtforAS-SGD. Theorem2. Let ¯MbethesolutionoutputbyAlgorithm2thatusesthelossfunctiondefined in(3.3).Let Mbetheoptimalsolutionthatminimizes L(M).Assuming AtFAforanytripletconstraintand 2/LA2,wehave EL(¯M)L(M)1−A 2/2+R22(1−A 2/2)(3.7)andthenumberofupdatesboundedby ENt=1ZtNLL(M)1−A 2/2+LR22(1−A 2/2)(3.8)wheretheexpectationistakenoverboththebinaryrandomvariables {Zt}Nt=1andthese- 26quenceoftripletconstraints. Remark2 Ifweset as=2LA2(1+2 )where =LA2R24NL(M)wehave E[L(¯M)]2L(M)+LR2A2N(3.9)andENt=1Zt2NLL(M)+L2A2R2(3.10)Theboundsgivenin(3.7)and(3.9)sharesimilarstructuresasthosegivenin(3.4)and(3.6) exceptthattheydonothavemini-batchsize bthatcanbeusedtomakebetween thenumberofupdatesandtheclassificationaccuracy.Thenumberofupdatesperformed byAlgorithm2isboundedby(3.10).Thedominatetermin(3.10)is O(L(M)N),implying thatAlgorithm2willhaveasmallnumberofupdatesiftheoptimaldistancemetriconly makesasmallnumberofmistakesforthegivensetoftrainingtriplets.Intheextremecase whenL(M)0,theexpectednumberofupdateswillbeboundedbyaconstant L2A2R2.Wenotethatthisisconsistentwithourintuition:itwillbeeasytolearnagooddistance metricwhentheoptimaloneonlymakesafewmistakes,andasaresult,onlyafewupdates areneededtofindadistancemetricthatareconsistentwithmostofthetrainingtriplets. Comparedwiththeresultofperceptronmethod[96],wedonotassumethatthedatasetis separable,whichmakesourboundforthenumberofupdatesmorepracticallyuseful. 27Algorithm3 AFrameworkofHybridStochasticGradientDescent(Hybrid-SGD)forDML 1:Input:tripletconstraints {(xti,xtj,xtk)}Nt=1,stepsize ,mini-batchsize b,anddomain sizeR2:InitializeM1=IandT=N/b 3:fort=1,...,Tdo4:Samplebtriplets{xt,si,xt,sj,xt,sk}bs=15:Computesamplingprobability t({xt,si,xt,sj,xt,sk}bs=1;Mt)asinEqn.3.11or3.12 6:Sampleabinaryrandomvariable ZtwithPr(Zt=1)= t7:ifZt=1then8:Updatethedistancemetricby t=1tMt+1(Mt−tt(Mt))9:endif 10:endfor 11:return¯M=1TTt=1Mt3.1.3HybridApproaches:CombiningMini-batchwithAdaptive SamplingforDML Sincemini-batchandadaptivesamplingimprovethecomputationaleofSGD fromtaspects,itisnaturaltocombinethemtogetherformoreetDML.Similar totheMini-SGDalgorithm,thehybridapproacheswillgroupmultipletripletconstraintsinto amini-batch.But,unlikeMini-SGDthatupdatesthedistancemetricforeverymini-batch ofconstraints,thehybridapproachesfollowtheideaofadaptivesampling,andintroducea binaryrandomvariabletodecideifthedistancemetricwillbeupdatedforeverymini-batch ofconstraints.Bycombiningthestrengthofmini-batchandadaptivesamplingforSGD, thehybridapproachesareabletomakefurtherimprovementinthecomputationale ofDML.Algorithm3highlightsthekeystepsofthehybridapproaches. 28Oneofthekeystepsinthehybridapproaches(step5inAlgorithm3)istochoose appropriatesamplingprobability tforeverymini-batchconstraints( xt,si,xt,sj,xt,sk),s=1,...,b.Inthiswork,westudytwotchoicesforsamplingprobability t:•Thefirstapproachchooses tbasedonatripletconstraintrandomlysampledfroma mini-batch.Morespecifically,givenamini-batchoftripletconstraints {xt,si,xt,sj,xt,sk}bs=1,itrandomlysamplesanindex sintherange[1 ,b].Itthensetsthesamplingprobability ttobethederivativefortherandomlysampledtriplet,i.e., t=|xt,si,xt,sj,xt,sk;Mt))|(3.11)Werefertothisapproachas HR-SGD.•Thesecondapproachisbasedontheaveragecaseanalysis.Itsetsthesampling probabilityastheaveragederivativemeasuredbythenormofthegradient t(Mt),i.e.,t=1Wt(Mt)F(3.12)whereW=max tt(Mt)Fandisestimatedbysampling.Werefertothisapproach asHA-SGD.3.2Experiments Tendatasetsareusedtovalidatetheenessoftheproposedalgorithms.Table3.1 summarizestheinformationofthesedatasets.Datasets dna,letter[58],protein andsensit[35]aredownloadedfromLIBSVM[22].Datasets tdt30andrcv20 aredocumentcorpora: tdt30isthesubsetoftdt2data[14]comprisedofthedocumentsfromthe30mostpopularcategories 29Table3.1:Statisticsforthetendatasetsusedinourempiricalstudy. #class #feature #train #test semeion102561,115478dna31802,0001,186isolet266176,2381,559tdt30302006,5752,819letter261615,0005,000protein 335717,7666,621connect4 34247,28920,268sensit310078,82319,705rcv20 20200477,14114,185poker 10101,000,00025,010andrcv20 isthesubsetofalargercv1dataset[5]consistedofdocumentsfromthe20most popularcategories.Following[24],wereducethedimensionalityofthesedocumentdatasets to200byprinciplecomponentsanalysis(PCA).Alltheotherdatasetsaredownloaded directlyfromtheUCIrepository[40].Forallthedatasetsusedinthisstudy,weusethe standardtraining/testingsplitprovidedbytheoriginaldataset,exceptfordatasets semeion,connect4 andtdt30where70%ofdataisrandomlyselectedfortrainingandtheremaining 30%isusedfortesting.Alltheexperimentsarerepeatedfivetimes,andboththeaverage resultsandtheirstandarddeviationarereported.Alltheexperimentsarerunonalaptop with8GBmemoryandtwo2.50GHzIntelCorei5-2520MCPUs. 3.2.1ParameterSetting Theparameter Linthelossfunction(3.3)issettobe3accordingtothesuggestionin[122]. Thenumberoftripletconstraints Nissettobe100 ,000forallthedatasetsexceptfortwo smalldatasets semeionanddnawhereN=20 n.Toconstructtripletconstraints,wefollow theactivesamplingstrategygivenin[114]:ateachiteration t,wefirstrandomlypicka 30trainingexample xti,andthen xtjfromthe3positivenearestneighbors 2ofxti;wethen randomlyselectatripletconstraintfromthesetofactiveconstraints 3involving xtiandxtj.Wenotethatthisistfrom[24],wheretripletconstraintsareselectedcompletely randomly.Ourempiricalstudyshowsthattheactivesamplingstrategyismoree thanchoosingtripletconstraintscompletelyrandomly.Thisisbecausetheactivelysampled constraintsaremoreinformativetothelearneddistancemetricthanacompletelyrandom choice.Furthermore,toverifythatthechoiceof Ndoesnotleadtotheoverfittingof trainingdata,particularlyforthetwosmalldatasets,inFig.3.1,weshowthetrainingand testerrorsfordataset dna.Itisclearthatbotherrorsdeclineoverepoches,suggestingthat nooverfittingisfoundevenforthesmalldataset. ForMini-SGDandthehybridapproaches,weset b=10forthesizeofmini-batchas in[98],leadingtoatotalof T=10 ,000iterationsfortheseapproaches.Weevaluatethe learneddistancemetricbytheclassificationerrorofa k-NNonthetestdata,wherethe numberofnearestneighbors kissettobe3basedonourexperience. Parameter Rintheproposedalgorithmsdeterminesthedomainsizeforthedistance metrictobelearned.Weobservethattheclassificationerrorof k-NNremainsalmost unchangedwhenvarying Rintherangeof {100,1000,10000}.Wethusset R=1,000for alltheexperiments.Anotherimportantparameterusedbytheproposedalgorithmsisthe stepsize .Weevaluatetheimpactofstepsize bymeasuringtheclassificationerrorof ak-NNalgorithmthatusesthedistancemetriclearnedbytheMini-SGDalgorithmwith ={0.1,1,10}.Weobservethat =1yieldsalowclassificationerrorforalmostall datasets.Wethusfix =1fortheproposedalgorithmsinalltheexperiments. 2xjisapositivenearestneighborof xiifxjandxisharethesameclassassignment. 3AconstraintisactiveifitshingelossbasedontheEuclideandistanceisnon-zero. 3105101520024681012#EpochsError(%)TrainTestFigure3.1:Thetrainingandtestingerrorsoverepochesfordataset dna.3.2.2Experiment(I):enessoftheProposedSGDAlgo- rithmsforDML Inthisexperiment,wecomparetheperformanceoftheproposedSGDalgorithmsforDML, i.e.,Mini-SGD,AS-SGDandtwohybridapproaches(HR-SGDandHA-SGD),tothefullver- sionofSGDforDML(SGD).WealsoincludeEuclideandistanceasthereferencemethodin ourcomparison.Table3.2showstheclassificationerrorof k-NN( k=3)usingtheproposed DMLalgorithmsandthebaselinealgorithms,respectively.First,itisnotsurprisingtoob- servethatallthedistancemetriclearningalgorithmsimprovetheclassificationperformance ofk-NNcomparedtotheEuclideandistance.Second,foralmostalldatasets,weobserve thatalltheproposedDMLalgorithms(i.e.,Mini-SGD,AS-SGD,HR-SGD,andHA-SGD) yieldsimilarclassificationperformanceasSGD,thefullversionofSGDalgorithmforDML. ThisresultconfirmsthattheproposedSGDalgorithmsareeforDMLdespitethe 32modificationswemadetotheSGDalgorithm. Table3.2:Classificationerror(%)of k-NN( k=3)usingthedistancemetricslearnedbythe proposedSGDmethodsforDML.Standarddeviationcomputedfromfivetrialsisincluded intheparenthesis. EuclidSGDMini-SGDAS-SGDHR-SGDHA-SGDsemeion8.794.39(0.30)4.60(0.53)4.23(0.60)4.27(0.41)4.18(0.26)dna20.716.90(0.16)6.64(0.33)7.15(0.42)6.80(0.21)6.86(0.15)isolet8.985.98(0.15)4.23(0.19)6.09(0.13)4.59(0.30)4.77(0.17)tdt305.964.51(0.07)3.52(0.08)4.53(0.06)3.70(0.20)3.65(0.09)letter4.422.26(0.09)2.54(0.06)2.25(0.10)2.31(0.08)2.23(0.07)protein 49.9539.46(0.42)38.16(0.24)39.49(0.51)40.76(0.20)40.03(0.30)connect4 29.4820.16(0.08)20.20(0.08)20.22(0.12)21.45(0.71)20.41(0.14)sensit27.2823.62(0.04)22.95(0.07)23.70(0.06)23.39(0.20)23.33(0.18)rcv20 9.137.76(0.16)8.42(0.04)7.74(0.11)8.40(0.04)8.37(0.02)poker 37.9835.89(0.06)35.22(0.18)35.87(0.08)35.74(0.41)35.66(0.16)3.2.3Experiment(II):EoftheProposedSGDAlgorithm- sforDML Fig.3.2summarizestherunningtimefortheproposedDMLalgorithmsandthebaseline SGDalgorithm.WenotethattherunningtimesinFig.3.2donottakeintoaccountthe timeforconstructingtripletconstraintssinceitissharedbyallthemethodsincomparison. ItisnotsurprisingtoobservethatalltheproposedSGDalgorithms,includingMini-SGD, AS-SGD,HA-SGDandHR-SGD,significantlyreducetherunningtimeofSGD.Forinstance, fordataset isolet,ittakesSGDmorethan35 ,000secondstolearnadistancemetric,while therunningtimeisreducedtolessthan4 ,000secondswhenapplyingtheproposedSGD algorithms,roughlyafactorof10reductioninrunningtime.Comparingtherunningtimeof AS-SGDtothatofMini-SGD,weobservethateachmethodhasitsownadvantage:AS-SGD ismoretondataset semeionwhileMini-SGDismoreetontheotherdatasets. ThisisbecausetmechanismsareemployedbyAS-SGDandMini-SGDtoreducethe 33(a)semeion(b)dna(c)isolet(d)tdt30(e)letter(f) protein (g)connect4 (h)sensit(i)rcv20 (j)poker Figure3.2:Thecomparisonofrunningtime(seconds)forvariousSGDmethods.Notethat LMNN,abatchDMLalgorithm,ismainlyimplementedinC,whichiscomputationally moretthanourMatlabimplementation.Alltheothermethodsareimplementedin Matlab.34computationalcost:AS-SGDimprovesthecomputationaleofDMLbyskippingthe constraintsthatareeasytobeclassified,whileMini-SGDimprovesthethecomputational ofSGDbyperformingtheupdatingofdistancemetriconceformultipletriplet constraints.Finally,weobservethatthetwohybridapproachesthatcombinethestrengthof bothadaptivesamplingandmini-batchSGD,arecomputationallymosttforalmost alldatasets.WealsoobservethatHR-SGDappearstobemoreetthanHA-SGDon sixdatasetsandonlylosesitedgeondatasets protein andrcv20 .ThisisbecauseHR-SGD computesthesamplingprobability tbasedononerandomlysampledtripletwhileHA-SGD needstocomputetheaveragederivativeforeachmini-batchoftripletconstraintsforthe samplingprobability. TofurtherexaminethecomputationaleofproposedSGDalgorithmsforDML,we summarizeinFig.3.3thenumberofupdatesperformedbytheproposedSGDalgorithmsand thebaselineSGDalgorithm,respectively.WeobservethatalltheproposedSGDalgorithms forDMLareabletoreducethenumberofupdatessignificantlycomparedtoSGD.Comparing Mini-SGDtoAS-SGD,weobservethatfor semeion,thenumberofupdatesperformed byAS-SGDissignificantlylessthanMini-SGD,whileitistheotherwayaroundforthe otherdatasets.ThisisagainduetothefactthatAS-SGDandMini-SGDdeployt mechanismsforreducingcomputationalcosts.Asweexpect,thetwohybridapproachesare abletofurtherreducethenumberofupdatesperformedbyAS-SGDandMini-SGD,making themmoretalgorithmsforDML. BycomparingtherunningtimeinFig.3.2tothenumberofupdatesinFig.3.3,we observethatasmallnumberofupdatesdoesNOTalwaysguaranteeashortrunningtime. Thisisexhibitedbythecomparisonbetweenthetwohybridapproaches:althoughHA- SGDperformsthesimilarnumberofupdatesasHR-SGDondatasets semeionanddna,it 35takesHA-SGDsignificantlylongertimetofinishthecomputationthanHR-SGD.Thisisalso exhibitedbycomparingtheresultsacrosstdatasetsforafixedmethod.Forexample, fortheHA-SGDmethod,thenumberofupdatesforthe protein datasetisnearlythesame asthatforthe poker dataset,buttherunningtimeforthe protein datasetisabout100 timeslongerthanthatforthe poker dataset.Thisresultmaysoundcounterintuitiveatthe firstglance.But,amorecarefulanalysisrevealsthatinadditiontothenumberofupdates, therunningtimeofDMLisalsoabythecomputationalcostperiteration,which explainstheconsistencybetweenFig.3.2andFig.3.3.Inthecaseofcomparingthetwo hybridapproaches,weobservethatHA-SGDissubjectedtoahighercomputationalcostper iterationthanHR-SGDbecauseHA-SGDhastocomputethenormofthe average gradient overeachmini-batchwhileHR-SGDonlyneedstocomputethederivativeof onerandomlysampledtripletconstraintforeachmini-batch.Inthecaseofcomparingtherunningtime acrosstdatasets,the protein datasethasasignificantlyhigherdimensionalitythan thepoker dataset,andthereforeissubjectedtoahighercomputationalcostperiteration becausethecomputationalcostofprojectinganupdateddistancemetricontothePSDcone increasesatleastquadraticallyinthedimensionality. 3.2.4Experiment(III):ComparisonwithState-of-the-artOnline DMLMethods WecomparetheproposedSGDalgorithmstothreestate-of-the-artonlinealgorithmsand onebathmethodforDML: •SPML[98]:anonlinelearningalgorithmforDMLthatisbasedonmini-batchSGD andthehingeloss, 36•OASIS [24]:astate-of-the-artonlineDMLalgorithmandsymmetricversionwithonly oneprojectionisapplied, •LEGO[66]:anonlineversionoftheinformationtheoreticbasedDMLalgorithm[32]. •POLA[96]:aPerceptionbasedonlineDMLalgorithm. Finally,forsanitychecking,wealsocomparetheproposedSGDalgorithmsto LMN-N[114],astate-of-the-artbatchlearningalgorithmforDML. BothSPMLandOASISusethesamesetoftripletconstraintstolearnadistancemet- ricastheproposedSGDalgorithms.SinceLEGOandPOLAaredesignedforpairwise constraints,forfaircomparison,wegeneratepairwiseconstraintsforLEGOandPOLAby splittingeachtripletconstraint( xti,xtj,xtk)intotwopairwiseconstraints:amust-linkcon- straint( xti,xtj)andacannot-linkconstraint( xti,xtk).Thissplittingoperationresultsina totalof2 NpairwiseconstraintsforLEGOandPOLA.Finally,wenotethatsinceLMNN isabatchlearningmethod,itisallowedtoutilize anytripletconstraintderivedfromthe data,andisnotrestrictedtothesetoftripletconstraintswegeneratefortheSGDmethod- s.AllthebaselineDMLalgorithmsareimplementedbyusingthecodesfromtheoriginal authorsexceptforSPML,forwhichwemadeappropriatechangestotheoriginalcodein ordertoavoidlargematrixmultiplicationandimprovethecomputationale.SPML, OASISLEGOandPOLAareimplementedinMatlab,whilethecorepartsofLMNNare implementedbyCthatisusuallydeemedtobemoretthanMatlab.Thedefault parameterssuggestedbytheoriginalauthorsareusedinthebaselinealgorithms.Thestep sizeofLEGOissettobe1,asitwasobservedin[24]thatthepredictionperformanceof LEGOisingeneralinsensitivetothestepsize.Inallexperiments,allthebaselinemethods settheinitialsolutionfordistancemetrictobeanidentitymatrix. 37Table3.3:Classificationerror(%)of k-NN( k=3)usingthedistancemetricslearnedby baselineSGDmethod,onlinelearningalgorithmsandbatchlearningapproachforDML. Standarddeviationcomputedfromfivetrialsisincludedintheparenthesis. BaselineBatch OnlineLearning SGDLMNNPOLALEGOOASIS SPMLsemeion4.187.11(0.39)19.25(1.95)12.89(1.84)6.74(0.34)4.81(0.59)dna6.644.89(0.29)7.32(0.55)7.39(0.55)11.75(0.43)6.78(0.58)isolet4.234.11(0.08)5.18(0.38)18.08(6.98)4.37(0.26)4.36(0.18)tdt303.522.80(0.0)5.93(0.38)21.11(3.68)3.92(0.08)3.47(0.13)letter2.233.20(0.00)3.10(0.22)5.24(0.45)3.92(0.05)3.98(0.53)protein 38.1639.86(0.16)38.38(0.56)42.60(1.13)37.83(0.23)40.12(0.53)connect4 20.1621.60(0.26)25.67(0.85)26.06(1.30)22.37(0.63)24.60(0.70)sensit22.9524.45(0.02)27.51(0.39)26.50(1.37)22.12(0.24)23.48(0.25)rcv20 7.74N/A7.96(0.08)8.49(0.18)8.08(0.06)8.61(0.12)poker 35.22N/A41.26(1.70)40.58(1.23)45.12(2.14)39.42(0.71)Table3.3summarizestheclassificationresultsof k-NN( k=3)usingthedistancemetrics learnedbytheproposedmethodandbybaselinealgorithms,respectively. SGDdenotesthe bestresultofproposemethodsinTable3.2.First,weobservethatLEGOandPOLAperform significantlyworsethantheproposedDMLalgorithmsforfourdatasets,including semeion,connect4 ,sensitandpoker .LEGOalsoperformspoorlyon isoletandtdt30.Thiscanbe explainedbythefactthatLEGOandPOLAusepairwiseconstraintsforDMLwhiletheother methodsincomparisonusetripletconstraintsforDML.Accordingto[24,98,114],triplet constraintsareingeneralmoreethanpairwiseconstraints.Second,althoughboth SPMLandMini-SGDarebasedonthemini-batchstrategy,SPMLperformssignificantly worsethanMini-SGDonthreedatasets,i.e. protein ,connect4 ,and poker .Theperformance betweenSPMLandMini-SGDcanbeexplainedbythefactthatMini-SGDuses asmoothlossfunctionwhileahingelossisusedbySPML.Accordingtoouranalysisand theanalysisin[28],usingasmoothlossfunctioniscriticalforthesuccessofthemini-batch strategy.Third,OASISyieldssimilarperformanceastheproposedalgorithmsforalmostall datasetsexceptfordatasets semeion,dnaandpoker ,forwhichOASISperformssignificantly 38worse.Overall,weconcludethattheproposedDMLalgorithmsyieldsimilar,ifnotbetter, performanceasthestate-of-the-artonlinelearningalgorithmsforDML. ComparedtoLMNN,astate-of-the-artbatchlearningalgorithmforDML,weobserve thattheproposedSGDalgorithmsyieldsimilarperformanceonfourdatasets.Theyhowever performsignificantlybetterthanLMNNondatasets semeionandletter,andsignificantly worseondatasets dnaandtdt30.Weattributetheinclassificationerrortothe factthattheproposedDMLalgorithmsarerestrictedto100 ,000randomlysampledtriplet constraintswhileLMNNisallowedtouse all thetripletconstraintsthatcanbederived fromthedata.Therestrictionintripletconstraintscouldsometimeslimittheclassification performancebutattheothertimehelpavoidtheoverfittingproblem.Wealsoobservethat LMNNisunabletorunonthetwolargedatasets rcv20 andpoker ,indicatingthatLMNN doesnotscalewelltothesizeofdatasets. Table3.4:Thecomparisonofrunningtime(seconds)forOASISandthehybridmethods. Averageresultsoverfivetrialsarereported. Methods semeiondnaisolettdt30letterOASIS 4.45.5156.610.42.3HR-SGD3.65.1363.221.71.4HA-SGD7.68.1597.928.41.1Methods protein connect4 sensitrcv20 poker OASIS 161.44.519.046.51.5HR-SGD275.73.023.5139.20.9HA-SGD164.62.515.365.61.2Therunningtimesfortheproposedalgorithmsandthebaselinealgorithmsaresumma- rizedinFig.3.2.Thenumberofupdatesforbothgroupsofalgorithmsareprovidedin Fig.3.3.ItisnotsurprisingtoobservethattwoonlineDMLalgorithms(SPML,OASIS) aresignificantlymoreetthanSGDintermsofbothrunningtimeandthenumberof updates.WealsoobservethatMini-SGDandSPMLsharethesamenumberofupdatesand 39(a)semeion(b)dna(c)isolet(d)tdt30(e)letter(f) protein (g)connect4 (h)sensit(i)rcv20 (j)poker Figure3.3:ThecomparisonofnumberofupdatesforvariousSGDmethods.Notethatsince POLAandLEGOoptimizepairwiseconstraints,wedecomposeeachtripletconstraintinto twopairwiseconstraintsforthesetwomethods.Asaresult,thenumberofconstraintsis doubledforthesetwomethods. 40similarrunningtimeforalldatasetsbecausetheyusethesamemini-batchstrategy.Further- more,comparedtothreeonlineDMLalgorithms(SPML,LEGOandPOLA),thetwohybrid approachesaresignificantlymoreetinbothrunningtimeandthenumberofupdates. Table3.4comparesthedetailedrunningtimeofOASISandthehybridmethods.Wealso observethatthehybridmethodsaremoreetthanOASISonsixdatasets(i.e., semeion,dna,letter,connect4 ,sensitandpoker ),eventhoughOASISonlyperformsprojectiononceat theendoftheprogram.Weattributetheofthehybridapproachestothereduced numberofupdatestheyhavetoperformonthelearnedmetric.Finally,sinceLMNNis implementedbyC,itisnotsurprisingtoobservethatLMNNsharessimilarrunningtimeas theotheronlineDMLalgorithmsforrelativelysmalldatasets.Itishoweversignificantlyless tthantheonlinelearningalgorithmsfordatasetsofmodestsize(e.g. letter,connect4 andsensit),andbecomescomputationallyinfeasibleforthetwolargedatasets rcv20 andpoker .Overall,weobservethatthetwohybridapproachesaresignificantlymoreet thantheotherDMLalgorithmsincomparison. 3.3Conclusions Inthischapter,weproposetwostrategiestoimprovethecomputationaleofSGD forDML,i.e.mini-batchandadaptivesampling.Thekeyideaofmini-batchistogroup multipletripletconstraintsintoamini-batch,andonlyupdatethedistancemetriconce foreachmini-batch;thekeyideaofadaptivesamplingistoperformstochasticupdating bygivingatripletconstraintmorechancetobeusedforupdatingthedistance metricthananeasytripletconstraint.Wedeveloptheoreticalguaranteesforbothstrategies. Wealsodeveloptwovariantsofhybridapproachesthatcombinemini-batchwithadaptive 41samplingformoreetDML.Ourempiricalstudyconfirmsthattheproposedalgorithms yieldsimilar,ifnotbetter,predictionperformanceasthestate-of-the-artonlinelearning algorithmsforDMLbutwithsignificantlylessamountofrunningtime.Inthefuture,we couldanalysisthetheoreticalguaranteeforthehybridmethods. 42Chapter4 DistanceMetricLearningforHigh DimensionalData Inthischapter,wewilladdressthechallengefromlargenumberofvariables( O(d2)).We followtheone-projectionparadigmasinSection2.1andfocusonoptimizing O(d2)variables. Nowadays,highdimensionalrepresentationsbecomemoreandmorepopularinvariousap- plications,especiallyforimagesandvideoasexamplesshowninTable4.1.Incontrast,DML isonlyavailableforlowdimensionaldata(e.g.,somehundredfeatures)duetotheinvolve- mentof d2variables.SomestudiestrytoalleviatetheissueasdescribedinSection2.4,but allofthemhavetoholdsomestrongassumptions. Table4.1:Examplesofapplicationswithhighdimensionalfeatures. Applications#Features FaveVerification[25] 100,000ImageClassification[90] 250,000VideoEventDetection[87] 500,000ThestraightforwardwaytosolvehighdimensionalDMLproblemisreducingthedi- mensionalityofdatausingdimensionalityreductionmethodssuchasprincipalcomponent analysis(PCA)[114]orrandomprojection(RP)[103].AlthoughRPiscomputationally moretthanPCA,itoftenyieldssignificantlyworseperformancethanPCAunless thenumberofrandomprojectionsistlylarge[39].WenotethatRPhasbeensuc- cessfullyappliedtomanymachinelearningtasks,e.g.,classification[91],clustering[11]and 43regression[81],however,onlyafewstudiesexaminedtheapplicationofRPtoDML,and mostofthemwithlimitedsuccess. Inthischapter,weproposea dualrandomprojection approachforhighdimensional DML.Ourapproach,ononehand,enjoysthelightcomputationofrandomprojection,and ontheotherhand,significantlyimprovestheenessofrandomprojection.Themain limitationofusingrandomprojectionforDMListhatallthecolumns/rowsofthelearned metricwilllieinthesubspacespannedbytherandomvectors.Weaddressthislimitation ofrandomprojectionby •firstestimatingthe dualvariablesbasedontherandomprojectedvectorsand, •thenreconstructingthedistancemetricusingtheestimateddualvariablesanddata vectorsintheoriginalspace. Sincethefinaldistancemetriciscomputedusingtheoriginalvectors,nottherandomly projectedvectors,thecolumn/rowspaceofthelearnedmetricwillNOTberestrictedto thesubspacespannedbytherandomprojection,thusalleviatingthelimitationofrandom projection.Weverifytheenessoftheproposedalgorithmsbothempiricallyand theoretically. Wefinallynotethatourworkisbuiltupontherecentwork[120]onrandomprojection whereadualrandomprojectionalgorithmisdevelopedforlinearclassification.Ourwork from[120]inthatweapplythetheoryofdualrandomprojectiontoDML.More importantly,wehavemadeanimportantprogressinadvancingthetheoryofdualrandom projection.Unlikethetheoryin[120]wherethedatamatrixisassumedtobelowrank orapproximatelylowrank,ournewtheoryofdualrandomprojectionisapplicabletoany datamatrixevenwhenitis NOTapproximatelylowrank.Thisnewanalysissignificantly broadenstheapplicationdomainswheredualrandomprojectionisapplicable,whichis 44furtherverifiedbyourempiricalstudy. Therestofthechapterisorganizedasfollows:Section4.1describestheproposeddual randomprojectionapproachforDMLandthedetailedalgorithmforsolvingthedualproblem inthesubspacespannedbyrandomprojection.Section4.2summarizestheresultsofthe empiricalstudy,andSection4.3concludesthiswork. 4.1DualRandomProjectionforDistanceMetricLearn- ingLetX=(x1,···,xn)Rd×ndenotethecollectionoftrainingexamples.GivenaPSD matrixM,thedistancebetweentwoexamples xiandxjisgivenas distM(xi,xj)=(xi−xj)M(xi−xj).TheproposedframeworkforDMLwillbebasedontripletconstraints,notpairwisecon- straints.Thisisbecauseseveralpreviousstudieshavesuggestedthattripletconstraintsare moreethanpairwiseconstraints[114,24,98].Let D={(x1i,x1j,x1k),...,(xNi,xNj,xNk)}bethesetoftripletconstraintsusedfortraining,where xtiisexpectedtobemoresimilarto xtjthanto xtk.Ourgoalistolearnametricfunction Mthatisconsistentwithmostofthe tripletconstraintsin D,i.e. (xti,xtj,xtk),(xti−xtj)M(xti−xtj)+1(xti−xtk)M(xti−xtk)45Followingtheempiricalriskminimizationframework,wecastthetripletconstraintsbased DMLintothefollowingoptimizationproblem: minMSd2M2F+1NNt=1(M,A t)(4.1) whereSdstandsforthesymmetricmatrixofsize d×d,0istheregularizationparameter, (·)isaconvexlossfunction, At=(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj),and ,standsforthedotproductbetweentwomatrices.Wenotethatwedidnotenforce Min(4.1) tobePSDbecausewefollowtheone-projectionparadigmproposedin[24]thatfirstlearns asymmetricmatrix Mbysolvingtheoptimizationproblemin(4.1)andthenprojectsthe learnedmatrix MontothePSDcone.Weemphasizethatunlike[120],wedidnotassume (·)tobesmooth,makingitpossibletoapplytheproposedapproachtothehingeloss. Let(·)betheconvexconjugateof (·).Thedualproblemof(4.1)isgivenby max1N−1NNt=1(t)−122 Nt=1tAt 2Fwhichisequivalentto max[−1,0]N−Nt=1(t)−12G(4.2)where=(1,···N)andG=[Ga,b]N×Nisamatrixof N×NwithGa,b=Aa,Ab.Wedenoteby MRd×dtheoptimalprimalsolutionto(4.1),andby RNtheoptimal 46dualsolutionto(4.2).Usingthefirstorderconditionforoptimality,wehave M=−1Nt=1tAt(4.3)4.1.1DualRandomProjectionforDistanceMetricLearning Directlysolvingtheprimalproblemin(4.1)orthedualproblemin(4.2)couldbecompu- tationalexpensivewhenthedataisofhighdimensionandthenumberoftrainingtriplets isverylarge.Weaddressthischallengebyinducingarandommatrix RRd×m,where mdandRi,j(0,1/m),andprojectingallthedatapointsintothelowdimensional spaceusingtherandommatrix,i.e., xi=Rxi.Asaresult, At,afterrandomprojection, becomes At=RAtR.AtypicalapproachofusingrandomprojectionforDMListoobtainamatrix Msofsize m×mbysolvingtheprimalproblemwiththerandomlyprojectedvectors {xi}ni=1,i.e. minMSm2M2F+1NNt=1(M,At)(4.4) Giventhelearnedmetric Ms,foranytwodatapoints xandx,theirdistanceismeasured by( x−x)RMsR(x−x)=(x−x)M(x−x),where M=RMsRRd×distheemetricintheoriginalspace Rd.Thekeylimitationofthisrandomprojection approachisthatboththecolumnandrowspaceof Marerestrictedtothesubspacespanned byvectorsinrandommatrix R.Insteadofsolvingtheprimalproblem,weproposedtosolvethedualproblemusingthe 47randomlyprojecteddatapoints {xi}ni=1,i.e. maxRN−Nt=1(t)−12G(4.5)whereGa,b=RAaR,R AbR.Afterobtainingtheoptimalsolution for(4.5),we reconstructthemetricbyusingthedualvariables anddatamatrix Xintheoriginal space,i.e. M=−1Nt=1tAt(4.6)Itisimportanttonotethatunliketherandomprojectionapproach,therecoveredmetric Min(4.6)isnotrestrictedbythesubspacespannedbytherandomvectors,akeytothe successoftheproposedalgorithm. Alg.4summarizesthekeystepsfortheproposeddualrandomprojectionmethodfor DML.Followingone-projectionparadigm[24],weprojectthelearnedsymmetricmatrix MontothePSDconeattheendofthealgorithm.ThekeycomponentofAlg.4istosolve theoptimizationproblemin(4.2)atStep4accurately.Wechoosestochasticdualcoordi- nateascent(SDCA)methodforsolvingthedualproblem(4.5)becauseitenjoysalinear convergencewhenthelossfunctionissmooth,andisshownempiricallytobesignificantly fasterthantheotherstochasticoptimizationmethods[97].Weusethecombinationstrategy recommendedin[97],denotedby CSDCA,whichusesSGDforthefirstepochandthen appliesSDCAfortherestepochs. 48Algorithm4 DualRandomProjectionMethod(DuRP)forDML 1:Input:thetripletconstraints Dandthenumberofrandomprojections m.2:Generatearandommatrix RRd×mandRi,j(0,1/m).3:Projecteachexampleas x=Rx.4:Solvetheoptimizationproblem(4.5)andobtaintheoptimalsolution 5:Recoverthesolutionintheoriginalspaceby M=−1ttAt6:Output: PSD (M)4.1.2MainTheoreticalResults First,similarto[120],weconsiderthecasewhenthedatamatrix Xisoflowrank.The theorembelowshowsthatunderthelowrankassumption,withahighprobability,the distancemetricrecoveredbyAlgorithm4isnearlyoptimal. Theorem3. Let Mbetheoptimalsolutionto(4.1).Let betheoptimalsolutionfor (4.5),andlet Mbethesolutionrecoveredfrom using(4.6).Undertheassumptionthat allthedatapointslieinthesubspaceof r-dimension,forany 01/6,withaprobability atleast 1−,wehave PSD (M)−PSD (M)F31−3MFprovided m(r+1)log(2 r/ )2andconstant cisatleast 1/3.TheproofofTheorem3canbefoundinappendix.Theorem3indicatesthatifthe numberofrandomprojectionsistlylarge(i.e. m( rlogr)),wecanrecoverthe optimalsolutionintheoriginalspacewithasmallerror.Itisimportanttonotethatour analysis,unlike[120],canbeappliedtonon-smoothlosssuchasthehingeloss. Inthesecondcase,weassumethelossfunction (·)is -smooth(i.e., |(z)−(z)|z−z|).Thetheorembelowshowsthatthedualvariablesobtainedbysolvingtheoptimization 49problemin(4.5)canbeclosetotheoptimaldualvariables,evenwhenthedatamatrix XisNOTlowrankorapproximatelylowrank.Forthepresentationoftheorem,wefirstdefinea fewimportantquantities.Definematrices M1RN×N,M2RN×N,M3RN×N,and M4RN×NasM1a,b=(xai−xak)22(xbi−xbk)22M2a,b=(xai−xaj)22(xbi−xbj)22M3a,b=(xai−xak)22(xbi−xbj)22M4a,b=(xai−xaj)22(xbi−xbk)22Definethemaximumofthespectralnormofthefourmatrices,i.e. =max M12,M22,M32,M42(4.7)where2standsforthespectralnormofmatrices. Theorem4. Assume(z)is-smooth.Let betheoptimalsolutiontothedualproblem in(4.2),andlet betheapproximatelyoptimalsolutionfor(4.5)withsuboptimality .Then,withaprobabilityatleast 1−,wehave −2max8 2,2where isdefinein(4.7),provided m82ln8N.TheproofofTheorem4canbefoundintheappendix.UnlikeTheorem3wherethe datamatrix Xisassumedtobelowrank,Theorem4holdswithoutanypriorassumption 50aboutthedatamatrix.Itshowsthatdespitetherandomprojection,thedualsolutioncan berecoveredapproximatelyusingtherandomlyprojectedvectors,providedthatthenumber ofrandomprojections mistlylarge, issmall,andtheapproximatelyoptimal solutionistlyaccurate.Inthecasewhenmostofthetrainingexamplesare notlineardependent,wecouldhave ( N/d ),whichcouldbeamodestnumberwhen disverylarge.TheresultinTheorem4essentiallyjustifiesthekeyideaofourapproach, i.e.computingthedualvariablesfirstandrecoveringthedistancemetriclater.Finally, since−2,theapproximationerrorintherecovereddualvariables,isproportional tothesquarerootofthesuboptimality ,anaccuratesolutionfor(4.5)isneededtoensure asmallapproximationerror.WenotethatgivenTheorem4,itisstraightforwardtobound M−M2usingtherelationshipbetweenthedualvariablesandtheprimalvariablesin (4.3).4.2Experiments Wewillfirstdescribetheexperimentalsetting,andthenpresentourempiricalstudyfor rankingandclassificationtasksonvariousdatesets. 4.2.1ExperimentalSetting Datasets Eightdatasetsareusedtovalidatetheenessoftheproposedalgorithm forDML.Table4.2summarizestheinformationofthesedatasets. imagenet5consistsof5 randomlyselectedclassesfromImageNet[92]while cats5 contains5catspeciesfromthe samearchive. caltech30 isasubsetofCaltech256imagedataset[48]andweusetheversion pre-processedby[24]. tdt30isasubsetoftdt2dataset[14].Both caltech30 andtdt30are51Table4.2:Statisticsforthedatasetsusedinourempiricalstudy.#Cisthenumberof classes.#Fisthenumberoforiginalfeatures.#Trainand#Testrepresentthenumberof trainingdataandtestdata,respectively. #C#F#Train #Test protein 335717,7666,621isolet266176,2381,559imagenet551,0004,5821,964cats5 51,0005,1282,199caltech30 301,0005,5022,355tdt30301,0006,5752,81920news201,00015,9353,993rcv30 301,000507,58515,195comprisedoftheexamplesfromthe30mostpopularcategories.Alltheotherdatasetsare downloadedfromLIBSVM[22],where rcv30 isasubsetoftheoriginaldatasetconsisted ofdocumentsfromthe30mostpopularcategories.Fordatasets tdt30,20newsandrcv30 ,theyarecomprisedofdocumentsrepresentedbyvectorsof 50,000dimensions.Sinceitis expensivetocomputeandmaintainamatrixof50 ,000×50,000,forthesethreedatasets, wefollowtheprocedurein[24]thatmapsalldocumentstoaspaceof1 ,000dimension. Morespecifically,wefirstkeepthetop20 ,000mostpopularwordsforeachcollection,and thenreducetheirdimensionalityto1 ,000byusingPCA.Weemphasizethatforseveral datasetsinourtestbeds,theirdatamatricescannotbewellapproximatedbylowrank matrices.Fig.4.1summarizestheeigenvaluedistributionoftheeightdatasetsusedinour experiment.Weobservethatfouroutofthesedatasets(i.e., caltech20 ,tdt30,20news,rcv30 )haveaflateigenvaluedistribution,indicatingthattheassociateddatamatricescannotbe wellapproximatedbyalowrankmatrix.Thisjustifiestheimportanceofremovingthelow rankassumptionfromthetheoryofdualrandomprojection,animportantcontributionof thiswork. Forthedatasetsusedinthisstudy,weusethestandardtraining/testingsplitprovided 520100200300400020406080100120140#IndexEigenvalue(a)protein 020040060080002004006008001000#IndexEigenvalue(b)isolet0200400600800100005001000150020002500#IndexEigenvalue(c)imagenet502004006008001000050010001500200025003000#IndexEigenvalue(d)cats5 02004006008001000051015#IndexEigenvalue(e)caltech30 020040060080010000100200300400#IndexEigenvalue(f) tdt30020040060080010000200400600800#IndexEigenvalue(g)20news02004006008001000020406080100#IndexEigenvalue(h)rcv30 Figure4.1:Theeigenvaluedistributionofdatasetsusedinourempiricalstudy bytheoriginaldatasets,exceptfordatasets imagenet5,cats5 ,tdt30,caltech30 andrcv30 .For imagenet5,cats5 ,tdt30andcaltech30 ,werandomlyselect70%ofthedatafortraining andusetheremaining30%fortesting;for rcv30 ,weswitchthetrainingandtestsetsdefined bytheoriginalpackagetoensurethatthenumberoftrainingexamplesisstlylarge. Evaluationmetrics Tomeasurethequalityoflearneddistancemetrics,twotypesof evaluationsareadoptedinourstudy.First,wefollowtheevaluationprotocolin[24]and evaluatethelearnedmetricbyitsrankingperformance.Morespecifically,wetreateach testinstance qasaquery,andranktheothertestinstancesintheascendingorderoftheir distanceto qusingthelearnedmetric.Themean-average-precision(mAP)givenbelowis usedtoevaluatethequalityoftherankinglist mAP=1|Q||Q|i=11ririj=1P(xij)where|Q|isthesizeofqueryset, riisthenumberofrelevantinstancesfor i-thqueryand P(xij)istheprecisionforthefirst jrankedinstanceswhentheinstancerankedatthe 53j-thpositionisrelevanttothequery q.Here,aninstance xisrelevanttoaquery qiftheybelongtothesameclass.Second,weevaluatethelearnedmetricbyitsclassification performancewith k-nearestneighborclassifier.Morespecifically,foreachtestinstance q,weapplythelearnedmetrictofindthefirst ktrainingexampleswiththeshortestdistance, andpredicttheclassassignmentfor kbytakingthemajorityvoteamongthe knearestneighbors.Finally,wealsoevaluatethecomputationaleoftheproposedalgorithm forDMLbyitse. BaselinesBesidestheEuclideandistancethatisusedasabaselinesimilaritymeasure, sixstate-of-the-artDMLmethodsarecomparedinourempiricalstudy: •DuOri:ThisalgorithmfirstappliesCombinedStochasticDualCoordinateAscent (CSDCA)[97]tosolvethedualproblemin(4.2)andthencomputesthedistance metricusingthelearneddualvariables. •DuRP:ThisistheproposedalgorithmforDML(i.e.Algorithm4). •SRP:Thisalgorithmappliesrandomprojectiontoprojectdataintolowdimensional space,andthenitemploysCSDCAtolearnthedistancemetricinthissubspace. •SPCA:ThisalgorithmusesPCAastheinitialsteptoreducethedimensionality,and thenappliesCSDCAtolearnthedistancemetricinthesubspacegeneratedbyPCA. •OASIS [24]:Astate-of-the-artonlinelearningalgorithmforDMLthatlearnsthe optimaldistancemetricdirectlyfromtheoriginalspacewithoutanydimensionality reduction.•LMNN[114]:Astate-of-the-artbatchlearningalgorithmforDML.Itperformsthe dimensionalityreductionusingPCAbeforestartingDML. Implementationdetails Werandomlyselect N=100 ,000activetriplets(i.e.,incur thepositivehingelossbyEuclideandistance)andsetthenumberofepochstobe3forall 54stochasticmethods(i.e.,DuOri,DuRP,SRP,SPCAandOASIS),whichyieldstly accuratesolutionsinourexperimentsandisalsoconsistentwiththeobservationin[97].We search in{10−5,10−4,10−3,10−2}andfixitas1 /Nsinceitisinsensitive.Thestepsizeof CSDCAissetaccordingtotheanalysisin[97].Forallstochasticoptimizationmethods,we followtheone-projectionparadigmbyprojectingthelearnedmetricontothePSDcone.The hingelossisusedintheimplementationoftheproposedalgorithm.BothOASISandLMNN usetheimplementationprovidedbytheoriginalauthorsandparametersaretunedbased ontherecommendationbytheoriginalauthors.AllmethodsareimplementedinMatlab, exceptforLMNN,whosecorepartisimplementedinC,whichisshowntobemoret thanourMatlabimplementation.Allstochasticoptimizationmethodsarerepeatedfive timesandtheaverageresultoverfivetrialsisreported.Allexperimentsareimplemented onaLinuxServerwith64GBmemoryand12 ×2.4GHzCPUsandonlysinglethreadis permittedforeachexperiment. 4.2.2oftheProposedMethod Table4.3:CPUtime(minutes)fortmethodsforDML.Allalgorithmsareimplement- edinMatlabexceptforLMNNwhosecorepartisimplementedinCandismoret thanourMatlabimplementation. MetricinOriginalSpace MetricinSubspace DuOriDuRPOASIS SRPSPCALMNNprotein 214.5±5.80.6±0.081.9±3.30.2±0.00.2±0.0488.9isolet198.4±7.20.6±0.012.6±0.20.1±0.00.1±0.0384.0imagenet5776.3±31.21.2±0.068.3±1.30.1±0.00.2±0.01,374.1cats5 782.3±70.91.1±0.074.1±1.10.1±0.00.3±0.11,467.0caltech30 1,214.5±229.51.3±0.0640.4±121.20.2±0.00.5±0.02,197.9tdt301,029.9±16.80.8±0.0140.8±4.50.2±0.00.4±0.0624.220news1,212.9±154.31.0±0.0216.3±48.80.2±0.00.5±0.01,893.6rcv30 1,121.3±79.41.3±0.0432.5±7.70.2±0.04.2±0.0N/A55Inthisexperiment,wesetthenumberofrandomprojectiontobe10,whichaccording toexperimentalresultsinSection4.2.3and4.2.4,yieldsalmosttheoptimalperformancefor theproposedalgorithm.Forfaircomparison,thenumberofreduceddimensionisalsoset tobe10forLMNN. Table.4.3comparestheCPUtime(inminutes)oftmethods.Noticethatthe timeofsamplingtripletsisnottakenintoaccountasitisconsumedbyallthemethods,and alltheotheroperators(e.g.,randomprojectionandPCA)areincluded.Itisnotsurprising toobservethatDuRP,SRPandSPCAhavesimilarCPUtimes,andaresignificantlymore tthantheothermethodsduetotheofdimensionalityreduction.SinceDuRP andSRPsharethesameprocedureforcomputingthedualvariablesinthesubspace,theonly betweenthemliesintheprocedureforreconstructingthedistancemetricfromthe estimateddualvariables,acomputationaloverheadthatmakesDuRPslightlyslowerthan SRP.Foralldatasets,weobservethatDuRPisatleast200timesfasterthanDuOriand20 timesfasterthanOASIS.Comparedtothestochasticoptimizationmethods,LMNNisthe leastetonsixdatasets(i.e., protein ,isolet,imagenet5,cats5 ,caltech30 and20news),mostlyduetothefactthatitisabatchlearningalgorithm. 4.2.3EvaluationbyRanking Infirstexperiment,wesetthenumberofrandomprojectionsusedbySRP,SPCAandthe proposedDuRPalgorithmtobe10,whichisroughly1%ofthedimensionalityoftheoriginal space.Forfaircomparison,thenumberofreduceddimensionforLMNNisalsosettobe 10.Wemeasurethequalityoflearnedmetricsbyitsrankingperformanceusingthemetric ofmAP. Table.4.4summarizestheperformanceoftmethodsforDML.First,weobserve 56Table4.4:ComparisonofrankingresultsmeasuredbymAP(%)fortmetriclearning algorithms.MetricinOriginalSpace MetricinSubspaceMetric EuclidDuOriDuRPOASIS SRPSPCALMNNprotein 39.047.0±0.149.1±0.145.7±0.137.7±0.141.9±0.141.9isolet46.777.1±0.570.5±0.667.6±0.211.9±1.236.5±2.668.7imagenet525.434.6±0.134.1±0.329.2±0.121.5±0.328.1±0.231.9cats5 22.526.3±0.127.3±0.323.7±0.121.0±0.121.5±0.224.8caltech30 16.423.8±0.125.5±0.125.4±0.28.1±0.419.5±0.016.3tdt3036.865.9±0.269.4±0.355.9±0.111.2±0.349.7±0.266.420news8.420.1±0.224.9±0.316.2±0.15.3±0.112.2±0.122.5rcv30 16.765.7±0.163.2±0.268.6±0.112.8±0.446.5±0.0N/AthatDuRPsignificantlyoutperformsSRPandSPCAforalldatasets.Infact,SRPisworse thanEuclideandistancewhichcomputesthedistanceintheoriginalspace.SPCAisonlyable toperformbetterthantheEuclideandistance,andisoutperformedbyalltheotherDML algorithms.Second,weobservethatforallthedatasets,DuRPyieldssimilarperformanceas DuOri.TheonlybetweenDuRPandDuOriisthatDuOrisolvesthedualproblem withoutusingrandomprojection.ThecomparisonbetweenDuRPandDuOriindicatesthat therandomprojectionstephasminimalimpactonthelearneddistancemetric,justifying thedesignoftheproposedalgorithm.Third,comparedtoOASIS,weobservethatDuRP performssignificantlybetteronthreedatasets(i.e., imagenet5,tdt30and20news)andhasthe comparableperformanceontheotherdatasets.Finally,weobservethatforalldatasets,the proposedDuRPmethodsignificantlyoutperformsLMNN,astate-of-the-artbatchlearning algorithmforDML.Wealsonotethatbecauseoflimitedmemory,weareunabletorun LMNNondatasets rcv30 .Inthesecondexperiment,wevarythenumberofrandomprojectionsfrom10to50. AllstochasticmethodsarerunwithfivetrailsandFig.4.2reportstheaverageresultswith standarddeviation.NotethattheperformanceofOASISandDuOriremainunchangedwith 57variednumberofprojectionsbecausetheydonotuseprojection.Itissurprisingtoobserve thatDuRPalmostachievesitsbestperformancewithonly10projectionsforalldatasets. ThisisincontrasttoSRPandSPCA,whoseperformanceusuallyimproveswithincreasing numberofprojections.WealsoobservethatDuRPoutperformsDuOriforseveraldatasets (i.e.protein ,imagenet5,cats5 ,caltech30 ,tdt30and20news).Wesuspectthatthebetter performanceofDuRPisbecauseoftheimplicitregularizationduetotherandomprojection. Weplantoinvestigatemoreabouttheregularizationcapabilityofrandomprojectioninthe future.Wefinallypointoutthatwithtlylargenumberofprojections,SPCAis abletooutperformOASISon5datasets(i.e., protein ,imagenet5,cats5 ,tdt30and20news),indicatingthatthecomparisonresultmaybesensitivetothenumberofprojections. 10203040503638404244464850#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(a)protein 10203040501020304050607080#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(b)isolet10203040502025303540#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(c)imagenet51020304050202224262830#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(d)cats5 102030405051015202530#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(e)caltech30 102030405010203040506070#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(f) tdt3010203040500510152025#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(g)20news102030405010203040506070#ProjectionsmAp(%)OASISDuOriDuRPSRPSPCA(h)rcv30 Figure4.2:Thecomparisonoftstochasticalgorithmsforranking 4.2.4EvaluationbyClassification Inthisexperiment,weevaluatethelearnedmetricbyitsclassificationaccuracywith k-NN( k=5)classifier.Weemphasizethatthepurposeofthisexperimentistoevaluatethe 58metricslearnedbytDMLalgorithms,nottodemonstratethatthelearnedmetricwill resultinthestate-of-artclassificationperformance 1.Similartotheevaluationbyranking, allexperimentsarerunfivetimesandtheresultsaveragedoverfivetrialswithstandard deviationarereportedinFig.4.3.Weessentiallyhavethesameobservationasthatfor therankingexperimentsreportedinSection4.2.3exceptthatformostdatasets,thethree methodsDuRP,DuOri,andOASISyieldverysimilarperformance. Notethemainconcernofthischapteristimeandthesizeoflearnedmetric isd×d.Itisstraightforwardtostorethelearnedmetricetlybykeepingalow-rank approximationofit. 102030405035404550556065#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(a)protein 1020304050020406080100#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(b)isolet102030405010203040506070#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(c)imagenet5102030405015202530354045#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(d)cats5 10203040501520253035404550#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(e)caltech30 1020304050020406080100#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(f) tdt301020304050020406080#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(g)20news1020304050020406080100#ProjectionsAccuracy(%)OASISDuOriDuRPSRPSPCA(h)rcv30 Figure4.3:Thecomparisonoftstochasticalgorithmsforclassification 4.3Conclusions Inthischapter,weproposeadualrandomprojectionmethodtolearnthedistancemetric forlarge-scalehigh-dimensionaldatasets.Themainideaistosolvethedualproblemin 1Manystudies(e.g.,[114,116])haveshownthatmetriclearningdonotyieldbetterclassificationaccuracy thanthestandardclassificationalgorithms(e.g.,SVM)givenastlylargenumberoftrainingdata. 59thesubspacespannedbyrandomprojection,andthenrecoverthedistancemetricinthe originalspaceusingtheestimateddualvariables.Wedevelopthetheoreticalguaranteethat withahighprobability,theproposedmethodcanaccuratelyrecovertheoptimalsolution withsmallerrorwhenthedatamatrixisoflowrank,andtheoptimaldualvariableseven whenthedatamatrixcannotbewellapproximatedbyalowrankmatrix.Ourempirical studyconfirmsboththeenessandeoftheproposedalgorithmforDML bycomparingittothestate-of-the-artalgorithmsforDML.Inthefuture,wemayfurther improvetheofourmethodbyexploitingthescenariowhenoptimaldistancemetric canbewellapproximatedbyalowrankmatrix. 60Chapter5 Fine-GrainedVisualCategorization viaMulti-stageMetricLearning Inthischapter 1,wewilladdressthechallengefromthelargenumberofconstraints( O(n3))andproposeanewDMLframeworkforFGVCbycombiningtheapproachinthelastchapter. FGVCaimstodistinguishobjectsinsubordinateclasses.Forinstance,dogimagesare classifiedintotbreedsofdogs,suchas“Chihuahua”,“Pug”,“Samoyed”andso on[70,88].Asaresult,FGVCclassificationhastohandletheco-occurrenceoftwosomewhat contradictoryrequirements:1)itneedstodistinguishmanysimilarclasses(e.g.,thedog breedsthatonlyhavesubtleand2)itneedstodealwithlargeintra-classvariance (e.g.,causedbytposes,examples,etc.).Fig.5.1demonstratesanexampleandsuch co-occurringrequirementsmakeFVGCverychallenging. ThepopularpipelineforFVGCconsistsoftwosteps,featureextractionstepandclas- sificationstep.Thefeatureextractionstep,whichsometimescombineswithsegmenta- tion[2,21,88],partlocalization[8,118]orboth[20],istoextractimagelevelrepresentations, andpopularchoicesareLLCfeatures[2],Fishervectors[44]andsoon.Arecentdevelopment istouseconvolutionalneuralnetwork(CNN)[72]trainedonlarge-scaleimageclassification dataset(e.g.,ImageNet[92])andthenusethetrainedmodeltoextractfeatures[34].Theso- 1Thischapterisadaptedfromthepublishedpaper:Q.Qian,R.Jin,S.ZhuandY.Lin.Fine-Grained VisualCategorizationviaMulti-stageMetricLearning.In:Proceedingsofthe28thIEEEConferenceon ComputerVisionandPatternRecognition(CVPR’15),Boston,MA,2015. 61Figure5.1:IllustrationofhowDMLlearnstheembeddingthatpullstogetherthedatapoints fromthesameclassandpushesapartthedatapointsfromtclasses.Bluepoints arefromtheclass“Englishmarigold”whileredonesare“Barbertondaisy”.Animportant notehereisthatourDMLdoesnotrequiretocollapsedatapointsfromeachclassandthis allowstheflexibilitytomodelintra-classvariance.Abigchallengenowishowtodealwith high-dimensionfeaturerepresentationwhichistypicalforimage-levelvisualfeatures.To thisend,weproposeamulti-stageschemeformetriclearning. calleddeeplearningfeatureshavedemonstratedthestate-of-the-artperformanceonFGVC datasets[34].NotethattrainingCNNdirectlyonFGVCdatasetsisbecausetheex- istingFGVCbenchmarksareoftentoosmall[34](onlyseveraltensofthousandsoftraining 62imagesorless).Inthischapter,wesimplytakethestate-of-the-artdeeplearningfeatures withoutanyotheroperators(e.g.,segmentation)butfocusonstudyingbetterclassification approachtobetteraddresstheaforementionedtwoco-occurringrequirementsinFGVC. Fortheclassificationstep,manyexistingFGVCmethodssimplylearnasingleclassifier foreachfine-grainedclassaccordingtotheone-vs-allstrategy[2,8,21,118].Apparently, thisstrategydoesnotscalewelltothenumberoffine-grainedclasseswhilethenumber ofsubordinateclassesinFGVCcouldbeverylarge(e.g.,200classesin birds11 dataset).Additionally,suchone-vs-allschemeisonlytoaddressthefirstissueinthetwoissues, namely,itmakestoseparatetclasseswithoutmodelingintra-classvariance. Incontrast,thischapterproposesaDMLapproach,aimingtoexplicitlyhandlethetwoco- occurringrequirementswitha singlemetric.Fig.5.1illustrateshowDMLworks.Thekey ideaofDMLisintwofolds:1)itlearnsadistancemetricthatpullsneighboringdatapoints ofthesameclassclosetoeachotheranddatapointsfromtclassesfarapart,and 2)ithastheflexibilitytodefineneighboringsizeandrequireonlyaportionofdatapoints fromthesameclasstobeclosetoeachother.Consequently,theflexibilityofneighboring sizeactsasanewaytomodellargeintra-classvariance.Withalearnedmetric,a k-nearestneighborclassifierwillbeappliedtofindtheclassassignmentforatestimage. Therearethreechallengesinlearningametricdirectlyfromtheoriginalhighdimensional space:•Largenumberofconstraints :Alargenumberoftrainingconstraintsareusuallyre- quiredtoavoidtheoverfittingofhighdimensionalDML.Thetotalnumberoftriplet constraintscouldbeupto O(n3)where nisthenumberofexamples. •Computationalchallenge :DMLhastolearnamatrixofsize d×d,where disthe dimensionalityofdataand d=134 ,016inourstudy.The O(d2)numberofvariables 63leadstotwocomputationalchallengesinfindingtheoptimalmetric.First,itresultsin aslowerconvergencerateinsolvingtherelatedoptimizationproblem[89].Second,to ensurethelearnedmetrictobepositivesemi-definitive(PSD),mostDMLalgorithms require,at everyiterationofoptimization,projectingtheintermediatesolutionontoa PSDcone,anexpensiveoperationwithcomplexityof O(d3)(atleast O(d2)).•Storagelimitation :Itcanbeexpensivetosimplysave O(d2)numberofvariablesin memory.Forinstance,inourstudy,itwouldtakemorethan130GBtostorethe completedmetricinmemory,whichaddsmorecomplexitytothealready optimizationproblem. Inthischapter,weproposeamulti-stagemetriclearningframeworkforhighdimensional DMLtoexplicitlyaddressthesechallenges.First,todealwithalargenumberofconstraints usedbyhighdimensionalDML,wedividetheoriginaloptimizationproblemintomultiple stages.Ateachstage,onlyasmallsubsetofconstraintsthataretobeclassified bythecurrentlylearnedmetricwillbeadaptivelysampledandusedtoimprovethelearned metric.Bysettingtheregularizerappropriately,wecanprovethatthefinalsolutionis optimizedoverallappearedconstraints.Second,tohandlethecomputationalchallengein eachsubproblem,weextendthetheoryof dualrandomprojection [120],whichwasoriginally developedforlinearclassificationproblems,tometriclearning.Theproposedmethod,onone hand,enjoystheeofrandomprojection,andontheotherhandlearnsadistance metricofsize d×d.Thisisincontrasttomostdimensionalityreductionmethodsthat learnametricina reduced space.Finally,tohandlethestorageproblem,weproposeto maintainalowrankcopyofthelearnedmetricbyarandomizedalgorithmforlowrank matrixapproximation.Itnotonlyacceleratesthewholelearningprocessbutalsoregularizes thelearnedmetrictoavoidoverfitting.ExtensivecomparisonsonbenchmarkFGVCdatasets 64verifytheenessandeoftheproposedmethod. Therestofthechapterisorganizedasfollows:Section5.1describesthedetailsofthe proposedmethod.Section5.2showstheresultsoftheempiricalstudy,andSection5.3 concludesthiswork. 5.1Multi-stageMetricLearning TheproposedDMLalgorithmfocusesontripletconstraintssoastopullthesmallportion ofnearestexamplesfromthesameclasstogether[114].Let D={(xi,yi),i=1,...,n}bea collectionof ntrainingimages,where xiRdandyiistheclassassignmentof xi.Givena distancemetric M,thedistancebetweentwodatapoints xiandxjismeasuredby distM(xi,xj)=(xi−xj)M(xi−xj)Let{xti,xtj,xtk}(t=1,...,N)beasetof Ntripletconstraintsderivedfromthetraining examplesin D.Sinceineachconstraint( xti,xtj,xtk),xtiandxtjareassumedtosharethe sameclassthatistfromthatof xtk,weexpect dM(xti,xtj)11−x−2:x<1−12(1−x)2:o.w. OnemaincomputationalchallengeofDMLcomesfromthePSDconstraint M0in (5.1).Weaddressthischallengebyfollowingtheone-projectionparadigm[24]thatfirst learnsametric MwithoutthePSDconstraintandthenprojects MtothePSDconeat theveryendofthelearningprocess.Hence,inthisstudy,wewillfocusonthefollowing optimizationproblemforFGVC minMSd2M2F+Nt=1(At,M)(5.2) whereAt=(xti−xtk)(xti−xtk)−(xti−xtj)(xti−xtj)isintroducedasamatrixrepresentation foreachtripletconstraint,and ,representsthedotproductbetweentwomatrices. WewilldiscussthestrategiestoaddressthethreechallengesofhighdimensionalDML, andsummarizetheframeworkofhighdimensionalDMLforFGVCattheendofthissection. 5.1.1ConstraintsChallenge:Multi-stageDivision Inordertoreliablydeterminethedistancemetricinahighdimensionalspace,alargenumber oftrainingexamplesareneededtoavoidtheoverfittingproblem.Sincethenumberoftriplet constraintscanbe O(n3),thenumberofsummationtermsin(5.2)canbeextremelylarge, makingittoelysolvetheoptimizationproblemin(5.2).Althoughactive learningmayhelpreducethenumberofconstraints[114],thenumberofactiveconstraints 66canstillbeverylargesincemanyimagesinFGVCfromtcategoriesarevisually similar,leadingtomanymistakes.Toaddressthischallenge,wedividethelearningprocess intomultiplestages.At s-thstage,let Ms−1bethedistancemetriclearnedfromthelast stage.Wesampleasubsetofactivetripletconstraintsthataretobeclassifiedby Ms−1(i.e.,incurlargehingeloss).Given Ms−1andthesampledtripletconstraints Ns,we updatethedistancemetricbysolvingthefollowingoptimizationproblem minMsSd2Ms−Ms−12F+ts(At,Ms)(5.3) Althoughonlyasmallsizeofconstraintsisusedtoimprovethemetricateachstage,we have Theorem5. Themetriclearnedbysolvingtheproblem(5.3)isequivalenttooptimizethe followingobjectivefunction minMSd2M2F+sk=1tk(At,M)Proof. Considertheobjectivefunctionforthefirst sstagesminMSd2M2F+s−1k=1tk(At,M)˘ˇˆ:=Ls−1(M)+ts(At,M)(5.4) Itisobviousthat Ls−1isstronglyconvex,sowehave(Chapter9,[12]) Ls−1(M)=Ls−1(Ms−1)+s−1(Ms−1),M−Ms−1+12(M−Ms−1)2Ls−1(M),M−Ms−167forsome Mbetween MandMs−1.SinceMs−1,thesolutionobtainedfromthefirst s−1stages,approximatelyoptimizes Ls−1(M)and Ls−1is-stronglyconvex,then Ls−1(M)s−1(Ms−1)+2M−Ms−12F(5.5)Wefinishtheproofbyreplacing Ls−1(M)in(5.4)withtheapproximationin(5.5). RemarkThistheoremdemonstratesthatthemetriclearnedfromthelaststageisopti- mizedoverconstrainsfromallstages.Therefore,theoriginalproblemcouldbedividedinto severalsubproblemsandeachonlyhasananumberofactiveconstraints.Fig.5.2 summariestheframeworkofmulti-stagelearningprocedure. Figure5.2:Theframeworkoftheproposedmethod. 685.1.2ComputationalChallenge:DualRandomProjection Nowwewilltrytosolvethehighdimensionalsubproblembydualrandomprojectiontech- niquewhichissimilartothatinthelastchapter.Forsimplifyingtheanalysis,wewill investigatethesubproblematthefirststageandfollowingstagescouldbeanalyzedinthe sameway.Byintroducingtheconvexconjugate forin(5.3),thedualproblemofDML ismaxR|N1|−|N1|t=1(t)−12G(5.6)wheretisthedualvariablefor AtandGisamatrixdefinedas Ga,b=Aa,Ab.M1=−1|N1|t=1tAtbysettingthegradientwithrespectto M1tozeros.Let R1,R2Rd×mbetwoGaussianrandommatrices,where misthenumberofrandomprojections( md)and Ri,j1,Ri,j2(0,1/m).Foreachtripletconstraint,weprojectitsrepresentation Atintolow dimensionalspaceusingtherandommatrices,i.e. At=R1AtR2.Byusingdoublerandom projections,whichistfromthesinglerandomprojectionin[120],wehave Lemma1. Aa,Ab,thedoublerandomprojectionspreservethepairwisesimilaritybetween them:E[Aa,Ab]=Aa,AbTheproofisstraightforward.Accordingtothelemma,thedualvariablesin(5.6)could beestimatedinthelowdimensionspaceas maxˆR|N1|−|N1|t=1(ˆt)−12ˆGˆ(5.7)whereG(a,b )=Aa,Ab.Then,bythedefinitionofconvexconjugateandthesmoothness ofthelossfunction,whichistfromtheanalysisasinChapter4,eachdualvariable 69ˆtin(5.7)couldfurtherbeestimatedby (At,M1),where M1Rm×misthemetric learnedinthereducedspace.Generally, Msislearnedbysolvingthefollowingoptimization problemminMsSm2Ms−Ms−12F+|Ns|t=1(At,Ms)(5.8) Sincethesizeof MsRm×missignificantlysmallerthanthatof Ms,(5.8)canbesolved muchmoretlythan(5.3).Inourimplementation,asimplestochasticgradientdescent isdevelopedtoetlysolvetheoptimizationproblemin(5.8).Given M1,thefinal distancemetric M1Rd×dintheoriginalspaceisestimatedas M1=−1|N1|t=1ˆtAt(5.9)5.1.3StorageChallenge:LowRankApproximation Although(5.9)allowsustorecoverthedistancemetric Minoriginal ddimensionalspace fromthedualvariables {t}|N| t=1,itisexpensive,ifnotimpossible,tosave Minthememory sincedisverylargeinFGVC[2].Toaddressthischallenge,insteadofsaving M,wepropose tosavethelowrankapproximationof M.Morespecifically,let 1rbethefirst rdeigenvaluesof M,andlet u1,...,urbethecorrespondingeigenvectors.Weapproximate Mbyalowrankmatrix M=ri=1iuiui=LL.DtfromexistingDMLmethods thatdirectlyoptimize L[113],weobtain Mfirstandthendecomposeittoavoidsuboptimal soulition.Unlike Mthatrequires O(d2)storagespace,itonlytakes O(rd)spacetosave Mandrcouldbearbitraryvalue.Inaddition,thelowrankmetricalsoacceleratesthe samplingstepbyreducingthecostofcomputingdistancefrom O(d)to O(r).Lowrankis 70alsoapopularregularizertoavoidoverfittingwhenlearninghighdimensionalmetric[78]. However,thekeyissueishowtotlycomputetheeigenvectorsandeigenvaluesof Mateach stage.Thisisparticularlychallenginginourcaseas Min(5.9)evencannotbe computedexplicitlyduetoitslargesize. Toaddressthisproblem,firstweinvestigatethestructureoftherecoveringstepfor s-thstageasin(5.9) Ms=Ms−1−1|Ns|t=1stAs t=Ms−2−1(|Ns|t=1stAs t+|Ns−1|t=1s−1tAs−1t)=−1sk=1|Nk|t=1ktAk tTherefore,wecouldexpressthesummationasmatrixmultiplication.Inparticular,for eachtriplet( xti,xtj,xtk),wedenoteitsdualvariableby =(A,M)andsetthecorre- spondingentriesinasparsematrix CasC(i,j )=,C(j,i )=,C(j,j )=−C(i,k )=−,C(k,i )=−,C(k,k )=(5.10)Itiseasytoverifythat Mcanbewrittenas M=XCX (5.11)Second,weexploittherandomizedtheory[52]totlycomputetheeigen-decomposition 71ofM.Morespecifically,let RRd×q(qd)beanGaussianrandommatrix.According to[52],withanoverwhelmingprobability,mostofthetop reigenvectorsof Mlieinthe subspacespannedbythecolumnvectorsin MRprovided qr+k,where kisacon- stantindependentfrom d.Thelimitationofthemethodisthatitrequirestheappearance ofthematrix Mforcomputing MRwhilekeepingthewholematrixishere. Fortunately,byreplacing MwithXCX accordingto(5.11),wecanapproximatethetop eigenvectorsof Mbythoseof XCX Rthatisofsize d×qandcanbecomputedtly sinceCisasparsematrix.Theoverallcomputationalcostoftheproposedalgorithmfor lowrankapproximationisonly O(qnd ),whichislinearin d.Notethatthesparsematrix Ciscumulatedoverallstages. Alg.5summarizesthekeystepsoftheproposedapproachforlowrankapproximation, whereqrandeigstandforQRandeigendecompositionofamatrix.Notethatthedistributed computingisparticularlyefortherealizationofthealgorithmbecausethematrix multiplication XCX Rcanbeaccomplishedinparallel,whichishelpfulwhen nisalso large.Algorithm5 AntAlgorithmforRecovering MandProjectItontoPSDConefrom M1:Input:DatasetXRd×n,MRm×m,thenumberofrandomcombinations q2:ComputeaGaussianrandommatrix RRd×q3:Computeasparsematrix Cusing(5.10) 4:Y=R×X,Y=Y×C,Y =Y×X5:[Q,R ]=qr(Y)6:B=Q×X,B=B×C,B =B×X7:[U,= eig(B)8:U=QU9:returnL=[1u1,···,rur]and M=LL,where uiisthe ithcolumnof Uandiisthe ithpositivediagonalelementof Alg.6showsthewholepictureoftheproposedmethod.Thetotalcostoftheframework 72Algorithm6 TheMulti-stageMetricLearningFrameworkforHighDimensionalDML (MsML)1:Input:DatasetXRd×n,thenumberofrandomprojections m,thenumberofrandom combinations q,andthenumberofstages T2:ComputetwoGaussianrandommatrices R1,R2Rd×m3:InitializeM0=0Rm×mandM0=0Rd×d4:fors=1,...,Tdo5:Sampleoneepoch( O(n))activetripletconstraintsusing Ms−16:EstimateMsbysolvingtheoptimizationproblemasin(5.8)usingSGD 7:Recoverthedistancemetric Msinthe ddimensionalspaceusingAlg.5 8:endfor 9:returnMTisonly O(dn).5.2Experiments DeCAFfeatures[34]areextractedastheimagerepresentationsintheexperiments.Although itisfromtheactivationofadeepconvolutionalnetwork,whichistrainedonImageNet[72], itoutperformsconventionalvisualfeaturesonmanygeneraltasks[34].Weconcatenate featuresfromthelastthreefullconnectedlayers(i.e.,DeCAF 5+6+7)andthedimensionof resultingfeaturesis51 ,456.Weapplytheproposedalgorithmtolearnadistancemetricandusethelearnedmetric togetherwithasmoothed k-nearestneighborclassifier,avariantof k-NN,topredictthe classassignmentsfortestexamples.tfromtraditional k-NN,smoothed k-NNfirst obtainskreferencecentersforeachclassbyclusteringimagesineachclassinto kclusters.Topredicttheclassassignmentforatestimage x,itcomputesadistancebetween xanda 73classCusingthereferencecenters C1,...,Ckasdis(x,C)=−1log˙˝kj=1exp−|x−Cj|2˛˚,(5.12)andassigns xtotheclasswiththeshortestdistance.Thedistancefunctiongivenin(5.12) actuallycomputesthesoftminamongthedistancebetween xandCj,andweusehardmin min1jk|x−Cj|2when=0.Wefindthatsmoothed k-NNismoreetthanthe traditionalk-NN,especiallyforlarge-scalelearningproblems.Werefertotheclassification approachbasedonthemetriclearnedbytheproposedalgorithmandthesmoothed k-NNas MsML,andthesmoothed k-NNwithEuclideandistanceintheoriginalspaceas Euclid.Althoughthesizeofcovariancematrixisverylarge(51 ,456×51,456),itsrank islowduetothesmallnumberoftrainingexamples,andthusPCAcanbecomputed explicitly.Thestate-of-the-artDMLalgorithm,i.e. LMNN[114]withPCAaspreprocess, isalsoincludedincomparison.Theone-vs-allstrategy,basedontheimplementationof LIBLINEAR[37],isusedasabaselineforFGVC,withtheregularizationparametervaried intherange {10i}(i=−2,···,3).Werefertoitas LSVM.Wealsoincludethestate-of-the- artresultsforFGVCinourevaluation.AlltheparametersusedbyMsMLaresetempirically, withthenumberofrandomprojections m=100andthenumberofrandomcombinations q=600.PCAisappliedforLMNNtoreducethedimensionalityto mbeforeDMLis learned.LMNNisimplementedbythecodefromtheoriginalauthorsandtherecommended parametersareused 2.Toensurethatthebaselinemethodfullyexploitstrainingdata,we setthemaximumnumberofiterationsforLMNNas10 4.Theseparametervaluesareused throughoutalltheexperiments.Alltraining/testsplitsareprovidedbydatasets. Mean2Wedidvarytheparameterslightlyfromtherecommendedvaluesanddidnotfindanynoticeablechange intheclassificationaccuracy. 74accuracy,astandardevaluationmetricforFGVC,isusedtoevaluatetheclassification performance.Allexperimentsarerunonasinglemachinewith162 .10GHzcoresand96GB memory. 5.2.1OxfordCats&Dogs Cats&dogs contains7 ,349imagesfrom37catsanddogsspecies[88].Thereareabout100 imagesperclassfortrainingandrestfortesting.Table5.1summariestheresults.First,we observethatMsMLismoreaccuratethanthebaselineLSVM.Thisisnotsurprisingbecause thedistancemetricislearnedfromthetrainingexamplesofallclassassignments.Thisisin contrasttotheone-vs-allapproachusedinLSVMinwhichtheclassificationfunctionfora classCislearnedonlybytheclassassignmentsof C.Second,ourmethodperformssignif- icantlybetterthanthebaselineDMLmethod,indicatingthattheunsuperviseddimension reductionmethodPCAmayresultinsuboptimalsolutionsforDML.Fig.5.3comparesthe imagesthataremostsimilartothequeryimagesusingthemetriclearnedbytheproposed algorithm(Column8-10)tothosebasedonthemetriclearnedbyLMNN(Column5-7)and Euclid(Column2-4).Weobservethatmoresimilarimagesarefoundbythemetriclearned byMsMLthanbyLMNN.Forexample,MsMLisabletocapturethebetweentwo catsspecies(longhairv.s.shorthair)whileLMNNreturnstheverysimilarimageswithwrong label.MoreexamplesaregiveninFig.5.6.Third,MsMLhasoverwhelmingperformance comparedtoallstate-of-the-artFGVCapproaches.Althoughthemethod[88]usingground truthheadboundingboxandsegmentationachieves59 .21%,MsMLis20%betterthanit withonlyimageinformation,whichshowstheadvantageoftheproposedmethod.Finally, ittakeslessthan0 .2secondtoextractDeCAFfeaturesperimagebasedona CPU imple-mentationwhileasimplesegmentationoperatorcostsmorethan2.5secondsasreportedin 75thestudy[2],makingtheproposedmethodforFGVCpracticallymoreappealing. Table5.1:Comparisonofmeanaccuracy(%)on cats &dogs dataset.“#”meansthatmore information(e.g.,groundtruesegmentation)isusedbythemethod. Methods MeanAccuracy(%) Imageonly[88] 39.64Det+Seg[2] 54.30Image+Head+Body#[88] 59.21Euclid72.60LSVM77.63LMNN76.24MsML80.45MsML+81.18Figure5.3:Examplesofretrievedimages.Firstcolumnarequeryimageshighlightedby greenboundingboxes.Columns2-4includethemostsimilarimagesmeasuredbyEuclid. Columns5-7showthosebythemetricfromLMNN.Columns8-10arefromthemetricof MsML.Imagesincolumns2-10arehighlightedbyredboundingboxeswhentheysharethe samecategoryasqueries,andblueboundingboxesiftheyarenot. ToevaluatetheperformanceofMsMLforextremelyhighdimensionalfeatures,wecon- catenateconventionalfeaturesbyusingthepipelineforvisualfeatureextractionthatis outlinedin[2].Specifically,weextractHOG[30]featuresat4tscalesandencode 76themto8 KdimensionalfeaturedictionarybytheLLCmethod[113].Amaxpoolingstrate- gyisthenusedtoaggregatelocalfeaturesintoasinglevectorrepresentation.Finally,82 ,560featuresareextractedfromeachimageandthetotaldimensionisupto134 ,016.MsMLwith thecombinedfeaturesisdenotedas MsML+andwecanobservethatitfurtherimproves theperformancebyabout1%asinTable5.1.Notethatthetimeofextractingthishigh dimensionalconventionalfeaturesisonly0 .5secondperimage,whichisstillmuchcheaper thananysegmentationorlocalizationoperator. 5.2.2Oxford102Flowers 102flowersistheOxfordflowersdatasetforflowerspecies[86],whichconsistsof8189images from102classes.Eachclasshas20imagesfortrainingandrestarefortesting.Table5.2 showstheresultsfromtmethods.Wehavethesimilarconclusionforthemethodsus- ingthesameDeCAFfeatures.Thatis,MsMLoutperformsLSVMandLMNNsignificantly. AlthoughLSVMalreadyperformsverywell,MsMLfurtherimprovestheaccuracy.Addition- ally,itisobservedthattheperformanceofstate-of-the-artmethodsevenwithsegmentation operatorsaremuchworsethanthatofMsML.NotethatGT[86]useshandannotatedseg- mentationsfollowedbymultiplekernelSVM,whileMsMLoutperformsitabout3%without anysupervisedinformation,whichconfirmstheenessoftheproposedmethod. Fig.5.4illustratesthechangingtrendoftestmeanaccuracyasthenumberofstages increases.WeobservethatMsMLconvergesveryfast,whichverifiesthatmulti-stagedivision isessentialinourproposedframework. 77Table5.2:Comparisonofmeanaccuracy(%)on 102flowersdataset.“#”meansthatmore information(e.g.,groundtruesegmentation)isusedbythemethod. Methods MeanAccuracy(%) CombinedCoHoG[65] 74.80CombinedFeatures[85] 76.30BiCoS-MT[19] 80.00Det+Seg[2] 80.66TriCoS[21] 85.20GT#[86] 85.60Euclid76.21LSVM87.14LMNN81.93MsML88.39MsML+89.450510152075808590#StagesTest MA(%)Figure5.4:Convergencecurveoftheproposedmethodon 102flowers.5.2.3Birds-2011 birds11 istheCaltech-USCD-200-2011birdsdatasetforbirdspecies[109].Thereare200 classeswithtotal11 ,788imagesandeachclasshasrough30imagesfortraining.Weusethe versionwithgroundtruthboundingbox.Table5.3comparestheproposedmethodtothe state-of-the-artbaselines.First,itisobviousthattheperformanceofMsMLissignificantly 78betterthanallbaselinemethodsastheobservationabove.Second,althoughSymb[20] combinessegmentationandlocalization,MsMLoutperformsitby9%withoutanytime consumingoperator.Third,Symb*andAli*mirrorthetrainingimagestoimprovetheir performances,whileMsMLisevenbetterwithoutthistrick.Finally,MsMLoutperforms themethodcombiningDeCAFfeaturesandDPDmodels[121],whichisduetothefact thatmostofresearchofFGVCignorechoosingtheappropriatebaseclassifierandsimply adoptlinearSVMwithone-vs-allstrategy.Forcomparison,wealsoreporttheresultwith mirroringtrainingimagesandisdenotedas MsML+*.Itprovidesanother1%improvement overMsML+asshowninTable5.3. 01020304050777879808182#Auxiliary ClassesMean Accuracy(%)LSVMMsMLFigure5.5:Comparisonwithtsizeofclasseson birds11 .ToillustratethecapacityofMsMLinexploringthecorrelationamongclasses,which makesitmoreethanasimpleone-vs-allclassifierforFGVC,weconductoneaddi- tionalexperiment.Werandomlyselect50classesfrom birds11 asthe targetclasses anduse thetestimagesfromthetargetclassesforevaluation.Whenlearningthemetric,besides thetrainingimagesfrom50targetclasses,wesample kclassesfromthe150unselectedones 79Table5.3:Comparisonofmeanaccuracy(%)on birds11 dataset.“*”denotesthemethod thatmirrorstrainingimages. Methods MeanAccuracy(%) Symb[20] 56.60POOF[9] 56.78Symb*[20] 59.40Ali*[44] 62.70DeCAF+DPD[34] 64.96Euclid46.85LSVM61.44LMNN51.04MsML65.84MsML+66.61MsML+*67.86asthe auxiliaryclasses,andusetrainingimagesfromtheauxiliaryclassesasadditional trainingexamplesfordistancemetriclearning.Fig.5.5comparestheperformanceofLSVM andMsMLwiththeincreasingnumberofauxiliaryclasses.Itisnotsurprisingtoobserve thattheperformanceofLSVMdecreasesalittlesinceitisunabletoexplorethesupervision informationintheauxiliaryclassestoimprovetheclassificationaccuracyoftargetclasses andmoreauxiliaryclassesjustintensifytheclassimbalanceproblem.Incontrast,theper- formanceofMsMLimprovessignificantlywithincreasingauxiliaryclasses,indicatingthat MsMLiscapableofelyexploringthetrainingdatafromtheauxiliaryclassesand thereforeisparticularysuitableforFGVC. 5.2.4StanfordDogs S-dogs istheStanforddogspeciesdataset[70].Itcontains120classesand20 ,580images, where100imagesfromeachclassisusedfortraining.SinceitisthesubsetofImageNet[92], whereDeCAFmodelistrainedfrom,wejustreporttheresultinTable5.4asreference. 80Table5.4:Comparisonofmeanaccuracy(%)on S-dogs dataset.“*”denotesthemethod thatmirrorstrainingimages. Methods MeanAccuracy(%) SIFT[70] 22.00EdgeTemplates[118] 38.00Symb[20] 44.10Symb*[20] 45.60Ali*[44] 50.10Euclid59.22LSVM65.00LMNN62.17MsML69.07MsML+69.80MsML+*70.315.2.5Comparisonof Inthissection,wecomparethetrainingtimeoftheproposedalgorithmforhighdimensional DMLtothatofLSVMandLMNN.MsMLisimplementedbyJulia,whichisalittleslower thanC 3,whileLSVMusestheLIBLINEARpackage,thestate-of-the-artalgorithmforsolv- inglinearSVMimplementedmostlyinC.ThecorepartofLMNNisalsoimplementedinC. Thetimeforfeatureextractionisnotincludedherebecauseitissharedbyallthemethods incomparison.TherunningtimeforMsMLincludesalloperationalcost(i.e.,thecostfor tripletconstraintssampling,computingrandomprojectionandlowrankapproximation). Table5.5:ComparisonofRunningtime(seconds). Methods cats &dogs 102flowersbirds11 S-dogs LSVM196.2309.81,417.01,724.8LMNN832.58702.71,178.21,643.6MsML164.91174.4413.1686.3MsML+337.20383.7791.31,229.7Table5.5summarizestheresultsofthecomparison.First,ittakesMsMLabout1 /3ofthetimetocompletethecomputationthanLMNN.ThisisbecauseMsMLemploysa 3Detailedcomparisoncanbefoundinhttp://julialang.org 81stochasticoptimizationmethodtofindtheoptimaldistancemetricwhileLMNNisabatch learningmethod.Second,weobservethattheproposedmethodissignificantlymoreet thanLSVMonmostofdatasets.ThehighcomputationalcostofLSVMmostlycomesfrom twoaspects.First,LSVMhastotrainoneclassificationmodelforeachclass,andbecomes significantlyslowerwhenthenumberofclassesislarge.Second,thefactthatimagesfrom tclassesarevisuallysimilarmakesitcomputationallytofindtheoptimal linearclassifierthatcanseparateimagesofoneclassfromimagesfromtheotherclasses.In contrast,thetrainingtimeofMsMLisindependentlyfromthenumberofclasses,makingit moreappropriateforFGVC.Finally,therunningtimeofMsML+with134 ,016featuresonly doublesthatofMsML,whichverifiesthattheproposedmethodislinearindimensionality (O(d)).5.3Conclusions Inthischapter,wedevelopamulti-stagemetriclearningframeworkofhighdimension- alFGVCproblem,whichaddressesthechallengesarisingfromconventionalDML.More specifically,itdividestheoriginalproblemsintomultiplestagestohandlethechallenge arisingfromtoomanytripletconstraints,extendsthetheoryofdualrandomprojectionto addressthecomputationalchallengeforhighdimensionaldata,andarandomizedalgorithm forlowrankmatrixapproximationforthestoragechallenge.Theempiricalstudyshowsthat theproposedmethodwithgeneralpurposefeaturesyieldsperformancethatissignificantly betterthanthestate-of-the-artapproachesforFGVC.Inthefuture,wecouldtrytocom- binetheproposedDMLalgorithmwithsegmentationandlocalizationtofurtherimprove theperformanceofFGVC. 82Figure5.6:Examplesofretrievedimages.Firstcolumnarequeryimageshighlightedby greenboundingboxes.Columns2-4includethemostsimilarimagesmeasuredbyEuclid. Columns5-7showthosebythemetricfromLMNN.Columns8-10arefromthemetricof MsML.Imagesincolumns2-10arehighlightedbyredboundingboxeswhentheysharethe samecategoryasqueries,andblueboundingboxesiftheyarenot. 83Chapter6 FeatureSelectionforFace Recognition:ADistanceMetricLearningApproach Inthischapter,wewillstudyfeatureselectionprobleminfacerecognition.Facerecognition involvesidentifyingiftwofaceimagesbelongtothesameperson(i.e.,faceverification) orwhichpersonthefaceimagebelongsto(i.e.,faceclassification).Itisanimportant applicationofcomputervisionandhasbeenstudiedextensivelyinthepastdecades[26,100, 3,62,25].Manyfacerecognitionmethodsrequireestimatingthesimilaritybetweenimages appropriatelyusingthetechniqueofdistancemetriclearning[50,60,61]. OnechallengeinapplyingDMLtoestimatethesimilarityoffaceimagesisthehigh dimensionalityoffacerepresentationsusedtocapturethesubtlebetweenthe faces[25,3].Thehighdimensionalitycomesfromthefactthat over-completed descrip-torsaresampledforfaceimagesandthetotalnumberoffeaturescanbeupto1 ,000,000[61]. MostofexistingDMLmethodsprojectthehighdimensionalfacefeaturestoalowdimen- sionalspacewithunsuperviseddimensionreductiontechniques(e.g.,PCA)andthenlearn ametricforthelowdimensionalspace[60,50].Theresultingsolutioncanbesuboptimal sinceimportantinformationoffaceimagesislostafterdimensionreduction. 84Byfurtherinvestigatingtheproblem,itisfoundthatalthoughthedimensionalityof featurevectorsforfaceimagesisveryhigh,thenumberoffeaturesusedbyeachdescriptor issignificantlysmallandisonly45inourstudy.Itisthusettolearnametricfor thegroupoffeaturesfromeachdescriptor.Ontheotherhand,sincethedescriptorsare over-completed,asubsetofthemmaybesttoverifyfaceimages.Fig.6.1illustrates thefeatureselection(i.e.,descriptorselection)forfaceverification.Eachdescriptoriscor- respondingtoasmallpatchontheface,andasubsetofthem(e.g.,eyes,nose,etc.)can capturemostof Figure6.1:Illustrationoffeatureselectionforfaceverification.Althoughdescriptorsare over-completed,asubsetofdescriptors(e.g.,eyes,nose,etc)cancapturemostof Inthischapter,weproposeaDMLmethodthatcanselectasubsetofdescriptorsand learnadistancemetricsforeachselecteddescriptorsimultaneously.Unliketheprevious methodsthatlearnasinglemetricfromtheoriginalfeaturespace,welearnametricfor eachfeaturegroupthatiscorrespondingtotheselecteddescriptorateveryiterationand theensembleofmetricsisusedfortheverification.Themethodavoidsdealingwithhigh dimensionalfeaturesdirectlywhilekeepingusefulinformationfromalldescriptors. Themaincontributionsofthisworkaresummarizedasfollows. •WeproposeaDMLmethodtohandletheover-completeddescriptorsinfaceverifi- cation.Ateachiteration,alinearoptimizationproblemissolvedtosimultaneously selectadescriptorandlearnthecorrespondingmetric.Thisproblemhastheclosed- formsolution,whichmakingthecomputationet. 85•Weprovethattheproposedmethodhas O(1/T)convergencerate,where Tisthe numberofiterationsandequivalenttothenumberofselecteddescriptorsinthiswork. •Toreducethehighcomputationalcostofexhaustivesearchthatselectsonesingle descriptorateachiteration,weexploitmini-batchstrategyandincrementalsampling, respectively.Itisshowntheoreticallyandempiricallythattheacceleratedmethods areabletoyieldsimilarperformancewithsignificantlyreducedcomputationalcosts. Thehybridmethodwiththesestrategiestogetherachievesthebestinour empiricalstudy. Therestofthischapterisorganizedasfollows:Section6.1describestheproposedap- proach.Section6.2showstheresultsoftheempiricalstudyandSection6.3concludesthis workwithfutureresearchdirections. 6.1DMLforfeatureselection Givenasetoffaceimages XRd×n,DMLaimstolearnametricthatestimatespairwise distancesas distM(xi,xj)=(xi−xj)M(xi−xj)Infaceverification,agoodmetriccanseparatetheimagesfromthesamepersonandthose fromtpersonsas xi,xj:dist M(xi,xj)b−1;xi,xk:dist M(xi,xk)b+1wherexiandxjarefromthesamepersonwhile xkisfromatperson,and bisa pre-definedthreshold. 86Themetriccanbelearnedbysolvingthecorrespondingoptimizationproblem,whichis minMd+L(M)=1NNi,j(Yij(distM(xi,xj)−b))(6.1) whereSd+isthesetofpositivesemi-definite(PSD)matriceswithsize d×d,and (·)canbe anysmoothlossfunctionandisthelogisticlossinourworkas (z)=log(1+exp( z+1)) Yij− 1,+1}istheindicatorwhere Yij=1if xiandxjarefromthesamepersonand Yij=−1otherwise. Itishardtohandletheproblemin(6.1)directlybyconventionalDMLmethodsdueto thehighdimensionalityofthefeaturespace.Ontheotherhand,featuresareknowntobe redundantandthenumberoffeaturesfromeachdescriptorissmall,makingitpossibleto tlylearnametricforeverygroupoffeature..Therefore,weproposeaniterative algorithmthatisabletoperformfeatureselectionandmetriclearningsimultaneously:it selectsoneinformativedescriptorateachiterationandlearnametricforthefeaturesofthe selecteddescriptor. Atthe t-thiteration,weselectthedescriptoras argmin s=1,···,SminMtSk+,MtF1(Zt−1),ztswhereSisthenumberofdescriptors, kisthedimensionoffeaturesfromasingledescriptor andwehave kS=d.(Zt−1)=[(Zt−1i,j)],zts=[Yi,j(distMt(xsi,xsj)−b)]and xsisthe s-th87featuregroupoftheexample.Itisalinearoptimizationproblemandisequivalentto argmin s=1,···,SminMtSk+,MtF1i,jYi,jt−1i,jAsi,j,Mt(6.2)whereAsi,j=(xsi−xsj)(xsi−xsj).TheintuitionofEqn.6.2istolearntheoptimalmetricfor eachfeaturegroupandthengreedilyselectthefeaturegroupwhosedistanceistheclosest tothenegativedirectionofthegradientwiththelearnedmetric. Themetricislearnedfromthesubproblem minMtSk+,MF1i,jYi,jt−1i,jAsi,j,MtAccordingtotheK.K.T.condition[12],theproblemhastheclosed-formsolution MtPSD, MtF=1(−i,jYi,jt−1i,jAsi,j)(6.3) where PSD, MF=1(M)istoprojectthemetricintothePSDconewiththeunitFrobenius norm.BytakingEqn.6.3backtoEqn.6.2,theselectioncriterionbecomes argmin s=1,···,SScores=PSD (−i,jYi,jt−1i,jAsi,j)F(6.4)Afterselectingthedescriptorwiththeminimalscore,thedistanceisupdatedas Zti,j=(1 −t)Zt−1i,j+tzti,j88Thestepsize tcanbeoptimizedbysolvingtheproblem mint:0t<11Ni,j((1−t)Zt−1i,j+tzti,j)withNewton’smethodorusethepre-definedsizeas t=2t+1Alg.7summarizesthekeystepsoftheproposedmethod. Algorithm7 GreedyCoordinateDescentMetricLearningforFaceVerification(Greco) 1:Input:DatasetXRd×n,#Iterations T2:fort=1,···,Tdo3:SelectasingledescriptoraccordingtoEqn.6.4 4:ComputethecorrespondingmetricasinEqn.6.3 5:Updatethedistanceusingtheselectedfeaturegroupandthelearnedmetricas Zti,j=(1 −t)Zt−1i,j+tzti,j6:endfor 7:returnZTThegreedycoordinatedescentmethodinAlg.7issimilarasFrank-Wolfemethod[41,43] orconditionalgradientdescentmethod[75],butitlearnstheensembleofmetricsratherthan asinglemodelasinpreviouswork.Wehavethefollowingtheoremfortheguaranteeofthe performance. Theorem6. Let ZTbethesolutionofAlg.7after Titerationsand Zbetheoptimal solution.Assumethatthelossfunction (·)isconvexand -smoothand i,j :zi−zj2D.Bysetting t=2t+1,wehave L(Zt)−L(Z)2T+189TheproofissimilartothatofFrank-Wolfemethod. Proof. L(Zt)−L(Z)=L((1−t)Zt−1+tzt)−(Z)(Zt−1)−L(Z)+t(Zt−1),zt−Zt−1+2t2Zt−1−zt2(6.5)(Zt−1)−L(Z)+t(Zt−1),Z−Zt−1+2t2D(6.6)(1−t)(L(Zt−1)−(Z))+ 2t2D(6.7)Eqn.6.5isobtainedbecausethatthelossfunctionis -smooth.Eqn.6.6isfromthefact thatztistheoptimalsolutionofProblem6.2and Zisintheconvexhullas zt.Eqn.6.7 isduetothat Lisconvex. Bysetting t=2t+1,weprovethetheorembyinduction.Weholdtheassumptionthat (ZT)−(Z)2T+1WhenT=1, 1=1anditisobviousthat (Z1)−(Z)2For T=t,wehave (Zt)−(Z)(1−1t+1)2t−1t+12t+2(t+1) 22t+1906.1.1AdaptiveMini-batchStrategy Althoughtheproposedmethodcanhandlethehighdimensionalchallengeinfaceverification, ithastoexhaustivelysearchthroughalldescriptorstoselectonlyonedescriptorateach iteration,whichiscomputationallytwhenthenumberofdescriptorsisverylarge. WeproposetwovariantsofDMLbasedfeatureselectionmethodstoimprovethe. First,weinvestigatethemini-batchstrategy,whichselectamini-batchdescriptorsrather thanasingledescriptorateachiteration.Aftercomputingthescoreofeachdescriptoras inEqn.6.4,top mdescriptorswillbeselected.Then,alargermetric(i.e.,sizeof mk×mk)willbelearnedfromallfeaturesofselecteddescriptorsas MtPSD, MtF=1/m(−i,jYi,jt−1i,jAs1,···,sdi,j)Notethatthenormof Mtisshrankby1 /mtomakesurethenormof ztisstillbounded overmultiplefeaturegroups.Itisbecause As1,···,smi,jFm(maxi,j,sAsi,jF)Thisshrinkageratioalsocanbeestimatedempiricallyas maxAs1,···,smi,jFmaxAsi,jFforatighter approximation. Giventhenormalizedscoreofmini-batchdescriptors,theupdatingisadaptivelyby comparingtothescoreoftheoptimalsingledescriptor.Iftheformerscoreisbetterthan thelaterone,thedistancewillbeupdatedbythemini-batchfeaturegroups.Otherwise,the 91singleoptimaldescriptorwillbeused.Alg.8showsthemethodwithmini-batchstrategy. Algorithm8 GreedyCoordinateDescentMetricLearningwithAdaptiveMini-batch (Greco-mini)1:Input:DatasetXRd×n,#Iterations T,mini-batchsize m2:fort=1,···,Tdo3:Selecttop ddescriptorsaccordingtoEqn.6.4 4:Computeasinglemetricfromtheselectedfeaturegroups 5:if(Zt−1),zMmini−batchL (Zt−1),zMsingle then6:Mt=Mmini−batch7:else8:Mt=Msingle 9:endif 10:Updatethedistanceusingtheselectedfeaturegroupandthelearnedmetricas Zti,j=(1 −t)Zt−1i,j+tzti,j11:endfor 12:returnZTWecanprovethatthemini-batchstrategywillnotatheconvergencerateifitsscore isbetterthanthesingleone,whichisstatedinthefollowingtheorem. Theorem7. Let ZTbethesolutionofAlg.8after Titerationsand Zbetheoptimal solution.Assumethatthelossfunction (·)isconvexand -smoothand i,j :zi−zj2D.Bysetting t=2t+1andassumingthatthescoreofmini-batchdescriptorsis better thanthescoreoftheoptimalsingledescriptor,wehave L(Zt)−L(Z)2−2T+1Theproofistrivialandwelistitforcompleteness. Proof. Weconsidertheproblemwhenthesolutionofeachiterationisslightlybetterthanthe singleoptimalsolutionwithafactor where0and (Zt−1),˜zt (Zt−1),zt.92Withthesimilarprocessasabove,wehave L(Zt)−L(Z)=L((1−t)Zt−1+t˜zt)−(Z)(Zt−1)−L(Z)+t(Zt−1),zt−Zt−1t+2t2Zt−1−˜zt2(1−t)(L(Zt−1)−L(Z))−t+2t2DWeprovetheresultbyinduction.Weset t=2/(t+1)asbeforeandtheassumptionis (ZT)−(Z)2−2T+1WhenT=1,itis (Z1)−(Z)2−−WhenT=t,wehave (Zt)−(Z)2t+1−(1−2t+1)2t−2(t+1) 22−2t+1Remarkisthebetweentheperformanceandthenumberofselecteddescrip- tors.Althoughthenumberofiterationsforthemethodwithmini-batchstrategyislessthan thatoftheoriginalmethodtoachievethesameperformance,thenumberofdescriptorscan belargerthanthatoforiginalone. 93However,ifthesolutionofmini-batchisgoodenoughas 2(1−T+1Tm+1)wheremisthemini-batchsize,wehave (ZT)−(Z)2Tm+1whichmeansthatevenwiththesamenumberofdescriptors,themethodwithmini-batch performsthesameastheoriginalmethodandtherunningtimeisshrankbythefactor minthisidealcase. 6.1.2IncrementalSamplingStrategy Inthissection,weexploitthesamplingstrategytoreducethenumberofcandidatedescrip- torsateachiteration.tfromuniformsampling,asubsetwithsize 1−tt+1Swillbesampledatthe t-thiteration,where0 t