DATAANALYSISANDSELECTIONFORSTATISTICALMACHINETRANSLATIONBySaulehEetemadiADISSERTATIONSubmittedtoMichiganStateUniversityinpartialoftherequirementsforthedegreeofElectricalEngineeringŒDoctorofPhilosophy2016ABSTRACTDATAANALYSISANDSELECTIONFORSTATISTICALMACHINETRANSLATIONBySaulehEetemadiStatisticalMachineTranslationhasreceivedattentionfromtheacademiccommu-nityoverthepastdecadewhichhasledtoimprovementsinmachinetranslationquality.Asaresultitiswidelyadoptedintheindustry(Google,Microsoft,Twitter,Facebook,...etc.)aswellasthegovernment(http://nist.gov).Thebiggestfactorinthisimprovementhasbeentheavailabilityofeverincreasingsourcesoftrainingdataasdigitalmultilingualcommunicationandinformationdisseminationbecomeubiquitous.Relativelylittleresearchhasbeendoneontrainingdataanalysisandselection,despitetrainingdatabeingthemaincontributorofmachinetranslationquality.Inthiswork,weexaminefundamentalpropertiesoftranslatedandauthoredtext.Weintroduceanewlinguisticallymotivatedfeature(PartofSpeechTagMinimalTranslationUnits)thatoutperformspriorworkinsentenceleveltranslationdirectiondetection.Next,wedevelopacross-domaindatamatrixthatenablescomparisonbetweendifferentfeaturesinthetranslationdirectiondetectiontask.Weextendourpreviouslyintroducedfeaturefortranslationdirectiondetectiontousestatisticallytrainedbrownclustersinsteadofpartofspeechtags.Thisnewfeatureoutperformsallpriorworkinallcross-domaindatamatrixcombinations.Dataselectioninmachinetranslationisperformedindifferentscenarioswithdifferentobjec-tivesincluding:reducingtrainingresourceconsumption,domainadaptation,improvingqualityorreducingdeploymentsize.Wedevelopanef(computationalcomplexityandmemorycon-sumptionislinearintrainingdatasize)frameworkfortrainingdataselectionandcompressioncalledVocabularySaturationFilter(VSF).InourexperimentsweshowthemachinetranslationsystemtrainedondataselectedusingVSFiscomparabletopriordataselectionmethodswithquadraticcomputationalcomplexity.However,VSFissensitivetodataorder.Thereforeweexper-imentwithdifferentorderingsofthedataandcomparetheresults.Finally,wedevelopahighlyscalableandxibledataselectionframeworkwherearbitrarysentencelevelfeaturescanbeusedfordataselection.Inaddition,avariablethresholdfunctioncanbeusedtoincorporateanyscoringfunctionthatisconstantthroughouttheselectionprocess.Afterintroducingthisframework,inspiredbythefeaturesweintroducedfordetectingtranslationdirection,weusejointmodelsofsourceandtargetusingMinimalTranslationUnits(MTU)inadditiontosourcesidecontextusingbrownclusterstocomparevariousfeaturesandthresholdfunctionswithinthisframework.Werunend-to-endexperimentsusingdataselectedbyvariousmethodsandcomparethestatisticaltranslationmodelsusingvarioustestsetsandphrasetablecomparisonmetrics.CopyrightbySAULEHEETEMADI2016Toanyonewhowillreadanduseit!vACKNOWLEDGEMENTSIstartedmyPh.D.in2006shortlyafterIjoinedMicrosoftResearchasasoftwareengineer.Indeed,myjobatMicrosoftResearchwastheonlyaspectofmylifethatdidnotchangeduringtheseyears.Wewererenting,thenweboughtacondo.Iwasoverweight,IstartedthenewsportofBrazilianJujitsu,Ilostweightandeventuallyearnedablackbelt.Iwenttopilgrimage.Then,Iwentagain.Ibecameafather.Wetravelledtomanydifferentcountriesmanytimes.Weleftthecondo,becamealandlordandstartedrentingagain.Wehadourseconddaughter.Werenovatedourhouse.Mywifewentthroughlifethreateningsurgeriesandrecovered.Ilostmygrandfatherandgrandmotherandsoon....WhenIstarted,Ineverthoughtitwouldtakeme10yearstodefendmydissertationandgraduate.WhileIalwaysknew,I'mnotgoingtogiveup,IneverknewhowmuchhelpIneeded.Firstandforemost,mydeepestgratitudeisfortheOnewhopresentedmewiththischallengealongwithmanyotherchallengesinlife,soImaybecomebetter.Iwouldliketothankmyadviser,Dr.Radha.WhilehewarnedmeofthedifofdoingaPh.D.andworkingafulltimejobatthesametime,hisintricatecombinationofpatience,supportandchallengewasinstrumentalinhelpingmetoovercomethisdif.Ifitwasn'tfortheincrediblehelpandsupportofmyco-advisorDr.KristinaToutanova,IwouldnothavebeenabletocompletemyPh.D.Iwanttoexpressmygratitudetoherformentoringmeandprovidingmewithsupportiveandcriticalfeedback.IwouldalsoliketothankDr.RongJinwhoservedonmycommitteebeforeleavingMSU.Ifitwasn'tforhissuggestiontohaveaco-advisorthatcanmentormeintheareaofStatisticalMachineTranslation,IwouldnothavehadKristinaasmyco-advisor.IalsowanttothankDr.EsfahanianwhoagreedtoreplaceDr.Jininmycommitteeonlyfewmonthsbeforemydefense.I'mthankfultomycommitteemembersDr.DellerandDr.Aviyentefortheirsupportandfeedbackaswell.I'mextremelythankfultomymangersandcolleaguesattheMachineTranslationGroupinMicrosoftResearchwhowerealwayssupportiveandencouragingtowardsthepursuitofmyPh.D.viMymanagers,Steve,Andreas,Federico,Mei-Yuh,ArulandVishalwerealwaysxibleandsup-portiveofmyPh.D.andforthat,I'mverythankful.WhileIlackedtheofhavinglabmatesformanyyears,JonClarkandQinGaothisgapastheyjoinedourgroupfewyearsagowhiletheywerePh.D.studentsaswell.Westartedhavingweeklymeetingswhichwasincrediblyhelpfulinpushingustowardsgraduationandforthis,I'mverythankful.IhadmanyfruitfuldiscussionswithmycolleagueandmentorDr.WilliamLewiswhoalsosharedmypassionfortheroleofdatainmachinetranslation.Hisoptimism,dedicationandsupportwasapositiveforcethroughoutmyPhD.SincethepursuitofPhDwastheotherbigthinginmylifeaftermyjob,thesubjecthadcomeuponceormorewithallcolleaguesI'vehadthroughouttheyears.Icannotnamethemallhere,butwithoutexception,theywereallencouragingandplayedapositiveroleinthisendeavourandforthatI'mverythankful.Whilebeingsupportedatworkandschoolbyincrediblepeople,IstillneededtimetoworkonmyPh.D.aftermyresponsibilitiesatwork.Mywifeandsoulmate,HodahasbeenincrediblygraciousandthroughouttheseyearswhileshestartedandhermastersdegreeinArchitecture,startedherPh.D.inConstructionManagement,hadtwokidsanddefendedherPh.D.lastyear.Thanks!Attheageof36,Ihavenowofbeeninschoolforover30years.FromthedayIwenttokindergartentillthisveryday,mymomalwaysencouragedmeandbelievedinme.Shegotmeintogoodschools,shestayeduplatetohelpmewithmyhomework,getgoodgrades,stayingoodschools,getintouniversityandsoon.ThewordscannotbegintodescribehowmuchsheformyeducationandsuccessandforthisI'mforeverthankful.Mydadalwayswasandcontinuestobemyrolemodel.Heblessedmewithhiswisdom,greatadvice,directionandsupportthroughouttheyearsandIcannotthankhimenoughforthis.Ihavebeenawayfrommyparentsnowforoveradecadeinpursuitofhighereducationandcareer.Althoughthispursuithastakenmuchlongerthananticipated,theyremainedencouragingandsupportiveandforthisI'mforeverindebttothem.IalwaysfromdiscussionsaboutcareerandhighereducationwithmyyoungerbrotherviiAmeen,throughouthisyearsatMicrosoftaswellasthetwoyearssincehestartedhisPh.D.atUCDavis.MyparentsinlawhavealsoblesseduswithvisitingushereintheStatesandhelpinguswithourtwodaughtersinadditiontobeinggreathostsformywifeanddaughtersduringthesummermonthswhileIwasnotwiththem.I'mincrediblythankfulandindebttothem.Manyfriendsandfamilieshavesupportedmewiththeirthoughtsandprayersthroughouttheyears.WhileIcannotnameallofthemhere,Idothankthemfortheirhelpandsupport.Finally,mydaughtersAulaandHannahhaveprovidedmewithincrediblejoy,challengeandheadacheforthebetterhalfofmyPh.D.WhileIhavespentmanynights,weekendsandsummermonthsawayfromthemworkingonmydissertation,Iwasalwayswelcomedandacceptedbythem.Ifitwasn'tforthem,althoughthisdissertationmighthavebeencompletedearlier,butoverallIamabetterpersonbecauseofthem.ThankYou!viiiTABLEOFCONTENTSLISTOFTABLES.......................................xiiLISTOFFIGURES.......................................xiiiCHAPTER1INTRODUCTION...............................1CHAPTER2BACKGROUND................................52.1EstimatingLanguageModelProbability.......................72.2EstimatingTranslationModelProbability......................92.2.1WordAlignment...............................92.2.2PhraseTable..................................102.3MachineTranslationEvaluation...........................10CHAPTER3LITERATUREREVIEW...........................133.1Introduction......................................133.2ApplicationScenarios.................................163.3NotationandTerminology..............................183.4ProblemFormulation.................................203.5SentenceLevelScoringFunctions..........................243.5.1ContextDependentFunctions........................243.5.1.1N-GramCoverageFunctions....................243.5.1.2Phrase-TableBasedFunctions...................273.5.1.3LanguageModelBasedFunctions.................293.5.1.4DecoderBasedFunctions.....................313.5.2ContextIndependentFunctions........................333.5.2.1LanguageModelBasedFunctions.................333.5.2.2AlignmentModelBasedFunctions................353.5.2.3OtherScoringFunctions......................383.6FeatureCombinationandParameterTuning.....................403.7SelectionAlgorithms.................................413.7.1ThresholdBasedFiltering..........................413.7.2GreedySearch................................423.7.3SubmodularOptimization..........................423.8ActiveLearning....................................443.9Batch-ModeLearning.................................453.10EvaluationMethods..................................473.11ComparativeSummaryofCitedWork........................483.12RelatedResearchAreas................................513.13Summary.......................................52CHAPTER4TRANSLATIONDIRECTIONDETECTION.................54ix4.1Introduction......................................554.2RelatedWork.....................................584.3BilingualFeaturesforTranslationDirection..............594.3.1POSTagMTUs................................594.3.2Distortion...................................604.4SingleDomainExperiments.............................604.4.1Data......................................604.4.2PreprocessingandFeatureExtraction....................614.4.3Results....................................624.4.4Analysis....................................624.5Cross-DomainExperiments..............................634.5.1DataMatrix..................................644.5.2PreprocessingandFeatureExtraction....................654.5.3Online.............................664.5.4EvaluationMethod..............................674.5.5Results....................................684.6ConclusionandFutureWork.............................71CHAPTER5EFFECTIVEANDEFFICIENTDATASELECTION............735.1VocabularySaturationFilter.............................735.2ComparisontonearOptimumSolution........................755.2.1ExperimentalSetup..............................755.2.2Results....................................765.3DataOrder.......................................785.4ExperimentSetup...................................785.5ExperimentResults..................................805.6EffectsofVSFonn-gramdistribution........................825.7Conclusion......................................84CHAPTER6THESATURATIONFILTERFRAMEWORK................856.1FeatureSpace.....................................866.1.1SourceVocabulary..............................876.1.2SourceandTargetVocabulary........................876.1.3BigramSourceVocabulary..........................876.1.4SourceVocabularywithContext.......................886.1.5MinimalTranslationUnit...........................886.1.6MinimalTranslationUnitwithTargetBrownCluster............886.2ThresholdFunction..................................896.3IncrementalAlgorithm................................906.4ExperimentSetup...................................916.5ExperimentResults..................................926.5.1FeatureComparison..............................946.5.2ThresholdFunctionComparison.......................976.6Conclusion......................................108xCHAPTER7CONCLUSION................................110BIBLIOGRAPHY........................................111xiLISTOFTABLESTable3.1ExplorationversusExploitation..........................27Table3.2RelatedWork....................................50Table4.1POSMTUfeatureswithhighestweight.FE#indicatesthenumberoftimesthisfeatureappearedwhentranslatingfromFrenchtoEnglish...........62Table4.2featuresandtheirlabels........................63Table4.3Cross-DomainDataSets..............................64Table5.1English-sideSentenceCounts(inmillions)fordifferentthresholdsforVSF,VSFafterorderingthedatabasedonnormalizedcombinedalignmentscoreandrandombaselines................................79Table5.2English-sideWordCountsfordifferentthresholdsforVSF,VSFafterorderingthedatabasedonnormalizedcombinedalignmentscoreandrandombaselines..80Table5.3Wordalignmenttimes(hh:mm)fordifferentthresholdsforVSF,VSFaftermodelscoreordering,andarandombaseline...................81Table5.4BLEUScoreresultsforVSF,S+VSF(SortedVSF),andRandomBaselineatdifferentthresholdst.................................81Table5.5OOVratesforVSF,S+VSF(SortedVSF),andRandomBaselineatdifferentthresholdst......................................82Table6.1Dataselectionfeaturesandtheirlabels.......................92Table6.2IncrementalFeatureSaturationFilteralgorithmsizesforeachfeature.......93Table6.3IncrementalSaturationFilteralgorithmsizesfordifferentthresholdfunctions...93xiiLISTOFFIGURESFigure2.1NoisyChannelModelforStatisticalMachineTranslation[50]..........6Figure2.2Probabilityofﬁpassedtheexamﬂandﬁfailedtheexamﬂwordsequencesac-cordingtoLanguageModelstrainedonpublishedbookseveryyearsince1900withasmoothingfactorofthree([42])....................8Figure3.1TaxonomyofdataselectioninMachineTranslation...............15Figure3.2DataSelection:Thisdiagramdemonstratestheroleofeachdatasetinthedataselectionprocess................................20Figure4.1EuroParlWordCloudDataVisualization(TranslatedvsOriginal)........56Figure4.2EffectsofChunkSizeonTranslationeseDetectionAccuracy..........57Figure4.3Sentenceleveltranslationdirectiondetectionprecisionusingdifferentfea-tureswithn-gramlengthsof1through5......................61Figure4.4POSTaggedandBrownClusterAlignedSentencePairs.............65Figure4.5ComparingareaundertheROCcurveforthetranslationdirectiondetectiontaskwhentrainingandtestingondifferentcorporausingeachoftheeightfeaturesets.SeeTable4.2forexperimentlabeldescription............67Figure4.6Translationdetectionperformancematrixfortrainingandtestingonthreedifferentcorpora.Eachrowcorrespondstoresultsoftrainingonacorpuswhiletestresultsarelaidoutinacolumnofthe33matrix.UnlikeFigure4.5wherewereportAUCvaluesforalln-grams,inthisdiagramweonlyreportthehighestAUCvalueforeachfeature.Thenumbernexttoeachdatapointindicatesthen-gramlengththatachievesthehighestperformanceforthatfeature.SeeTable4.2forexperimentlabeldescription..........68Figure4.7Translationdetectionperformancematrixfortrainingandtestingonthreedifferentcorpora-Weranexperimentsforn-gramsofuptolengthveforeachfeature(SeeTable4.2forfeaturelabeldescriptions).UnlikeFigure4.5wherewereportAUCvaluesforalln-gramlengths,inthisgraphweonlypresentthehighestAUCnumberforeachfeature.Eachmarkertypeindi-catesatrainingandtestsetcombination.Theformatofexperimentlabelsinthelegendis[TrainingSet]-[TestSet]andEU:EuroParl,H:Hansard,HC:HansardCommittees.Forexample,EU-HCmeanstrainingonEuroParlcor-pusandtestingonHansardCommitteescorpus..................70xiiiFigure4.8TranslationdirectiondetectionAUCperformancerankforeachtrainingandtestsetcombination.ForcorpuscombinationabbreviationsseedescriptionofFigure4.7.ForfeaturelabeldescriptionsseeTable4.2.............72Figure5.1UnigramEckvs.UnigramVSF..........................76Figure5.2BigramEckvs.UnigramVSF..........................77Figure5.3ComparativeBLEUscoresfortwoVSFimplementations,againstarandomlysampledbaseline..................................82Figure5.4ComparativeOOVratesfortwoVSFimplementations,againstarandomlysampledbaseline..................................83Figure5.5log2-scaleUnigramFrequencyscatterplotbeforeVSFversusafterVSF....84Figure6.1BLEUscorecomparisonofdifferentdataselectionfeatureswithconstantthresholdwiththeWMT2009testset.......................95Figure6.2BLEUscorecomparisonofdifferentdataselectionfeatureswithconstantthresholdwiththeWMT2013testset.......................95Figure6.3Comparisonoftotalnumberofuniquesourcephrasesextractedforeachdataselectionfeature...................................96Figure6.4Comparisonoftotalnumberofuniquephrasepairsextractedforeachdataselectionfeature...................................96Figure6.5Averagenumberoftargetphrasespersourcephraseforeachdataselectionfeature........................................96Figure6.6Samplingbiasinrandomdataselection.X-Axisislogarithmoffrequencyforwordsinallthedata.Y-Axisislogarithmoffrequencyforwordsinselecteddata.Thegraphisgeneratedforsubsetsofthedatainsizes1M,4M,8Mand16M(JSD:JensenŒShannonDivergence).....................98Figure6.7Samplingbiasinuniformthresholdfunction.X-Axisislogarithmoffre-quencyforwordsinallthedata.Y-Axisislogarithmoffrequencyforwordsinselecteddata.Thegraphisgeneratedforsubsetsofthedatainsizes1M,4M,8Mand16M(JSD:JensenŒShannonDivergence)..............99Figure6.8Samplingbiasindataselectionusinglogarithmoffrequencythresholdfunc-tion.X-Axisislogarithmoffrequencyforwordsinallthedata.Y-Axisislogarithmoffrequencyforwordsinselecteddata.Thegraphisgeneratedforsubsetsofthedatainsizes1M,4M,8Mand15M(JSD:JensenŒShannonDivergence).....................................100xivFigure6.9Samplingbiasindataselectionusingentropythresholdfunction.X-Axisislogarithmoffrequencyforwordsinallthedata.Y-Axisislogarithmoffrequencyforwordsinselecteddata.Thegraphisgeneratedforsubsetsofthedatainsizes1M,4M,9Mand17M(JSD:JensenŒShannonDivergence)...101Figure6.10BLEUscorecomparisonofdifferentdataselectionthresholdfunctionusingsourcevocabularyfeaturewiththeWMT2009testset...............102Figure6.11BLEUscorecomparisonofdifferentdataselectionthresholdfunctionusingsourcevocabularyfeaturewiththeSpeechEX1conversationaltestset([67])...102Figure6.12BLEUscorecomparisonofdifferentdataselectionthresholdfunctionusingsourcevocabularyfeaturewiththeWMT2009testset...............103Figure6.13BLEUscorecomparisonofdifferentdataselectionthresholdfunctionusingsourcevocabularyfeaturewiththeSpeechEX1conversationaltestset([67])...104Figure6.14Thresholdfunctionscomparisonforaveragenumberoftargetphrasespersourcephrasesextractedforsourcevocabularyfeature..............104Figure6.15Thresholdfunctionscomparisonforaveragenumberoftargetphrasespersourcephrasesextractedforbigramsourcevocabularyfeature..........105Figure6.16Thresholdfunctionscomparisonfortotalnumberofphrasepairsextractedforsourcevocabularyfeature............................105Figure6.17Thresholdfunctionscomparisonfortotalnumberofphrasepairsextractedforbigramsourcevocabularyfeature........................105Figure6.18Thresholdfunctionscomparisonfornumberofuniquesourcephrasesex-tractedforsourcevocabularyfeature........................106Figure6.19Thresholdfunctionscomparisonfornumberofuniquesourcephrasesex-tractedforbigramsourcevocabularyfeature....................106Figure6.20BLEUscorecomparisonfortheselectionoflowerdatasizesusinglogarithmoffrequencythresholdfunctionandMTUfeatureatdifferentselectionsizes...107Figure6.21BLEUscorecomparisonfortheselectionoflowerdatasizesusinglogarithmoffrequencythresholdfunctionandMTUwithtargetBrownclusterfeatureatdifferentselectionsizes.............................108xvCHAPTER1INTRODUCTIONWiththedigitalagethereisevermorenaturallanguagepreservedindigitalformat.Theavailabilityofnaturallanguageincreaseswitheverynewspiece,newarticle,blogpost,recordedconversationandvirtuallyeverythingspokenorwrittenbybillionsofpeopleontheplanet.Itisonlynaturaltodevelopalgorithmsandtechniquesenablinghumanstoconsume,absorbandleveragethiseverexpandingsourceofdata.Forexample,summarisationenablespeopletoabsorbmoreinformationwithlesseffort.Sentimentdetectionbuildsontopofthatandprovidesaggregateinformationon,forexample,productreviewswrittenbyconsumers.MachineTranslationprovidescosteffectiveandseamlessaccesstoinformationthatwouldhavebeenotherwiseinaccessible.Researchersdevelopincreasinglymorecomplexalgorithmstoperformthesetaskswithhigherprecision.Manyofthesealgorithmsarenotinherentlydecomposableintopiecesthatcanefruninparallel.Also,notallresearchershaveaccesstolargescaleclusterstodeveloporrunsuchdecomposablealgorithms.Anexampleofthisisdeepneuralnetworks.Althoughtheoriginalideawasdevelopedinthe70s,ithasonlyrecentlybeenusedeffectivelyandsuccessfullyinarangeofnaturallanguageprocessingproblemssuchasspeechrecognitionandmachinetranslation.Thisimpliesitispossibleforarangeofcurrentlyavailablealgorithmsthatarenotscalable,toonlybeusedinseveralyearsordecadeswithimprovedcomputationpower.Thishasmotivatedanewapproachtothisgab.Theapproachistocompacttrainingdatatoasmallenoughsizesuchthatthealgorithmscantrainonitinareasonableamountoftime.Anexampleofsuchworkintheofmachinelearningiscore-sets([36]).ThefocusofthisthesisisthetrainingdatausedinMachineTranslation.MachineTranslationhasbeenaroundforseveraldecades.ThequalityofMachineTranslationhoweverhasimprovedoverthepastseveralyearswiththedevelopmentofstatisticalmethodsthatcanlearnfromexistingtranslationdatainsteadofhavinglinguistsdeveloptranslationrulesforeachlanguage1pair.MachineTranslationhasreceivedasecondboostwiththedevelopmentofdatacollectionmethodsthathaveharvestedthedigitalresourcessuchasthewebtocomposemoreandmoretrainingdata.WhiletheconsensusisthatmoredataimprovesmachinetranslationtheefforttocollectmoredatahasdiminishedfordominantlanguagessuchasEnglish,SpanishandFrenchsincethecomputationcapacitytoeftrainamachinetranslationsystemonsuchlargedatasetshasbeensaturated.Takethelanguagemodelingtask.Thesizeoftheindexedwebisestimatedtobe50billionwebpages[105].Assuminganaveragewordsperpageof429.2([78]),thisisapproximately21.4terawords.Thesizeofthedeepwebisestimatedtobe500timestheindexedweb([14]).Thatbringsthetotalestimatednaturallanguageavailableonthewebcloseto10.7petawords.ThelargestlanguagemodeleverreportedtobebuiltisbyGoogleover1terawords[38].Thatis5%oftheindexedweband0.09%oftheentireweb.Thesecondtypeoftrainingdataisparalleldata.French-Englishhasthelargestpubliclyavailableparallelcorpuswithcloseto40millionsentencepairsor2.8billiontotalwords[21].Whilethereisalotmoreparalleldataavailablereportedbytheindustry([66]),thereislittleefforttocollectsuchdatasincethetrainingalgorithmsarealreadycomputationallysaturatedwithexistinghardware.Inthisthesis,wefocusonafundamentalaspectofmachinetranslationtrainingdata.Parallelnaturallanguagedataisuniquetomachinetranslation.Itisalsothemainandkeytypeoftrainingdataformachinetranslation.Insomerecentworkinmachinetranslationusingdeepneuralnetworksitistheonlytypeoftrainingdata([94]).Asfarasweareaware,paralleldatahasonlybeenusedforthepurposeofdirectlyorindirectlytrainingamachinetranslation.TheonlyothertaskthatcanalsobeperformedusingparalleldataisTranslationesedetection.InaparallelcorpussettingwecallthisTranslationDirectionDetectionorTDDforshort.Theofthistaskisthatitcanbeevaluatedobjectively,accuratelyandmeaningfully.Incomparison,theevaluationofamachinetranslationalgorithmoverparalleldataisofallcomputationallycomplex.Second,theevaluationislessmeaningfulasasentencecanbetranslatedcorrectlyinmanydifferentwaysandforthatreason,,theevaluationisnotaccurateasacompletelyvalidtranslationcanbeevaluatedasincorrectduetoreferencemismatch.TDDdoesnothave2anyoftheseissues.Thusitisasuitabletaskforexplorationandaccuratecomparisonofmanydifferentfeaturefunctions.InChapter4wedevelopandcomparetheeffectivenessofdifferentfeaturesinselectingsentencepairsofadirection.AlthoughthesefeaturesaredevelopedandevaluatedinthecontextofTDD,weshowinChapter5thatsomeofthesefeaturesarealsomosteffectiveindataselection.ThedataselectionliteratureinmachinetranslationisreviewedinChapter3.Wefocusonthequalityimprovementandtrainingresourcelimitationscenariofromtheliteratureusingdatacompactingtechniquestotrainonthemostinformativesubsetofthedatawhenitisnotpracticaltotrainontheentiredataset.InChapter5,weintroducealinearbutsuboptimalsearchalgorithmforoptimizingacommonlyusedsurrogateobjectivefunction(n-gramcoverage)fordataselection.Weshowthequalityofthemachinetranslationoutputproducedbymodelstrainedonthedatasetchosenbyoursuboptimalalgorithmisnotstatisticallydifferentfromtheoutputproducedbymodelstrainedontheoptimalsubset.Inotherwords,weareabletoproducesamequalitymachinetranslationoutputusingalinearandsuboptimalalgorithm.Finally,inChapter6weintroduceaxibleframeworkfordataselectionthatisabletouseanysentencepairlevelfeaturealongwithanarbitrarythresholdfunction.WeproposeusingmoresophisticatedfeaturesfromChapter4withthesamelinearsearchalgorithm.Thesefeaturesusewordalignmentinformationcombinedwithsemanticfeaturesfrombrownclusterstocapturejointlexicalandsemanticinformation.Thesefeaturesareabletocaptureinformationthathasnotbeendirectlyusedinpriordataselectionmethods.Inaddition,wefurtherinvestigatetheeffectsofdataselectiononvocabularydistributionandimprovetheJensenŒShannondivergencebetweenselecteddataandselectionpoolusinganentropybasedvariablethresholdfunction.Improvinguponthepreviouslyintroducedefsearchalgorithmweareabletopartitionverylargedatasets(mostpriorworkdonotscaletosuchlargedatasets)intoanorderedandtunablenumberofpartitions1andoutperformallpriorworkthatcanscaletosuchlargedatasets.Weperform1Partitionsareintheorderofmoredesirabledata.Thedesirednumbersentencepairscanbe3endtoendstatisticalmachinetranslationexperimentscomparingdifferentfeaturesetsandvariablethresholdfunctions.WeusetranslationmodelstatisticsinadditiontoBLEUscoretocomparetheeffectivenessofvariousselectiontechniques.selectedbyselectingpartitionsinorder.4CHAPTER2BACKGROUNDTranslationbetweendifferentlanguagesisnearlyasoldasspokenandwrittenlanguageitself.Duetothelaborioustaskoftranslation,theinterestinusingcomputersfortranslationcameaboutassoonascomputersbecameavailable.WarrenWeaver,co-authorofthelandmarkworkoncommunication,TheMathematicalTheoryofCommunication,withClaudShannonofBellLab-oratories,iswidelyrecognizedasthepioneerofMachineTranslation.OnMarch4th,1947,inhislettertoProfessorNorbertWienerofMassachusettsInstituteofTechnologyhewrote([103]):WhenIlookatanarticleinRussian,IsayﬁThisisreallywritteninEnglish,butithasbeencodedinsomestrangesymbols.Iwillnowproceedtodecode.ﬂHaveyoueverthoughtaboutthis?Asalinguistandexpertoncomputers,doyouthinkitisworththinkingabout?Also,inhis1949ﬁTranslationﬂmemoranduminreferencetoShannon'sworkonTheoryofCommunication,hewrites([103]):ProbablyonlyShannonhimself,atthisstage,canbeagoodjudgeofthepossi-bilitiesinthisdirection;butaswasexpressedinW.W.'soriginallettertoWiener,itisverytemptingtosaythatabookwritteninChineseissimplyabookwritteninEn-glishwhichwascodedintotheﬁChinesecode.ﬂIfwehaveusefulmethodsforsolvingalmostanycryptographicproblem,mayitnotbethatwithproperinterpretationwealreadyhaveusefulmethodsfortranslation?InspiredbytheworkofWarner,intheirseminalwork,ﬁTheMathematicsofStatisticalMa-chineTranslatoion:ParameterEstimationﬂ([23]),Brownetal.introducedtheconceptofusingShannon'sNoisyChannelModel([89])forthepurposeofStatisticalMachineTranslation(Fig-ure2.1).AnoisychannelframeworkformachinetranslationworkssimilartohowWarnerde-5Figure2.1NoisyChannelModelforStatisticalMachineTranslation[50]scribedinhismemorandum:Forexample,anEnglishsentencegoesthroughaﬁnoisyﬂchannelandcomesoutasFrench.Nowgiventheobservedﬁnoisyﬂsentence(whichisnowinFrench),whatisthemostlikelyEnglishsentenceatthesourceofthenoisychannel?RecoveringtheoriginalEnglishrequiresreasoningabout1)HowtheEnglishlanguageiswritten(ﬁsourcemodelingﬂ)and2)HowEnglishisturnedintoFrench(ﬁchannelmodelingﬂ)([55]).Formally,whentranslatingaFrenchwordsequence,f,toEnglish,weassumeaFrenchnativespeakerhadasequenceofEnglishwords,e,inmindwhenproducingtheFrenchwordsequence.TheprocessoftranslatingfintoEnglishistotheEnglishwordsequence‹ethatmaximizesP(ejf).UsingBayes'rule,wecanwrite:P(ejf)=P(fje)P(e)P(f)(2.1)SincethedenominatorinEquation2.1isconstantforallvaluesofe,themaximizationproblemfor‹ecanbeformulatedasEquation2.2,alsoknownastheFundamentalEquationofMachineTranslation([23]).6‹e=argmaxeP(fje)P(e)(2.2)Thisequationpresentsthethreefundamentalcomputationalproblemsofstatisticalmachinetranslation:1.Estimatingthelanguagemodelprobability,P(e).2.Estimatingthetranslationmodelprobability,P(fje).3.FindingthewordsequenceethatmaximizesEquation2.2(Alsocalleddecoding).2.1EstimatingLanguageModelProbabilityAlanguagemodelisformallyasafunctionthattakesawordsequenceinagivenlanguageasinputandreturnstheprobabilitythatthewordsequencewasproducedbyanativespeakerofthatlanguage([57]).LanguageModel:PLM(e1;e2;:::;en)(2.3)ForExample1:PLM(passedtheexam)>PLM(failedtheexam).N-Gramlanguagemodelsaretheleadingmethodforlanguagemodeling.Thismethodisbasedonthelikelihoodthatindividualwordsfolloweachother.TomakethelanguagemodelingproblemtractableaMarkovChainwithaedmemory.ThememorylengthusedintheMarkovChainiscalledthelanguagemodelorder2.Atrigram3languagemodelassumesaMarkovchainoflength1SeeFigure2.2formoredetailsonthisexample([42])2BytheorderofalanguagemodelistwoplusthememorylengthoftheassumedMarkovchain.Thatis,abigramlanguagemodelassumesamemorylessMarkovchain.3Alanguagemodelwithorder3isalsocalledatrigramlanguagemodel.7Figure2.2ProbabilityofﬁpassedtheexamﬂandﬁfailedtheexamﬂwordsequencesaccordingtoLanguageModelstrainedonpublishedbookseveryyearsince1900withasmoothingfactorofthree([42]).two.Thatis,theprobabilityofthenextwordinasequenceofwordsisonlydependentonitsprevioustwowords.PLM(e1;e2;:::;en)=nÕi=1PLM(eije1;:::;ei1)=nÕi=1PLM(eijeim1;:::;ei1)(MarkovChainwithmemorym)(2.4)An-gramlanguagemodelisestimatedusingalargecorpusinthedesiredlanguagebygatheringn-gramcountsofvariouslengthsuptothelanguagemodelorderandcalculatingsufstatisticsforthelanguagemodel.Inthisthesis,weregularlybuildanduselanguagemodelingtoolsusingoff-the-shelftoolssuchasKenLM([46])andSRILM([93]).Sinceourcontributionsinthisthesisarenotdirectlyrelatedtolanguagemodels,ageneralunderstandingoflanguagemodelsandhowtheyareusedasdescribedaboveissufforunderstandingthisthesis.Formoreinformationsee[57].82.2EstimatingTranslationModelProbabilityThereareseveraldifferentmodelingtechniquesdevelopedforestimatingtranslationmodelprob-abilities.Wordbasedmodels([23]),phrasebasedmodels([58])andhierarchicalphrasebasedmodels([29])aresomeofthemostcommonlyusedmodels.Aprerequisiteforallofthesemodel-ingtechniquesiswordalignment.2.2.1WordAlignmentGivenasentencepairthataretranslationsofeachother,thewordalignmenttaskistoaligneachwordinonesentencetoitstranslationintheothersentence.Thisisformallyusinganalignmentfunction,a:i!jwhichmapsthewordinpositioniinonesentencetoanotherwordinpositionjintheothersentencewherethesetwowordsaretranslationsofeachother.OneofthemaincontributionsofBrownetal.wasmodelingthisalignmentfunctionanddevelopinganExpectationMaximizationalgorithmtoestimateit([23]).InitssimplestformthealignmentfunctioncanbemodeledusingIBMModel1([23])inEquation2.5whereeistheEnglishortargetsentencewithlengthle,fistheFrenchorsourcesentencewithlengthlf,eistheasP(lejlf)andaisthealignmentfunction.P(aje;f)=P(e;ajf)P(e;f)chainrule(2.5)P(e;ajf)=e(lf+1)leleÕj=1P(ejjfa(j))(2.6)Afterreplacementand([57])wecancalculateP(aje;f)asbelow.P(aje;f)=leÕj=1P(ejjfa(j)ålfi=0P(ejjfi)(2.7)IBMModel1onlytakesintoaccountlexicalinformationforalignment.Morecomplexmodelshavebeendeveloped(IBMModels2,3,4and5)toconsiderthepositionofsourceandtargetwords9inthesentence,fertility4anddistortion5.Forthepurposeofthisthesisunderstandingtheconceptofwordalignmentisessential,butestimationmethodsandfurtherdetailsaboutalignmentmodelsarenotnecessary.Givenaword-alignedparallelcorpus6theprobabilityofeachtargetword(s)givenasourcewordcanbeestimatedbygatheringsufcountsfromthecorpusandapplyingthechainrule(Formoredetailssee[57]and[55]).2.2.2PhraseTableWordbasedtranslationmodelsdonotallowforthestatisticalmodelstodirectlylearnthetrans-lationofphraseslongerthanlengthone.Phrasebasedmodelshavebeendevelopedtoaddressthisy.Differentphraseextractiontechniqueshavebeendevelopedforthispurpose.Thephraseextractionprocesstakesaword-alignedparallelcorpusasinputandoutputsatranslationtablewithentriescorrespondingtosourcephrase,targetphraseandtheprobabilityofthetargetphrasegiventhesourcephrase.Formoredetailssee[57]and[58].2.3MachineTranslationEvaluationAutomaticevaluationmethodsformachinetranslationoutputhasbeenoneofthekeycontribut-ingfactorstotheimprovementofmachinetranslationqualityoverthepastdecade.AmongstdifferentevaluationmetricssuchasBLEU7([83])andMETEOR([12]),BLEUiscur-rentlythemostpopularautomaticevaluationmetric([57]).InthisthesiswemakeextensiveuseofBLEUforcomparingthequalityofthetranslationsystemtrainedondifferenttrainingdata,en-4Thefertilityofawordinthecontextofmachinetranslationisasthenumberofwordsonthetargetgeneratedby(oralignedto)asinglewordinthesourcesentence.5Inthecontextofmachinetranslationdistortionisastherelativepositionofwordinthetargetsentencecomparedtothepositionofitscorrespondingwordinthesource.6Aparallelcorpusissimplyacollectionofsentencepairsintwolanguageswhereeachsentenceispairedwithitscorrespondingtranslationsentence.7BLEU:BiLingualEvaluationUnderstudy10ablingustohaveanobjectivecomparisonbetweenvariousdataselectiontechniquesintheabsenceofahumantranslationevaluator.Similartootherevaluationtechniques,BLEUmakesuseofreferencetranslations.BLEUisaprecisionbasedmetric.Thatis,ittakesintoaccountthenumbern-gramsinthetranslationthatoccurinanyofthereferencetranslationsandignoresanyn-graminthereferencetranslationsthatdonotappearintheproducedtranslation.Thedownsideofthisapproachisthatatranslationoutputcangetaperfectscoreifitonlygeneratesasinglematchingunigram.ThisproblemisaddressedinBLEUbyintroducingabrevitypenalty(Equation2.8).brevity-penalty=min(1;output-lengthreference-length)(2.8)Theprecisionforeachn-gramlengthiscalculatedbycountingthetotalnumberofn-gramsintheoutputthatalsoexistsinthereferencedividedbythetotalnumberofn-gramsofthatlengththatappearinreferencetranslation(Equation2.9).precisioni=ångi2Outputångi2Reference1ångi2Reference(2.9)ngi:n-gramoflengthiABLEUscorecanbecalculatedforann-gramofuptoacertainlength(BLEU-n).BLEU-n=brevity-penaltyeåni=1lilogprecisioni(2.10)Then-gramlengthoffouranduniformlambdaweightsisacommonusedforevaluation([57]).ThisEquation2.10toEquation2.11BLEU-4=min(1;output-lengthreference-length)4Õi=1precisioni(2.11)11FormoredetailsonmultiplereferencesandmoredetaileddiscussiononBLEUandothereval-uationmetricssee[57].Inthischapter,wecoveredabriefhistoryofmachinetranslation,keyideasandconceptsessen-tialforunderstandingthisthesis.ForacomprehensiveanddetaileddiscussiononvarioustopicsrelatedtoStatisticalMachineTranslation,seethebookwrittenwiththesametitlebyKoehn([57]).12CHAPTER3LITERATUREREVIEW3.1IntroductionTrainingdataforstatisticalmachinetranslationhasincreasedoverthepastseveralyearsandiscontinuingtogrowevenfurtherasdigitalmultilingualcommunicationandinformationdisseminationbecomeubiquitous.However,thedataisalsocominginanincreasinglywiderspectrumofquality,fromsentencelevelprofessionaltranslationstocrowd-sourcedtranslations,tomachine-generatedoutputsfromfreeonlinetranslators.Thisisoneofthereasonswhydataselectionandcleaninghasbecomeacommonstepinthedevelopmentofmachinetranslationsystems.Atthesametime,whiletrainingdataforstatisticalmachinetranslationisabundantinsomelanguagepairs,developmentofhighqualitymachinetranslationsystemsforlow-resourcelanguagesremainsachallengeduetolackoftrainingdata.Usinghumantranslatorstoproducenewtrainingdatahasbeenacommonsolutiontothisproblem([6,5,24,44]),yetthecostoftranslationhasbeenalimitingfactor.Asaresult,dataselectiontechniqueshavebeendevelopedtoselectmonolingualdatathat,iftranslated,canimprovethequalityofamachinetranslationsystemthemost.Whileindustryleadersinthehavetheinfrastructureandresourcestobuildanditerateoverevergrowingdatasets,thisisnotcosteffectiveforsmallbusinessesandacademics.And,eveninthepresenceofabundanttrainingdataforstatisticalmachinetranslation,itisstilldesirabletoselectasubsetofthedatathatenablesthebesttranslationqualityinastatisticalmachinetranslationsystem.Finally,ofmobileapplicationswhichgenerallyhaveconstrainedmemorylimitationsareanotherimportantapplicationofdataselectioninmachinetranslation.Asaresult,aofdatacleaningandselectionhasformedintheofmachinetrans-lation,andsomeofthemostpracticalmethodsfordatacleaningandselectionhaveappearedinsystemsubmissionpapers.Thischapterattemptstoexhaustivelyreviewtheliteratureinmachine13translationdataselectionandofferacomparativeoverviewoftheproposedmethodsineachappli-cationscenario.Alldataselectionandcleaningalgorithmsselectasubsetofsomeoriginaldataset.Theydifferinthreemaincharacteristics:1.ApplicationScenario:Theapplicationscenarioforeachmethod(e.g.resourcereductionforhighorlow-resourcelanguages,qualityimprovementingeneralordomainadaptationsettings)thetypeofdatathatisbeingselected(parallelormonolingual),andtheproblemthatthemethodaimstosolve(e.g.maximizetranslationqualitywithminimaltrainingtimememoryconsumption).2.ScoringFunctions:Sinceitisinfeasibletotrainamachinetranslationsystemoneachsubsetofthedataandevaluatethequalityoftheresultingsystemforeachsubset,researchershavedevelopedscoringfunctions(features)onsubsetsofthedata,whichareexpectedtocorrelatewiththeusefulnessofthedatasubsetfortrainingahigh-qualitymachinetranslationsystem.Differentapplicationscenariosmotivatedifferentscoringfunctions.3.SelectionAlgorithms:Givenoneormorescoringfunctionsforevaluatingasubsetofdata,weneedtothesubsetthatmaximizesthisscoringfunctionsubjecttosomeconstraints.Dependingonthepropertiesofthescoringfunction,exactorapproximatesearchalgorithmshavebeendeveloped.Intherestofthechapterwereviewthevariationsintheliteraturebasedonthemaincom-ponentslistedabove(Fig.3.1).Nextweprovideacomparativeoverviewofliteratureonhoweachworkhascombinedthecomponentsaboveandhowtheycomparewithotherresearch.Thischapterisorganizedasfollows:InSection3.2webreakoutanddetailtheapplicationscenariosfoundationaltodataselectionalgorithms,asreviewedin(1)above.InSection3.4,weformalizethedataselectionproblemasaconstrainedoptimizationproblemandelaborateondifferentfor-mulationsofsuchproblemscorrespondingtodifferentapplicationscenarios.Anexhaustivelistof1NumbersinFig.3.1indicaterelatedsectionnumbersinthischapter.14DataSelectioninMachineTranslationApplicationScenariosLimitedRe-sourcesTrainingRe-sourcesDeploymentRe-sourcesHumanLabelingRe-sourcesImproveQualityNoiseReduc-tionUnhelpfulDataRe-ductionDomainImprove-mentSentenceScoringFunctionsContextDepen-dentLanguageModelPhraseTableDecodern-gramStatisticContextIndepen-dentAlignmentModelOtherSelectionAlgo-rithmsThresholdBasedFilteringGreedySearchSub-modularOpt.BatchModelLearningActiveLearningEvaluationMethodsFeatureCombina-tionProblemFormula-tion3.23.103.63.43.53.5.23.5.13.73.7.13.7.23.7.33.93.83.5.2.33.5.2.23.5.2.13.5.1.33.5.1.23.5.1.13.5.1.4Figure3.1TaxonomyofdataselectioninMachineTranslation1dataselectionscoringfunctionsispresentedinSection3.5followedbyfeaturecombinationandparametertuninginSection3.6.Thescoringfunctionsareorganizedbasedonthestatisticalmodel(e.g.,languagemodel,translationmodel,...)theyuse.SearchmethodsusedtosolveoptimizationproblemsarelistedinSections3.7,3.8and3.9.MethodsofevaluatingtheeffectivenessofdataselectionmethodsarediscussedinSection3.10.Weprovideacomparativesummaryofallpriorworkinsection3.11,alongwithrecommendationsforapplyingthesemethods,andlistrelatedareasofresearchinSection3.12.153.2ApplicationScenariosDataselectionisperformedinavarietyofscenarios.Therearetwobroadapplicationscenariocategories:thosewherethegoalistominimizeresourceconsumption,andthosefocusedonqual-ityimprovementsornoisereduction.Theseinturncanbebrokendownintosub-categories,asdescribedbelow,allbasedonselectioncriteriaandscoringfunctions,whichwillbediscussedinmoredetailinSection3.4:1.SatisfyingResourceConstraints:Acommonscenarioindataselectionformachinetranslationisthattrainingdatasizeneedstobereducedduetosomeresourceconstraints([66,40,26]).Applicationscenariosinthiscategorydifferbasedontheconstrainedre-source.Althoughmodelpruningcanbeusedtosatisfydeploymentsizeconstraints([41]),selectingasubsetofthetrainingdatacanbeusedtosatisfyanyoftheresourceconstraintlistedbelow.Withanexponentialnumberofsubsetstochoosefrom,thegoalistochooseasubsetofthetrainingdatathatwillultimatelyresultinthehighesttranslationquality.a)TrainingResources:Withtheincreasingavailabilityoftrainingdataforstatisticalmachinetranslation,trainingresourcerequirements(e.g.,numberofcomputers,mainmemorysizeortrainingtime)increaseaswell.Thegoalhereistoreducetrainingresourcerequirementsbytrainingonasubsetofthetrainingdatawithlittleornoimpactontranslationquality.b)DeploymentSize:Inscenarioswhereatranslationsystemishostedondeviceswithlimitedhardwarecapabilities,suchasmobiledevices,itisdesirabletoproducetrans-lationmodelsthataresmallinsizeandcanbehostedonsuchdeviceswithminimumimpactontranslationquality.Onemethodforachievingthisobjectiveisbyselectingasubsetofthetrainingdata([33]).Alternatively,thisobjectivecanbeachievedbypruningthetranslationmodels(Suchmethodsareoutsidethescopeofthisthesis.See[110]foracomparisonofdifferentpruningmethods).16c)ManualTranslationCost:Whenimprovingtranslationqualityforlowresourcelan-guages2wherepre-existingparalleltrainingdataislimited,acommonapproachistoselectasubsetofamonolingualcorpustobemanuallytranslatedbyhumansandaddedtothetrainingdata([5,6]).Sincethereisacostelementinvolved,thegoalistoachievehighestimprovementataedcostormeetapredeterminedqualitybarwithminimumcost.2.QualityImprovement:Unliketheresourceconstraintscenario,thefocusofapplicationsthatfollowthisscenarioisonqualityimprovement.Inthesescenarios,thegoalistoselectadatasubsetwhichwillresultinahigherqualitySMT3systemcomparedtoanSMTsystemtrainedonthefulldataset.Thiscanbedonebyoutsentencesthatarenoisyorthathaveadiftranslationphenomena(e.g.,phrasebasedtranslationsystemsoftenlearnincorrectphrasetranslationsfromfreelytranslated4text),orselectingasubsetthatisrelevanttoadomaindifferentfromthedomainoftheinputdata.Althoughdatasizewilloftenbereducedinthesescenarios,thatisaside-effect,notthegoal.a)Noisereduction:Acommonsourceoftrainingdataisfromcrawledmultilingualwebsites([86]).Webdatacanbeverynoisy,hence,noisereductionhasbecomeacommonpreprocessingstepforthiskindofdata([32]).Forparalleldata,apairofsentencesthatarenotgoodtranslationsofeachothercanbeconsiderednoise.b)ReductionofUnhelpfulData:Althoughtheobjectiveissimilarandalsopartiallyachievedbynoisereduction,itmaybedesirabletooutdatathatarenottradi-tionallyasnoise.Forexample,non-literaltranslationsarenotdesirablefortrainingphrase-basedstatisticalMTsystems(atleastusingcurrenttechnology)..2LowResourceLanguagesarelanguagesthathaverelativelylittlemonolingualorparalleldataavailablefortraining,tuningandevaluation.3SMT:StatisticalMachineTranslation4Freetranslationorfreelytranslatedtextcontrarytoliteral,directorword-for-wordtranslationconveystheoverallmeaningofasentenceorphrasewithoutnecessarilyaword-for-wordcorre-spondencebetweensourceandtranslatedtext[45].17c)DomainImprovement:Acommonapproachforimprovingtranslationqualityistotraindomaintranslationmodels([74,8]).Tothatend,asubsetofthetrainingdatathatismorerepresentativeofthetargetdomaincanbeselected.Inthistask,anin-domaindatasetisusedtoguidethetrainingdataselectionprocess.3.3NotationandTerminologyPriorworkindataselectioninmachinetranslationhaveborrowedterminologyandnotationfromseveralrelatedincludingactivelearning,machinelearningandqualityestimation.Welistthetermsusedintheliteratureforeachconceptindataselectionandpresentthetermsandnotationsweusethroughoutthischapter.Dataselectioninmachinetranslationhasalsobeenreferredtoasdatacleaning([49,80]),noisereduction([95,52,81]),improvingtrainingdataquality([70,2])andactivelearning([5,7]).Althoughweuseallofthesetermsinthecontextofcorrespondingscenariosormethods,werefertothegeneraltaskofselectingasubsetofdatafortrainingasﬁdataselectionﬂindependentofitspurposeormethod.Thedataselectiontaskistoselecttheﬁbestﬂsubsetofthedataformachinetranslationmodeltraining.WerefertothefullsetofavailabledataasﬁselectionpoolﬂorSpoolanddenotetheﬁbestﬂsubsetbyﬁoptimumsubsetﬂ,S,regardlessoftheoptimizationcriteria.Theunitofdataisalwayssentencesorsentencepairs5.Wedonotcalloutparalleldataorsentencepairsversusmonolingualdataorsentenceswhenitisapparentfromthecontext.Whendistinctionbetweentwosidesofaparallelcorpusisnecessary,weusetheletterﬁsﬂwithappropriatecasingtoindicatethesourcesideoftheparalleldataandﬁtﬂforthetargetside.Theselectionpoolisalsocalledunlabeleddatainsomepriorwork([43]).Insomescenariosthereisaseedparallelormonolingualcorpusgivenaspartofthedataselectiontask.WecallthisdatasettheﬁseedcorpusﬂanddenoteitbySseed.Inthisscenario,theassumptionisthattheseedcorpusisalready5Thetermstranslationunitsorutterancescouldalsobeused,buttheyarefunctionallyequiva-lentinthiscontext.18selectedandthetaskistoselectasubsetoftheselectionpooltobeaddedtotheseedcorpus.Dataselectiontasksfordomainadaptationarealsogivenanﬁin-domainﬂdevelopmentset,Sinandanﬁout-of-domainﬂdevelopmentset,Sout.Insomecases,theﬁselectionpoolﬂisalsousedastheout-of-domaindevelopmentset.Asinglesentenceinanyofthesetsaboveislowercasesitodenotetheithsentenceinthedataset.Inmostproposedmethodsfordataselection,oneormorefunctionsareusedtoevaluatetheusefulnessofasentenceorsentencepair.Theliteraturecallssuchfunctions:features,featurefunctions,objectivefunctionsandﬁscoringfunctionsﬂ.Wechosethelasttermanddenoteitasf(si).Manymethodsproposedfordataselectionuseiterativeselectionalgorithmswhereoneormoresentencesfromtheselectionpoolareselectedineachiteration.Whilethealgorithmhasselectedjsentences,thesentencesthathavebeenselectedsofararecalledtheﬁselectedpoolﬂandaredenotedasSj1(sentences1throughj).Inthiscontextitisassumedthataftereachiteration,theﬁselectedpoolﬂisaddedtotheﬁseedcorpusﬂ.Theremainingsentencesarenotedasﬁcandidatepoolﬂ,Scandidate.Scoringfunctionsthatdependontheselectedpoolarecalledﬁcontextdependentﬂscoringfunctionsandaredenotedasf(sj+1jSj1).TranslationofsentencesiusingmodelstrainedondatasetSisdenotedasti=TX(S)(si)whereX(S)isanymodeltrainedondatasetS(e.g.,languagemodel,LM(S),alignmentmodel,AM(S),ortranslationmodel,TM(S)).Lowercasesm1indicateswords1throughminsentence,s.PLM(S)(si)representstheprobabilityofsentencesiappearingintheevaluationsetaccordingtoalanguagemodeltrainedoncorpusS.PTM(S)(si;ti)andPAM(S)(si;ti)areinterpretedsimilarlyforatransla-tionmodelandanalignmentmodel.Figure3.2demonstratestheroleofeachdatasetinthedataselectionprocess.Theselectionpoolcanbeusedtobuildstatisticalmodels(e.g.,languagemodel,translationmodel,...)toprovidestatisticalinsight.Thesemodelsarecomputedonceandnotupdated.Ontheotherhand,statisticalmodelscomputedusingtheselectedpoolseparatelyorcombinedwiththeseedcorpusareupdatedastheselectedpoolchangesduringthedataselectionprocess.Scoringfunctionsthatdependonthemodelsthatneedtobeupdatedarecalledcontextdependentscoringfunctions.Inaddition,adecoderusingmodelsbuiltbytheselectionorselectedpoolcanbeusedasanotherinputto19Figure3.2DataSelection:Thisdiagramdemonstratestheroleofeachdatasetinthedataselectionprocess.contextdependentorcontextindependentscoringfunction.Adevelopmentsetcanbeusedtotrainweightswhenmorethanonefeaturefunctionsisused.Ultimately,thescoringfunctionisusedtoscoresentencesorsentencepairsintheselectionpoolwithanoptimizerusedtoselecttheoptimumsubsetoftheselectionpool.Theoptimizationstepisoftendonebygreedilyaddingsentencepairsfromselectionpoolintotheselectedpool.Forcontextdependentfunctions,themodelsorstatisticsneedtoberecomputedaftertheselectionofeachsentencepair,orNsentencepairsinthecaseofbatchlearning.3.4ProblemFormulationDataselectioncanbeformulatedasaconstrainedoptimizationproblem.Theconstraintisthattheoptimumsubsethasagivenmaximumsize,whichisusuallymeasuredbythenumberofsentencesorsentencepairsinthesubset,denotedbyjSj(Sizecanalsobemeasuredintotalnumberof20characters,tokens,tokentypes,etcdependingonthescenario.)Thetruescoringfunctionthatadataselectionmethodaimstomaximizemeasurestheexpectedtranslationqualityusingdatadrawnfromsomedesiredtargetdistribution(whichcouldbeofgeneraldomainoratargetdomaininthecaseofdomainadaptation).Belowweformulateaconstrainedoptimizationprobleminagenericform,whichexpresseswhatdataselectionmethodsaimtoachieveintheory.Sinceitisimpracticaltosolvethiscon-strainedoptimizationproblem6,differentdataselectionmethodsusealternativeformulations.Thetruescoringfunctionistoselectasubset,S,fromtheselectionpool,Spool,thatmaximizesexpectedtranslationquality7whentranslatedusingmodelstrainedonSforanydatasample,D,thatisdrawnfromsomedesiredtargetdistributionP(ngram),subjecttosomesizeorresourceconstraintC,asshowninEquation3.1.S=argmaxSˆSpool;Size(S)<>:0ng2mSi=1NG(sm;n)1otherwise(3.8)norm1(s)=jNG(s;n)jInthisapproachrareandfrequentn-gramsarevaluedequally.Intheirefforttoassignahigherweighttomorefrequentn-gramsintheselectionpool,Spool,Ecketal.usethen-gramfrequencyastheweightintheirimprovedscoringfunction(Equation3.9).Usingnormalizationfactorsofjsjiwhereitakesonvaluesof0,1and2,areothervariationsattemptedbythiswork.weight2(ng;m)=8><>:0ng2mSi=1NG(sm;n)Freq(ng;Spool)otherwise(3.9)Freq(ng;S)=ås2Sång2NG(s;len(ng))125Aclearshortcomingofthisfunction(Equation3.9)isthatonceann-gramexistsintheselectedsentences,itisofzerovaluewhenselectingnewsentences.Thisdoesnotallowforthescoringfunctiontodiscriminateagainstann-gramwithfrequentoccurrencesintheselectedsentencesversusann-gramthathasonlyappearedonce.Thisisthemotivationforintroducingafeaturedecayfunctionforthen-gramweight([4,16]).weight3(ng;m)=Freq(ng;S)elFreq(ng;fs1:::smg)(3.10)l:ExponentialdecayhyperparameterFortheirlastvariationontheweightfunction,Ecketal.usethecosinesimilaritybetweenTF-IDF10vectorsoftheselectedpoolandsentencesinthecandidatepool([33]).InaninformationretrievalsettingforTF-IDF,allsentencesintheselectedpoolplaytheroleofthesearchquerywhileeachsentenceinthecandidatepoolisconsideredapotentialmatchingdocument.Thegoalinthiscaseistothedocument(candidatesentence)thatismostdissimilartothesearchquery(selectedpool).theweightfunction,weight(ng;m),asfollows,resultsinEquation3.7whichusescosinesimilarityofTF-IDFvectorsasabove,asthesentencescoringfunction.tf(ng;S)=Freq(ng;S)idf(ng;S)=jSjåsi2S1(ng2NG(si;len(ng)))1(ng2NG)=8><>:1ng2NG0otherwiseweight4(ng;m)=tf(ng;Sm1)Logidf(ng;SjSjm+1)(3.11)SincethegoaloftheTF-IDFweightfunctionistothemostdissimilarsentence,theoveralloptimizationproblem(Equation3.6)turnsintoaminimization,unlikeothern-grambasedscoringfunctions.10TermFrequency-InverseDocumentFrequency263.5.1.2Phrase-TableBasedFunctionsScoringfunctionsinthissectionrequireaphrasetable.Dependingonthefunction,theﬁseedcorpusﬂ,theﬁselectionpoolﬂ,theﬁselectedpoolﬂorﬁcandidatepoolﬂoracombinationofthemareusedtotrainthephrasetable.Whenselectingfromamonolingualpoolofsentences,afullSMTsystemtrainedontheseedcorpuscanbeusedtoturnthemonolingualcorpusintoaparallelcorpustotrainthephrasetable[43].Haffariintroducestheideaofexplorationversusexploitationinthiscontext([43]).Theideaistoﬁexploreﬂbyselectingsentenceswithnewphrasepairstoexpandcoveragewhileﬁexploitingﬂphrasepairsthatarenotnew,toimprovetheirprobabilityestimation.WeuseTable3.1tofurtherexplainthisidea.Phrasepairsthatoccurfrequentlyintheselectionpoolbutdonotoccuroroccurrarelyintheseedcorpusareofmostvalue.Iftheydonotexistintheseedcorpus(A),theyaddnewcoverage(exploration).Ontheotherhand,theyprovidebetterestimationiftheyoccurrarelyintheseedcorpus(B,exploitation).Phrasepairswithlowfrequencyintheselectionpoolcanalsobeusefuliftheyoccurrarelyintheseedcorpus(D)ordonotoccuratall(C).Althoughthisintuitionisexplainedinthecontextofphrase-pairsitisapplicableton-gramandlanguagemodelbasedscoringfunctionsaswell.SelectionPoolPhrasePairshighfrequencylowfrequencyzerofrequencySeedCorpushighfrequencylowfrequencyBDzerofrequencyACTable3.1ExplorationversusExploitationA:Improvedcoverageforimportant11phrasepairsB:ImprovedestimationforimportantphrasepairsC:ImprovedcoverageforphrasepairsD:ImprovedestimationforphrasepairsHaffariintroducesaphrasepairutilityfunctiontocaptureexplorationversusexploitation(Equation3.12).Thesentencescoringfunctionisthencomputedusingarithmetic(Equation3.13)11Theassumptionhereisthataphrasepairthatoccursfrequentlyintheselectionpool,islikelytobeusedinatranslationtask,andthereforeimportant.27orgeometric(Equation3.14)averageofthephrasepairutilityfunctionoverallphrasepairs.PPUtil(ppj(s;t)j11)=P(PP1=ppjPP12j1Sk=1PPs(sk;tk))P(PP2=ppjPP22jSjSk=jPPs(sk;tk))(3.12)pp:Aphrasepairmappingaphraseinonelanguagetoitstranslationinasecondlanguage.PPs(s;t):Thesetofallphrasespairsextractedfromsentencepair(s,t).f2((sj;tj)j(s;t)j11)=0@Õpp2PPs((sj;tj))LogPPUtil(ppj(s;t)j11)1A1jSj(3.13)f3((sj;tj)j(s;t)j11)=1jSjåpp2PPs((sj;tj))PPUtil(ppj(s;t)j11)(3.14)Ambatitakesadifferentapproachinusingaphrasetableforscoringsentences([4]).Thegoalistogiveahigherscoretosourcephraseswherethephrasetableismoreuncertainabouttheirtrans-lation.Sinceentropyisameasureofuncertainty,Ambatiphrasalentropy(Equation3.15)tocalculatethelevelofuncertaintyaphrasetablehasregardingasourcephrase.PhrEnt12(psj(s;t)j1)=åpt2Trans(ps)PTM(Sj1)(ptjps)logPTM(Sj1)(ptjps)jTrans(ps)jTrans(ps):Targetphrasesforphrasepsaccordingtophrasetabletrainedon(s;t)j1(3.15)Usingphrasalentropy,thescoringfunctionforasentenceisasthesumofthephrasalentropyoverallsourcephrasesinasentence.Thisscoringfunctioncanbeextendedtousetwo12ByconditionalentropyofH(YjX)isasåx2X;y2Yp(x;y)logp(yjx).But,theprovidedforPhrEnthereismissingaPTM(Sj1)(ps)insidethesummationbutoutsidethelog.Werecognizethisinaccuracybutkeeptheformulaconsistentwiththereferencedwork([4]).28differentphrasetables(onetrainedfromsourcelanguagetotargetlanguageandtheothertargettosource).Usingthesecondphrasetable,uncertaintyabouttargetphrasescanalsobetakenintoaccount.3.5.1.3LanguageModelBasedFunctionsLanguagemodelbasedscoringfunctionsaresimilarton-grambasedscoringfunctions.Inbothcases,n-gramsarethebasisforscoringsentences.Theadvantageoflanguagemodelbasedscoringfunctionsistheirhigheraccuracyinestimatingn-gramprobability,especiallywhenthedataissparse(Thisisduetoback-offandsmoothingstrategiesusedinlanguagemodeling.)Ontheotherhand,trainingalanguagemodelhasahighercomputationalcomplexitycomparedtocollectingn-gramstatisticsforn-grambasedscoringfunctions.Intheabsenceofaphrasetable(whenparalleldataisnotavailableorlesscomputationalintensityisdesirable),Haffarietal.suggestsusingalanguagemodeloverallpossiblephrases(alln-gramsofuptoacertainsize)tocalculatetheequivalentofthephrasepairutilityfunctioninEquation3.12,n-gramutility.N-gramutilityfollowsthesameprinciple,butusesn-gramsandtwolanguagemodels(onelanguagemodeltrainedontheseedcorpusandanothertrainedontheselectionpool)insteadofphrasepairsandtwophrasetables(Equation3.16).NGramUtility(ngj(s;t)j11)=PLM(SjSjj)(NG1=ng)PLM(Sj11)(NG2=ng)(3.16)PLM(Sj11)(NG=ng):Probabilityofn-gramngaccordingtolanguagemodeltrainedonSj11.Thesentencescoringfunctionisthenasanarithmeticorgeometricaverageofthen-gramutilityfunctionofalln-gramsuptoacertainlengthderivedfromthesentence.InaslightvariationtoEquation3.16,Mandaletal.useperplexity13insteadofprobabilityusingthesame13Theperplexityofarandomvariableisastwo(oranygivenbasenumber)tothepowerofitsentropy.Innaturallanguageprocessingthisfunctioniscommonlyusedasameasureofhowsurprisedalanguagemodeliswhenobservingasequenceofwords.29languagemodels([71]).Liuetal.attempttoassignthehighestscoretosentenceswithmostinformativen-grams([69]).Theymeasureinformativenessofn-gramsbyn-gram14entropy.Inaddition,theymultiplytheentropybythesquarerootofthen-gramlengthtoencouragelongerphrasesastheysuggestlongerphrasescontributetobettertranslation.NGEntropy(ngjSj11)=log PLM(SjSjj)(ng)!plen(ng)1ng2NG(SjSjj)15(3.17)Equation3.17usesindividualentropyasameasureofinformativeness.Wewouldliketopointoutthatamoreappropriateuseofentropywouldbeusingitasameasureofuncertainty.Forthispurpose,givenan-gramngofsizem,fw1:::wmg,theentropyofthen-gramcanbeinthecontextofalln-gramsofsizem+1,fw1:::wm+1g,thathaven-gramngastheir(Equation3.18).NGEntropyLM(Sj1)(wm1)=åw2Vocab(Sj1)PLM(Sj1)(wjwm1)logPLM(Sj1)(wjwm1)Vocab(Sj1):ThesetofalluniquewordsinSj1.(3.18)Finally,Ambatiattemptstomaketheprobabilitydistributionfunctionoftheselectedpoolascloseaspossibletotheprobabilitydistributionfunctionoftheselectionpool.Twolanguagemodelsaretrainedforthispurpose.Thelanguagemodelistrainedontheseedcorpusandselectedpool,whilethesecondlanguagemodelistrainedontheselectionpool.Theaimisto14Forallpracticalpurposesintheirwork([69]),aphraseisthesameasann-gramastheyconsiderphrasesforasentencetobealln-gramsofuptoacertainlength.15SeeEquation3.7forofNG(SjSjj).30minimizetheKullback-Leiblerdivergence16betweenthesetwoprobabilitydistributionfunctions.Toachievethis,thescoringfunctionforasentenceequalsthesumofitsn-gramcontributionstototheoverallKLdivergence(Equation3.19).KL-Div(ng)=PLM(Sj11)(ng)logPLM(Sj11)(ng)PLM(SjSj1)(ng)(3.19)BasedonEquation3.19,thesentencelevelscoringfunctionisthesumoftheKL-Divfunctionoverallpossiblen-gramsinasentence.3.5.1.4DecoderBasedFunctionsDecoderbasedscoringfunctionsrequireafullSMTsystemtobetrainedovertheseedcorpus,theselectionpoolorboth.Ifthegoalistoselectfromaparallelpoolofsentences,thesimpleapproachistouseafullSMTsystemtrainedovertheseedcorpustotranslatesentencesintheselectionpool.Thescoringfunctionforasentenceisbasedontheerrororthedistancebetweentheproducedtranslationandthetargetsideofthesentencepair.TheideaisthatiftheSMTsystemmakesanerroronasentence,thesentenceislikelytocontainusefulinformation.Mostofthefunctionsintroducedinthiscategoryassumetheselectionpoolismonolingualandthereforeuseothermeanstomeasurethelikelihoodthatthedecoderwillproduceanerroneoustranslation.HaffariusesanSMTtrainedontheseedcorpustotranslatethemonolingualselectionpool.Thescoringfunctionisobtainedbynormalizingthemodelscoreproducedbythedecoderusingsentencelength.Onedrawbackofthisscoringfunctionisthat,althoughdecoderscoreisanobjec-tivescorewhencomparingtranslationsofthesamesourcesentence,itisnotacomparablescorewhenconsideringdifferentsourcesentences.However,differencesinthisscorecanbeameaningfulmeasureandcaneffectivelybeusedfordataselection([44,4]).16Kullback-Leiblerdivergenceisaninformationtheoreticmeasureofthedistancebetweentwoprobabilitydistributions.31AnotherapproachtomeasuringtranslationerrorisusingRoundTripTranslationAccuracy(([44])).InthismethodtwoSMTsystemsaretrainedontheseedcorpus,oneineachdirection.Foreachsentenceintheselectionpool,itistranslatedoncefromsourcetotargetandthenfromtargettosource.Thescoringfunctionisthentheerrorfunctionbetweenthetranslationandtheoriginalsentence.Inter-SystemDisagreementisanotherapproachdevelopedbyMandaletal.([71]).Inthismethoddifferentstatisticalmachinetranslationsystems(e.g.,ahierarchicalsystemaswellasaphrasebasedsystem)aretrainedonaseedcorpus.Next,sentencesinaparalleldevelopmentsetaretranslatedusingallstatisticaltranslationsystems.Foreachtranslationsystem,aninter-systemdisagreementerroriscalculatedforitstranslationoutputusingallothersystems'outputasreferences.Thetargetsideofthedevelopmentsetisthenusedtotrainalinearregressionmodelthatpredictsthetranslationerrorbasedontheinter-systemdisagreementerrorfromeachtranslationsystem.Oncethelinearregressionmodelislearned,inter-systemdisagreementerrorsarecalculatedfortheselectionpoolandfedintothelinearregressionmodeltoproducethescoringfunction.Theideaisthatthesentenceswherethetranslationoutputfromdifferentsystemsdifferthemostarethemostpoorlytranslatedsentencesandthereforemostinformativeifaddedtotrainingdatawithhumantranslation.TheworkofAnanthakrishnanetal.differsfromallotherworkinthiscategory,sinceitdoesnotuseasentencelevelscoringfunction([6]).Instead,theytrainasentencelevelpairwisecomparatortosortsentencesintheselectionpoolbasedontheirestimatedtranslationerrorreduc-tion.Thepairwisecomparatorusesn-gramfeaturestocomparesentences.However,weincludethisworkinthissection,sinceadecoderwithmodelstrainedontheseedcorpusisusedtocreatethedevelopmentset.First,aparallelcorpusisrequiredtocreatethedevelopmentset.Next,thesourcesideoftheparallelcorpusisdecodedandtheTER17iscalculatedforeachsentencepair.Ananthakrishnanetal.suggestifthesentencewiththehighestTERisaddedtotheseedcorpus,itwilllikelyreducethetranslationerrorthemost.Sortingthesentencepairsinthedevelopment17TranslationEditRate(TER)measurestheamountofeditingahumanwouldneedtoperformonamachinetranslationoutputtoexactlymatchareferencetranslation([90]).32setaccordingtotheirTERscoreproducesthenaldevelopmentset.Theytrainthepairwisecom-paratorweightstobestmatchtheorderofsentencesinthedevelopmentset.Afterthecomparatoristrained,itisusedtosortthesentences.Thescoringfunctioninthiscaseistherankofeachsentenceinthesortedlist.Finally,Kauchakattemptstoestimatethecontributionofeachsentencepairintheselectionpoolbysamplingalargenumberofrandomsubsetsoftheselectionpoolwithreplacementandtrainingastatisticalmachinetranslationsystemsovereachsubsetofthetrainingdata.Aftercal-culatingtheperformance(e.g.,BLEUscore)ofeachsystem,acontributionscoreisestimatedforreachsentencepairbasedonitsmembershipindifferentsubsetsandtheperformanceofthosesubsets.Thismethodisverycomputationallyintensiveanditsuseisnotfeasibleinpracticalsce-narios.However,itcanbecalculatedonsmalldatasetsandusedasanoraclescoretoevaluateotherdataselectionfeaturesortunehyperparametersforvariousdataselectionmethods([51]).3.5.2ContextIndependentFunctionsContextindependentfunctionshavemostlybeeninspiredbyfeaturesusedintheareaofTranslationQualityEstimation([98]).InTranslationQualityEstimation,givenasentenceanditstranslation,thequalityofthetranslationisestimatedintheabsenceofareferencetranslation.Wehavegroupedthesefunctionsbasedonthestatisticormodelthatisrequiredfortheircomputation.3.5.2.1LanguageModelBasedFunctionsLanguagemodelscanbeusedindifferentwaystoassesstheusefulnessofasentenceorsentencepair.Denkowskietal.useanexternalsourceandtargetlanguagemodeltrainedoncleandatatooutungrammaticalsentences(Equation3.20).Inthiscasethescoringfunctionisthelengthnormalizedlanguagemodelscoreofthesentence([32]).Allauzenetal.usethelanguagemodelperplexityscoreofthesentenceasitsscoringfunctiontooutforeignlanguagesentences([3])(Equation3.21).33f4(si)=PLM(Sclean)(si)(3.20)f5(sijSclean)=2HindLM(Sclean)(si)(3.21)HindLM(S)(si)=PLM(S)(si)logPLM(S)(si)(individualentropy)Yasudaetal.usealanguagemodeltrainedonanin-domain18developmentsettoscoresen-tencesintheselectionpool([108]).Inthisapproachthescoringfunctionequalsthegeometricalaverageofthesourceandtargetentropy19ofthesentencepairaccordingtotheindomainlanguagemodels(Equation3.20).f6(sijSin)=rHindLM(Sin-src)(si)HindLM(Sin-tgt)(si)(3.22)Sincethelanguagemodelusedincomputingtheentropyisanestimateofthetrueprobabilitydistributionfunction,cross-entropy20isusedinstead.Auniformprobabilitydistributionfunctionisusedtoestimatecross-entropy.Thegoalinthismethodistogivesentencessimilartothein-domaindevelopmentsetahigherscore.However,theshortcomingofthismethodisthatitwillalsogivegenerallypopularsentences(thatarealsopopularinthein-domaindevelopmentset)ahighscoreaswell.Thisisnotdesirableinadomain-adaptationdataselectiontask.MooreandLewisaddressthisissuebyintroducingasecondlanguagemodeltrainedonageneral-domainde-velopmentset([74]).Intheirapproach,theyusethecross-entropydifferenceofthetwolanguagemodelstoscoresentences(Equation3.23).f7(sijSin;Sout)=HindLM(Sin)(si)HindLM(Sout)(si)(3.23)18Theseedcorpuscanbeusedforthispurposeaswell.19Entropy,H(X),measurestheamountofinformationoruncertaintyinarandomvariable([72]).20Cross-Entropyofarandomvariable,X,withthetrueprobabilitydistributionfunctionofP(X)andanestimatedprobabilitydistributionfunctionofQ(X)isformallyasthesumoftheentropyofXplustheKL-divergencebetweenP(X)andQ(X):H(P(X);Q(X))=H(X;P(X))+DKL(P(X)jjQ(X))([30]).34Axelrodetal.extendthecross-entropydifferencemethodtoincludethetargetsideofpar-alleldata([10]).Intheirmethod,bilingualcross-entropydifference,twoseparatein-domainandgeneral-domainlanguagemodelsaretrainedforthesourceandtargetlanguage.Foragivensentencepair,itsscoreiscalculatedbyaddingthecross-entropydifferenceofsourceandtargettogether(Equation3.24).f8(sijSin;Sout)=f7(sisrcjSin-src;Sout-src)+f7(sitgtjSin-tgt;Sout-tgt)(3.24)3.5.2.2AlignmentModelBasedFunctionsAlignment21basedscoringfunctionsareonlyapplicabletoparalleltrainingdataselection.Thesescoringfunctionsattempttoquantifythetranslationqualityandaccuracyofthesentencepair.Anywordalignmentmodel(e.g.,HMM-basedwordalignmentmodel[99])canbeusedtocomputethisscoringfunctionbyaligningeachwordinasentencetocorrespondingword(s)initspair.Anumberofscoringfunctionscanbecomputedusingthisalignmentmodel.KhadiviandNeysuggesttraininganalignmentmodel22onallsentencesintheselectionpool.ThescoreforeachsentenceequalstheaverageViterbialignmentprobabilitiesaccordingtosource-to-targetandtarget-to-sourcealignmentmodels(Equation3.25).21Awordalignmentmodelisastatisticalmodeltrainedandusedtoalignindividualwordsinasentencetotheirtranslationsinthetranslatedsentence.Aword-alignedsentencepaircontainsalignmentlinksusedtoalignwordsinonesentencetotheother([23]).22Intheirwork,KhadiviandNeyuseIBMmodel1,HMMandIBMmodel2insuccessiontotrainthemodel.But,theirscoringfunctiondoesnotdependonanyparticularalignmentmodel.Hence,anyalignmentmodelcanbeused.35f9((si)m1;(ti)n1)=1mlogmaxam1PAM(si)m1;am1j(ti)n1+1nlogmaxan1PAM(ti)n1;an1j(si)m1(3.25)(si)m1:words1throughminsentencesi:am1:targetwordalignmentlinkforsourcewords1throughm:Taghipouretal.introducealignmententropyasascoringfunction.Alignmententropy,asinEquation3.26,isasmoothmeasureofuniformdistributionofalignmentlinks.Inotherwords,givenanalignmentlinkcountofn,howuncertainthemodelisaboutpredictingthewordcorrespondingtothelink23.Thehighestvalueforalignmententropyisachievedwhenallwordshavethesamenumberofalignmentlinksassociatedwiththem.f10(si;ti)=1nå(si)n1p1((si)n1)logp1((si)n1)1må(ti)m1p1((ti)m1)logp1((ti)m1)(3.26)p1((si)k)=a((si)k)åka((si)k)a((si)k):Thenumberoflinksthatendonword(si)kTheyalsouseanumberoffeaturesinspiredbytheworkofMunteanuandMarcuonﬁexploita-tionofnon-parallelcorpusformachinetranslationﬂbasedonnumberofalignmentlinksandword23Wethinkamorenaturalderivationforalignmententropyofasentenceisaveragingalignmententropiesofeachsourceortargetwordaccordingtoallpossiblealignments.Thiswillprovideameasureofhowuncertainanalignmentmodelisaboutwordalignmentsinasentence.ThisisamorenaturaluseofentropycomparedtotheuncertaintyaboutawordgiventhenumberofViterbialignmentlinksfortheword.36fertility24.f11(si)=Nsrc(si)Ntgt(si)(3.27)f12(si)=Nsrc(si)jsij(3.28)f13(si)=Ntgt(si)jsij(3.29)f14(si)=argmaxwi2sifertility(wi)(3.30)Nsrc(s):NumberofnullalignedsourcewordsinsentencesNtgt(s):NumberofnullalignedtargetwordsinsentencesDenkowskietal.addthefollowingscoringfunctionstothelistabove([32]).f15(si)=Asrc(si)jsij(3.31)f16(si)=Atgt(si)jsij(3.32)Asrc(s):NumberofalignedsourcewordsinsentencesAtgt(s):NumberofalignedtargetwordsinsentencesAmbatiusesbidirectionalalignmentscorestoprovidethescoreforasinglealignmentinasentencepair([4]).Denkowskietal.assignasentencepairscorebyaveragingthesealignmentscores([32])(Equation3.33).f17(si;ti)=qPAM(sijti)PAM(tijsi)(3.33)Inaddition,Ambatiproposeusingthedistancebetweenthebidirectionalalignmentscoresforthehighestprobabilityalignmentandthesecondhighestprobabilityalignmentforasourceortargetwordasameasureofalignmentuncertainty.Thecloserthedistance,themoreuncertainthealignmentmodelisaboutthealignmentforthewordinquestion(Equation3.34).Although,they24Inwordalignment,fertilityofawordisasthenumberofalignmentlinksinitiatedfromthatword.37havenotextendedthisscoringfunctiontothesentencelevel,anaverageoverallsourceortargetwordscanbeusedtoprovideasentencelevelscoringfunction.3.5.2.3OtherScoringFunctionsWeprovidealistofsentencelevelscoringfunctionsthatdonotthescoringfunctioncategoriesabove.Theseincludelength-basedfunctions,character-basedfunctionsandfunctionsindicatingliteralnessoftranslation.Length-basedfunctionsareprimarilyusedtooutnoisydata.Inthefollowinglengthbasedfunctions,sentencelengthcanbecalculatedasthenumberoftokens/wordsornumberofcharacters.Adevelopmentsetcanbeusedtotunethresholdsfortheselengthbasedfunctions[52,95].Alternatively,anunsupervisedoutlierthresholdcanbedevelopedbasedonthecandidatesentencepoolitself([32]).1.Sourcesentencetotargetsentencelengthratioandviceversa([52,95,32])2.Differenceofsourceandtargetsentencelengths[95])3.Sentenceslength([32])4.Lengthoflongesttokeninthesentence([32])Afewothersimplescoringfunctionscanalsobeusedtooutnoisydata.1.Sentenceendpunctuationagreement([52])2.Percentageofalphanumericcharacters([52,32])3.Existenceofcontrolcharactersandinvalidunicodecharacters([32])4.Co-occurrenceofspecialtokenssuchasemailaddress,URL,dateornumberonbothsidesofasentencepair([95]).Inadditiontothefunctionsabove,Hanetal.introducetwofunctionstoassesslexicalandsyntacticcompatibilityofasentencepair([45]).Theirapproachistoassignscorestosentence38pairsaccordingtothepotentialforastaticalmachinetranslationtolearnformthesentencepair.Theideaisthatsentencepairswithliteraltranslationsaremorelikelytobeusefulcomparedtoanabstractorconceptualtranslation.First,theyattempttoestimatetheliteralnessofatranslationgivenasentencepairusinglexicalfeatures.Thisfeatureiscalculatedbyexpandingthewordsofsourceandtargetsentencesbyasynonymdictionaryandthencountingthenumberofwordsthataretranslationsofeachothergivenabilingualdictionary.Theynormalizethisvaluebythetotalnumberofwordsanduseitasthefunctionscore(Equation3.34).f18(si;ti)=SkBiDicsrc!tgtSynonymssrc(sik)T SkSynonymstgt(tik)!SkSynonymstgt(tik)+SkSynonymssrc(sik)T SkBiDictgt!srcSynonymstgt(tik)!SkSynonymssrc(sik)(3.34)BiDicsrc!tgt(wi):Translationsforwordwiaccordingtoagivendictionary.Synonymssrc(wi):Synonymsforwordwiaccoringtoagivendictionary.sik:kthwordinsentencesiIntheirsecondapproach,Hanetal.usepartofspeechtagstoevaluatethegrammaticalcompat-ibilityofasentencepair([45]).First,theymanuallymapcommonPOStagsofthetwolanguages(e.g.,theﬁnounﬂPOStaginEnglishismappedtotheﬁnounﬂPOStaginChinese).Foragivensentencepair,thenumberofoccurrencesofeachsourcetaganditstargetcounterparttagiscal-culated.Theirdistanceiscomputedusingtheratioofthesmallercounttothelargercount.Aweightedaverage25ofthedistancemeasuresforallcommontagsisusedtocalculatethecross-lingualgrammaticalcompatibility(Equation3.35).25Theweightsaretrainedusingamanuallyanalyzeddevelopmentsetofsize1000.39f19(si;ti)=åtagltagmin(Count(si;tag);Count(ti;tag))+1max(Count(si;tag);Count(ti;tag))+1(3.35)3.6FeatureCombinationandParameterTuningSomedataselectionscenariosandscoringfunctionsrequireparametertuning.1.Scoringfunctionparameter(s):Somescoringfunctionscontainaparameterthatneedstobetuned(e.g.,[45]).2.FeatureCombination:Sincedifferentscoringfunctionscapturedifferentaspectsofthedata,itisoftendesirabletocombinemorethanonefunctionfordataselection.Forthispurposeacombinationmodel(e.g.,linear)isrequiredalongwithanalgorithmtooptimummodelparameters(e.g.,[52]).3.Thresholdtuning:Dataselectionmethodsthatuseathresholdtodata,requirethethresholdparametertobetuned(e.g.,[32]).Acommonrequirementforparametertuningandfeaturecombinationistheavailabilityofadevelopmentset.Inasupervisedlearningscenario,humanannotatorsareusedtocreatethedevelopmentset.Forexample,Hanetal.usehumanannotatorstolabelsentencepairsasliteraltranslationsversusconceptualtranslationstocreateadevelopmentset.Sincethisisanexpensiveandtimeconsumingtask,alternativetechniquesfordevelopmentdatacreationinsemi-supervisedlearningmethodshavebeendeveloped([52,95,7]).1.Mixingcleanandnoisydata:Intheirwork,KhadiviandNeyaddnoisydatatocleandata(labeledaccordingly)asthedevelopmentsettotraintheirnoisealgorithm.2.SinglePassActiveLearning:Intheirefforttotrainweightsforapairwisesentencecom-parator,Ananthakrishnanetal.useaparallelcorpusorderedbyTER26(between26TER:TranslationEditRate[90]40referencetranslationsandtranslationsofdecoderusingmodelstrainedonseedcorpus)astheirdevelopmentset([6]).Haffariusesasimilarmethodtogenerateadevelopmentsettotraintheirfeaturecombinationmodelweights([43]).3.UsingSelectionPool:Denkowskietal.usethecandidateselectionpoolofsentencestocomputeoutlierdetectionparameters([32]).Theycalculateaverageandstandarddeviationontheselectionpoolforanumberoffeatures(e.g.,sentencelengthratioandalphanumericdensity)tooutoutliers.3.7SelectionAlgorithmsGivenaconstrainedoptimizationproblemformulation,consistingofscoringfunctions,acombi-nationmodelandsizeconstraints,aselectionalgorithmisusedtosolvetheoptimizationproblem.3.7.1ThresholdBasedFilteringAcontextindependentfunctioncanbecomputedforallsentencepairsinonepassandthesubsetoftrainingdatathatpassesacertainthresholdcanbeselectedinlineartime.MethodsthatdependonathresholdtunethethresholdpriortoForexample,Denkowskietal.setsthethresh-oldsforlengthbasedatavg(sN1)2stddev(sN1).Otherselectionmethodswithcontextindependentscoringfunctionsiterativelyselecthighestscoringfunctionsuntiltheconstraintismet.ThesemethodsrequirethesentencestobesortedpriortoForalineartime,radixsortcanbeused.TheworkofMooreandLewisisanexampleofconstraintbasedassentencesaresortedbasedontheircross-entropydifferencebeforetheisapplied.Clas-basedapproachessuchas[95,52]followthesameparadigm.Sincetheseapproachescombinemultiplefeatures,theytrainaonadevelopmentsetandthenapplythetotheselectionpooltolabelsentencesasselectedversusnotselected,orcleanversusnoisy.413.7.2GreedySearchTheselectionmethodformostmethodswithcontextdependentscoringfunctionsisgreedysearch.Thehighestscoringsentenceisselectedineachiterationandthemodelsusedforthescoringfunctionsareupdated.Thisprocessisrepeateduntilaconstraint(e.g.,thetotalnumberofwordsinselectedpooloraminimumqualitybar)ismet.[33],[44]and[6]areexamplesofthisselectionmethod.Duetohighcomputationalcomplexity,forpracticalpurposes,greedyselectionalgorithmsareoftenpairedwithbatch-learningtechniques(seeSection3.9).3.7.3SubmodularOptimizationKirchhoffandBilmeshaverecentlyintroducedsubmodularityfordataselectioninstatisticalma-chinetranslation([53]).Submodularfunctionsareaclassofsetfunctionsthatformalizethecon-ceptofﬁdiminishingreturnsﬂ.Thismakesthemanaturaltheoreticalframeworkfordataselection.Formally,afunction,F,iscalledsubmodularifaddinganinstancextoasetSincreasesF(S)morethan(orequal)ifitwasaddedtoasupersetofS,SˆS0(Equation3.36).F(S[fxg)F(S)F(S0[fxg)F(S0)(3.36)Ingeneral,solvingEquation3.4exactlyisNP-complete.However,ifthesetfunctionFissubmodulartheproblemcanbesolvedusingagreedyalgorithmwithworst-casesolutionofF(X)>(11=e)f(Xopt).Thatis,intheworstcase,thefunctionvalueforthesolutionfromthegreedyalgorithmisapproximately0:63oftheoptimumsolution[53].Moreover,dependingonthecurvatureofthesubmodularfunction,theworstcaseguaranteecanbeimproved.Graphbasedsubmodularfunctionsareanaturalfordataselectionaseachsentenceisrepresentedasanodeinthegraphandedgeweightsarecomputedbasedonthesimilarityofthetwosentences.Theorderofgraphconstructioncomplexityisquadraticinthenumberofsentences.Thismakesthisclassofsubmodularfunctionsimpracticalforlargedatasets.AnotherclassofsubmodularfunctionsthataremorepracticalforlargedatasetsisbasedonbipartitegraphsG=(V;U;E;w).Inthissetting42sentencesaretheleftvertices,V.Features(e.g.,n-grams)aretherightvertices,U,withweightw.Connectingeachsentencetoitsfeaturesaddsuptothesetofedges,E.Inthissetting,KirchhoffandBilmesafeature-basedsubmodularfunctionasfollows.f(X)=åu2Uwufu(mu(X))(3.37)Assumingwutobepositive,mu(X)tobeanon-negativemodularfunction27andfutobeanon-negative,non-decreasingconcavefunction,F(X)asinEquation3.37issubmodular.Thisisaxibleframeworkfordataselectionthatisscalabletolargedatasizes.Usingthisframeworkrequiresthecomponentsbelow:1.U:Uisthesetoffeaturespresentinallsentencesintheselectionpool.N-gramsareanaturalchoiceforthis.2.mu(x):Thisfunctioncalculatestherelevanceofeachfeatureutosentencex.KirchhoffandBilmesusetfidf(u;x)forthisfunction.Themainconstraintwiththisfunctionisthatithastobemodular.3.w(u):Eachfeaturecanhaveaweightfunctiontopresentitsimportance.Inadomain-adaptationtask,thefrequencyofthen-graminanin-domaindevelopmentsetcanbeusedforthisweightfunction.4.fu(a):Thisdecayfunctiondeterminestherateofdiminishingreturnsforredundantinstancesofafeatureu.Asthisconcavefunctionbecomesthefeaturelosesitsabilitytoprovideadditionalimprovementtothevalueofacandidatesubset.KirchhoffandBilmesexperimentwithsquarerootandlogarithmicfunctions.ThegreedyalgorithmproposedbyKirchhoffandBilmesissimilarthegreedyalgorithmsmen-tionedinSection3.7.2.Theiralgorithmchoosesthesentenceinthesentencepoolthat,ifaddedto27Asetfunctionismodularifandonlyifthevalueofthefunctionoverasetequalsthesumofthefunctionvalueoveritsindividualelements.43theselectedsentences,improvesthesubmodularfunction'svalueovertheselectedsentencesthemost.Thisstepisrepeateduntilthebudgetconstraintismet.3.8ActiveLearningActivelearningisatechniqueusedinmachinelearningwherethealgorithmcanchoosethenewdatafromwhichitlearns[88].Thistechniqueisoftenemployedinproblemswhereunlabeleddataisabundantwhiletheprocessofobtaininglabeleddataisexpensiveortime-consuming.Inanactivelearningframework,thelearningalgorithmqueriesanoracle28fornewdatapointstobelabeled.First,thelearningalgorithmcanobtainnewdatapointsinseveralways.1.PoolBased:Thelearningalgorithmselectsdatapointsfromapoolofunlabeleddata.2.StreamBased:Allavailableunlabeleddataisstreamedthroughthelearningalgorithmandthelearningalgorithmcanchoosetoqueryforthedatapointordiscard.3.QuerySynthesis:Thelearningalgorithmgeneratesnewdatapointstobelabeledbytheoracle.ThelearningalgorithmcanuseseveralstrategiesforselectingnewdatapointssuchasUncer-taintySampling,QuerybyCommittee,ExpectedModelChange,ExpectedErrorReduction,Vari-anceReductionandDensity-WeightedMethods([88]).Activelearningselectsnewdatapointsinaniterativeprocess.Ineachstep,itselectsnewdatapoint(s),queriestheoracleforthelabelandlearnsfromtheresults.Incontrast,inpassivelearning,thelearningalgorithmhasnocontributiontothedataselectionandlabelingprocess.Fromanotherperspective,inpassivelearningthedataselectionprocessdoesnotusethelearningalgorithmtoselectnewdatapointsforlabeling.Al-though,thedataselectiontaskforpassivelearningcanalsobeiterative,thekeydistinctionisthatthelearningalgorithmdoesnotplayaroleinthedataselectiontask.28Anoraclecanbeahumanorahumanlabeleddatasetthatprovidesthetruelabelforaquery.Inthecontextofmachinetranslation,theoracleisahumantranslatororaparallelcorpuswherethesourcetexthasalreadybeentranslatedbyhumans.44Instatisticalmachinetranslationadataselectionmethodcanbeasanactivelearningmethodonlyifitfollowstheiterativedataselectionprocessdescribedabove.ContextDependentFunctionslistedinSection3.5.1thisparadigm.Inaddition,theselectionprocessmustuseanelementofstatisticalmachinetranslationlearningtobeconsideredactivelearning.Althoughonlysomepriorworkrefertotheirmethodasactivelearning[44,4,7],allscoringfunctionslistedinSections3.5.1.2,3.5.1.3and3.5.1.4theactivelearningframework.3.9Batch-ModeLearningBatch-modelearningisapplicabletobothactivelearningaswellasdataselectionforpassivelearning.Theideaistoselectnewdatapointsinbatchesratherthanoneatatime.Withoutbatch-modelearning,afterasingledatapointisselectedandlabeledbytheoracle,thelearningmodel(e.g.,LanguageModelorTranslationModel)orthedataselectionscoringfunctionstatistic(e.g.,n-gramfrequencyinselectedsentences)isupdatedwiththenewdatapoint,andalldatapointscoresarerecomputed.Thisprocesscanbeverytimeconsuming,especiallyifSMTmodelsneedtoberetrainedinordertorecomputedatapointscores.Batch-modelearningaddressesthisconcernbyselectingmultiplenewdatapointsatatimeandthereforereducingthenumberoftimesthelearningmodelhastobeupdatedandsentencescoresneedtoberecalculated.UsingthetopNsentenceswithhighestscoresoftendoesnotproducethebestresultssinceitfailstoconsideroverlapbetweendatapoints.Thisiscommonlyaddressedbyminimizingoverlapbetweendatapointswithinabatch.ToselectabatchofsizeK,Ananthakrishnanetal.gothroughthedataKtimes29([6]).Eachtimetheyselectthesentencewithmaximumscoreandupdatethescoringfunctiontoexcludefeatures(inthiscasen-grams)usedinscoringtheselectedsentences.InspiredbytheworkofDasguptaandHsuonHierarchicalSampling30,Haffarietal.pro-29Inthisbatch-modelearningstrategytheideaistoavoidretrainingallSMTmodelsafterselec-tionofeachsentence.Alessexpensiveupdateisusedtoimprovethediversityofthebatchwhereonlythescoringfunctionisupdated.Aftertheentirebatchisselected,thenallSMTmodelsareupdated.30HierarchicalSamplingattemptstoleveragetheclusterstructureofthedataforsamplinginan45poseHierarchicalAdaptiveSamplingforfeaturecombinationandbatch-learningatthesametime([43]).TheirworkdiffersfromHierarchicalSamplingbynothavingastatichierarchi-calclusterovertheselectionpool.Instead,thehierarchyiscreateddynamicallyinsteps.UnlikeHierarchicalSampling,theiralgorithmalwaysselectssamplesfromtheclusternodethathasbeenselectedforfurtherpartitioning.Ateachstep,Ksentencesarerandomlychosenfromtheselectedclusternodeandqueriedforhumantranslations.TheselectedsentencesareaddedtothesetofselectedsentencepairsandtheSMTmodelsareretrainedonthelargersetofselectedsentencepairs.Thetwopointsbelowareessentialtounderstandtherestofthisalgorithm.1.Choiceofclusterforfurtherpartitioning:AclusternodewithlowestaveragemodelscorewhentranslatedusingtheupdatedSMTmodelsischosenforfurtherpartitioning.2.Partitioningmechanismforachosenclusternode:Thesentencesinthepartitionaresortedaccordingtotheirsimilaritytotheselectedsentences.Theapercentageofthesen-tencesareputinthechildclusternode,andtherestinthesecondchildclusternode.Ineffect,HierarchicalAdaptiveSamplingiscombiningtheﬁaveragemodelscoreﬂfeaturewiththeﬁsimilaritytotheselectedsentencesﬂfeature.Whiletheﬁaveragemodelscoreﬂfeatureselectstheclusterthatismostdiftotranslate,theﬁsimilaritytotheselectedsentencesﬂfeatureensuresdiversity.However,inpractice,thesetwofeaturescanbereplacedbymanyofthescoringfunctionsintroducedinSection3.5.Inadditiontoitshighcomputationalcomplexity,anothershortcomingofthisapproachisthatitdoesnotensurediversitywithinasinglebatch.Inafollowupactivelearningsetting([31]).Inthiswork,astatichierarchicalclusterstructureoftheunlabeleddataisgiven.Asetofclusternodesforsamplingismaintainedthroughoutthealgorithm.Initially,thesamplingsetonlycontainstherootnodeoftheclusterwhichcontainsallnodes.Randomsamplesaredrawnfromthesamplingsetandqueriedfromtheoracle.Basedonthesequeries,eachnodeinthehierarchicalclustermaintainsstatisticsaboutitspositiveandnegativelabels.Clusternodesinthesamplingsetwithmixedlabelsandmorepurechildnodesareremovedfromthesamplingsetandtheirchildnodesareadded.Thesestepsarerepeateduntilallnodesreachalevelofpurity.Thismethodismotivatedbyaddressingtheﬁsamplingbiasﬂproblem([87])inactivelearningandprovidestheoreticalguaranteesforabetterlearningperformancethanrandomsampling.46totheirworkreferencedabove,Ananthakrishnanetal.introducethreelevelsofscoringfunctions([7]):1.NeverUpdated:Thesefunctionsareonlycomputedonceforsentencesintheselectionpoolandareneverupdated.Theyusen-gramoverlapwithanin-domaindevelopmentsetasameasureofdomainrelevanceforthispurpose.2.UpdatedPerBatch:Thesefunctionsareupdatedafterabatchofsentencesareselectedandqueriedforhumantranslation.TheycomputetranslationdifbyretrainingtheSMTmodelswiththenewbatchandderivetranslationdiffeaturesbasedontheperformanceofnewmodelsforthenextbatchofsentences.3.UpdatedPerSentence:Thesefunctionsarerecomputedeverytimeasinglesentenceisaddedtothepool.N-gramoverlapwithexistingsentencesinthebatch(batchdiversity)isusedforthispurposeandhastobeupdatedpersentence.Althoughonlybatchdiversityisrecomputedaftertheselectionofeachsentence,Ananthakrishnanetal.useallthreefunctionsabovetoselectthehighestscoringsentenceineachstep.Thisiscontinueduntilthebatchsizeisreached.3.10EvaluationMethodsAllpriorworkevaluatetheeffectivenessofdataselectionmethodsbytrainingonaselectedpar-allelcorpusandtestingonarandomlyheld-outtestset.Indataselectionfordomainadaptationscenarios([74,8]thetestsetisanin-domaintestset.Asformoststatisticalmachinetranslationtasks,BLEUisthemostcommonevaluationmethod.Whileinsomecasesselectingasubsetofthedataoutperformscasesusingtheentiredataset[53,33],thegeneraltrendisthatselectingmoredataimprovesevaluationresults.Forthisreason,comparisonbetweendifferentselectionmethodsisoftendonebyrunningeachselectionalgorithmmultipletimes.Eachtimetheselectionconstraint(e.g.,maximumnumberofwordsselected)issettoadifferentpercentageofthedata.47AnSMTsystemistrainedoneachselectedsubsetandtestedontheheld-outtestset.Thisprovidesseveralcomparisonpointsbetweentheselectionmethods.Theobjectiveincostfocusedselectionmethodsistomeetaqualitybaratminimumcost([19]).Inthesescenarios,differentselectionmethodsarecomparedbasedontheircosttoreachaBLEUscore.3.11ComparativeSummaryofCitedWorkTheworkofEcketal.isnoteworthyastheworkpublishedondataselectionforstatisticalmachinetranslation([33]).Featuresintroducedinthisworksuchasn-gramcoverageandtfidfarestillthebasisformuchofthemostrecentwork.ApracticalextensionoftheirworkistheVo-cabularySaturationFilter([66])whichrelaxesthedependencyonpreviouslyselectedsentences;thisallowsforlinear-timeapplicationofn-gramcoveragetoverylargedatasets.Anotherexten-siontotheworkofEcketal.isusingfeaturedecayfunctions([5,18])toimplementtheideaofdiminishingreturns.Whileintheiroriginalwork,Ecketal.maintainedn-gramcountstatistics,otherwork([44,68])usedlanguagemodelsformoreaccurateprobabilityestimationinstead.MooreandLewisintroducecross-entropydifferencewhichhasbecomeoneofthemostcom-monlyusedapproachesindataselectioninthedomainadaptationsetting[74].Bilingualcross-entropydifferenceisanaturalextensionofthisworkthattakesbothsidesoftheparalleldataintoconsideration([8]).TheworkofHaffarietal.isuniquelyobjectiveinthattheyusetheperformanceofatranslationsystemtrainedontheselecteddataintranslatingnewdataasameasureoftranslationdifandthustheusefulnessofnewdata([43]).Theirapproachtoactivelearningiscomputationallyintensive.Intheirextensiontothiswork,Ananthakrishnanetal.addressthecomputationalintensityproblembytrainingatocomparesentencesbasedontheirtranslationdif.InafollowupworktheycombinetheapproachesofEcketal.,MooreandLewisandHaffarietal.todevelopamulti-layeractivelearningstrategythatcombinestranslationdif,domainappropriatenessanddiversityintoasingleframework([7]).TheworkofKirchhoffandBilmesoffersthemosteffectivedataselectionsearchframework48withtheoreticalguarantees([53]).Althoughtheydonotintroduceanynewfeaturesintheirsub-modularoptimizationmethodfordataselection,theyofferaxibleframeworkwherecustomfunctionscanbeusedforsentencefeatures,relevancescores,featureweightsandadecayfunc-tion.TheworkofAmbatietal.isuniqueinthat,theyintroduceacustomizedactivelearningstrategyforcrowdsourcingwhileintroducingnewfeaturessuchasphrasalentropyandKL-divergence[4].BloodgoodandCallison-Burchofferacostfocusedactivelearningstrategybyaskingacrowdfortranslationofphraseswithinasentenceinsteadoffullsentencetranslation([19]).Usingthismethodtheyareabletooutperformallpreviousapproachesonaperwordtranslationpricingscheme.Taghipouretal.andKhadiviandNeyuseabasedapproachtooutnoisydatawhileDenkowskietal.usesimpleaverageandstandarddeviationstatisticstoachievethesamegoal([95,52,32]).Hanetal.haveauniqueapproachtodataselectionwheretheyuselexicalandsyntacticcom-patibilitybetweensourceandtargetsentencestoestimatetheliteralnessofthetranslation([45]).Theirideaistoselectmoreliteraltranslationsasastatisticalmachinetranslationsystemcanonlylearnfromliteraltranslationsandwouldnotfromfreetranslations.FinallytheworkofKauchakisuniqueastheyprovidethemostobjectivemeasureofsentencepairusefulnessforSMT,althoughtheirmethodisnotpracticalforanyreal-worldscenario([51]).Intheirwork,theycreate500randomsubsetsfromtheselectionpooltotrain500differentsetsofmodels.Theusefulnessofeachsentencepairisestimatedbasedonitsmembershipintrainingdatafordifferentmodelsetsandthemodelsets'performance.RelatedworksintheliteraturehavebeenlistedinTable3.2accordingtothemaincomponentsoftrainingdataselectionlisted.49ReferenceScenarioFeaturesAlgorithm[33]LowResourceLanguage31N-gramCoverageTF-IDFGreedySearch[44]LowResourceLanguageRoundTripTranslationAccuracyPhrasePairUtilityN-gramUtilityN-gramCoverageDecoderTranslationModelScoreGreedySearchHAS32[71]LowResourceLanguageInter-SystemDisagreementLanguageModels'PerplexityRatioGreedySearchThresholdFiltering[61]DomainAdaptationTF-IDFThresholdFiltering[95]NoiseReductionWordAlignmentProbabilityWordtoNullAlignmentCountWordtoNullAlignmentRatioAlignmentEntropyTop-3WordFertilitiesFactoidCo-AppearanceSentenceLengthRatioLMProbabilityDifferenceLMProbabilityRatio[32]NoiseReductionLanguageModelScoreCombinedAlignmentScoreLengthRatioAlphanumericIntensityMaximumTokenLengthThresholdFiltering[6]LowResourceLanguageExpectedTranslationErrorReductionGreedySearch[45]NoiseReductionLexicalCompatibilityGrammaticalCompatibilityThresholdFiltering[104]TrainingResourceReductionTF-IDFSubmodularOpt.[69]TrainingResourceReductionWeightedPhraseEntropyWeightedSub-treeEntropyGreedySearch[74]DomainAdaptationCrossEntropyDifferenceThresholdFiltering[27]TrainingResourceReductionN-gramCo-OccurrenceGreedySearch[9]DomainAdaptationCrossEntropyDifferenceThresholdFiltering[40]TrainingResourceReductionAverageAlignmentProbabilityGreedySearch[4]LowResourceLanguagePhraseProbabilityUnseenN-gramsGreedySearch[52]NoiseReductionSentenceLengthRatioAlphanumericCharacterPresenceSentenceEndingSymbolSimilarityAutomaticLanguageAlignmentProbability[19]QualityImprovementUnseenN-GramsThresholdFiltering[51]LowResourceLanguageEstimatedContributionScoreThresholdFilteringGreedySearch[80]NoiseReductionMinimumN-GramOccurrenceSentenceLengthGreedySearch[3]NoiseReductionLMPerplexityThresholdFiltering[108]DomainAdaptationLMPerplexityThresholdFiltering[34]QualityImprovementN-gramCo-OccurrenceGreedySearch[66]TrainingResourceReductionMinimumN-GramOccurrenceAlignmentProbabilityThresholdFilteringTable3.2RelatedWork503.12RelatedResearchAreasDataselectioninmachinetranslationiscloselyrelatedtoqualityestimation.Inaqualityestima-tiontask,givenasentenceanditstranslationusinganSMTdecoder,thequalityofthetranslationisestimatedwithoutareferencetranslation([91]).Thisisaverysimilartasktodataselection.Infactmanyfeaturesusedindataselectionhavebeeninspiredbyfeaturesusedinqualityestimation([32]).Dataselectionisapracticalmethodofadaptingstatisticalmachinetranslationmodelstoadomainandisoftenusedindomainadaptationtasks.Otherdomainadaptationtechniquesincludetranslationmodeladaptation([28]).Mostoftheactivelearningliteratureinmachinelearningisfocusedontaskswithalimitednumberoffeatures.Therefore,thefocusisonselectingdatapointsthataremostusefulinlearningfeatureweights.Activelearningformachinetranslationisuniqueinthatthenumberoffeaturestobelearnedisverylargeor,forallpracticalpurposes,unlimitedforlargedatasets.Theactivelearningtaskinthiscaseistoidentifydatapointswiththemostusefulfeaturesinadditiontodatapointsthatarethemostusefulinestimatingtheweightsforthefeatures.Thismakesactivelearningformachinetranslationuniqueintheactivelearningliterature.Co-trainingandself-trainingarealsocloselyrelatedareastodataselectionformachinetrans-lation([107,43]).Bothtechniquesareusedwhenlabeleddataisscarceandunlabeleddataisabundant.Inasimpleco-trainingsetting,twotrainedonlabeleddataclassifythesameunlabeleddatasetusingtwodifferentbutcomplimentaryfeaturesets.Iftheagreeintheirlabels,thedatapointsandtheirlabelsareaddedastrainingdataandareretrained.Inself-training,unlabeleddataislabeledusingasingleertrainedonthelimitedlabeleddata.Datapointswithhighareaddedtothetrainingasnewdatapoints.Thesetechniqueshavebeenleveragedindataselectionformachinetranslationaswell.Inhiswork,Haffariexperimentswithself-trainingwhiletheworkof[71]usesfeaturesthatareclosely31LowResourceLanguage32HierarchicalAdaptiveSampling51relatedtoco-training.Finally,anotherrelatedresearchareaismodelpruninginstatisticalmachinetranslation.Indataselectionscenarioswithmodelsizedeploymentconstraintsmodelpruningcanbeusedasanalternativetodataselection.Theresearchinthisareafocusesonphrasetablepruning([106])andlanguagemodelpruning([41]).3.13SummaryInpractice,performingadataselectiontaskforstatisticalmachinetranslationrequiresaddressingtwotechnicalquestions:1.ScoringFunction:Howisthevalueofasentenceorsentencepairobjectivelyevaluated?2.SelectionAlgorithm:Givenascoringfunctionandaconstrainthowistheoptimumdatasubsetselected?Inthischapterwereviewedalonglistofobjectivefunctionsalongwithanumberofselec-tionalgorithms.Thiscombinationoffersanoverwhelmingnumberofdifferenttoperformadataselectiontask.However,inpractice,thenumberofapplicablesolutionscanbenarroweddownbytheapplicationscenario(Section3.2)anddatasize.Multipledataselectionscenariosareoftenemployedinthedevelopmentofasinglestatisticalmachinetranslationsystem.Forexample,whendevelopingageneralpurposemachinetranslationserviceforEnglish$Frenchtherearethreedifferentissueswithtrainingdatathatcanbeaddressedusingdataselection.1.Muchoftheavailableparalleldataiscollectedthroughwebcrawlingandisnoisy.2.Thereistoomuchparalleldatatotrainanditerateoninatimelymanner([66]).3.Evenifwecouldtrainonallthedata,ifweweretodeploytheSMTsystemonanofmobiledevice,themodelsizesaretoolarge.52InthisscenariothedatacanbecleanedusingcontextindependentscoringfunctionslistedinEquation3.20-3.33andSection3.5.2.3.Investingincreatingadevelopmentsettotrainaclas-([95])wouldenabletrainingweightsformultiplescoringfunctions.Iffewscoringfunctionsareused,usingaverageandstandarddeviationforthresholdbased([32])canbeeffectiveaswell.Thenoiselevelofasentencepaircanbeestimatedindependentlyusingcontextindependentscoringfunctions.However,whenthereistoomuchcleandata,selectingasubsetofthedatathatcanperformthebestonablindtestsetismorecomplicated.Activelearningprovidesanidealsolutionasitisthemostobjective.OntheotherhanditrequiresretrainingafullSMTsystemoneveryiterationwhichmakesitimpracticalforlargedatasets.Activelearninginherentlyusescontextdependentscoringfunctionssinceitusesmodelsbuiltfromtheselectedpooltoevaluatenewsentencesfromtheselectionpool.Othercontextdependentscoringfunctions([33])usein-crementalstatisticsfromtheselectedpooltoscorenewsentencesintheselectionpool.Ageneralandefframeworkfortheuseofcontextdependentscoringfunctionsissubmodularopti-mization([53]).ForverylargedatasetsthatacomputationalcomplexityofO(NlogN)isnotpractical,VSF33([66])offersalinearsolution.Thedeploymentresourcerestrictionscenarioissimilartotrainingresourcelimitationscenariowiththeexceptionthatphrasetableorlanguagemodelpruningisaviablealternativetodataselection.AnothercommonscenarioforthedataselectiontaskinSMTisdevelopinganSMTsystemforlowresourcelanguages.Inthisscenariothereisasmallparallelseedcorpusandalargemonolingualselectionpool.Theideaistoselectalimitednumberofsentencesfromtheselectionpooltobemanuallytranslatedbyhumansandaddedtotheseedcorpus.Inthisscenarioactivelearningstrategieshaveprovedtobethemostsuccessful([7,4,44].TheactivelearningstrategiescanbeusedwithanyofthecontextdependentscoringfunctionsreviewedinSection3.5.1whilebatch-modelearningstrategiesreviewedinSection3.9canbeusedtospeedupthedataselectionprocess.33VSF:VocabularySaturationFilter([66])53CHAPTER4TRANSLATIONDIRECTIONDETECTIONThegoalofthisdissertationistoselectasubsetofparalleltrainingdatathatresultsintrans-lationmodelsthatyieldhighestqualitytranslationoutput.AsdescribedinChapter3thistaskisoftenbrokendowntosentencelevelfeaturefunctions.Byaselectedparalleldatasubsetcannotbeimmediatelyandindependentlyevaluatedwithoutrelyingonthedownstreamtranslationmodeltrainingtaskandtestset.Thedownstreamtaskitselfcantakeonmanyvariations(e.g.,trainingaphrasebasedsystem[109]versushierarchicalphrasedbasedtranslationmodel[29]ortree-to-stringtranslationmodel[85])withmanytuningparameters(maximumphraselength,max-imumdistortion,pruningparameters,...)inadditiontothechoiceofdevelopmentsetandturningmethod(e.g.,PRO[47],MERT[79]orMIRA[102]).Finally,therearedifferentevaluationmeth-ods(BLEU[83],METEOR[12]orhumanevaluation[83]).Inadditiontoallvariationsintrainingastatisticalmachinetranslationsystem,thetrainingprocessisalsocomputationallyintensive.Theaforementionedfactorsmakesanobjective,fairandaccuratecomparisonbetweenawiderangeoffeaturefunctionsintractable.Thereforeinthischapterweselectedataskforparalleldatathatcanbeobjectively,independentlyandaccuratelyevaluatedwithouttheneedforaanydownstreamtask.Thisenablesustoexploreawiderangeofsentencepairfeaturefunctionsindifferentsettings(samedomainversuscross-domain)andaccuratelymeasuretheireffectiveness.Theonlytaskthatmeetsthecriteriamentionedaboveisthe"TranslationDirectionDetection"task.Inthischapter,wethetask,reviewrelatedwork,introducearangeofnewfeaturesandeval-uatetheireffectivenessinanexistingdataset.Inaddition,wedevelopathreewaycross-domaindatasetandevaluatetheeffectivenessofpreviouslyintroducedfeaturesaswellasnewfeatureinthiscross-domainsetting.ThebestperformingfeaturesareusedinChapter5andChapter??toimprovetheperformanceofthedataselectiontaskforstatisticalmachinetranslation.544.1IntroductionIthasbeenknownformanyyearsinlinguisticsthattranslatedtexthasdistinctpatternscomparedtooriginalorauthoredtext[11].ThetermﬁTranslationeseﬂisoftenusedtorefertothecharacteristicsoftranslatedtext.PatternsofTranslationesecanbecategorizedasfollows[101]:1.:Theprocessoftranslationisoftencoupledwithaprocessatseverallevels.Forexample,theretendstobelesslexicalvarietyintranslatedtextandrarewordsareoftenavoided.2.Explicitation:Translatorsoftenhavetobemoreexplicitintheirtranslationsduetolackoftheculturalcontextthatspeakersofthesourcelanguagehave.Anothermanifestationofthispatternismakingargumentsmoreexplicitwhichcanbeobservedintheheavyuseofcohesivemarkerslikeﬁthereforeﬂandﬁmoreoverﬂintranslatedtext[59].3.Normalization:Translatedtextoftencontainsmoreformalandrepeatinglanguage.4.Interference:Atranslatorislikelytoproduceatranslationthatisstructurallyandgrammat-icallyclosertothesourcetextortheirnativelanguage.InFigure4.1thesizeofawordintheﬁTranslatedﬂsectionisproportionaltothedifferencebetweenthefrequencyofthewordinoriginalandinthetranslatedtext[37].Forexample,itisapparentthatthewordﬁtheﬂisover-representedintranslatedEnglishasnotedbyotherresearch[101].Inaddition,cohesivemarkersareclearlymorecommonintranslatedtext.InthepastfewyearstherehasbeenworkonmachinelearningtechniquesforidentifyingTrans-lationese.StandardmachinelearningalgorithmslikeSVMs[13]andBayesianLogisticRegression[59]havebeenemployedtotrainforoneofthefollowingtasks:i.Givenachunkoftextinalanguage,classifyitasﬁOriginalﬂorﬁTranslatedﬂ.ii.Givenachunkoftranslatedtext,predictthesourcelanguageofthetranslation.iii.Givenatextchunkpairandtheirlanguages,predictthedirectionoftranslation.55Figure4.1EuroParlWordCloudDataVisualization(TranslatedvsOriginal)1Inadditiontoourstatedmotivationofexploringtheeffectivenessofdifferentfeaturefunctionsinparalleldatacltherearetwomotivationsstatedinpriorresearchforthetasksabove:empiricalvalidationoflinguistictheoriesaboutTranslationese[101],andsecond,improvingstatisticalmachinetranslationbyleveragingtheknowledgeofthetranslationdirectionintrainingandtestdata[63,65,64].FewparallelcorporaincludingacustomizedversionofEuroParl[48]andaprocessedversionofHansard[60]arelabeledfortranslatedversusoriginaltext.Usingtheselimitedresources,ithasbeenshownthattakingthetranslationdirectionintoaccountwhentrainingastatisticalmachinetranslationsystemcanimprovetranslationquality[65].However,improvingstatisticalmachinetranslationusingtranslationdirectioninformationhasbeenlimitedbyseveralfactors.1.LimitedLabeledData:Theamountoflabeleddataislimitedbylanguageanddomainandthereforebyitselfisnotenoughtomakeaimprovementinstatisticalmachinetranslation.2.Cross-DomainScalability:CurrentmethodsofTranslationesedetectiondonotscaleacross1ThiswordcloudwascreatedusingthewordcloudandtmRpackages[37]fromEuroParlparalleldataannotatedfortranslationdirection[48]obtainedfromhttp://www.hucompute.org/ressourcen/corpora/56.56differentcorpora.Forexample,atrainedonEuroParlcorpus[56]hadin-domainaccuracyof92.7%butout-of-domainaccuracyof64.8%[59].3.TextChunkSize:ThereportedhighaccuracyofTranslationesedetectionisbasedonrela-tivelylarge(approximately1500tokens)textchunks[59].Whensimilartasksareperformedatthesentenceleveltheaccuracydropsby15percentagepointsormore[60].Figure4.2showshowdetectionaccuracydropswiththereductionoftheinputtextchunksize.Sinceparalleldataareoftenavailableatthesentencelevelorsmallchunksoftext,existingdetec-tionmethodsaren'tsuitableforthistypeofdata.Figure4.2EffectsofChunkSizeonTranslationeseDetectionAccuracy2Motivatedbytheselimitationsandtheaforementionedgoalofthischapter,inthisworkwefocusonimprovingsentence-levelaccuracybyusingbilingualfeaturesatthesentencelevel.Inadditiontoimprovingaccuracy,thesefeaturesmaybebetterabletoexistingtheoriesordiscovernewlinguisticphenomenathatoccurinthetranslationprocess.Weuseafastlineartrainedwithonlinelearning,VowpalWabbit[62].TheHansardFrench-Englishdataset[60]isusedfortrainingandtestdatainallexperiments.2ThisisareproductionoftheresultsofKoppelandOrdan[59]usingfunctionwordfrequenciesasfeaturesforalogisticregression.Basedonthedescriptionofhowtextchunkswerecreated,theresultsofthepaper(92.7%accuracy)arebasedontextchunksizesofapproximately1500tokens.574.2RelatedWorkVolanskyetal.[101]provideacomprehensivelistofmonolingualfeaturesusedforTranslationesedetection.ThesefeaturesincludePOSn-grams,charactern-grams,functionwordfrequency,punc-tuationfrequency,meanwordlength,meansentencelength,wordn-gramsandtype/tokenratio.WhiledistinctpatternsofTranslationesehavebeenstudiedwidelyinthepast,theworkofBaroniandBernardini[13]isthetointroduceacomputationalmethodfordetectingTranslationesewithhighaccuracy.Priorworkhasshownin-domainaccuracycanbeveryhighatthechunk-leveliffullylexicalizedfeaturesareused[101],butthenthephenomenalearnedareclearlynotgener-alizableacrossdomains.Forexample,inFigure4.1,itcanbeobservedthatcontentwordslikeﬁcommissionﬂ,ﬁcouncilﬂorﬁunionﬂcanbeusedeffectivelyforwhiletheydonotcaptureanygenerallinguisticphenomenaandareunlikelytoscaletoothercorpora.Thisisalsobyanaveragehumanperformanceof72.7%precisionwith82.1%recallonasimilartaskwherethetestsubjectswerenotfamiliarwiththedomainandwerenotabletousedomain-lexicalfeatures[13].Amoregeneralfeaturesetstillwithhighin-domainaccuracyisPOStagswithlexicalizationoffunctionwords[13,60].Webuildonthisfeaturesetandexplorebilingualfeatures.Weareawareofonlyonepriorworkthatpresentedacross-domainevaluation.KoppelandOrdan[59]usealogisticregressionwithfunctionwordunigramfrequenciestoachieve92.7%accuracywithtenfoldcrossvalidationontheEuroParl[56]corpusand86.3%ontheIHTcorpus.HowevertestingtheEuroParltrainedontheIHTcorpusyieldsanaccuracyof64.8%(andtheaccuracyis58.8%whentheistrainedonIHTandtestedonEuroParl).Theinthisstudyaretrainedandtestedontextblocksofapproximately1500tokens,andthereisnocomparativeevaluationofmodelsusingdifferentfeaturesets.WearealsoawareofonepriorworksthatinvestigateTranslationesedetectionaccuracyatthesentencelevel.Kurokawaetal[60]usetheHansardEnglish-Frenchcorpusfortheirexperiments.ForsentenceleveltranslationdirectiondetectiontheyreachF-scoreof77%usingwordn-gramsandstayslightlybelow70%F-scorewithPOSn-gramsusinganSVM.Thisisalso58theonlyworktoconsiderfeaturesofthetwoparallelchunks(oneoriginal,onetranslated).Theysimplyusedtheunionofthen-grammixed-POS3featuresofthetwosides;thesearemonolingualfeaturesoftheoriginalandtranslatedtextanddonotlookattranslationphenomenadirectly.Theirworkisalsotheonlyworktolookatsentenceleveldetectionaccuracyandreport15percentagepointsdropinaccuracywhengoingfromchunkleveltosentencelevelOurworkinthischapterisuniqueinseveralways:1)Weexploreseveralbilingualfeaturesnotexploredbefore,2)Weimproveonthebestresultsreportedforsentence-levelaccuracyand3)Wedevelopacross-domaindatasetandoutperformallpriormethodsinallcross-domainsettings.4.3BilingualFeaturesforTranslationDirectionWeareinterestedinlearningcommonlocalizedlinguisticphenomenathatoccurduringthetrans-lationprocesswhentranslatinginonedirectionbutnottheother.4.3.1POSTagMTUsMinimaltranslationunits(MTUs)forasentencepairareaspairsofsourceandtargetwordsetsthatsatisfythefollowingconditions[84,111].1.NoalignmentlinksbetweendistinctMTUs.2.MTUsarenotdecomposableintosmallerMTUswithoutviolatingthepreviousrule.WeusePOStagstocapturelinguisticstructuresandMTUstomaplinguisticstructuresofthetwolanguages.ToobtainPOSMTUsfromaparallelcorpus,theparallelcorpusiswordaligned.Next,thesourceandtargetsideofthecorpusaretaggedindependently.Finally,wordsarereplacedwiththeircorrespondingPOStaginword-alignedsentencepairs.MTUswereextractedfromthePOStaggedword-alignedsentencepairsfromlefttorightandlistedinsourceorder.3OnlyreplacingcontentwordswiththeirPOStagswhileleavingfunctionwordsasis.59Unigram,bi-gram,andhigherordern-gramfeatureswerebuiltoverthissequenceofPOSMTUs.Forexample,forthesentencepairinFigure4.4,thefollowingPOSMTUswillbeextracted:VBZ)D,PRP)(N,V),RB)ADV,JJ)N,.)PUNC.4.3.2DistortionInadditiontothemappingoflinguisticstructures,anotherinterestingphenomenonisthereorder-ingoflinguisticstructuresduringtranslation.Onehypothesisisthatwhentranslatingfromaed-ordertoafree-orderlanguage,theorderofthetargetwillbeverybythesource(almostmonotonetranslation),butwhentranslatingintoaedorderlanguage,morere-orderingisrequiredtoensuregrammaticalityofthetarget.TocapturethispatternweadddistortiontoPOSTagMTUfeatures.Weexperimentwithabsolutedistortion(wordpositiondifferencebetweensourceandtargetofalink)aswellasHMMdistortion(wordpositiondifferencebetweenthetar-getofalinkandthetargetofthepreviouslink).Webinthedistortionsintothreebins:ﬁ=0ﬂ,ﬁ>0ﬂandﬁ<0ﬂ,toreducesparsity.4.4SingleDomainExperimentsForthetranslationdirectiondetectiontaskexplainedinsection4.1,weuseafastlineartrainedwithonlinelearning,VowpalWabbit[62].Trainingdataandfeaturesareexplainedinsection4.4.1and4.4.2.4.4.1DataForthistaskwerequireaparallelcorpuswithsentencepairsavailableinbothdirections(sentencesauthoredinlanguageAandthentranslatedtolanguageBandviceversa).WhilethecustomizedversionofEuroParl[48]containssentencepairsformanylanguagepairs,noneofthelanguagepairshavesentencepairsavailableinbothdirections(e.g.,itdoescontainsentencesauthoredinEnglishandtranslatedintoFrenchbutnotviceversa).TheCanadianHansardcorpusontheother60Figure4.3Sentenceleveltranslationdirectiondetectionprecisionusingdifferentfeatureswithn-gramlengthsof1through5.handtherequirementasithas742,408sentencepairstranslatedfromFrenchtoEnglishand2,203,504sentencespairsthatweretranslatedfromEnglishtoFrench[60].WeusetheHansarddatafortrainingFortrainingtheHMMwordalignmentmodelusedtofeatures,weusealargersetoftenbillionwordsofparalleltextfromtheWMTEnglish-Frenchcorpus.4.4.2PreprocessingandFeatureExtractionWeusedalanguage4,deduplication5andlengthratiotocleanthedata.Afterwewereleftwith1,890,603English-Frenchsentencepairsand640,117French-Englishsentencepairs.TheStanfordPOStagger[96]wasusedtotagtheEnglishandtheFrenchsidesofthecorpus.TheHMMalignmentmodel[100]trainedonWMTdatawasusedtoword-aligntheHansardcorpuswhilereplacingwordswiththeircorrespondingPOStags.DuetodifferencesinwordbreakingbetweenthePOStaggertoolandourwordalignmenttoolthereweresomemis-matches.Forsimplicitywedroppedtheentiresentencepairwheneveratokenmismatchoccurred.Thisleftuswith401,569POStagalignedsentencepairsintheFrenchtoEnglishdirectionand4Acharactern-gramlanguagemodelisusedtodetectthelanguageofsourceandtargetsidetextandthemoutiftheydonotmatchtheirannotatedlanguage.5Duplicatesentencespairsareout.611,184,702pairsintheotherdirection.Wechosetocreateabalanceddatasetandreducedthenum-berofEnglish-Frenchsentencesto401,679with20,000sentencepairsheldoutfortestingineachdirection.POSMTU(E)F)FE#EF#Example1NNPS)(N,C)33612quebecers(NNPS))québécoises(N)et(C)desquébécois2IN)(CL,V)691027afewdaysago(IN))ily(CL)a(V)quelques3PRP)(N,V)18663he(PRP)is)ledéputé(N)à(V)4(NNP,POS))A15528quebec(NNP)'s(POS)history)histoirequébécoises(A)5(FW,FW))ADV7195pro(FW)bono(FW)work)bénévolement(ADV)travailler6(RB,MD))V2112moneyalone(RB)could(MD)solve)argentsufV)àrésoudreTable4.1POSMTUfeatureswithhighestweight.FE#indicatesthenumberoftimesthisfeatureappearedwhentranslatingfromFrenchtoEnglish.64.4.3ResultsTheresultsofourexperimentsonthetranslationdirectiondetectiontaskarelistedinTable4.5.Wewouldliketopointoutseveralresultsfromthetable.First,whenusingonlyunigramfeatures,thehighestaccuracyisachievedbytheﬁPOS-MTU+HMMDistortionﬂfeature,whichusesPOSminimaltranslationunitstogetherwithdistortion.ThehighestaccuracyoverallifobtainedbyaﬁPOS-MTUﬂtrigrammodel,showingtheadvantageofbilingualfeaturesoverpriorworkusingonlyaunionofmonolingualfeatures(reproducedbytheﬁEnglish-POS+French-POSﬂration).Whilehigherorderfeaturesgenerallyshowbetterin-domainaccuracy,theadvantageoflow-orderbilingualfeaturesmightbeevenhigherincross-domain4.4.4AnalysisAninterestingaspectofthisworkisthatitisabletoextractfeaturesthatcanbelinguisticallyinterpreted.Althoughlinguisticanalysisofthesefeaturesisoutsidethescopeofthiswork,welistPOSMTUfeatureswithhighestpositiveornegativeweightsinTable4.1.Althoughthetopfeature,6FordescriptionofEnglishPOStagssee[73]and[1]forFrench62LabelDescriptionENU.LEXEnglishwordn-gramsFRA.LEXFrenchwordn-gramsENU.POSEnglishPOSTagn-gramsFRA.POSFrenchPOSTagn-grmasENU.BCEnglishBrownClustern-gramsFRA.BCFrenchBrownClustern-gramsPOS.MTUPOSMTUn-gramsBC.MTUBrownClusterMTUn-gramsTable4.2featuresandtheirlabels.NNPS)(N,C)7,inthiscontextisoriginatingfromacommonphraseusedbyFrenchspeakingmembersoftheCanadianParliament,québécoisesetdesquébécois,itdoeshighlightanunderlyinglinguisticphenomenonthatisnottotheCanadianParliament.WhentranslatingapluralnounfromEnglishtoFrenchitislikelythatonlythemasculineformofthenounappears,whileifitwasauthoredinFrenchwithbothformsofthenouns,asinglepluralnounwouldappearinEnglishasEnglishdoesn'thavemasculineandfeminineformsoftheword.AmorecompleteformofthisfeaturewouldhavebeenNNPS)(N,C,N),butsincewordalignmentmodels,ingeneral,discourageone-to-manyalignments,theextractedMTUonlycoversthenounandconjunction.4.5Cross-DomainExperimentsThegoalthissectionistocomparenovelandpreviouslyintroducedfeaturesinacross-domainsetting.Duetothevolumeofexperimentsrequiredforcomparison,foraninitialstudy,weselectalimitednumberoffeaturesetsforcomparison.PriorworksclaimPOSn-gramfeaturescapturelinguisticphenomenaoftranslationandshouldgeneralizeacrossdomains[60].WechosesourceandtargetPOSn-gramfeaturesforn=1:::5totestthisclaim.Anotherfeaturewehavechosenisthebestperformingfeaturefromsingledomainexperiments:POSMTU8n-gramfeatures.POSMTUsincorporatesourceandtargetsideinformationinadditiontowordalignment.Priorworkhasalsoclaimedlexicalfeaturessuchaswordn-gramsdonotgeneralizeacrossdomainsdueto7NNPS:PluralNoun,N:Noun,C:Conjunction8MinimalTranslationUnits[84]63CorpusAuthoredLanguageTranslationLanguageTrainingSentencesTestSentencesEuroParlEnglishFrench62k6kEuroParlFrenchEnglish43k4kHansardEnglishFrench1,697k169kHansardFrenchEnglish567k56kHansard-CommitteesEnglishFrench2,930k292kHansard-CommitteesFrenchEnglish636k63kTable4.3Cross-DomainDataSetscorpusvocabulary[101].Wetestthishypothesisusingsourceandtargetwordn-gramfeatures.Usingn-gramsoflength1through5werun45(Ninedatamatrixentriestimesn-gramlengthsofve)experimentsforeachfeaturesetmentionedabove.Inadditiontothefeaturesmentionedabove,wemakeasmalltothefeatureusedtoobtainthebestsingledomainsentencelevelperformancetoderiveanewtypeoffeatures.WeintroduceBrowncluster[22]MTUsinstead.OuruseofBrownclustersisinspiredbyrecentsuccessontheiruseinstatisticalmachinetranslationsystems[15].Finally,wealsoincludesourceandtargetBrownclustern-gramsasacomparisonpointtobetterunderstandtheireffectivenesscomparedtoPOSn-gramsandtheircontributiontotheeffectivenessofBrownclusterMTUs.Giventhese8featuretypessummarizedinTable4.2,n-gramlengthsofupto5andthe33datamatrixexplainedinthenextsection,werun360experimentsforthiscross-domainstudy.4.5.1DataMatrixWechosetheEnglish-Frenchlanguagepairforourcross-domainexperimentsbasedonpriorworkandavailabilityoflabeleddata.Existingsentence-paralleldatasetsusedfortrainingmachinetrans-lationsystems,donotnormallycontaingold-standardtranslationdirectioninformation,andaddi-tionalprocessingisnecessarytocompileadatasetwithsuchinformation(labels).Kurokawaetal[60]extracttranslationdirectioninformationfromtheEnglish-FrenchHansardparalleldatasetusingspeakerlanguagetags.Weusethisdataset,andtreatthetwosectionsﬁmainparliamentaryproceedingsﬂandﬁcommitteehearingsﬂastwodifferentcorpora.Thesetwocorporahaveslightlydifferentdomains,althoughtheysharemanycommontopicsaswell.Weadditionallychoosea64Figure4.4POSTaggedandBrownClusterAlignedSentencePairsthirdcorpus,whosedomainismoredistinctfromthesetwo,fromtheEuroParlEnglish-Frenchcorpus.[48]providedacustomizedversionofEuroparlwithtranslationdirectionlabels,butthisdatasetonlycontainssentencesthatwereauthoredinEnglishandtranslatedtoFrench,anddoesnotcontainexamplesforwhichtheoriginallanguageofauthoringwasFrench.WethusprepareanewdatasetfromEuroParlandwillmakeitpubliclyavailableforuse.TheoriginalunprocessedversionofEuroParl[56]containsspeakerlanguagetags(originallanguageofauthoring)fortheFrenchandEnglishsidesoftheparallelcorpus.Weoutinconsistenciesinthecorpus.First,weoutsectionswherethelanguagetagismissingfromoneorbothsides.Wealsooutsectionswithlanguagetags.Parallelsectionswithdifferentnumberofsentencesarealsodiscardedtomaintainsentencealignment.Thisleavesuswiththreedatasets(twoHansardandoneEuroParl)withtranslationdirectioninformationavailable,andwhichcontainsentencesauthoredinbothlanguages.Weholdout10%ofeachdatasetfortestingandusetherestfortraining.Our33corpusdatamatrixconsistsofallninecombinationsoftrainingononecorpusandtestingonanother(Table4.3).4.5.2PreprocessingandFeatureExtractionSimilartopreprocessingstepsforsingledomainexperiments,wecleanalldatasetsusingthefollowingsimpletechniques.Sentenceswithlowalphanumericdensityarediscarded.Acharactern-grambasedlanguagedetectiontoolisusedtoidentifythelanguageofeach65sentence.Wediscardsentenceswithadetectedlanguageotherthantheirlabel.Wediscardsentenceswithinvalidunicodecharactersorcontrolcharacters.Sentenceslongerthan2000charactersareexcluded.Next,anHMMwordalignmentmodel[100]trainedontheWMTEnglish-Frenchcorpus[20]word-alignssentencepairs.Wediscardsentencepairswherethewordalignmentfails.WeusetheStanfordPOStagger[96]forEnglishandFrenchtotagallsentencepairs.AcopyofthealignmentwithwordsreplacedwiththeirPOStagsisalsogenerated.FrenchandEnglishBrownclustersaretrainedseparatelyontheFrenchandEnglishsidesoftheWMTEnglish-Frenchcorpus[20].TheproducedmodelsassignclusterIDstowordsineachsentencepair.WecreateacopyofthealignmentwithclusterIDsinsteadofwordsaswell.Theofourchoice(Section4.5.3)extractsn-gramfeatureswithnasanop-tion.Inpreparationfortrainingandtesting,featureextractiononlyneedstoproducetheunigramfeatureswhilepreservingtheorder(n-gramsofhigherlengthareautomaticallyextractedbythePOS,word,andBrownclustern-gramfeaturesaregeneratedbyusingtherespectiverepresen-tationforsequencesoftokensinthesentences.ForPOSandBrownclusterMTUfeatures,thesequenceofMTUsisastheleft-to-rightinsourceordersequence(duetoreordering,theexactenumerationorderofMTUsmatters).Forexample,forthesentencepairinFigure4.4,thesequenceofBrownclusterMTUsis:73)(390,68),208)24,7689)3111,7321)1890,2)16.4.5.3OnlineWechosetheVowpalWabbit[62](VW)onlinelinearsinceitisfast,scalableandithasspecial(bagofwordsandn-gramgeneration)optionsfortextWefoundthatVWwascomparableinaccuracytoabatchlogisticregression.Fortrainingandtestingthe,wecreatedbalanceddatasetswiththesamenumberoftrainingexamplesinbothdirections.ThiswasachievedbyrandomlyremovingsentencepairsfromtheEnglishtoFrench66Figure4.5ComparingareaundertheROCcurveforthetranslationdirectiondetectiontaskwhentrainingandtestingondifferentcorporausingeachoftheeightfeaturesets.SeeTable4.2forexperimentlabeldescription.directionuntilitmatchestheFrenchtoEnglishdirection.Forexample,636ksentencepairsarerandomlychosenfromthe2,930ksentencepairsinEnglishtoFrenchHansard-CommitteescorpustomatchthenumberofexamplesintheFrenchtoEnglishdirection.4.5.4EvaluationMethodWeareinterestedincomparingtheperformanceofvariousfeaturesetsintranslationdirectiondetection.Performanceevaluationofdifferentfeaturesobjectivelyischallengingintheabsenceofadownstreamtask.,dependingonthepreferredbalancebetweenprecisionandrecall,differentfeaturescanbesuperior.IdeallyanROCgraph[35]visualizesthetradeoffbetweenprecisionandrecallandcanserveasanobjectivecomparisonbetweendifferentfeaturesets.However,itisnotpracticaltopresentROCgraphsfor360experiments.Hence,weresorttotheAreaUndertheROCgraph(AUC)measureasagoodmeasuretoprovideanobjectivecomparison.Theoretically,theareaunderthecurvecanbeinterpretedastheprobability67Figure4.6Translationdetectionperformancematrixfortrainingandtestingonthreedifferentcorpora.Eachrowcorrespondstoresultsoftrainingonacorpuswhiletestresultsarelaidoutinacolumnofthe33matrix.UnlikeFigure4.5wherewereportAUCvaluesforalln-grams,inthisdiagramweonlyreportthehighestAUCvalueforeachfeature.Thenumbernexttoeachdatapointindicatesthen-gramlengththatachievesthehighestperformanceforthatfeature.SeeTable4.2forexperimentlabeldescription.thatthescoresarandomnegativeexamplehigherthanarandompositiveexample[35].Asapointofreference,wealsoprovideF-scoresforexperimentalsettingsthatarecomparabletothepriorworkreviewedinSection??.4.5.5ResultsFigure4.5presentsAUCpointsforallexperiments.Rowsandcolumnsarelabeledwithcorpusnamesfortrainingandtestdatasetsrespectively.Forexample,thegraphonthethirdrowandcolumncorrespondstotrainingontheHansard-CommitteescorpusandtestingonEuroParl.WithineachgraphwecomparetheAUCperformanceofdifferentfeatureswithn-gramlengthsof681through5.Graphsonthediagonaldemonstratehigherperformancecomparedtooffdiagonalgraphs.Thisthebasicassumptionthatcross-domaintranslationdirectiondetectionisamorediftaskasthegraphsonthediagonalcorrespondtoin-domaindetection.TheoverallperformanceisalsohigherwhentrainedontheHansardcorpusandtestedonHansard-Commiteeandviceversa.ThisisbecuasetheHansardcorpusismoresimilartotheHansard-CommitteescorpuscomparedtotheEuroParlcorpus.Itisalsoobservablethatthevariationinperformanceofdifferentfeaturesdiminishesasthetrainingandtestcorporabecomemoredissimilar.Forinstance,thisphenomenoncanbeobservedonthesecondrowofgraphswherethefeaturesaremostspreadoutwhentestedontheHansardcorpus.TheyarelessspreadoutwhentestedontheHansard-Committeescorpus,andcompressedtogetherwhentestedontheEuroParlcorpus.Thesamephenomenoncanbeobservedfortrainedonothercorpora.Fordifferentfeaturetypes,differentn-gramorderofthefeaturesisbest,dependingonthefea-turegranularity.Tomakeiteasiertoobservepatternsintheperformanceofdifferentfeaturetypes,Figure4.7showstheperformanceforeachfeaturetypeandeachtrain-testcorpuscombinationasasinglepoint,byusingthebestn-gramorderforthatfeature/datacombination.Eachofthe9train/testdatacombinationsisshownasacurveoverfeaturetypes.WecanseethatMTUfeatures(whichlookatbothlanguagesatthesametime)outperformindividualsourceortargetfeatures(POSorBrowncluster)foralldatasets.Brownclustersareunsupervisedandcanprovidedifferentlevelsofgranularity.Ontheotherhand,POStagsusuallyprovideaedgranularityandrequirelexiconsorlabeleddatatotrain.WeseethatBrownclustersoutperformcorrespondingPOStagsacrossdatasettings.Asanexample,whentrainingandtestingontheHansardcorpusFRA.BCoutperformsFRA.POSbycloseto20AUCpoints.LexicalfeaturesoutperformmonolingualPOSandBrownclusterfeaturesinmostsettingsalthoughtheiradvantagesdiminishasthetrainingandtestcorpusbecomemoredissimilar.ThisissomewhatcontrarytopriorclaimsthatlexicalfeatureswillnotgeneralizewellacrossdomainsŒweseethatlexicalfeaturesdocaptureimportantgeneralizationsacrossdomainsandmodelsthat69Figure4.7Translationdetectionperformancematrixfortrainingandtestingonthreedifferentcorpora-Weranexperimentsforn-gramsofuptolengthveforeachfeature(SeeTable4.2forfeaturelabeldescriptions).UnlikeFigure4.5wherewereportAUCvaluesforalln-gramlengths,inthisgraphweonlypresentthehighestAUCnumberforeachfeature.Eachmarkertypeindicatesatrainingandtestsetcombination.Theformatofexperimentlabelsinthelegendis[TrainingSet]-[TestSet]andEU:EuroParl,H:Hansard,HC:HansardCommittees.Forexample,EU-HCmeanstrainingonEuroParlcorpusandtestingonHansardCommitteescorpus.70useonlyPOStagfeatureshavelowerperformance,bothinandout-of-domain.Figure4.8showstherankofeachfeatureamongstall8differentfeaturesforeachentryinthecross-corpusdatamatrix(SimilartoFigure4.7thehighestperformingn-gramlengthhasbeenchosenforeachfeature).BrownclusterMTUsoutperformallotherfeatureswithrankoneinalldatasetcombinations.SourceandtargetPOStagfeaturesarethelowestperformingfeaturesin8outof9datasetcombinations.ThePOS.MTUhasitslowestranks(7and8)whenitistrainedontheEuroParlcorpusanditshighestranks(2and3)whentrainedontheHansard-Committeescorpus.HighnumberoffeaturesinPOS.MTUrequiresalargedatasetfortraining.ThevariationinperformanceforPOS.MTUcanbeexplainedbythedifferenceintrainingdatasizebetweenEuroParlandHansard-Committees.Finally,whileFRA.LEXandENU.LEXaremostlyinrank2and3(afterBC.MTU)theyhavetheirlowestranks(6and4)incorss-corpussettings(HC-EUandHC-H).Finally,wereportprecisionandrecallnumberstoenablecomparisonbetweenourexperimentsandpreviousworkreportedinSection4.2.WhentrainingandtestingontheHansardcorpus,BC.MTUachieves0.80precisionwith0.85recall.Incomparison,ENU.POSachieves0.65preci-sionwith0.64recallandPOS.MTUachieves0.73precisionand0.74recall.Thesearethehighestperformanceofeachfeaturewithn-gramsofuptolength5.4.6ConclusionandFutureWorkFromamongeightstudiedsetsoffeatures,BrownclusterMTUswerethemosteffectiveatidenti-fyingtranslationdirectionatthesentencelevel.Theyweresuperiorinbothin-domainandcross-domainsettings.AlthoughEnglish-LexicalfeaturesdidnotperformaswellasBrownclusterMTUs,theyperformedbetterthanmostothermethods.Infuturework,weplantoinvestigatelexicalMTUsandtoconsiderfeaturesetscontaininganysubsetoftheeightormorebasicfeaturetypeswehaveconsideredhere.Withtheseexperimentswehopetogainfurtherinsightintotheperformanceoffeaturesetsininoutout-of-domainsettingsandtoimprovethestate-of-the-artinrealistictranslationdirectiondetectiontasks.Additionally,weplantousethistoextend71Figure4.8TranslationdirectiondetectionAUCperformancerankforeachtrainingandtestsetcombination.ForcorpuscombinationabbreviationsseedescriptionofFigure4.7.ForfeaturelabeldescriptionsseeTable4.2.theworkofTwitto-Shmuel([97])bybuildingamoreaccurateandlargerparallelcorpuslabeledfortranslationdirectiontofurtherimproveSMTquality.72CHAPTER5EFFECTIVEANDEFFICIENTDATASELECTIONLackofaneffectiveandefdataselectionalgorithmisapparentfromtheliteraturereviewinChapter3.MosteffectivedataselectionmethodsinvolveretrainingafullsetofSMTmodelsateverystepofthedataselectionprocess(e.g.,[43,7]).Thesemethodsarebarelypracticalforsmalldatasetswith100Ksentencesintrainingdata,letalonemediumorlargedatasetswithoverfewmillionsentences.Thecontributionofmoreefmethodsisintwofolds:1)Introducingasurrogatefunction(e.g.,maximumnumberofunseentokentypescoveredwithminimumnum-berofsentences([33]))toeliminatetheneedforretrainingafullSMTsystemand2)providingamechanismtooptimizethesurrogateobjectivefunction(e.g.,GreedySearch,alsoin[33]).Inthischapterwechallengetheofachievingtheoptimumvalueforthesimplesurro-gatefunctiondescribedabovebyprovidingasimpleandefalgorithmthatachievessimilarresultsusingthesamesurrogatefunctionandalinearsearchalgorithmthatdoesnotprovideanytheoreticalguaranteesforachievingtheoptimumsolution.Werunexperimentsatseveraldatasetsizesandcompareitwithrandomselection.Inaddition,weexperimentwithdifferentorderingsofthedataandprovideexperimentalresults.5.1VocabularySaturationFilterThemostcommonfeaturefunctionthathasbeenusedwithsomevariations(seeChapter3)fordataselectioninstatisticalmachinetranslationisthenumberofnewn-gramsinasentencepair.Theobjectiveofthepreviouslyintroducedsearchalgorithms(e.g.,[33])istomaximizethenumberofuniquen-gramsinasubsetofthedatawithaconstraintonthetotalnumberoftokens.AsexplainedinChapter3thisisdoneusingagreedyalgorithmwiththecomplexityofO(jSSelectionPoolj2).Incontrastweproposealinearalgorithm,O(jSSelectionPoolj),thatachievescomparableresults.Aversionofthisalgorithm(VocabularySaturationFilterorVSF)ismakingasingle73passthroughthedataandselectingsentencepairssuchthatforalln-gramsthereisatleastoneselectedsentencethatcontainsthatn-gram.Thealgorithmwouldgothroughthesentencepairsinanarbitraryorderandonlyselectasentencepair,ifitcontainsanewn-gram.Toincreasethexibilityofthealgorithmweintroduceaﬁminimumoccurrenceﬂthreshold,t,whichisequaltooneinthealgorithmdescribedabove.Thealgorithmwillselectasentenceifandonlyifitcontainsatleastonen-gramthathasoccurredlessthanttimesintheselectedpool.TheimplementationofVSFisdescribedinAlgorithm1below.Input:SelectionPool,t,`Output:SelectedPoolforeachsp2SelectionPooldoNGsrc EnumNgrams(sp:src,`);NGtgt EnumNgrams(sp:tgt,`);selected false;foreachng2NGsrcdoifSrcCnt[ng]