VERBSEMANTICSASDENOTINGCHANGEOFSTATEINTHEPHYSICAL WORLD By MalcolmDoering ATHESIS Submittedto MichiganStateUniversity inpartialentoftherequirements forthedegreeof ComputerScience-MasterofScience 2015 ABSTRACT VERBSEMANTICSASDENOTINGCHANGEOFSTATEINTHEPHYSICAL WORLD By MalcolmDoering Inthenottoodistantfutureweanticipatetheexistenceofrobotsthatwillhelparoundthe house,inparticular,thekitchen.Thus,itiscriticalthatrobotscanunderstandthelanguage commonlyusedwithinthisdomain.Therefore,inthisworkweexplorethesemanticsofverbs thatfrequentlyoccurwhendescribingcookingactivities.Motivatedbylinguistictheory onthelexicalsemanticsofconcreteactionverbsanddatacollectedviacrowdsourcing,an ontologyofthechangesofstatesofthephysicalworldasdenotedbyconcreteactionverbsis presented.Furthermore,additionaldatasetsarecollectedforthepurposeofvalidatingthe ontology,exploringtheofcontextonverbalchangeofstatesemantics,andtestingthe automaticidenofchangesofstatedenotedbyverbs.Inconclusion,severalareasof furtherinvestigationaresuggested. ACKNOWLEDGEMENTS Writingamaster'sthesishasbeenamuchmoreundertakingthanoriginally anticipated.Iwouldnothavebeensuccessfulwithoutthehelpofseveralpeople.Foremost, Iwouldliketothankmyadvisor,Dr.JoyceY.Chai,forherguidancethroughthearduous processofformulatingathesistopicandalongeverystepofthewaytoitscompletion.Iwould alsoliketothankDr.RuiFangfortakingmeunderhiswingduringmytimeasamaster's student,ShaohuaYangforhisinputregardingvisualfeaturesandresults,my thesiscommitteemembersDr.CristinaSchmittandDr.XiaomingLiufortheirinsightful questionsandfeedback,ZachRichardsonforhisannotationandalltheremaining membersoftheLanguageandInteractionResearchLaboratorywhomademytimeatMSU enjoyable.Lastly,Iwouldliketothankmyfamilyandfriendsfortheircontinuedmoral supportthroughout. iii TABLEOFCONTENTS LISTOFTABLES .................................... v LISTOFFIGURES ................................... vi Chapter1IntroductionandMotivation ...................... 1 Chapter2RelatedWork ............................... 3 Chapter3ChangeofStateinVerbSemantics .................. 6 Chapter4APilotStudyandOntologyofChangeofState .......... 9 4.1CrowdsourceSetup................................9 4.2OntologyDesignandAnnotation........................10 4.3Analysis......................................12 4.3.1TypesofCoSDescriptions........................12 4.3.2MultipleCoSLabelsperVerb......................13 4.3.3TheRoleofVisualContext.......................15 4.3.4ofDirectObjectonCoS.....................17 4.3.5VerbSemanticSimilaritybasedonCoS.................17 Chapter5AutomatedIdenofCoS .................... 19 5.1TheCucumberDataset..............................19 5.2Multilabel.............................20 5.3Results.......................................21 5.3.1LinguisticandVisualFeatures......................21 5.3.2ComparisonofFeaturesonCucumberDataset.............22 5.3.3PredictingtheAttributeDescribedbyNLDescriptionsofCoS....24 Chapter6ComplexityofVerbSemanticsbasedonCoS ............ 27 6.1MultilevelDatasetandCrowdsourceStudy...................27 6.1.1BreadandCucumberMultilevelDataset................28 6.1.224ActivitiesMultilevelDataset.....................29 6.2ResultsofHumanStudies............................29 6.2.1CoverageofCoSOntology........................29 6.2.2ofVisualContext.........................31 6.2.3LevelofDetail'sonCoS.....................32 6.2.4LevelofDetail'sonVerbFrequencyandDistribution.....32 Chapter7DiscussionandConclusion ....................... 36 BIBLIOGRAPHY .................................... 38 iv LISTOFTABLES Table4.1:Verbsandobjectsusedfordatacollection(pilot)............10 Table4.2:Attributesandresultvaluesforchangeofstate.............13 Table4.3:VariabilitybetweenCoSfrequenciesperobjectandsceneconditions (pilotdataset)...............................16 Table4.4:Labelcardinalityperverb(pilotdataset)................17 Table4.5:Jensen-ShannondivergencebetweenCoSdistributionsforverbpairs (pilotdataset)...............................18 Table5.1:Verbcountsandlabelcardinalityoftoptenmostfrequentverbs(cu- cumberdataset)..............................20 Table5.2:CoSattributeaccuracy(cucumberdataset).......23 Table5.3:CoSattributeaccuracyusingvariousfeaturesets(cucum- berdataset)................................24 Table5.4:PerverbCoSattributeaccuracyusingvariousfeaturesets (cucumberdataset)............................25 Table5.5:CoSattributeaccuracyforopenendedchangeofstate descriptions(pilotdataset)........................26 Table6.1:CoverageoftheCoSframeoptions(breadandcucumbermultilevel dataset)..................................30 Table6.2:CoverageoftheCoSframeoptions(24activitiesmultileveldataset).30 Table6.3:Comparisonbetween+/-sceneconditions(breadandcucumbermulti- leveldataset)...............................32 Table6.4:Theentropyofthedistributionsofeachverboverthreelevels(24ac- tivitiesmultileveldataset)........................35 v LISTOFFIGURES Figure3.1:Eventschemaforverbsthatdenoteexternallycausedstatechanges[14]7 Figure4.1:ExampleofCoSframeappliedtoadescriptionofchangeofstate...12 Figure4.2:Percentageofsamplesdescribing0to3changesofstate(pilotand cucumberdatasets)............................14 Figure4.3:CoSdistributionsoverattributesfor clean and rinse (pilotdataset).14 Figure6.1:CoSdistributionsoverattributesforverbsatthreelevelsofdetail(24 activitiesmultileveldataset).......................33 Figure6.2:Frequenciesofoccurrenceofthetoptwentyverbs(24activitiesmulti- leveldataset)...............................34 vi Chapter1 IntroductionandMotivation Inthefuturerobotswillworkcloselywithhumansinthehometoaidinvariousdomestic tasks.Foremostaretasksinthekitchen.Forarobottohelpoutandlearnnewabilitiesin thekitchendomainitmustbeabletounderstandahuman'sinstructions.Thisworkfocuses inparticularonhowarobotmayrepresentconcreteverbsinthekitchendomain. Concreteactionverbsareverbsthatdenoteconcreteactivitiesperformedbyanagentin theworld.Theseactionsarevisuallyperceivableeventsthatcanpotentiallybeunderstood bycomputervisionalgorithms.Furthermore,theycanbecategorizedintotwoclassesbased ontheirsemantics:ResultVerbsandMannerVerbs[14,13].Mannerverbstypicallydenote theFormortheMannerinwhichtheactiondenotedbytheverbisperformed,whereas Resultverbs(thefocusofthiswork)denotetheChangeofState(CoS)thattheobjectof theverbundergoesasaresultoftheactiondenotedbytheverb.Inordertogroundthese verbstotheenvironment,therobotmusthavearichrepresentationofthechangesofstate associatedwiththeverb.ExistingverbresourcessuchasVerbNet[20]donotcontainthis richinformation.InVerbNet,althoughitssemanticrepresentationforvariousverbsmay indicatethatachangeofstateisinvolved,itdoesnotalwaysprovidethespassociated withtheverb'smeaning.Forexample,thechangewilloccurtosomeattributeoftheverb's directobjectsuchascolor,numberofpieces,speed,etc. ThisworkpresentsanontologyofttypesofCoStoouttheVerbNetrep- resentation.Sp,thisworkisfocusedonverbsinthecookingdomain,withthe 1 forethoughtthatthismethodcanbegeneralizedtootherdomains.Wecarriedoutaseries ofdatacollectionexperimentsviacrowdsourcingdesignedtoelicitnaturallanguagedescrip- tionsofchangesofstateasdenotedbyvariousverbs.Thesedescriptionsprovidedinsight intohowmid-levelvisualfeaturessuchasstatecanberealizedlinguisticallyatthesurface levelandguidedthedesignoftheCoSontologywhichcategorizesttypesofchange ofstate. Thisdataallowsustoasksomequestions:Howdoestheobjectofaverbthe meaningoftheverb?I.e.,doestheverbdenotetchangesofstatedependingon thetypeofdirectobject?Furthermore,howdoesthepresenceorabsenceofascenein combinationwiththeverbaldescriptionCoSwhichtheviewerattributestotheverb? DoambiguitiesintheCoSforasinglesenseofaverbindicatesub-categorizationsofverb senses?Answerstothesequestionsareimportantforarobotwhoseunderstandingofaverb isbasedonitscontextofuse.Bydeterminingon-linetheCoSindicatedbyaverb,therobot canfocusitssensingresourcesontheindicatedCoS(activesensing). Intheend,IhopethatthiswillbeausefulresourceforSituatedNLPresearchers. 2 Chapter2 RelatedWork Therelatedworkcanbedividedintosectionsincludingtheoreticallinguisticsworkon thelexicalsemanticsofverbsandcomputervisionworkonrecognizingvisualattributesof objectsandunderstandingeventsinvideos. Previousworkintheoreticallinguisticshasthetypesofconcreteactionverbs weareinterestedin{mainlyResultverbs,whichindicatetheirobject'schangeofstate[14, 13].Kennedyprovidesamoredetailedanalysisofgradablepredicatesintermsofscale structure,whichisapplicabletoResultverbs[6].Andlastly[20]presentsdigitaldictionary ofverbs(VerbNet),andtheircategorizationsfrom[7].Ourworksupplementsthesemantic spofsomeoftheverbscontainedVerbNetbycategorizingthetypesofchanges ofstateintheworldwhichtheverbsmaydenote.Moreinformationonverbsemanticsis containedinSection3. Intraditionalverbsemanticsthemeaningofaverbmaybeambiguous.Thereisa staticnumberofsenses(denotingsubtletiesofmeaning)foraverb,oneofwhichmustbe selectedbasedontheverb'scontextofuse.Alternatively,GenerativeLexiconarguesagainst thetraditionalnumberofsensesforaword,astheymaynotapplytonoveluses[12]. GLproposesthatthemeaningsofverbsaredependentnotonlyonthelexicalsp oftheverbitself,butthroughinteractionswiththecomplexsemanticrepresentationsofits arguments.Thatis,thepropertiesofnounsthemeaningoftheverb.Thus,\these- manticloadisspreadevenlyacrossthelexicon".Thecurrentworkprovidesexamplesofhow 3 averb'smeaningmaydependonitspatientargument,i.e.examplesofsub-categorizations ofverbsenses.Also,weshowhowvisualcontextthehuman'sinterpretationofthe verbsenseintermsofchangeofstate. Gilletteetal.carriedoutstudiesofvocabularylearningusingtheHumanSimulation Paradigm[4].Collegestudentsweretaskedwithidentifyingverbswhenpresentedwithdif- ferentinformationincludingthenounsthatappearedinthesentencewiththeverb,the syntacticinformationofthesentencecontainingtheverb,andextra-linguisticinformation (i.e.,avideowhichtheverbdescribes).Theverbitselfwasnotpresented.Experimenters foundthatverbswithahighdegreeof`imigability'or`concreteness'weremoreeasilyiden- fromtheextra-linguisticcontext,whereassyntaxwasamoreusefulcueforabstract verbs.Thissuggeststhatthevisualcontextmaybeimportanttoidentifyingthechanges ofstateintheworldwhichtheseconcreteverbsdenote.Inourworkwefocusprimarilyon verbsfromthecookingdomainwithahighdegreeofimagability(e.g., cut , rinse ,etc.). Visualattributesarehighlevelvisualfeaturesof(objectsin)sceneswhichhavecorre- spondingnaturallanguagedescriptions.Forexample,`green'referstoasprangeof RGBvalues.Visualattributesaresemanticallymeaningful,discriminative,andgeneralizable acrosstobjecttypesandcanbeusedforobjectrecognition[11,22,27].Attributes roughlycorrespondtoadjectives(describingstates{colors,properties,etc.{ofobjects), butcanalsodescribeotherpropertiesnotnamedbyadjectives(e.g.`haswing'forbirds); therefore,thesemanticsofaCoSverbmaybegroundedinchangesofanobject'svisual attributes. Chaoetal.jointlymodelsactionandobjectcategoriesmadeupofthesynsetsfrom WordNet[2,10].Sp,theymodeledtheoftheobjectcategories,which indicatethefunctionsofobjects,inordertoimproveactionrecognition.WhereasChaoet al.modeltheinteractionsbetweenactionsandobjects,inthisworkweareinterestedinthe interactionsbetweenverbsanddirectobjects(oftennouns)andhowtheobjecttsthe meaningoftheverb.Averb/nounistfromanaction/objectbecausetsenses 4 oftheverb/nounmayappearinseparateaction/objectcategories. Siskindetal.demonstratehowanactioninavideocanberecognizedautomatically byusingprinciplesandconstraintsofhumanperception[24,23].Actionsarerepresented viatemporalpredicatelogic.Furthermore,theydemonstratethatactionsmaybe intermsof relations (e.g.,support,contact,andattachment)betweentheentitiesrather thanlowlevel kinematicrepresentations (e.g.,jointanglesandvelocities).Beyondrelations, ourworkexplores changesofstate ,anotherhighlevelfeatureintegraltoactionrepresenta- tion/verbmeaning. Variousmethodsforgeneratingnaturallanguagedescriptionsofimagesandvideosare presentedby[9,19,28] 5 Chapter3 ChangeofStateinVerbSemantics Lexicalsemanticsisimportanttodesigningmethodsforrobotstolearnverbsbecauseit indicateswhatmustbelearnedaspartoftheverbrepresentation.Verbscanbedividedinto twobroadcategories:stativeverbsthatdenotestates(suchas know , depend , loathe )and actionverbswhichdenoteactions(suchas run , throw , cook )Inthisworkweareprimarily interestedinthelatter. Aconcreteactionverbisonethat,incombinationwithitsargumentsandmo denotesanactionintheworld(asopposedtodenotingastateoranabstractactionnot visibleintheworld).HovavandLevin[14,13]furtherdividethetypesofactionverbsinto Manner verbs,which\specifyaspartoftheirmeaningamannerofcarryingoutanaction", and Result verbs,which\specifythecomingaboutofaresultstate".Forexample, Mannerverbs :nibble,rub,scribble,sweep,laugh,run,swim... Resultverbs :clean,cover,empty,freeze,kill,melt,open,arrive,die,enter, faint... Inthisworkwefocussponresultverbs,i.e.verbsofChangeofState(CoS).A setof\canonicalrealizationrules"specifyhowaparticularchangeofstateisincorporated intoaverb'ssemantics.Semanticsaredeterminedbasedonthecombinationofa\root", whichisparticulartotheverb(e.g.,aresult-state),andan\eventschema"templateas showninFigure3.1. 6 Figure3.1:Eventschemaforverbsthatdenoteexternallycausedstatechanges[14] Previousworkhasfurtherresultverbsintothreecategories:ChangeofState verbs,whichdenoteachangeofstatetoapropertyoftheverb'sobject(e.g.`towarm'), InherentlyDirectedMotionverbs,whichdenotemovementalongapathinrelationtoaland- markobject(e.g.`toarrive'),andIncrementalThemeVerbs,whichdenotetheincremental changeofvolumeorareaoftheobject(e.g.`toeat')[8].Inourworkweproposeasp setofresult-statesthatmaybeusedtothesemanticsofmostconcreteactionverbs inthekitchendomain.NotethatweusethetermChangeofStateinamoregeneralway throughoutthispapersuchthatthelocationandvolumeorareaofanobjectarepartofits state. HovavandLevinalsoclaimthatsomeverbsthatdenotechangeofstateeventslexically specifyascale[14,13].I.e.,theyareverbsofscalarchange.Ascaleis\asetofpointsona particulardimension(e.g.height,temperature,cost)".Inthecaseofverbs,thedimension isanattributeoftheobjectoftheverb.Forexample,\Johncooledthemeans thatthetemperatureattributeoftheobject ce hasdecreased.KennedyandMcNally giveaverydetaileddescriptionofscalestructureanditsvariations[6].Ourwork scalestructurebyrepresentingchangeofstateasaframeconsistingofanobject,oneofits attributes,andtheresultingvalueoftheattribute. Verbsofscalarchangecanbefurthersplitintotwocategoriesverbswithtwopoints onthescaleandverbswithmultiplepointsonthescale[14].Verbsonatwopointscaleare verbsofachievement,wherethechangeofstateis\conceptualizedasinstantaneous"(pg. 30).Examplesoftheseverbsare`crack'and`arrive'.Verbsonamultiplepointscaleare associatedwithchangesinattributesthatcanhavemultiplevalues.Examplesoftheseverbs are`advance',`descend',`fall',`recede',`rise',`warm',`cool',etc.Thesearecalledverbsof gradualchangeordegreeachievement.Inthisworkthesesubtlearenottaken 7 intoaccount,buttheymaybeworthincludinginfuturework. Mannerverbsareverbsofnon-scalarchange.Theyareditfromverbsofscalar changebecausetheyaremorecomplex{theyinvolvechangeinmorethanasingleattribute ordonotspecifychangeinasingledirectiononascale[14].Theymayinvolveacombination ofmultiplechanges.Someexamplesare`walk'and`jog',whichspecifyasequenceofchanges onseveralattributes.Additionally,verbsofnon-scalarchangedonotalwayshavetobe spaboutwhatchangesareinvolved.E.g.exercisemaydenoteanyofseveralvarieties ofphysical(andsometimesmental)activity(pg.33).Wewillnotfocusonmannerverbsin thiswork. BeaversandKoontz-Garbodenintroduceasetofdiagnosticsforjudgingwhetheraverb isamannerverborresultverb[1].TheyalsoarguesagainstRappaportHovavandLevin's claimthataverbcannotlexicalizebothmannerandresult[14,13].Indeed,fromourown observationsitwouldseemthataverbcanhavebothchangeofstateandmannercomponents (e.g.`chop'). Levincategorizesverbsbasedontheirsyntax,i.e.basedonwhichalternations(argu- mentstructure)averbcantake[7].Thiscategorizationschemeresultedinclassesofverbs withsemanticsimilarities.VerbNetisadigitalresourcebasedonLevin'sverbclasses[20]. Foreachverb,itprovidesthepossibleargumentstructuresandalogicalrepresentationof theirsemantics.ThishasbeenausefulresourceforNLPresearchersinthepast.But,in thecaseofresultandmannerverbsthatdenotephysicallyobservableevents,itdoesnot provideenoughdetailtoenablearobottogroundtheverbinitsperceptions.Forexample, theVerbNetsemanticrepresentationof`cut'sponlythattheresome`degradationof thematerialintegrity'oftheobjectasaresultoftheaction. 8 Chapter4 APilotStudyandOntologyof ChangeofState 4.1 CrowdsourceSetup Inordertobuildanontologyofthetypesofchangesofstatethatverbsinthecooking domainmaydenote,weconductedapilotstudy.Verb-objectpairswerepresentedtoturkers viaAmazonMechanicalTurk(AMT)andturkerswereaskedtodescribethechangesofstate thatoccurtotheobjectasaresultoftheverb.Thentheturker'sopen-endeddescriptions wereanalyzedandcategorized. VerbsandobjectsfromtheTACoScorpuswerechosenforthiscrowdsourcingstudy. TheTACoScorpus[17]isacollectionofnaturallanguagedescriptionsoftheactionsthat occurinasetofcookingvideos.I.e.,itcontains18227sentencescollectedviaAMTthat describevariouscookingevents(preparingacucumber,scramblingeggs,etc.).Thisisan idealcorpustoexplorethetypesofchangesofstateandmannersthatverbsdenotesinceit containsmainlydescriptionsofconcreteactions.Moreover,possiblybecausemostactions inthecookingdomainaregoal-directed,amajorityoftheverbsinthedescriptionsdenote resultsofaction(changesofstate). Thetenverbs(showninTable4.1)werechosenbasedonthecriteriathattheytakean agentargumentandthattheyoccurrelativelyfrequentlyinthecorpusandwithavariety oferentdirectobjects.Furthermore,theymustbeconcrete,meaningthattheydenote 9 Verb Object1 Object2 Object3 clean cuttingboard dishes counter rinse cuttingboard dishes ginger wipe counter knife hands cut cucumber beans leek chop cucumber beans leek mix eggs leeks,salt,andpepper ingredients stir eggs leeks,salt,andpepper ingredients add water eggs leeks open breadpackaging drawer pomegranate shake spices bowl broccoli Table4.1:Verbsandobjectsusedfordatacollection(pilot) someobservableeventintheworld.Verbsofthistypearethemostrelevantforakitchen robot.Lastly,eoftheverbswerechosenbecausetheyonlydenoteachangeofstate,and theotherewerechosenbecausetheydenotesomemannerofaction(possiblyinaddition tochangeofstate). ToexaminehowCoSdependsonthecontext,wepairedverbswithtobjects(3 objectsperverb,showninTable4.1)andpresentedtheverbstoturkerswithandwithout avideooftheactiondescribedbytheverb(+/-scene).Objectswerechosenbasedonthe criteriathattheyaredissimilartoeachother,sincewehypothesizethatthechangeofstate indicatedbytheverbwilldependingontheobject'sfeatures.Forexample, broccoli and bowl werechosenasobjectsfortheverb shake becauseoneisavegetableandtheother akitchenutensil,havingverytfeatures.Thus,thereare10 3 2=60conditions. Foreachconditionwecollected30turkerresponses. Inadditiontoturkersresponsesaboutwhatchangesofstatetheverbsindicated,we alsocollectedresponsesaboutthemanneroftheaction. 4.2 OntologyDesignandAnnotation Basedonthetypesofdescriptionstheturkersprovidedwedevelopedanontologyof changeofstate,whichisshowinTable4.2.Thissectionwillexplainthisontologyindetail. 18attributesofstatechangeswereidenfromthechangeofstatedescriptions 10 providedbytheturkers.Becauseweareinterestedinlowerlevelstatechanges,whichcan easilysensedwithacameraandcomputervisionalgorithm,wedidnotincludehigherlevel attributesinthecategorizationsuchas Cleanliness ,whichmaybeidenintermsof lowerlevelattributesbutaremoreculttosenseautomatically.Although,notethat someoftheseattributesareatntlevelsintermsofvisualperception,e.g. Shape vs. Wetness .Wetnessmustbevisuallyidendintermsofsomelowerlevelfeaturessuchas color,whereasshapeisdirectlyperceivable.Someexamplesofdescriptionscontainingthese attributeareshownbelow.Itisalsoworthknowingthatsomedescriptionsactuallydescribe multiplechangesofstateasshowninFigure4.1. GiventhatbothadjectivesandCoSverbshavetheirsemanticsintermsofa scalestructure(forgradableverbsandadjectives),someoftheaboveattributesaremoti- vatedbythesemantictypesofadjectivesfromDixonandAikhenvald'scategorization[3]. TheseadjectivecategoriesincludeDimension,Color,PhysicalProperty,Quanand Position. Althoughtheinstructionsgiventotheturkerssprequestedadescriptionofthe changeofstateundergonebythe object ,severalresponsescontaineddescriptionsofchanges ofotherobjects.Often,apartofthedirectobjectwasdescribed,ratherthanthewhole object.And,sometimessomecompletelytobject,thatwasstillassociatedinthe action,wasdescribed.Thus,CoSdescriptionscanbecategorizedasdescribingachangeto theDirectObject,PartOfObject,orAssociatedObject.Someexamplesareshownbelow. DirectObject Cut-cucumber:\Thesizeof thecucumber changes" PartOfObject Wipe-knife:\Theknifegetscleaner.More metal isshowing" AssociatedObject Clean-dishes:\ Debrisandresidue fallawayfromthedishes" 11 Figure4.1:ExampleofCoSframeappliedtoadescriptionofchangeofstate Inadditiontoattributeofchangeandtheobjectundergoingchange,theturkersde- scriptionsoftencontainedathirdimportantaspectofachangeofstate:theresultvalue.I.e. theresultvalueoftheattributeafteritchanges.Thesevaluescanbecategorizedinseveral twaysdependingontheattribute,butgenerallytherearetwopolarvalues.For example,theSizeLengthVolumeThickness,Wetness,NumberOfPieces,etc.attributemay Increase or Ddecrease invalue.Ontheotherhand,notallresultvaluescanbecategorized inthisway.Forexample,theShapeattributeisusuallydescribedsimplyashavingchanged insomevagueway,ortohaveundergoneaspchange.Thus, Spe and Change are twomoregeneralresultvalues. Thesethreeaspectsofachangeofstate,the Attribute , Object ,andresult Value ,make upthe CoSframe whichcanbeusedtolabelaverb-objectpair,asshowninFigure4.1for asentencewhichdescribesthreechangesofstate.Thus,theCoSontologypresentedhere consistsofaCoSframeandtheoptionsusedtotheframeslots. 4.3 Analysis 4.3.1 TypesofCoSDescriptions Thedatashowsthatgenerallyturkersdescribedchangesofstateinoneofthreeways. (1)Theydescribetheattributedirectly,e.g.cut-cucumber:\The size ofthecucumber changes".(2)Theydescribethechangeofstatewitharesultativephrase,e.g.cut-cucumber: \Thecucumberiscut intosmallpieces ."Here,theCoSisindicatedbythesemanticsof smallpieces .And,(3)theCoScanbedescribedwithanotherverbthatdenotesthesameor similarCoSastheverbpresentedtotheturker,e.g.stir-ingredients:\Theingredientsare 12 TypeofCoS Attribute AttributeResultValue Dimension Size,length, volume,thickness Changes,increases,decreases,sp Shape Changes,sp(cylindrical,etc.) Color/Texture Color Appear,disappear,changes,mix,separate, sp(becomesgreen,red,etc.) Texture Changes,sp(slippery,frothy,bubbly,soft,etc.) PhysicalProperty Weight Increase,decrease Flavor,smell Changes,insp Solidity sp(paste,soggy,etc.) Wetness Becomeswet(ter),dry(er) Visibility Appears,disappears Temperature Increases,decreases Containment Becomesemptied,hollow SurfaceIntegrity Aholeoropeningappears Quan Numberofpieces Increases,onebecomesmany, decreases,manybecomesone Position Location Changes,enter/exitcontainer,sp Occlusion Becomescovered,uncovered Attachment Becomesdetached Presence Nolongerpresent,becomespresent Table4.2:Attributesandresultvaluesforchangeofstate mixed together". 4.3.2 MultipleCoSLabelsperVerb Aturker'schangeofstatedescriptiondoesnotnecessarilyonlycontainasinglechange ofstate.Infact,allthedescriptionsdescribedbetween0and3changesofstate,asseenin Figure4.2.Mostdescriptions(43%)containedonlyasinglechangeofstate.Also,alarge percentage(36%)containednochangeofstate.Inactuality,someofthedescriptionsthat wereannotatedascontainingnochangeofstate,describedchangesofstatewithhighlevel attributeswhichdonotintoourcategories(e.g.Cleanliness).Otherscontainedverbal descriptionsofCoS(e.g.Stir-ingredients:\Theingredientsaremixedtogether.").Wedid notannotatedescriptionswhichcontainedthesecircular Foreachverb,wecalculatedthedistributionofCoSannotationsovereachattribute. Figure4.3showstheCoSattributedistributionsfortwoverbs, clean and rinse .TheCoS 13 Figure4.2:Percentageofsamplesdescribing0to3changesofstate(pilotandcucumber datasets) Figure4.3:CoSdistributionsoverattributesfor clean and rinse (pilotdataset) distributionsarecloselyrelatedtothesemanticsoftheverbstheylabel.Forexample,CoS labelswiththeattributesWetnessandPresenceOfObject(referringtodirtthatisremoved) aremorefrequentfortheverbs clean and rinse thanCoSlabelswithotherattributes.This isbecausethesemanticsoftheseverbsindicatesomeobjectiscleanedaway,possiblywith water.Noticethat clean ,theresultverb,hasamuchlowerfrequencyoftheWetnessattribute (whichisrelatedmoretothemannerofcleaning)andahigherfrequencyofPresenceOfObject (whichisrelatedtotheintendedresult).Ontheotherhand,themannerverb rinse hasthese distributionstheotherwayaround. AnotherobservationregardingtheCoSdistributionsinthepilotdatasetisthatnotall thedescriptionsdescribethesameattributes.Forexample,for clean ,mostofthedescrip- tionsdescribeWetnessandPresenceOfObject,butthereisalsosomedistributionoverthe attributesTexture,OcclusionBySecondObject,Color,etc.Onereasonthismayhappenis 14 becausewhenaverb-objectpairispresentedtoaturkerwithoutanaccompanyingscene, theturkermayrelymoreontheirimaginationwhendescribingthechangeofstate,whereas whenthesceneisshown,theycanseethechangedirectly.Forexample,iftheverbobject pairis shakebroccoli buttheturkerdoesnotseethatthebroccoliiscoveredwithwater, theywillnotdescribethewaterdropletsthatcomeasitisshaken.Moreover,evenin conditionswhenthesceneisshown,theturkersmaydescribethesamechangeofstatein twaysresultingintCoSannotations.Forexample, cleandishes :\Foodis removedfromthedishes"describesthePresenceOfObjectattributeoftheAssociatedOb- ject,while cleandishes :\Dishsurfaceisclearedofdebrisand/ormuck."describesthe OcclusionBySecondObjectattributeoftheDirectObject. 4.3.3 TheRoleofVisualContext Todeterminehowthepresenceofasceneorhowtheobjectofaverbthetypes ofCoSturkersdescribeintheirresponses,weaJensen-Shannondivergencebased metric variability .TheJSDoftwodistributionsPandQisgivenbytheformulabelow. JSD ( P jj Q )= 1 2 D ( P jj M )+ 1 2 D ( Q jj M )(4.1) where M = 1 2 ( P + Q )(4.2) DistheKullback-Leiblerdivergence,anon-symmetricmeasureofthebetween twodistributions. D ( P jj Q )= X i P ( i )ln P ( i ) Q ( i ) (4.3) TheadvantageofusingJSDisthatitisasymmetricmeasure.Itshowssimilarity betweentwodistributions,equaling0whenthedistributionsarethesameandapproaching 1astheybecomemoret. 15 Verb +/-Scene 3Objects Objects+scene Objects-scene clean 0.03 0.04 0.08 0.05 rinse 0.01 0.05 0.09 0.07 wipe 0.02 0.14 0.15 0.19 cut 0.01 0.02 0.01 0.05 chop 0.02 0.03 0.03 0.04 mix 0.05 0.13 0.15 0.17 stir 0.09 0.21 0.24 0.25 add 0.12 0.22 0.19 0.33 open 0.09 0.32 0.34 0.41 shake 0.18 0.42 0.42 0.43 Table4.3:VariabilitybetweenCoSfrequenciesperobjectandsceneconditions(pilotdataset) ThevariabilitydescribeshowtheCoSdistributionofaverbdependingonacertain variable(+/-sceneorobject).WecomputethevariabilitybyaveragingthesumofJSDfor eachpairofCoSdistributionsoftheverb,wherethedistributionsofeachpairaretaken overtvaluesofthevariable.Forexample,thevariabilityoftheverb shake overthe sceneconditionsisfoundbydividingtheJSDoftheCoSdistributionsinthe+sceneand -sceneconditionsby1(thenumberofpairsofconditions).Moreover,thevariabilityover theobjectconditionsisfoundbysummingtheJSDoftheCoSdistributionsforeachpairof objectconditions(threeuniquepairs),anddividingby3.Thegeneralvariabilityformulais showninEquation4.4. variability = P distrpairs ( i;j ) JSD ( d i ;d j ) numpairs (4.4) Thevariabilitiesbetweenvariousconditionsforeachverbinthepilotdatasetareshown inTable4.3.Thevariabilitymetricshowsthatthereisindeedsomebetweenthe CoSdistributionsinthe+sceneand-sceneconditions.Moreover,thevariabilityismuch higherforsomeverbs(shake0.18,add0.12)thanothers(cut0.01,rinse0.01).Thismaybe becauseforverbslikeshakeandadd,withouttheaccompanyingsceneitisnotclearhow thestateoftheobjectwillchange. 16 Verb Labelcard. +Scene -Scene add 0.64 0.78 0.52 chop 1.46 1.44 1.48 clean 0.88 0.99 0.78 cut 1.42 1.41 1.43 mix 0.71 0.68 0.76 open 0.82 0.91 0.73 rinse 0.88 0.89 0.87 shake 0.52 0.53 0.53 stir 0.52 0.61 0.47 wipe 0.65 0.78 0.53 Table4.4:Labelcardinalityperverb(pilotdataset) Table4.4showstheaveragenumberofCoSlabelsforeachoftheturkers'descriptions. Formostverbs(6of10),morechangesofstatearedescribedwhenthesceneisshown, indicatingthatthescenepresentsmoreinformationaboutCoStotheturkerfortheseverbs. Takentogetherthisdatashowsthatvisualcontextisimportanttodeterminethechangeof statedenotedbyaverb. 4.3.4 ofDirectObjectonCoS Table4.3alsoshowsthevariabilityforeachverbinthepilotdatasetcomputedoverthe threeobjectconditions.Thevariabilityamongobjectsistforeachverb,showing thatforsomeverbsinthiskitchendomaintheCoSdependsmoreontheobjectoftheverb thanforothers.Theverbwiththehighestvariabilityisagainshake(0.42).Theresulting statechangefromshakinga(wet)pieceofbroccoliisverytthanshakingacontainer ofspicesoverfood,orabowlwitheggs.Thisshowsthateventhoughtheverbsense isthesameforinallthesedescriptions,theCoSindicatedbytheverbmaydependonthe objectoftheverb. 4.3.5 VerbSemanticSimilaritybasedonCoS TocomparetheCoSdistributionsbetweeneachpairofverbsinthepilotdataset wecomputedtheJensen-Shannondivergenceofeachpair.Table4.5showsthatthe distributionsforverbsfromthepilotdataareverysimilarforverbswithsimilarse- 17 Verbpair JSD Verbpair JSD Verbpair JSD cut-chop 0.01 mix-shake 0.36 rinse-stir 0.45 mix-stir 0.03 chop-stir 0.37 rinse-mix 0.45 rinse-wipe 0.04 cut-stir 0.39 chop-add 0.47 clean-rinse 0.05 wipe-add 0.39 cut-add 0.48 clean-wipe 0.06 clean-shake 0.39 wipe-open 0.51 add-shake 0.11 chop-open 0.40 clean-open 0.53 stir-add 0.20 chop-mix 0.42 rinse-open 0.57 add-open 0.23 rinse-add 0.42 chop-shake 0.59 mix-add 0.27 wipe-stir 0.42 cut-shake 0.59 open-shake 0.30 cut-open 0.43 wipe-chop 0.67 wipe-shake 0.31 cut-mix 0.43 clean-chop 0.67 stir-shake 0.32 clean-mix 0.43 wipe-cut 0.68 stir-open 0.32 clean-stir 0.43 clean-cut 0.68 rinse-shake 0.33 wipe-mix 0.44 rinse-cut 0.68 mix-open 0.34 clean-add 0.45 rinse-chop 0.68 Table4.5:Jensen-ShannondivergencebetweenCoSdistributionsforverbpairs(pilot dataset) mantics(e.g.JSD(cut,chop)=0.01andJSD(mix,stir)=0.03vs.JSD(cut,shake)=0.59and JSD(rinse,chop)=0.68).ThisshowsthattheCoSframeiscapturingrelevantsemanticinfor- mation. 18 Chapter5 AutomatedIdenofCoS 5.1 TheCucumberDataset TodiscoverwhichfeaturesareimportantforpredictingCoSlabelswecollectedasecond datasetbyautomaticallyidentifying553verbsfromtheTACoScorpusmanuallyannotated themwithCoSframelabels.Inparticular,wecollectedalltheverbswithinsentences describingthecucumberpreparationactivity(slicingacucumberandplacingitonaplate). Astudentwithnopreviousknowledgeoftheprojectwasrecruitedandtrainedtolabeleach verbwithuptothreeCoSframes.Theywereshowntheverb,itspatient(bothautomatically idenandtheoriginalsentence(e.g.,\Theperson chops the cucumber intosliceson thecuttingboard").Thentheyweretaskedwithannotatingthechangeofstatethatoccurred tothepatientasaresultoftheverbbychoosingoptionsfromtheCoSontologytoup tothreeCoSframes.Ifthechangeofstatewasnotclearfromthisinformation,thestudent couldviewthevideos. Figure4.2showsthepercentageoftheTACoSsentencesthatreceived0to3CoS labels.Comparedtothecrowdsourceddataalmostalloftheseverbsreceivedatleastone CoSannotation.Thisisbecauseinthepilotstudysomeoftheturkers'responsesdidnot describechangesofstateandsomeoftheirdescriptionsdidnotintoourCoScategories. Table5.1showsthenumberoftokensofthetoptenmostfrequentverbsfromthe TACoScucumbervideodescriptions,theiraveragenumberofCoSlabels,andthenumber 19 Verb Count Labelcard. Numobj. get 106 1.44 17 take 88 1.51 15 wash 58 1.10 9 cut 56 1.09 14 rinse 50 1.36 9 slice 30 1.07 17 place 30 1.07 16 peel 24 2.67 10 put 21 1.00 8 remove 16 2.31 10 Table5.1:Verbcountsandlabelcardinalityoftoptenmostfrequentverbs(cucumber dataset) oftobjects(e.g.`cucumber',`bowl',etc.)theyoccurredwith.Theaveragenumber ofobjecttypesthateachverbtakesis12.46. 5.2 Multilabel ToexploretheoftfeaturesetsonCoSprediction,weformulatedthe problemasamultilabelproblem.Thegoalofatypicalproblem (i.e.supervisedlearningwithdiscretelabels)istolearnathatcanpredictthe correctlabelforasamplegivenavectoroffeaturevaluesthatrepresentasample.Asetof N samples( x 1 ;y 1 ) ; ( x 2 ;y 2 ) ;:::; ( x N ;y N ),where x i isthevectoroffeaturevaluesforsample iand y i issample i 'slabel,isusedtotrainandtestamodel.Inamultilabel problem, y i isa set oflabelsratherthanasinglelabel[25,15]. Inthisparticularproblemasample i consistsofasingleverb-objectpairanditsCoS frameannotations.Thefeaturevector x i (e.g.,linguisticfeaturesofthesentencecontaining theverborvisualfeaturesofthevideoitdescribes)isextractedfromthesample.Thelabel set y i consistsoftheAttributesoftheannotations(forexample,NumberOfPieces,Wetness, etc.).NotethatalthoughwecollectedObjectandresultValueinformationforeachverb, wedidnotincludeitaspartofthepredictionproblem. WeexploredtwomethodsofmultilabelBinaryRelevanceandLabelCom- 20 bination.Binaryrelevanceworksbytraining18separate(oneforeachattribute) andthenapplyingthemindependentlytopredictthelabelsforeachsample.Thus,cor- relationsbetweenattributesarenottakenintoaccountwiththismethod.Incontrast,the labelcombinationmethodtakesintoaccountcorrelationsbetweenthelabelsbytreatingeach uniquelabelsetasasinglelabel/classandapplyingtraditionaltechniques. Becausemultilabelisasigtlytproblemfromtraditional specialmetricsareusedforevaluation.Threecommonmetricsformultilabel are ExactMatch [26,15], HammingScore [5,26,15],and JaccardIndex [26,15, 16].Exactmatchisanexample-basedmetricofaccuracycomputedbythenumberoflabel setsthatarecorrectdividedbythenumberofsamples.Thisisastrictmetricbecause\a singlefalsepositiveorfalsenegativelabelmakesthe[sample]incorrect"[15].Furthermore, Hammingscoreisalabel-basedmetricofaccuracy.Itiscomputedbyaveragingtheaccuracy foreachlabelcalculatedindependently.Thisisconsideredalaxmetricbecauseit\tendsto beoverlylenientduetothetypicalsparsityoflabelsinmulti-labeldata,"rewardinghighly conservativeprediction[15].Lastly,Jaccardindexprovidesanaccuracymetricbetween ExactmatchandHammingscoreintermsofstrictness.Itisa\ratioofthesizeoftheunion andintersectionofthepredictedandactuallabelssets,foreachexampleandaveragedover thenumberofexamples"[15]. 5.3 Results 5.3.1 LinguisticandVisualFeatures Thelinguisticfeatureset,bagofwords(BoW),isextractedfromthesentence containingtheverb-objectpair.InthecaseofdatafromthepilotstudyBoWfeatureswere extractedfromtheopen-endeddescriptionsofCoS.Thisfeaturesetcontainsthelemmas ofeachwordinthesentenceconcatenatedtoitspart-of-speech(e.g.,`cut-verb',`cucumber- noun',etc.).ThussomesyntacticinformationiscontainedintheBoWfeatureset.Theset isthenconvertedintoavectorof1sand0s,representingwhetherornoteachPoS-lemma 21 combinationispresentinthesentence.Thisprocedureyielded264featuresforsamplesfrom thecucumberdatasetand1159featuresforsamplesfromthepilotdataset. Theverb+objectfeatureset(VO)isthesecondlinguisticfeatureset.Theverb+object featuresetconsistsofabinaryrepresentationofthelemmatizedverbandobjectofeach sample'sverb-objectpair. Thethirdfeaturesetisvisualratherthanlinguistic.Beforeweextractthevisual features,wemaketheassumptionthatthegroundtruthcorrespondenceoftheverband objectinthevideoisknown.Thevisualfeaturesareextractedfromthevideoclipdescribed bytheverbandincludethefollowing. 1. enceinareaofobjectatthebeginningandendofvideoclip 2. Distancebetweenstartandendlocationoftheobject 3. enceincolor(euclideandistance)oftheobjectbetweenthestartandendofthe videoclip 4. enceintexture(euclideandistancebetweenHoGfeatures)oftheobjectatthe startandendofthevideoclip 5. enceintheobject'smomentsofinertiaatthestartandendofthevideoclip{ thismayindicatechangeoforientation 6. Whetherornottheobjectwasoccludedatthebeginningandendofthevideoclip 7. Whetherornottheobjectwaspresentinthesceneatthebeginningandendofthe videoclip 5.3.2 ComparisonofFeaturesonCucumberDataset Threetfeaturesetsandtheircombinationsweretried.Thegoalistoevaluate howimportanttlinguisticandvisualfeaturesaretotheidenofchangeof state. 22 Jaccardindex Hammingscore Exactmatch DT+BR,BoW 0.601+/-0.121 0.951+/-0.019 0.363+/-0.259 DT+BR,BoW+VO 0.612+/-0.128 0.952+/-0.019 0.372+/-0.271 DT+LC,BoW 0.752+/-0.054 0.968+/-0.009 0.696+/-0.066 DT+LC,BoW+VO 0.855+/-0.048 0.983+/-0.007 0.790+/-0.067 LR+BR,BoW 0.698+/-0.052 0.965+/-0.006 0.602+/-0.045 LR+BR,BoW+VO 0.778+/-0.045 0.976+/-0.007 0.694+/-0.048 LR+LC,BoW 0.775+/-0.038 0.971+/-0.005 0.720+/-0.039 LR+LC,BoW+VO 0.854+/-0.021 0.983+/-0.003 0.801+/-0.028 BL 0.868 0.988 0.790 Table5.2:CoSattributeaccuracy(cucumberdataset) Initiallyonlythelinguisticfeatureswereusedincombinationwithtwotypesofclas- decisiontree(DT)andlogisticregression(LR),andthetwomethodsofmultilabel BRandLC,describedinSection5.2.Theresultsforthecucumberdatasetare showninTable5.2.Thedatawasrandomlysplitinto80%fortrainingand20%fortesting. Fivereplicatesoftheexperimentweredoneandtheiraccuracymeasurementsaveraged.The baseline(BL)predictseachverbtohavethemajoritylabelsetforthatverb,i.e.itpredicts labelsetsbasedontheverb. Table5.3showsthepredictionresultsforthecucumberdatasetusingallcombinations ofthetwolinguisticandonevisualfeaturesets.Thedatawasrandomlysplitinto60% fortrainingand40%fortesting.Logisticregressionwithl1regularizationwasusedfor Thescoresshowtheaverageofereplicatesandthestandarddeviations. TheresultsshowthatonlytheBoWcombinedwiththeVOfeatureset,aswellasall threefeaturessetscombined,performedbetterthanthebaseline(exactmatch0.8486and 0.8495vs.0.790).Thevisualfeaturesalonedonotperformnearlyaswellasthebaseline (0.3667vs.0.790);however,theydoperformbetterthanrandomassignmentoflabelsets (0.3667vs.0.0532),showingthattheydocontainsomeusefulinformationrelatedtothe attributesundergoingchange. Table5.4showsperverbforthethreemostfrequentverbs.Predictions weremadeusinglogisticregressionwithl1regularizationwith60%ofthedatausedfortrain- 23 Exactmatch Jaccardindex Hammingscore BL 0.790 0.868 0.984 Randomassignmentoflabelsets 0.0532+/-0.0125 0.1319+/-0.0093 0.8648+/-0.0029 BoW 0.7369+/-0.0190 0.7833+/-0.0164 0.9736+/-0.0015 VO 0.7595+/-0.0259 0.8393+/-0.0222 0.9819+/-0.0024 BoW+VO 0.8486+/-0.0225 0.8844+/-0.0202 0.9870+/-0.0022 Visual 0.3667+/-0.0301 0.4480+/-0.031 0.9306+/-0.0042 BoW+Visual 0.7405+/-0.0244 0.7868+/-0.0201 0.9742+/-0.002 VO+Visual 0.7450+/-0.0234 0.8328+/-0.0202 0.9813+/-0.0021 BoW+VO+Visual 0.8495+/-0.0218 0.8847+/-0.0195 0.9871+/-0.0021 Table5.3:CoSattributeaccuracyusingvariousfeaturesets(cucumberdataset) ingand40%fortesting.Thelabelcombinationmethodwasusedformultilabel Thescoresshowtheaverageofereplicatesandthestandarddeviations. Theresultsshowthatthefeaturessetsthatgivethebestperformancedependonthe verb.Theverb get hasthebestexactmatchscorewiththevisualandBoWfeatures(0.819 vs.baseline0.642).AddingtheVOfeaturesnofurtherimprovement. Take hasthe bestperformancewithonlyBoWfeatures(exactmatch0.872vs.baseline0.439).Adding theVOfeatures,visualfeatures,orbothtotheBoWnofurtherimprovement.Lastly, cut hasthebestperformancewiththeBoWandVOfeaturescombined(exactmatch0.791 vs.baseline0.687).Moreover,whenthevisualfeaturesareusedinadditiontothisfeature settheperformanceforthisverbactuallydecreasesdowntoexactmatch0.783.Overall, theseresultsshowthatthemostimportantfeaturesforpredictingtheCoSofaverbmay dependontheverb,ratherthantherebeingasinglebestfeaturesetforall. 5.3.3 PredictingtheAttributeDescribedbyNLDescriptionsof CoS WetheCoSdescriptionsfromthepilotdatainordertodeterminewhich featuresareimportantinpredictingthechangeofstate.Notethatthisdatasetprovidesa distinctproblemfromtheotherthreedatasets.Inthecaseofthepilotdata,theturkerhas providedadirectdescriptionofachangeofstate,fromwhichthefeaturesfor areextracted.Ontheotherhand,forthecucumberdatasetthefeaturesareextractedfrom 24 get take cut Numexamples 106 88 56 Numuniquelabelsets 3 5 5 BL EM 0.642+/-0.062 0.439+/-0.069 0.687+/-0.084 JI 0.809+/-0.004 0.700+/-0.003 0.748+/-0.007 HS 0.979+/-0.004 0.965+/-0.003 0.974+/-0.007 BoW EM 0.814+/-0.015 0.872+/-0.065 0.774+/-0.058 JI 0.900+/-0.001 0.926+/-0.004 0.830+/-0.007 HS 0.989+/-0.001 0.991+/-0.004 0.983+/-0.007 VO EM 0.619+/-0.035 0.628+/-0.084 0.757+/-0.052 JI 0.798+/-0.003 0.792+/-0.006 0.817+/-0.004 HS 0.978+/-0.003 0.975+/-0.006 0.982+/-0.004 BoW+VO EM 0.800+/-0.019 0.872+/-0.065 0.791+/-0.051 JI 0.893+/-0.001 0.926+/-0.004 0.848+/-0.006 HS 0.988+/-0.001 0.991+/-0.004 0.985+/-0.006 Visual EM 0.600+/-0.043 0.472+/-0.039 0.757+/-0.09 JI 0.788+/-0.003 0.714+/-0.002 0.817+/-0.007 HS 0.976+/-0.003 0.967+/-0.002 0.982+/-0.007 BoW+Visual EM 0.819+/-0.023 0.872+/-0.065 0.774+/-0.064 JI 0.902+/-0.002 0.926+/-0.004 0.830+/-0.007 HS 0.989+/-0.002 0.991+/-0.004 0.983+/-0.007 VO+Visual EM 0.605+/-0.044 0.617+/-0.121 0.765+/-0.065 JI 0.793+/-0.003 0.789+/-0.007 0.826+/-0.005 HS 0.977+/-0.003 0.975+/-0.007 0.983+/-0.005 BoW+VO+Visual EM 0.819+/-0.023 0.872+/-0.065 0.783+/-0.073 JI 0.902+/-0.002 0.926+/-0.004 0.839+/-0.007 HS 0.989+/-0.002 0.991+/-0.004 0.984+/-0.007 Table5.4:PerverbCoSattributetionaccuracyusingvariousfeaturesets(cucumber dataset) sentenceswhichdescribetheaction(ratherthanchangeofstate).Ifarobotisnotableto understandtheCoSfromahuman'snarrationofanactioninthekitchen,thenitshould beabletoaskwhatCoSisindicatedandsubsequentlyextracttheCoSfromthehuman's description. Table5.5showstheresultsforofthepilotdatausingallcombinations ofthelinguisticfeaturessetswithtwotypesofdecisiontree(DT)andlogistic regression(LR),andthetwomethodsofmultilabelBRandLC,describedin Section5.2.Thedatawasrandomlysplitinto80%fortrainingand20%fortesting.Five 25 Jaccardindex Hammingscore Exactmatch DT+BR,BoW 0.426+/-0.197 0.949+/-0.019 0.271+/-0.256 DT+BR,BoW+VO 0.429+/-0.200 0.949+/-0.019 0.273+/-0.259 DT+LC,BoW 0.726+/-0.021 0.976+/-0.002 0.666+/-0.026 DT+LC,BoW+VO 0.732+/-0.022 0.977+/-0.002 0.674+/-0.022 LR+BR,BoW 0.691+/-0.027 0.970+/-0.002 0.624+/-0.025 LR+BR,BoW+VO 0.610+/-0.037 0.961+/-0.003 0.540+/-0.037 LR+LC,BoW 0.738 0.983 0.684 LR+LC,BoW+VO 0.743 0.983 0.689 BL 0.486 0.973 0.429 Table5.5:CoSattributeaccuracyforopenendedchangeofstatedescriptions (pilotdataset) replicatesoftheexperimentweredoneandtheiraccuracymeasurementsaveraged.The baseline(BL)predictseachverbtohavethemajoritylabelsetforthatverb,i.e.itpredicts labelsetsbasedontheverb. Theresultsshowthatthesimplefeaturesetbagofwords(BoW)wasenoughforperfor- manceabovethebaseline(BL);although,whentheverbandobject(VO)werealsoprovided asfeaturesaccuracyincreased.Furthermore,theresultsshowtwointerestingthings.(1) LC,themultilabelmethodthattakesintoaccountcorrelationsbetweenlabels, performsbetterthanBR,whichdoesnottakecorrelationsintoconsideration.Thisisin linewithourobservationthatsomeattributesarecorrelated.And(2)incomparisontothe resultsforthecucumberdatasettheaccuracyforthepilotdatasetisnotas high.Wemightexpecttheopositesincethepilotsentencesaredirectdescriptionsofthe CoS.Thismaybebecauseofthelargenumberofsentencesfromthepilotdatasetareanno- tatedasdescribingnoCoSresultinginlessexamplestolearnfrom.Moreover,thereismuch morevariationintheCoSdescriptions(pilotdata)becausesometurkersgavethorough,full sentenceresponseswhileotheronlyrespondedwithoneortwowords. 26 Chapter6 ComplexityofVerbSemanticsbased onCoS 6.1 MultilevelDatasetandCrowdsourceStudy TheTACoSMultilevelcorpusconsistsofdescriptionsofcookingvideos(thesamevideos asfromtheoriginalTACoScorpus)atthreelevelsofdetail,includingsinglesentence,short (aboutesentences),anddetaileddescriptions(nomorethan15sentences)[18,21].The corpuscontains20triplesofdescriptionsforeachvideoandthereareevideosperactivity. Anexampleofdescriptionsforthecucumberpreparationactivityisshownbelow.(Notethat thesesentenceswereprocessedtohavethesamesubject,etc.) Singlesentence \Thepersonenteredthekitchenandslicedacucumber" Short \Thepersonwalkedintothekitchen.Thepersongotacuttingboard,knife,and cucumber.Thepersonwashedthecucumber.Thepersonputthecucumberonthe cuttingboard.Thepersonslicedthecucumber.Thepersonputthecucumberonthe plate." Detailed \Thepersonwalkedintothekitchen.Thepersonremovedacuttingboardandknife 27 fromthedrawer.Thepersonputtheplateonthecounter.Thepersonwashedthe cucumberatthesink.Thepersonplacedthecucumberandtheplatenexttothe cuttingboard..." Weusedthiscorpustoexamineseveralquestionsabouthowthelevelofdetailofan activitytheCoSdenotedbyverbsinthedescription. 6.1.1 BreadandCucumberMultilevelDataset ToexaminetheofvisualcontextwecollectedCoSannotationsforthesentences describingthebreadandcucumberactivitiesoftheTACoSmultilevelcorpus.Notethat thesesentencesaredistinctfromthecucumberdatasetfromSection5.1.CoSannotations werecollectedbypresentingasentencefromthecorpuscontainingaverb-objectpairand sometimesaccompaniedbythevideoclipdescribedbythesentence.Thein responsestosentenceswithandwithouttheclipwillelucidatetheofvisualcontext onCoSasinthepilotstudy.Theannotationsofverbsattlevelsofdetailallowfor analysesabouttherelationsbetweenthelevelsintermsofCoS. TurkersweretaskedwithouttheCoSframeforuptothreechangesofstate byselectingtheframeslotoptionsfromadropdownmenu.Theywerealsoinstructedto check`Currentchangeofstateframeisnotapplicable'(CoS-NA)ifnoneoftheoptions satisfactorilydescribedthechangeofstatedenotedbytheverb.Furthermore,theycould check`Nochangeofstate'(No-CoS)iftheverbdidnotdenoteachangeofstate.Inlatter twocasesturkerswerepromptedtoprovideareasonfortheirresponse. Annotationsforeofthetuplesofdescriptions(outof20)werecollectedwitha+/- sceneconditionforeachofthetenbreadandcucumberactivityvideos.Wecollectedthree turkerresponsesforeachverbtoensureinter-annotatoragreement.Thus,10videos 21 sentences 5tuples 2videoconditions 3replicates ! wecanexpect6300responses. However,inrealitywecollected5100responsesforthecucumberandbreadvideos.This isbecausenotalltheshortanddetaileddescriptionscontainedeorsentences, respectively.And,notallsentencescontainedonlyasingleverb,oraverbthattakesa 28 patient(theobjectthatundergoesachangeofstate). Fromthethreeturkerresponsescollectedforeachverb-object-sceneconditionwemarked unanimouslabelswheneveratleasttwooftheattribute,CoS-NA,orNo-CoSlabelsmatch (e.g.,aunanimousNo-Coslabelismarkedifatleasttwoofthreeoftheresponses'No-Cos labelsmatch).BecauseweareonlyexaminingtheattributesoftheCoSframes,wedonot considertheobjectandvalueportionsoftheframefornow. 6.1.2 24ActivitiesMultilevelDataset Lastly,afourthdatasetwascollectedinordertovalidatetheCoSontologybyshowing thattheCoSframeandframeslotoptionsapplytoawiderangeofkitchenactivities.We choseonevideoofeachoftheremaining24activitiesfromtheTACoSmultilevelcorpus andturkersweretaskedwithannotatingdescriptionsoftheaccompanyingsentences.These activitiesincludedicinganonion,fryingeggs,etc.Thisbroadscopeofkitchenactivities ensuresthatavarietyofverbswillappearinthedescriptions.Annotationsfore(outof 20)ofthetuplesforeachvideowerecollectedasinSection6.1.1.Collectingannotations forthisdiversesetofactivitiesshouldtellushowwelltheCoSontologycoversthekitchen domain. With24videos 21sentences 5tuples 3replicates ! weexpectedtocollect7560 turkerresponses.However,intotalwecollected8256turkerresponsesforthesamereasons asstatedinsection6.1.1.Unanimouslabelswerealsocomputedasinthesectionabove. 6.2 ResultsofHumanStudies 6.2.1 CoverageofCoSOntology Table6.1andTable6.2showhowwelltheCoSframeoptionscovertheverbsinthe thebreadandcucumberandthe24activitiesdatasets,respectively.Thesestatisticswere computedbycountingthenumberofturkerresponsesthatcontained`Currentchangeof stateframeisnotapplicable'and`Nochangeofstate'labels.Unanimous(UN)labelsexist whenevertwoormoreturkersagreeonalabelforagivenverb.0.5percentofthebreadand 29 All SingleShortDetailed +Scene-Scene Num.verbtypes 94 164379 9494 Num.verbtokens 1676 1404561080 838838 Num.CoS-NA(UN) 8 053 53 Perc.CoS-NA(UN) 0.0048 0.00000.01100.0028 0.00600.0036 Num.No-CoS(UN) 4 022 13 Perc.No-Cos(UN) 0.0024 0.00000.00440.0019 0.00120.0036 Num.turkerresponses 5100 43313843283 25482552 Num.CoS-NA 57 32232 2235 Perc.CoS-NA 0.0112 0.00690.01590.0097 0.00860.0137 Num.No-CoS 49 41134 2029 Perc.No-CoS 0.0096 0.00920.00790.0104 0.00780.0114 Table6.1:CoverageoftheCoSframeoptions(breadandcucumbermultileveldataset) All SingleShortDetailed Num.verbtypes 222 53112201 Num.verbtokens 2715 2006581857 Num.CoS-NA(UN) 12 048 Perc.CoS-NA(UN) 0.0044 0.00000.00610.0043 Num.No-CoS(UN) 15 0213 Perc.No-Cos(UN) 0.0055 0.00000.00300.0070 Num.turkerresponses 8256 61120055640 Num.CoS-NA 115 92779 Perc.CoS-NA 0.0139 0.01470.01350.0140 Num.No-CoS 133 429100 Perc.No-CoS 0.0161 0.00650.01450.0177 Table6.2:CoverageoftheCoSframeoptions(24activitiesmultileveldataset) cucumberverbinstancesand0.4percentoftheinstancesfromtheother24activitieshave unanimousCoS-NAlabels.ThelownumberofCoS-NAlabelssuggeststhatthecoverageof theCoSontologyisquitethorough. Basedonthefeedbackprovidedbytheturkers,thereareseveralreasonsthatthey selectedCoS-NA.Themostfrequentreasonisamistakebytheautomaticpartofspeech tagger(e.g.,labelinganounmoasaverb(e.g. cuttingboard )orthelabeledverbwas actuallyadeverbaladjectivedescribinganon-changingstateoftheverb(e.g. thecrumbs thatremained , sealedpackage )).Also,sometimesthereweremistakesmadebytheautomatic semanticrolelabeler(e.g.,mislabelingagentsaspatients).Insomecasestheverbpresented 30 totheturkerwasactuallynotaconcreteactionverb(e.g. start , , continue ,need). Lastly,theCoSframedoesnotapplytosentencesfromtheoriginalTACoScorpusthatdid notmakesense(e.g. Thepersonleakedthefolk. ).Noneoftheabovereasonsareresultfrom thedesignoftheCoSontologybutotherfactors. SomeinterestingverbsthattheCoSontologydoesnotcoverinclude`taste'and`test' (e.g.,\ThepersontestedthetemperaturewithhiswheretheCoSoccursinthe knowledgeoftheagent,and`took'alongwithotherverbswheretheCoSisachangeof possession.Changeofstateofknowledgeandchangeofpossessionhoweverareabstract concepts,notobservableintermsofconcretevisualfeatures,sotheywereintentionallyleft outoftheontology.TherewerealsovagueverbsforwhichtheCoSwasnotclearfromonly thelinguisticcontext,forexample`prepare'and`make'.Also,theverb`use',whichactually seemstoindicatethatitspatientisaninstrumentinanotheraction.Furthermore,incases wheretheverbwasusedtostatethepurposeoftheaction(e.g.,\Thepersontookoutan orangetomakeorangejuice")oranattemptthatmayormaynotbesuccessful(e.g.,\The persontriedtoremovethelid")theverbdoesnotclearlydescribeaCoSthatactuallyoccurs. TheseinstanceswerelabeledwithNo-CoS.Lastly,theonlyconcretechangeofstatethat shouldhavebeenincludedintheCoSontologywas turn .Forexample,thischange ofstateisvisiblebythewofwaterfromasinkfaucet. 6.2.2 ofVisualContext Table6.3comparesthelabelsetsbetweenthe+/-sceneconditionsofthetop20most frequentverbsfromthemultilevelbreadandcucumberdataset.Themetricsshowthevari- abilitybetweencorrespondingverb-objectinstances'labelsetsinthetwosceneconditions. Thelabelcardinalityindicatestheaveragenumberoflabelseachverbwasannotatedwith. Overall,thedatashowsthatthelabelsetsdonotvarygreatlybetweenthe+sceneand -sceneconditionsforallverbs(variabilityof0.005).I.e.thevisualcontextdoesnotplay alargeonhumansinterpretationsoftheseverbs.Moreover,theaveragenumberof labelsdoesisthesameforbothsceneconditions(0.86labels).However,thedatashowsthat 31 Verb Count Variability +Slabelcard. -Slabelcard. allverbs 838 0.005 0.86 0.86 slice 96 0.015 0.80 0.73 cut 81 0.018 0.74 0.80 put 67 0.018 0.96 1.00 get 61 0.017 0.93 0.97 place 58 0.006 1.02 1.02 take 57 0.016 0.96 0.93 takeout 42 0.010 0.95 0.88 wash 32 0 1.00 1.00 open 31 0.086 0.48 0.55 remove 26 0.047 0.88 0.88 Table6.3:Comparisonbetween+/-sceneconditions(breadandcucumbermultileveldataset) someverbsCoSinterpretationdependsonthevisualcontextsuchas open withvariability 0.086andlabelcardinality0.48in+scene,and0.55in-scene.Butthevisualcontextdoes nottheinterpretationofotherverbsasmuch,suchas wash withvariability0and labelcardinality1.00inbothsceneconditions. 6.2.3 LevelofDetail'sonCoS Doesthelevelofdetailinwhichaverbappearsathechangeofstatethatitdenotes? Figure6.1showshowthelabelsaredistributedovertheattributesforfourverbsthatoccur frequentlyinallthreelevelsofthe24activitydataset.Wecanseethatforsomeoftheverbs ( put and wash )thedistributionsarethesameforallthreelevels.However,fortheverbs cut and slice thedistributionsdependingonthelevelofdetailinwhichtheverboccurs. Thissuggeststhatthereissomeambiguityindeterminingwhichchangesofstatetheverbs denotes.Furthermore,thechangeofstatemaybedeterminedbasedonthecontextofuse. 6.2.4 LevelofDetail'sonVerbFrequencyandDistribution Howdoesthelevelofdetailwhichwordsappearmostfrequently?Figure6.2 showsthefrequenciesofthetoptwentymostcommonverbsineachofthelevelsfrom the24activitydataset.AsobservedpreviouslyintheTACoSmultilevelcorpusthereare betweentheverbsusedattlevelsofdetail[18,21].Forexample,thesingle 32 (a) (b) (c) (d) Figure6.1:CoSdistributionsoverattributesforverbsatthreelevelsofdetail(24activities multileveldataset) sentencedescriptionscontainvaguewordslike cook , prepare ,and make ,aswellasverbssuch as demonstrate whichdescribestheactivityatahigherlevelofabstraction.Theseverbs appearlessfrequentlyornotatallatmoredetailedlevelsofdescription. Howarethetokensofeachverbdistributedacrossthethreelevels?Table6.4shows thedistributionsofeachverbinthe24activitydatasetoverthreelevelsofdescription. Additionally,theentropyofeachdistributionshowshowlikelytheverbistoappearequally inallthreelevels(highentropy,max.of1.0986...)oronlytoappearinonelevel(low entropy,min.0).Theentropyofaverb'sdistributionamongthethreelevelsisgivenby Equation6.1,where ( v l )isthedistributionofverb v inlevel l . 33 (a) (b) // (c) Figure6.2:Frequenciesofoccurrenceofthetoptwentyverbs(24activitiesmultileveldataset) H ( V )= X l ( v l ) log 2 ( ( v l ))(6.1) Thedatashowsthatindeedsomeabstractverbslike prepare , cook ,and make aremore highlydistributedatlevelsoflessdetail(with95%,72%,and56%chanceofappearingin asinglesentencedescription,respectively).Theseverbsdescribethemainactivityinthe videobutthereisnoreasontousethematgreaterlevelsofdetailwherethesequenceof actionsisdescribed.Consequentially,theseverbshaverelativelylowentropy(0.22,0.74, and0.93,respectively).Ontheotherhand,theverbs cut , wash ,and rinse areequallylikely toappearinanylevel(withhighentropy1.08,1.08,and1.06,respectively).Whenusedin lowlevelsofdetailtheydescribethemostsalientpartofthevideo{i.e.,thegoalofthe 34 Verb Entropy Single distr. Short distr. Det. distr. Single count Short count Det. count All count cut 1.08 0.35 0.39 0.26 19 69 129 217 wash 1.08 0.26 0.39 0.35 6 30 77 113 put 1.08 0.25 0.38 0.37 8 41 112 161 use 1.07 0.44 0.32 0.24 5 12 26 43 rinse 1.06 0.24 0.31 0.45 4 17 71 92 remove 1.02 0.17 0.35 0.48 3 21 81 105 throw 1.02 0.17 0.35 0.48 2 14 54 70 peel 1.01 0.50 0.34 0.16 18 41 54 113 dice 0.98 0.51 0.35 0.14 4 9 10 23 slice 0.98 0.56 0.25 0.18 17 25 51 93 chop 0.97 0.50 0.37 0.12 7 17 16 40 enter 0.97 0.56 0.28 0.16 13 21 34 68 place 0.96 0.13 0.32 0.55 3 25 120 148 make 0.93 0.56 0.34 0.10 7 14 12 33 cook 0.74 0.72 0.22 0.06 20 20 16 56 take 0.69 0.00 0.46 0.54 0 30 101 131 separate 0.68 0.78 0.12 0.10 4 2 5 11 takeout 0.68 0.00 0.42 0.58 0 19 75 94 get 0.67 0.00 0.38 0.62 0 20 91 111 add 0.63 0.00 0.32 0.68 0 8 48 56 prepare 0.22 0.95 0.04 0.01 8 1 1 10 Table6.4:Theentropyofthedistributionsofeachverboverthreelevels(24activities multileveldataset) action(e.g.,\Thepersoncuttheonion").Conversely,theycanalsobeusedinhighlevelsof detailwheretheydescribeaspactioninasequenceofactions(e.g.,\Theperson settheoniononthecuttingboard.Hecuttheonionintosmallpieces.Heputthepiecesin abowl..."). 35 Chapter7 DiscussionandConclusion Inconclusion,wehavedesignedanontologyofchangesofstateasdenotedbyverbs basedonrepresentationsofverbalsemantics.Furthermore,threedatasetscontainingchange ofstateframeannotationsofverbsinthekitchendomainwerecollectedbybuildingontop ofthepreexistingTACoSandTACoSmultilevelcorpora.Severalanalysesshowedhow visualcontext,theobjectoftheverb,andlevelofdescriptionthewayinwhicha personunderstandaverbtoindicatechangeofstate.Itwasalsodemonstratedthatthe CoSontologycanbeusedtoannotatethechangesofstatedenotedbyawidevarietyof verbsinthekitchendomain.Andlastly,weshowedthattheCoSindicatedbyaverbcan bepredictedtosomedegreeautomaticallybasedonlinguisticandvisualfeatures. Inthefutureitwouldbeinterestingtocreateasetoftopredictalltheslots oftheCoSframe(attribute,object,andvalue)ratherthanonlytheattribute.Furthermore, moreindepthstudyisneededtoshowhowspecthefeaturesoftheobject,the surroundinglinguisticcontextoftheverb,andthevisualcontextofthescenedescribedby theverbdeterminewhichCoSaverbdenotes.Thisworkmaydrawmoreinspirationfrom GenerativeLexiconwhichsphowthepropertiesofnounargumentsmayectthe meaningsofverbs[12]. Thecurrentworkmayberelevanttofutureworkonactionlearningbyinstruction (incombinationwithdemonstration)forrobotsasitprovidesasetofchangesofstate synonymouswiththegoaloftheaction.Furthermore,wedemonstratedwiththeprediction 36 resultsthateveniftheverbisunknown(asisthecasewhenhearinganovelverb),thatthe numberofCoShypothesescanbenarroweddownbasedonthesurroundinglinguisticand visualcontext. Also,aftertherobothaslearnedanewaction,thehumanmayprovidefurtherfeedback bydescribingtheCoSoftheverb(i.e.,goaloftheaction)inmoredetail.Forexampleimagine thatweareteachingtherobotanovelwordthatmeans slicethinly .Iftheinstructorsees thattherobotsslicesaretoothick,anextralevelofdetailcanbeaddedtotherobot'sCoS representationfortheverb,whichstatesthattheresultingslicesshouldbegreaterthana certainthickness. 37 BIBLIOGRAPHY 38 BIBLIOGRAPHY [1] JohnBeaversandAndrewKoontz-Garboden.MannerandResultintheRootsofVerbal Meaning. LinguisticInquiry ,43(3):331{369,2012. [2] Yu-WeiChao,ZhanWang,RadaMihalcea,andJiaDeng.Miningsemantic ofvisualobjectcategories. JournalofComputerVision ,88(2):303{338,2010. [3] R.M.W.DixonandA.Y.Aikhenvald. AdjectiveClasses:ACross-linguisticTypology . ExplorationsinLanguageandSpaceC.OUPOxford,2006. [4] JaneGillette,HenryGleitman,LilaGleitman,andAnneLederer.Humansimulations ofvocabularylearning. Cognition ,73(2):135{176,1999. [5] ShantanuGodboleandSunitaSarawagi.Discriminativemethodsformulti-labeledclas- In AdvancesinKnowledgeDiscoveryandData ,volumeLNCS3056,pages 22{30.Springer,2004. [6] ChristopherKennedyandLouiseMcNally.Scalestructureandthesemantictypology ofgradablepredicates. Language ,81(2):345{381,2005. [7] BethLevin. Englishverbclassesandalternations:Apreliminaryinvestigation .Uni- versityofChicagopress,1993. [8] BethLevinandMalkaRappaportHovav.Lexicalizedscalesandverbsofscalarchange. Presentedat46thAnnualMeetingoftheChicagoLinguisticsSociety,2010. [9] JMMandler.Howtobuildababy:II.Conceptualprimitives. Psychologicalreview , 99(4):587{604,1992. [10] GeorgeA.Miller.Wordnet:Alexicaldatabaseforenglish. Commun.ACM ,38(11):39{ 41,November1995. [11] SanmitNarvekarandKristenGrauman.RelativeAttributes.In IEEEInternational ConferenceonComputerVision ,pages503{510,2011. [12] JamesPustejovsky.Thegenerativelexicon. Computationallinguistics ,17(4):409{441, 1991. [13] MalkaRappaportHovavandBethLevin.onmanner/resultcomplementar- ity. Lecturenotes ,2008. [14] MalkaRappaportHovavandBethLevin.onmanner/resultcomplementar- ity.In LexicalSemantics,Syntax,andEventStructure ,pages21{38.OxfordUniversity Press,2010. [15] JesseRead. ScalableMulti-labelation .PhDthesis,UniversityofWaikato,2010. 39 [16] JesseRead,BernhardPfahringer,GHolmes,andEibeFrank.chainsfor multi-label Machinelearning ,85(3):333{359,2011. [17] MichaelaRegneri,MarcusRohrbach,DominikusWetzel,StefanThater,BerntSchiele, andManfredPinkal.Groundingactiondescriptionsinvideos. Transactionsofthe AssociationforComputationalLinguistics(TACL) ,1:25{36,2013. [18] AnnaRohrbach,MarcusRohrbach,WeiQiu,AnnemarieFriedrich,ManfredPinkal,and BerntSchiele.Coherentmulti-sentencevideodescriptionwithvariablelevelofdetail. In GCPR ,2014. [19] MarcusRohrbach,WeiQiu,IvanTitov,StefanThater,ManfredPinkal,andBernt Schiele.Translatingvideocontenttonaturallanguagedescriptions.In ICCV ,2013. [20] KarinKipperSchuler. VerbNet:ABroad-Coverage,ComprehensiveVerbLexicon .PhD thesis,UniversityofPennsylvania,2005. [21] AnnaSenina,MarcusRohrbach,WeiQiu,AnnemarieFriedrich,SinkandarAmin, MykhayloAndrilika,ManfredPinkal,andBerntSchiele.Coherentmulti-sentencevideo descriptionwithvariablelevelofdetail. arXiv:1403.6173 ,2014. [22] AashishSheshadri,IanEndres,DerekHoiem,andDavidForsyth.DescribingObjects bytheirAttributes.In IEEEConferenceonComputerVisionandPatternRecognition , 2009. [23] JSiskind.GroundingLexicalSemanticsofVerbsinVisualPerceptionUsingForce DynamicsandEvenLogic. JournalofAIResearch ,15:31{90,2001. [24] JeMarkSiskind.GroundingLanguageinPerception. AIntelligenceReview , 8:371{391,1995. [25] GrigoriosTsoumakasandIoannisKatakis.Multi-label International JournalofDataWarehousingandMining ,3(3):1{13,2007. [26] GrigoriosTsoumakas,IoannisKatakis,andIoannisVlahavas.Miningmulti-labeldata. In Dataminingandknowledgediscoveryhandbook ,pages667{685.Springer,2010. [27] JosiahWang,KatjaMarkert,andMarkEveringham.LearningModelsforObjectRecog- nitionfromNaturalLanguageDescriptions. ProcedingsoftheBritishMachineVision Conference2009 ,pages2.1{2.11,2009. [28] RanXu,CaimingXiong,WeiChen,andJasonJCorso.Jointlymodelingdeepvideo andcompositionaltexttobridgevisionandlanguageinaframework.In AAAI , 2015. 40