VERBSEMANTICSASDENOTINGCHANGEOFSTATEINTHEPHYSICAL
WORLD
By
MalcolmDoering
ATHESIS
Submittedto
MichiganStateUniversity
inpartialentoftherequirements
forthedegreeof
ComputerScience-MasterofScience
2015
ABSTRACT
VERBSEMANTICSASDENOTINGCHANGEOFSTATEINTHEPHYSICAL
WORLD
By
MalcolmDoering
Inthenottoodistantfutureweanticipatetheexistenceofrobotsthatwillhelparoundthe
house,inparticular,thekitchen.Thus,itiscriticalthatrobotscanunderstandthelanguage
commonlyusedwithinthisdomain.Therefore,inthisworkweexplorethesemanticsofverbs
thatfrequentlyoccurwhendescribingcookingactivities.Motivatedbylinguistictheory
onthelexicalsemanticsofconcreteactionverbsanddatacollectedviacrowdsourcing,an
ontologyofthechangesofstatesofthephysicalworldasdenotedbyconcreteactionverbsis
presented.Furthermore,additionaldatasetsarecollectedforthepurposeofvalidatingthe
ontology,exploringtheofcontextonverbalchangeofstatesemantics,andtestingthe
automaticidenofchangesofstatedenotedbyverbs.Inconclusion,severalareasof
furtherinvestigationaresuggested.
ACKNOWLEDGEMENTS
Writingamaster'sthesishasbeenamuchmoreundertakingthanoriginally
anticipated.Iwouldnothavebeensuccessfulwithoutthehelpofseveralpeople.Foremost,
Iwouldliketothankmyadvisor,Dr.JoyceY.Chai,forherguidancethroughthearduous
processofformulatingathesistopicandalongeverystepofthewaytoitscompletion.Iwould
alsoliketothankDr.RuiFangfortakingmeunderhiswingduringmytimeasamaster's
student,ShaohuaYangforhisinputregardingvisualfeaturesandresults,my
thesiscommitteemembersDr.CristinaSchmittandDr.XiaomingLiufortheirinsightful
questionsandfeedback,ZachRichardsonforhisannotationandalltheremaining
membersoftheLanguageandInteractionResearchLaboratorywhomademytimeatMSU
enjoyable.Lastly,Iwouldliketothankmyfamilyandfriendsfortheircontinuedmoral
supportthroughout.
iii
TABLEOFCONTENTS
LISTOFTABLES
....................................
v
LISTOFFIGURES
...................................
vi
Chapter1IntroductionandMotivation
......................
1
Chapter2RelatedWork
...............................
3
Chapter3ChangeofStateinVerbSemantics
..................
6
Chapter4APilotStudyandOntologyofChangeofState
..........
9
4.1CrowdsourceSetup................................9
4.2OntologyDesignandAnnotation........................10
4.3Analysis......................................12
4.3.1TypesofCoSDescriptions........................12
4.3.2MultipleCoSLabelsperVerb......................13
4.3.3TheRoleofVisualContext.......................15
4.3.4ofDirectObjectonCoS.....................17
4.3.5VerbSemanticSimilaritybasedonCoS.................17
Chapter5AutomatedIdenofCoS
....................
19
5.1TheCucumberDataset..............................19
5.2Multilabel.............................20
5.3Results.......................................21
5.3.1LinguisticandVisualFeatures......................21
5.3.2ComparisonofFeaturesonCucumberDataset.............22
5.3.3PredictingtheAttributeDescribedbyNLDescriptionsofCoS....24
Chapter6ComplexityofVerbSemanticsbasedonCoS
............
27
6.1MultilevelDatasetandCrowdsourceStudy...................27
6.1.1BreadandCucumberMultilevelDataset................28
6.1.224ActivitiesMultilevelDataset.....................29
6.2ResultsofHumanStudies............................29
6.2.1CoverageofCoSOntology........................29
6.2.2ofVisualContext.........................31
6.2.3LevelofDetail'sonCoS.....................32
6.2.4LevelofDetail'sonVerbFrequencyandDistribution.....32
Chapter7DiscussionandConclusion
.......................
36
BIBLIOGRAPHY
....................................
38
iv
LISTOFTABLES
Table4.1:Verbsandobjectsusedfordatacollection(pilot)............10
Table4.2:Attributesandresultvaluesforchangeofstate.............13
Table4.3:VariabilitybetweenCoSfrequenciesperobjectandsceneconditions
(pilotdataset)...............................16
Table4.4:Labelcardinalityperverb(pilotdataset)................17
Table4.5:Jensen-ShannondivergencebetweenCoSdistributionsforverbpairs
(pilotdataset)...............................18
Table5.1:Verbcountsandlabelcardinalityoftoptenmostfrequentverbs(cu-
cumberdataset)..............................20
Table5.2:CoSattributeaccuracy(cucumberdataset).......23
Table5.3:CoSattributeaccuracyusingvariousfeaturesets(cucum-
berdataset)................................24
Table5.4:PerverbCoSattributeaccuracyusingvariousfeaturesets
(cucumberdataset)............................25
Table5.5:CoSattributeaccuracyforopenendedchangeofstate
descriptions(pilotdataset)........................26
Table6.1:CoverageoftheCoSframeoptions(breadandcucumbermultilevel
dataset)..................................30
Table6.2:CoverageoftheCoSframeoptions(24activitiesmultileveldataset).30
Table6.3:Comparisonbetween+/-sceneconditions(breadandcucumbermulti-
leveldataset)...............................32
Table6.4:Theentropyofthedistributionsofeachverboverthreelevels(24ac-
tivitiesmultileveldataset)........................35
v
LISTOFFIGURES
Figure3.1:Eventschemaforverbsthatdenoteexternallycausedstatechanges[14]7
Figure4.1:ExampleofCoSframeappliedtoadescriptionofchangeofstate...12
Figure4.2:Percentageofsamplesdescribing0to3changesofstate(pilotand
cucumberdatasets)............................14
Figure4.3:CoSdistributionsoverattributesfor
clean
and
rinse
(pilotdataset).14
Figure6.1:CoSdistributionsoverattributesforverbsatthreelevelsofdetail(24
activitiesmultileveldataset).......................33
Figure6.2:Frequenciesofoccurrenceofthetoptwentyverbs(24activitiesmulti-
leveldataset)...............................34
vi
Chapter1
IntroductionandMotivation
Inthefuturerobotswillworkcloselywithhumansinthehometoaidinvariousdomestic
tasks.Foremostaretasksinthekitchen.Forarobottohelpoutandlearnnewabilitiesin
thekitchendomainitmustbeabletounderstandahuman'sinstructions.Thisworkfocuses
inparticularonhowarobotmayrepresentconcreteverbsinthekitchendomain.
Concreteactionverbsareverbsthatdenoteconcreteactivitiesperformedbyanagentin
theworld.Theseactionsarevisuallyperceivableeventsthatcanpotentiallybeunderstood
bycomputervisionalgorithms.Furthermore,theycanbecategorizedintotwoclassesbased
ontheirsemantics:ResultVerbsandMannerVerbs[14,13].Mannerverbstypicallydenote
theFormortheMannerinwhichtheactiondenotedbytheverbisperformed,whereas
Resultverbs(thefocusofthiswork)denotetheChangeofState(CoS)thattheobjectof
theverbundergoesasaresultoftheactiondenotedbytheverb.Inordertogroundthese
verbstotheenvironment,therobotmusthavearichrepresentationofthechangesofstate
associatedwiththeverb.ExistingverbresourcessuchasVerbNet[20]donotcontainthis
richinformation.InVerbNet,althoughitssemanticrepresentationforvariousverbsmay
indicatethatachangeofstateisinvolved,itdoesnotalwaysprovidethespassociated
withtheverb'smeaning.Forexample,thechangewilloccurtosomeattributeoftheverb's
directobjectsuchascolor,numberofpieces,speed,etc.
ThisworkpresentsanontologyofttypesofCoStoouttheVerbNetrep-
resentation.Sp,thisworkisfocusedonverbsinthecookingdomain,withthe
1
forethoughtthatthismethodcanbegeneralizedtootherdomains.Wecarriedoutaseries
ofdatacollectionexperimentsviacrowdsourcingdesignedtoelicitnaturallanguagedescrip-
tionsofchangesofstateasdenotedbyvariousverbs.Thesedescriptionsprovidedinsight
intohowmid-levelvisualfeaturessuchasstatecanberealizedlinguisticallyatthesurface
levelandguidedthedesignoftheCoSontologywhichcategorizesttypesofchange
ofstate.
Thisdataallowsustoasksomequestions:Howdoestheobjectofaverbthe
meaningoftheverb?I.e.,doestheverbdenotetchangesofstatedependingon
thetypeofdirectobject?Furthermore,howdoesthepresenceorabsenceofascenein
combinationwiththeverbaldescriptionCoSwhichtheviewerattributestotheverb?
DoambiguitiesintheCoSforasinglesenseofaverbindicatesub-categorizationsofverb
senses?Answerstothesequestionsareimportantforarobotwhoseunderstandingofaverb
isbasedonitscontextofuse.Bydeterminingon-linetheCoSindicatedbyaverb,therobot
canfocusitssensingresourcesontheindicatedCoS(activesensing).
Intheend,IhopethatthiswillbeausefulresourceforSituatedNLPresearchers.
2
Chapter2
RelatedWork
Therelatedworkcanbedividedintosectionsincludingtheoreticallinguisticsworkon
thelexicalsemanticsofverbsandcomputervisionworkonrecognizingvisualattributesof
objectsandunderstandingeventsinvideos.
Previousworkintheoreticallinguisticshasthetypesofconcreteactionverbs
weareinterestedin{mainlyResultverbs,whichindicatetheirobject'schangeofstate[14,
13].Kennedyprovidesamoredetailedanalysisofgradablepredicatesintermsofscale
structure,whichisapplicabletoResultverbs[6].Andlastly[20]presentsdigitaldictionary
ofverbs(VerbNet),andtheircategorizationsfrom[7].Ourworksupplementsthesemantic
spofsomeoftheverbscontainedVerbNetbycategorizingthetypesofchanges
ofstateintheworldwhichtheverbsmaydenote.Moreinformationonverbsemanticsis
containedinSection3.
Intraditionalverbsemanticsthemeaningofaverbmaybeambiguous.Thereisa
staticnumberofsenses(denotingsubtletiesofmeaning)foraverb,oneofwhichmustbe
selectedbasedontheverb'scontextofuse.Alternatively,GenerativeLexiconarguesagainst
thetraditionalnumberofsensesforaword,astheymaynotapplytonoveluses[12].
GLproposesthatthemeaningsofverbsaredependentnotonlyonthelexicalsp
oftheverbitself,butthroughinteractionswiththecomplexsemanticrepresentationsofits
arguments.Thatis,thepropertiesofnounsthemeaningoftheverb.Thus,\these-
manticloadisspreadevenlyacrossthelexicon".Thecurrentworkprovidesexamplesofhow
3
averb'smeaningmaydependonitspatientargument,i.e.examplesofsub-categorizations
ofverbsenses.Also,weshowhowvisualcontextthehuman'sinterpretationofthe
verbsenseintermsofchangeofstate.
Gilletteetal.carriedoutstudiesofvocabularylearningusingtheHumanSimulation
Paradigm[4].Collegestudentsweretaskedwithidentifyingverbswhenpresentedwithdif-
ferentinformationincludingthenounsthatappearedinthesentencewiththeverb,the
syntacticinformationofthesentencecontainingtheverb,andextra-linguisticinformation
(i.e.,avideowhichtheverbdescribes).Theverbitselfwasnotpresented.Experimenters
foundthatverbswithahighdegreeof`imigability'or`concreteness'weremoreeasilyiden-
fromtheextra-linguisticcontext,whereassyntaxwasamoreusefulcueforabstract
verbs.Thissuggeststhatthevisualcontextmaybeimportanttoidentifyingthechanges
ofstateintheworldwhichtheseconcreteverbsdenote.Inourworkwefocusprimarilyon
verbsfromthecookingdomainwithahighdegreeofimagability(e.g.,
cut
,
rinse
,etc.).
Visualattributesarehighlevelvisualfeaturesof(objectsin)sceneswhichhavecorre-
spondingnaturallanguagedescriptions.Forexample,`green'referstoasprangeof
RGBvalues.Visualattributesaresemanticallymeaningful,discriminative,andgeneralizable
acrosstobjecttypesandcanbeusedforobjectrecognition[11,22,27].Attributes
roughlycorrespondtoadjectives(describingstates{colors,properties,etc.{ofobjects),
butcanalsodescribeotherpropertiesnotnamedbyadjectives(e.g.`haswing'forbirds);
therefore,thesemanticsofaCoSverbmaybegroundedinchangesofanobject'svisual
attributes.
Chaoetal.jointlymodelsactionandobjectcategoriesmadeupofthesynsetsfrom
WordNet[2,10].Sp,theymodeledtheoftheobjectcategories,which
indicatethefunctionsofobjects,inordertoimproveactionrecognition.WhereasChaoet
al.modeltheinteractionsbetweenactionsandobjects,inthisworkweareinterestedinthe
interactionsbetweenverbsanddirectobjects(oftennouns)andhowtheobjecttsthe
meaningoftheverb.Averb/nounistfromanaction/objectbecausetsenses
4
oftheverb/nounmayappearinseparateaction/objectcategories.
Siskindetal.demonstratehowanactioninavideocanberecognizedautomatically
byusingprinciplesandconstraintsofhumanperception[24,23].Actionsarerepresented
viatemporalpredicatelogic.Furthermore,theydemonstratethatactionsmaybe
intermsof
relations
(e.g.,support,contact,andattachment)betweentheentitiesrather
thanlowlevel
kinematicrepresentations
(e.g.,jointanglesandvelocities).Beyondrelations,
ourworkexplores
changesofstate
,anotherhighlevelfeatureintegraltoactionrepresenta-
tion/verbmeaning.
Variousmethodsforgeneratingnaturallanguagedescriptionsofimagesandvideosare
presentedby[9,19,28]
5
Chapter3
ChangeofStateinVerbSemantics
Lexicalsemanticsisimportanttodesigningmethodsforrobotstolearnverbsbecauseit
indicateswhatmustbelearnedaspartoftheverbrepresentation.Verbscanbedividedinto
twobroadcategories:stativeverbsthatdenotestates(suchas
know
,
depend
,
loathe
)and
actionverbswhichdenoteactions(suchas
run
,
throw
,
cook
)Inthisworkweareprimarily
interestedinthelatter.
Aconcreteactionverbisonethat,incombinationwithitsargumentsandmo
denotesanactionintheworld(asopposedtodenotingastateoranabstractactionnot
visibleintheworld).HovavandLevin[14,13]furtherdividethetypesofactionverbsinto
Manner
verbs,which\specifyaspartoftheirmeaningamannerofcarryingoutanaction",
and
Result
verbs,which\specifythecomingaboutofaresultstate".Forexample,

Mannerverbs
:nibble,rub,scribble,sweep,laugh,run,swim...

Resultverbs
:clean,cover,empty,freeze,kill,melt,open,arrive,die,enter,
faint...
Inthisworkwefocussponresultverbs,i.e.verbsofChangeofState(CoS).A
setof\canonicalrealizationrules"specifyhowaparticularchangeofstateisincorporated
intoaverb'ssemantics.Semanticsaredeterminedbasedonthecombinationofa\root",
whichisparticulartotheverb(e.g.,aresult-state),andan\eventschema"templateas
showninFigure3.1.
6
Figure3.1:Eventschemaforverbsthatdenoteexternallycausedstatechanges[14]
Previousworkhasfurtherresultverbsintothreecategories:ChangeofState
verbs,whichdenoteachangeofstatetoapropertyoftheverb'sobject(e.g.`towarm'),
InherentlyDirectedMotionverbs,whichdenotemovementalongapathinrelationtoaland-
markobject(e.g.`toarrive'),andIncrementalThemeVerbs,whichdenotetheincremental
changeofvolumeorareaoftheobject(e.g.`toeat')[8].Inourworkweproposeasp
setofresult-statesthatmaybeusedtothesemanticsofmostconcreteactionverbs
inthekitchendomain.NotethatweusethetermChangeofStateinamoregeneralway
throughoutthispapersuchthatthelocationandvolumeorareaofanobjectarepartofits
state.
HovavandLevinalsoclaimthatsomeverbsthatdenotechangeofstateeventslexically
specifyascale[14,13].I.e.,theyareverbsofscalarchange.Ascaleis\asetofpointsona
particulardimension(e.g.height,temperature,cost)".Inthecaseofverbs,thedimension
isanattributeoftheobjectoftheverb.Forexample,\Johncooledthemeans
thatthetemperatureattributeoftheobject
ce
hasdecreased.KennedyandMcNally
giveaverydetaileddescriptionofscalestructureanditsvariations[6].Ourwork
scalestructurebyrepresentingchangeofstateasaframeconsistingofanobject,oneofits
attributes,andtheresultingvalueoftheattribute.
Verbsofscalarchangecanbefurthersplitintotwocategoriesverbswithtwopoints
onthescaleandverbswithmultiplepointsonthescale[14].Verbsonatwopointscaleare
verbsofachievement,wherethechangeofstateis\conceptualizedasinstantaneous"(pg.
30).Examplesoftheseverbsare`crack'and`arrive'.Verbsonamultiplepointscaleare
associatedwithchangesinattributesthatcanhavemultiplevalues.Examplesoftheseverbs
are`advance',`descend',`fall',`recede',`rise',`warm',`cool',etc.Thesearecalledverbsof
gradualchangeordegreeachievement.Inthisworkthesesubtlearenottaken
7
intoaccount,buttheymaybeworthincludinginfuturework.
Mannerverbsareverbsofnon-scalarchange.Theyareditfromverbsofscalar
changebecausetheyaremorecomplex{theyinvolvechangeinmorethanasingleattribute
ordonotspecifychangeinasingledirectiononascale[14].Theymayinvolveacombination
ofmultiplechanges.Someexamplesare`walk'and`jog',whichspecifyasequenceofchanges
onseveralattributes.Additionally,verbsofnon-scalarchangedonotalwayshavetobe
spaboutwhatchangesareinvolved.E.g.exercisemaydenoteanyofseveralvarieties
ofphysical(andsometimesmental)activity(pg.33).Wewillnotfocusonmannerverbsin
thiswork.
BeaversandKoontz-Garbodenintroduceasetofdiagnosticsforjudgingwhetheraverb
isamannerverborresultverb[1].TheyalsoarguesagainstRappaportHovavandLevin's
claimthataverbcannotlexicalizebothmannerandresult[14,13].Indeed,fromourown
observationsitwouldseemthataverbcanhavebothchangeofstateandmannercomponents
(e.g.`chop').
Levincategorizesverbsbasedontheirsyntax,i.e.basedonwhichalternations(argu-
mentstructure)averbcantake[7].Thiscategorizationschemeresultedinclassesofverbs
withsemanticsimilarities.VerbNetisadigitalresourcebasedonLevin'sverbclasses[20].
Foreachverb,itprovidesthepossibleargumentstructuresandalogicalrepresentationof
theirsemantics.ThishasbeenausefulresourceforNLPresearchersinthepast.But,in
thecaseofresultandmannerverbsthatdenotephysicallyobservableevents,itdoesnot
provideenoughdetailtoenablearobottogroundtheverbinitsperceptions.Forexample,
theVerbNetsemanticrepresentationof`cut'sponlythattheresome`degradationof
thematerialintegrity'oftheobjectasaresultoftheaction.
8
Chapter4
APilotStudyandOntologyof
ChangeofState
4.1
CrowdsourceSetup
Inordertobuildanontologyofthetypesofchangesofstatethatverbsinthecooking
domainmaydenote,weconductedapilotstudy.Verb-objectpairswerepresentedtoturkers
viaAmazonMechanicalTurk(AMT)andturkerswereaskedtodescribethechangesofstate
thatoccurtotheobjectasaresultoftheverb.Thentheturker'sopen-endeddescriptions
wereanalyzedandcategorized.
VerbsandobjectsfromtheTACoScorpuswerechosenforthiscrowdsourcingstudy.
TheTACoScorpus[17]isacollectionofnaturallanguagedescriptionsoftheactionsthat
occurinasetofcookingvideos.I.e.,itcontains18227sentencescollectedviaAMTthat
describevariouscookingevents(preparingacucumber,scramblingeggs,etc.).Thisisan
idealcorpustoexplorethetypesofchangesofstateandmannersthatverbsdenotesinceit
containsmainlydescriptionsofconcreteactions.Moreover,possiblybecausemostactions
inthecookingdomainaregoal-directed,amajorityoftheverbsinthedescriptionsdenote
resultsofaction(changesofstate).
Thetenverbs(showninTable4.1)werechosenbasedonthecriteriathattheytakean
agentargumentandthattheyoccurrelativelyfrequentlyinthecorpusandwithavariety
oferentdirectobjects.Furthermore,theymustbeconcrete,meaningthattheydenote
9
Verb
Object1
Object2
Object3
clean
cuttingboard
dishes
counter
rinse
cuttingboard
dishes
ginger
wipe
counter
knife
hands
cut
cucumber
beans
leek
chop
cucumber
beans
leek
mix
eggs
leeks,salt,andpepper
ingredients
stir
eggs
leeks,salt,andpepper
ingredients
add
water
eggs
leeks
open
breadpackaging
drawer
pomegranate
shake
spices
bowl
broccoli
Table4.1:Verbsandobjectsusedfordatacollection(pilot)
someobservableeventintheworld.Verbsofthistypearethemostrelevantforakitchen
robot.Lastly,eoftheverbswerechosenbecausetheyonlydenoteachangeofstate,and
theotherewerechosenbecausetheydenotesomemannerofaction(possiblyinaddition
tochangeofstate).
ToexaminehowCoSdependsonthecontext,wepairedverbswithtobjects(3
objectsperverb,showninTable4.1)andpresentedtheverbstoturkerswithandwithout
avideooftheactiondescribedbytheverb(+/-scene).Objectswerechosenbasedonthe
criteriathattheyaredissimilartoeachother,sincewehypothesizethatthechangeofstate
indicatedbytheverbwilldependingontheobject'sfeatures.Forexample,
broccoli
and
bowl
werechosenasobjectsfortheverb
shake
becauseoneisavegetableandtheother
akitchenutensil,havingverytfeatures.Thus,thereare10

3

2=60conditions.
Foreachconditionwecollected30turkerresponses.
Inadditiontoturkersresponsesaboutwhatchangesofstatetheverbsindicated,we
alsocollectedresponsesaboutthemanneroftheaction.
4.2
OntologyDesignandAnnotation
Basedonthetypesofdescriptionstheturkersprovidedwedevelopedanontologyof
changeofstate,whichisshowinTable4.2.Thissectionwillexplainthisontologyindetail.
18attributesofstatechangeswereidenfromthechangeofstatedescriptions
10
providedbytheturkers.Becauseweareinterestedinlowerlevelstatechanges,whichcan
easilysensedwithacameraandcomputervisionalgorithm,wedidnotincludehigherlevel
attributesinthecategorizationsuchas
Cleanliness
,whichmaybeidenintermsof
lowerlevelattributesbutaremoreculttosenseautomatically.Although,notethat
someoftheseattributesareatntlevelsintermsofvisualperception,e.g.
Shape
vs.
Wetness
.Wetnessmustbevisuallyidendintermsofsomelowerlevelfeaturessuchas
color,whereasshapeisdirectlyperceivable.Someexamplesofdescriptionscontainingthese
attributeareshownbelow.Itisalsoworthknowingthatsomedescriptionsactuallydescribe
multiplechangesofstateasshowninFigure4.1.
GiventhatbothadjectivesandCoSverbshavetheirsemanticsintermsofa
scalestructure(forgradableverbsandadjectives),someoftheaboveattributesaremoti-
vatedbythesemantictypesofadjectivesfromDixonandAikhenvald'scategorization[3].
TheseadjectivecategoriesincludeDimension,Color,PhysicalProperty,Quanand
Position.
Althoughtheinstructionsgiventotheturkerssprequestedadescriptionofthe
changeofstateundergonebythe
object
,severalresponsescontaineddescriptionsofchanges
ofotherobjects.Often,apartofthedirectobjectwasdescribed,ratherthanthewhole
object.And,sometimessomecompletelytobject,thatwasstillassociatedinthe
action,wasdescribed.Thus,CoSdescriptionscanbecategorizedasdescribingachangeto
theDirectObject,PartOfObject,orAssociatedObject.Someexamplesareshownbelow.
DirectObject
Cut-cucumber:\Thesizeof
thecucumber
changes"
PartOfObject
Wipe-knife:\Theknifegetscleaner.More
metal
isshowing"
AssociatedObject
Clean-dishes:\
Debrisandresidue
fallawayfromthedishes"
11
Figure4.1:ExampleofCoSframeappliedtoadescriptionofchangeofstate
Inadditiontoattributeofchangeandtheobjectundergoingchange,theturkersde-
scriptionsoftencontainedathirdimportantaspectofachangeofstate:theresultvalue.I.e.
theresultvalueoftheattributeafteritchanges.Thesevaluescanbecategorizedinseveral
twaysdependingontheattribute,butgenerallytherearetwopolarvalues.For
example,theSizeLengthVolumeThickness,Wetness,NumberOfPieces,etc.attributemay
Increase
or
Ddecrease
invalue.Ontheotherhand,notallresultvaluescanbecategorized
inthisway.Forexample,theShapeattributeisusuallydescribedsimplyashavingchanged
insomevagueway,ortohaveundergoneaspchange.Thus,
Spe
and
Change
are
twomoregeneralresultvalues.
Thesethreeaspectsofachangeofstate,the
Attribute
,
Object
,andresult
Value
,make
upthe
CoSframe
whichcanbeusedtolabelaverb-objectpair,asshowninFigure4.1for
asentencewhichdescribesthreechangesofstate.Thus,theCoSontologypresentedhere
consistsofaCoSframeandtheoptionsusedtotheframeslots.
4.3
Analysis
4.3.1
TypesofCoSDescriptions
Thedatashowsthatgenerallyturkersdescribedchangesofstateinoneofthreeways.
(1)Theydescribetheattributedirectly,e.g.cut-cucumber:\The
size
ofthecucumber
changes".(2)Theydescribethechangeofstatewitharesultativephrase,e.g.cut-cucumber:
\Thecucumberiscut
intosmallpieces
."Here,theCoSisindicatedbythesemanticsof
smallpieces
.And,(3)theCoScanbedescribedwithanotherverbthatdenotesthesameor
similarCoSastheverbpresentedtotheturker,e.g.stir-ingredients:\Theingredientsare
12
TypeofCoS
Attribute
AttributeResultValue
Dimension
Size,length,
volume,thickness
Changes,increases,decreases,sp
Shape
Changes,sp(cylindrical,etc.)
Color/Texture
Color
Appear,disappear,changes,mix,separate,
sp(becomesgreen,red,etc.)
Texture
Changes,sp(slippery,frothy,bubbly,soft,etc.)
PhysicalProperty
Weight
Increase,decrease
Flavor,smell
Changes,insp
Solidity
sp(paste,soggy,etc.)
Wetness
Becomeswet(ter),dry(er)
Visibility
Appears,disappears
Temperature
Increases,decreases
Containment
Becomesemptied,hollow
SurfaceIntegrity
Aholeoropeningappears
Quan
Numberofpieces
Increases,onebecomesmany,
decreases,manybecomesone
Position
Location
Changes,enter/exitcontainer,sp
Occlusion
Becomescovered,uncovered
Attachment
Becomesdetached
Presence
Nolongerpresent,becomespresent
Table4.2:Attributesandresultvaluesforchangeofstate
mixed
together".
4.3.2
MultipleCoSLabelsperVerb
Aturker'schangeofstatedescriptiondoesnotnecessarilyonlycontainasinglechange
ofstate.Infact,allthedescriptionsdescribedbetween0and3changesofstate,asseenin
Figure4.2.Mostdescriptions(43%)containedonlyasinglechangeofstate.Also,alarge
percentage(36%)containednochangeofstate.Inactuality,someofthedescriptionsthat
wereannotatedascontainingnochangeofstate,describedchangesofstatewithhighlevel
attributeswhichdonotintoourcategories(e.g.Cleanliness).Otherscontainedverbal
descriptionsofCoS(e.g.Stir-ingredients:\Theingredientsaremixedtogether.").Wedid
notannotatedescriptionswhichcontainedthesecircular
Foreachverb,wecalculatedthedistributionofCoSannotationsovereachattribute.
Figure4.3showstheCoSattributedistributionsfortwoverbs,
clean
and
rinse
.TheCoS
13
Figure4.2:Percentageofsamplesdescribing0to3changesofstate(pilotandcucumber
datasets)
Figure4.3:CoSdistributionsoverattributesfor
clean
and
rinse
(pilotdataset)
distributionsarecloselyrelatedtothesemanticsoftheverbstheylabel.Forexample,CoS
labelswiththeattributesWetnessandPresenceOfObject(referringtodirtthatisremoved)
aremorefrequentfortheverbs
clean
and
rinse
thanCoSlabelswithotherattributes.This
isbecausethesemanticsoftheseverbsindicatesomeobjectiscleanedaway,possiblywith
water.Noticethat
clean
,theresultverb,hasamuchlowerfrequencyoftheWetnessattribute
(whichisrelatedmoretothemannerofcleaning)andahigherfrequencyofPresenceOfObject
(whichisrelatedtotheintendedresult).Ontheotherhand,themannerverb
rinse
hasthese
distributionstheotherwayaround.
AnotherobservationregardingtheCoSdistributionsinthepilotdatasetisthatnotall
thedescriptionsdescribethesameattributes.Forexample,for
clean
,mostofthedescrip-
tionsdescribeWetnessandPresenceOfObject,butthereisalsosomedistributionoverthe
attributesTexture,OcclusionBySecondObject,Color,etc.Onereasonthismayhappenis
14
becausewhenaverb-objectpairispresentedtoaturkerwithoutanaccompanyingscene,
theturkermayrelymoreontheirimaginationwhendescribingthechangeofstate,whereas
whenthesceneisshown,theycanseethechangedirectly.Forexample,iftheverbobject
pairis
shakebroccoli
buttheturkerdoesnotseethatthebroccoliiscoveredwithwater,
theywillnotdescribethewaterdropletsthatcomeasitisshaken.Moreover,evenin
conditionswhenthesceneisshown,theturkersmaydescribethesamechangeofstatein
twaysresultingintCoSannotations.Forexample,
cleandishes
:\Foodis
removedfromthedishes"describesthePresenceOfObjectattributeoftheAssociatedOb-
ject,while
cleandishes
:\Dishsurfaceisclearedofdebrisand/ormuck."describesthe
OcclusionBySecondObjectattributeoftheDirectObject.
4.3.3
TheRoleofVisualContext
Todeterminehowthepresenceofasceneorhowtheobjectofaverbthetypes
ofCoSturkersdescribeintheirresponses,weaJensen-Shannondivergencebased
metric
variability
.TheJSDoftwodistributionsPandQisgivenbytheformulabelow.
JSD
(
P
jj
Q
)=
1
2
D
(
P
jj
M
)+
1
2
D
(
Q
jj
M
)(4.1)
where
M
=
1
2
(
P
+
Q
)(4.2)
DistheKullback-Leiblerdivergence,anon-symmetricmeasureofthebetween
twodistributions.
D
(
P
jj
Q
)=
X
i
P
(
i
)ln
P
(
i
)
Q
(
i
)
(4.3)
TheadvantageofusingJSDisthatitisasymmetricmeasure.Itshowssimilarity
betweentwodistributions,equaling0whenthedistributionsarethesameandapproaching
1astheybecomemoret.
15
Verb
+/-Scene
3Objects
Objects+scene
Objects-scene
clean
0.03
0.04
0.08
0.05
rinse
0.01
0.05
0.09
0.07
wipe
0.02
0.14
0.15
0.19
cut
0.01
0.02
0.01
0.05
chop
0.02
0.03
0.03
0.04
mix
0.05
0.13
0.15
0.17
stir
0.09
0.21
0.24
0.25
add
0.12
0.22
0.19
0.33
open
0.09
0.32
0.34
0.41
shake
0.18
0.42
0.42
0.43
Table4.3:VariabilitybetweenCoSfrequenciesperobjectandsceneconditions(pilotdataset)
ThevariabilitydescribeshowtheCoSdistributionofaverbdependingonacertain
variable(+/-sceneorobject).WecomputethevariabilitybyaveragingthesumofJSDfor
eachpairofCoSdistributionsoftheverb,wherethedistributionsofeachpairaretaken
overtvaluesofthevariable.Forexample,thevariabilityoftheverb
shake
overthe
sceneconditionsisfoundbydividingtheJSDoftheCoSdistributionsinthe+sceneand
-sceneconditionsby1(thenumberofpairsofconditions).Moreover,thevariabilityover
theobjectconditionsisfoundbysummingtheJSDoftheCoSdistributionsforeachpairof
objectconditions(threeuniquepairs),anddividingby3.Thegeneralvariabilityformulais
showninEquation4.4.
variability
=
P
distrpairs
(
i;j
)
JSD
(
d
i
;d
j
)
numpairs
(4.4)
Thevariabilitiesbetweenvariousconditionsforeachverbinthepilotdatasetareshown
inTable4.3.Thevariabilitymetricshowsthatthereisindeedsomebetweenthe
CoSdistributionsinthe+sceneand-sceneconditions.Moreover,thevariabilityismuch
higherforsomeverbs(shake0.18,add0.12)thanothers(cut0.01,rinse0.01).Thismaybe
becauseforverbslikeshakeandadd,withouttheaccompanyingsceneitisnotclearhow
thestateoftheobjectwillchange.
16
Verb
Labelcard.
+Scene
-Scene
add
0.64
0.78
0.52
chop
1.46
1.44
1.48
clean
0.88
0.99
0.78
cut
1.42
1.41
1.43
mix
0.71
0.68
0.76
open
0.82
0.91
0.73
rinse
0.88
0.89
0.87
shake
0.52
0.53
0.53
stir
0.52
0.61
0.47
wipe
0.65
0.78
0.53
Table4.4:Labelcardinalityperverb(pilotdataset)
Table4.4showstheaveragenumberofCoSlabelsforeachoftheturkers'descriptions.
Formostverbs(6of10),morechangesofstatearedescribedwhenthesceneisshown,
indicatingthatthescenepresentsmoreinformationaboutCoStotheturkerfortheseverbs.
Takentogetherthisdatashowsthatvisualcontextisimportanttodeterminethechangeof
statedenotedbyaverb.
4.3.4
ofDirectObjectonCoS
Table4.3alsoshowsthevariabilityforeachverbinthepilotdatasetcomputedoverthe
threeobjectconditions.Thevariabilityamongobjectsistforeachverb,showing
thatforsomeverbsinthiskitchendomaintheCoSdependsmoreontheobjectoftheverb
thanforothers.Theverbwiththehighestvariabilityisagainshake(0.42).Theresulting
statechangefromshakinga(wet)pieceofbroccoliisverytthanshakingacontainer
ofspicesoverfood,orabowlwitheggs.Thisshowsthateventhoughtheverbsense
isthesameforinallthesedescriptions,theCoSindicatedbytheverbmaydependonthe
objectoftheverb.
4.3.5
VerbSemanticSimilaritybasedonCoS
TocomparetheCoSdistributionsbetweeneachpairofverbsinthepilotdataset
wecomputedtheJensen-Shannondivergenceofeachpair.Table4.5showsthatthe
distributionsforverbsfromthepilotdataareverysimilarforverbswithsimilarse-
17
Verbpair
JSD
Verbpair
JSD
Verbpair
JSD
cut-chop
0.01
mix-shake
0.36
rinse-stir
0.45
mix-stir
0.03
chop-stir
0.37
rinse-mix
0.45
rinse-wipe
0.04
cut-stir
0.39
chop-add
0.47
clean-rinse
0.05
wipe-add
0.39
cut-add
0.48
clean-wipe
0.06
clean-shake
0.39
wipe-open
0.51
add-shake
0.11
chop-open
0.40
clean-open
0.53
stir-add
0.20
chop-mix
0.42
rinse-open
0.57
add-open
0.23
rinse-add
0.42
chop-shake
0.59
mix-add
0.27
wipe-stir
0.42
cut-shake
0.59
open-shake
0.30
cut-open
0.43
wipe-chop
0.67
wipe-shake
0.31
cut-mix
0.43
clean-chop
0.67
stir-shake
0.32
clean-mix
0.43
wipe-cut
0.68
stir-open
0.32
clean-stir
0.43
clean-cut
0.68
rinse-shake
0.33
wipe-mix
0.44
rinse-cut
0.68
mix-open
0.34
clean-add
0.45
rinse-chop
0.68
Table4.5:Jensen-ShannondivergencebetweenCoSdistributionsforverbpairs(pilot
dataset)
mantics(e.g.JSD(cut,chop)=0.01andJSD(mix,stir)=0.03vs.JSD(cut,shake)=0.59and
JSD(rinse,chop)=0.68).ThisshowsthattheCoSframeiscapturingrelevantsemanticinfor-
mation.
18
Chapter5
AutomatedIdenofCoS
5.1
TheCucumberDataset
TodiscoverwhichfeaturesareimportantforpredictingCoSlabelswecollectedasecond
datasetbyautomaticallyidentifying553verbsfromtheTACoScorpusmanuallyannotated
themwithCoSframelabels.Inparticular,wecollectedalltheverbswithinsentences
describingthecucumberpreparationactivity(slicingacucumberandplacingitonaplate).
Astudentwithnopreviousknowledgeoftheprojectwasrecruitedandtrainedtolabeleach
verbwithuptothreeCoSframes.Theywereshowntheverb,itspatient(bothautomatically
idenandtheoriginalsentence(e.g.,\Theperson
chops
the
cucumber
intosliceson
thecuttingboard").Thentheyweretaskedwithannotatingthechangeofstatethatoccurred
tothepatientasaresultoftheverbbychoosingoptionsfromtheCoSontologytoup
tothreeCoSframes.Ifthechangeofstatewasnotclearfromthisinformation,thestudent
couldviewthevideos.
Figure4.2showsthepercentageoftheTACoSsentencesthatreceived0to3CoS
labels.Comparedtothecrowdsourceddataalmostalloftheseverbsreceivedatleastone
CoSannotation.Thisisbecauseinthepilotstudysomeoftheturkers'responsesdidnot
describechangesofstateandsomeoftheirdescriptionsdidnotintoourCoScategories.
Table5.1showsthenumberoftokensofthetoptenmostfrequentverbsfromthe
TACoScucumbervideodescriptions,theiraveragenumberofCoSlabels,andthenumber
19
Verb
Count
Labelcard.
Numobj.
get
106
1.44
17
take
88
1.51
15
wash
58
1.10
9
cut
56
1.09
14
rinse
50
1.36
9
slice
30
1.07
17
place
30
1.07
16
peel
24
2.67
10
put
21
1.00
8
remove
16
2.31
10
Table5.1:Verbcountsandlabelcardinalityoftoptenmostfrequentverbs(cucumber
dataset)
oftobjects(e.g.`cucumber',`bowl',etc.)theyoccurredwith.Theaveragenumber
ofobjecttypesthateachverbtakesis12.46.
5.2
Multilabel
ToexploretheoftfeaturesetsonCoSprediction,weformulatedthe
problemasamultilabelproblem.Thegoalofatypicalproblem
(i.e.supervisedlearningwithdiscretelabels)istolearnathatcanpredictthe
correctlabelforasamplegivenavectoroffeaturevaluesthatrepresentasample.Asetof
N
samples(
x
1
;y
1
)
;
(
x
2
;y
2
)
;:::;
(
x
N
;y
N
),where
x
i
isthevectoroffeaturevaluesforsample
iand
y
i
issample
i
'slabel,isusedtotrainandtestamodel.Inamultilabel
problem,
y
i
isa
set
oflabelsratherthanasinglelabel[25,15].
Inthisparticularproblemasample
i
consistsofasingleverb-objectpairanditsCoS
frameannotations.Thefeaturevector
x
i
(e.g.,linguisticfeaturesofthesentencecontaining
theverborvisualfeaturesofthevideoitdescribes)isextractedfromthesample.Thelabel
set
y
i
consistsoftheAttributesoftheannotations(forexample,NumberOfPieces,Wetness,
etc.).NotethatalthoughwecollectedObjectandresultValueinformationforeachverb,
wedidnotincludeitaspartofthepredictionproblem.
WeexploredtwomethodsofmultilabelBinaryRelevanceandLabelCom-
20
bination.Binaryrelevanceworksbytraining18separate(oneforeachattribute)
andthenapplyingthemindependentlytopredictthelabelsforeachsample.Thus,cor-
relationsbetweenattributesarenottakenintoaccountwiththismethod.Incontrast,the
labelcombinationmethodtakesintoaccountcorrelationsbetweenthelabelsbytreatingeach
uniquelabelsetasasinglelabel/classandapplyingtraditionaltechniques.
Becausemultilabelisasigtlytproblemfromtraditional
specialmetricsareusedforevaluation.Threecommonmetricsformultilabel
are
ExactMatch
[26,15],
HammingScore
[5,26,15],and
JaccardIndex
[26,15,
16].Exactmatchisanexample-basedmetricofaccuracycomputedbythenumberoflabel
setsthatarecorrectdividedbythenumberofsamples.Thisisastrictmetricbecause\a
singlefalsepositiveorfalsenegativelabelmakesthe[sample]incorrect"[15].Furthermore,
Hammingscoreisalabel-basedmetricofaccuracy.Itiscomputedbyaveragingtheaccuracy
foreachlabelcalculatedindependently.Thisisconsideredalaxmetricbecauseit\tendsto
beoverlylenientduetothetypicalsparsityoflabelsinmulti-labeldata,"rewardinghighly
conservativeprediction[15].Lastly,Jaccardindexprovidesanaccuracymetricbetween
ExactmatchandHammingscoreintermsofstrictness.Itisa\ratioofthesizeoftheunion
andintersectionofthepredictedandactuallabelssets,foreachexampleandaveragedover
thenumberofexamples"[15].
5.3
Results
5.3.1
LinguisticandVisualFeatures
Thelinguisticfeatureset,bagofwords(BoW),isextractedfromthesentence
containingtheverb-objectpair.InthecaseofdatafromthepilotstudyBoWfeatureswere
extractedfromtheopen-endeddescriptionsofCoS.Thisfeaturesetcontainsthelemmas
ofeachwordinthesentenceconcatenatedtoitspart-of-speech(e.g.,`cut-verb',`cucumber-
noun',etc.).ThussomesyntacticinformationiscontainedintheBoWfeatureset.Theset
isthenconvertedintoavectorof1sand0s,representingwhetherornoteachPoS-lemma
21
combinationispresentinthesentence.Thisprocedureyielded264featuresforsamplesfrom
thecucumberdatasetand1159featuresforsamplesfromthepilotdataset.
Theverb+objectfeatureset(VO)isthesecondlinguisticfeatureset.Theverb+object
featuresetconsistsofabinaryrepresentationofthelemmatizedverbandobjectofeach
sample'sverb-objectpair.
Thethirdfeaturesetisvisualratherthanlinguistic.Beforeweextractthevisual
features,wemaketheassumptionthatthegroundtruthcorrespondenceoftheverband
objectinthevideoisknown.Thevisualfeaturesareextractedfromthevideoclipdescribed
bytheverbandincludethefollowing.
1.
enceinareaofobjectatthebeginningandendofvideoclip
2.
Distancebetweenstartandendlocationoftheobject
3.
enceincolor(euclideandistance)oftheobjectbetweenthestartandendofthe
videoclip
4.
enceintexture(euclideandistancebetweenHoGfeatures)oftheobjectatthe
startandendofthevideoclip
5.
enceintheobject'smomentsofinertiaatthestartandendofthevideoclip{
thismayindicatechangeoforientation
6.
Whetherornottheobjectwasoccludedatthebeginningandendofthevideoclip
7.
Whetherornottheobjectwaspresentinthesceneatthebeginningandendofthe
videoclip
5.3.2
ComparisonofFeaturesonCucumberDataset
Threetfeaturesetsandtheircombinationsweretried.Thegoalistoevaluate
howimportanttlinguisticandvisualfeaturesaretotheidenofchangeof
state.
22
Jaccardindex
Hammingscore
Exactmatch
DT+BR,BoW
0.601+/-0.121
0.951+/-0.019
0.363+/-0.259
DT+BR,BoW+VO
0.612+/-0.128
0.952+/-0.019
0.372+/-0.271
DT+LC,BoW
0.752+/-0.054
0.968+/-0.009
0.696+/-0.066
DT+LC,BoW+VO
0.855+/-0.048
0.983+/-0.007
0.790+/-0.067
LR+BR,BoW
0.698+/-0.052
0.965+/-0.006
0.602+/-0.045
LR+BR,BoW+VO
0.778+/-0.045
0.976+/-0.007
0.694+/-0.048
LR+LC,BoW
0.775+/-0.038
0.971+/-0.005
0.720+/-0.039
LR+LC,BoW+VO
0.854+/-0.021
0.983+/-0.003
0.801+/-0.028
BL
0.868
0.988
0.790
Table5.2:CoSattributeaccuracy(cucumberdataset)
Initiallyonlythelinguisticfeatureswereusedincombinationwithtwotypesofclas-
decisiontree(DT)andlogisticregression(LR),andthetwomethodsofmultilabel
BRandLC,describedinSection5.2.Theresultsforthecucumberdatasetare
showninTable5.2.Thedatawasrandomlysplitinto80%fortrainingand20%fortesting.
Fivereplicatesoftheexperimentweredoneandtheiraccuracymeasurementsaveraged.The
baseline(BL)predictseachverbtohavethemajoritylabelsetforthatverb,i.e.itpredicts
labelsetsbasedontheverb.
Table5.3showsthepredictionresultsforthecucumberdatasetusingallcombinations
ofthetwolinguisticandonevisualfeaturesets.Thedatawasrandomlysplitinto60%
fortrainingand40%fortesting.Logisticregressionwithl1regularizationwasusedfor
Thescoresshowtheaverageofereplicatesandthestandarddeviations.
TheresultsshowthatonlytheBoWcombinedwiththeVOfeatureset,aswellasall
threefeaturessetscombined,performedbetterthanthebaseline(exactmatch0.8486and
0.8495vs.0.790).Thevisualfeaturesalonedonotperformnearlyaswellasthebaseline
(0.3667vs.0.790);however,theydoperformbetterthanrandomassignmentoflabelsets
(0.3667vs.0.0532),showingthattheydocontainsomeusefulinformationrelatedtothe
attributesundergoingchange.
Table5.4showsperverbforthethreemostfrequentverbs.Predictions
weremadeusinglogisticregressionwithl1regularizationwith60%ofthedatausedfortrain-
23
Exactmatch
Jaccardindex
Hammingscore
BL
0.790
0.868
0.984
Randomassignmentoflabelsets
0.0532+/-0.0125
0.1319+/-0.0093
0.8648+/-0.0029
BoW
0.7369+/-0.0190
0.7833+/-0.0164
0.9736+/-0.0015
VO
0.7595+/-0.0259
0.8393+/-0.0222
0.9819+/-0.0024
BoW+VO
0.8486+/-0.0225
0.8844+/-0.0202
0.9870+/-0.0022
Visual
0.3667+/-0.0301
0.4480+/-0.031
0.9306+/-0.0042
BoW+Visual
0.7405+/-0.0244
0.7868+/-0.0201
0.9742+/-0.002
VO+Visual
0.7450+/-0.0234
0.8328+/-0.0202
0.9813+/-0.0021
BoW+VO+Visual
0.8495+/-0.0218
0.8847+/-0.0195
0.9871+/-0.0021
Table5.3:CoSattributeaccuracyusingvariousfeaturesets(cucumberdataset)
ingand40%fortesting.Thelabelcombinationmethodwasusedformultilabel
Thescoresshowtheaverageofereplicatesandthestandarddeviations.
Theresultsshowthatthefeaturessetsthatgivethebestperformancedependonthe
verb.Theverb
get
hasthebestexactmatchscorewiththevisualandBoWfeatures(0.819
vs.baseline0.642).AddingtheVOfeaturesnofurtherimprovement.
Take
hasthe
bestperformancewithonlyBoWfeatures(exactmatch0.872vs.baseline0.439).Adding
theVOfeatures,visualfeatures,orbothtotheBoWnofurtherimprovement.Lastly,
cut
hasthebestperformancewiththeBoWandVOfeaturescombined(exactmatch0.791
vs.baseline0.687).Moreover,whenthevisualfeaturesareusedinadditiontothisfeature
settheperformanceforthisverbactuallydecreasesdowntoexactmatch0.783.Overall,
theseresultsshowthatthemostimportantfeaturesforpredictingtheCoSofaverbmay
dependontheverb,ratherthantherebeingasinglebestfeaturesetforall.
5.3.3
PredictingtheAttributeDescribedbyNLDescriptionsof
CoS
WetheCoSdescriptionsfromthepilotdatainordertodeterminewhich
featuresareimportantinpredictingthechangeofstate.Notethatthisdatasetprovidesa
distinctproblemfromtheotherthreedatasets.Inthecaseofthepilotdata,theturkerhas
providedadirectdescriptionofachangeofstate,fromwhichthefeaturesfor
areextracted.Ontheotherhand,forthecucumberdatasetthefeaturesareextractedfrom
24
get
take
cut
Numexamples
106
88
56
Numuniquelabelsets
3
5
5
BL
EM
0.642+/-0.062
0.439+/-0.069
0.687+/-0.084
JI
0.809+/-0.004
0.700+/-0.003
0.748+/-0.007
HS
0.979+/-0.004
0.965+/-0.003
0.974+/-0.007
BoW
EM
0.814+/-0.015
0.872+/-0.065
0.774+/-0.058
JI
0.900+/-0.001
0.926+/-0.004
0.830+/-0.007
HS
0.989+/-0.001
0.991+/-0.004
0.983+/-0.007
VO
EM
0.619+/-0.035
0.628+/-0.084
0.757+/-0.052
JI
0.798+/-0.003
0.792+/-0.006
0.817+/-0.004
HS
0.978+/-0.003
0.975+/-0.006
0.982+/-0.004
BoW+VO
EM
0.800+/-0.019
0.872+/-0.065
0.791+/-0.051
JI
0.893+/-0.001
0.926+/-0.004
0.848+/-0.006
HS
0.988+/-0.001
0.991+/-0.004
0.985+/-0.006
Visual
EM
0.600+/-0.043
0.472+/-0.039
0.757+/-0.09
JI
0.788+/-0.003
0.714+/-0.002
0.817+/-0.007
HS
0.976+/-0.003
0.967+/-0.002
0.982+/-0.007
BoW+Visual
EM
0.819+/-0.023
0.872+/-0.065
0.774+/-0.064
JI
0.902+/-0.002
0.926+/-0.004
0.830+/-0.007
HS
0.989+/-0.002
0.991+/-0.004
0.983+/-0.007
VO+Visual
EM
0.605+/-0.044
0.617+/-0.121
0.765+/-0.065
JI
0.793+/-0.003
0.789+/-0.007
0.826+/-0.005
HS
0.977+/-0.003
0.975+/-0.007
0.983+/-0.005
BoW+VO+Visual
EM
0.819+/-0.023
0.872+/-0.065
0.783+/-0.073
JI
0.902+/-0.002
0.926+/-0.004
0.839+/-0.007
HS
0.989+/-0.002
0.991+/-0.004
0.984+/-0.007
Table5.4:PerverbCoSattributetionaccuracyusingvariousfeaturesets(cucumber
dataset)
sentenceswhichdescribetheaction(ratherthanchangeofstate).Ifarobotisnotableto
understandtheCoSfromahuman'snarrationofanactioninthekitchen,thenitshould
beabletoaskwhatCoSisindicatedandsubsequentlyextracttheCoSfromthehuman's
description.
Table5.5showstheresultsforofthepilotdatausingallcombinations
ofthelinguisticfeaturessetswithtwotypesofdecisiontree(DT)andlogistic
regression(LR),andthetwomethodsofmultilabelBRandLC,describedin
Section5.2.Thedatawasrandomlysplitinto80%fortrainingand20%fortesting.Five
25
Jaccardindex
Hammingscore
Exactmatch
DT+BR,BoW
0.426+/-0.197
0.949+/-0.019
0.271+/-0.256
DT+BR,BoW+VO
0.429+/-0.200
0.949+/-0.019
0.273+/-0.259
DT+LC,BoW
0.726+/-0.021
0.976+/-0.002
0.666+/-0.026
DT+LC,BoW+VO
0.732+/-0.022
0.977+/-0.002
0.674+/-0.022
LR+BR,BoW
0.691+/-0.027
0.970+/-0.002
0.624+/-0.025
LR+BR,BoW+VO
0.610+/-0.037
0.961+/-0.003
0.540+/-0.037
LR+LC,BoW
0.738
0.983
0.684
LR+LC,BoW+VO
0.743
0.983
0.689
BL
0.486
0.973
0.429
Table5.5:CoSattributeaccuracyforopenendedchangeofstatedescriptions
(pilotdataset)
replicatesoftheexperimentweredoneandtheiraccuracymeasurementsaveraged.The
baseline(BL)predictseachverbtohavethemajoritylabelsetforthatverb,i.e.itpredicts
labelsetsbasedontheverb.
Theresultsshowthatthesimplefeaturesetbagofwords(BoW)wasenoughforperfor-
manceabovethebaseline(BL);although,whentheverbandobject(VO)werealsoprovided
asfeaturesaccuracyincreased.Furthermore,theresultsshowtwointerestingthings.(1)
LC,themultilabelmethodthattakesintoaccountcorrelationsbetweenlabels,
performsbetterthanBR,whichdoesnottakecorrelationsintoconsideration.Thisisin
linewithourobservationthatsomeattributesarecorrelated.And(2)incomparisontothe
resultsforthecucumberdatasettheaccuracyforthepilotdatasetisnotas
high.Wemightexpecttheopositesincethepilotsentencesaredirectdescriptionsofthe
CoS.Thismaybebecauseofthelargenumberofsentencesfromthepilotdatasetareanno-
tatedasdescribingnoCoSresultinginlessexamplestolearnfrom.Moreover,thereismuch
morevariationintheCoSdescriptions(pilotdata)becausesometurkersgavethorough,full
sentenceresponseswhileotheronlyrespondedwithoneortwowords.
26
Chapter6
ComplexityofVerbSemanticsbased
onCoS
6.1
MultilevelDatasetandCrowdsourceStudy
TheTACoSMultilevelcorpusconsistsofdescriptionsofcookingvideos(thesamevideos
asfromtheoriginalTACoScorpus)atthreelevelsofdetail,includingsinglesentence,short
(aboutesentences),anddetaileddescriptions(nomorethan15sentences)[18,21].The
corpuscontains20triplesofdescriptionsforeachvideoandthereareevideosperactivity.
Anexampleofdescriptionsforthecucumberpreparationactivityisshownbelow.(Notethat
thesesentenceswereprocessedtohavethesamesubject,etc.)
Singlesentence
\Thepersonenteredthekitchenandslicedacucumber"
Short
\Thepersonwalkedintothekitchen.Thepersongotacuttingboard,knife,and
cucumber.Thepersonwashedthecucumber.Thepersonputthecucumberonthe
cuttingboard.Thepersonslicedthecucumber.Thepersonputthecucumberonthe
plate."
Detailed
\Thepersonwalkedintothekitchen.Thepersonremovedacuttingboardandknife
27
fromthedrawer.Thepersonputtheplateonthecounter.Thepersonwashedthe
cucumberatthesink.Thepersonplacedthecucumberandtheplatenexttothe
cuttingboard..."
Weusedthiscorpustoexamineseveralquestionsabouthowthelevelofdetailofan
activitytheCoSdenotedbyverbsinthedescription.
6.1.1
BreadandCucumberMultilevelDataset
ToexaminetheofvisualcontextwecollectedCoSannotationsforthesentences
describingthebreadandcucumberactivitiesoftheTACoSmultilevelcorpus.Notethat
thesesentencesaredistinctfromthecucumberdatasetfromSection5.1.CoSannotations
werecollectedbypresentingasentencefromthecorpuscontainingaverb-objectpairand
sometimesaccompaniedbythevideoclipdescribedbythesentence.Thein
responsestosentenceswithandwithouttheclipwillelucidatetheofvisualcontext
onCoSasinthepilotstudy.Theannotationsofverbsattlevelsofdetailallowfor
analysesabouttherelationsbetweenthelevelsintermsofCoS.
TurkersweretaskedwithouttheCoSframeforuptothreechangesofstate
byselectingtheframeslotoptionsfromadropdownmenu.Theywerealsoinstructedto
check`Currentchangeofstateframeisnotapplicable'(CoS-NA)ifnoneoftheoptions
satisfactorilydescribedthechangeofstatedenotedbytheverb.Furthermore,theycould
check`Nochangeofstate'(No-CoS)iftheverbdidnotdenoteachangeofstate.Inlatter
twocasesturkerswerepromptedtoprovideareasonfortheirresponse.
Annotationsforeofthetuplesofdescriptions(outof20)werecollectedwitha+/-
sceneconditionforeachofthetenbreadandcucumberactivityvideos.Wecollectedthree
turkerresponsesforeachverbtoensureinter-annotatoragreement.Thus,10videos

21
sentences

5tuples

2videoconditions

3replicates
!
wecanexpect6300responses.
However,inrealitywecollected5100responsesforthecucumberandbreadvideos.This
isbecausenotalltheshortanddetaileddescriptionscontainedeorsentences,
respectively.And,notallsentencescontainedonlyasingleverb,oraverbthattakesa
28
patient(theobjectthatundergoesachangeofstate).
Fromthethreeturkerresponsescollectedforeachverb-object-sceneconditionwemarked
unanimouslabelswheneveratleasttwooftheattribute,CoS-NA,orNo-CoSlabelsmatch
(e.g.,aunanimousNo-Coslabelismarkedifatleasttwoofthreeoftheresponses'No-Cos
labelsmatch).BecauseweareonlyexaminingtheattributesoftheCoSframes,wedonot
considertheobjectandvalueportionsoftheframefornow.
6.1.2
24ActivitiesMultilevelDataset
Lastly,afourthdatasetwascollectedinordertovalidatetheCoSontologybyshowing
thattheCoSframeandframeslotoptionsapplytoawiderangeofkitchenactivities.We
choseonevideoofeachoftheremaining24activitiesfromtheTACoSmultilevelcorpus
andturkersweretaskedwithannotatingdescriptionsoftheaccompanyingsentences.These
activitiesincludedicinganonion,fryingeggs,etc.Thisbroadscopeofkitchenactivities
ensuresthatavarietyofverbswillappearinthedescriptions.Annotationsfore(outof
20)ofthetuplesforeachvideowerecollectedasinSection6.1.1.Collectingannotations
forthisdiversesetofactivitiesshouldtellushowwelltheCoSontologycoversthekitchen
domain.
With24videos

21sentences

5tuples

3replicates
!
weexpectedtocollect7560
turkerresponses.However,intotalwecollected8256turkerresponsesforthesamereasons
asstatedinsection6.1.1.Unanimouslabelswerealsocomputedasinthesectionabove.
6.2
ResultsofHumanStudies
6.2.1
CoverageofCoSOntology
Table6.1andTable6.2showhowwelltheCoSframeoptionscovertheverbsinthe
thebreadandcucumberandthe24activitiesdatasets,respectively.Thesestatisticswere
computedbycountingthenumberofturkerresponsesthatcontained`Currentchangeof
stateframeisnotapplicable'and`Nochangeofstate'labels.Unanimous(UN)labelsexist
whenevertwoormoreturkersagreeonalabelforagivenverb.0.5percentofthebreadand
29
All
SingleShortDetailed
+Scene-Scene
Num.verbtypes
94
164379
9494
Num.verbtokens
1676
1404561080
838838
Num.CoS-NA(UN)
8
053
53
Perc.CoS-NA(UN)
0.0048
0.00000.01100.0028
0.00600.0036
Num.No-CoS(UN)
4
022
13
Perc.No-Cos(UN)
0.0024
0.00000.00440.0019
0.00120.0036
Num.turkerresponses
5100
43313843283
25482552
Num.CoS-NA
57
32232
2235
Perc.CoS-NA
0.0112
0.00690.01590.0097
0.00860.0137
Num.No-CoS
49
41134
2029
Perc.No-CoS
0.0096
0.00920.00790.0104
0.00780.0114
Table6.1:CoverageoftheCoSframeoptions(breadandcucumbermultileveldataset)
All
SingleShortDetailed
Num.verbtypes
222
53112201
Num.verbtokens
2715
2006581857
Num.CoS-NA(UN)
12
048
Perc.CoS-NA(UN)
0.0044
0.00000.00610.0043
Num.No-CoS(UN)
15
0213
Perc.No-Cos(UN)
0.0055
0.00000.00300.0070
Num.turkerresponses
8256
61120055640
Num.CoS-NA
115
92779
Perc.CoS-NA
0.0139
0.01470.01350.0140
Num.No-CoS
133
429100
Perc.No-CoS
0.0161
0.00650.01450.0177
Table6.2:CoverageoftheCoSframeoptions(24activitiesmultileveldataset)
cucumberverbinstancesand0.4percentoftheinstancesfromtheother24activitieshave
unanimousCoS-NAlabels.ThelownumberofCoS-NAlabelssuggeststhatthecoverageof
theCoSontologyisquitethorough.
Basedonthefeedbackprovidedbytheturkers,thereareseveralreasonsthatthey
selectedCoS-NA.Themostfrequentreasonisamistakebytheautomaticpartofspeech
tagger(e.g.,labelinganounmoasaverb(e.g.
cuttingboard
)orthelabeledverbwas
actuallyadeverbaladjectivedescribinganon-changingstateoftheverb(e.g.
thecrumbs
thatremained
,
sealedpackage
)).Also,sometimesthereweremistakesmadebytheautomatic
semanticrolelabeler(e.g.,mislabelingagentsaspatients).Insomecasestheverbpresented
30
totheturkerwasactuallynotaconcreteactionverb(e.g.
start
,

,
continue
,need).
Lastly,theCoSframedoesnotapplytosentencesfromtheoriginalTACoScorpusthatdid
notmakesense(e.g.
Thepersonleakedthefolk.
).Noneoftheabovereasonsareresultfrom
thedesignoftheCoSontologybutotherfactors.
SomeinterestingverbsthattheCoSontologydoesnotcoverinclude`taste'and`test'
(e.g.,\ThepersontestedthetemperaturewithhiswheretheCoSoccursinthe
knowledgeoftheagent,and`took'alongwithotherverbswheretheCoSisachangeof
possession.Changeofstateofknowledgeandchangeofpossessionhoweverareabstract
concepts,notobservableintermsofconcretevisualfeatures,sotheywereintentionallyleft
outoftheontology.TherewerealsovagueverbsforwhichtheCoSwasnotclearfromonly
thelinguisticcontext,forexample`prepare'and`make'.Also,theverb`use',whichactually
seemstoindicatethatitspatientisaninstrumentinanotheraction.Furthermore,incases
wheretheverbwasusedtostatethepurposeoftheaction(e.g.,\Thepersontookoutan
orangetomakeorangejuice")oranattemptthatmayormaynotbesuccessful(e.g.,\The
persontriedtoremovethelid")theverbdoesnotclearlydescribeaCoSthatactuallyoccurs.
TheseinstanceswerelabeledwithNo-CoS.Lastly,theonlyconcretechangeofstatethat
shouldhavebeenincludedintheCoSontologywas
turn
.Forexample,thischange
ofstateisvisiblebythewofwaterfromasinkfaucet.
6.2.2
ofVisualContext
Table6.3comparesthelabelsetsbetweenthe+/-sceneconditionsofthetop20most
frequentverbsfromthemultilevelbreadandcucumberdataset.Themetricsshowthevari-
abilitybetweencorrespondingverb-objectinstances'labelsetsinthetwosceneconditions.
Thelabelcardinalityindicatestheaveragenumberoflabelseachverbwasannotatedwith.
Overall,thedatashowsthatthelabelsetsdonotvarygreatlybetweenthe+sceneand
-sceneconditionsforallverbs(variabilityof0.005).I.e.thevisualcontextdoesnotplay
alargeonhumansinterpretationsoftheseverbs.Moreover,theaveragenumberof
labelsdoesisthesameforbothsceneconditions(0.86labels).However,thedatashowsthat
31
Verb
Count
Variability
+Slabelcard.
-Slabelcard.
allverbs
838
0.005
0.86
0.86
slice
96
0.015
0.80
0.73
cut
81
0.018
0.74
0.80
put
67
0.018
0.96
1.00
get
61
0.017
0.93
0.97
place
58
0.006
1.02
1.02
take
57
0.016
0.96
0.93
takeout
42
0.010
0.95
0.88
wash
32
0
1.00
1.00
open
31
0.086
0.48
0.55
remove
26
0.047
0.88
0.88
Table6.3:Comparisonbetween+/-sceneconditions(breadandcucumbermultileveldataset)
someverbsCoSinterpretationdependsonthevisualcontextsuchas
open
withvariability
0.086andlabelcardinality0.48in+scene,and0.55in-scene.Butthevisualcontextdoes
nottheinterpretationofotherverbsasmuch,suchas
wash
withvariability0and
labelcardinality1.00inbothsceneconditions.
6.2.3
LevelofDetail'sonCoS
Doesthelevelofdetailinwhichaverbappearsathechangeofstatethatitdenotes?
Figure6.1showshowthelabelsaredistributedovertheattributesforfourverbsthatoccur
frequentlyinallthreelevelsofthe24activitydataset.Wecanseethatforsomeoftheverbs
(
put
and
wash
)thedistributionsarethesameforallthreelevels.However,fortheverbs
cut
and
slice
thedistributionsdependingonthelevelofdetailinwhichtheverboccurs.
Thissuggeststhatthereissomeambiguityindeterminingwhichchangesofstatetheverbs
denotes.Furthermore,thechangeofstatemaybedeterminedbasedonthecontextofuse.
6.2.4
LevelofDetail'sonVerbFrequencyandDistribution
Howdoesthelevelofdetailwhichwordsappearmostfrequently?Figure6.2
showsthefrequenciesofthetoptwentymostcommonverbsineachofthelevelsfrom
the24activitydataset.AsobservedpreviouslyintheTACoSmultilevelcorpusthereare
betweentheverbsusedattlevelsofdetail[18,21].Forexample,thesingle
32
(a)
(b)
(c)
(d)
Figure6.1:CoSdistributionsoverattributesforverbsatthreelevelsofdetail(24activities
multileveldataset)
sentencedescriptionscontainvaguewordslike
cook
,
prepare
,and
make
,aswellasverbssuch
as
demonstrate
whichdescribestheactivityatahigherlevelofabstraction.Theseverbs
appearlessfrequentlyornotatallatmoredetailedlevelsofdescription.
Howarethetokensofeachverbdistributedacrossthethreelevels?Table6.4shows
thedistributionsofeachverbinthe24activitydatasetoverthreelevelsofdescription.
Additionally,theentropyofeachdistributionshowshowlikelytheverbistoappearequally
inallthreelevels(highentropy,max.of1.0986...)oronlytoappearinonelevel(low
entropy,min.0).Theentropyofaverb'sdistributionamongthethreelevelsisgivenby
Equation6.1,where

(
v
l
)isthedistributionofverb
v
inlevel
l
.
33
(a)
(b)
//
(c)
Figure6.2:Frequenciesofoccurrenceofthetoptwentyverbs(24activitiesmultileveldataset)
H
(
V
)=

X
l

(
v
l
)
log
2
(

(
v
l
))(6.1)
Thedatashowsthatindeedsomeabstractverbslike
prepare
,
cook
,and
make
aremore
highlydistributedatlevelsoflessdetail(with95%,72%,and56%chanceofappearingin
asinglesentencedescription,respectively).Theseverbsdescribethemainactivityinthe
videobutthereisnoreasontousethematgreaterlevelsofdetailwherethesequenceof
actionsisdescribed.Consequentially,theseverbshaverelativelylowentropy(0.22,0.74,
and0.93,respectively).Ontheotherhand,theverbs
cut
,
wash
,and
rinse
areequallylikely
toappearinanylevel(withhighentropy1.08,1.08,and1.06,respectively).Whenusedin
lowlevelsofdetailtheydescribethemostsalientpartofthevideo{i.e.,thegoalofthe
34
Verb
Entropy
Single
distr.
Short
distr.
Det.
distr.
Single
count
Short
count
Det.
count
All
count
cut
1.08
0.35
0.39
0.26
19
69
129
217
wash
1.08
0.26
0.39
0.35
6
30
77
113
put
1.08
0.25
0.38
0.37
8
41
112
161
use
1.07
0.44
0.32
0.24
5
12
26
43
rinse
1.06
0.24
0.31
0.45
4
17
71
92
remove
1.02
0.17
0.35
0.48
3
21
81
105
throw
1.02
0.17
0.35
0.48
2
14
54
70
peel
1.01
0.50
0.34
0.16
18
41
54
113
dice
0.98
0.51
0.35
0.14
4
9
10
23
slice
0.98
0.56
0.25
0.18
17
25
51
93
chop
0.97
0.50
0.37
0.12
7
17
16
40
enter
0.97
0.56
0.28
0.16
13
21
34
68
place
0.96
0.13
0.32
0.55
3
25
120
148
make
0.93
0.56
0.34
0.10
7
14
12
33
cook
0.74
0.72
0.22
0.06
20
20
16
56
take
0.69
0.00
0.46
0.54
0
30
101
131
separate
0.68
0.78
0.12
0.10
4
2
5
11
takeout
0.68
0.00
0.42
0.58
0
19
75
94
get
0.67
0.00
0.38
0.62
0
20
91
111
add
0.63
0.00
0.32
0.68
0
8
48
56
prepare
0.22
0.95
0.04
0.01
8
1
1
10
Table6.4:Theentropyofthedistributionsofeachverboverthreelevels(24activities
multileveldataset)
action(e.g.,\Thepersoncuttheonion").Conversely,theycanalsobeusedinhighlevelsof
detailwheretheydescribeaspactioninasequenceofactions(e.g.,\Theperson
settheoniononthecuttingboard.Hecuttheonionintosmallpieces.Heputthepiecesin
abowl...").
35
Chapter7
DiscussionandConclusion
Inconclusion,wehavedesignedanontologyofchangesofstateasdenotedbyverbs
basedonrepresentationsofverbalsemantics.Furthermore,threedatasetscontainingchange
ofstateframeannotationsofverbsinthekitchendomainwerecollectedbybuildingontop
ofthepreexistingTACoSandTACoSmultilevelcorpora.Severalanalysesshowedhow
visualcontext,theobjectoftheverb,andlevelofdescriptionthewayinwhicha
personunderstandaverbtoindicatechangeofstate.Itwasalsodemonstratedthatthe
CoSontologycanbeusedtoannotatethechangesofstatedenotedbyawidevarietyof
verbsinthekitchendomain.Andlastly,weshowedthattheCoSindicatedbyaverbcan
bepredictedtosomedegreeautomaticallybasedonlinguisticandvisualfeatures.
Inthefutureitwouldbeinterestingtocreateasetoftopredictalltheslots
oftheCoSframe(attribute,object,andvalue)ratherthanonlytheattribute.Furthermore,
moreindepthstudyisneededtoshowhowspecthefeaturesoftheobject,the
surroundinglinguisticcontextoftheverb,andthevisualcontextofthescenedescribedby
theverbdeterminewhichCoSaverbdenotes.Thisworkmaydrawmoreinspirationfrom
GenerativeLexiconwhichsphowthepropertiesofnounargumentsmayectthe
meaningsofverbs[12].
Thecurrentworkmayberelevanttofutureworkonactionlearningbyinstruction
(incombinationwithdemonstration)forrobotsasitprovidesasetofchangesofstate
synonymouswiththegoaloftheaction.Furthermore,wedemonstratedwiththeprediction
36
resultsthateveniftheverbisunknown(asisthecasewhenhearinganovelverb),thatthe
numberofCoShypothesescanbenarroweddownbasedonthesurroundinglinguisticand
visualcontext.
Also,aftertherobothaslearnedanewaction,thehumanmayprovidefurtherfeedback
bydescribingtheCoSoftheverb(i.e.,goaloftheaction)inmoredetail.Forexampleimagine
thatweareteachingtherobotanovelwordthatmeans
slicethinly
.Iftheinstructorsees
thattherobotsslicesaretoothick,anextralevelofdetailcanbeaddedtotherobot'sCoS
representationfortheverb,whichstatesthattheresultingslicesshouldbegreaterthana
certainthickness.
37
BIBLIOGRAPHY
38
BIBLIOGRAPHY
[1]
JohnBeaversandAndrewKoontz-Garboden.MannerandResultintheRootsofVerbal
Meaning.
LinguisticInquiry
,43(3):331{369,2012.
[2]
Yu-WeiChao,ZhanWang,RadaMihalcea,andJiaDeng.Miningsemantic
ofvisualobjectcategories.
JournalofComputerVision
,88(2):303{338,2010.
[3]
R.M.W.DixonandA.Y.Aikhenvald.
AdjectiveClasses:ACross-linguisticTypology
.
ExplorationsinLanguageandSpaceC.OUPOxford,2006.
[4]
JaneGillette,HenryGleitman,LilaGleitman,andAnneLederer.Humansimulations
ofvocabularylearning.
Cognition
,73(2):135{176,1999.
[5]
ShantanuGodboleandSunitaSarawagi.Discriminativemethodsformulti-labeledclas-
In
AdvancesinKnowledgeDiscoveryandData
,volumeLNCS3056,pages
22{30.Springer,2004.
[6]
ChristopherKennedyandLouiseMcNally.Scalestructureandthesemantictypology
ofgradablepredicates.
Language
,81(2):345{381,2005.
[7]
BethLevin.
Englishverbclassesandalternations:Apreliminaryinvestigation
.Uni-
versityofChicagopress,1993.
[8]
BethLevinandMalkaRappaportHovav.Lexicalizedscalesandverbsofscalarchange.
Presentedat46thAnnualMeetingoftheChicagoLinguisticsSociety,2010.
[9]
JMMandler.Howtobuildababy:II.Conceptualprimitives.
Psychologicalreview
,
99(4):587{604,1992.
[10]
GeorgeA.Miller.Wordnet:Alexicaldatabaseforenglish.
Commun.ACM
,38(11):39{
41,November1995.
[11]
SanmitNarvekarandKristenGrauman.RelativeAttributes.In
IEEEInternational
ConferenceonComputerVision
,pages503{510,2011.
[12]
JamesPustejovsky.Thegenerativelexicon.
Computationallinguistics
,17(4):409{441,
1991.
[13]
MalkaRappaportHovavandBethLevin.onmanner/resultcomplementar-
ity.
Lecturenotes
,2008.
[14]
MalkaRappaportHovavandBethLevin.onmanner/resultcomplementar-
ity.In
LexicalSemantics,Syntax,andEventStructure
,pages21{38.OxfordUniversity
Press,2010.
[15]
JesseRead.
ScalableMulti-labelation
.PhDthesis,UniversityofWaikato,2010.
39
[16]
JesseRead,BernhardPfahringer,GHolmes,andEibeFrank.chainsfor
multi-label
Machinelearning
,85(3):333{359,2011.
[17]
MichaelaRegneri,MarcusRohrbach,DominikusWetzel,StefanThater,BerntSchiele,
andManfredPinkal.Groundingactiondescriptionsinvideos.
Transactionsofthe
AssociationforComputationalLinguistics(TACL)
,1:25{36,2013.
[18]
AnnaRohrbach,MarcusRohrbach,WeiQiu,AnnemarieFriedrich,ManfredPinkal,and
BerntSchiele.Coherentmulti-sentencevideodescriptionwithvariablelevelofdetail.
In
GCPR
,2014.
[19]
MarcusRohrbach,WeiQiu,IvanTitov,StefanThater,ManfredPinkal,andBernt
Schiele.Translatingvideocontenttonaturallanguagedescriptions.In
ICCV
,2013.
[20]
KarinKipperSchuler.
VerbNet:ABroad-Coverage,ComprehensiveVerbLexicon
.PhD
thesis,UniversityofPennsylvania,2005.
[21]
AnnaSenina,MarcusRohrbach,WeiQiu,AnnemarieFriedrich,SinkandarAmin,
MykhayloAndrilika,ManfredPinkal,andBerntSchiele.Coherentmulti-sentencevideo
descriptionwithvariablelevelofdetail.
arXiv:1403.6173
,2014.
[22]
AashishSheshadri,IanEndres,DerekHoiem,andDavidForsyth.DescribingObjects
bytheirAttributes.In
IEEEConferenceonComputerVisionandPatternRecognition
,
2009.
[23]
JSiskind.GroundingLexicalSemanticsofVerbsinVisualPerceptionUsingForce
DynamicsandEvenLogic.
JournalofAIResearch
,15:31{90,2001.
[24]
JeMarkSiskind.GroundingLanguageinPerception.
AIntelligenceReview
,
8:371{391,1995.
[25]
GrigoriosTsoumakasandIoannisKatakis.Multi-label
International
JournalofDataWarehousingandMining
,3(3):1{13,2007.
[26]
GrigoriosTsoumakas,IoannisKatakis,andIoannisVlahavas.Miningmulti-labeldata.
In
Dataminingandknowledgediscoveryhandbook
,pages667{685.Springer,2010.
[27]
JosiahWang,KatjaMarkert,andMarkEveringham.LearningModelsforObjectRecog-
nitionfromNaturalLanguageDescriptions.
ProcedingsoftheBritishMachineVision
Conference2009
,pages2.1{2.11,2009.
[28]
RanXu,CaimingXiong,WeiChen,andJasonJCorso.Jointlymodelingdeepvideo
andcompositionaltexttobridgevisionandlanguageinaframework.In
AAAI
,
2015.
40